Article spider template (`article`)

Basic use

scrapy crawl article -a url="https://www.zyte.com/blog/"

Parameters

pydantic model zyte_spider_templates.spiders.article.ArticleSpiderParams[source]

field crawl_strategy: ArticleCrawlStrategy = ArticleCrawlStrategy.full: Determines how input URLs and follow-up URLs are crawled.

field extract_from: ExtractFrom | None = None: Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).

field geolocation: Geolocation | None = None: Country of the IP addresses to use.

field incremental: bool = False: Skip items with URLs already stored in the specified Zyte Scrapy Cloud Collection. This feature helps avoid reprocessing previously crawled items and requests by comparing their URLs against the stored collection.

field incremental_collection_name: str | None = None: Name of the Zyte Scrapy Cloud Collection used during an incremental crawl.By default, a Collection named after the spider (or virtual spider) is used, meaning that matching URLs from previous runs of the same spider are skipped, provided those previous runs had incremental argument set to true.Using a different collection name makes sense, for example, in the following cases:- different spiders share a collection.- the same spider uses different collections (e.g., for development runs vs. production runs). Only ASCII alphanumeric characters and underscores are allowed in the collection name.

field max_requests: int | None = 100

The maximum number of Zyte API requests allowed for the crawl.

Requests with error responses that cannot be retried or exceed their retry limit also count here, but they incur in no costs and do not increase the request count in Scrapy Cloud.

field max_requests_per_seed: NonNegativeInt | None = None: The maximum number of follow-up requests allowed per initial URL. Unlimited if not set.

field url: str = '': Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/

field urls: List[str] | None = None: Initial URLs for the crawl, separated by new lines. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/

field urls_file: str = '': URL that point to a plain-text file with a list of URLs to crawl, e.g. https://example.com/url-list.txt. The linked file must contain 1 URL per line.

Settings

The following zyte-spider-templates settings may be useful for the article spider template:

NAVIGATION_DEPTH_LIMIT: Limit the crawling depth of subcategories.
OFFSITE_REQUESTS_PER_SEED_ENABLED: Skip follow-up requests if their URL points to a domain different from the domain of their initial URL.
ONLY_FEEDS_ENABLED: Extract links only from Atom and RSS news feeds.

Article spider template (article)

Basic use

Parameters

Settings

Article spider template (`article`)