Settings

MAX_REQUESTS_PER_SEED

Tip

When using the article spider template, you may use the max_requests_per_seed command-line parameter instead of this setting.

Default: 0

Limit the number of follow-up requests per initial URL to the specified amount. Non-positive integers (i.e. 0 and below) imposes no limit and disables this middleware.

The limit is the total limit for all direct and indirect follow-up requests of each initial URL.

Implemented by MaxRequestsPerSeedDownloaderMiddleware.

OFFSITE_REQUESTS_PER_SEED_ENABLED

Default: True

Setting this value to True enables the OffsiteRequestsPerSeedMiddleware while False completely disables it.

The middleware ensures that most requests would belong to the domain of the seed URLs. However, it does allow offsite requests only if they were obtained from a response that belongs to the domain of the seed URLs. Any other requests obtained thereafter from a response in a domain outside of the seed URLs will not be allowed.

This prevents the spider from completely crawling other domains while ensuring that aggregator websites (e.g. a news website with articles from other domains) are supported, as it can access pages from other domains.

Disabling the middleware would not prevent offsite requests from being filtered and might generally lead in other domains from being crawled completely, unless allowed_domains is set in the spider.

Note

If a seed URL gets redirected to a different domain, both the domain from the original request and the domain from the redirected response will be used as references.

If the seed URL is https://books.toscrape.com, all subsequent requests to books.toscrape.com and its subdomains are allowed, but requests to toscrape.com are not. Conversely, if the seed URL is https://toscrape.com, requests to both toscrape.com and books.toscrape.com are allowed.

ONLY_FEEDS_ENABLED

Note

Only works for the article spider template.

Default: False

Whether to extract links from Atom and RSS news feeds only (True) or to also use extracted links from ArticleNavigation.subCategories (False).

Implemented by OnlyFeedsMiddleware.

INCREMENTAL_CRAWL_BATCH_SIZE

Default: 50

The maximum number of seen URLs to read from or write to the corresponding Zyte Scrapy Cloud collection per request during an incremental crawl (see INCREMENTAL_CRAWL_ENABLED).

This setting determines the batch size for interactions with the Collection. If the response from a webpage contains more than 50 URLs, they will be split into smaller batches for processing. Conversely, if fewer than 50 URLs are present, all URLs will be handled in a single request to the Collection.

Adjusting this value can optimize the performance of a crawl by balancing the number of requests sent to the Collection with processing efficiency.

Note

Setting it too large (e.g. > 100) will cause issues due to the large query length. Setting it too small (less than 10) will remove the benefit of using a batch.

Implemented by IncrementalCrawlMiddleware.

INCREMENTAL_CRAWL_COLLECTION_NAME

Note

virtual spiders are spiders based on spider templates. The explanation of using INCREMENTAL_CRAWL_COLLECTION_NAME related to both types of spiders.

Tip

When using the article spider template, you may use the incremental_collection_name command-line parameter instead of this setting.

Note

Only ASCII alphanumeric characters and underscores are allowed.

Default: <The current spider’s name>_incremental. The current spider’s name here will be virtual spider’s name, if it’s a virtual spider; otherwise, Spider.name.

Name of the Zyte Scrapy Cloud collection used during an incremental crawl (see INCREMENTAL_CRAWL_ENABLED).

By default, a collection named after the spider is used, meaning that matching URLs from previous runs of the same spider are skipped, provided those previous runs had the INCREMENTAL_CRAWL_ENABLED setting set to True or the spider argument incremental set to true.

Using a different collection name makes sense, for example, in the following cases: - Different spiders share a collection. - The same spider uses different collections (e.g., for development runs vs. production runs).

Implemented by IncrementalCrawlMiddleware.

INCREMENTAL_CRAWL_ENABLED

Tip

When using the article spider template, you may use the incremental command-line parameter instead of this setting.

Default: False

If set to True, items seen in previous crawls with the same INCREMENTAL_CRAWL_COLLECTION_NAME value are skipped.

Implemented by IncrementalCrawlMiddleware.