API
Spiders
- class zyte_spider_templates.ArticleSpider(*args: Any, **kwargs: Any)[source]
Yield articles from one or more websites that contain articles.
See
ArticleSpiderParams
for supported parameters.See also
- class zyte_spider_templates.EcommerceSpider(*args: Any, **kwargs: Any)[source]
Yield products from an e-commerce website.
See
EcommerceSpiderParams
for supported parameters.See also
- class zyte_spider_templates.GoogleSearchSpider(*args: Any, **kwargs: Any)[source]
Yield results from Google searches.
See
GoogleSearchSpiderParams
for supported parameters.
- class zyte_spider_templates.JobPostingSpider(*args: Any, **kwargs: Any)[source]
Yield job postings from a job website.
See
JobPostingSpiderParams
for supported parameters.
Pages
- class zyte_spider_templates.pages.DefaultSearchRequestTemplatePage(response: AnyResponse, page_params: PageParams)[source]
Parameter mixins
- pydantic model zyte_spider_templates.params.CustomAttrsMethodParam[source]
- field custom_attrs_method: CustomAttrsMethod = CustomAttrsMethod.generate
Which model to use for custom attribute extraction.
- enum zyte_spider_templates.params.CustomAttrsMethod(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.params.ExtractFromParam[source]
- field extract_from: ExtractFrom | None = None
Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).
- enum zyte_spider_templates.params.ExtractFrom(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.params.GeolocationParam[source]
- field geolocation: Geolocation | None = None
Country of the IP addresses to use.
- enum zyte_spider_templates.params.Geolocation(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.params.UrlParam[source]
- field url: str = ''
Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/
- pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam[source]
- field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.automatic
Determines how the start URL and follow-up URLs are crawled.
- enum zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceExtractParam[source]
- field extract: EcommerceExtract = EcommerceExtract.product
Data to return.
- enum zyte_spider_templates.spiders.ecommerce.EcommerceExtract(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.spiders.serp.SerpItemTypeParam[source]
- field item_type: SerpItemType = SerpItemType.off
If specified, follow organic search result links, and extract the selected data type from the target pages. Spider output items will be of the specified data type, not search engine results page items.
- enum zyte_spider_templates.spiders.serp.SerpItemType(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.spiders.article.ArticleCrawlStrategyParam[source]
- field crawl_strategy: ArticleCrawlStrategy = ArticleCrawlStrategy.full
Determines how input URLs and follow-up URLs are crawled.
- enum zyte_spider_templates.spiders.article.ArticleCrawlStrategy(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.spiders.job_posting.JobPostingCrawlStrategyParam[source]
- field crawl_strategy: JobPostingCrawlStrategy = JobPostingCrawlStrategy.navigation
Determines how input URLs and follow-up URLs are crawled.
Middlewares
- class zyte_spider_templates.CrawlingLogsMiddleware(crawler=None)[source]
For each page visited, this logs what the spider has extracted and planning to crawl next. The motivation for such logs is to easily debug the crawling behavior and see what went wrong. Apart from high-level summarized information, this also includes JSON-formatted data so that it can easily be parsed later on.
- Some notes:
scrapy.utils.request.request_fingerprint
is used to match what https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy uses. This makes it easier to work with since we can easily match it with the fingerprints logged in Scrapy Cloud’s request data.
This middleware helps manage navigation depth by setting a final_navigation_page meta key when the predefined depth limit (NAVIGATION_DEPTH_LIMIT) is reached.
Note
Navigation depth is typically increased for requests that navigate to a subcategory originating from its parent category, such as a request targeting a category starting from the website home page. However, it may not be necessary to increase navigation depth, for example, for the next pagination requests. Spiders can customize this behavior as needed by controlling when navigation depth is incremented.
- class zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware(crawler: Crawler)[source]
This middleware limits the number of requests that each seed request can subsequently have.
To enable this middleware, set the
MAX_REQUESTS_PER_SEED
setting to the desired positive value. Non-positive integers (i.e. 0 and below) imposes no limit and disables this middleware.By default, all start requests are considered seed requests, and all other requests are not.
Please note that you also need to enable TrackSeedsSpiderMiddleware to make this work.
- class zyte_spider_templates.OffsiteRequestsPerSeedMiddleware(crawler: Crawler)[source]
This middleware ensures that subsequent requests for each seed do not go outside the original seed’s domain.
However, offsite requests are allowed only if it came from the original domain. Any other offsite requests that follow from offsite responses will not be allowed. This behavior allows to crawl articles from news aggregator websites while ensuring it doesn’t fully crawl other domains it discover.
Disabling the middleware would not prevent offsite requests from being filtered and might generally lead in other domains from being crawled completely, unless
allowed_domains
is set in the spider.This middleware relies on
TrackSeedsSpiderMiddleware
to set the “seed” and “is_seed_request” values inRequest.meta
. Ensure that such middleware is active and sets the said values before this middleware processes the spiders outputs.Note
If a seed URL gets redirected to a different domain, both the domain from the original request and the domain from the redirected response will be used as references.
If the seed URL is https://books.toscrape.com, all subsequent requests to books.toscrape.com and its subdomains are allowed, but requests to toscrape.com are not. Conversely, if the seed URL is https://toscrape.com, requests to both toscrape.com and books.toscrape.com are allowed.
- class zyte_spider_templates.OnlyFeedsMiddleware(crawler: Crawler)[source]
This middleware helps control whether the spider should discover all links on the webpage or extract links from RSS/Atom feeds only.
- class zyte_spider_templates.IncrementalCrawlMiddleware(crawler: Crawler)[source]
Downloader middleware to skip items seen in previous crawls.
To enable this middleware, set the
INCREMENTAL_CRAWL_ENABLED
setting toTrue
.This middleware keeps a record of URLs of crawled items in the Zyte Scrapy Cloud collection specified in the
INCREMENTAL_CRAWL_COLLECTION_NAME
setting, and skips items, responses and requests with matching URLs.Use
INCREMENTAL_CRAWL_BATCH_SIZE
to fine-tune interactions with the collection for performance.