API

Spiders

class zyte_spider_templates.ArticleSpider(*args: Any, **kwargs: Any)[source]

Yield articles from one or more websites that contain articles.

See ArticleSpiderParams for supported parameters.

See also

Article spider template (article).

class zyte_spider_templates.BaseSpider(*args: Any, **kwargs: Any)[source]

class zyte_spider_templates.EcommerceSpider(*args: Any, **kwargs: Any)[source]

Yield products from an e-commerce website.

See EcommerceSpiderParams for supported parameters.

See also

E-commerce spider template (ecommerce).

class zyte_spider_templates.GoogleSearchSpider(*args: Any, **kwargs: Any)[source]

Yield results from Google searches.

See GoogleSearchSpiderParams for supported parameters.

class zyte_spider_templates.JobPostingSpider(*args: Any, **kwargs: Any)[source]

Yield job postings from a job website.

See JobPostingSpiderParams for supported parameters.

Pages

class zyte_spider_templates.pages.DefaultSearchRequestTemplatePage(response: AnyResponse, page_params: PageParams)[source]

class zyte_spider_templates.pages.HeuristicsArticleNavigationPage(request_url: RequestUrl, response: AnyResponse, stats: Stats, page_params: PageParams)[source]

class zyte_spider_templates.pages.HeuristicsProductNavigationPage(request_url: RequestUrl, product_navigation: ProductNavigation, response: AnyResponse, page_params: PageParams)[source]

Parameter mixins

pydantic model zyte_spider_templates.params.CustomAttrsInputParam[source]

field custom_attrs_input: Dict[str, Any] | None = None: Custom attributes to extract.

pydantic model zyte_spider_templates.params.CustomAttrsMethodParam[source]

field custom_attrs_method: CustomAttrsMethod = CustomAttrsMethod.generate: Which model to use for custom attribute extraction.

enum zyte_spider_templates.params.CustomAttrsMethod(value)[source]

Member Type:: str

Valid values are as follows:

generate: str = <CustomAttrsMethod.generate: 'generate'>

extract: str = <CustomAttrsMethod.extract: 'extract'>

pydantic model zyte_spider_templates.params.ExtractFromParam[source]

field extract_from: ExtractFrom | None = None: Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).

enum zyte_spider_templates.params.ExtractFrom(value)[source]

Member Type:: str

Valid values are as follows:

httpResponseBody: str = <ExtractFrom.httpResponseBody: 'httpResponseBody'>

browserHtml: str = <ExtractFrom.browserHtml: 'browserHtml'>

pydantic model zyte_spider_templates.params.GeolocationParam[source]

field geolocation: Geolocation | None = None: Country of the IP addresses to use.

enum zyte_spider_templates.params.Geolocation(value)[source]

Member Type:: str

Valid values are as follows:

AF: str = <Geolocation.AF: 'AF'>

AL: str = <Geolocation.AL: 'AL'>

DZ: str = <Geolocation.DZ: 'DZ'>

AS: str = <Geolocation.AS: 'AS'>

AD: str = <Geolocation.AD: 'AD'>

AO: str = <Geolocation.AO: 'AO'>

AI: str = <Geolocation.AI: 'AI'>

AQ: str = <Geolocation.AQ: 'AQ'>

AG: str = <Geolocation.AG: 'AG'>

AR: str = <Geolocation.AR: 'AR'>

AM: str = <Geolocation.AM: 'AM'>

AW: str = <Geolocation.AW: 'AW'>

AU: str = <Geolocation.AU: 'AU'>

AT: str = <Geolocation.AT: 'AT'>

AZ: str = <Geolocation.AZ: 'AZ'>

BS: str = <Geolocation.BS: 'BS'>

BH: str = <Geolocation.BH: 'BH'>

BD: str = <Geolocation.BD: 'BD'>

BB: str = <Geolocation.BB: 'BB'>

BY: str = <Geolocation.BY: 'BY'>

BE: str = <Geolocation.BE: 'BE'>

BZ: str = <Geolocation.BZ: 'BZ'>

BJ: str = <Geolocation.BJ: 'BJ'>

BM: str = <Geolocation.BM: 'BM'>

BT: str = <Geolocation.BT: 'BT'>

BO: str = <Geolocation.BO: 'BO'>

BQ: str = <Geolocation.BQ: 'BQ'>

BA: str = <Geolocation.BA: 'BA'>

BW: str = <Geolocation.BW: 'BW'>

BV: str = <Geolocation.BV: 'BV'>

BR: str = <Geolocation.BR: 'BR'>

IO: str = <Geolocation.IO: 'IO'>

BN: str = <Geolocation.BN: 'BN'>

BG: str = <Geolocation.BG: 'BG'>

BF: str = <Geolocation.BF: 'BF'>

BI: str = <Geolocation.BI: 'BI'>

CV: str = <Geolocation.CV: 'CV'>

KH: str = <Geolocation.KH: 'KH'>

CM: str = <Geolocation.CM: 'CM'>

CA: str = <Geolocation.CA: 'CA'>

KY: str = <Geolocation.KY: 'KY'>

CF: str = <Geolocation.CF: 'CF'>

TD: str = <Geolocation.TD: 'TD'>

CL: str = <Geolocation.CL: 'CL'>

CN: str = <Geolocation.CN: 'CN'>

CX: str = <Geolocation.CX: 'CX'>

CC: str = <Geolocation.CC: 'CC'>

CO: str = <Geolocation.CO: 'CO'>

KM: str = <Geolocation.KM: 'KM'>

CG: str = <Geolocation.CG: 'CG'>

CD: str = <Geolocation.CD: 'CD'>

CK: str = <Geolocation.CK: 'CK'>

CR: str = <Geolocation.CR: 'CR'>

HR: str = <Geolocation.HR: 'HR'>

CU: str = <Geolocation.CU: 'CU'>

CW: str = <Geolocation.CW: 'CW'>

CY: str = <Geolocation.CY: 'CY'>

CZ: str = <Geolocation.CZ: 'CZ'>

CI: str = <Geolocation.CI: 'CI'>

DK: str = <Geolocation.DK: 'DK'>

DJ: str = <Geolocation.DJ: 'DJ'>

DM: str = <Geolocation.DM: 'DM'>

DO: str = <Geolocation.DO: 'DO'>

EC: str = <Geolocation.EC: 'EC'>

EG: str = <Geolocation.EG: 'EG'>

SV: str = <Geolocation.SV: 'SV'>

GQ: str = <Geolocation.GQ: 'GQ'>

ER: str = <Geolocation.ER: 'ER'>

EE: str = <Geolocation.EE: 'EE'>

SZ: str = <Geolocation.SZ: 'SZ'>

ET: str = <Geolocation.ET: 'ET'>

FK: str = <Geolocation.FK: 'FK'>

FO: str = <Geolocation.FO: 'FO'>

FJ: str = <Geolocation.FJ: 'FJ'>

FI: str = <Geolocation.FI: 'FI'>

FR: str = <Geolocation.FR: 'FR'>

GF: str = <Geolocation.GF: 'GF'>

PF: str = <Geolocation.PF: 'PF'>

TF: str = <Geolocation.TF: 'TF'>

GA: str = <Geolocation.GA: 'GA'>

GM: str = <Geolocation.GM: 'GM'>

GE: str = <Geolocation.GE: 'GE'>

DE: str = <Geolocation.DE: 'DE'>

GH: str = <Geolocation.GH: 'GH'>

GI: str = <Geolocation.GI: 'GI'>

GR: str = <Geolocation.GR: 'GR'>

GL: str = <Geolocation.GL: 'GL'>

GD: str = <Geolocation.GD: 'GD'>

GP: str = <Geolocation.GP: 'GP'>

GU: str = <Geolocation.GU: 'GU'>

GT: str = <Geolocation.GT: 'GT'>

GG: str = <Geolocation.GG: 'GG'>

GN: str = <Geolocation.GN: 'GN'>

GW: str = <Geolocation.GW: 'GW'>

GY: str = <Geolocation.GY: 'GY'>

HT: str = <Geolocation.HT: 'HT'>

HM: str = <Geolocation.HM: 'HM'>

VA: str = <Geolocation.VA: 'VA'>

HN: str = <Geolocation.HN: 'HN'>

HK: str = <Geolocation.HK: 'HK'>

HU: str = <Geolocation.HU: 'HU'>

IS: str = <Geolocation.IS: 'IS'>

IN: str = <Geolocation.IN: 'IN'>

ID: str = <Geolocation.ID: 'ID'>

IR: str = <Geolocation.IR: 'IR'>

IQ: str = <Geolocation.IQ: 'IQ'>

IE: str = <Geolocation.IE: 'IE'>

IM: str = <Geolocation.IM: 'IM'>

IL: str = <Geolocation.IL: 'IL'>

IT: str = <Geolocation.IT: 'IT'>

JM: str = <Geolocation.JM: 'JM'>

JP: str = <Geolocation.JP: 'JP'>

JE: str = <Geolocation.JE: 'JE'>

JO: str = <Geolocation.JO: 'JO'>

KZ: str = <Geolocation.KZ: 'KZ'>

KE: str = <Geolocation.KE: 'KE'>

KI: str = <Geolocation.KI: 'KI'>

KP: str = <Geolocation.KP: 'KP'>

KR: str = <Geolocation.KR: 'KR'>

KW: str = <Geolocation.KW: 'KW'>

KG: str = <Geolocation.KG: 'KG'>

LA: str = <Geolocation.LA: 'LA'>

LV: str = <Geolocation.LV: 'LV'>

LB: str = <Geolocation.LB: 'LB'>

LS: str = <Geolocation.LS: 'LS'>

LR: str = <Geolocation.LR: 'LR'>

LY: str = <Geolocation.LY: 'LY'>

LI: str = <Geolocation.LI: 'LI'>

LT: str = <Geolocation.LT: 'LT'>

LU: str = <Geolocation.LU: 'LU'>

MO: str = <Geolocation.MO: 'MO'>

MG: str = <Geolocation.MG: 'MG'>

MW: str = <Geolocation.MW: 'MW'>

MY: str = <Geolocation.MY: 'MY'>

MV: str = <Geolocation.MV: 'MV'>

ML: str = <Geolocation.ML: 'ML'>

MT: str = <Geolocation.MT: 'MT'>

MH: str = <Geolocation.MH: 'MH'>

MQ: str = <Geolocation.MQ: 'MQ'>

MR: str = <Geolocation.MR: 'MR'>

MU: str = <Geolocation.MU: 'MU'>

YT: str = <Geolocation.YT: 'YT'>

MX: str = <Geolocation.MX: 'MX'>

FM: str = <Geolocation.FM: 'FM'>

MD: str = <Geolocation.MD: 'MD'>

MC: str = <Geolocation.MC: 'MC'>

MN: str = <Geolocation.MN: 'MN'>

ME: str = <Geolocation.ME: 'ME'>

MS: str = <Geolocation.MS: 'MS'>

MA: str = <Geolocation.MA: 'MA'>

MZ: str = <Geolocation.MZ: 'MZ'>

MM: str = <Geolocation.MM: 'MM'>

NA: str = <Geolocation.NA: 'NA'>

NR: str = <Geolocation.NR: 'NR'>

NP: str = <Geolocation.NP: 'NP'>

NL: str = <Geolocation.NL: 'NL'>

NC: str = <Geolocation.NC: 'NC'>

NZ: str = <Geolocation.NZ: 'NZ'>

NI: str = <Geolocation.NI: 'NI'>

NE: str = <Geolocation.NE: 'NE'>

NG: str = <Geolocation.NG: 'NG'>

NU: str = <Geolocation.NU: 'NU'>

NF: str = <Geolocation.NF: 'NF'>

MK: str = <Geolocation.MK: 'MK'>

MP: str = <Geolocation.MP: 'MP'>

NO: str = <Geolocation.NO: 'NO'>

OM: str = <Geolocation.OM: 'OM'>

PK: str = <Geolocation.PK: 'PK'>

PW: str = <Geolocation.PW: 'PW'>

PS: str = <Geolocation.PS: 'PS'>

PA: str = <Geolocation.PA: 'PA'>

PG: str = <Geolocation.PG: 'PG'>

PY: str = <Geolocation.PY: 'PY'>

PE: str = <Geolocation.PE: 'PE'>

PH: str = <Geolocation.PH: 'PH'>

PN: str = <Geolocation.PN: 'PN'>

PL: str = <Geolocation.PL: 'PL'>

PT: str = <Geolocation.PT: 'PT'>

PR: str = <Geolocation.PR: 'PR'>

QA: str = <Geolocation.QA: 'QA'>

RO: str = <Geolocation.RO: 'RO'>

RU: str = <Geolocation.RU: 'RU'>

RW: str = <Geolocation.RW: 'RW'>

RE: str = <Geolocation.RE: 'RE'>

BL: str = <Geolocation.BL: 'BL'>

SH: str = <Geolocation.SH: 'SH'>

KN: str = <Geolocation.KN: 'KN'>

LC: str = <Geolocation.LC: 'LC'>

MF: str = <Geolocation.MF: 'MF'>

PM: str = <Geolocation.PM: 'PM'>

VC: str = <Geolocation.VC: 'VC'>

WS: str = <Geolocation.WS: 'WS'>

SM: str = <Geolocation.SM: 'SM'>

ST: str = <Geolocation.ST: 'ST'>

SA: str = <Geolocation.SA: 'SA'>

SN: str = <Geolocation.SN: 'SN'>

RS: str = <Geolocation.RS: 'RS'>

SC: str = <Geolocation.SC: 'SC'>

SL: str = <Geolocation.SL: 'SL'>

SG: str = <Geolocation.SG: 'SG'>

SX: str = <Geolocation.SX: 'SX'>

SK: str = <Geolocation.SK: 'SK'>

SI: str = <Geolocation.SI: 'SI'>

SB: str = <Geolocation.SB: 'SB'>

SO: str = <Geolocation.SO: 'SO'>

ZA: str = <Geolocation.ZA: 'ZA'>

GS: str = <Geolocation.GS: 'GS'>

SS: str = <Geolocation.SS: 'SS'>

ES: str = <Geolocation.ES: 'ES'>

LK: str = <Geolocation.LK: 'LK'>

SD: str = <Geolocation.SD: 'SD'>

SR: str = <Geolocation.SR: 'SR'>

SJ: str = <Geolocation.SJ: 'SJ'>

SE: str = <Geolocation.SE: 'SE'>

CH: str = <Geolocation.CH: 'CH'>

SY: str = <Geolocation.SY: 'SY'>

TW: str = <Geolocation.TW: 'TW'>

TJ: str = <Geolocation.TJ: 'TJ'>

TZ: str = <Geolocation.TZ: 'TZ'>

TH: str = <Geolocation.TH: 'TH'>

TL: str = <Geolocation.TL: 'TL'>

TG: str = <Geolocation.TG: 'TG'>

TK: str = <Geolocation.TK: 'TK'>

TO: str = <Geolocation.TO: 'TO'>

TT: str = <Geolocation.TT: 'TT'>

TN: str = <Geolocation.TN: 'TN'>

TM: str = <Geolocation.TM: 'TM'>

TC: str = <Geolocation.TC: 'TC'>

TV: str = <Geolocation.TV: 'TV'>

TR: str = <Geolocation.TR: 'TR'>

UG: str = <Geolocation.UG: 'UG'>

UA: str = <Geolocation.UA: 'UA'>

AE: str = <Geolocation.AE: 'AE'>

GB: str = <Geolocation.GB: 'GB'>

US: str = <Geolocation.US: 'US'>

UM: str = <Geolocation.UM: 'UM'>

UY: str = <Geolocation.UY: 'UY'>

UZ: str = <Geolocation.UZ: 'UZ'>

VU: str = <Geolocation.VU: 'VU'>

VE: str = <Geolocation.VE: 'VE'>

VN: str = <Geolocation.VN: 'VN'>

VG: str = <Geolocation.VG: 'VG'>

VI: str = <Geolocation.VI: 'VI'>

WF: str = <Geolocation.WF: 'WF'>

EH: str = <Geolocation.EH: 'EH'>

YE: str = <Geolocation.YE: 'YE'>

ZM: str = <Geolocation.ZM: 'ZM'>

ZW: str = <Geolocation.ZW: 'ZW'>

AX: str = <Geolocation.AX: 'AX'>

pydantic model zyte_spider_templates.params.MaxRequestsParam[source]

field max_requests: int | None = 100

The maximum number of Zyte API requests allowed for the crawl.

Requests with error responses that cannot be retried or exceed their retry limit also count here, but they incur in no costs and do not increase the request count in Scrapy Cloud.

pydantic model zyte_spider_templates.params.UrlParam[source]

field url: str = '': Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/

pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam[source]

field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.automatic: Determines how the start URL and follow-up URLs are crawled.

enum zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy(value)[source]

Member Type:: str

Valid values are as follows:

automatic: str = <EcommerceCrawlStrategy.automatic: 'automatic'>

full: str = <EcommerceCrawlStrategy.full: 'full'>

navigation: str = <EcommerceCrawlStrategy.navigation: 'navigation'>

pagination_only: str = <EcommerceCrawlStrategy.pagination_only: 'pagination_only'>

direct_item: str = <EcommerceCrawlStrategy.direct_item: 'direct_item'>

pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceExtractParam[source]

field extract: EcommerceExtract = EcommerceExtract.product: Data to return.

enum zyte_spider_templates.spiders.ecommerce.EcommerceExtract(value)[source]

Member Type:: str

Valid values are as follows:

product: str = <EcommerceExtract.product: 'product'>

productList: str = <EcommerceExtract.productList: 'productList'>

pydantic model zyte_spider_templates.spiders.serp.SerpItemTypeParam[source]

field item_type: SerpItemType = SerpItemType.off: If specified, follow organic search result links, and extract the selected data type from the target pages. Spider output items will be of the specified data type, not search engine results page items.

enum zyte_spider_templates.spiders.serp.SerpItemType(value)[source]

Member Type:: str

Valid values are as follows:

off: str = <SerpItemType.off: 'off'>

article: str = <SerpItemType.article: 'article'>

articleList: str = <SerpItemType.articleList: 'articleList'>

forumThread: str = <SerpItemType.forumThread: 'forumThread'>

jobPosting: str = <SerpItemType.jobPosting: 'jobPosting'>

product: str = <SerpItemType.product: 'product'>

productList: str = <SerpItemType.productList: 'productList'>

pydantic model zyte_spider_templates.spiders.serp.SerpMaxPagesParam[source]

field max_pages: int = 1: Maximum number of result pages to visit per search query.

pydantic model zyte_spider_templates.spiders.article.ArticleCrawlStrategyParam[source]

field crawl_strategy: ArticleCrawlStrategy = ArticleCrawlStrategy.full: Determines how input URLs and follow-up URLs are crawled.

enum zyte_spider_templates.spiders.article.ArticleCrawlStrategy(value)[source]

Member Type:: str

Valid values are as follows:

full: str = <ArticleCrawlStrategy.full: 'full'>

direct_item: str = <ArticleCrawlStrategy.direct_item: 'direct_item'>

pydantic model zyte_spider_templates.spiders.job_posting.JobPostingCrawlStrategyParam[source]

field crawl_strategy: JobPostingCrawlStrategy = JobPostingCrawlStrategy.navigation: Determines how input URLs and follow-up URLs are crawled.

enum zyte_spider_templates.spiders.job_posting.JobPostingCrawlStrategy(value)[source]

Member Type:: str

Valid values are as follows:

navigation: str = <JobPostingCrawlStrategy.navigation: 'navigation'>

direct_item: str = <JobPostingCrawlStrategy.direct_item: 'direct_item'>

Middlewares

class zyte_spider_templates.CrawlingLogsMiddleware(crawler=None)[source]

For each page visited, this logs what the spider has extracted and planning to crawl next. The motivation for such logs is to easily debug the crawling behavior and see what went wrong. Apart from high-level summarized information, this also includes JSON-formatted data so that it can easily be parsed later on.

Some notes:

scrapy.utils.request.request_fingerprint is used to match what https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy uses. This makes it easier to work with since we can easily match it with the fingerprints logged in Scrapy Cloud’s request data.

class zyte_spider_templates.TrackNavigationDepthSpiderMiddleware(crawler)[source]: This middleware helps manage navigation depth by setting a final_navigation_page meta key when the predefined depth limit (NAVIGATION_DEPTH_LIMIT) is reached.

Note

Navigation depth is typically increased for requests that navigate to a subcategory originating from its parent category, such as a request targeting a category starting from the website home page. However, it may not be necessary to increase navigation depth, for example, for the next pagination requests. Spiders can customize this behavior as needed by controlling when navigation depth is incremented.

class zyte_spider_templates.MaxRequestsPerSeedDownloaderMiddleware(crawler: Crawler)[source]

This middleware limits the number of requests that each seed request can subsequently have.

To enable this middleware, set the MAX_REQUESTS_PER_SEED setting to the desired positive value. Non-positive integers (i.e. 0 and below) imposes no limit and disables this middleware.

By default, all start requests are considered seed requests, and all other requests are not.

Please note that you also need to enable TrackSeedsSpiderMiddleware to make this work.

class zyte_spider_templates.OffsiteRequestsPerSeedMiddleware(crawler: Crawler)[source]

This middleware ensures that subsequent requests for each seed do not go outside the original seed’s domain.

However, offsite requests are allowed only if it came from the original domain. Any other offsite requests that follow from offsite responses will not be allowed. This behavior allows to crawl articles from news aggregator websites while ensuring it doesn’t fully crawl other domains it discover.

Disabling the middleware would not prevent offsite requests from being filtered and might generally lead in other domains from being crawled completely, unless allowed_domains is set in the spider.

This middleware relies on TrackSeedsSpiderMiddleware to set the “seed” and “is_seed_request” values in Request.meta. Ensure that such middleware is active and sets the said values before this middleware processes the spiders outputs.

Note

If a seed URL gets redirected to a different domain, both the domain from the original request and the domain from the redirected response will be used as references.

If the seed URL is https://books.toscrape.com, all subsequent requests to books.toscrape.com and its subdomains are allowed, but requests to toscrape.com are not. Conversely, if the seed URL is https://toscrape.com, requests to both toscrape.com and books.toscrape.com are allowed.

class zyte_spider_templates.OnlyFeedsMiddleware(crawler: Crawler)[source]: This middleware helps control whether the spider should discover all links on the webpage or extract links from RSS/Atom feeds only.

class zyte_spider_templates.TrackSeedsSpiderMiddleware(crawler: Crawler)[source]

class zyte_spider_templates.IncrementalCrawlMiddleware(crawler: Crawler)[source]

Downloader middleware to skip items seen in previous crawls.

To enable this middleware, set the INCREMENTAL_CRAWL_ENABLED setting to True.

This middleware keeps a record of URLs of crawled items in the Zyte Scrapy Cloud collection specified in the INCREMENTAL_CRAWL_COLLECTION_NAME setting, and skips items, responses and requests with matching URLs.

Use INCREMENTAL_CRAWL_BATCH_SIZE to fine-tune interactions with the collection for performance.