zyte-spider-templates documentation

Spider templates for automatic crawlers.

This library contains Scrapy spider templates. They can be used out of the box with the Zyte features such as Zyte API or modified to be used standalone. There is a sample Scrapy project for this library that you can use as a starting point for your own projects.

Initial setup

Learn how to get spider templates installed and configured on an existing Scrapy project.

Tip

If you do not have a Scrapy project yet, use zyte-spider-templates-project as a starting template to get started quickly.

Requirements

  • Python 3.8+

  • Scrapy 2.11+

For Zyte API features, including AI-powered parsing, you need a Zyte API subscription.

Installation

pip install zyte-spider-templates

Configuration

In your Scrapy project settings (usually in settings.py):

For Zyte API features, including AI-powered parsing, configure scrapy-zyte-api with scrapy-poet integration.

The following additional settings are recommended:

  • Set CLOSESPIDER_TIMEOUT_NO_ITEM to 600, to force the spider to stop if no item has been found for 10 minutes.

  • Set SCHEDULER_DISK_QUEUE to "scrapy.squeues.PickleFifoDiskQueue" and SCHEDULER_MEMORY_QUEUE to "scrapy.squeues.FifoMemoryQueue", for better request priority handling.

  • Update SPIDER_MIDDLEWARES to include "zyte_spider_templates.middlewares.CrawlingLogsMiddleware": 1000, to log crawl data in JSON format for debugging purposes.

  • Ensure that zyte_common_items.ZyteItemAdapter is also configured:

    from itemadapter import ItemAdapter
    from zyte_common_items import ZyteItemAdapter
    
    ItemAdapter.ADAPTER_CLASSES.appendleft(ZyteItemAdapter)
    
  • Update SPIDER_MIDDLEWARES to include "zyte_spider_templates.middlewares.AllowOffsiteMiddleware": 500 and "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None. This allows for crawling item links outside of the domain.

For an example of a properly configured settings.py file, see the one in zyte-spider-templates-project.

Spider templates

Built-in spider templates use Zyte API automatic extraction to provide automatic crawling and parsing, i.e. you can run these spiders on any website of the right type to automatically extract the desired structured data.

For example, to extract all products from an e-commerce website, you can run the e-commerce spider spider as follows:

scrapy crawl ecommerce -a url="https://books.toscrape.com"

Spider templates support additional parameters beyond url. See the documentation of each specific spider for details.

You can also customize spider templates to meet your needs.

Spider template list

E-commerce

Get products from an e-commerce website.

E-commerce spider template (ecommerce)

Basic use

scrapy crawl ecommerce -a url="https://books.toscrape.com"

Parameters

pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams[source]
Config:
  • json_schema_extra: dict = {‘groups’: [{‘id’: ‘inputs’, ‘title’: ‘Inputs’, ‘description’: ‘Input data that determines the start URLs of the crawl.’, ‘widget’: ‘exclusive’}]}

Validators:
  • single_input » all fields

field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.full

Determines how the start URL and follow-up URLs are crawled.

Validated by:
  • single_input

field extract_from: ExtractFrom | None = None

Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).

Validated by:
  • single_input

field geolocation: Geolocation | None = None

ISO 3166-1 alpha-2 2-character string specified in https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.

Validated by:
  • single_input

field max_requests: int | None = 100

The maximum number of Zyte API requests allowed for the crawl.

Requests with error responses that cannot be retried or exceed their retry limit also count here, but they incur in no costs and do not increase the request count in Scrapy Cloud.

Validated by:
  • single_input

field url: str = ''

Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/

Constraints:
  • pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$

Validated by:
  • single_input

field urls_file: str = ''

URL that point to a plain-text file with a list of URLs to crawl, e.g. https://example.com/url-list.txt. The linked list must contain 1 URL per line.

Constraints:
  • pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$

Validated by:
  • single_input

validator single_input  »  all fields

Fields url and urls_file form a mandatory, mutually-exclusive field group: one of them must be defined, the rest must not be defined.

Customization

Built-in spider templates can be highly customized:

Customizing spider templates

Subclass a built-in spider template to customize its metadata, parameters, and crawling logic.

Customizing metadata

Spider template metadata is defined using scrapy-spider-metadata, and can be redefined or customized in a subclass.

For example, to keep the upstream title but change the description:

from zyte_spider_templates import EcommerceSpider


class MySpider(EcommerceSpider):
    name = "my_spider"
    metadata = {
        **EcommerceSpider.metadata,
        "description": "Custom e-commerce spider template.",
    }

Customizing parameters

Spider template parameters are also defined using scrapy-spider-metadata, and can be redefined or customized in a subclass as well.

For example, to add a min_price parameter and filter out products with a lower price:

from decimal import Decimal
from typing import Iterable

from scrapy_poet import DummyResponse
from scrapy_spider_metadata import Args
from zyte_common_items import Product
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams


class MyParams(EcommerceSpiderParams):
    min_price: str = "0.00"


class MySpider(EcommerceSpider, Args[MyParams]):
    name = "my_spider"

    def parse_product(
        self, response: DummyResponse, product: Product
    ) -> Iterable[Product]:
        for product in super().parse_product(response, product):
            if Decimal(product.price) >= Decimal(self.args.min_price):
                yield product

You can also override existing parameters. For example, to hard-code the start URL:

from scrapy_spider_metadata import Args
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams


class MyParams(EcommerceSpiderParams):
    url: str = "https://books.toscrape.com"


class MySpider(EcommerceSpider, Args[MyParams]):
    name = "my_spider"

A mixin class exists for every spider parameter (see Parameter mixins), so you can use any combination of them in any order you like in your custom classes, while enjoying future improvements to validation, documentation or UI integration for Scrapy Cloud:

from scrapy_spider_metadata import Args
from zyte_spider_templates.params import GeolocationParam, UrlParam


class MyParams(GeolocationParam, UrlParam):
    pass


class MySpider(Args[MyParams]):
    name = "my_spider"

Customizing the crawling logic

The crawling logic of spider templates can be customized as any other Scrapy spider.

For example, you can make a spider that expects a product details URL and does not follow navigation at all:

from typing import Iterable

from scrapy import Request
from zyte_spider_templates import EcommerceSpider


class MySpider(EcommerceSpider):
    name = "my_spider"

    def start_requests(self) -> Iterable[Request]:
        for request in super().start_requests():
            yield request.replace(callback=self.parse_product)

All parsing logic is implemented separately in page objects, making it easier to read the code of built-in spider templates to modify them as desired.

Customizing page objects

All parsing is implemented using web-poet page objects that use Zyte API automatic extraction to extract standard items, both for navigation and for item details.

You can implement your own page object classes to override how extraction works for any given combination of URL and item type.

Tip

Make sure the import path of your page objects module is in the SCRAPY_POET_DISCOVER setting, otherwise your page objects might be ignored.

Overriding parsing

To change or fix how a given field is extracted, overriding the value from Zyte API automatic extraction, create a page object class, configured to run on some given URLs (web_poet.handle_urls()), that defines the logic to extract that field. For example:

pages/books_toscrape_com.py
import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage


@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
    response: AnyResponse

    @field
    async def aggregateRating(self):
        element_class = self.response.css(".star-rating::attr(class)").get()
        if not element_class:
            return None
        rating_str = element_class.split(" ")[-1]
        rating = parse_number(rating_str)
        if not rating:
            return None
        return AggregateRating(ratingValue=rating, bestRating=5)

AutoProductPage and other page objects from zyte-common-items prefixed with Auto define fields for all standard items that return the value from Zyte API automatic extraction, so that you only need to define your new field.

The page object above is decorated with @attrs.define so that it can declare a dependency on AnyResponse and use that to implement custom parsing logic. You could alternatively use BrowserHtml if needed.

Parsing a new field

To extract a new field for one or more websites:

  1. Declare a new item type that extends a standard item with your new field. For example:

    items.py
    from typing import Optional
    
    import attrs
    from zyte_common_items import Product
    
    
    @attrs.define
    class CustomProduct(Product):
        stock: Optional[int]
    
  2. Create a page object class, configured to run for your new item type (web_poet.pages.Returns) on some given URLs (web_poet.handle_urls()), that defines the logic to extract your new field. For example:

    pages/books_toscrape_com.py
    import re
    
    from web_poet import Returns, field, handle_urls
    from zyte_common_items import AutoProductPage
    
    from ..items import CustomProduct
    
    
    @handle_urls("books.toscrape.com")
    class BookPage(AutoProductPage, Returns[CustomProduct]):
        @field
        async def stock(self):
            for entry in await self.additionalProperties:
                if entry.name == "availability":
                    match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value)
                    if not match:
                        return None
                    stock_str = re.sub(r"[.,\s]", "", match[0])
                    return int(stock_str)
            return None
    
  3. Create a spider template subclass that requests your new item type instead of the standard one. For example:

    spiders/books_toscrape_com.py
    from scrapy_poet import DummyResponse
    from zyte_spider_templates import EcommerceSpider
    
    from ..items import CustomProduct
    
    
    class BooksToScrapeComSpider(EcommerceSpider):
        name = "books_toscrape_com"
        metadata = {
            **EcommerceSpider.metadata,
            "title": "Books to Scrape",
            "description": "Spider template for books.toscrape.com",
        }
    
        def parse_product(self, response: DummyResponse, product: CustomProduct):
            yield from super().parse_product(response, product)
    

Reference

Spiders

class zyte_spider_templates.BaseSpider(*args: Any, **kwargs: Any)[source]
class zyte_spider_templates.EcommerceSpider(*args: Any, **kwargs: Any)[source]

Yield products from an e-commerce website.

See EcommerceSpiderParams for supported parameters.

Pages

class zyte_spider_templates.pages.HeuristicsProductNavigationPage(request_url: RequestUrl, product_navigation: ProductNavigation, response: AnyResponse, page_params: PageParams)[source]

Parameter mixins

pydantic model zyte_spider_templates.params.ExtractFromParam[source]
field extract_from: ExtractFrom | None = None

Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).

enum zyte_spider_templates.params.ExtractFrom(value)[source]
Member Type:

str

Valid values are as follows:

httpResponseBody: str = <ExtractFrom.httpResponseBody: 'httpResponseBody'>

Use HTTP responses. Cost-efficient and fast extraction method, which works well on many websites.

browserHtml: str = <ExtractFrom.browserHtml: 'browserHtml'>

Use browser rendering. Often provides the best quality.

pydantic model zyte_spider_templates.params.GeolocationParam[source]
field geolocation: Geolocation | None = None

ISO 3166-1 alpha-2 2-character string specified in https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.

enum zyte_spider_templates.params.Geolocation(value)[source]
Member Type:

str

Valid values are as follows:

AF: str = <Geolocation.AF: 'AF'>
AL: str = <Geolocation.AL: 'AL'>
DZ: str = <Geolocation.DZ: 'DZ'>
AS: str = <Geolocation.AS: 'AS'>
AD: str = <Geolocation.AD: 'AD'>
AO: str = <Geolocation.AO: 'AO'>
AI: str = <Geolocation.AI: 'AI'>
AQ: str = <Geolocation.AQ: 'AQ'>
AG: str = <Geolocation.AG: 'AG'>
AR: str = <Geolocation.AR: 'AR'>
AM: str = <Geolocation.AM: 'AM'>
AW: str = <Geolocation.AW: 'AW'>
AU: str = <Geolocation.AU: 'AU'>
AT: str = <Geolocation.AT: 'AT'>
AZ: str = <Geolocation.AZ: 'AZ'>
BS: str = <Geolocation.BS: 'BS'>
BH: str = <Geolocation.BH: 'BH'>
BD: str = <Geolocation.BD: 'BD'>
BB: str = <Geolocation.BB: 'BB'>
BY: str = <Geolocation.BY: 'BY'>
BE: str = <Geolocation.BE: 'BE'>
BZ: str = <Geolocation.BZ: 'BZ'>
BJ: str = <Geolocation.BJ: 'BJ'>
BM: str = <Geolocation.BM: 'BM'>
BT: str = <Geolocation.BT: 'BT'>
BO: str = <Geolocation.BO: 'BO'>
BQ: str = <Geolocation.BQ: 'BQ'>
BA: str = <Geolocation.BA: 'BA'>
BW: str = <Geolocation.BW: 'BW'>
BV: str = <Geolocation.BV: 'BV'>
BR: str = <Geolocation.BR: 'BR'>
IO: str = <Geolocation.IO: 'IO'>
BN: str = <Geolocation.BN: 'BN'>
BG: str = <Geolocation.BG: 'BG'>
BF: str = <Geolocation.BF: 'BF'>
BI: str = <Geolocation.BI: 'BI'>
CV: str = <Geolocation.CV: 'CV'>
KH: str = <Geolocation.KH: 'KH'>
CM: str = <Geolocation.CM: 'CM'>
CA: str = <Geolocation.CA: 'CA'>
KY: str = <Geolocation.KY: 'KY'>
CF: str = <Geolocation.CF: 'CF'>
TD: str = <Geolocation.TD: 'TD'>
CL: str = <Geolocation.CL: 'CL'>
CN: str = <Geolocation.CN: 'CN'>
CX: str = <Geolocation.CX: 'CX'>
CC: str = <Geolocation.CC: 'CC'>
CO: str = <Geolocation.CO: 'CO'>
KM: str = <Geolocation.KM: 'KM'>
CG: str = <Geolocation.CG: 'CG'>
CD: str = <Geolocation.CD: 'CD'>
CK: str = <Geolocation.CK: 'CK'>
CR: str = <Geolocation.CR: 'CR'>
HR: str = <Geolocation.HR: 'HR'>
CU: str = <Geolocation.CU: 'CU'>
CW: str = <Geolocation.CW: 'CW'>
CY: str = <Geolocation.CY: 'CY'>
CZ: str = <Geolocation.CZ: 'CZ'>
CI: str = <Geolocation.CI: 'CI'>
DK: str = <Geolocation.DK: 'DK'>
DJ: str = <Geolocation.DJ: 'DJ'>
DM: str = <Geolocation.DM: 'DM'>
DO: str = <Geolocation.DO: 'DO'>
EC: str = <Geolocation.EC: 'EC'>
EG: str = <Geolocation.EG: 'EG'>
SV: str = <Geolocation.SV: 'SV'>
GQ: str = <Geolocation.GQ: 'GQ'>
ER: str = <Geolocation.ER: 'ER'>
EE: str = <Geolocation.EE: 'EE'>
SZ: str = <Geolocation.SZ: 'SZ'>
ET: str = <Geolocation.ET: 'ET'>
FK: str = <Geolocation.FK: 'FK'>
FO: str = <Geolocation.FO: 'FO'>
FJ: str = <Geolocation.FJ: 'FJ'>
FI: str = <Geolocation.FI: 'FI'>
FR: str = <Geolocation.FR: 'FR'>
GF: str = <Geolocation.GF: 'GF'>
PF: str = <Geolocation.PF: 'PF'>
TF: str = <Geolocation.TF: 'TF'>
GA: str = <Geolocation.GA: 'GA'>
GM: str = <Geolocation.GM: 'GM'>
GE: str = <Geolocation.GE: 'GE'>
DE: str = <Geolocation.DE: 'DE'>
GH: str = <Geolocation.GH: 'GH'>
GI: str = <Geolocation.GI: 'GI'>
GR: str = <Geolocation.GR: 'GR'>
GL: str = <Geolocation.GL: 'GL'>
GD: str = <Geolocation.GD: 'GD'>
GP: str = <Geolocation.GP: 'GP'>
GU: str = <Geolocation.GU: 'GU'>
GT: str = <Geolocation.GT: 'GT'>
GG: str = <Geolocation.GG: 'GG'>
GN: str = <Geolocation.GN: 'GN'>
GW: str = <Geolocation.GW: 'GW'>
GY: str = <Geolocation.GY: 'GY'>
HT: str = <Geolocation.HT: 'HT'>
HM: str = <Geolocation.HM: 'HM'>
VA: str = <Geolocation.VA: 'VA'>
HN: str = <Geolocation.HN: 'HN'>
HK: str = <Geolocation.HK: 'HK'>
HU: str = <Geolocation.HU: 'HU'>
IS: str = <Geolocation.IS: 'IS'>
IN: str = <Geolocation.IN: 'IN'>
ID: str = <Geolocation.ID: 'ID'>
IR: str = <Geolocation.IR: 'IR'>
IQ: str = <Geolocation.IQ: 'IQ'>
IE: str = <Geolocation.IE: 'IE'>
IM: str = <Geolocation.IM: 'IM'>
IL: str = <Geolocation.IL: 'IL'>
IT: str = <Geolocation.IT: 'IT'>
JM: str = <Geolocation.JM: 'JM'>
JP: str = <Geolocation.JP: 'JP'>
JE: str = <Geolocation.JE: 'JE'>
JO: str = <Geolocation.JO: 'JO'>
KZ: str = <Geolocation.KZ: 'KZ'>
KE: str = <Geolocation.KE: 'KE'>
KI: str = <Geolocation.KI: 'KI'>
KP: str = <Geolocation.KP: 'KP'>
KR: str = <Geolocation.KR: 'KR'>
KW: str = <Geolocation.KW: 'KW'>
KG: str = <Geolocation.KG: 'KG'>
LA: str = <Geolocation.LA: 'LA'>
LV: str = <Geolocation.LV: 'LV'>
LB: str = <Geolocation.LB: 'LB'>
LS: str = <Geolocation.LS: 'LS'>
LR: str = <Geolocation.LR: 'LR'>
LY: str = <Geolocation.LY: 'LY'>
LI: str = <Geolocation.LI: 'LI'>
LT: str = <Geolocation.LT: 'LT'>
LU: str = <Geolocation.LU: 'LU'>
MO: str = <Geolocation.MO: 'MO'>
MG: str = <Geolocation.MG: 'MG'>
MW: str = <Geolocation.MW: 'MW'>
MY: str = <Geolocation.MY: 'MY'>
MV: str = <Geolocation.MV: 'MV'>
ML: str = <Geolocation.ML: 'ML'>
MT: str = <Geolocation.MT: 'MT'>
MH: str = <Geolocation.MH: 'MH'>
MQ: str = <Geolocation.MQ: 'MQ'>
MR: str = <Geolocation.MR: 'MR'>
MU: str = <Geolocation.MU: 'MU'>
YT: str = <Geolocation.YT: 'YT'>
MX: str = <Geolocation.MX: 'MX'>
FM: str = <Geolocation.FM: 'FM'>
MD: str = <Geolocation.MD: 'MD'>
MC: str = <Geolocation.MC: 'MC'>
MN: str = <Geolocation.MN: 'MN'>
ME: str = <Geolocation.ME: 'ME'>
MS: str = <Geolocation.MS: 'MS'>
MA: str = <Geolocation.MA: 'MA'>
MZ: str = <Geolocation.MZ: 'MZ'>
MM: str = <Geolocation.MM: 'MM'>
NA: str = <Geolocation.NA: 'NA'>
NR: str = <Geolocation.NR: 'NR'>
NP: str = <Geolocation.NP: 'NP'>
NL: str = <Geolocation.NL: 'NL'>
NC: str = <Geolocation.NC: 'NC'>
NZ: str = <Geolocation.NZ: 'NZ'>
NI: str = <Geolocation.NI: 'NI'>
NE: str = <Geolocation.NE: 'NE'>
NG: str = <Geolocation.NG: 'NG'>
NU: str = <Geolocation.NU: 'NU'>
NF: str = <Geolocation.NF: 'NF'>
MK: str = <Geolocation.MK: 'MK'>
MP: str = <Geolocation.MP: 'MP'>
NO: str = <Geolocation.NO: 'NO'>
OM: str = <Geolocation.OM: 'OM'>
PK: str = <Geolocation.PK: 'PK'>
PW: str = <Geolocation.PW: 'PW'>
PS: str = <Geolocation.PS: 'PS'>
PA: str = <Geolocation.PA: 'PA'>
PG: str = <Geolocation.PG: 'PG'>
PY: str = <Geolocation.PY: 'PY'>
PE: str = <Geolocation.PE: 'PE'>
PH: str = <Geolocation.PH: 'PH'>
PN: str = <Geolocation.PN: 'PN'>
PL: str = <Geolocation.PL: 'PL'>
PT: str = <Geolocation.PT: 'PT'>
PR: str = <Geolocation.PR: 'PR'>
QA: str = <Geolocation.QA: 'QA'>
RO: str = <Geolocation.RO: 'RO'>
RU: str = <Geolocation.RU: 'RU'>
RW: str = <Geolocation.RW: 'RW'>
RE: str = <Geolocation.RE: 'RE'>
BL: str = <Geolocation.BL: 'BL'>
SH: str = <Geolocation.SH: 'SH'>
KN: str = <Geolocation.KN: 'KN'>
LC: str = <Geolocation.LC: 'LC'>
MF: str = <Geolocation.MF: 'MF'>
PM: str = <Geolocation.PM: 'PM'>
VC: str = <Geolocation.VC: 'VC'>
WS: str = <Geolocation.WS: 'WS'>
SM: str = <Geolocation.SM: 'SM'>
ST: str = <Geolocation.ST: 'ST'>
SA: str = <Geolocation.SA: 'SA'>
SN: str = <Geolocation.SN: 'SN'>
RS: str = <Geolocation.RS: 'RS'>
SC: str = <Geolocation.SC: 'SC'>
SL: str = <Geolocation.SL: 'SL'>
SG: str = <Geolocation.SG: 'SG'>
SX: str = <Geolocation.SX: 'SX'>
SK: str = <Geolocation.SK: 'SK'>
SI: str = <Geolocation.SI: 'SI'>
SB: str = <Geolocation.SB: 'SB'>
SO: str = <Geolocation.SO: 'SO'>
ZA: str = <Geolocation.ZA: 'ZA'>
GS: str = <Geolocation.GS: 'GS'>
SS: str = <Geolocation.SS: 'SS'>
ES: str = <Geolocation.ES: 'ES'>
LK: str = <Geolocation.LK: 'LK'>
SD: str = <Geolocation.SD: 'SD'>
SR: str = <Geolocation.SR: 'SR'>
SJ: str = <Geolocation.SJ: 'SJ'>
SE: str = <Geolocation.SE: 'SE'>
CH: str = <Geolocation.CH: 'CH'>
SY: str = <Geolocation.SY: 'SY'>
TW: str = <Geolocation.TW: 'TW'>
TJ: str = <Geolocation.TJ: 'TJ'>
TZ: str = <Geolocation.TZ: 'TZ'>
TH: str = <Geolocation.TH: 'TH'>
TL: str = <Geolocation.TL: 'TL'>
TG: str = <Geolocation.TG: 'TG'>
TK: str = <Geolocation.TK: 'TK'>
TO: str = <Geolocation.TO: 'TO'>
TT: str = <Geolocation.TT: 'TT'>
TN: str = <Geolocation.TN: 'TN'>
TM: str = <Geolocation.TM: 'TM'>
TC: str = <Geolocation.TC: 'TC'>
TV: str = <Geolocation.TV: 'TV'>
TR: str = <Geolocation.TR: 'TR'>
UG: str = <Geolocation.UG: 'UG'>
UA: str = <Geolocation.UA: 'UA'>
AE: str = <Geolocation.AE: 'AE'>
GB: str = <Geolocation.GB: 'GB'>
US: str = <Geolocation.US: 'US'>
UM: str = <Geolocation.UM: 'UM'>
UY: str = <Geolocation.UY: 'UY'>
UZ: str = <Geolocation.UZ: 'UZ'>
VU: str = <Geolocation.VU: 'VU'>
VE: str = <Geolocation.VE: 'VE'>
VN: str = <Geolocation.VN: 'VN'>
VG: str = <Geolocation.VG: 'VG'>
VI: str = <Geolocation.VI: 'VI'>
WF: str = <Geolocation.WF: 'WF'>
EH: str = <Geolocation.EH: 'EH'>
YE: str = <Geolocation.YE: 'YE'>
ZM: str = <Geolocation.ZM: 'ZM'>
ZW: str = <Geolocation.ZW: 'ZW'>
AX: str = <Geolocation.AX: 'AX'>
pydantic model zyte_spider_templates.params.MaxRequestsParam[source]
field max_requests: int | None = 100

The maximum number of Zyte API requests allowed for the crawl.

Requests with error responses that cannot be retried or exceed their retry limit also count here, but they incur in no costs and do not increase the request count in Scrapy Cloud.

pydantic model zyte_spider_templates.params.UrlParam[source]
field url: str = ''

Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/

Constraints:
  • pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$

pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam[source]
field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.full

Determines how the start URL and follow-up URLs are crawled.

enum zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy(value)[source]
Member Type:

str

Valid values are as follows:

full: str = <EcommerceCrawlStrategy.full: 'full'>

Follow most links within the domain of URL in an attempt to discover and extract as many products as possible.

navigation: str = <EcommerceCrawlStrategy.navigation: 'navigation'>

Follow pagination, subcategories, and product detail pages.

Pagination Only is a better choice if the target URL does not have subcategories, or if Zyte API is misidentifying some URLs as subcategories.

pagination_only: str = <EcommerceCrawlStrategy.pagination_only: 'pagination_only'>

Follow pagination and product detail pages. Subcategory links are ignored.

Changes

0.7.2 (2024-05-07)

0.7.1 (2024-02-22)

0.7.0 (2024-02-09)

  • Updated requirement versions:

  • With the updated dependencies above, this fixes the issue of having 2 separate Zyte API Requests (productNavigation and httpResponseBody) for the same URL. Note that this issue only occurs when requesting product navigation pages.

  • Moved zyte_spider_templates.spiders.ecommerce.ExtractFrom into zyte_spider_templates.spiders.base.ExtractFrom.

0.6.1 (2024-02-02)

  • Improved the zyte_spider_templates.spiders.base.BaseSpiderParams.url description.

0.6.0 (2024-01-31)

0.5.0 (2023-12-18)

  • The zyte_spider_templates.page_objects module is now deprecated in favor of zyte_spider_templates.pages, in line with web_poet.pages.

0.4.0 (2023-12-14)

  • Products outside of the target domain can now be crawled using zyte_spider_templates.middlewares.AllowOffsiteMiddleware.

  • Updated the documentation to also set up zyte_common_items.ZyteItemAdapter.

  • The max_requests spider parameter has now a default value of 100. Previously, it was None which was unlimited.

  • Improved the description of the max_requests spider parameter.

  • Official support for Python 3.12.

  • Misc documentation improvements.

0.3.0 (2023-11-03)

  • Added documentation.

  • Added a middleware that logs information about the crawl in JSON format, zyte_spider_templates.middlewares.CrawlingLogsMiddleware. This replaces the old crawling information that was difficult to parse using regular expressions.

0.2.0 (2023-10-30)

  • Now requires zyte-common-items >= 0.12.0.

  • Added a new crawl strategy, “Pagination Only”.

  • Improved the request priority calculation based on the metadata probability value.

  • CI improvements.

0.1.0 (2023-10-24)

Initial release.