zyte-spider-templates documentation
Spider templates for automatic crawlers.
This library contains Scrapy spider templates. They can be used out of the box with the Zyte features such as Zyte API or modified to be used standalone. There is a sample Scrapy project for this library that you can use as a starting point for your own projects.
Initial setup
Learn how to get spider templates installed and configured on an existing Scrapy project.
Tip
If you do not have a Scrapy project yet, use zyte-spider-templates-project as a starting template to get started quickly.
Requirements
Python 3.8+
Scrapy 2.11+
For Zyte API features, including AI-powered parsing, you need a Zyte API subscription.
Installation
pip install zyte-spider-templates
Configuration
In your Scrapy project settings (usually in settings.py
):
Update
SPIDER_MODULES
to include"zyte_spider_templates.spiders"
.Configure scrapy-poet, and update SCRAPY_POET_DISCOVER to include
"zyte_spider_templates.pages"
.
For Zyte API features, including AI-powered parsing, configure scrapy-zyte-api with scrapy-poet integration.
The following additional settings are recommended:
Set
CLOSESPIDER_TIMEOUT_NO_ITEM
to 600, to force the spider to stop if no item has been found for 10 minutes.Set
SCHEDULER_DISK_QUEUE
to"scrapy.squeues.PickleFifoDiskQueue"
andSCHEDULER_MEMORY_QUEUE
to"scrapy.squeues.FifoMemoryQueue"
, for better request priority handling.Update
SPIDER_MIDDLEWARES
to include"zyte_spider_templates.middlewares.CrawlingLogsMiddleware": 1000
, to log crawl data in JSON format for debugging purposes.Ensure that
zyte_common_items.ZyteItemAdapter
is also configured:from itemadapter import ItemAdapter from zyte_common_items import ZyteItemAdapter ItemAdapter.ADAPTER_CLASSES.appendleft(ZyteItemAdapter)
Update
SPIDER_MIDDLEWARES
to include"zyte_spider_templates.middlewares.AllowOffsiteMiddleware": 500
and"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None
. This allows for crawling item links outside of the domain.
For an example of a properly configured settings.py
file, see the one
in zyte-spider-templates-project.
Spider templates
Built-in spider templates use Zyte API automatic extraction to provide automatic crawling and parsing, i.e. you can run these spiders on any website of the right type to automatically extract the desired structured data.
For example, to extract all products from an e-commerce website, you can run the e-commerce spider spider as follows:
scrapy crawl ecommerce -a url="https://books.toscrape.com"
Spider templates support additional parameters beyond url
. See the
documentation of each specific spider for details.
You can also customize spider templates to meet your needs.
Spider template list
- E-commerce
Get products from an e-commerce website.
E-commerce spider template (ecommerce
)
Basic use
scrapy crawl ecommerce -a url="https://books.toscrape.com"
Parameters
- pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams[source]
- Config:
json_schema_extra: dict = {‘groups’: [{‘id’: ‘inputs’, ‘title’: ‘Inputs’, ‘description’: ‘Input data that determines the start URLs of the crawl.’, ‘widget’: ‘exclusive’}]}
- Validators:
single_input
»all fields
- field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.full
Determines how the start URL and follow-up URLs are crawled.
- Validated by:
single_input
- field extract_from: ExtractFrom | None = None
Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).
- Validated by:
single_input
- field geolocation: Geolocation | None = None
ISO 3166-1 alpha-2 2-character string specified in https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.
- Validated by:
single_input
- field max_requests: int | None = 100
The maximum number of Zyte API requests allowed for the crawl.
Requests with error responses that cannot be retried or exceed their retry limit also count here, but they incur in no costs and do not increase the request count in Scrapy Cloud.
- Validated by:
single_input
- field url: str = ''
Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/
- Constraints:
pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$
- Validated by:
single_input
- field urls_file: str = ''
URL that point to a plain-text file with a list of URLs to crawl, e.g. https://example.com/url-list.txt. The linked list must contain 1 URL per line.
- Constraints:
pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$
- Validated by:
single_input
Customization
Built-in spider templates can be highly customized:
Subclass spider templates to customize metadata, parameters, and crawling logic.
Implement page objects to override parsing logic for all or some websites, both for navigation and item detail data.
Customizing spider templates
Subclass a built-in spider template to customize its metadata, parameters, and crawling logic.
Customizing metadata
Spider template metadata is defined using scrapy-spider-metadata, and can be redefined or customized in a subclass.
For example, to keep the upstream title
but change the description
:
from zyte_spider_templates import EcommerceSpider
class MySpider(EcommerceSpider):
name = "my_spider"
metadata = {
**EcommerceSpider.metadata,
"description": "Custom e-commerce spider template.",
}
Customizing parameters
Spider template parameters are also defined using scrapy-spider-metadata, and can be redefined or customized in a subclass as well.
For example, to add a min_price
parameter and filter out products with a
lower price:
from decimal import Decimal
from typing import Iterable
from scrapy_poet import DummyResponse
from scrapy_spider_metadata import Args
from zyte_common_items import Product
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams
class MyParams(EcommerceSpiderParams):
min_price: str = "0.00"
class MySpider(EcommerceSpider, Args[MyParams]):
name = "my_spider"
def parse_product(
self, response: DummyResponse, product: Product
) -> Iterable[Product]:
for product in super().parse_product(response, product):
if Decimal(product.price) >= Decimal(self.args.min_price):
yield product
You can also override existing parameters. For example, to hard-code the start URL:
from scrapy_spider_metadata import Args
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams
class MyParams(EcommerceSpiderParams):
url: str = "https://books.toscrape.com"
class MySpider(EcommerceSpider, Args[MyParams]):
name = "my_spider"
A mixin class exists for every spider parameter (see Parameter mixins), so you can use any combination of them in any order you like in your custom classes, while enjoying future improvements to validation, documentation or UI integration for Scrapy Cloud:
from scrapy_spider_metadata import Args
from zyte_spider_templates.params import GeolocationParam, UrlParam
class MyParams(GeolocationParam, UrlParam):
pass
class MySpider(Args[MyParams]):
name = "my_spider"
Customizing the crawling logic
The crawling logic of spider templates can be customized as any other Scrapy spider.
For example, you can make a spider that expects a product details URL and does not follow navigation at all:
from typing import Iterable
from scrapy import Request
from zyte_spider_templates import EcommerceSpider
class MySpider(EcommerceSpider):
name = "my_spider"
def start_requests(self) -> Iterable[Request]:
for request in super().start_requests():
yield request.replace(callback=self.parse_product)
All parsing logic is implemented separately in page objects, making it easier to read the code of built-in spider templates to modify them as desired.
Customizing page objects
All parsing is implemented using web-poet page objects that use Zyte API automatic extraction to extract standard items, both for navigation and for item details.
You can implement your own page object classes to override how extraction works for any given combination of URL and item type.
Tip
Make sure the import path of your page objects module is in the SCRAPY_POET_DISCOVER setting, otherwise your page objects might be ignored.
Overriding parsing
To change or fix how a given field is extracted, overriding the value from
Zyte API automatic extraction, create a page object class, configured to run
on some given URLs (web_poet.handle_urls()
), that defines the logic to
extract that field. For example:
import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage
@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
response: AnyResponse
@field
async def aggregateRating(self):
element_class = self.response.css(".star-rating::attr(class)").get()
if not element_class:
return None
rating_str = element_class.split(" ")[-1]
rating = parse_number(rating_str)
if not rating:
return None
return AggregateRating(ratingValue=rating, bestRating=5)
AutoProductPage
and other page objects from zyte-common-items
prefixed with Auto
define fields for all standard items that return
the value from Zyte API automatic extraction, so that you only need
to define your new field.
The page object above is decorated with @attrs.define
so that it can
declare a dependency on AnyResponse
and
use that to implement custom parsing logic. You could alternatively use
BrowserHtml
if needed.
Parsing a new field
To extract a new field for one or more websites:
Declare a new item type that extends a standard item with your new field. For example:
items.pyfrom typing import Optional import attrs from zyte_common_items import Product @attrs.define class CustomProduct(Product): stock: Optional[int]
Create a page object class, configured to run for your new item type (
web_poet.pages.Returns
) on some given URLs (web_poet.handle_urls()
), that defines the logic to extract your new field. For example:pages/books_toscrape_com.pyimport re from web_poet import Returns, field, handle_urls from zyte_common_items import AutoProductPage from ..items import CustomProduct @handle_urls("books.toscrape.com") class BookPage(AutoProductPage, Returns[CustomProduct]): @field async def stock(self): for entry in await self.additionalProperties: if entry.name == "availability": match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value) if not match: return None stock_str = re.sub(r"[.,\s]", "", match[0]) return int(stock_str) return None
Create a spider template subclass that requests your new item type instead of the standard one. For example:
spiders/books_toscrape_com.pyfrom scrapy_poet import DummyResponse from zyte_spider_templates import EcommerceSpider from ..items import CustomProduct class BooksToScrapeComSpider(EcommerceSpider): name = "books_toscrape_com" metadata = { **EcommerceSpider.metadata, "title": "Books to Scrape", "description": "Spider template for books.toscrape.com", } def parse_product(self, response: DummyResponse, product: CustomProduct): yield from super().parse_product(response, product)
Reference
Spiders
- class zyte_spider_templates.EcommerceSpider(*args: Any, **kwargs: Any)[source]
Yield products from an e-commerce website.
See
EcommerceSpiderParams
for supported parameters.See also
Pages
Parameter mixins
- pydantic model zyte_spider_templates.params.ExtractFromParam[source]
- field extract_from: ExtractFrom | None = None
Whether to perform extraction using a browser request (browserHtml) or an HTTP request (httpResponseBody).
- enum zyte_spider_templates.params.ExtractFrom(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.params.GeolocationParam[source]
- field geolocation: Geolocation | None = None
ISO 3166-1 alpha-2 2-character string specified in https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.
- enum zyte_spider_templates.params.Geolocation(value)[source]
- Member Type:
Valid values are as follows:
- pydantic model zyte_spider_templates.params.UrlParam[source]
- field url: str = ''
Initial URL for the crawl. Enter the full URL including http(s), you can copy and paste it from your browser. Example: https://toscrape.com/
- Constraints:
pattern = ^https?://[^:/s]+(:d{1,5})?(/[^s]*)*(#[^s]*)?$
- pydantic model zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategyParam[source]
- field crawl_strategy: EcommerceCrawlStrategy = EcommerceCrawlStrategy.full
Determines how the start URL and follow-up URLs are crawled.
- enum zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy(value)[source]
- Member Type:
Valid values are as follows:
- full: str = <EcommerceCrawlStrategy.full: 'full'>
Follow most links within the domain of URL in an attempt to discover and extract as many products as possible.
Follow pagination, subcategories, and product detail pages.
Pagination Only is a better choice if the target URL does not have subcategories, or if Zyte API is misidentifying some URLs as subcategories.
Changes
0.7.2 (2024-05-07)
Implemented mixin classes for spider parameters, to improve reuse.
Improved docs, providing an example about overriding existing parameters when customizing parameters, and featuring
AnyResponse
in the example about overriding parsing.
0.7.1 (2024-02-22)
The
crawl_strategy
parameter ofEcommerceSpider
now defaults tofull
instead ofnavigation
. We also reworded some descriptions ofEcommerceCrawlStrategy
values for clarification.
0.7.0 (2024-02-09)
Updated requirement versions:
scrapy-poet >= 0.21.0
scrapy-zyte-api >= 0.16.0
With the updated dependencies above, this fixes the issue of having 2 separate Zyte API Requests (productNavigation and httpResponseBody) for the same URL. Note that this issue only occurs when requesting product navigation pages.
Moved
zyte_spider_templates.spiders.ecommerce.ExtractFrom
intozyte_spider_templates.spiders.base.ExtractFrom
.
0.6.1 (2024-02-02)
Improved the
zyte_spider_templates.spiders.base.BaseSpiderParams.url
description.
0.6.0 (2024-01-31)
Fixed the
extract_from
spider parameter that wasn’t working.The “www.” prefix is now removed when setting the spider’s
allowed_domains
.The
zyte_common_items.ProductNavigation.nextPage
link won’t be crawled ifzyte_common_items.ProductNavigation.items
is empty.zyte_common_items.Product
items that are dropped due to low probability (below 0.1) are now logged in stats:drop_item/product/low_probability
.zyte_spider_templates.pages.HeuristicsProductNavigationPage
now inherits fromzyte_common_items.AutoProductNavigationPage
instead ofzyte_common_items.BaseProductNavigationPage
.Moved e-commerce code from
zyte_spider_templates.spiders.base.BaseSpider
tozyte_spider_templates.spiders.ecommerce.EcommerceSpider
.Documentation improvements.
0.5.0 (2023-12-18)
The
zyte_spider_templates.page_objects
module is now deprecated in favor ofzyte_spider_templates.pages
, in line withweb_poet.pages
.
0.4.0 (2023-12-14)
Products outside of the target domain can now be crawled using
zyte_spider_templates.middlewares.AllowOffsiteMiddleware
.Updated the documentation to also set up
zyte_common_items.ZyteItemAdapter
.The
max_requests
spider parameter has now a default value of 100. Previously, it wasNone
which was unlimited.Improved the description of the
max_requests
spider parameter.Official support for Python 3.12.
Misc documentation improvements.
0.3.0 (2023-11-03)
Added documentation.
Added a middleware that logs information about the crawl in JSON format,
zyte_spider_templates.middlewares.CrawlingLogsMiddleware
. This replaces the old crawling information that was difficult to parse using regular expressions.
0.2.0 (2023-10-30)
Now requires
zyte-common-items >= 0.12.0
.Added a new crawl strategy, “Pagination Only”.
Improved the request priority calculation based on the metadata probability value.
CI improvements.
0.1.0 (2023-10-24)
Initial release.