Changes
0.12.0 (2025-03-31)
Search queries support is added to the job posting spider template.
Fixed support for POST requests in search queries.
Improved validation in the Google search spider template.
0.11.2 (2024-12-30)
Do not log warning about disabled components.
0.11.1 (2024-12-26)
The e-commerce and job posting spider templates no longer ignore item requests for a different domain.
0.11.0 (2024-12-16)
New Articles spider template, built on top of Zyte API’s article and articleNavigation.
New Job Posting spider template, built on top of Zyte API’s jobPosting and jobPostingNavigation.
Search queries support is added to the e-commerce spider template. This allows to provide a list of search queries to the spider; the spider finds a search form on the target webpage, and submits all the queries.
ProductList extraction support is added to the e-commerce spider template. This allows spiders to extract basic product information without going into product detail pages.
New features are added to the Google Search spider template:
An option to follow the result links and extract data from the target pages (via the
extract
argument)Content Languages (lr) parameter
Content Countries (cr) parameter
User Country (gl) parameter
User Language (hl) parameter
results_per_page parameter
Added a Scrapy add-on. This allows to greatly simplify the initial zyte-spider-templates configuration.
Bug fix: incorrectly extracted URLs no longer make spiders drop other requests.
Cleaned up the CI; improved the testing suite; cleaned up the documentation.
0.10.0 (2024-11-22)
Dropped Python 3.8 support, added Python 3.13 support.
Increased the minimum required versions of some dependencies:
pydantic
:2
→2.1
scrapy-poet
:0.21.0
→0.24.0
scrapy-spider-metadata
:0.1.2
→0.2.0
scrapy-zyte-api[provider]
:0.16.0
→0.23.0
zyte-common-items
:0.22.0
→0.23.0
Added custom attributes support to the e-commerce spider template through its new
custom_attrs_input
andcustom_attrs_method
parameters.The
max_pages
parameter of the Google Search spider template can no longer be 0 or lower.The Google Search spider template now follows pagination for the results of each query page by page, instead of sending a request for every page in parallel. It stops once it reaches a page without organic results.
Improved the description of
EcommerceCrawlStrategy
values.Fixed type hint issues related to Scrapy.
0.9.0 (2024-09-17)
Now requires
zyte-common-items >= 0.22.0
.New Google Search spider template, built on top of Zyte API’s serp.
The heuristics of the e-commerce spider template to ignore certain URLs when following category links now also handles subdomains. For example, before https://example.com/blog was ignored, now https://blog.example.com is also ignored.
In the spider parameters JSON schema, the
crawl_strategy
parameter of the e-commerce spider template switches position, from being the last parameter to being betweenurls_file
andgeolocation
.Removed the
valid_page_types
attribute ofzyte_spider_templates.middlewares.CrawlingLogsMiddleware
.
0.8.0 (2024-08-21)
Added new input parameters:
urls
accepts a newline-delimited list of URLs.urls_file
accepts a URL that points to a plain-text file with a newline-delimited list of URLs.
Only one of
url
,urls
andurls_file
should be used at a time.Added new crawling strategies:
automatic
- uses heuristics to see if an input URL is a homepage, for which it uses a modifiedfull
strategy where other links are discovered only in the homepage. Otherwise, it assumes it’s a navigation page and uses the existingnavigation
strategy.direct_item
- input URLs are directly extracted as products.
Added new parameters classes:
LocationParam
andPostalAddress
. Note that these are available for use when customizing the templates and are not currently being utilized by any template.Backward incompatible changes:
automatic
becomes the new default crawling strategy instead offull
.
CI test improvements.
0.7.2 (2024-05-07)
Implemented mixin classes for spider parameters, to improve reuse.
Improved docs, providing an example about overriding existing parameters when customizing parameters, and featuring
AnyResponse
in the example about overriding parsing.
0.7.1 (2024-02-22)
The
crawl_strategy
parameter ofEcommerceSpider
now defaults tofull
instead ofnavigation
. We also reworded some descriptions ofEcommerceCrawlStrategy
values for clarification.
0.7.0 (2024-02-09)
Updated requirement versions:
scrapy-poet >= 0.21.0
scrapy-zyte-api >= 0.16.0
With the updated dependencies above, this fixes the issue of having 2 separate Zyte API Requests (productNavigation and httpResponseBody) for the same URL. Note that this issue only occurs when requesting product navigation pages.
Moved
zyte_spider_templates.spiders.ecommerce.ExtractFrom
intozyte_spider_templates.spiders.base.ExtractFrom
.
0.6.1 (2024-02-02)
Improved the
zyte_spider_templates.spiders.base.BaseSpiderParams.url
description.
0.6.0 (2024-01-31)
Fixed the
extract_from
spider parameter that wasn’t working.The “www.” prefix is now removed when setting the spider’s
allowed_domains
.The
zyte_common_items.ProductNavigation.nextPage
link won’t be crawled ifzyte_common_items.ProductNavigation.items
is empty.zyte_common_items.Product
items that are dropped due to low probability (below 0.1) are now logged in stats:drop_item/product/low_probability
.zyte_spider_templates.pages.HeuristicsProductNavigationPage
now inherits fromzyte_common_items.AutoProductNavigationPage
instead ofzyte_common_items.BaseProductNavigationPage
.Moved e-commerce code from
zyte_spider_templates.spiders.base.BaseSpider
tozyte_spider_templates.spiders.ecommerce.EcommerceSpider
.Documentation improvements.
0.5.0 (2023-12-18)
The
zyte_spider_templates.page_objects
module is now deprecated in favor ofzyte_spider_templates.pages
, in line withweb_poet.pages
.
0.4.0 (2023-12-14)
Products outside of the target domain can now be crawled using
zyte_spider_templates.middlewares.AllowOffsiteMiddleware
.Updated the documentation to also set up
zyte_common_items.ZyteItemAdapter
.The
max_requests
spider parameter has now a default value of 100. Previously, it wasNone
which was unlimited.Improved the description of the
max_requests
spider parameter.Official support for Python 3.12.
Misc documentation improvements.
0.3.0 (2023-11-03)
Added documentation.
Added a middleware that logs information about the crawl in JSON format,
zyte_spider_templates.middlewares.CrawlingLogsMiddleware
. This replaces the old crawling information that was difficult to parse using regular expressions.
0.2.0 (2023-10-30)
Now requires
zyte-common-items >= 0.12.0
.Added a new crawl strategy, “Pagination Only”.
Improved the request priority calculation based on the metadata probability value.
CI improvements.
0.1.0 (2023-10-24)
Initial release.