Customizing page objects
All parsing is implemented using web-poet page objects that use Zyte API automatic extraction to extract standard items: for navigation, for item details, and even for search request generation.
You can implement your own page object classes to override how extraction works for any given combination of URL and item type.
Tip
Make sure the import path of your page objects module is in the SCRAPY_POET_DISCOVER setting, otherwise your page objects might be ignored.
Overriding parsing
To change or fix how a given field is extracted, overriding the value from
Zyte API automatic extraction, create a page object class, configured to run
on some given URLs (web_poet.handle_urls()
), that defines the logic to
extract that field. For example:
import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage
@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
response: AnyResponse
@field
async def aggregateRating(self):
element_class = self.response.css(".star-rating::attr(class)").get()
if not element_class:
return None
rating_str = element_class.split(" ")[-1]
rating = parse_number(rating_str)
if not rating:
return None
return AggregateRating(ratingValue=rating, bestRating=5)
AutoProductPage
and other page objects from zyte-common-items
prefixed with Auto
define fields for all standard items that return
the value from Zyte API automatic extraction, so that you only need
to define your new field.
The page object above is decorated with @attrs.define
so that it can
declare a dependency on AnyResponse
and
use that to implement custom parsing logic. You could alternatively use
BrowserHtml
if needed.
Parsing a new field
To extract a new field for one or more websites:
Declare a new item type that extends a standard item with your new field. For example:
items.pyfrom typing import Optional import attrs from zyte_common_items import Product @attrs.define class CustomProduct(Product): stock: Optional[int]
Create a page object class, configured to run for your new item type (
web_poet.pages.Returns
) on some given URLs (web_poet.handle_urls()
), that defines the logic to extract your new field. For example:pages/books_toscrape_com.pyimport re from web_poet import Returns, field, handle_urls from zyte_common_items import AutoProductPage from ..items import CustomProduct @handle_urls("books.toscrape.com") class BookPage(AutoProductPage, Returns[CustomProduct]): @field async def stock(self): for entry in await self.additionalProperties: if entry.name == "availability": match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value) if not match: return None stock_str = re.sub(r"[.,\s]", "", match[0]) return int(stock_str) return None
Create a spider template subclass that requests your new item type instead of the standard one. For example:
spiders/books_toscrape_com.pyfrom scrapy_poet import DummyResponse from zyte_spider_templates import EcommerceSpider from ..items import CustomProduct class BooksToScrapeComSpider(EcommerceSpider): name = "books_toscrape_com" metadata = { **EcommerceSpider.metadata, "title": "Books to Scrape", "description": "Spider template for books.toscrape.com", } def parse_product(self, response: DummyResponse, product: CustomProduct): yield from super().parse_product(response, product)
Fixing search support
If the default implementation to build a request out of search queries does not work on a given website, you can implement your own search request page object to fix that. See Writing a request template page object.
For example:
from web_poet import handle_urls
from zyte_common_items import BaseSearchRequestTemplatePage
@handle_urls("example.com")
class ExampleComSearchRequestTemplatePage(BaseSearchRequestTemplatePage):
@field
def url(self):
return "https://example.com/search?q={{ query|quote_plus }}"