Customizing page objects

All parsing is implemented using web-poet page objects that use Zyte API automatic extraction to extract standard items, both for navigation and for item details.

You can implement your own page object classes to override how extraction works for any given combination of URL and item type.

Tip

Make sure the import path of your page objects module is in the SCRAPY_POET_DISCOVER setting, otherwise your page objects might be ignored.

Overriding parsing

To change or fix how a given field is extracted, overriding the value from Zyte API automatic extraction, create a page object class, configured to run on some given URLs (web_poet.handle_urls()), that defines the logic to extract that field. For example:

pages/books_toscrape_com.py
import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage


@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
    response: AnyResponse

    @field
    async def aggregateRating(self):
        element_class = self.response.css(".star-rating::attr(class)").get()
        if not element_class:
            return None
        rating_str = element_class.split(" ")[-1]
        rating = parse_number(rating_str)
        if not rating:
            return None
        return AggregateRating(ratingValue=rating, bestRating=5)

AutoProductPage and other page objects from zyte-common-items prefixed with Auto define fields for all standard items that return the value from Zyte API automatic extraction, so that you only need to define your new field.

The page object above is decorated with @attrs.define so that it can declare a dependency on AnyResponse and use that to implement custom parsing logic. You could alternatively use BrowserHtml if needed.

Parsing a new field

To extract a new field for one or more websites:

  1. Declare a new item type that extends a standard item with your new field. For example:

    items.py
    from typing import Optional
    
    import attrs
    from zyte_common_items import Product
    
    
    @attrs.define
    class CustomProduct(Product):
        stock: Optional[int]
    
  2. Create a page object class, configured to run for your new item type (web_poet.pages.Returns) on some given URLs (web_poet.handle_urls()), that defines the logic to extract your new field. For example:

    pages/books_toscrape_com.py
    import re
    
    from web_poet import Returns, field, handle_urls
    from zyte_common_items import AutoProductPage
    
    from ..items import CustomProduct
    
    
    @handle_urls("books.toscrape.com")
    class BookPage(AutoProductPage, Returns[CustomProduct]):
        @field
        async def stock(self):
            for entry in await self.additionalProperties:
                if entry.name == "availability":
                    match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value)
                    if not match:
                        return None
                    stock_str = re.sub(r"[.,\s]", "", match[0])
                    return int(stock_str)
            return None
    
  3. Create a spider template subclass that requests your new item type instead of the standard one. For example:

    spiders/books_toscrape_com.py
    from scrapy_poet import DummyResponse
    from zyte_spider_templates import EcommerceSpider
    
    from ..items import CustomProduct
    
    
    class BooksToScrapeComSpider(EcommerceSpider):
        name = "books_toscrape_com"
        metadata = {
            **EcommerceSpider.metadata,
            "title": "Books to Scrape",
            "description": "Spider template for books.toscrape.com",
        }
    
        def parse_product(self, response: DummyResponse, product: CustomProduct):
            yield from super().parse_product(response, product)