Customizing page objects

All parsing is implemented using web-poet page objects that use Zyte API automatic extraction to extract standard items: for navigation, for item details, and even for search request generation.

You can implement your own page object classes to override how extraction works for any given combination of URL and item type.

Tip

Make sure the import path of your page objects module is in the SCRAPY_POET_DISCOVER setting, otherwise your page objects might be ignored.

Overriding parsing

To change or fix how a given field is extracted, overriding the value from Zyte API automatic extraction, create a page object class, configured to run on some given URLs (web_poet.handle_urls()), that defines the logic to extract that field. For example:

pages/books_toscrape_com.py

import attrs
from number_parser import parse_number
from web_poet import AnyResponse, field, handle_urls
from zyte_common_items import AggregateRating, AutoProductPage


@handle_urls("books.toscrape.com")
@attrs.define
class BooksToScrapeComProductPage(AutoProductPage):
    response: AnyResponse

    @field
    async def aggregateRating(self):
        element_class = self.response.css(".star-rating::attr(class)").get()
        if not element_class:
            return None
        rating_str = element_class.split(" ")[-1]
        rating = parse_number(rating_str)
        if not rating:
            return None
        return AggregateRating(ratingValue=rating, bestRating=5)

AutoProductPage and other page objects from zyte-common-items prefixed with Auto define fields for all standard items that return the value from Zyte API automatic extraction, so that you only need to define your new field.

The page object above is decorated with @attrs.define so that it can declare a dependency on AnyResponse and use that to implement custom parsing logic. You could alternatively use BrowserHtml if needed.

Parsing a new field

To extract a new field for one or more websites:

Declare a new item type that extends a standard item with your new field. For example:

items.py

from typing import Optional

import attrs
from zyte_common_items import Product


@attrs.define
class CustomProduct(Product):
    stock: Optional[int]

Create a page object class, configured to run for your new item type (web_poet.pages.Returns) on some given URLs (web_poet.handle_urls()), that defines the logic to extract your new field. For example:

pages/books_toscrape_com.py

import re

from web_poet import Returns, field, handle_urls
from zyte_common_items import AutoProductPage

from ..items import CustomProduct


@handle_urls("books.toscrape.com")
class BookPage(AutoProductPage, Returns[CustomProduct]):
    @field
    async def stock(self):
        for entry in await self.additionalProperties:
            if entry.name == "availability":
                match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value)
                if not match:
                    return None
                stock_str = re.sub(r"[.,\s]", "", match[0])
                return int(stock_str)
        return None

Create a spider template subclass that requests your new item type instead of the standard one. For example:

spiders/books_toscrape_com.py

from scrapy_poet import DummyResponse
from zyte_spider_templates import EcommerceSpider

from ..items import CustomProduct


class BooksToScrapeComSpider(EcommerceSpider):
    name = "books_toscrape_com"
    metadata = {
        **EcommerceSpider.metadata,
        "title": "Books to Scrape",
        "description": "Spider template for books.toscrape.com",
    }

    def parse_product(self, response: DummyResponse, product: CustomProduct):
        yield from super().parse_product(response, product)

Fixing search support

If the default implementation to build a request out of search queries does not work on a given website, you can implement your own search request page object to fix that. See Writing a request template page object.

For example:

from web_poet import handle_urls
from zyte_common_items import BaseSearchRequestTemplatePage


@handle_urls("example.com")
class ExampleComSearchRequestTemplatePage(BaseSearchRequestTemplatePage):
    @field
    def url(self):
        return "https://example.com/search?q={{ query|quote_plus }}"