.. _custom-page-objects: ======================== Customizing page objects ======================== All parsing is implemented using :ref:`web-poet page objects ` that use `Zyte API automatic extraction`_ to extract :ref:`standard items `: for navigation, for item details, and even for :ref:`search request generation `. .. _Zyte API automatic extraction: https://docs.zyte.com/zyte-api/usage/extract.html You can implement your own page object classes to override how extraction works for any given combination of URL and item type. .. tip:: Make sure the import path of your page objects module is in the :ref:`SCRAPY_POET_DISCOVER ` setting, otherwise your page objects might be ignored. .. _configured scrapy-poet: https://scrapy-poet.readthedocs.io/en/stable/intro/install.html#configuring-the-project .. _override-parsing: Overriding parsing ================== To change or fix how a given field is extracted, overriding the value from `Zyte API automatic extraction`_, create a page object class, configured to run on some given URLs (:func:`web_poet.handle_urls`), that defines the logic to extract that field. For example: .. code-block:: python :caption: pages/books_toscrape_com.py import attrs from number_parser import parse_number from web_poet import AnyResponse, field, handle_urls from zyte_common_items import AggregateRating, AutoProductPage @handle_urls("books.toscrape.com") @attrs.define class BooksToScrapeComProductPage(AutoProductPage): response: AnyResponse @field async def aggregateRating(self): element_class = self.response.css(".star-rating::attr(class)").get() if not element_class: return None rating_str = element_class.split(" ")[-1] rating = parse_number(rating_str) if not rating: return None return AggregateRating(ratingValue=rating, bestRating=5) ``AutoProductPage`` and other page objects from `zyte-common-items`_ prefixed with ``Auto`` define fields for all standard items that return the value from `Zyte API automatic extraction`_, so that you only need to define your new field. .. _zyte-common-items: https://zyte-common-items.readthedocs.io/en/latest/ The page object above is decorated with ``@attrs.define`` so that it can declare a dependency on :class:`~web_poet.page_inputs.response.AnyResponse` and use that to implement custom parsing logic. You could alternatively use :class:`~web_poet.page_inputs.browser.BrowserHtml` if needed. .. _add-field: Parsing a new field =================== To extract a new field for one or more websites: #. Declare a new item type that extends a :ref:`standard item ` with your new field. For example: .. code-block:: python :caption: items.py from typing import Optional import attrs from zyte_common_items import Product @attrs.define class CustomProduct(Product): stock: Optional[int] #. Create a page object class, configured to run for your new item type (:class:`web_poet.pages.Returns`) on some given URLs (:func:`web_poet.handle_urls`), that defines the logic to extract your new field. For example: .. code-block:: python :caption: pages/books_toscrape_com.py import re from web_poet import Returns, field, handle_urls from zyte_common_items import AutoProductPage from ..items import CustomProduct @handle_urls("books.toscrape.com") class BookPage(AutoProductPage, Returns[CustomProduct]): @field async def stock(self): for entry in await self.additionalProperties: if entry.name == "availability": match = re.search(r"\d([.,\s]*\d+)*(?=\s+available\b)", entry.value) if not match: return None stock_str = re.sub(r"[.,\s]", "", match[0]) return int(stock_str) return None #. Create a spider template subclass that requests your new item type instead of the standard one. For example: .. code-block:: python :caption: spiders/books_toscrape_com.py from scrapy_poet import DummyResponse from zyte_spider_templates import EcommerceSpider from ..items import CustomProduct class BooksToScrapeComSpider(EcommerceSpider): name = "books_toscrape_com" metadata = { **EcommerceSpider.metadata, "title": "Books to Scrape", "description": "Spider template for books.toscrape.com", } def parse_product(self, response: DummyResponse, product: CustomProduct): yield from super().parse_product(response, product) .. _fix-search: Fixing search support ===================== If the default implementation to build a request out of :ref:`search queries ` does not work on a given website, you can implement your own search request page object to fix that. See :ref:`custom-request-template-page`. For example: .. code-block:: python from web_poet import handle_urls from zyte_common_items import BaseSearchRequestTemplatePage @handle_urls("example.com") class ExampleComSearchRequestTemplatePage(BaseSearchRequestTemplatePage): @field def url(self): return "https://example.com/search?q={{ query|quote_plus }}"