A functional approach to web scraping with Python’s singledispatch decorator - Part II: practice

Published on Sept. 2, 2023 by Sebastian Paulo in
Programming

In the second part of this article, I will present a practical example of web scraping in a functional programming style. This is a minimalist example based on a recent project of mine. Have a look at the following, simplified project description:

Imagine we need to keep up to date with documents from public institutions about a topic of our interest. In this example, we will collect content about the European Union’s trade policy on two web sites: a page with trade news from the European Commission and a site containing press releases issued by the European Parliament. Moreover, assume that you find documents organized in tables or any other form of coherent collection in chronological order.

Scraping the two web pages requires different approaches. The first one is server-side rendered with a standard pagination; the second is client-side generated with a load-more button. We could write a spider class for each of these types of web page. Instead, we will apply a functional approach using immutable data structures, functions and, as the centre piece, Python’s @functools.singledispatch decorator.

You can see the complete code for this example in the Github repository of my blog.

Preparing the input data: web pages

First, we define the web pages as data separately from the web scraping process. As explained in the previous section, we use immutable data structures to conform with the functional programming style. More concretely, we define two types for the two kinds of web pages we want to scrape: ServerSideWebPage and ClientSideWebPage.

# scraper/webpages.py
...
from pydantic.dataclasses import dataclass
from pydantic import HttpUrl
...

@dataclass(frozen=True)
class BaseWebPage:
    url: HttpUrl
    name: str
    items: str
    item_xpath: ItemXPath

    # methods to load web page data from a json file not shown here


@dataclass(frozen=True)
class ServerSideWebPage(BaseWebPage):
    pagination: str


@dataclass(frozen=True)
class ClientSideWebPage(BaseWebPage):
    load_more_button: str
    load_more_max: int

In this example, we use pydantic dataclasses to define the web page types. With dataclasses we can inherit from BaseWebPage that contains fields common to all web page types as well as methods to load web pages from a json file. We set the @dataclass decorator to frozen in order to make the class immutable.

Pydantic enables runtime type checking of dataclass fields. The fields hold data about the web page, such as their url. The items field contains the xpath instruction as a string about how to find the items to scrape. ItemXpath is again a frozen dataclass that holds the xpath instructions for how to find different pieces of content for each item, such as the title or the publication date. ServerSideWebPage and ClientSideWebPage contain additional fields specific to their type. For example, ClientSideWebPage has two fields that explain where to find the load-more button and how often to click it.

Running spiders with the singledispatch decorator

Now that we have defined the web page types and are able to load web pages as data into our programme, we can inject them into the web scraping process. The function run_spider takes a web page as input and calls a different set of functions depending on the type of the web page. Thanks to the @singledispatch decorator, wrapped around the run_spider function, we can achieve this without clunky, conditional logic.

The @singledispatch decorator is part of Python’s functools module. It decorates the run_spider function, a generic function to run our spiders. It can be overloaded based on the type of the first argument, in our case the type of the web page (ServerSideWebPage or ClientSideWebpage). With the @run_spider.register decorator, we can add more ways how to scrape web pages. But the function we actually call will always have the name run_spider.

# scraper/dispatch.py
from functools import singledispatch
from dataclasses import asdict
from collections.abc import Iterator
from scraper import webpages
from scraper.documents import Item
from scraper import spider


@singledispatch
def run_spider(webpage: webpages.BaseWebPage) -> Iterator[Item]:
    page = spider.get_page(webpage.url)
    if page is None:
        return
    items = spider.get_items(page, webpage.items)
    yield from spider.parse_items(items, webpage)


@run_spider.register
def _(webpage: webpages.ServerSideWebPage) -> Iterator[Item]:
    page = spider.get_page(webpage.url)
    if page is None:
        return
    items = spider.get_items(page, webpage.items)
    yield from spider.parse_items(items, webpage)
    next_url = spider.get_next_url(page, webpage.url, webpage.pagination)
    if next_url:
        next_page_data = asdict(webpage)
        next_page_data["url"] = next_url
        next_page = webpage.__class__(**next_page_data)
        yield from run_spider(next_page)


@run_spider.register
def _(webpage: webpages.ClientSideWebPage) -> Iterator[Item]:
    with spider.launch_webdriver(options=spider.set_driver_options()) as driver:
        driver = spider.make_driver_request(driver, webpage.url)
        driver = spider.press_load_more_button(driver, webpage)
        page = spider.get_page_from_driver(driver.page_source)
        items = spider.get_items(page, webpage.items)
        yield from spider.parse_items(items, webpage)

In this example, the body of the generic run_spider function handles the BaseWebPage type. We register two more functions that will take care of our ServerSideWebPage and ClientSideWebPage types. The function bodies differ according to what is necessary to scrape these web page types. The functions called in the body come from the spider module in the project.

The spider module serves as a toolbox of functions that can be used in the run_spider function. The functions in scraper/spider.py do what they say they do and use Python libraries like requests, lxml and selenium depending on what is required for a given web page type.

Functional programming advocates often highlight composability as an advantage. This comes out quite clearly here. The implementations for run_spider are overall different depending on the web page type, but they can also use the same functions like spider.get_items and spider.parse_items that appear in all versions of the run_spider function.

Adding or changing web pages is very easy with this approach. For web pages that correspond to one of the types we have already defined, we only have to add or change the input data; the web scraping code is already there. Otherwise, we define a new web page type and register an additional implementation of the run_spider function for this type. Ideally, the new implementation draws on some functions that already exist in our toolbox.

Getting the output data: from Items to Documents

The run_spider function works as a generator that yields scraped Items. An Item is a typing.NamedTuple, which is again immutable. The fields contain the raw HTML Elements that carry the information we are looking for. Only the link field is a bit more complicated. It is a tuple of the web page’s url and the HTML Element containing the link to a document. We need both to construct absolute urls from relative urls if necessary.

# scraper/documents.py
from pydantic import HttpUrl
from lxml.etree import _Element
from typing import NamedTuple

class Item(NamedTuple):
    title: _Element
    link: tuple[HttpUrl, _Element]
    publication_date: _Element
    description: _Element

We get the final output data for scraped documents by converting Items to Documents. Documents are again pydantic dataclasses with the frozen flag set to True. We take advantage of the @field_validator decorator to convert the HTML Elements of an Item to the final data types (string, date, url) of the corresponding Document. Setting the validation mode to before means that this conversion happens before pydantic checks the field types of the newly created documents.

# continuation of scraper/documents.py
...
from pydantic.dataclasses import dataclass
from pydantic import field_validator
from datetime import date
from dateutil.parser import parse, ParserError
from urllib.parse import urljoin, urlparse

...
# get_text_from_sub_elements is a helper function that gets
# text from lxml Elements
...

@dataclass(frozen=True)
class Document:
    title: str
    link: HttpUrl
    publication_date: date
    description: str

    @field_validator("title", "description", mode="before")
    def validate_text(cls: "Document", element: _Element) -> str:
        txt = get_text_from_sub_elements(element)
        if len(txt.strip()) == 0:
            raise ValueError
        return txt.strip()

    @field_validator("publication_date", mode="before")
    def validate_date(cls: "Document", date_element: _Element) -> date:
        date_str = get_text_from_sub_elements(date_element)
        if not date_str:
            raise TypeError
        try:
            return parse(date_str).date()
        except ParserError:
            raise ValueError

    @field_validator("link", mode="before")
    def validate_link(cls: "Document", link_info: tuple[HttpUrl, _Element]) -> HttpUrl:
        base_url, link_element = link_info
        link = link_element.get("href")
        if link is None:
            raise TypeError
        parsed_link = urlparse(link)
        if (len(parsed_link.scheme) == 0) or (len(parsed_link.netloc) == 0):
            return HttpUrl(urljoin(str(base_url), link))
        return HttpUrl(link)

Putting everything together

With all the different pieces in place, we can use the scraper. In this example, we use a simple command line tool built with the click library. The command takes the path to the json file containing the data describing the web pages we want to scrape. Optionally, we can indicate the maximum number of days we want to go back to look for documents. By default, the command looks only for the last 40 days to avoid running through the entire pagination each time we use the scraper. We can execute the following command:

scraper_cli run-spiders data/webpage.json —max-days 60

Executing the command loads the list of web pages from the json file and starts the scraping process that is defined by the functions in scraper/pipeline.py. When loading the web page data, we use a helper function that automatically identifies the type based on the keys of the json data.

Here is what happens when the command calls the pipeline:

# scraper/pipeline.py
from collections.abc import Sequence
from datetime import datetime, timedelta
import concurrent.futures
from functools import partial
from pydantic import ValidationError
import logging
from scraper.dispatch import run_spider
from scraper.documents import Document
from scraper.webpages import BaseWebPage

def insert_doc_to_db(document: Document) -> None:
    # code for DB insertion would go here
    logging.info(f"\nAdded document:\n{document}\n")

def add_documents(wp: BaseWebPage, max_days: int) -> None:
    for item in run_spider(wp):
        try:
            document = Document(**item._asdict())
            if max_days and datetime.combine(
                document.publication_date, datetime.min.time()
            ) < (datetime.now() - timedelta(days=max_days)):
                break
            insert_doc_to_db(document)
        except ValidationError:
            logging.error("Skipping document with invalid data.")
            continue


def scrape_webpages(
    wp_list: Sequence[BaseWebPage], max_conc_req: int, max_days: int
) -> None:
    max_workers = min(max_conc_req, len(wp_list))
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        executor.map(partial(add_documents, max_days=max_days), wp_list)

In scraper/pipeline.py, we use threads with the concurrent.futures module to run the scraping process. The thread pool executor maps each web page from our list to a function that adds the documents found on this web page. To pass the add_documents function together with the argument for max_days, we use partial, another higher-order function from the functools module. It "freezes“ the max_days argument to the value we have given.

add_document then calls the run_spider generator function (the one we decorated with @singledispatch) based on the type of the web page. The yielded items get converted to documents and inserted in a database. As you can see, the functions in scraper/pipeline.py do not return anything as their purpose is to write to a database, so here we compromise on the purity of functional programming. The database code is not part of the example and we simply log the scraping results.

Here’s an example output:

INFO:root:Starting scraping process.

…
INFO:root:
Added document:
Document(
    title='Joint Statement on the launch of negotiations for an EU-Singapore digital trade agreement',
    link=Url('https://policy.trade.ec.europa.eu/news/joint-statement-launch-negotiations-eu-singapore-digital-trade-agreement-2023-07-20_en'),
    publication_date=datetime.date(2023, 7, 20),
    description='Statement')
…

INFO:root:
Added document:
Document(
    title='MEPs renew trade support measures for Moldova for one year',
    link=Url('https://www.europarl.europa.eu/news/en/press-room/20230707IPR02410/'),
    publication_date=datetime.date(2023, 11, 7),
    description='Parliament gave its green light to suspending EU import duties on Moldovan exports of agricultural products for another year to support the country’s economy.')

…

INFO:root:Scraping completed in 4 seconds.

Conclusion

This concludes our example of how to implement web scraping in a functional programming style without being puristic to the extreme. It illustrates some of the advantages of functional programming in the special context of web scraping. Most importantly, adding and changing web pages is easy as the code is extensible and composable. If we just want to add a web page that corresponds to a type we have already defined, it is sufficient to just add this web page to the data we read from (a json file or a database table). If the web page is different from the types we have defined so far, we define a new type and register a new run_spider function.

As explained in the first part, Python is agnostic when it comes to programming styles. A mature web scraping framework like Scrapy does a very good job with spider classes. But, personally, the functional programming paradigm has helped me to write cleaner code, especially when building a custom web scraping pipeline from the ground up.

Reminder: you can see the complete code and instructions on how to use the example in the Github repository of my blog.

Find more posts related to:

Web Scraping Python Functional Programming