In the second part of this article, I will present a practical example of web scraping in a functional programming style. This is a minimalist example based on a recent project of mine. Have a look at the following, simplified project description:
Imagine we need to keep up to date with documents from public institutions about a topic of our interest. In this example, we will collect content about the European Union’s trade policy on two web sites: a page with trade news from the European Commission and a site containing press releases issued by the European Parliament. Moreover, assume that you find documents organized in tables or any other form of coherent collection in chronological order.
Scraping the two web pages requires different approaches. The first one is server-side rendered with a standard pagination; the second is client-side generated with a load-more button. We could write a spider class for each of these types of web page. Instead, we will apply a functional approach using immutable data structures, functions and, as the centre piece, Python’s @functools.singledispatch
decorator.
You can see the complete code for this example in the Github repository of my blog.
Preparing the input data: web pages
First, we define the web pages as data separately from the web scraping process. As explained in the previous section, we use immutable data structures to conform with the functional programming style. More concretely, we define two types for the two kinds of web pages we want to scrape: ServerSideWebPage
and ClientSideWebPage
.
# scraper/webpages.py
...
from pydantic.dataclasses import dataclass
from pydantic import HttpUrl
...
@dataclass(frozen=True)
class BaseWebPage:
url: HttpUrl
name: str
items: str
item_xpath: ItemXPath
# methods to load web page data from a json file not shown here
@dataclass(frozen=True)
class ServerSideWebPage(BaseWebPage):
pagination: str
@dataclass(frozen=True)
class ClientSideWebPage(BaseWebPage):
load_more_button: str
load_more_max: int
In this example, we use pydantic dataclasses to define the web page types. With dataclasses we can inherit from BaseWebPage
that contains fields common to all web page types as well as methods to load web pages from a json file. We set the @dataclass
decorator to frozen
in order to make the class immutable.
Pydantic enables runtime type checking of dataclass fields. The fields hold data about the web page, such as their url. The items
field contains the xpath instruction as a string about how to find the items to scrape. ItemXpath
is again a frozen dataclass that holds the xpath instructions for how to find different pieces of content for each item, such as the title or the publication date. ServerSideWebPage
and ClientSideWebPage
contain additional fields specific to their type. For example, ClientSideWebPage
has two fields that explain where to find the load-more button and how often to click it.
Running spiders with the singledispatch decorator
Now that we have defined the web page types and are able to load web pages as data into our programme, we can inject them into the web scraping process. The function run_spider
takes a web page as input and calls a different set of functions depending on the type of the web page. Thanks to the @singledispatch
decorator, wrapped around the run_spider
function, we can achieve this without clunky, conditional logic.
The @singledispatch
decorator is part of Python’s functools module. It decorates the run_spider
function, a generic function to run our spiders. It can be overloaded based on the type of the first argument, in our case the type of the web page (ServerSideWebPage
or ClientSideWebpage
). With the @run_spider.register
decorator, we can add more ways how to scrape web pages. But the function we actually call will always have the name run_spider
.
# scraper/dispatch.py
from functools import singledispatch
from dataclasses import asdict
from collections.abc import Iterator
from scraper import webpages
from scraper.documents import Item
from scraper import spider
@singledispatch
def run_spider(webpage: webpages.BaseWebPage) -> Iterator[Item]:
page = spider.get_page(webpage.url)
if page is None:
return
items = spider.get_items(page, webpage.items)
yield from spider.parse_items(items, webpage)
@run_spider.register
def _(webpage: webpages.ServerSideWebPage) -> Iterator[Item]:
page = spider.get_page(webpage.url)
if page is None:
return
items = spider.get_items(page, webpage.items)
yield from spider.parse_items(items, webpage)
next_url = spider.get_next_url(page, webpage.url, webpage.pagination)
if next_url:
next_page_data = asdict(webpage)
next_page_data["url"] = next_url
next_page = webpage.__class__(**next_page_data)
yield from run_spider(next_page)
@run_spider.register
def _(webpage: webpages.ClientSideWebPage) -> Iterator[Item]:
with spider.launch_webdriver(options=spider.set_driver_options()) as driver:
driver = spider.make_driver_request(driver, webpage.url)
driver = spider.press_load_more_button(driver, webpage)
page = spider.get_page_from_driver(driver.page_source)
items = spider.get_items(page, webpage.items)
yield from spider.parse_items(items, webpage)
In this example, the body of the generic run_spider
function handles the BaseWebPage
type. We register two more functions that will take care of our ServerSideWebPage
and ClientSideWebPage
types. The function bodies differ according to what is necessary to scrape these web page types. The functions called in the body come from the spider module in the project.
The spider module serves as a toolbox of functions that can be used in the run_spider
function. The functions in scraper/spider.py
do what they say they do and use Python libraries like requests
, lxml
and selenium
depending on what is required for a given web page type.
Functional programming advocates often highlight composability as an advantage. This comes out quite clearly here. The implementations for run_spider
are overall different depending on the web page type, but they can also use the same functions like spider.get_items
and spider.parse_items
that appear in all versions of the run_spider
function.
Adding or changing web pages is very easy with this approach. For web pages that correspond to one of the types we have already defined, we only have to add or change the input data; the web scraping code is already there. Otherwise, we define a new web page type and register an additional implementation of the run_spider
function for this type. Ideally, the new implementation draws on some functions that already exist in our toolbox.
Getting the output data: from Items to Documents
The run_spider
function works as a generator that yields scraped Items. An Item is a typing.NamedTuple
, which is again immutable. The fields contain the raw HTML Elements that carry the information we are looking for. Only the link
field is a bit more complicated. It is a tuple of the web page’s url and the HTML Element containing the link to a document. We need both to construct absolute urls from relative urls if necessary.
# scraper/documents.py
from pydantic import HttpUrl
from lxml.etree import _Element
from typing import NamedTuple
class Item(NamedTuple):
title: _Element
link: tuple[HttpUrl, _Element]
publication_date: _Element
description: _Element
We get the final output data for scraped documents by converting Items to Documents. Documents are again pydantic dataclasses with the frozen flag set to True. We take advantage of the @field_validator
decorator to convert the HTML Elements of an Item
to the final data types (string, date, url) of the corresponding Document
. Setting the validation mode
to before
means that this conversion happens before pydantic checks the field types of the newly created documents.
# continuation of scraper/documents.py
...
from pydantic.dataclasses import dataclass
from pydantic import field_validator
from datetime import date
from dateutil.parser import parse, ParserError
from urllib.parse import urljoin, urlparse
...
# get_text_from_sub_elements is a helper function that gets
# text from lxml Elements
...
@dataclass(frozen=True)
class Document:
title: str
link: HttpUrl
publication_date: date
description: str
@field_validator("title", "description", mode="before")
def validate_text(cls: "Document", element: _Element) -> str:
txt = get_text_from_sub_elements(element)
if len(txt.strip()) == 0:
raise ValueError
return txt.strip()
@field_validator("publication_date", mode="before")
def validate_date(cls: "Document", date_element: _Element) -> date:
date_str = get_text_from_sub_elements(date_element)
if not date_str:
raise TypeError
try:
return parse(date_str).date()
except ParserError:
raise ValueError
@field_validator("link", mode="before")
def validate_link(cls: "Document", link_info: tuple[HttpUrl, _Element]) -> HttpUrl:
base_url, link_element = link_info
link = link_element.get("href")
if link is None:
raise TypeError
parsed_link = urlparse(link)
if (len(parsed_link.scheme) == 0) or (len(parsed_link.netloc) == 0):
return HttpUrl(urljoin(str(base_url), link))
return HttpUrl(link)
Putting everything together
With all the different pieces in place, we can use the scraper. In this example, we use a simple command line tool built with the click
library. The command takes the path to the json file containing the data describing the web pages we want to scrape. Optionally, we can indicate the maximum number of days we want to go back to look for documents. By default, the command looks only for the last 40 days to avoid running through the entire pagination each time we use the scraper. We can execute the following command:
scraper_cli run-spiders data/webpage.json —max-days 60
Executing the command loads the list of web pages from the json file and starts the scraping process that is defined by the functions in scraper/pipeline.py
. When loading the web page data, we use a helper function that automatically identifies the type based on the keys of the json data.
Here is what happens when the command calls the pipeline:
# scraper/pipeline.py
from collections.abc import Sequence
from datetime import datetime, timedelta
import concurrent.futures
from functools import partial
from pydantic import ValidationError
import logging
from scraper.dispatch import run_spider
from scraper.documents import Document
from scraper.webpages import BaseWebPage
def insert_doc_to_db(document: Document) -> None:
# code for DB insertion would go here
logging.info(f"\nAdded document:\n{document}\n")
def add_documents(wp: BaseWebPage, max_days: int) -> None:
for item in run_spider(wp):
try:
document = Document(**item._asdict())
if max_days and datetime.combine(
document.publication_date, datetime.min.time()
) < (datetime.now() - timedelta(days=max_days)):
break
insert_doc_to_db(document)
except ValidationError:
logging.error("Skipping document with invalid data.")
continue
def scrape_webpages(
wp_list: Sequence[BaseWebPage], max_conc_req: int, max_days: int
) -> None:
max_workers = min(max_conc_req, len(wp_list))
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
executor.map(partial(add_documents, max_days=max_days), wp_list)
In scraper/pipeline.py
, we use threads with the concurrent.futures
module to run the scraping process. The thread pool executor maps each web page from our list to a function that adds the documents found on this web page. To pass the add_documents
function together with the argument for max_days
, we use partial
, another higher-order function from the functools
module. It "freezes“ the max_days
argument to the value we have given.
add_document
then calls the run_spider
generator function (the one we decorated with @singledispatch
) based on the type of the web page. The yielded items get converted to documents and inserted in a database. As you can see, the functions in scraper/pipeline.py
do not return anything as their purpose is to write to a database, so here we compromise on the purity of functional programming. The database code is not part of the example and we simply log the scraping results.
Here’s an example output:
INFO:root:Starting scraping process.
…
INFO:root:
Added document:
Document(
title='Joint Statement on the launch of negotiations for an EU-Singapore digital trade agreement',
link=Url('https://policy.trade.ec.europa.eu/news/joint-statement-launch-negotiations-eu-singapore-digital-trade-agreement-2023-07-20_en'),
publication_date=datetime.date(2023, 7, 20),
description='Statement')
…
INFO:root:
Added document:
Document(
title='MEPs renew trade support measures for Moldova for one year',
link=Url('https://www.europarl.europa.eu/news/en/press-room/20230707IPR02410/'),
publication_date=datetime.date(2023, 11, 7),
description='Parliament gave its green light to suspending EU import duties on Moldovan exports of agricultural products for another year to support the country’s economy.')
…
INFO:root:Scraping completed in 4 seconds.
Conclusion
This concludes our example of how to implement web scraping in a functional programming style without being puristic to the extreme. It illustrates some of the advantages of functional programming in the special context of web scraping. Most importantly, adding and changing web pages is easy as the code is extensible and composable. If we just want to add a web page that corresponds to a type we have already defined, it is sufficient to just add this web page to the data we read from (a json file or a database table). If the web page is different from the types we have defined so far, we define a new type and register a new run_spider
function.
As explained in the first part, Python is agnostic when it comes to programming styles. A mature web scraping framework like Scrapy does a very good job with spider classes. But, personally, the functional programming paradigm has helped me to write cleaner code, especially when building a custom web scraping pipeline from the ground up.
Reminder: you can see the complete code and instructions on how to use the example in the Github repository of my blog.