A functional approach to web scraping with Python’s singledispatch decorator - Part I: theory

Published on Sept. 2, 2023 by Sebastian Paulo in
Programming

I have been building quite a few data collection pipelines involving web scraping. For a long time, I have associated web scraping projects with a heavy dose of object-oriented programming (OOP). Most Python programmers might be familiar with classes of spiders similar to the one in this code snippet copied from the well-known web scraping framework Scrapy:

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/tag/humor/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "author": quote.xpath("span/small/text()").get(),
                "text": quote.css("span.text::text").get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

More recently, I have been working on projects that put a lot of emphasis on a functional programming style in Python. This has led me to think about how I would write code for web scraping that fits more naturally under this paradigm. Functional programming emphasizes immutable data structures and pure functions. Data flows through functions that receive input and return an output, avoiding side-effects as far as possible. Moreover, functions can take other functions as input and return functions as output (higher-order functions).

This is different from OOP where objects (e.g. an instance of a spider class) keep track of data as their internal state and this state can change through methods associated with that object. In the example shown above, data and behaviour are much more intertwined than in a functional programming style. The spider class contains data in the attributes such as the name of the spider and the url. However, data about the web page to be scraped is also contained in the parse method, such as in the css selectors that identify the relevant content.

In this blog post, I explain how I would go about writing "functional" code for web scraping. In particular, I will define the web pages I want to scrape entirely as data that go as input into a series of functions that output scraped content as data again. Of course, there are limits to how pure functional web scraping code can be. Functional programming insists on avoiding side effects to protect programmes from the unpredictability of the "outside world“. However, the main purpose of web scraping is to engage with objects outside your programme, i.e. getting data from web sites.

As the HOWTO on functional programming in the Python documentation points out, Python is very flexible about programming paradigms and allows combinations of different styles. In other words, Python does not require you to convert your spider classes into functions just because the rest of a project is supposed to be "functional“. So why bother in the first place?

Potential advantages of "functional“ web scraping

I do not argue that using spider classes is generally inferior to what I suggest in this post. Most of the time, you should just go with what works for you without having to engage in ideological debates about programming paradigms. However, functional programming is generally associated with some advantages. I have experienced some of these benefits specifically with regard to web scraping.

Web scraping can be notoriously unstable. That is why it is important to plan ahead and write code that is as resilient as possible. Otherwise, you might have to write and change code frequently to scrape new web sites or react to changes on web sites you have already been scraping. Although you can write clean and resilient code under any paradigm, I find that functional programming forces me to think in ways that lead to writing less code and changing code less often.

Avoid code proliferation and keep it simple

With OOP, there is a temptation to cut the planning process and add a new spider class whenever a web page does not seem to fit the scraping pattern of existing classes in the code base. This can lead to code proliferation and make maintenance and testing more costly. Moreover, the composability of code written in a functional style facilitates rearranging existing functions to scrape different types of web pages. This can be less confusing over time than complex class inheritance or mixins.

Change the input data instead of the code

Another advantage I see with functional programming in web scraping is to think about web pages as data separate from the scraping logic. You can just define a web page as an immutable data structure that can easily be generated from a json file or a database table. This way, adding or changing a web page often does not even require any changes in the code. In one of my projects, I developed an admin interface that allows adding and changing web pages to scrape merely by filling out a form. Most of the time, adding, updating or deleting web pages does not require the intervention of a programmer, but could be done by a person with some knowledge of HTML and CSS to submit the correct selectors in the admin interface.

Functional programming as a learning experience

I would agree that you can achieve the same advantages writing well-thought-out spider classes, but I still find that functional programming is very good at nudging programmers to think in certain ways and do the right thing. Converting code to a functional programming style is a great learning opportunity. On a very general level, it is an exercise in refactoring that increases your understanding of your project.

Last but not least, it is a great opportunity to work with some of Python’s cool tools for functional programming. In the second part of this post, I will show a practical example of web scraping in a functional programming style. The singledispatch decorator from Python’s functools module is the cornerstone of this example. So let’s go to the next part and dive right into the example.

Find more posts related to:

Web Scraping Python Functional Programming