HTML extraction in 4 languages

In this post I want to present the same task, solved four times in four different languages.

The task: Parse the front page of the New York Times and extract all main articles with their title, the author name(s) and the URL to the full article. If author name or URL can’t be extracted we skip this article.

The contenders:

  • Haskell: Static functional language with lazy evaluation
  • Clojure: Dynamic functional language with persistent data types
  • Python: Dynamic, duck-typed and interpreted language
  • Go: Static procedural language

Haskell

Clojure

Python

from bs4 import BeautifulSoup
from urllib.request import urlopen


# Global Variables

nyt_url = "https://www.nytimes.com"

headline_selector = "h2.story-heading"
author_selector = "p.byline"
url_selector = "h2 > a"
summary_selector = "p.summary"

# Utility Functions

def has_class(whitelist=[], blacklist=[]):
    def _helper(cls):
        if cls:
            whitelist_ok = all(map(lambda k: k in cls, whitelist))
            blacklist_ok = all(map(lambda k: k not in cls, blacklist))
            return whitelist_ok and blacklist_ok
        return False
    return _helper

def fetch_url(url):
    return urlopen(url).read()

def get_articles():
    soup = BeautifulSoup(fetch_url(nyt_url), 'html.parser')
    divs = soup.find_all("div", class_=has_class(whitelist=['collection'],
                                                 blacklist=['headlines']))
    articles = [div.find_all("article", class_=has_class(whitelist=['story', 'theme-summary'],
                                                         blacklist=['banner']))
                    for div in divs]
    return [article for submatch in articles for article in submatch if article]

def extract(article):
    headline = article.select_one(headline_selector)
    if headline:
        headline = headline.string
    author = article.select_one(author_selector)
    if author:
        author = author.string
    url = article.select_one(url_selector)
    if url:
        url = url['href']
    return {'headline': headline, 'author': author, 'url': url}


if __name__ == '__main__':
    parsed_articles = map(extract, get_articles())
    articles = filter(lambda a: all(a.values()), parsed_articles)
    for article in articles:
        print()
        print(article['headline'])
        print(article['author'])
        print("Url: {}".format(article['url']))

Go

Remarks

As in other similar posts before I beg to consider that I am not equally good in all four languages. Furthermore the size and elegance of each solution is strongly influenced by the choice of framework. I tried to choose the most popular framework for every language or that I have the most experience with.

The amount of lines, not considering comments, blanks or import statements, is fairly similar in all four cases. However there are a couple of differences, some suttle, some very aparent.

Notice how the Haskell solution is the only one that doesn’t require filtering the articles for empty values. This is silently handled by the monad implementation of the scraper, with Maybe behind the scenes.

From all four solutions the Python solution is the least satisfying for me. This is largely due to the fact that BeautifulSoup does not provide easy nesting of criteria as in the other solutions. This leads to a lot of manual error checking boilerplate (see all the ifs in extract). Even worse the list flattening list comprehension in the return statement, this will scale horribly with more complicated scraping demands. Python also has a jQuery like library, so using it probably would have let to similar code like in the Go version, which uses goquery, but I wanted to use BeautifulSoup because of its popularity within the Python community.