HTML extraction in 4 languages

In this post I want to present the same task, solved four times in four different languages.

The task: Parse the front page of the New York Times and extract all main articles with their title, the author name(s) and the URL to the full article. If author name or URL can’t be extracted we skip this article.

The contenders:

  • Haskell: Static functional language with lazy evaluation
  • Clojure: Dynamic functional language with persistent data types
  • Python: Dynamic, duck-typed and interpreted language
  • Go: Static procedural language

Haskell

{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import Text.HTML.Scalpel
import Text.HTML.TagSoup


data ArticleInfo = ArticleInfo
  { articleHeadline :: String
  , articleAuthor :: String
  , articleUrl :: String
  } deriving (Show)

-- Global Variables

nytUrl = "https://www.nytimes.com"

articleSelector = ("div" @: [hasClass "collection",
                             notP (hasClass "headlines")])
                  //
                  ("article" @: [hasClass "story",
                                 hasClass "theme-summary",
                                 notP (hasClass "banner")])
headlineSelector = "h2" @: [hasClass "story-heading"]
authorSelector = "p" @: [hasClass "byline"]
urlSelector = "h2" // "a"

-- Utility functions

extract :: Scraper String ArticleInfo
extract = do
  headline <- text headlineSelector
  author <- text authorSelector
  url <- attr "href" urlSelector
  return (ArticleInfo headline author url)

articleScraper :: Scraper String [ArticleInfo]
articleScraper = chroots articleSelector extract

printResult :: Maybe ([ArticleInfo]) -> IO ()
printResult Nothing = putStrLn "Failed to parse any articles"
printResult (Just articles) = forM_ articles (\article ->
    do
      putStrLn ""
      putStrLn (articleHeadline article)
      putStrLn (articleAuthor article)
      putStrLn ("Url: " ++ (articleUrl article)))

main :: IO ()
main = do
  scrapeResult <- scrapeURL nytUrl articleScraper
  printResult scrapeResult

Clojure

(ns nyt-clojure.core
  (:gen-class)
  (:require [net.cgrand.enlive-html :as html]
            [clojure.string :as str]))


;; Global Variables

(def nyt-url "https://www.nytimes.com")

(def article-selector [[:div.collection
                        (html/but :.headlines)]
                       [:article.story
                        :article.theme-summary
                        (html/but :.banner)]])
(def headline-selector [:h2.story-heading])
(def author-selector   [:p.byline])
(def url-selector      [:h2 :> :a])

;; Utility functions

(defn fetch-url [url]
  (html/html-resource (java.net.URL. url)))

(defn get-articles []
  "Fetches the website of the New York Times and then extracts the articles."
  (html/select (fetch-url nyt-url) article-selector))

(defn extract [node]
  "Extracts the headline, author and URL of an article"
  (let [headline-node (first (html/select [node] headline-selector))
        author-node   (first (html/select [node] author-selector))
        url-node      (first (html/select [node] url-selector))
        headline      (html/text headline-node)
        author        (html/text author-node)
        url           (:href (:attrs url-node))]
    {:headline headline :author author :url url}))

(defn -main
  "Parses the website of the New York Times and prints a list of the front page
  articles."
  [& args]
  (def parsed-articles (map extract (get-articles)))
  (def articles (filter
                  #(every? false? (map str/blank? (vals %)))
                  parsed-articles))
  (doseq [article articles]
    (println "")
    (println (:headline article))
    (println (:author article))
    (println (str "Url: " (:url article)))))

Python

from bs4 import BeautifulSoup
from urllib.request import urlopen


# Global Variables

nyt_url = "https://www.nytimes.com"

headline_selector = "h2.story-heading"
author_selector = "p.byline"
url_selector = "h2 > a"
summary_selector = "p.summary"

# Utility Functions

def has_class(whitelist=[], blacklist=[]):
    def _helper(cls):
        if cls:
            whitelist_ok = all(map(lambda k: k in cls, whitelist))
            blacklist_ok = all(map(lambda k: k not in cls, blacklist))
            return whitelist_ok and blacklist_ok
        return False
    return _helper

def fetch_url(url):
    return urlopen(url).read()

def get_articles():
    soup = BeautifulSoup(fetch_url(nyt_url), 'html.parser')
    divs = soup.find_all("div", class_=has_class(whitelist=['collection'],
                                                 blacklist=['headlines']))
    articles = [div.find_all("article", class_=has_class(whitelist=['story', 'theme-summary'],
                                                         blacklist=['banner']))
                    for div in divs]
    return [article for submatch in articles for article in submatch if article]

def extract(article):
    headline = article.select_one(headline_selector)
    if headline:
        headline = headline.string
    author = article.select_one(author_selector)
    if author:
        author = author.string
    url = article.select_one(url_selector)
    if url:
        url = url['href']
    return {'headline': headline, 'author': author, 'url': url}


if __name__ == '__main__':
    parsed_articles = map(extract, get_articles())
    articles = filter(lambda a: all(a.values()), parsed_articles)
    for article in articles:
        print()
        print(article['headline'])
        print(article['author'])
        print("Url: {}".format(article['url']))

Go

package main

import (
        "fmt"
        "log"
        "github.com/PuerkitoBio/goquery"
)


const NytUrl = "https://www.nytimes.com"

func getArticles(doc *goquery.Document) *goquery.Selection {
    return doc.Find("div.collection").
               Not(".headlines").
               Find("article.story.theme-summary").
               Not(".banner")
}

func extract(sel *goquery.Selection) map[string]string {
    headline := sel.Find("h2.story-heading").Text()
    author := sel.Find("p.byline").Text()
    url, found := sel.Find("h2 > a").Attr("href")
    if headline == "" || author == "" || !found {
        return nil
    } else {
        var m = map[string]string{
            "headline": headline,
            "author": author,
            "url": url,
        }
        return m
    }
}

func main() {
        doc, err := goquery.NewDocument(NytUrl)
        if err != nil {
            log.Fatal(err)
        }

        var articles *goquery.Selection = getArticles(doc)

        articles.Each(func(i int, s *goquery.Selection) {
            m := extract(s)
            if m != nil {
                fmt.Printf("\n")
                fmt.Printf("%s\n", m["headline"])
                fmt.Printf("%s\n", m["author"])
                fmt.Printf("Url: %s\n", m["url"])
            }
        })
}

Remarks

As in other similar posts before I beg to consider that I am not equally good in all four languages. Furthermore the size and elegance of each solution is strongly influenced by the choice of framework. I tried to choose the most popular framework for every language or that I have the most experience with.

The amount of lines, not considering comments, blanks or import statements, is fairly similar in all four cases. However there are a couple of differences, some suttle, some very aparent.

Notice how the Haskell solution is the only one that doesn’t require filtering the articles for empty values. This is silently handled by the monad implementation of the scraper, with Maybe behind the scenes.

From all four solutions the Python solution is the least satisfying for me. This is largely due to the fact that BeautifulSoup does not provide easy nesting of criteria as in the other solutions. This leads to a lot of manual error checking boilerplate (see all the ifs in extract). Even worse the list flattening list comprehension in the return statement, this will scale horribly with more complicated scraping demands. Python also has a jQuery like library, so using it probably would have let to similar code like in the Go version, which uses goquery, but I wanted to use BeautifulSoup because of its popularity within the Python community.