Comparing Haskell’s Attoparsec with Python’s pyparsing

Writing parsers for custom data is a fairly common task in a programmer’s every day life. Consequently it is nice to know alternatives. In this post I want to compare 2 different parser libraries, written in different languages.

On the one side: Haskell with attoparsec. On the other side: Python with pyparsing. Both languages have more than one parser library in their package repositories, but I think these two are nice to compare, because they are similar in spirit.

Input

The input is a simple log file format of an imaginary process logging data. One line contains exactly one log entry and all entries follow the same pattern. I took this example from a nice post on schoolofhaskell.

2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse

Haskell

The Haskell implementation is also taken from schoolofhaskell, but I modified the code slightly:

  • I made the file a stack script
  • Data.Attoparsec.Char8 is deprecated, I replaced it with Data.Attoparsec.ByteString.Char8
  • I removed all comments
  • The main method has been altered a bit. The parser result is now printed more readable, with one parsed log per line
#!/usr/bin/env stack
{- stack
  script
  --resolver lts-9.12
  --package attoparsec
  --package bytestring
  --package time
-}

{-# LANGUAGE OverloadedStrings #-}

import Data.Word
import Data.Time
import Data.Attoparsec.ByteString.Char8
import Control.Applicative
import Control.Monad
import qualified Data.ByteString as B


logFile :: FilePath
logFile = "yesterday.log"


data IP = IP Word8 Word8 Word8 Word8 deriving Show

data Product = Mouse | Keyboard | Monitor | Speakers deriving Show

data LogEntry = LogEntry
    { entryTime :: LocalTime
    , entryIP   :: IP
    , entryProduct   :: Product
    } deriving Show

type Log = [LogEntry]


parseIP :: Parser IP
parseIP = do
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  return $ IP d1 d2 d3 d4


timeParser :: Parser LocalTime
timeParser = do
  y  <- count 4 digit
  char '-'
  mm <- count 2 digit
  char '-'
  d  <- count 2 digit
  char ' '
  h  <- count 2 digit
  char ':'
  m  <- count 2 digit
  char ':'
  s  <- count 2 digit
  return $
    LocalTime { localDay = fromGregorian (read y) (read mm) (read d)
              , localTimeOfDay = TimeOfDay (read h) (read m) (read s)
                }


productParser :: Parser Product
productParser =
     (string "mouse"    >> return Mouse)
 <|> (string "keyboard" >> return Keyboard)
 <|> (string "monitor"  >> return Monitor)
 <|> (string "speakers" >> return Speakers)


logEntryParser :: Parser LogEntry
logEntryParser = do
  t <- timeParser
  char ' '
  ip <- parseIP
  char ' '
  p <- productParser
  return $ LogEntry t ip p

logParser :: Parser Log
logParser = many $ logEntryParser <* endOfLine


main :: IO ()
main = do
  log_content <- B.readFile logFile
  let parse_result = parseOnly logParser log_content
  case parse_result of
    Left err -> print err
    Right logs -> forM_ logs print
$ ./log_parser.hs
LogEntry {entryTime = 2013-06-29 11:16:23, entryIP = IP 124 67 34 60, entryProduct = Keyboard}
LogEntry {entryTime = 2013-06-29 11:32:12, entryIP = IP 212 141 23 67, entryProduct = Mouse}
LogEntry {entryTime = 2013-06-29 11:33:08, entryIP = IP 212 141 23 67, entryProduct = Monitor}
LogEntry {entryTime = 2013-06-29 12:12:34, entryIP = IP 125 80 32 31, entryProduct = Speakers}
LogEntry {entryTime = 2013-06-29 12:51:50, entryIP = IP 101 40 50 62, entryProduct = Keyboard}
LogEntry {entryTime = 2013-06-29 13:10:45, entryIP = IP 103 29 60 13, entryProduct = Mouse}

Python

from datetime import datetime
from pyparsing import *


filename = "yesterday.log"
def literal_(s): return Suppress(Literal(s))


class IP:
    def __init__(self, ip1, ip2, ip3, ip4):
        self.ip = [ip1, ip2, ip3, ip4]

    def __str__(self):
        return ".".join(self.ip)

    def __repr__(self):
        return "IP(" + str(self) + ")"


class Product:
    P_NONE = 0
    P_MOUSE = 1
    P_KEYBOARD = 2
    P_MONITOR = 3
    P_SPEAKERS = 4

    def __init__(self, product):
        self.product = product
        if product == "mouse":
            self.snr = self.P_MOUSE
        elif product == "mouse":
            self.snr = self.P_MOUSE
        elif product == "mouse":
            self.snr = self.P_MOUSE
        elif product == "mouse":
            self.snr = self.P_MOUSE
        else:
            self.snr = self.P_NONE

    def __str__(self):
        return self.product

    def __repr__(self):
        return "Product(" + str(self) + ")"


def toIP(parseResult):
    ip = parseResult.asDict()["IP"]
    return IP(*ip)


def toProduct(parseResult):
    product = parseResult.asDict()["Product"]
    return Product(product)


def toDateTime(parseResult):
    dt = map(int, parseResult.asDict()["DateTime"])
    return datetime(*dt)


parseIP = ( Word(nums)
          + literal_(".")
          + Word(nums)
          + literal_(".")
          + Word(nums)
          + literal_(".")
          + Word(nums)
          ).setResultsName("IP").setParseAction(toIP)


parseTime = ( Word(nums, exact=4)
            + literal_("-")
            + Word(nums, exact=2)
            + literal_("-")
            + Word(nums, exact=2)
            + Word(nums, exact=2)
            + literal_(":")
            + Word(nums, exact=2)
            + literal_(":")
            + Word(nums, exact=2)
            ).setResultsName("DateTime").setParseAction(toDateTime)


parseProduct = ( Literal("mouse")
               | Literal("keyboard")
               | Literal("monitor")
               | Literal("speakers")
               ).setResultsName("Product").setParseAction(toProduct)


parseLogEntry = parseTime + parseIP + parseProduct


parseLog = ZeroOrMore(Group(parseLogEntry))


if __name__ == '__main__':
    parse_results = parseLog.parseFile(filename)
    for parse_result in parse_results:
        print(parse_result)
$ python3 log_parser.py
[datetime.datetime(2013, 6, 29, 11, 16, 23), IP(124.67.34.60), Product(keyboard)]
[datetime.datetime(2013, 6, 29, 11, 32, 12), IP(212.141.23.67), Product(mouse)]
[datetime.datetime(2013, 6, 29, 11, 33, 8), IP(212.141.23.67), Product(monitor)]
[datetime.datetime(2013, 6, 29, 12, 12, 34), IP(125.80.32.31), Product(speakers)]
[datetime.datetime(2013, 6, 29, 12, 51, 50), IP(101.40.50.62), Product(keyboard)]
[datetime.datetime(2013, 6, 29, 13, 10, 45), IP(103.29.60.13), Product(mouse)]

Note: The parser parseIP could be written much shorter with the help of delimitedList:

parseIP = delimitedList(Word(nums), ".").setResultsName("IP")

Conclusion

I really like both parser libraries. At least in this usecase it is amazing how easy and automatic it was to translate the original Haskell parser to Python.

I used enclosing parenthesis as a small trick to be able to write the Python parser as close to the Haskell parser in do-notation. I think both parser implementations are comprehensible and easy to grasp.