Switching from Pelican to Hakyll

The title says it all: I changed the static site generator from Pelican to Hakyll.

Why?

To be honest, I was quite satisfied with what Pelican offered me. I think it comes with very nice default settings and has nice default themes. The documentation is ok (could be better though) and it is written in Python, a language I like and am quite comfortable with.

So why change? Well, because I am riding on the big Haskell wave. Over the years I was sucked up more and more by Haskell and the ecosystem that has grown around it. I like the look and feel of the language and the tools that the community generates. Many websites that I know and regularily visit are “proudly generated by Hakyll” and I thought “Why not mine?”. The idea was flying in my head for a couple of month until recently I decided to finally make the transition.

Goals

Here is the statement of what I wanted to achieve:

I wanted to switch my static site generator engine to Hakyll and, after I
regenerate all HTML content, end up with (almost) the identical website.

In addition I wanted to do the following:

  1. Change the plain text markup from restructured text to common markdown
  2. Change the format of the filename of every post to yyyy-mm-dd-text.markdown
  3. Use yaml style metadate in all my posts, see here for more info (look for “Extension: yaml_metadata_block”

Problems

These are the problems I faced:

  1. My folder structure was not ideal
  2. The existing Jinja based templates are not usable in Hakyll
  3. All my blog posts have a different metadata format
  4. Converting from .rst to .markdown was not 100% perfect when done with pandoc
  5. I used plugins provided by Pelican that do not exist in this form for Hakyll

The following sections will detail how I solved the above mentioned problems.

New folder structure

Hakyll prefers a different folder structure than Pelican. Luckily Hakyll comes with a nice little tool called ‘hakyll-init’, which will setup a basic dummy folder with all important files and folders already there, a deafult deploy script and a few dummy posts.

Now with the bare bones repository ready we can start copying everything we need. All the Pelican related files (publish.py, etc.) can be omitted, because we obviously don’t need them anymore. Basically all we want is content, i.e. images, code snippets, most of the .css files and the blog posts of course.

So here is what I did:

  1. Run hakyll-init
  2. Copy all .css files in the themes folder to the css folder
  3. Make a new folder code and copy al our code samples there
  4. Copy all images to the images folder
  5. Delete the dummy posts in the posts folder and copy all the existing .rst files there

Create new templates

Luckily this was a very mechanic work. It basically meant writing the same templates in another syntax.

Example

<!DOCTYPE html>
<html lang="{{ DEFAULT_LANG }}">
<head>
        <meta charset="utf-8" />
        <title>{% block title %}{{ SITENAME }}{%endblock%}</title>
        <link rel="stylesheet" href="{{ SITEURL }}/{{ THEME_STATIC_DIR }}/css/{{ CSS_FILE }}" />

        <!--[if IE]>
            <script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
</head>

<body id="index" class="home">
{% include 'github.html' %}
        <header id="banner" class="body">
                {% block site_header %}
                <h1><a href="{{ SITEURL }}/">{{ SITENAME }} {% if SITESUBTITLE %} <strong>{{ SITESUBTITLE }}</strong>{% endif %}</a></h1>
                {% endblock %}
                <nav><ul>
                {% for title, link in MENUITEMS %}
                    <li><a href="{{ SITEURL }}/{{ link }}">{{ title }}</a></li>
                {% endfor %}
                {% if DISPLAY_PAGES_ON_MENU -%}
                {% for pg in pages %}
                    <li{% if pg == page %} class="active"{% endif %}><a href="{{ SITEURL }}/{{ pg.url }}">{{ pg.title }}</a></li>
                {% endfor %}
                {% endif %}
                {% if DISPLAY_CATEGORIES_ON_MENU -%}
                {% for cat, null in categories %}
                    <li{% if cat == category %} class="active"{% endif %}><a href="{{ SITEURL }}/{{ cat.url }}">{{ cat }}</a></li>
                {% endfor %}
                {% endif %}
                </ul></nav>
        </header><!-- /#banner -->
        {% block content %}
        {% endblock %}
        <section id="extras" class="body">

        </section><!-- /#extras -->

        <footer id="contentinfo" class="body">
                <address id="about" class="vcard body">
                &copy; 2017-2019 Tobias Pleyer
                </address><!-- /#about -->

                <p><i>This blog has been generated with <a href="http://getpelican.com/">Pelican</a></i></p>
        </footer><!-- /#contentinfo -->

{% include 'analytics.html' %}
{% include 'disqus_script.html' %}
</body>
</html>

became

<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="x-ua-compatible" content="ie=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <title>$blog_title$</title>
        <link rel="stylesheet" href="/css/reset.css" />
        <link rel="stylesheet" href="/css/typogrify.css" />
        <link rel="stylesheet" href="/css/syntax.css" />
        <link rel="stylesheet" href="/css/main.css" />
    </head>
    <body>
        <header id="banner" class="body">
            <div class="logo">
                <h1>$blog_name$</h1>
            </div>
            <nav>
                <ul>
                    <li><a href="../">Home</a></li>
                    <li><a href="../contact.html">Contact</a></li>
                    <li><a href="../links.html">Links</a></li>
                    <li><a href="../archive.html">Archive</a></li>
                    <li><a href="../tags.html">Tags</a></li>
                </ul>
            </nav>
        </header>

        <main role="main">
            <section id="content" class="body">
                $body$
            </section>
        </main>

        <footer id="contentinfo" class="body">
            <address id="about" class="vcard body">
                &copy; 2017-2019 Tobias Pleyer
            </address>
            <p>
                <i>
                    Site proudly generated by
                    <a href="http://jaspervdj.be/hakyll">Hakyll</a>
                </i>
            </p>
        </footer>
    </body>
</html>

As you can see the template got shorter and more readable because I hard-coded content that was configurable in Pelican before and made some shared code explicit. I think that is totally acceptable because things like .css file names or items in a navigation list do not change that often and if they do it is not a big deal to modify the template.

As a matter of fact I managed to generate the website as it was before, but with quite a few templates less. I went down from 20 to 6!

Different metadata format

The metadata format really posed a problem. In the original .rst files the metadata directly followed the main title of the post:

Title
=====

:key1: value1
:key2: value2
:key3: value3

rest of file

which unfortunately was not recognized by Hakyll. Converting to markdown via

pandoc -f rst -t markdown post.rst -o post.markdown

led to something like this

Title
=====

key1

: value1

key2

: value2

key3

: value3

rest of file

which does not only look wrong, it also didn’t work. I decided to parse and transform the metadata by hand and do it during the conversion process. The next section about the file format conversion will detail this.

Converting from restructured text to markdown

Most of my effort went into converting my blog post files. Pelican uses restructured text as the text file markup syntax. By itself that does not pose a problem to Hakyll, because under the hood it uses Pandoc, which is capable to handle restructured text and many other formats. The bigger problem I had was that my current metadata system that worked for Pelican did not combine well with Hakyll.

The solution was to write a small script that opens all files in the posts directory and does the following for every file:

  1. Read the file content into memory
  2. Locate the metadata and cut it out of the file
  3. Parse the metadata and transform it to YAML format
  4. Convert the rest of the file via Pandoc
  5. Paste the metadata back in at the beginning
  6. Write the in-memory content back to the file with the new naming convention

Here is the full script:

#!/usr/bin/env stack
{- stack
   script
   --resolver lts-13.8
   --package attoparsec
   --package base
   --package directory
   --package filepath
   --package pandoc
   --package text
-}
{-# LANGUAGE OverloadedStrings #-}

import           Control.Monad (forM_)
import qualified Data.Attoparsec.Text as P
import           Data.Char (isAlpha, isSpace)
import           Data.Either (fromRight, isLeft, isRight)
import           Data.List (intersperse)
import           Data.Maybe (fromJust)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import           System.Directory (listDirectory)
import           System.Environment (getArgs)
import           System.FilePath ((<.>), (</>))
import           Text.Pandoc


metadataParser :: P.Parser ( T.Text
                           -- ^ the key
                           , T.Text
                           -- ^ the value
                           )
metadataParser = do
  P.char ':'
  key <- P.takeWhile isAlpha
  P.char ':'
  value <- P.takeText
  return (key,T.strip value)

notEmpty :: T.Text -> Bool
notEmpty = T.any (not . isSpace)

getFirstNonEmpty :: [T.Text] -> Maybe T.Text
getFirstNonEmpty [] = Nothing
getFirstNonEmpty (t:ts)
  | notEmpty t = Just t
  | otherwise  = getFirstNonEmpty ts

sanitizeTitle :: T.Text -> String
sanitizeTitle = T.unpack
              . T.replace " " "-"
              . T.replace "'" ""
              . T.replace "\\" ""
              . T.replace "\"" ""
              . T.replace "." "-"
              . T.replace ":" "-"
              . T.strip

quote :: T.Text -> T.Text
quote t = "\"" <> t <> "\""

joinToLastEntry :: [T.Text] -> [T.Text] -> [T.Text]
joinToLastEntry [] ts' = [T.concat (intersperse " " (map T.strip ts'))]
joinToLastEntry ts ts' = init ts ++ [T.concat (intersperse " " (last ts : map T.strip ts'))]

replaceAssoc :: Eq a => a -> (b -> b) -> [(a,b)] -> [(a,b)]
replaceAssoc k f = go k f []
  where
    go k f seen [] = reverse seen
    go k f seen ((k',v):kvs)
      | k == k'   = go k f ((k',f v):seen) kvs
      | otherwise = go k f ((k',v):seen) kvs

main = do
  -- print $ P.parseOnly metadataParser ":tags: haskell, clojure"
  postDir <- head <$> getArgs
  posts <- listDirectory postDir
  forM_ posts $ \post -> do
    let postPath = postDir </> post
    putStrLn $ "Converting " ++ postPath
    content <- TIO.readFile postPath
    let
      ls = T.lines content
      (before,rest) = span (isLeft . P.parseOnly metadataParser) ls
      (metadataCandidate,afterCandidate) = span (isRight . P.parseOnly metadataParser) rest
      (nonEmptyFollowUpLines, after) = span notEmpty afterCandidate
      metadata = joinToLastEntry metadataCandidate nonEmptyFollowUpLines
      title = quote $ fromJust $ getFirstNonEmpty before
      kvs :: [(T.Text,T.Text)]
      kvs = map (fromRight ("","") . P.parseOnly metadataParser) metadata
      kvsWithTitle = ("title"," " <> title) : kvs
      kvsWithTitle' = replaceAssoc "summary" (quote . T.replace "\"" "\\\"") kvsWithTitle
      date = fromJust $ lookup "date" kvsWithTitle'
      mdPath = postDir </> sanitizeTitle (date <> "-" <> title) <.> "markdown"
      yamlMeta = T.unlines $ ["---"] ++ map (\(k,v) -> k <> ": " <> v) kvsWithTitle' ++ ["---",""]
    mdContent <- runIOorExplode $
                   readRST def{readerExtensions = pandocExtensions} (T.unlines (before ++ after))
                   >>= writeMarkdown def{writerExtensions = pandocExtensions}
    TIO.writeFile mdPath $ yamlMeta <> mdContent

The script is not amazingly pretty, but if it does the right thing it is only needed once and then never again, so no point of over engineering it. I want to point out how nice it is that Pandoc is written in Haskell. Instead of running an impure external process we can run native and pure Haskell code to convert the files from .rst to .markdown.

Thanks to Haskell’s superb parser combinator libraries, finding and extracting the metadata block was very easy. There was only one corner case that needed special attention: overly long summaries. Since I usually try to respect the 80 characters line width convention I wrote long summaries on more than one line. These multiple line summaries were not covered by the line based parser I wrote, so the simplest solution I found was to consider every non-empty line following the metadata block as part of the summary. This expectation does not have to be right in general, but I knew that I followed this convention in all of my blog posts, thus this worked for me.

By randomly inspecting some of the converted files I noticed some other minor conversion mistakes. Since the number of occurrences was small enough, I simply grep’ed the patterns and edited them by hand.

The conversion script also handles extracting the date and saving the file with the new filename convention is settled for. But this caused some more problems that Pandoc could not solve obviously: internal references (links) were wrong. The references were actually wrong for more than one reason:

  • The filenames changed
  • Pelican used some custom syntax that is not recognized by Pandoc or Hakyll
  • The folder hierarchy and thus the paths of the files changed

Luckily bulletin point #2 helps in this situation. In my Pelican posts an internal file reference had the following format:

...some text

`reference_text <{filename}/filename.rst>`_

more text...

which gives a very nice search pattern to find all occurrences:

grep -Hrn '<{filename}'

After the occurrences had been found I could simply rewrite them like this

[reference_text](./filename.html)

As you can see they simply became relative HTML links, which works because I know that all my posts are within the same directory on my webserver.

Retaining plugin functionality

Pelican has a pretty good plugin system and there are plenty of them available. The amount of plugins and their differences are too plenty to give a general catch-all solution how to retain their functionality in Hakyll.

In my case I really justed needed one thing: external code snippet inclusion. In Pelican this is supported by the code-include plugin and I really came to love this functionality because

  • It keeps your blog posts tidier and more concise
  • Saves you from constant copy/paste when you test and modify the code
  • Saves you from remembering which post contains what code

In Pelican the code insertion was achieved like so:

...some text

.. code-include:: code/path/to/file.hs
    :lexer: haskell

more text...

which Pandoc translates to native divs via fenced divs:

...some text

::: {.code-include lexer="haskell" file="code/path/to/file.hs"}
:::
more text...

Important: The native_divs extension was added to Pandoc starting with major version 2. If your Pandoc version is older than that you have to upgrade.

The fact that the code-include directives are translated to native divs that can be read and interpreted by Pandoc gave me the solution to my problem: walk Pandoc’s abstract syntax tree (AST) and replace the div blocks with code blocks. I will devote a separate blog post to this topic.

Deploy Script

For completeness here is the full script that runs my Hakyll based blog:

--------------------------------------------------------------------------------
{-# LANGUAGE OverloadedStrings #-}
import           Control.Arrow ((&&&))
import qualified Data.Map.Strict as M
import           Data.Maybe (maybeToList)
import           Data.Monoid (mappend)
import           System.FilePath
import           Text.Pandoc
import           Text.Pandoc.Definition
import           Text.Pandoc.Walk
import           Hakyll


--------------------------------------------------------------------------------
main :: IO ()
main = hakyll $ do
    match "images/**" $ do
        route   idRoute
        compile copyFileCompiler

    match "css/**" $ do
        route   idRoute
        compile compressCssCompiler

    match (fromList ["links.markdown", "contact.markdown"]) $ do
        route   $ setExtension "html"
        compile $ pandocCompiler
            >>= loadAndApplyTemplate "templates/default.html" blogCtx
            >>= relativizeUrls

    tags <- buildTags "posts/*" (fromCapture "tags/*.html")

    tagsRules tags $ \tag pattern -> do
        let title = "Posts tagged \"" ++ tag ++ "\""
        route idRoute
        compile $ do
            posts <- recentFirst =<< loadAll pattern
            let tagCtx =
                    constField "title" title `mappend`
                    listField "posts" postCtx (return posts) `mappend`
                    blogCtx

            makeItem ""
                >>= loadAndApplyTemplate "templates/tags.html" tagCtx
                >>= loadAndApplyTemplate "templates/default.html" tagCtx
                >>= relativizeUrls

    match "posts/*" $ do
        route $ setExtension "html"
        compile $ do
            snippets <- toSnippetsMap <$> loadAll ("code/**" .||. "site.hs")
            pandocCompilerWithCodeInsertion snippets
              >>= loadAndApplyTemplate "templates/post.html" (postCtxWithTags tags)
              >>= loadAndApplyTemplate "templates/default.html" (postCtxWithTags tags)
              >>= relativizeUrls

    create ["archive.html"] $ do
        route idRoute
        compile $ do
            posts <- recentFirst =<< loadAll "posts/*"
            let archiveCtx =
                    constField "title" "Archives" `mappend`
                    listField "posts" postCtx (return posts) `mappend`
                    blogCtx

            makeItem ""
                >>= loadAndApplyTemplate "templates/archive.html" archiveCtx
                >>= loadAndApplyTemplate "templates/default.html" archiveCtx
                >>= relativizeUrls

    create ["tags.html"] $ do
        route idRoute
        compile $ do
            let tagsCtx =
                    constField "title" "Tags collection" `mappend`
                    listField "tags" postCtx (traverse (makeItem . fst) (tagsMap tags)) `mappend`
                    blogCtx

            makeItem ""
                >>= loadAndApplyTemplate "templates/tag-list.html" tagsCtx
                >>= loadAndApplyTemplate "templates/default.html" tagsCtx
                >>= relativizeUrls

    match "index.html" $ do
        route idRoute
        compile $ do
            posts <- recentFirst =<< loadAll "posts/*"
            let indexCtx =
                    listField "posts" postCtx (return (take 3 posts)) `mappend`
                    constField "title" "Home" `mappend`
                    blogCtx

            getResourceBody
                >>= applyAsTemplate indexCtx
                >>= loadAndApplyTemplate "templates/default.html" indexCtx
                >>= relativizeUrls

    match "code/**" $ do
        route idRoute
        compile getResourceString

    match "site.hs" $ do
        compile getResourceString

    match "templates/*" $ compile templateBodyCompiler


--------------------------------------------------------------------------------
blogCtx :: Context String
blogCtx =
    constField "blog_title" "Tobi's blog" `mappend`
    constField "blog_name" "My blog about programming and other stuff" `mappend`
    defaultContext

postCtx :: Context String
postCtx =
    dateField "date" "%B %e, %Y" `mappend`
    blogCtx

postCtxWithTags :: Tags -> Context String
postCtxWithTags tags = tagsField "tags" tags `mappend` postCtx

toSnippetsMap :: [Item String] -> M.Map FilePath String
toSnippetsMap is = M.fromList kvs
  where kvs = map ((toFilePath . itemIdentifier) &&& itemBody) is

pandocCompilerWithCodeInsertion :: M.Map FilePath String -> Compiler (Item String)
pandocCompilerWithCodeInsertion snippets =
  pandocCompilerWithTransform defaultHakyllReaderOptions defaultHakyllWriterOptions (codeInclude snippets)

codeInclude :: M.Map FilePath String -> Pandoc -> Pandoc
codeInclude snippets = walk $ \block -> case block of
  div@(Div (ident,cs,kvs) bs) -> if "code-include" `elem` cs
                                 then codeBlockFromDiv snippets div
                                 else block
  _ -> block

fromPara (Para is) = is
fromStr (Str s) = s

codeBlockFromDiv snippets div@(Div (ident,cs,kvs) bs) =
  let mLexer = lookup "lexer" kvs
      css = maybeToList mLexer
      path = (fromStr . head . fromPara . head) bs
      content = M.lookup path snippets
  in maybe Null (CodeBlock ("",css,[])) content
codeBlockFromDiv _ _ = Null

Summary

In the end I am very pleased with the end result. The blog hasn’t changed in appearance. Better so: I provide event more information with less lines of code.

Specifically my blog now has a page with all tags that exist on my blog. Clicking on the link of one of those tags opens a nice to read, chronologically ordered list of posts that were tagged with this tag.

One of the big differences between Pelican and Hakyll is the following:

Pelican is configured, while Hakyll is scripted

What’s the difference? In Pelican most of the functionality is controlled by variables that alter the behavior, but to really add something knew you have to change the underlying library or write a plugin.

Hakyll on the other hand comes with its own domain specific language (DSL), which is basically normal monadic Haskell code. The library API exposes all the functions and data types necessary to build the deployment script. As a result, once you know how it all works, you have much more fine grained control. You don’t ask “How does it work?”, because you wrote the code yourself, not just a configuration!

I think this is also the reason that Hakyll does not need plugins as they are needed in Pelican. Plugins usually provide an entry point into the library, but in Hakyll we already have full access (apart from internal stuff), so we don’t need it.