On Modern Documents

Modern documents should be written in HTML. Matthew Butterick put it best.

PDF is fundamentally a digital simulation of paper. So it’s great for making paper documents available in the digital realm. But for natively digital documents…it removes functionality and imposes design constraints.

This applies not only to PDF, but to other formats: PostScript, .docx, .odf. The need for first-class HTML output disqualifies many traditional document processing systems from consideration. In order to fill this gap I developed two tools: modoc and lark.

Static site generators

Static site generators are meant to produce HTML from human-readable source files. This seems like the perfect solution, but there exist an absurd amount of static site generators. This may be because static site generators bundle several disparate tools together. Developers feel the need to constantly reinvent the wheel with their own preferences. The result is a large body of deficient programs which are difficult to customize.

The best solution is to use a dedicated build automation tool. They were designed to take source files and compile them efficiently. The only difference here is that we are using markup files as our input instead of program source code and HTML as our output instead of binaries. This method decouples all the components of the static site generator and while one loses some convenience, there are major gains in flexibility. Many changes can occur with just a minor edit to the build script. No plugins necessary.

The most difficult part is getting started. Nobody writes Makefiles from scratch.

modoc generates boilerplate build automation scripts designed for compiling HTML. It supports a number of different build automation tools, markup parsers, and templating languages. The generated file is only meant to be a starting point. The modoc reference provides an array of sample configurations.

Markup parsing

Selecting a markup parser can be difficult as markup languages have a power-readability trade-off. HTML is powerful and expressive, but quite difficult to read and write. Markdown is an extremely readable format, but is constrained by the language features provided by the parser.

The ideal markup language is parsimonious. It only provides the syntax elements you need and nothing more. The particular set of syntax elements is, of course, highly dependent on the author.

Some tools like pandoc offer filters where one can modify the AST before it is converted to HTML. This is useful, but not quite flexible enough. One can transform elements into others, but cannot introduce new syntax altogether.

lark enables one to easily construct and modify humane markup language parsers. It returns LPeg patterns, meaning it can easily be extended to include more sophisticated functionality. There is a module which implements a significant number of Markdown language features called lark-md.


The impetus for this document philosophy was twofold: frustration at static site generators for reinventing dependency tracking when build automation tools already exist for this purpose, and a desire to write human-readable source files that implement more features than standard Markdown. The latter specifically for writing mathematics.

I developed these tools to resolve my personal gripes with static site generators and markup parsers. modoc and lark can be used separately, but in tandem I hope they form a powerful and extensible document system that others can use too.


Others before me have used build automation tools to generate HTML pages like the m4-bakery and Wakefile projects. My contribution in modoc was to extend the idea to other build automation tools, provide an easy web interface to customize boilerplates, and write some adequate documentation demonstrating sample configurations.

The idea behind lark is more novel. I experimented with several other systems including a general purpose macro processor, but ultimately decided to use a full parsing library which enables one to write more powerful markup languages.

The target markup language I had in mind was Markdown and it turned out most of the syntax can be neatly categorized into block and inline elements. lark takes advantage of this simple classification and enables one to replicate most of the features of Markdown with very little code.