Writing a Custom Markup Parser for this Site

Published: December 28, 2021

How it All Happened

Almost exactly a year ago, I moved this site from templated markdown to a static site built with mdoc(7). I ported 5 blog posts in the process and ended up writing 7 more over the course of the year.

While it was a success in teaching me mdoc(7), I found that it slowed me down a bit in authoring blog posts.

Around the same time as I was feeling this slowness, I began actively phlogging on my gopherhole and even started posting gopher-only content. I really enjoyed writing plaintext posts because they are quick to write and, more importantly, highly durable to changes in technology. I can't imagine a world where one cannot open a plain .txt file and read/edit it.

I got a real sense for the importance of optimizing for archival while browsing gopher--it's incredible to be reading textfiles older than me. It made me realize that I want to be sure that my content can survive for decades with minimal effort.

So, I started thinking about how I could move my site's source (pre-HTML) to a more readable plaintext format. Markdown was the obvious choice, but I wanted to stay true to my creative limitation of keeping this site buildable by base OpenBSD, so I had to find another option.

It was about 10 days into solving Advent of Code puzzles that I realized I could redirect some of the puzzling effort at the problem and write my own markup parser. The result, a few weeks later, is nihdoc(1). nihdoc(1) (a play on the fact that markdown is Not Invented Here) provides support for all the basic syntax I'd want in a blog post--nested lists, inline styles and code, code blocks and block quotes, and headers. It was a blast to write, and I learned a lot in the process!

I suspect the CSS for the blog will still change (maybe a dark mode? or something a little less plain), but I tried to keep the resulting HTML pretty bare in support of accessibility and portability--it should read well in screenreaders, embedded in RSS feeds, and more.

If you want to see the source for any post, just replace the .html extension in the URL with .txt! For example, here's this post's source.

Implementation Highlights

If you read this far, I figure you might be interested in some of the implementation details and design decisions.

Stream Based Parsing

Probably the most interesting detail of the parser is that it is stream-based with constant memory usage. In other words, it will start spitting out the input and the HTML markup as soon as it can decisively figure out what state it's in (i.e. has the paragraph ended, etc). Keeping track of this state is done with a handful of booleans/integers and doesn't involve storing lines in memory. In fact, the current implementation reads the input one character at a time!

This is an efficiency win for large documents (not that my posts are that long), but was also just a fun constraint to try to code within. In practice, I found I was able to get support for almost everything I wanted (nested lists, etc) with maybe the exception of "bottom of the document" links that markdown allows. More on that later.

Balancing Ease of Implementation with Syntax

One of the most interesting challenges in designing a markup language is settling on a syntax that's both easy-ish to implement (I'm a big believer in simpler = less bugs) but also syntactically appealing in plaintext format (after all, one of the main motivations was to make the source archive-ready).

The best example of this was deciding how to write links.

I started off with the easiest implementation, which is also the least appealing (IMHO). A link looked like this:

[https://alexkarle.com/blog my blog]

This is super easy to parse one character at a time. In psuedo-code:

  1. If current character is [, print <a href="
  2. Print all characters (the href) until you see a space/newline
  3. Once we see the space/newline, print ">
  4. Print all characters (the description) up until the ]
  5. Once we see the ], print the closing </a>

This fits really nicely into our "parse one character at a time", since each special character in the link corresponds to a direct piece of HTML to output. However, it's ugly to print links that have no description, such as:

[https://alexkarle.com https://alexkarle.com]

To address this, the next evolution added a (stack-allocated) "link buffer" that would store the href as it was printed so that if the ']' was hit before a space/newline, it was assumed that the description was the href and it would print the link buffer in the place of the description, enabling "bare links" like so:

[https://alexkarle.com]

I was about to go live on my blog with that iteration because I liked it enough, but the one thing that really bothered me was that it's hard to read the description after the link. To the plaintext reader, the description is way more important than the link! Especially for long links, it's distracting to have to scan ahead to continue a sentence.

I really wanted markdown-style links like so:

[my blog](https://alexkarle.com/blog)

The immediate problem was that the parser can no longer print the characters as it sees them, since the URL happens after the description in the input but needs to come before the description in the output. I realized however that this is a similar problem to the way I used the linkbuf for bare links--all I had to do was store the description in the buffer, and play it back after printing the href. It's the same amount of memory, but a tad more complex, since the description is allowed to have inline styles, so before pushing onto the link buf, we need to check for styles and push those too (effectively a smaller version of the main loop).

The final form of markdown links that I'd like to support but can't is a "postfix link", link so:

This is a [link] in
a paragraph
...
[link]: https://alexkarle.com

Since the actual link could be anywhere in the document, this kind of parsing requires buffering potentially the whole document, which violates the streaming condition (which I'd like to keep!), so I stopped short of it.

Conclusion

I hope you found this discussion of syntax, tradeoffs, and parsers interesting! I'm sure there's a lot more I can learn and improve on, but it's been a fun evolution from the mdoc(7) I started with! Check out the source if you're curious. I expect it'll change rather frequently in the next few weeks, so I wouldn't advise depending on it yourself (but I wanted to open source it to share with others as a teaching tool regardless!).

Back to blog