StrictMark: rational Markdown

Markdown is a wonderful lightweight markup: minimalistic, easy to read and write. Markdown is supported by GitHub, Bitbucket, Reddit, Diaspora, Stack Exchange, and many others. It is not without issues though. Markdown is precedent-based, so to say. It aimed to codify preexisting practices which were... diverse. For that reason, it is messy and inconsistent between implementations.

StrictMark is a rational subset of Markdown that implements all the features with the shortest formal grammar possible. Hence, uniform syntax and no ambiguities. The idea is that StrictMark can reuse all the existing Markdown support, without sharing the weight of the legacy syntax and its incidental complexity.

StrictMark is Markdown, refactored.

Markdown critique

Markdown implementations are inconsistent. Vim highlights it one way, VS Code does it differently, and the resulting HTML is yet another thing. A textbook fix for inconsistent implementations is having a formal grammar. That might be either a proper eBNF grammar or just some regexes in simpler cases. There must be something formal and unambiguous. HTML has it, CSS has it, every markup or programming language has it. But Markdown. Sadly, the syntax itself is so ambiguous that making a formal grammar becomes a road of pain. For example, HTML (which is hardly lightweight) has a uniform syntax for its </elements>. With Markdown, every element has its own syntax and those syntaxes interact. Markdown formal grammar was attempted in the past, but if you ask me, the result was underwhelming. The PEG based grammar is 700 lines long. For a minimalistic markup, that is a lot. So ironic.

CommonMark is a Markdown codification effort that produced the most complete spec so far. Still, that spec is rule-and-exception based, no grammar. The text of the spec is full of "legalese":

«An indented code block cannot interrupt a paragraph, so there must be a blank line between a paragraph and a following indented code block»

Here it describes the specifics of interrelations between two particular markup elements. But N elements produce N*N relations! Consider that ATX headings are always single-line while Setext headings can be multiline. Why? Because the spec says so. It is one big heap of rules and another one of exceptions. That's why it advises a "parsing strategy" and not a parser generator.

Whether the CommonMak spec has fixed all the corner cases and ambiguities is unclear. Or maybe clear, as the spec is actively revised. So is the code. The CommonMark C parser is 10KLoC of hand-written code. It has plenty of exceptions and tweaks. While writing RON docs, I ran into issues immediately; had to use the HEAD version which has those issues fixed.

Overall, that seemingly theoretical grammar problem causes plenty of accidental complexity. Markdown is messy and hard to reason about; it is not always clear how to interpret a given construct. That combinatorial mess may not be a problem for its current uses, of course. (Although, I highly doubt that.) After all, the existing libcmark parser is fuzzed, thus reasonably reliable. Still, Markdown is a very shaky base if you want to build on top of it. To build something more advanced than a README. Like, full WYSIWYG editing or diff highlighting or other complex behavior.

StrictMark principles

StrictMark's objective is to make a Markdown subset which is a proper markup language. To remove that incidental complexity by rationalizing the grammar and making it formal. This document is StrictMark.

One may ask, why do I want to make Markdown a proper language? After all, there is HTML which is proper enough. Yes, HTML is a widely supported standard, but it is hopelessly elephantine. Let's think, who can afford to develop/support a proper HTML engine? That is roughly one-and-a-half companies in the world. Hence the interest in a minimalistic hypertext markup language.

The next question is obvious. If StrictMark is ever used at some scale, would not it become elephantine, naturally? Consider Wikipedia/Mediawiki markup. Once neat and lean like all such markups, it evolved into an elephantine mess. What about table support, for example? Some people think it is necessary, quite deservedly so.

I plan to prevent feature sprawl by enabling transclusion. A document may reference other documents and objects through hyperlinks. It can also include other objects and documents, through hyperlinks. The good old <img> tag is an example of transclusion. That will not even extend the syntax: StrictMark reuses the image syntax for the general case of transclusion. Once you need a table, you transclude a table! It is up to the renderer to deal with all those other data types. CSV is a much better format for tables than HTML or Markdown.

StrictMark is a backwards compatible subset of CommonMark, for the most part. Any existing CommonMark tooling will support StrictMark reasonably well. A StrictMark parser may not understand arbitrary Markdown.

StrictMark principles:

  1. A formal grammar. StrictMark is a regular language in its structural part (i.e. blocks). The inline markup syntax is based on regex-defined markers. That way, a decent parser can be implemented in regexes only.

  2. There is one way to do a thing, as uniform as possible. Hence: spaces, not tabs! Because spaces can replace tabs, not the other way around. ATX headings only. Formatting brackets are *one char* wide only. All block formatting is indented uniformly.

  3. Minimize ambiguity. Same character should not denote lists and emphasis, etc.

  4. No spooky action-at-a-distance. Each line can be parsed separately. All the structural markup is 4-char-wide, hence indents are uniform. This restriction makes the nesting structure clear and unambiguous.

  5. Inline markup has very limited nesting; it is of secondary importance anyway. There is clear markup precedence; a code span wins over strong, strong wins over emph.

  6. HTML is not the only output format. It could be PDF, DOC, TeX, whatever. Hence, no HTML inserts. Use transclusion for other formats.

  7. Keep markup to the minimum. The plain text form must stay clean and readable.

  8. If the StrictMark interpretation contradicts CommonMark or CommonMark has an ambiguity then screw CommonMark.

Interestingly enough, similar ideas were proposed in the CommonMark community some years ago.

Inline markup

Markdown inline markup may seem like an easy part. Sadly, it is not. Due to very irregular and ambiguous syntax, implementing it properly is difficult. For that reason, StrictMark rationalizes the inline markup in the following ways:

  1. All inline markup is seen as bracketing. Brackets are matched separately, using regular expressions; e.g. (?<=\s)[*](?=\S) is the opening bracket for STRONG. Bracket pairs only become effective if they satisfy the precedence rules.

  2. Bracket precedence, lower to higher:

    1. _emphasized_,

    2. [link][1],

    3. *strong*,

    4. \* escapes,

    5. `code`.

  3. An open bracket can be paired with any following closing bracket of that kind. A new open bracket will cancel any preceding unmatched open bracket of its kind.

  4. In case of overlap, higher-precedence brackets win; in case of equal-precedence, the earlier range wins.

  5. Higher-precedence brackets may nest in lower-precedence, but not the other way around (the lower one is cancelled). In case of equal-precedence, the earlier range wins.

  6. No double symbols, i.e. *strong* not **strong**.

Compared to CommonMark, restrictions are many:

That may seem restrictive, but again: inline formatting has a supplementary role. It must pull its own weight or it must not be there. The accurate bracket patterns are listed in the grammar appendix.

Links, images and transclusions

The only form of links CommonMark supports is full reference links. The link label must be exactly one symbol long. This approach:

Reference definitions can be placed anywhere. It is nice to put them at the end of a section or in the end of the document. Example:

    see [Replicated Object Notation][1]

    [1]: http://doc.replicated.cc/ron.sm "What is RON"

If you have more than 10 links, use letters. In case you have thousands of links, use Unicode symbols.

Transclusions and images use the same syntax as links, with an exclamation mark ! prepended. Example:

    ![here is the table][T]
    [T]: /table?@tab "this might be any object"

Block markup

StrictMark has tree types of blocks:

  1. container blocks (lists, blockquotes, divs),

  2. leaf blocks (paragraphs, headers, rulers, fenced code blocks).

Container and entry blocks can contain other blocks, leaf blocks can not. Depending on the type of a leaf block, it can contain text, metadata or nothing at all,

The block-related markup goes in the beginning of the line, in blocks of four symbols. That part of a line is called a block stack. The allowed blockstack pattern is (INDENT|QUOTE)* LIST? LEAF?. In absence of an explicit leaf block, a formatted text paragraph is implied.

A block can be continued in the following lines; that is signaled by indents (four spaces) in place of the block markup. An empty line is considered to be a continuation line for the container blocks in the stack, but not for the leaf block.

Example:

 #  Multiline
    header

 1. here the entry starts,
    and then it continues

    and continues...
 2. ...till the next entry.

Changes in blockstack depth cause container nesting changes. Additional indent of less than 4 spaces is not meaningful. That allows for easier line-by-line parsing and interpretation. If an indent level starts with a bare indent, that creates a generic container block (in other words, a div). With CommonMark, that should be a code block. StrictMark generalizes that slightly.

In case a block marker ends with a non-space symbol, the next symbol must be whitespace (the gap space). For example, the only <h4> marker is #### which must be followed by some whitespace. That whitespace might be a part of the next marker or the line itself.

The supported block types are:

     >   blockquote,
     -   bulleted list,
     1.  numbered list,
         generic block (div),
    ```` fenced code block,
     ##  header (4 levels),
    [x]: reference definition,
    ____ ruler,
     [ ] TODO entry (in a list).

Headers

StrictMark is limited to four levels of headers. Only ATX headers are allowed. Headers can be multiline. The markings only go in the beginning of the line. Examples:

    #   Top header
    ##  Subheader
      ## Indented subheader
    ### Small header
    #### Smallest header

Note that header markings are padded to 4 chars with spaces, like the rest of the block markup. In case the last char is not a space, the inline text must start with a gap space.

Lists

The unordered list markup symbol is a dash -. The other two Markdown options are * and +. But * is ambiguous and + is unpopular and there must be one way only!

The ordered list markup is 12. numbers-dot. Again, list markup takes four chars per level. If the last char is not a space, the first inline-text char must be a space. The particular indenting of bullet markers is up to the user. Typography purists may prefer _1._ while people who like to use Tab will write 1.__. The GitHub ToDo extension syntax is supported as another leaf block, _[ ].

      - bulleted list
      - still bulleted

         1. nested numbered list
         2. more numbered

        plain paragraph, also nested, indented 4 chars

         1. another nested list
         2. of two entries

      - resume the bulleted list

     1. I am a typography
     2. purist, I set terminal to
     3. custom fonts on a Mac.
    <!-- -->
    1.  I use Tab a lot,
    2.  I don't like to bother.

Note that numbered lists are limited to 999 properly numbered entries. Technically, numbering all entries 1. will produce perfectly correct output, but the raw markup will not be properly numbered then.

To separate two adjacent lists of the same kind, CommonMark suggests to use an empty HTML comment. We can not do better than to recommend exactly the same trick. That counts as two pieces of fictive block markup, 8 chars total. In practice, you'd better put some text inbetween.

Blockquotes

Blockquote block markup is one > and three spaces, in any order.

  > #   Quoted header
  > Quoted paragraph text.

The only type of formatting where the continuation is not necessarily an empty indent, but can also be a quotation marker. This exception is made because of historical reasons; it is highly advised to use indents.

Code blocks

Code blocks use the fenced syntax with exactly four backticks. The opening fence may mention the language used. The code must be indented 4 chars; that follows the same continuation rules as all the other block containers have. The closing code fence is optional; the end of the code block can be signaled by the lack of indent. The code can safely use four backticks.

    ````js
        console.log("JavaScript is the best worst lang ever");
    ````

The grammar

The Ragel variant.

    ##  C O N T A I N E R  B L O C K S

    CHAR = any; 
    WS = [ \t\r\n];
    NONWS = CHAR - WS;
    # StrictMark allows NO \n\r as there must be one way only!
    # Also, notepad.exe now supports Unix newlines.
    NL = "\n";
    NONNL = CHAR - NL;
    INLINE = NONNL*;
    MARKUP = [`*_\[\]];
    PUNCT = "!".."/" | ":".."" | "[".."`" | "{".."~";
    NONWSP = NONWS - PUNCT;
    WSP = WS | PUNCT;
    
    WSA = WS ;

    INDENT = "    ";
    HEAD1 = "   #" | "  # " | " #  " | "#   ";
    HEAD2 = "  ##" | " ## " | "##  ";
    HEAD3 = " ###" | "### ";
    HEAD4 = "####";
    HEADER = HEAD1  | HEAD2  | HEAD3  | HEAD4 ;
    QUOTE = ">   " | " >  " | "  > " | "   >";
    ULIST = "  - " | "-   " | " -  " | "   -";
    OLIST = " " digit ". " | digit ".  " | "  " digit "." | 
        digit digit ". " | " " digit digit "." | digit digit digit ".";
    FENCE = "````";
    HLINE = "----";
    LINK_LABEL = CODEPOINT - WSP - "]";
    REFDEF = "[" LINK_LABEL "]:";

    NESTBLOCK = INDENT   | 
                QUOTE   ;
    LISTBLOCK = ULIST  |
                OLIST ;
    LEAFBLOCK = HEADER | 
                FENCE  |
                HLINE |
                REFDEF  ;

    BLOCK_STACK = NESTBLOCK* LISTBLOCK? LEAFBLOCK?;

    FUCKED_LINE = "TODO";

    LINE = ( BLOCK_STACK INLINE NL )  ;
    WIKITEXT = LINE*;

    ##  S T R I C T M A R K  I N L I N E

    ## not that each word is a URL candidate; use the URL parser separately
    ## as the full URL grammar will blow this state machine up
    
    AWORD = (NONWS+)  ;
    WORDS = AWORD (WS+ AWORD)*;

    ## emphasized text
    EMPH_BOTH_FLANKING = PUNCT "_" PUNCT ;
    EMPH_LEFT_FLANKING = (WSP "_" NONWS ) - EMPH_BOTH_FLANKING;
    EMPH_RIGHT_FLANKING = (NONWS "_" WSP ) - EMPH_BOTH_FLANKING;

    ## strong emphasis; can nest URIs, code spans; may contain escapes
    STRONG_BOTH_FLANKING = PUNCT "*" PUNCT ;
    STRONG_LEFT_FLANKING = (WSP "*" NONWS ) - STRONG_BOTH_FLANKING;
    STRONG_RIGHT_FLANKING = (NONWS "*" WSP ) - STRONG_BOTH_FLANKING;
    STRONG_INTRAWORD = "*"  NONWSP+ "*" ;

    ## a reference link
    LINK_OPEN = [^\]] "[" ;
    LINK_CLOSE = "]["  LINK_LABEL "]" ;

    ## code spans contain arbitrary Unicode, all parsed literally
    CODE = "`" ;

    ## backslash and ampersand escapes have higher precedence
    ## than other inline markup
    BACKSLASH_ESC = "\\" PUNCT ;
    AMPERSAND_ESC = "&"  ( alnum+ | "#" digit+ | "#" [xX] xdigit+ ) ";" ;
    ESC = BACKSLASH_ESC | AMPERSAND_ESC;

    ## to prevent combinatorial state explosion we find markup elements in the
    ## inline soup (left/right separately) then filter them by the nesting rules
    LEFT_MARKUP = EMPH_LEFT_FLANKING | STRONG_LEFT_FLANKING | LINK_OPEN;
    RIGHT_MARKUP = EMPH_RIGHT_FLANKING | STRONG_RIGHT_FLANKING | LINK_CLOSE;
    OTHER = STRONG_INTRAWORD | CODE | ESC;
    INLINE_SOUP = (LEFT_MARKUP|RIGHT_MARKUP|OTHER|CHAR)* ;

    LINK_REF_DEF = REFDEF WS* (NONWS+) (WS+ ["] (CHAR-["])* ["] )? WS*;

    }%%