The Markeven Processor
Markeven transforms text files into HTML using a set of simple rules. It takes most ideas from Markdown, but has more strict rules, which lead to better source structure and enhanced performance.
Syntax cheatsheet
Block elements
Markeven recognizes following block-level elements:
Block elements are always delimited by two or more line ends (\n\n ):
This is a paragraph
this is a code block
But this is still
a paragraph.
This behavior is different from Markdown, which interprets last block as two blocks: paragraph and code block.
Paragraphs
A paragraph is simply one or more lines of text. Multiple lines are joined into a single paragraph:
Paragraph one.
Paragraph two.
More text.
Paragraph three.
More text.
Even more text.
Following markup will be generated:
<p>Paragraph one.</p>
<p>Paragraph two. More text.</p>
<p>Paragraph three. More text. Even more text.</p>
Like in Markdown, if you wish to place linebreak, leave two or more space characters at the end of the line.
Sections (divs)
Section is a block which is rendered into an HTML div element. Each line of a section must start with a pipe | character:
| This is a section with two paragraphs.
| I am the first one.
|
| And I am the second one.
Following markup will be generated:
<div>
<p>This is a section with two paragraphs. I am the first one.</p>
<p>And I am the second one.</p>
</div>
Sections are frequently used in conjunction with block selectors by web designers to achieve certain effects like styling, animating, etc.
Headings
Markeven supports both ATX and Setex styles proposed by Markdown:
This is first-level heading
===========================
This is second-level heading
----------------------------
# First level again
## Second level here
### Third level
#### Fourth level
##### Fifth level
###### Sixth level
Unline Markdown, Markeven do not allow closing # s, so following example:
# This is a heading which ends with #
will be transformed into:
<h1>This is a heading which ends with #</h1>
Preformatted code blocks
Code blocks are used to write about programming or markup stuff. Their contents is usually rendered using monospaced font and is interpreted literally. To produce a code block, indent every line of block with at least 4 spaces or 1 tab:
Here's some code:
println("Hello world!")
Markeven will produce:
<p>Here's some code:</p>
<pre><code>println("Hello world!")
</code></pre>
Ordered and unordered lists
Lists in Markeven have strict rules which help you build highly structured documents.
The first thing to know about is a list marker. Ordered lists must start with 1. followed by at least one space. Unordered lists must start with * followed by at least one space. Every subsequent list item must start with the same marker (a number followed by a dot and whitespace in case of ordered lists):
1. list item 1
2. list item 2
* list item 1
* list item 2
Here is generated markup for the above snippet:
<ol>
<li>list item 1</li>
<li>list item 2</li>
</ol>
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
Lists items can contain another block-level elements. To interpret whitespace-sensitive blocks properly, you should maintain the same indentation inside list items. We refer to this indentation as list item baseline:
* This paragraph is under first list item.
This paragraph is also under first list item, because
it is properly indented.
* This list item has another baseline.
So we should indent our second paragraph accordingly.
This paragraph, however, is outside list.
Following markup will be generated:
<ul>
<li>
<p>This paragraph is under first list item.</p>
<p>This paragraph is also under first list item, because it is properly indented.</p>
</li>
<li>
<p>This list item has another baseline.</p>
<p>So we should indent our second paragraph accordingly.</p>
</li>
</ul>
<p>This paragraph, however, is outside list.</p>
Nested lists follow the same rules:
1. List 1 item 1
2. List 1 item 2
1. List 2 item 1
2. List 2 item 2
3. List 1 item 3
Codeblocks can also be nested inside list items. Each line of a code block must be indented with at least 4 spaces or 1 tab relatively to list item's baseline:
1. Code inside list item:
def sayHello = {
println("Hello world!")
}
You can also add a visual guide indicating current list item baseline using the pipe | character. It can be useful in cases when the list item is long and its content is complex:
1. | This is a long and complex list item.
|
| code block
|
| * another list
| * ...
2. And that's it.
Note that list items belong to the same list only if their markers are equaly indented. Following example shows two different lists:
* List one item one
* List one item two
* List two item one
* List two item two
And a paragraph.
Here's the markup:
<ul>
<li>List one item one</li>
<li>List one item two</li>
</ul>
<ul>
<li>List two item one</li>
<li>List two item two</li>
</ul>
<p>And a paragraph.</p>
Tables
Markeven supports simple syntax for tables:
---------------------------------------------
| Column 1 | Column 2 | Column 3 |
--------------|--------------|---------------
| one | two | three |
| four | five | six |
---------------------------------------------
Here's the markup:
<table>
<thead>
<tr>
<th>Column 1</th>
<th>Column 2</th>
<th>Column 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>one</td>
<td>two</td>
<td>three</td>
</tr>
<tr>
<td>four</td>
<td>five</td>
<td>six</td>
</tr>
</tbody>
</table>
As you can see, the first and the last line of table should consist of minus - characters only. The only exception to this rule is that the first line can optionally end with > character. If > character is there, the width of table will expand to its maximum.
Cells are separated by the pipe | character, you can omit leading and trailing pipes. Table header is separated from table body by the separator line. This line can optionally contain semicolon : characters to express column alignment: a semicolon to the left side means left alignment, a semicolon to the right side means right alignment, two semicolons at both ends means center alignment:
---------------------------------------------
| Column 1 | Column 2 | Column 3 |
-------------:|:------------:|:--------------
| one | two | three |
| four | five | six |
---------------------------------------------
You can also omit the header, in this case you cannot specify column alignment with semicolons:
--------------------
one | two | three
--------------------
Blockquotes
Blockquotes are similar to sections, but they are rendered into HTML blockquote element. Each line of a blockquote must start with > character. Like sections, blockquotes can contain nested block elements:
> This is blockquote.
>
> > This blockquote is nested.
>
> That's it.
Here's generated markup:
<blockquote>
<p>This is blockquote.</p>
<blockquote>This blockquote is nested.</blockquote>
<p>That's it.</p>
</blockquote>
Horizontal rulers
A horizontal ruler is rendered from a block which contains three or more minus - characters:
This is some text.
---
This is some more text.
Following markup will be produced by Markeven:
<p>This is some text.</p>
<hr/>
<p>This is some more text.</p>
No other syntaxes for <hr/> are supported.
Inline HTML
Markeven allows you to place HTML elements right inside your text. Their content won't get processed:
<div>
This text won't get transformed into a code block
</div>
But this will.
There are no strict rules about inline HTML. The only important thing is that your markup should be correct (tags closed and properly nested). Markeven does not have the ability to “fix” wrong HTML markup yet :)
Block selectors
Each block can optionally have a selector. It is used to add id and class HTML attributes to blocks:
I have an id. {#para1}
I have two classes. {.class1.class2}
I have an id and a class. {#para3.class1}
The example above will be transformed into a following HTML snippet:
<p id="para1">I have an id.</p>
<p class="class1 class2">I have two classes.</p>
<p id="para3" class="class1">I have an id and a class.</p>
The most common use of selectors is to assign id attribute so that they can be used in links:
Now I can be referenced by id! {#mypara}
Look! I can reference [another paragraph](#mypara).
The selector expression is enclosed into curly braces and must be placed at the end of the first line of the block (no trailing whitespace allowed!).
Text enhancements
Inside block level elements following text enhancements occur:
You can also use backslash escaping to prevent misinterpreting special characters. Following characters can be escaped: \`_*{}[]()#+-~.!
Links & Images
Two style of links are supported: inline and reference.
Inline links look like this: [my text](http://my_url) or [some text](http://some_url "some title") and are rendered into HTML a element: <a href="http://my_url">my text</a> and <a href="http://some_url" title="some title">some text</a> .
Reference-style links are split into link definition and link usage. Using previous examples, here's how link definitions could look like:
[id1]: http://my_url
[id2]: http://some_url "some title"
Link usages would then look like this: [my text][id1] and [some text][id2] . The generated markup is equal to the previous one.
The syntax for images is similar to the one for links: the exclamation ! sign immediately before opening bracket tells markeven to interpret the link as an image. Link text becomes the value of alt attribute:
Inline image: ![some image](/img/hello.png "Hello")
Or reference-style image: ![some image][img]
[img]: /img/hello.png "Hello"
Both cases generate following markup for image:
<img src="/img/hello.png" title="Hello" alt="some image"/>
|
class MarkevenProcessor() {
val protector = new Protector
val links = new HashMap[String, LinkDefinition]()
var level = 0
val macros = new HashMap[String, StringEx => CharSequence]()
def increaseIndent: Unit = level += 1
def decreaseIndent: Unit = if (level > 0) level -= 1
def addMacro(name: String, function: StringEx => CharSequence): this.type = {
macros += (name -> function)
return this
}
def currentIndent: String =
if (level <= 0) return ""
else " " * level
def normalize(s: StringEx): StringEx = s.replaceAll("\t"," ")
.replaceAll(regexes.lineEnds, "\n")
def cleanEmptyLines(s: StringEx): StringEx = s.replaceAll(regexes.blankLines, "")
def stripLinkDefinitions(s: StringEx): StringEx = s.replaceAll(regexes.linkDefinition, m => {
val id = m.group(1).trim.toLowerCase
val url = processUrl(m.group(2))
var t = m.group(4)
val title = new StringEx(if (t == null) "" else t)
encodeChars(title)
encodeBackslashEscapes(title)
encodeChars(url)
encodeBackslashEscapes(url)
links += id -> new LinkDefinition(url, title)
""
})
def hashInlineHtml(s: StringEx, pattern: Pattern, out: String => String): StringEx =
s.replaceIndexed(pattern, m => {
val startIdx = m.start
var endIdx = 0
if (m.group(2) != null) {
// self-closing tag, escape as is
endIdx = m.end
} else {
// find end-index of matching closing tag
val tagName = m.group(1)
// following regex will have `group(1) == null` for closing tags;
// `group(2)` determines if a tag is self-closing.
val tm = regexes.htmlTag(tagName).matcher(s.buffer)
var depth = 1
var idx = m.end
while (depth > 0 && idx < s.length && tm.find(idx)) {
if (tm.group(1) == null) depth -= 1 // closing tag
else if (tm.group(2) == null) depth += 1 // opening tag
idx = tm.end
}
endIdx = idx
}
// add to protector and replace
val key = protector.addToken(s.buffer.subSequence(startIdx, endIdx))
(out(key), endIdx)
})
def hashHtmlBlocks(s: StringEx): StringEx =
hashInlineHtml(s, regexes.inlineHtmlBlockStart, key => "\n\n" + key + "\n\n")
def hashHtmlComments(s: StringEx): StringEx = s.replaceAll(regexes.htmlComment, m =>
"\n\n" + protector.addToken(m.group(0)) + "\n\n")
def readBlocks(s: StringEx): Seq[Block] = {
val result = new ListBuffer[Block]()
val chunks = new ChunkIterator(s.split(regexes.blocks))
while (chunks.hasNext)
result += readBlock(chunks)
return result
}
def readBlock(chunks: ChunkIterator): Block = {
// get current chunk
val s = chunks.next
// strip selector if any
val selector = stripSelector(s)
// assume hashed inline HTML
if (s.buffer.length == keySize + 2 &&
s.buffer.charAt(0) == '!' && s.buffer.charAt(1) == '}')
protector.decode(s.buffer.toString) match {
case Some(content) => return new InlineHtmlBlock(new StringEx(content))
case _ => return new ParagraphBlock(s, selector)
}
// assume code block
if (s.matches(regexes.d_code))
return processComplexChunk(chunks, new CodeBlock(s, selector),
c => c.matches(regexes.d_code))
// trim any leading whitespace
val indent = s.trimLeft
// do not include empty freaks
if (s.length == 0) return EmptyBlock
// assume unordered list and ordered list
if (s.startsWith("* "))
return processComplexChunk(chunks, new UnorderedListBlock(s, selector, indent),
c => ul_?(c, indent))
if (s.startsWith("1. "))
return processComplexChunk(chunks, new OrderedListBlock(s, selector, indent),
c => ol_?(c, indent))
// assume blockquote and section
if (s.startsWith(">"))
if (s.matches(regexes.d_blockquote)) {
return new BlockquoteBlock(s, selector)
} else return new ParagraphBlock(s, selector)
if (s.startsWith("|"))
if (s.matches(regexes.d_div)) {
return new SectionBlock(s, selector)
} else return new ParagraphBlock(s, selector)
// assume table, headings and hrs
s.matches(regexes.d_table, m => {
new TableBlock(s, selector)
}) orElse s.matches(regexes.d_heading, m => {
val marker = m.group(1)
val body = m.group(2)
new HeadingBlock(new StringEx(body), selector, marker.length)
}) orElse s.matches(regexes.d_h1, m => {
new HeadingBlock(new StringEx(m.group(1)), selector, 1)
}) orElse s.matches(regexes.d_h2, m => {
new HeadingBlock(new StringEx(m.group(1)), selector, 2)
}) orElse s.matches(regexes.d_hr, m => {
new HorizontalRulerBlock(selector)
}) match {
case Some(block: Block) => block
case _ => // nothing matched -- paragraph
new ParagraphBlock(s, selector)
}
}
def processComplexChunk(chunks: ChunkIterator,
block: Block,
accept: StringEx => Boolean): Block = {
var eob = false
while (chunks.hasNext && !eob) {
val c = chunks.peek
if (accept(c)) {
block.text.append("\n\n").append(c)
chunks.next
} else eob = true
}
return block
}
def ol_?(s: StringEx, indent: Int): Boolean = {
val i = new CharIterator(s)
while (i.hasNext && i.index < indent - 1)
if (i.next != ' ') return false
if (!i.hasNext) return false
// first char must be digit or space
var c = i.next
if (c == ' ') return true
if (!c.isDigit) return false
// look for more digits or `. `
while(i.hasNext) {
c = i.next
if (c == '.' && i.hasNext && i.peek == ' ') return true
else if (!c.isDigit) return false
}
return false
}
def ul_?(s: StringEx, indent: Int): Boolean = {
val i = new CharIterator(s)
while (i.hasNext && i.index < indent - 1)
if (i.next != ' ') return false
if (!i.hasNext) return false
// first char must be asterisk or space
var c = i.next
if (c == ' ') return true
return (c == '*' && i.hasNext && i.peek == ' ')
}
def stripSelector(s: StringEx): Selector = {
var id = ""
var classes = new ListBuffer[String]()
s.replaceFirst(regexes.blockSelector, m => {
val idSelector = m.group(1)
val classesSelector = m.group(2)
if (idSelector != null)
id = idSelector.substring(1)
if (classesSelector != null)
classesSelector.split("\\.").foreach { cl =>
if (cl != "")
classes += cl
}
""
})
return new Selector(id, classes)
}
def process(cs: CharSequence, out: Writer): Unit = {
val s = new StringEx(cs)
normalize(s)
stripLinkDefinitions(s)
hashHtmlBlocks(s)
hashHtmlComments(s)
cleanEmptyLines(s)
writeHtml(readBlocks(s), out)
}
def transform(s: StringEx): StringEx = {
protector.clear
normalizeSpan(s)
hashInlineHtml(s, regexes.inlineHtmlSpanStart, key => key)
doMacros(s)
encodeChars(s)
doCodeSpans(s)
encodeBackslashEscapes(s)
doInlineImages(s)
doRefImages(s)
doInlineLinks(s)
doRefLinks(s)
doSpanEnhancements(s)
return unprotect(s)
}
def normalizeSpan(s: StringEx): StringEx =
s.trim.replaceAll(" \n", "<br/>\n").replaceAll("\n", " ")
def encodeChars(s: StringEx): StringEx =
s.replaceAll(regexes.e_amp, "&")
.replaceAll("<", "<")
.replaceAll(">", ">")
protected def processSingleMacro(m: Matcher): CharSequence = {
var name = m.group(1)
if (name == null) name = ""
if (name.length > 0)
name = name.substring(0, name.length - 1)
val contents = new StringEx(m.group(2))
val r = macros.get(name).map(f => f(contents)).getOrElse(m.group(0))
r
}
def doMacros(s: StringEx): StringEx =
s.replaceAll(regexes.macro, m => protector.addToken(processSingleMacro(m)))
def doCodeSpans(s: StringEx): Unit = s.replaceAll(regexes.codeSpan, m => {
val s = new StringEx(m.group(2)).trim
// there can be protected content inside codespans, so decode them first
unprotect(s)
encodeChars(s)
protector.addToken(s.append("</code>").prepend("<code>"))
})
def encodeBackslashEscapes(s: StringEx): StringEx =
s.replaceAll(regexes.backslashChar, m => {
val c = m.group(0)
escapeMap.getOrElse(c, c)
})
def doRefLinks(s: StringEx): StringEx = s.replaceAll(regexes.refLinks, m => {
val linkText = m.group(1)
var id = m.group(2)
if (id == "") id = linkText
id = id.trim.toLowerCase
val linkContent = new StringEx(linkText)
// there can be protected content inside linktexts, so decode them first
unprotect(linkContent)
doSpanEnhancements(linkContent)
val result = links.get(id)
.map(ld => ld.toLink(linkContent))
.getOrElse(new StringEx(m.group(0)))
doMacros(result)
protector.addToken(result)
})
def doRefImages(s: StringEx): StringEx = s.replaceAll(regexes.refImages, m => {
val altText = m.group(1)
var id = m.group(2)
if (id == "") id = altText
id = id.trim.toLowerCase
val result = links.get(id)
.map(ld => ld.toImageLink(altText))
.getOrElse(new StringEx(m.group(0)))
doMacros(result)
protector.addToken(result)
})
def doInlineLinks(s: StringEx): StringEx = s.replaceAll(regexes.inlineLinks, m => {
val linkText = m.group(1)
val url = processUrl(m.group(2))
var title = m.group(4)
if (title == null) title = ""
val linkContent = new StringEx(linkText)
// there can be protected content inside linktexts, so decode them first
unprotect(linkContent)
doSpanEnhancements(linkContent)
val result = new LinkDefinition(url, new StringEx(title)).toLink(linkContent)
doMacros(result)
protector.addToken(result)
})
def doInlineImages(s: StringEx): StringEx = s.replaceAll(regexes.inlineImages, m => {
val altText = m.group(1)
val url = processUrl(m.group(2))
var title = m.group(4)
if (title == null) title = ""
val result = new LinkDefinition(url, new StringEx(title)).toImageLink(altText)
doMacros(result)
protector.addToken(result)
})
def doSpanEnhancements(s: StringEx): StringEx = {
doTypographics(s)
recurseSpanEnhancements(s)
s
}
protected def processUrl(url: CharSequence): StringEx = new StringEx(url)
protected def recurseSpanEnhancements(s: StringEx): StringEx =
s.replaceAll(regexes.spanEnhancements, m => {
val element = m.group(1) match {
case "*" => "strong"
case "_" => "em"
case "~" => "del"
case _ => "span"
}
val content = new StringEx(m.group(2))
recurseSpanEnhancements(content)
new StringEx("<").append(element).append(">")
.append(content).append("</").append(element).append(">")
})
def doTypographics(s: StringEx): StringEx = {
s.replaceAll(regexes.ty_dash, typographics.dash)
s.replaceAll(regexes.ty_larr, typographics.larr)
s.replaceAll(regexes.ty_rarr, typographics.rarr)
s.replaceAll(regexes.ty_trade, typographics.trade)
s.replaceAll(regexes.ty_reg, typographics.reg)
s.replaceAll(regexes.ty_copy, typographics.copy)
s.replaceAll(regexes.ty_hellip, typographics.hellip)
s.replaceAll(regexes.ty_ldquo, typographics.ldquo)
s.replaceAll(regexes.ty_rdquo, typographics.rdquo)
s
}
def unprotect(s: StringEx): StringEx = {
val found = s.replaceIfFound(regexes.protectKey, m => {
val key = m.group(0)
protector.decode(key).getOrElse(key)
})
if (found) unprotect(s)
return s
}
def writeHtml(blocks: Seq[Block], out: Writer): Unit =
blocks.foreach(b => if (b != EmptyBlock) {
b.writeHtml(this, out)
out.write(newLine)
})
def formHtml(blocks: Seq[Block], indent: Boolean = false): StringEx = {
val result = new StringEx("")
if (indent) level += 1
blocks.foreach(b =>
if (b != EmptyBlock) result.append(b.toHtml(this)).append(newLine))
if (indent) level -= 1
return result
}
def newLine: String = "\n"
def toHtml(cs: CharSequence): String = {
val out = new StringWriter(cs.length)
process(cs, out)
return out.toString
}
}
|