The compiler for your content.
This handbook describes the unified ecosystem. It goes in depth about the numerous syntaxes it supports, usage, and practical guides on writing plugins. Additionally, it will attempt to define murky, computer science-y concepts that unified attempts to abstract away.
- Introduction
- How does it work?
- Supported syntaxes
- Abstract syntax trees
- unist
- unified
- remark
- rehype
- retext
- MDX
- Tree traversal
- Glossary
- Collective
- Authors
- Additional resources
- Acknowledgements
- License
- Notes
unified enables new exciting projects like Gatsby to pull in Markdown, MDX to embed JSX, and Prettier to format it. It’s used in about 300k projects on GitHub and has about 10m downloads each month on npm: you’re probably using it.
It powers remarkjs, rehypejs, mdx-js, retextjs, and redotjs. It's used to build other projects like prettier, gatsbyjs, and more.
Some notable users are Node.js, ZEIT, Netlify, GitHub, Mozilla, WordPress, Adobe, Facebook, Google.
unified uses abstract syntax trees, or ASTs, that plugins can operate on. It can even process between different formats. This means you can parse a markdown document, transform it to HTML, and then transpile back to markdown.
unified leverages a syntax tree specification (called unist or UST) so that utilities
can be shared amongst different formats. In practice, you can use unist-util-visit
to visit
nodes using the same library with the same API on any supported AST.
visit(markdownAST, 'images', transformImages)
visit(htmlAST, 'img', transformImgs)
unified supports a few different syntaxes. Each have their own formal specification and are
compatible with all unist
utility libraries.
Each syntax has its own GitHub organization and subset of plugins and libraries.
An abstract syntax tree, or AST, is a representation of input. It's an abstraction that enables developers to analyze, transform and generate code.
They're the integral data structure in the unified ecosystem. Most plugins operate solely on the AST, receiving it as an argument and then returning a new AST afterwards.
Your most basic plugin looks like the following (where the tree is an AST):
module.exports = options => tree => {
return tree
}
It accepts the AST as an argument, and then returns it. You can make it do something slightly more interesting by counting the heading nodes.
const visit = require('unist-util-visit')
module.exports = options => tree => {
let headingsCount = 0
visit(tree, 'heading', node => {
headingsCount++
})
}
Or, turn all h1
s in a document into h2
s:
const visit = require('unist-util-visit')
module.exports = options => tree => {
visit(tree, 'heading', node => {
if (node.depth === 1) {
node.depth = 2
}
})
}
If you ran the plugin above with # Hello, world!
and compiled it
back to markdown, the output would be ## Hello, world!
.
unified uses ASTs because plugins are much easier to write when operating on objects rather than the strings themselves. You could achieve the same result with a string replacement:
markdown.replace(/^#\s+/g, '## ')
But this would be brittle and doesn't handle the thousands of edge cases with complex grammars which make up the syntax of markdown, HTML, and MDX.
In order to form an AST, unified takes an input string and passes that to a tokenizer. A tokenizer breaks up the input into tokens based on the syntax. In unified the tokenizer and lexer are coupled. When syntax is found the string is "eaten" and it's given metadata like node type (this is the "lexer").
Then, the parser turns this information into an AST. All together the pipeline looks like:
[INPUT] => [TOKENIZER/LEXER] => [PARSER] => [AST]
Consider this markdown input:
# Hello, **world**!
The tokenizer will match the "#" and create a heading node. Then it will begin searching for inline syntax where it will encounter "**" and create a strong node.
It's important to note that the parser first looks for block-level syntax which includes headings, code blocks, lists, paragraphs, and block quotes.
Once a block has been opened, inline tokenization begins which searches for syntax including bold, code, emphasis, and links.
The markdown will result in the following AST:
{
"type": "heading",
"depth": 1,
"children": [
{
"type": "text",
"value": "Hello, ",
"position": {}
},
{
"type": "strong",
"children": [
{
"type": "text",
"value": "world",
"position": {}
}
],
"position": {}
},
{
"type": "text",
"value": "!",
"position": {}
}
],
"position": {}
}
A compiler turns an AST into output (typically a string). It provides functions that handle each node type and compiles them to the desired end result.
For example, a compiler for markdown would encounter a link
node and
transform it into []()
markdown syntax.
[AST] => [COMPILER] => [OUTPUT]
It would turn the AST example above back into the source markdown when compiling to markdown. It could also be compiled to HTML and would result in:
<h1>
Hello, <strong>world</strong>!
</h1>
unist is a specification for syntax trees which ensures that libraries that work with unified are as interoperable as possible. All ASTs in unified conform to this spec. It's the bread and butter of the ecosystem.
A standard AST allows developers to use the same visitor function on all formats, whether it's markdown, HTML, natural language, or MDX. Using the same library ensures that the core functionality is as solid as possible while cutting down on cognitive overhead when trying to perform common tasks.
When working with ASTs it's common to need to traverse the tree. This is typically referred to as "visiting". A handler for a particular type of node is called a "visitor".
unified comes with visitor utilities so you don't have to reinvent the wheel every time you want to operate on particular nodes.
unist-util-visit is a library that improves the DX of tree traversal for unist trees. It's a function that takes a tree, a node type, and a callback which it invokes with any matching nodes that are found.
visit(tree, 'image', node => {
console.log(node)
})
Note: This performs a depth-first tree traversal in preorder (NLR).
Something that's useful with unist utilities is that they can be used on subtrees. A subtree would be any node in the tree that may or may not have children.
For example if you only wanted to visit images within heading nodes you could first visit headings, and then visit images contained within each heading node you encounter.
visit(tree, 'heading', headingNode => {
visit(headingNode, 'image', node => {
console.log(node)
})
})
Once you're familiar with some of the primary unist utilities, you can combine them together to address more complex needs.
When you care about multiple node types and are operating on large documents it might be preferable to walk all nodes and add a check for each node type with unist-util-is.
In some cases you might want to remove nodes based on their parent context. Consider a scenario where you want to remove all images contained within a heading.
You can achieve this by combining unist-util-visit with unist-util-remove. The idea is that you first visit the parent, which would be heading nodes, and then remove images from the subtree.
visit(tree, 'heading', headingNode => {
remove(headingNode, 'image')
})
Watch this lesson on egghead →
unified is the interface for working with syntax trees and can be used in the same way for any of the supported syntaxes.
For unified to work it requires two key pieces: a parser and a compiler.
A parser takes a string and tokenizes it based on syntax. A markdown parser would
turn # Hello, world!
into a heading
node.
unified has a parser for each of its supported syntax trees.
A compiler turns an AST into its "output". This is typically a string. In some cases folks want to parse a markdown document, transform it, and then write back out markdown (like Prettier). In other cases folks might want to turn markdown into HTML.
unified already supports compilers for most common outputs including markdown, HTML, text, and MDX. It even offers compilers for less common use cases including compiling markdown to CLI manual pages.
unified also offers transpilers. This is how one syntax tree is converted to another format.
The most common transpiler is mdast-util-to-hast
which converts the markdown AST (mdast)
to the HTML AST (hast).
unified should be invoked:
unified()
Passed plugins:
.use(remarkParse)
And then given a string to operate on:
.process('# Hello, world!', (err, file) => {
console.log(String(file))
})
A more real-world example might want to turn a markdown document into an HTML string which would look something like:
var unified = require('unified')
var markdown = require('remark-parse')
var remark2rehype = require('remark-rehype')
var doc = require('rehype-document')
var format = require('rehype-format')
var html = require('rehype-stringify')
var report = require('vfile-reporter')
unified()
.use(markdown)
.use(remark2rehype)
.use(doc, {title: '👋🌍'})
.use(format)
.use(html)
.process('# Hello world!', function(err, file) {
console.error(report(err || file))
console.log(String(file))
})
The code is doing the following
- Receives a markdown string (
process()
) - Parses the markdown (
.use(markdown)
) - Converts the mdast to hast (
.use(remark2rehype)
) - Wraps the hast in a document (
.use(doc)
) - Formats the hast (
.use(format)
) - Converts the hast to HTML (
.use(html)
)
It'll result in an HTML string:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>👋🌍</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<h1>Hello world!</h1>
</body>
</html>
remark is a plugin-based markdown processor. It has the ability to parse markdown, transform it with plugins, and then write back to markdown or transpile it to another format like HTML.
It's highly configurable. Even plugins can customize the parser and compiler if needed.
You can use the remark library directly in your scripts:
remark()
.processSync('# Hello, world!')
Though, it's really a shortcut for:
unified()
.use(remarkParse)
.use(remarkStringify)
.processSync('# Hello, world!')
remark offers a CLI that which can be used to automate tasks.
A useful option with the remark CLI is inspecting the AST of a document. This can be useful when you're trying to remember the name of a node type or you want an overview of the overall structure.
❯ remark doc.md --inspect
root[13] (1:1-67:1, 0-2740)
├─ paragraph[1] (1:1-1:64, 0-63)
│ └─ text: "import TableOfContents from '../src/components/TableOfContents'" (1:1-1:64, 0-63)
├─ heading[1] (3:1-3:15, 65-79) [depth=1]
│ └─ text: "Fecunda illa" (3:3-3:15, 67-79)
├─ html: "<TableOfContents headings={props.headings} />" (5:1-5:46, 81-126)
├─ heading[1] (7:1-7:18, 128-145) [depth=2]
│ └─ text: "Sorore extulit" (7:4-7:18, 131-145)
├─ paragraph[1] (9:1-12:75, 147-454)
│ └─ text: "Lorem markdownum sorore extulit, non suo putant tritumque amplexa silvis: in,\nlascivaque femineam ara etiam! Oppida clipeus formidine, germanae in filia\netiamnunc demisso visa misce, praedaeque protinus communis paverunt dedit, suo.\nSertaque Hyperborea eatque, sed valles novercam tellure exhortantur coegi." (9:1-12:75, 147-454)
├─ list[3] (14:1-16:58, 456-573) [ordered=true][start=1][spread=false]
│ ├─ listItem[1] (14:1-14:22, 456-477) [spread=false]
│ │ └─ paragraph[1] (14:4-14:22, 459-477)
│ │ └─ text: "Cunctosque plusque" (14:4-14:22, 459-477)
│ ├─ listItem[1] (15:1-15:38, 478-515) [spread=false]
│ │ └─ paragraph[1] (15:4-15:38, 481-515)
│ │ └─ text: "Cum ego vacuas fata nolet At dedit" (15:4-15:38, 481-515)
│ └─ listItem[1] (16:1-16:58, 516-573) [spread=false]
│ └─ paragraph[1] (16:4-16:58, 519-573)
│ └─ text: "Nec legerat ostendisse ponat sulcis vincirem cinctaque" (16:4-16:58, 519-573)
You can use plugins with the CLI:
remark doc.md --use toc
This will output a markdown string with a table of contents added. If you'd like, you can overwrite the document with the generated table of contents:
remark doc.md -o --use toc
You can use a lint preset to ensure your markdown style guide is adhered to:
❯ remark doc.md --use preset-lint-markdown-style-guide
15:1-15:38 warning Marker should be `1`, was `2` ordered-list-marker-value remark-lint
16:1-16:58 warning Marker should be `1`, was `3` ordered-list-marker-value remark-lint
34:1-60:6 warning Code blocks should be fenced code-block-style remark-lint
⚠ 4 warnings
If you want to exit with a failure code (1
) when the lint doesn't pass you can use the --frail
option:
❯ remark doc.md --frail --use preset-lint-markdown-style-guide || echo '!!!failed'
15:1-15:38 warning Marker should be `1`, was `2` ordered-list-marker-value remark-lint
16:1-16:58 warning Marker should be `1`, was `3` ordered-list-marker-value remark-lint
34:1-60:6 warning Code blocks should be fenced code-block-style remark-lint
⚠ 4 warnings
!!!failed
Watch a video introduction to the CLI →
unist-util-visit
is useful for visiting nodes in an AST based on a particular
type. To visit all headings you can use it like so:
module.exports = () => tree => {
visit(tree, 'heading', node => {
console.log(node)
})
}
The above will log all heading nodes. Heading nodes also have a depth
field which
indicates whether it's h1
-h6
. You can use that to narrow down what heading
nodes you want to operate on.
Below is a plugin that prefixes "BREAKING" to all h1
s in a markdown document.
const visit = require('unist-util-visit')
module.exports = () => tree => {
visit(tree, 'heading', node => {
if (node.depth !== 1) {
return
}
visit(node, 'text', textNode => {
textNode.value = 'BREAKING ' + textNode.value
})
})
}
rehype is an HTML processor in the same way that remark is for markdown.
rehype()
.processSync('<title>Hi</title><h2>Hello world!')
MDX is a syntax and language for embedding JSX in markdown. It allows you to embed components in your documents for writing immersive and interactive content.
An example MDX document looks like:
import Chart from '../components/snowfall-chart'
# Last year's snowfall
In the winter of2018, the snowfall was above average. It was followed by
a warm spring which caused flood conditions in many of the nearby rivers.
<SnowfallChart year="2018" />
The MDX core library extends the remark parser with the remark-mdx plugin in order to define its own JSX-enabled syntax.
MDX uses remark and rehype internally. The flow of MDX consists of the following six steps:
- Parse: MDX text => MDAST
- Transpile: MDAST => MDXAST (remark-mdx)
- Transform: remark plugins applied to AST
- Transpile: MDXAST => MDXHAST
- Transform: rehype plugins applied to AST
- Generate: MDXHAST => JSX text
The final result is JSX that can be used in React/Preact/Vue/etc.
MDX allows you to hook into this flow at step 3 and 5, where you can use remark and rehype plugins (respectively) to benefit from their ecosystems.
Tree traversal is a common task when working with a tree to search it. Tree traversal is typically either breadth-first or depth-first.
In the following examples, we’ll work with this tree:
+---+
| A |
+-+-+
|
+-----+-----+
| |
+-+-+ +-+-+
| B | | F |
+-+-+ +-+-+
| |
+-----+--+--+ |
| | | |
+-+-+ +-+-+ +-+-+ +-+-+
| C | | D | | E | | G |
+---+ +---+ +---+ +---+
Breadth-first traversal is visiting a node and all its siblings to broaden the search at that level, before traversing children.
For the syntax tree defined in the diagram, a breadth-first traversal first searches the root of the tree (A), then its children (B and F), then their children (C, D, E, and G).
Alternatively, and more commonly, depth-first traversal is used. The search is first deepened, by traversing children, before traversing siblings.
For the syntax tree defined in the diagram, a depth-first traversal first searches the root of the tree (A), then one of its children (B or F), then their children (C, D, and E, or G).
For a given node N with children, a depth-first traversal performs three steps, simplified to only binary trees (every node has head and tail, but no other children):
These steps can be done in any order, but for non-binary trees, L and R occur together. If L is done before R, the traversal is called left-to-right traversal, otherwise it is called right-to-left traversal. In the case of non-binary trees, the other children between head and tail are processed in that order as well, so for left-to-right traversal, first head is traversed (L), then its next sibling is traversed, etcetera, until finally tail (R) is traversed.
Because L and R occur together for non-binary trees, we can produce four types of orders: NLR, NRL, LRN, RLN.
NLR and LRN (the two left-to-right traversal options) are most commonly used and respectively named preorder and postorder.
For the syntax tree defined in the diagram, preorder and postorder traversal thus first search the root of the tree (A), then its head (B), then its children from left-to-right (C, D, and then E). After all descendants of B are traversed, its next sibling (F) is traversed and then finally its only child (G).
A tree is a node and all of its descendants (if any).
Node X is child of node Y, if Y’s children
include X.
Node X is parent of node Y, if Y is a child of X.
The index of a child is its number of preceding
siblings, or 0
if it has none.
Node X is a sibling of node Y, if X and Y have the same parent (if any).
The previous sibling of a child is its sibling at its index minus 1.
The next sibling of a child is its sibling at its index plus 1.
The root of a node is itself, if without parent, or the root of its parent.
The root of a tree is any node in that tree without parent.
Node X is descendant of node Y, if X is a child of Y, or if X is a child of node Z that is a descendant of Y.
An inclusive descendant is a node or one of its descendants.
Node X is an ancestor of node Y, if Y is a descendant of X.
An inclusive ancestor is a node or one of its ancestors.
The head of a node is its first child (if any).
The tail of a node is its last child (if any).
A leaf is a node with no children.
A branch is a node with one or more children.
A node is generated if it does not have positional information.
The type of a node is the value of its type
field.
The positional information of a node is the value of its position
field.
A file is a source document that represents the original file that was parsed to produce the syntax tree. Positional information represents the place of a node in this file. Files are provided by the host environment and not defined by unist.
For example, see projects such as vfile.
In preorder (NLR) is [depth-first][traversal-depth] [tree traversal][traversal] that performs the following steps for each node N:
- N: visit N itself
- L: traverse head (then its next sibling, recursively moving forward until reaching tail)
- R: traverse tail
In postorder (LRN) is [depth-first][traversal-depth] [tree traversal][traversal] that performs the following steps for each node N:
- L: traverse head (then its next sibling, recursively moving forward until reaching tail)
- R: traverse tail
- N: visit N itself
Enter is a step right before other steps performed on a given node N when [traversing][traversal] a tree.
For example, when performing preorder traversal, enter is the first step taken, right before visiting N itself.
Exit is a step right after other steps performed on a given node N when [traversing][traversal] a tree.
For example, when performing preorder traversal, exit is the last step taken, right after traversing the tail of N.
unified was originally created by Titus Wormer. It's now governed by a collective which handles the many GitHub organizations, repositories, and packages that are part of the greater unified ecosystem.
The collective and its governance won't be addressed in this handbook. If you're interested, you can read more about the collective on GitHub.
This handbook is inspired by the babel-handbook written by James Kyle.
- unist nodes are accompanied by positional information. To keep AST printouts as
simple as possible, it will be an empty object (
"position": {}
) when it isn't relevant for the example.