Org-mode parser with focus on scripting, structured data processing and ease of embedding in software.
- Built for fast processing – core is implemented in C++
- With scripting in mind – Python API, export to structured data (JSON, Yaml, S-expressions, protobuf)
- Strictly defined schema – strongly typed AST/DOM structure for individual nodes, auto-generated protobuf grammar, extensive use of vocabulary types in the data model.
- Expressive debug capabilities – built-in tracing framework for all document processing steps for ease of debugging
- Extensive testing – curated corpus of documents, fuzzy content generation for automatic bug detection using python hypothesis.
Currently working on the second phase of the Python API, improving the ergonomics and adding more tests. Next step is to implement the CI for testing, documentation and artifact builds.
Phase | Status | Notes |
---|---|---|
Qt editor | 🚧 WIP | Qt/C++ editor for the org-mode |
C++ core | 🚧 Alpha | Whole pipeline is complete, working on testing |
Python API | 🚧 Alpha | Whole DOM model is exposed, working on API improvements |
C API | ⬜ Todo | Make the library usable from virtually any PL |
- ✅ Machine-readable data model – describe the data model in machine-readable format and generate C++ code for it
- ✅ Data transform pipeline – lexer, parser, sem convert
- ✅ Document test corpus – yaml specification for the given/expected files
- ✅ WIP Org-mode AST back to string – for document formatting and round-trip PBT testing (make random document, render, parse back, compare with original)
- ✅ Automatically wrap C++ data model for python – done, tested, working on ergonomics improvements
- 🚧 Implement common data format exporters – (latex, HTML, XML, Pandoc)
- 🚧 Use hypothesis for fuzz testing – porting earlier C++
google/fuzztest
implementation
- 🚧 Edit text documents with a fusion of WYSIWYG and text-based approach, similar to the Jupyter markdown cells
- 🚧 Data-driven mind map visualization powered by graphviz and adaptagrams automatic graph layout.
- ⬜ Automatic wrapper of the C++ API to C
- ⬜ Machine-readable description of the C API
Main entry point for file manipulation – python script implementing commands for export, analysis and file modifications.
pybind11
python module exposing the org-mode AST for scripting.
Not a fan of sprawling python dependencies when you only need to parse the file? Totally understandable. haxorg_lite
is intended as a self-contained binary for the C++ parser core – parse org-mode and output to json, yaml, binary protobuf, protobuf JSON, reformatted org-mode, S-expressions.
There is no working and usable org-mode parser that can be decoupled from emacs and even emacs one is absolute garbage. Let’s start with the ‘canonical’ implementation: aside from the fact it seems more like a huge collection of random regex rules, it barely has an AST, certainly not in a way that is ready for export into other consumers. Exporter customization system still mainly revolves around string manipulation. Very few parts are implemented as a data processing pipelines where you take in an AST and return some other content – this is not limited to the exporters, org agenda customization is also a string processing. subtree.isRecurring()
is easy to implement in the C++, but for emacs it is a re-search-forward
with some rx
hacks on top. And so on.
So while the emacs is certainly a good org-mode editor, it does a terrible job at being org-mode processor (all default exports block the UI, batch exporting in CLI is something you need to hack around) unless you are planning to dive knee-deep into the lisp programming and figure out all the details of how things need to fit together. Adding support for new source block languages is also tricky. And be a lisp programmer, again. Most people aren’t lisp programmers – I’m not one, even after using emacs for 5+ years at this point. There are far more python and c++ programmers out there than lisp ones.
This pretty much sums up the problem statement – implement an org-mode parser in some programming language that /I/ know and expose the interface in python for quicker scripting. C++ fits the bill, so that’s what I went with. Might’ve been a good opportunity to use Rust or Zig or some other PL, but as it turned out the C++ can be moved into a very ergonomic direction even without full syntax revamps like Carbon or cppfront
(aka C++ Syntax 2).
After I stated what in the world I’m doing here in this project, lets take a closer look at how I’m planning to actually carry this out. Let’s go over the development tools first. The programming language is C++, specifically the latest C++23 – to simplify toolchain and stdlib bundling I will just use LLVM releases directly. Dependencies are managed by submodules because not all the libraries I used even have conan packaging (fuzztest
, abseil, libgit2
(1 year outdated), other things). And
Emacs is still the reference implementation, but sensible extensions taken either from the common packages or ones that I use personally (nested tags #parent##sub##[subsub1,subsub2]
, @mention
, admonition blocks and NOTE:
prefixes) will be implemented and tested as well. AST structure will conform to whatever data model makes the most sense, not necessarily following the S-Expr blurbs at the org-syntax.
Unit testing for the regular cxx code if possible, plus collection of test documents in the .yaml
spec corpus (tests/org/corpus), followed by the hypothesis-based random document generation. Each test document goes through the whole lex-parse-sem process, then to AST->string
formatter and parsed again. This ensures every test validates the whole processing pipeline, even if no intermediate assertions are provided. For more on testing, read the ARCHITECTURE.org section “Testing infrastructure”.
Where is the project on the roadmap at the moment, are there any fixed plans or it is just me bumping around the code and fixing things if I see anything that catches my attention this particular moment? Not in a formal sense at the moment, but a rough outline of the things I want to do is:
- Finish rewrite to the standard library types and RE-flex lexer – implementation with Qt types was working correctly as far back as August 2023, but since then I decided to completely drop dependency on Qt, use the RE-flex lexer instead of hand-rolled one and so some other things reorganizing the project. It has taken quite a bit of time, the main missing link being the new lexer implementation. Parser and sem convert don’t have to change as much.
- Stabilize exposed python API –
pybind11
wrapper generation relies on the