GitHub contains a large corpus data that is amenable for NLP, in the form of Issues, READMEs, pull request comments and other items. However, this text is often accompanies by markdown which allows the user to specify styling (bold, underline, headings) and specialized formatting (code blocks, tables, block quotes, hyperlinks). This library has two goals:
This is so markdown information is not lost. For example, a list block is enclosed with xxxlistB
and xxxlistE
and a code block is enclosed with xxxcdb
and xxxcde
. Other noteable examples:
- @mentions: xxxatmention (the handle is removed and replaced by just this indicator)
- quote blocks: xxxqb/xxxqe
- strikethrough: xxxdelb/xxxdele
- horizontal rule: xxxhr
- {large, medium, small} headers: annotated with xxxh{l,m,s}. H1=large, H2-3=medium, H4-6=small.
GitHub issues often contain a large stack trace, or a large table with data. This library comes equipped with sensible defaults to surface the most relevant information and discard what would otherwhise be lots of characters for a machine learning algorithm to handle:
- Code Blocks: only first two and last two rows are kept
- Tables: only table headers are kept
- Urls: only the host is kept. For example www.google.com/search is reformatted to www.google.com
- Images: the image is discarded but the file extension and metadata about the image (available to screenreader) is extracted.
- IP Addresses, extremely long numbers are marked as xxunk
This parser works by converting markdown to HTML then converting the HTML (along with some of the HTML tags, in certain cases) to text.
pip install mdparse
This library makes extremely opinionated choices on how to parse markdown and filter information. This library is for experimental purposes only, and may not be appropriate for every problem. Please use with caution.
The primary use case of this parser has been to prepare a large corpus of GitHub Issue data for a language model. However, we envision this parser would be applicable to other machine learning tasks involving the extraction of features from the text of GitHub Issues, Readme files, or pull request comments.
See /notebooks/Demo.ipynb for an example of the transformations this parser does on a markdown file.