Contributing

We would welcome any contribution to make this project better.

You could contribute by raising or commenting on an issue.

Or just submit a pull request.

Ideas

There are many aspects you could improve:

Conversion step, e.g.:
- Paragraph detection
- Formula extraction
- Table extraction
Computer Vision Model (issue #1)
Training Data:
- The training data should include PDF and full text XML
- Public data is preferable as it can then be shared; but we are also looking to partner with publishers that may not be able to make their data public but can provide it for training purpose
Training Data preparation:
- We are using the XML to annotate the PDF characters; that works reasonably well but still needs improving.
Community
Your ideas here