Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Latest commit

 

History

History
24 lines (18 loc) · 932 Bytes

CONTRIBUTING.md

File metadata and controls

24 lines (18 loc) · 932 Bytes

Contributing

We would welcome any contribution to make this project better.

You could contribute by raising or commenting on an issue.

Or just submit a pull request.

Ideas

There are many aspects you could improve:

  • Conversion step, e.g.:
    • Paragraph detection
    • Formula extraction
    • Table extraction
  • Computer Vision Model (issue #1)
  • Training Data:
    • The training data should include PDF and full text XML
    • Public data is preferable as it can then be shared; but we are also looking to partner with publishers that may not be able to make their data public but can provide it for training purpose
  • Training Data preparation:
    • We are using the XML to annotate the PDF characters; that works reasonably well but still needs improving.
  • Community
  • Your ideas here