PDF-shape is a Rust library dedicated to analyse XML files produced by pdf2xml
Implemented :
- Alignement extraction
- Coordinates extraction
- Shape extraction
- Spacing extraction
- Style extraction
- Blocks extraction (get all the block elements of a given document)
- Texts extraction (get all the text elements of a given document)
- Tokens extraction (get all the token elements of a given document)
Not implemented yet:
- Line detection
- Column detection
- Paragraph detection
- Blocks detection
You can run the example with :
cargo run --example=main
You can build the documentation with :
cargo doc --open --lib --no-deps
The following diagram represents the shape of objects/set of objects and the spacing between them
A line is a set of objects sharing the same base or a set of objects which are horizontally aligned. Horizontal spacing between objects shouldn't be greater than the horizontal spacing mode of the document.
A paragraph is a set of lines that are equally spaced vertically. In most cases the paragraph spacing should be greater than the document line spacing. Each paragraph lines have to be vertically aligned.