LibPdf is a fast and efficient Node.js library for converting PDF files to text. This open-source project aims to simplify the process of extracting text content from PDFs, making it easier for developers to work with PDF data in their applications.
Features
- Fast PDF to text conversion
- Easy-to-use API
To install libpdf
and its dependencies, ensure you have a supported version of Node installed. You can install the
project with npm. In the project directory, run:
$ npm install libpdf --save
This command fully installs the project, including installing any dependencies.
To build libpdf
, you need to have Rust installed. If you have already installed the project and only want to run the
build, use:
$ npm run build
After building libpdf
, you can explore its exports at the Node REPL.
-
Install
libpdf
:$ npm install libpdf --save
-
Open Node REPL:
$ node
-
Execute the following commands:
> const pdfFile = require("fs").readFileSync("doc.pdf") > const doc = require("libpdf").document(pdfFile); > console.log(doc);
-
Create a file named
index.ts
with the following content:const pdfFile = require('fs').readFileSync("doc.pdf"); const doc = require('libpdf').document(pdfFile); console.log(doc);
-
Run the file with Node:
$ node index.ts
This setup ensures you can easily install, build, and explore the capabilities of libpdf
.
-
Best for Small and Medium PDFs:
libPdf
consistently performs the fastest for small and medium PDF files, showing significant speed advantages overpdf-lib
andpdf-parse
. -
Balanced Performance:
pdf-parse
offers a middle-ground performance across all file sizes but is generally slower thanlibPdf
for smaller files andpdf-lib
for medium files. -
Inefficiency with Complex PDFs:
libPdf
shows a notable drop in performance with complex PDFs, taking significantly longer compared topdf-parse
andpdf-lib
. -
Library Efficiency:
pdf-lib
excels with small and medium PDFs but struggles significantly with large and complex documents, making it less suitable for those cases.
- Run Benchmark
- Add support for extracting text from specific pages
- Improve text extraction accuracy for complex PDFs
- Implement batch processing for multiple PDFs
- Add CLI support for direct command-line usage
- Create detailed documentation and examples
- Not supported for Identity-H encoding
We welcome contributions to improve LibPdf! Feel free to submit issues and pull requests on our GitHub repository.
This project is licensed under the Apache-2.0 license.