Skip to content

overview/overview-convert-office

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Converts office files for Overview using LibreOffice.

Methodology

This program always outputs 0.json and 0.blob.

The output 0.json has wantOcr:false.

These metadata fields from the input document will be written to the output PDF:

  • /Title
  • /Author
  • /Subject
  • /Keywords
  • /CreationDate
  • /ModDate
  • /Creator (program name that produced the document)

And these metadata fields will be written to the output JSON metadata (since there is no equivalent field in PDF):

  • Modified By (akin to Author)
  • Comments (text)

We use custom C++ with LibreOfficeKit and QPDF. This is painful and janky. Here's why:

  • soffice.bin is slow to load, imposing big overhead (hence LOK).
  • soffice.bin does not preserve metadata when converting to PDF (hence QPDF).
  • soffice.bin cannot even extract metadata. The only move is to convert to .odt and then read the result as a zipfile -- costing a second invocation of soffice.bin just to read metadata (hence not using soffice.bin).

We create the LibreOffice profile directory once and reset it with every conversion. This is for security: once an invocation is complete, all its data is wiped.

Testing

Write to test/test-*. docker build . will run the tests.

Each test has input.blob (which means the same as in production) and input.json (whose contents are $1 in do-convert-single-file). The files stdout, 0.json and 0.blob in the test directory are expected values. If actual values differ from expected values, the test fails.

PDF is a tricky format to get exactly right. You may need to use the Docker image itself to generate expected output files. For instance, here is how we build test-odt/0.blob:

  1. Wrote test/test-odt/{input.json,input.blob,0.json,stdout}
  2. Ran docker build .. The end of the output looked like this: Step 12/13 : RUN [ "/app/test-convert-single-file" ] ---> Running in f65521f3a30c 1..1 not ok 1 - test-odt TODO do-convert-single-file wrote /tmp/test-do-convert-single-file912093989/0-thumbnail.jpg, but we expected it not to exist TODO ... TODO1. docker cp f65521f3a30c:/tmp/test-do-convert-single-file912093989/0-thumbnail.jpg test/test-jpg-ocr/ TODO1. docker rm -f f65521f3a30c