Recovering of malformed ENEX file #12

engdan77 · 2021-05-15T07:49:31Z

Hey .. Awesome work developing this project, that I found very useful to me and saved me some work.. Thanks.. :)

Some background to this PR...
I've been searching around for a tool allowing me to transforming my personal collection of Evernote notes to a format easier to search and potentially easier import to future services.

Now I discovered problem processing my large data ~5GB using the existing source using Pythons builtin xml-parser that unfortunately was unable to succeed without exception breaking the process.

My first attempt I tried to adapt to more robust lxml package allowing huge data and with "recover", but even if it worked better it also failed processing the whole data. Even using the memory efficient etree.iterparse() it also unfortunately got into trouble.

And with no luck finding any other libraries successfully parsing this enormous file I instead chose to build a "hugexmlparser" module that allows parsing this huge file using yield (on a byte-to-byte-level) and allows you to set a maximum size for to cater for potential malformed or undesirable large attachments to export, should succeed covering potential exceptions. Some cases found where the parses discover malformed XML within so also in those cases try to save as much as possible by escaping (to be dealt at a later stage, better than nothing), and if a missing end before new (malformed?) it would add this after encounter a new start-tag.

The code for the recovery process is a bit rough and for certain room for refactoring, but at the moment is seem to achieve what I wanted.

Now with the above we pass this a minor changed version of save_note_recovery() assure the existing works.
Also adding this as a new recover-enex command to click and kept the original options.
A couple of new tests was added as well to check against using this command.

Now this currently works to me, but thought I might share a PR in such as you find use for this yourself or found useful to others finding this repository.

As a second step .. When the time allows it would have been nice to also be able to easily export from SQLite to formatted HTML/MD and attachments saved... but that might perhaps be better a separate project ... or if you or someone else have something that might shared to save some trouble, I would be interested ;-)

… should to the trick even though not most efficient way.

…nex files.

engdan77 added 10 commits May 14, 2021 18:10

Unable to find a XML parser allow full recovery of XML so this module…

3825976

… should to the trick even though not most efficient way.

Integrating hugexml parser and techniques allowing recovering large E…

23d04ed

…nex files.

Adding dependencies to requirements.txt

317c01d

Adding support for progressbar while parsing large individual notes

d04a3b7

Changes made to recover_enex to support parsing of large ENEX files

7883f3d

Adding function to allow resuming of an already started recovery process

bb65c5e

Adding help to describing the recover enex command

27610cc

Adding tests for recover-enex

a47a5a8

Removing comment

4b6349c

Missing some packages in setup.py

a5839da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering of malformed ENEX file #12

Recovering of malformed ENEX file #12

engdan77 commented May 15, 2021 •

edited

Loading

Recovering of malformed ENEX file #12

Are you sure you want to change the base?

Recovering of malformed ENEX file #12

Conversation

engdan77 commented May 15, 2021 • edited Loading

engdan77 commented May 15, 2021 •

edited

Loading