Diff PDFs by converting them to HTML and diffing that #9

Mr0grog · 2018-02-23T16:56:36Z

This should not necessarily replace a differ for actual PDF content, but it could potentially be a lot more useful when it works well: instead of trying to diff two PDF files, convert the PDF to HTML (there are at least a few open-source libraries for this) and feed that through the HTML differ.

Not sure what the right name for this is.

Lizz brought this up in Slack and, though I remember having a short discussion about the idea before, I can’t find anywhere we’ve written it down, hence this issue.

Mr0grog · 2018-05-10T16:03:21Z

Potentially useful article on PDF text extraction in Python I ran across today: https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

stale · 2019-01-10T01:25:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog · 2019-01-10T04:53:56Z

Definitely still a relevant idea.

Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020

stale bot added the stale label Jun 2, 2021

stale bot closed this as completed Jun 16, 2021

Mr0grog reopened this Jun 18, 2021

stale bot removed the stale label Jun 18, 2021

edgi-govdata-archiving deleted a comment from stale bot Jun 18, 2021

stale bot added the stale label Jan 8, 2022

stale bot closed this as completed Apr 16, 2022

Mr0grog added enhancement New feature or request never-stale and removed stale labels Apr 17, 2022

Mr0grog reopened this Apr 17, 2022

edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022

Mr0grog added the new-diff A new type of diff label Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff PDFs by converting them to HTML and diffing that #9

Diff PDFs by converting them to HTML and diffing that #9

Mr0grog commented Feb 23, 2018

Mr0grog commented May 10, 2018

stale bot commented Jan 10, 2019

Mr0grog commented Jan 10, 2019

Diff PDFs by converting them to HTML and diffing that #9

Diff PDFs by converting them to HTML and diffing that #9

Comments

Mr0grog commented Feb 23, 2018

Mr0grog commented May 10, 2018

stale bot commented Jan 10, 2019

Mr0grog commented Jan 10, 2019