Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff PDFs by converting them to HTML and diffing that #9

Open
Mr0grog opened this issue Feb 23, 2018 · 3 comments
Open

Diff PDFs by converting them to HTML and diffing that #9

Mr0grog opened this issue Feb 23, 2018 · 3 comments
Labels
enhancement New feature or request never-stale new-diff A new type of diff

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 23, 2018

This should not necessarily replace a differ for actual PDF content, but it could potentially be a lot more useful when it works well: instead of trying to diff two PDF files, convert the PDF to HTML (there are at least a few open-source libraries for this) and feed that through the HTML differ.

Not sure what the right name for this is.

Lizz brought this up in Slack and, though I remember having a short discussion about the idea before, I can’t find anywhere we’ve written it down, hence this issue.

@Mr0grog
Copy link
Member Author

Mr0grog commented May 10, 2018

Potentially useful article on PDF text extraction in Python I ran across today: https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

@stale
Copy link

stale bot commented Jan 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 10, 2019

Definitely still a relevant idea.

@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@stale stale bot closed this as completed Jun 16, 2021
@Mr0grog Mr0grog reopened this Jun 18, 2021
@stale stale bot removed the stale label Jun 18, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 18, 2021
@stale stale bot added the stale label Jan 8, 2022
@stale stale bot closed this as completed Apr 16, 2022
@Mr0grog Mr0grog added enhancement New feature or request never-stale and removed stale labels Apr 17, 2022
@Mr0grog Mr0grog reopened this Apr 17, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@Mr0grog Mr0grog added the new-diff A new type of diff label Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request never-stale new-diff A new type of diff
Projects
None yet
Development

No branches or pull requests

1 participant