Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add differ for word docs? #7

Open
Mr0grog opened this issue Jan 19, 2018 · 5 comments
Open

Add differ for word docs? #7

Mr0grog opened this issue Jan 19, 2018 · 5 comments
Labels
never-stale new-diff A new type of diff

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jan 19, 2018

We don’t have a lot of Word docs in our DB, but there are a few and Analysts have noted that they are a pain. That said, we aren’t any worse than the existing tool (Versionista), plus we can do edgi-govdata-archiving/web-monitoring-ui#186, so this isn’t a high priority.

I don’t know if there are any great Linux tools out there for rendering a .doc file, but there certainly a few libraries that can handle .docx, like Mammoth: https://github.com/mwilliamson/python-mammoth, which can convert to HTML, Markdown, or plain text, any of which we could then diff with existing algorithms.

We could also use a service like Zamzar to convert, then diff.

@danielballan
Copy link
Contributor

I would guess that handling .doc sufficiently to get a readable diff sounds is a chore beyond our current capacity, but .docx -> HTML seems easy enough to add.

@stale
Copy link

stale bot commented Jan 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 10, 2019

This is more of a long-term idea. Would be great to have someone jump in and take a cut at it.

@stale
Copy link

stale bot commented Jul 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot closed this as completed Jul 16, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 1, 2019

Keeping this open as a call for contributions. We probably don’t have the capacity for this right now, but if you’re interested in diffing and would like to take a shot at writing a function that can diff .docx files, we’d love to integrate it!

@Mr0grog Mr0grog reopened this Aug 1, 2019
@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
@Mr0grog Mr0grog added the new-diff A new type of diff label Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
never-stale new-diff A new type of diff
Projects
None yet
Development

No branches or pull requests

2 participants