Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to identify situations where parts of a page have *moved* #16

Open
Mr0grog opened this issue Aug 16, 2018 · 7 comments
Open

Try to identify situations where parts of a page have *moved* #16

Mr0grog opened this issue Aug 16, 2018 · 7 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Aug 16, 2018

One situation we see that can be confusing for analysts is when a portion of the page has moved by swapping locations with another (maybe two paragraphs get reversed, maybe the navigation moves from the bottom to the top of the page, etc). Are there any non-expensive heuristics we can use to identify this? I think it matters most in the html_token and text diffs.

My simpleton idea here is to do the diff, then:

  1. Look for insertions or deletions that have a lot of tokens (so only look at reasonably large changes) — maybe 50+ tokens (I pulled that number out of thin air, to be clear)

  2. Check whether there is a corresponding deletion or insertion with the same set of tokens

    Note: in the HTML diff, we should do this before calling assemble_diff so that the tokens we are comparing are the words, not words + tags. I think we’ll be significantly more likely to fail matching once tags are added back into the token stream.

  3. Classify those matching pairs as moved (instead of inserted or deleted) (not sure how best to show this visually).

  4. (bonus!) based on whether the insertion half of the pair is earlier or later in the token stream, classify the direction of the move.

  5. (bonus!) assign each pair an ID so in the unified diff view, we can actually draw a line connecting them (but we’ll leave actually drawing that line for later — and it probably belongs in the UI project). Not sure how feasible it will be to connect them in the side-by-side views.

Some examples:

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 16, 2018

First: we need to gather up some concrete examples of this situation.

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 16, 2018

Alternative thought for identifying moves with changes inside them: do a simhash on each large token-chunk and consider them equal if their hashes are close enough (how close? idunno! ¯\_(ツ)_/¯). Then re-diff those token streams to identify the insertions and deletions inside them (kind of like the two-level diffing we now do with the links diff).

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 16, 2018

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 21, 2018

See also edgi-govdata-archiving/web-monitoring#146 about zhang-shasha edit distance. Would be good to look at the performance of a number of edit distance/similarity algorithms (I suggested simhash above, but no idea how expensive that may be) and pick (for this use) whatever is fastest.

@stale
Copy link

stale bot commented Feb 17, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 4, 2019

@stale stale bot closed this as completed Dec 8, 2019
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Dec 9, 2019
@Mr0grog Mr0grog reopened this Dec 9, 2019
@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant