Try to identify situations where parts of a page have moved #16

Mr0grog · 2018-08-16T03:34:45Z

One situation we see that can be confusing for analysts is when a portion of the page has moved by swapping locations with another (maybe two paragraphs get reversed, maybe the navigation moves from the bottom to the top of the page, etc). Are there any non-expensive heuristics we can use to identify this? I think it matters most in the html_token and text diffs.

My simpleton idea here is to do the diff, then:

Look for insertions or deletions that have a lot of tokens (so only look at reasonably large changes) — maybe 50+ tokens (I pulled that number out of thin air, to be clear)
Check whether there is a corresponding deletion or insertion with the same set of tokens

Note: in the HTML diff, we should do this before calling assemble_diff so that the tokens we are comparing are the words, not words + tags. I think we’ll be significantly more likely to fail matching once tags are added back into the token stream.
Classify those matching pairs as moved (instead of inserted or deleted) (not sure how best to show this visually).
(bonus!) based on whether the insertion half of the pair is earlier or later in the token stream, classify the direction of the move.
(bonus!) assign each pair an ID so in the unified diff view, we can actually draw a line connecting them (but we’ll leave actually drawing that line for later — and it probably belongs in the UI project). Not sure how feasible it will be to connect them in the side-by-side views.

Some examples:

Navigation moved across the page: https://monitoring.envirodatagov.org/page/c7af604b-5a3d-4497-b6bc-5c55bf865aa1/3b5192c2-b18c-43e4-9908-6fb531eeb2c7..f5358bf6-19a1-4ef4-b9b2-805dfd11beed
A move that also involves a big change; not sure if it will be super feasible to identify, but makes for a good outside target: https://monitoring.envirodatagov.org/page/8db5264c-715b-4a0e-8c5a-7036f91e8b15/bc56f6ae-fa86-4e7a-baa1-5142a703ed1d..7ed11cad-1553-47c0-9f07-b07b2d1fe03f
Paragraph dropped in the middle of a sizable move:
https://monitoring.envirodatagov.org/page/5d79428e-6abe-4090-b81c-ed503510b1b1/3ba94e1f-4585-4739-b53b-4053a767af0b..8ee14e56-3dce-4729-a5ac-bdb853dd0623

The text was updated successfully, but these errors were encountered:

Mr0grog · 2018-08-16T03:39:23Z

First: we need to gather up some concrete examples of this situation.

Mr0grog · 2018-08-16T03:44:35Z

Alternative thought for identifying moves with changes inside them: do a simhash on each large token-chunk and consider them equal if their hashes are close enough (how close? idunno! ¯\_(ツ)_/¯). Then re-diff those token streams to identify the insertions and deletions inside them (kind of like the two-level diffing we now do with the links diff).

Mr0grog · 2018-08-16T03:48:10Z

Example of a move with a change inside it, as noted above: https://monitoring.envirodatagov.org/page/8db5264c-715b-4a0e-8c5a-7036f91e8b15/bc56f6ae-fa86-4e7a-baa1-5142a703ed1d..7ed11cad-1553-47c0-9f07-b07b2d1fe03f

Mr0grog · 2018-08-21T22:08:45Z

Another with a change inside:
https://monitoring.envirodatagov.org/page/5d79428e-6abe-4090-b81c-ed503510b1b1/3ba94e1f-4585-4739-b53b-4053a767af0b..8ee14e56-3dce-4729-a5ac-bdb853dd0623

Mr0grog · 2018-08-21T22:13:34Z

See also edgi-govdata-archiving/web-monitoring#146 about zhang-shasha edit distance. Would be good to look at the performance of a number of edit distance/similarity algorithms (I suggested simhash above, but no idea how expensive that may be) and pick (for this use) whatever is fastest.

stale · 2019-02-17T23:00:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog · 2019-06-04T16:08:03Z

Some good examples of chunking and moving on this page: https://monitoring.envirodatagov.org/page/1f0bd347-60c1-47cf-9dac-cc4f86345e43/4728ed96-cfdd-471b-a380-139beeccbd63..99c12d5e-d535-40e3-a46b-a25ab588923c

stale bot closed this as completed Dec 8, 2019

edgi-govdata-archiving deleted a comment from stale bot Dec 9, 2019

Mr0grog reopened this Dec 9, 2019

Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020

stale bot added the stale label Jun 2, 2021

edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021

stale bot removed the stale label Jun 4, 2021

Mr0grog added the never-stale label Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to identify situations where parts of a page have moved #16

Try to identify situations where parts of a page have moved #16

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

stale bot commented Feb 17, 2019

Mr0grog commented Jun 4, 2019

Try to identify situations where parts of a page have *moved* #16

Try to identify situations where parts of a page have *moved* #16

Comments

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 16, 2018

Mr0grog commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

stale bot commented Feb 17, 2019

Mr0grog commented Jun 4, 2019

Try to identify situations where parts of a page have moved #16

Try to identify situations where parts of a page have moved #16