-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to identify situations where parts of a page have *moved* #16
Comments
First: we need to gather up some concrete examples of this situation. |
Alternative thought for identifying moves with changes inside them: do a simhash on each large token-chunk and consider them equal if their hashes are close enough (how close? idunno! ¯\_(ツ)_/¯). Then re-diff those token streams to identify the insertions and deletions inside them (kind of like the two-level diffing we now do with the links diff). |
Example of a move with a change inside it, as noted above: https://monitoring.envirodatagov.org/page/8db5264c-715b-4a0e-8c5a-7036f91e8b15/bc56f6ae-fa86-4e7a-baa1-5142a703ed1d..7ed11cad-1553-47c0-9f07-b07b2d1fe03f |
See also edgi-govdata-archiving/web-monitoring#146 about zhang-shasha edit distance. Would be good to look at the performance of a number of edit distance/similarity algorithms (I suggested simhash above, but no idea how expensive that may be) and pick (for this use) whatever is fastest. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
Some good examples of chunking and moving on this page: https://monitoring.envirodatagov.org/page/1f0bd347-60c1-47cf-9dac-cc4f86345e43/4728ed96-cfdd-471b-a380-139beeccbd63..99c12d5e-d535-40e3-a46b-a25ab588923c |
One situation we see that can be confusing for analysts is when a portion of the page has moved by swapping locations with another (maybe two paragraphs get reversed, maybe the navigation moves from the bottom to the top of the page, etc). Are there any non-expensive heuristics we can use to identify this? I think it matters most in the
html_token
andtext
diffs.My simpleton idea here is to do the diff, then:
Look for insertions or deletions that have a lot of tokens (so only look at reasonably large changes) — maybe 50+ tokens (I pulled that number out of thin air, to be clear)
Check whether there is a corresponding deletion or insertion with the same set of tokens
Note: in the HTML diff, we should do this before calling
assemble_diff
so that the tokens we are comparing are the words, not words + tags. I think we’ll be significantly more likely to fail matching once tags are added back into the token stream.Classify those matching pairs as moved (instead of inserted or deleted) (not sure how best to show this visually).
(bonus!) based on whether the insertion half of the pair is earlier or later in the token stream, classify the direction of the move.
(bonus!) assign each pair an ID so in the unified diff view, we can actually draw a line connecting them (but we’ll leave actually drawing that line for later — and it probably belongs in the UI project). Not sure how feasible it will be to connect them in the side-by-side views.
Some examples:
Navigation moved across the page: https://monitoring.envirodatagov.org/page/c7af604b-5a3d-4497-b6bc-5c55bf865aa1/3b5192c2-b18c-43e4-9908-6fb531eeb2c7..f5358bf6-19a1-4ef4-b9b2-805dfd11beed
A move that also involves a big change; not sure if it will be super feasible to identify, but makes for a good outside target: https://monitoring.envirodatagov.org/page/8db5264c-715b-4a0e-8c5a-7036f91e8b15/bc56f6ae-fa86-4e7a-baa1-5142a703ed1d..7ed11cad-1553-47c0-9f07-b07b2d1fe03f
Paragraph dropped in the middle of a sizable move:
https://monitoring.envirodatagov.org/page/5d79428e-6abe-4090-b81c-ed503510b1b1/3ba94e1f-4585-4739-b53b-4053a767af0b..8ee14e56-3dce-4729-a5ac-bdb853dd0623
The text was updated successfully, but these errors were encountered: