Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Diff: Where possible, diff regions of the page independently #5

Open
Mr0grog opened this issue May 17, 2018 · 4 comments
Open
Labels
enhancement New feature or request never-stale

Comments

@Mr0grog
Copy link
Member

Mr0grog commented May 17, 2018

I thought I’d written this idea down somewhere before, but cannot find it.

It might be nice to have the HTML differ use some heuristics to identify major regions of a page (e.g. main content [distinguised by tag, class name, etc.], headers, footers) and, if it can find the same region on both sides of the diff, diff each of those regions (and the regions between them) independently. This could help ensure, for example, that diffs in menus don't bleed together with diffs of the body content.

Having these heuristics around would also undoubtedly be useful in auto-classifying changes (e.g. changes only involved menus).

@Mr0grog
Copy link
Member Author

Mr0grog commented May 17, 2018

Side note: it’s also possible this would help clean up diffs like a recent change to EPA’s page layout, where menus moved from after the body content to before it—depending on the particulars of the body content, we currently might identify the whole body as being changed or instead identify the whole menu as being changed. I imagine this technique of splitting up the diffing would ensure that this would always show menu as being the part that changed.

Example: https://monitoring.envirodatagov.org/page/2a2cd62e-ded7-4ecc-9749-804ea3e06a0d/9bfd1b57-5872-467f-b556-34cee538493a..0ef9613e-d38d-4aa2-97e7-01f39acf6f17

@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 10, 2019

Definitely still an idea worth working on.

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 16, 2019

This Microsoft research paper covers an interesting way of using layout information to visually segment a page: https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/

Some things about it might be tough to incorporate easily, though:

  • It needs layout information, so we’d need to use headless Chrome or something similar,
  • It can construct visual blocks from non-contiguous sections of the DOM, so reconstructing a page from it might be hard. It could be a visual layer on top, or maybe a “map” for another, simpler differ to do several diffs of subsections of the page based on it.
  • It would almost certainly need a lot of updating for the web of today, as opposed to the web of 2003. (New tags, inline SVG, etc.)

@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot added the stale label Jan 8, 2022
@stale stale bot closed this as completed Apr 16, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@Mr0grog Mr0grog added enhancement New feature or request never-stale and removed stale labels Apr 17, 2022
@Mr0grog Mr0grog reopened this Apr 17, 2022
@Mr0grog
Copy link
Member Author

Mr0grog commented Apr 17, 2022

There is also some useful existing work based on real-world data from government sites at https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/main/analyst_sheets/normalize.py

Not nearly as generic as the MS paper referenced earlier, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request never-stale
Projects
None yet
Development

No branches or pull requests

1 participant