-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Diff: Where possible, diff regions of the page independently #5
Comments
Side note: it’s also possible this would help clean up diffs like a recent change to EPA’s page layout, where menus moved from after the body content to before it—depending on the particulars of the body content, we currently might identify the whole body as being changed or instead identify the whole menu as being changed. I imagine this technique of splitting up the diffing would ensure that this would always show menu as being the part that changed. |
Definitely still an idea worth working on. |
This Microsoft research paper covers an interesting way of using layout information to visually segment a page: https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/ Some things about it might be tough to incorporate easily, though:
|
There is also some useful existing work based on real-world data from government sites at https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/main/analyst_sheets/normalize.py Not nearly as generic as the MS paper referenced earlier, though. |
I thought I’d written this idea down somewhere before, but cannot find it.
It might be nice to have the HTML differ use some heuristics to identify major regions of a page (e.g. main content [distinguised by tag, class name, etc.], headers, footers) and, if it can find the same region on both sides of the diff, diff each of those regions (and the regions between them) independently. This could help ensure, for example, that diffs in menus don't bleed together with diffs of the body content.
Having these heuristics around would also undoubtedly be useful in auto-classifying changes (e.g. changes only involved menus).
The text was updated successfully, but these errors were encountered: