Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links diff and HTML diff should ignore Java Servlet Session IDs #18

Open
Mr0grog opened this issue Apr 12, 2019 · 0 comments
Open

Links diff and HTML diff should ignore Java Servlet Session IDs #18

Mr0grog opened this issue Apr 12, 2019 · 0 comments
Labels
enhancement New feature or request never-stale

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Apr 12, 2019

Java Servlets can keep your session ID in the URL instead of in cookies by tacking ;jsessionid=XYZ onto the end of all the link URLs in a page. See NOAA NCEI’s Historical Observing Metadata Repository for an example: https://www.ncdc.noaa.gov/homr/

(Note: the behavior only occurs on that site for fresh sessions, so try with a private/incognito browser window.)

https://www.ncdc.noaa.gov/homr/api;jsessionid=A2DECB66D2648BFED11FC721FC3043A1

Since most captures of a page using this will necessarily have different sessions, we should ignore this part of link/subresource URLs when diffing. (This should be adjustable via an argument, but I think ignoring it is the right default.) Ideally, the full URL would still appear in the output; it just wouldn’t be highlighted by the differ.

I’m pretty sure there are other (mostly older) systems that do something similar, and we should treat them the same as we discover them.

Mr0grog referenced this issue in edgi-govdata-archiving/web-monitoring-processing Oct 22, 2019
This adds a new parameter to the HTML diff: `url_rules`. It should be a comma-separated list of custom rules to use when comparing any two URLs on the page (e.g. link `href` attributes or image `src` attributes).

These are useful for ignoring transient data in the URL that is pointlessly different on every page load or for comparing versions from popular archives like the Wayback Machine.

Partially addresses #391 (still needs to be applied to the links diff, too).
@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@Mr0grog Mr0grog added enhancement New feature or request never-stale labels Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request never-stale
Projects
None yet
Development

No branches or pull requests

1 participant