You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Java Servlets can keep your session ID in the URL instead of in cookies by tacking ;jsessionid=XYZ onto the end of all the link URLs in a page. See NOAA NCEI’s Historical Observing Metadata Repository for an example: https://www.ncdc.noaa.gov/homr/
(Note: the behavior only occurs on that site for fresh sessions, so try with a private/incognito browser window.)
Since most captures of a page using this will necessarily have different sessions, we should ignore this part of link/subresource URLs when diffing. (This should be adjustable via an argument, but I think ignoring it is the right default.) Ideally, the full URL would still appear in the output; it just wouldn’t be highlighted by the differ.
I’m pretty sure there are other (mostly older) systems that do something similar, and we should treat them the same as we discover them.
The text was updated successfully, but these errors were encountered:
Mr0grog
referenced
this issue
in edgi-govdata-archiving/web-monitoring-processing
Oct 22, 2019
This adds a new parameter to the HTML diff: `url_rules`. It should be a comma-separated list of custom rules to use when comparing any two URLs on the page (e.g. link `href` attributes or image `src` attributes).
These are useful for ignoring transient data in the URL that is pointlessly different on every page load or for comparing versions from popular archives like the Wayback Machine.
Partially addresses #391 (still needs to be applied to the links diff, too).
Mr0grog
transferred this issue from edgi-govdata-archiving/web-monitoring-processing
Oct 26, 2020
Java Servlets can keep your session ID in the URL instead of in cookies by tacking
;jsessionid=XYZ
onto the end of all the link URLs in a page. See NOAA NCEI’s Historical Observing Metadata Repository for an example: https://www.ncdc.noaa.gov/homr/(Note: the behavior only occurs on that site for fresh sessions, so try with a private/incognito browser window.)
Since most captures of a page using this will necessarily have different sessions, we should ignore this part of link/subresource URLs when diffing. (This should be adjustable via an argument, but I think ignoring it is the right default.) Ideally, the full URL would still appear in the output; it just wouldn’t be highlighted by the differ.
I’m pretty sure there are other (mostly older) systems that do something similar, and we should treat them the same as we discover them.
The text was updated successfully, but these errors were encountered: