All notable changes to this project are documented in this file. The format is based on Keep a Changelog. This project adheres to Semantic Versioning.
Also, at the bottom of this file there is a TODO list with planned future changes.
tool-v0.21.0 - 2024-12-29: Bugfixes, incremental improvements
-
serve
:-
When remapping URLs, hash/fragment parts will be preserved now.
-
When replaying,
HTTP
status codes will be preserved now. -
From now on, by default,
serve
will replay archivedHTTP
headers overHTTP
, instead of inlining them into renderedHTML
documents (see below). -
When both archiving and replaying, newly dumped reqres that fail given input filters will no longer be indexed and made available for replay.
-
-
mirror
,serve
:-
Reqres containing redirects (e.g.
302 Found
) are handled properly now. -
From now on, implicit favicons will be mirrored and replayed properly (see below).
-
-
*
:- Fixed
--*grep*
filtering of headers with multiple values.
- Fixed
-
scrub
,mirror
,serve
:-
Added
inline_headers
option, making inlining of headers asmeta http-equiv
tags optional. -
Implemented
inline_fallback_icon
option.When enabled, this option adds a fallback
<link rel=icon href="/favicon.ico">
to the result when the input declares no icons and that URL remaps to something useful.This option is then enabled by default, thus fixing replay of implicit favicons.
-
-
serve
:-
Implemented
--web
and--mirror
options which control how headers should be replayed.With
--web
enabled,serve
will evokescrub
with-inline_headers
and will replay those headers overHTTP
instead.With
--mirror
it will continue to usescrub
with+inline_headers
, likemirror
does.From now on,
--web
is the default. -
Implemented
--oldest
and--nearest
options, similar to those ofmirror
. -
Added more namespaces other than
/web/
and madeserve
use them for different kinds of targets when remapping.So that, e.g., links pointing to unavailable URLs get remapped to
/unavailable/<date>/<url>
and links pointing to redirects get remapped to/redirect/<date>/<url>
.This makes links much more informative when hovering other them or when looking at the log output of
serve
.For replay, however, all those namespaces are equivalent and can be used interchangeably.
-
-
serve
,mirror
:-
Renamed
--ignore-bad-inputs
->--ignore-some-inputs
. -
Changed default input filters to allow reqres containing redirects.
-
-
mirror
:-
Added a default value for the root filters, which is
--root-status-re ".[23]00C"
to prevent redirects being added as roots. -
Added
--queue-all-indexed
option to make the previous item optional. -
Changed (simplified) semantics of the
--boring
option.From now on, making a path
--boring
simply disables queuing of its reqres as roots.This allows for more interesting uses.
-
-
Improved documentation.
extension-v1.19.0 - 2024-12-21: Reworked popup UI, better replay integration
-
Popup UI:
-
Reorganized the whole layout by assigning tags to all elements and allowing switching between those tags as if they were tabs.
The original idea was to unroll in steps a-la
uBlock Origin
, but this is superior. -
Improved some help strings.
-
-
Core + Popup UI + Shortcuts:
-
Added
Replay from the archiving server
configuration option.It's a tristate of: disallow, enable if
Submit dumps via 'HTTP'
option is enabled and the server supports it, enable even ifSubmit dumps via 'HTTP'
option is disabled. -
Added
Include in global replays
per-tab options. -
Added popup UI button and keyboard shortcut both of which re-navigate all tabs for which
Include in global replays
is set to their replays. -
Added popup UI button, keyboard shortcut, and context menu item all of which re-navigate a currently active tab to its replay.
-
Added
Force 'Work offline' in replayed tabs
configuration option which does the same thing the similar options forfile:
anddata:
URL does, but for tabs that point to replay URLs. Enabled by default. -
Added
🎄 Winter Days mode
seasonal theme. -
Added
Escape notification messages
configuration option to help support more notification daemons. Disabled by default.
-
-
The
Help
page:-
Merged "Handling of failures" section into "Archival".
-
Reworded some awkward places.
-
-
Core +
manifest.json
:-
Improved server checking logic and error messages.
-
Improved keyboard shortcut descriptions.
-
-
Improved documentation.
-
Core:
-
Snapshot
buttons and keyboard shortcuts will no longer takeDOM
snapshots of replay pages, unlessCapture snapshots of all URLs
option is set. -
On Chromium, fixed
Hoardy-Web
trying to collect and archive replay pages.
-
extension-v1.18.0 - 2024-12-16: Replay integration, incremental improvements
This release integrates the extension
with tool-v0.20.0
, which can now do both archival and replay over HTTP
, see below.
-
Core:
-
From now on, all requests to all URLs under
Server URL
will be ignored, allowing you work withtool-v0.20.0
-replayed pages without fiddling with any settings. -
From now on, the
extension
will respect archiving server's settings and features given by its/hoardy-web/server-info
endpoint, if such a thing exists. -
The default value of
Server URL
does not specify/pwebarc/dump
endpoint anymore, as this is now configurable server-side.For old configs, you can keep the old value, the archiving server handling code will silently elide that path away.
-
From now on, before the first archival, the
extension
will check that a working archiving server is available at the givenServer URL
and generate errors describing what exactly appears to be broken when not.
-
-
Popup UI:
-
From now on, if you set
Server URL
setting to an empty string, it will be reset to the default value. -
Improved CPU usage when switching tabs really quickly.
-
tool-v0.20.0 - 2024-12-16: Replay over HTTP
, mirroring of non-GET
reqres
-
export mirror
:- Renamed
export mirror
sub-command to justmirror
.
- Renamed
-
*
:- Renamed all
--no-overwrites
options ->--no-overwrite
.
- Renamed all
-
*
,--expr
:- Renamed
source
->agent
. - Renamed
raw_path_parts
->path_parts
. - Renamed
mq_raw_path
->mq_path
. - Renamed
qsl_urlencode
atom ->unparse_query
.
- Renamed
-
Improved URL normalization:
- From now on, it will preserve "=" symbols in query strings even when parameter values are empty, like browsers do.
- URL path and query quoting and unquoting is now, hopefully, equivalent to what browsers do, too.
This changes the file names generated by organize
with the default --output
format a bit.
-
serve
:-
Implemented the
serve
sub-command, which runshoardy-web
as a web server that can replay archived data overHTTP
, a-laheritrix
andpywb
.After starting it with something like
hoardy-web serve path/to/your/archives
, you can then navigate to- http://127.0.0.1:3210/web/*/* to see the list of all available URLs and their versions (visits), or to
- something like http://127.0.0.1:3210/web/2/https://archiveofourown.org/works/3733123 to view the latest archived version of that URL, or to
- something like http://127.0.0.1:3210/web/*/https://archiveofourown.org/works/3733123 to view the list of all visits to this URL,
- which also works with glob patterns http://127.0.0.1:3210/web/*/https://archiveofourown.org/works/[0-9]*.
This is very reminiscent of the Wayback Machine by design, yes.
-
Added
/hoardy-web/server-info
endpoint support for future integration with theextension
, similar to that ofsimple_server
(hoardy-web-sas
) now does. -
Implemented archiving server support when running
serve
with--archive-to
option.This is similar to
simple_server
, except newly archived reqres will become available for replay immediately. -
Implemented archiving-server-only mode when running
serve
with--no-replay
option.In this mode, it is essentially equivalent to
simple_server
, excepthoardy-web serve
supports arbitrary--output
formats. -
Implemented
--latest
option, which only indexes and allows replay for the latest available visit to each available URL.Archiving new reqres updates the index accordingly, as expected.
-
Documented it all across the whole repository.
-
-
mirror
:-
Implemented rendering of non-
GET
reqres.So, e.g.,
DOM
snapshots and web search answer pages done viaPOST
will be included in the outputs now.If you do not want some of those, you can filter them out with
--not-method POST
or some such.
-
-
scrub
,mirror
,serve
:-
From now on, malformed URLs will be kept as-is instead of being voided out.
-
From now on, more types of IE-pragmas will be censored out by
-iepragmas
(which is the default). -
From now on,
scrub
will use+verbose
and+whitespace
as defaults.This is a much nicer default, and after content-addressed outputs were implemented in
tool-v0.19.0
, the resulting space savings-verbose,-whitespace
produce are mostly inconsequential now. -
Simplified semantics of
(+|-)pretty
, it does not set theverbose
option anymore.
-
-
*
:- Added
--structure
and--raw-qbody
options. - Added a bunch more parsed URL properties.
- Added a bunch more similar reqres properties.
- Added
-
mirror
:-
Changed semantics of
--nearest
option a bit. From now on, it will parse its argument as a time interval and then take the middle of it as the target value.This is much nicer in practice since, from now on, giving
--nearest 2024
is much less likely to get you the stuff from 2023. It will try to give you stuff nearest to2024-07-02 00:00:00
instead. -
Improved performance.
-
-
*
:- Renamed
--no-remap
option ->--raw-sbody
, the old name is kept as an alias.
- Renamed
-
Improved documentation and help strings.
Most notably, the input filtering options are shown only once now.
-
pyproject.toml
now explicitly specifies optionalmitmproxy
file format support.
-
*
in theory, but only ever triggered bymirror
:- Fixed a file descriptor semi-leak when lazily reloading reqres.
simple_server-v1.8.0 - 2024-12-16
-
Added
-t
,--to
, and--archive-to
aliases for--root
. -
Added
/hoardy-web/server-info
endpoint for future integration with theextension
.
-
From now on, "/" and most other non-word symbols (except "_", "-", and space) in bucket names are forbidden and will be removed.
This will simplify some future things.
-
From now on, when several buckets are specified via several
profile
query parameters, the last one will be used. -
Renamed
--uncompressed
->--no-compress
, the old name is kept as an alias. -
Slightly improved performance.
-
Started typechecking with
mypy
.
tool-v0.19.0 - 2024-12-07: Powerful filtering, new mirroring modes
-
*
:- In
--expr
expressions,sha256
function changed semantics. From now on it returns the raw hash digest instead of the hexadecimal one. To get the old value, usesha256|to_hex
.
- In
-
*
exceptorganize --move
,organize --hardlink
,organize --symlink
,get
, andrun
:-
From now on, all sub-commands except for above can take inputs in all supported file formats.
I.e., you can now do
hoardy-web export mirror --to ~/hoardy-web/mirror1 mitmproxy.*.dump
on
mitmproxy
dumps without evenimport
ing them first. -
By default, the above commands now also automatically dispatch between loaders of different file formats based on file extensions. So you can mix and match different file formats on the same command line.
-
Added a bunch of
--load-*
options that force a specific loader instead, e.g.--load-wrrb
,--load-mitmproxy
.
-
-
*
:-
Added a ton of new filtering options.
For example, you can now do:
hoardy-web find --method GET --method DOM --status-re .200C --response-mime text/html \ --response-body-grep-re "\bPotter\b" ~/hoardy-web/raw
As before, these filters can still be used with other commands, like
stream
, orexport mirror
, etc.Also, the overall filtering semantics changed a bit. The top-level logical expression the filters compute is now a large conjunction. I.e. the above example now compiles to, a bit simplified,
(response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body)
. -
Added a bunch of new
--output
formats. Mostly, this adds a bunch of output formats that refer tostime
s. Mainly, to simplifyexport mirror --all
usage, described below.
-
-
export mirror
:-
Implemented mirroring of different URL visits.
I.e., you can now mirror not just
--latest
visit to each URL, but an--oldest
one, or one--nearest
to a given date, or--all
of them. -
Implemented
--latest-hybrid
,--oldest-hybrid
, and--nearest-hybrid
options.These allow you to mirror each page with resource requisites that are date-vise closest to the
stime
of the page itself, instead of taking globally--latest
,--oldest
, or--nearest
versions of all requisite URLs.At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.
-
Implemented
--hardlink
and--symlink
options, which allow mirroring into content-addressed destinations.I.e.
export mirror --hardlink
will render and write each mirrored file to<--to>/_content/<hash/based/path>.<ext>
and only then hardlink the result to<--to>/<output/format/based/path>.<ext>
target destination. And similarly for--symlink
.This saves quite a bit of space when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you mirror
--all
visits to some URLs and many of those are absolutely identical, etc.So, from now on,
--hardlink
is the default. The old behavior can be archived by running it with--copy
instead. -
Implemented
--relative
and--absolute
options, which control if URLs should be remapped to relative or absolutefile:
URLs, respectively.
-
-
Documented all the new things.
-
Added a bunch of new
test-cli.sh
tests.
-
export mirror
:-
--root-*
options now use the same syntax and machinery and support the same filtering options as the normal input filters. -
Switched default
--output
tohupq_n
to prevent collisions when using--*-hybrid
and--all
. -
Improved handling of
base
HTML
tags,target
attributes are supported now. -
Links that reference a page from itself will no longer refer to the page's filename, even when the link has no
fragment
.The results can be a bit confusing, but this makes the new content de-duplication options much more effective.
-
Made
export mirror
default filters explicit and changed them from--method "GET" --status-re ".200C"
to--method "GET" --method "DOM" --status-re ".200C"
. -
Implemented
--ignore-bad-inputs
and--index-all-inputs
options to allow you to change the above default. -
Improved output log format.
-
-
Improved file loading performance a bit.
-
Improved documentation.
-
Added a bunch of new tests for
organize
, which cover theorganize --symlink --latest
bug oftool-v0.18.0
. Won't happen again. -
Fixed a couple of silly filtering-related bugs.
tool-v0.18.1 - 2024-11-30: Hotfixes
tool-v0.18.0
introduced a bunch of issues:
-
organize
:-
Fixed
organize --symlink --latest
dereferencing output files, which lead to it overwriting plainWRR
source files containing updated URLs with symlinks to their newer versions.The good news is that this bug was only triggered when
organize --symlink --latest
was run with some newly archived data and, for each updatedURL
, it only overwrote the second to lastWRR
file with a symlink to the latestWRR
file. Unfortunately, this error was self-propagating, so those files could then get overwritten again by the next invocation oforganize --symlink --latest
with some more new data. This could happen up to 7 times, at which point it would start crashing, because of the OS symlink deferencing limit.You can check if you were affected by running:
cd ~/web/raw ; find . -type l
The paths it outputs will be the paths of lost
WRR
files.A reminder that it is good to do daily backups, I suppose.
The next version will have a test for this, but I'm releasing this hotfix an hour after I discovered this.
-
Fixed it
assert
-crashing sometimes when running with--symlink
. -
Improved memory consumption a bit.
-
-
export mirror
:- Fixed overly large memory consumption.
tool-v0.18.0 - 2024-11-20: Incremental improvements
-
export mirror
:-
Implemented the
--boring
option, which allows you to load some inputPATH
s without adding them as roots, even when no--root-*
options are specified.This make CLI a bit more convenient to use. The
README.md
has a new example showcasing it.
-
-
export mirror
,scrub
:-
Implemented support for
@import
CSS
rules using a string token in place of a URL.As far as I can see, this syntax is rarely used in practice, but the spec allows this, so.
-
Implemented
interpret_noscript
option, which enables inlining ofnoscript
tags whenscrub
is running with-scripts
.That is,
export mirror
will now use this feature by default.This is needed because some websites put
link
tags withCSS
undernoscript
, thus making such pages look broken whenscrub
bed with-scripts
(which is the default) and then opened in a browser with scripts enabled.
-
-
*
: Refactored/reworked a large chunk of internals, as a result:organize
can now takeWRR
bundles as inputs too,export mirror
became much faster at indexing inputs that contain archives of the same URLs, repeatedly.
In general, these changes are aimed towards making
hoardy-web
completely input-agnostic. That is, wouldn't it be nice if you could feedmitmproxy
files toexport mirror
directly, instead of going throughimport mitmproxy
first? -
export mirror
,scrub
:-
From now on, it will stop generating
link
tags with void URLs, it will simply censor them out instead. -
scrub
with+verbose
set will now also show originalrel
attr values for censored out tags. -
Also, in general, the outputs of
scrub
with+verbose
set are much prettier now.
-
-
Improved documentation.
tool-v0.17.0 - 2024-11-09: Incremental improvements
-
*
exceptorganize
,get
, andrun
:-
All
WRR
-processing sub-commands except for above can now takeWRR
bundles as inputs.That is, you can now directly do
hoardy-web pprint ~/Downloads/Hoardy-Web-export-*.wrrb hoardy-web export mirror --to ~/web/mirror ~/Downloads/Hoardy-Web-export-*.wrrb # etc
without needing to run
hoardy-web import bundle
first.Though, at the moment,
export mirror
will stop respecting--max-memory
option for such inputs.
-
-
export mirror
,scrub
:-
Implemented support for old-style
HTML
pages usingframeset
andframe
HTML
tags. -
Implemented support for stylesheets stored as
data:
URLs stored inhref
s oflink
tags.Yes, this is actually allowed by the specs and the browsers.
-
-
*
:- Parsing of
MIME
types,Content-Type
, andLink
headers is much more forgiving towards malformed inputs now.
- Parsing of
-
export mirror
,scrub
:-
Links pointing to an
id
on sameHTML
page will now get emitted as#<id>
, not./<file>#<id>
. -
Refresh
headers with non-HTTP
URLs will get censored away now. -
Improved error messages in cases when an
HTTP
header failed to parse. -
Improved performance a little bit.
-
-
export mirror
,scrub
:-
From now on,
scrub
will simply dropHTML
tag attributes when all URLs in their values get censored away.Previously, it produced attributes with void URLs instead.
This makes a huge difference for
src
andsrcset
attributes ofHTML
img
tags where, before, generated pages plugged void URLs for missing sources, which sometimes confused browsers about which things should be used to display stuff, breaking things.
-
-
*
:-
HTML
s with Byte Order Marks will no longer getmimesniff
ed astext/plain
. -
Fixed parsing of quoted
MIME
parameter values. -
Fixed various crashes when processing data generated by the extension running under Chromium.
-
extension-v1.17.2 - 2024-11-09: Documentation fixes, mostly
-
The
Help
page:- Rewrote "Conventions" and "'Work offline' mode" sections of to be much more readable.
-
*
:- Improved contrast when running with a light
CSS
color scheme.
- Improved contrast when running with a light
-
Documentation:
- Fixed some typos.
-
*
:- Fixed some potential state display inconsistency bugs and improved UI pages' init performance when the core is very busy.
extension-v1.17.1 - 2024-11-01: Annoyance fixes
-
Popup UI:
-
Reverted most of the block reordering bit of popup UI rework of
extension-v1.17.0
.The "Globally" block is near the top again.
-
Edited the "Persistence" block a bit more.
Mainly, to stop graying out always-useful stat lines, even when the associated features are disabled.
This prevents possible confusion of what buttons can be used when.
-
Renamed some options and stat lines, mostly to make their names shorter to make popup UI on Fenix more readable.
-
-
Toolbar button:
-
Edited its title format to be much shorter, especially on Fenix.
-
Reverted the ordering of parts there to how it was before
extension-v1.17.0
.The (much shorter now) "globally" part is at the front again because otherwise the badge being at the front there too without an explanation of its format is kind of confusing.
-
-
Core + All internal pages:
-
Improved internal async message handling infrastructure, making things slightly more efficient.
-
Improved initialization functions of all internal pages, making them more efficient and making the resulting UI much less jiggly when changing zoom level and/or jumping around between pages.
-
-
*
:- Renamed
build.sh
firefox
target tofirefox-mv2
, for consistency.
- Renamed
-
UI:
-
Fixed flaky rendering of
Help
andChangelog
pages on Fenix.They render properly now the very first time you load them, no reloads needed.
-
Fixed duplication of history entries when navigating internal links.
-
Fixed source links sometimes failing to being highlighted when pressing the browser's "Back" button.
-
Fixed some small
CSS
nitpicks.
-
-
Popup UI + Documentation:
- Realigned some help strings with reality.
-
*
:- Fixed a couple more mostly inconsequential tiny bugs.
-
The
Help
page:- Documented what
webNavigation
permission is used for, improved the rest a bit.
- Documented what
extension-v1.17.0 - 2024-10-30: Halloween special: major UI and state display improvements, fine-grained Work offline
mode, add-on reloading with its state preserved, new options, etc
In related news, I have 💸☕ a Patreon account now.
-
Core:
-
Fixed a bug in
upgradeConfig
that was resettingbucket
settings to their default values or upgrade toextension-v1.13.0
. So, this is no longer relevant, but still. Also, refactored code there to prevent such errors in the future.However, just in case, if you previously set
bucket
settings to something other than their default values and those settings are important to you, you should probably check your settings to ensure everything there is set as you expect it to be.
-
-
Core + Popup UI + Documentation:
-
Renamed
failed
state and relatedfailed*
stats tounarchived
state andunarchived*
stats. Introduced a newfailed
stat that is now a sum ofunstashed
andunarchived
stats. Edited the popup UI and the other pages appropriately.This makes documentation's terminology more consistent, and simplifies UI a bit.
In particular, the
Retry
button ofQueued/Failed
stat line will both retry stashingunstashed
and archivingunarchived
reqres now.
-
-
Popup UI:
-
Reworked the whole thing quite a bit:
- Improved option names and help strings.
- Sorted sections and options to follow a more logically consistent order.
- Improved layout.
- Fixed some typos there.
-
From now on, setting
Bucket
for the current tab will setBucket
for its new children too, similar to how the rest of those settings work. -
From now on, setting any of the
Bucket
settings to nothing will reset it to the parent/default value. I.e.:- Setting
Bucket
ofThis tab's new children
to nothing will reset it toBucket
value ofThis tab
. - Setting
Bucket
ofThis tab
to nothing will reset it toBucket
value ofNew root tabs
. - Setting
Bucket
ofNew root tabs
to nothing will reset it todefault
.
- Setting
-
-
The
Help
page:-
The previous "Desktop"
JavaScript
-generated layout becamecolumns
CSS
layout andJS
-operation mode, while the "Mobile"JavaScript
-generated layout becamelinear
CSS
layout andJS
-operation mode. The page will now automatically switch between these two layouts and modes synchronously, depending on viewport width.(As before, in
linear
mode hovering over a link does nothing, but incolumns
mode, hovering over a link referring to a target in popup UI scrolls the popup UI column to that target and highlights it.)I.e., this means that on a Desktop browser, you can now zoom the
Help
page to arbitrary zoom levels and it will just switch between layouts and link-hover behaviors depending on available viewport width. -
Greatly improved the styling of all links and documented it in the "Conventions" section.
-
-
All internal pages:
-
All internal pages now color-code links depending on where they point to, using exactly the same
CSS
as theHelp
page. -
All pages now use the same history state handling behaviour.
I.e., using the "Back" button of your browser will now not only go back, but also highlight the last link you clicked.
-
All documentation pages now set viewport width to
device-width
, set content'smax-width
to900px
andwidth
to100% - padding
, preventing horizontal scroll, when possible. -
Improved the
CSS
styling in general.
-
-
Core + Popup UI + General UI:
-
Implemented a new popup UI tristate toggle named
Color scheme
which allowsHoardy-Web
's color-scheme to be different from the browser's default. -
Implemented a mechanism and popup UI settings for applying additional themes and experimental features.
-
And then I looked at the date. Which is why ◥▅◤◢▅◣◥▅◤
Hoardy-Web
now has🦇 Halloween mode
. ◥▅◤◢▅◣◥▅◤. -
Also, from now on, the neutral states of tristate toggles are displayed with toggle knobs being in the middle of the things, not on their left. This is not a political statement. This mans that all tristate toggles, from left to right, now go
false
->null
->true
both internally (exactly as they did before) and externally (which is new).
-
-
Core + Toolbar button + Icons:
-
Replaced toolbar button's icons representing Cartesian products of other icons with animations.
In other words, the previous "this tab has limbo mode enabled while this tab's children do not" icon will now instead be represented with an animation that switches between "this tab has limbo mode enabled" and "this tab is idle" icons instead.
This both takes less space in the
XPI
/CRX
, makes for a cuter UI, and is the only reasonable solution when the core wants to display more than two icons at the same time. -
Improved toolbar button's badge and title format a bit.
"This tab" part goes first now, then "its new children", then "globally".
Also, the order of sub-parts of those strings is more consistent now.
-
From now on, internal UI updater will generate icon animation frames for all important statuses and setting states.
- When per-tab and per-tab's-new-children animation frames are equal, the repeated part will be elided.
- When per-tab and per-tab's-new-children animation frames differ, the
main
icon will be inserted at the end to make it obvious when the animation loop restarts (otherwise, it's easy to interpret such animation loops incorrectly).
-
The update frequency of toolbar button's icon, badge, and title now depends on the amount of not yet done stuff still queued in the core.
I.e., from now on, when the core has a lot of stuff to do (like when re-archiving thousands of reqres at the same time), it will start updating toolbar button's properties less to trade update latency for improved performance, and vice versa.
-
Greatly improved performance of state display updates. It's uses 2-1000x less CPU now, depending on what the core is doing.
-
-
Icons:
-
Renamed the
error
icon tofailed
and added a newerror
icon.From now on, the
failed
icon will only be used for archival/stashing errors, while theerror
icon will only be used for internal errors (i.e. bugs). -
Improved all icons to make them more visually distinct when they are being rendered at 48x48 or less, both in light and dark mode.
-
On Chromium, all icons are now rendered with transparent backgrounds, so now they will look nice in the dark mode too.
-
-
Core + Popup UI + Toolbar button:
-
From now on, popup UI and toolbar button's badge and title will display information about currently running internal actions.
(Implementing this took a surprising amount of effort in improvements to infrastructure code.)
-
-
Core + Toolbar button + Icons:
- Added a new
in_limbo
icon for "this tab has data in limbo" status. Unlike most other icons, this icon will never be used alone, it will always be an animation frame of something longer.
- Added a new
-
Core + Popup UI + Toolbar button:
- Implemented
Animate toolbar icon every
setting for controlling toolbar icon animation speed.
- Implemented
-
Core + Toolbar button:
-
Fixed a bunch of bugs that prevented updates to toolbar button's icon and badge in some cases.
-
The icon and the badge will no longer get stuck when the core is very busy, like when re-archiving a lot of stuff all at once.
-
-
Core + Popup UI + Toolbar button + Icons + Documentation:
-
Implemented
Work offline
mode, options, their popup UI, shortcuts, and icons.This mode does the same thing as
File > Work Offline
checkbox of Firefox, except it supports per-tab/per-other-origin operation, not just the whole-browser one. Also, enabling any these options will not break requests that are still in flight, and the requests they do cancel can be logged.That is, enabling
Work offline
in a tab will start canceling all new requests that tab generates, and the resultingcanceled
reqres will get logged ifTrack new requests
option is enabled in the same tab. Similarly for background tasks and other origins.This can be generally useful for debugging your own websites with dynamic responsive
CSS
, or if you just want to prevent a tab from accessing the network for some reason.However, the main reason this exists is that the files generated by
hoardy-web export mirror
do not getscrub
bed absolutely correctly at the moment, and the resulting pages can end up with some references to remote resources (in cases when an mirrored page uses some rareHTML
andCSS
tag combinations, or lazy-load images viaJavaScript
, but still). WithWork offline
options enabled in a tab, you can now be sure that opening pages generated byhoardy-web export mirror
won't send any requests to the network.In fact, from now on, by default,
Hoardy-Web
will enableWork offline
in all tabs pointing tofile:
URLs. This can be disabled in the settings. -
Documented it in more detain on the
Help
page. -
Added a new
offline
toolbar icon to display the above state.
-
-
Core + Popup UI + Documentation:
-
Implemented
reloadSelf
action that reloads the add-on while preserving its state.This action is different from similar
Reload
buttons in browser's own UI in that triggering this action will reload the add-on while preserving its state. Meanwhile, using the browser's buttons will reset everything and loose all reqres that are both unarchived and unstashed. -
Added a popup UI button for triggering this action. (The button is only shown when a new version is available, unless debugging is enabled.)
-
Implemented
Auto-reload on updates
setting to automate away clicking of that button on updates. Though, it is currently disabled by default, because this feature is a bit experimental at the moment. -
Documented it a little bit on the
Help
page. -
That is, from now on, after the browser notifies the add-on that it ready to be updated, the popup UI will display a button allowing you to reload it, so that the browser could load the new version instead. Alternatively, you can now enable
Auto-reload on updates
and it would do that automatically.
-
-
Core + Popup UI:
-
Implemented
Export via 'saveAs' > Bundle dumps
option as separate toggle instead of forcing you into set the maximum size to0
to get the same effect. -
Implemented
Include in global snapshots
per-tab/per-origin setting.I.e., you can now exclude specific tabs from being included in all-tab
DOM
-snapshots even whenTrack new requests
option is enabled.
-
-
Notifications + Popup UI:
-
Implemented
Notify about 'problematic' reqres
per-tab/per-origin setting.I.e., you can now exclude specific tabs from generating notifications about
problematic
reqres even whenGenerate notifications about > ... new 'problematic' reqres
option is enabled.
-
-
Notifications + Documentation:
-
From now on, clicking an error notification will open the relevant section of the
Help
page while doing nothing for other notifications. -
Also, improved that section a little bit.
-
-
Documentation:
-
Core:
-
On Chromium, there will no longer be duplicates in reqres errors lists.
-
From now on, when you close a tab, all in-flight reqres in it will be emitted with
*::capture::EMIT_FORCED::BY_CLOSED_TAB
error set. Before, some of them sometimes finished withwebRequest::*_ABORT
errors instead. -
Refactored a lot of internal stuff, simplifying how many internal things are done.
-
-
Icons:
-
Core + Documentation:
-
Fixed handling of interactions between page scrolling, node hilighting, and help tooltips.
I.e., on the
Help
page, highlighting an option in the popup UI by hovering over a link there, and then clicking on the help tooltip of the highlighted option will no longer make the UI look weird.
-
-
Core + Notifications:
- Fixed a small bug preventing no longer relevant notifications about
unarchived
reqres from being closed automatically.
- Fixed a small bug preventing no longer relevant notifications about
tool-v0.16.0 - 2024-10-19
-
scrub
,export mirror
:-
Implemented inlining of
Link
,Refresh
,Content-Security-Policy
, and some otherHTTP
headers into the mirroredHTML
files asmeta http-equiv
tags. -
scrub
now has(+|-)navigations
option which controls whether the resultingmeta http-equiv=refresh
headers should be kept or censored out,-navigations
is the default. -
Also, CSP headers are not supported yet and, thus, the generated
meta http-equiv=content-security-policy
tags will get immediately censored out, which is usually invisible, but can be seen with+verbose
set.
-
-
export mirror
:- Added
--max-memory
option, allowing you to sacrifice arbitrary amounts of RAM to improve performance.
- Added
-
*
:-
Added unit tests for all internal parsers.
-
Added a lot of new integration tests.
-
-
export mirror
:-
Improved the mirroring algorithm, switched to a completely recursive implementation, in preparation for future extensions with cool features.
-
From now on, writes to all files that are being mirrored (not just the top-level ones) will be atomic with respect to their dependencies.
-
-
*
:-
Refactored internals a lot.
-
Improved performance a bit.
-
extension-v1.16.1 - 2024-10-15
-
On Chromium, fixed request tracking being frequently broken since
extension-v1.15.0
. -
Fixed reqres without responses but with networking errors having "Responded at" field set in the logs.
tool-v0.15.5 - 2024-10-07
-
get
,export mirror
, etc:-
Restricted the
idna
workaround oftool-v0.15.4
to hostnames with "--" in [2:4] character positions.The previous iteration made
parse_url
start accepting many malformed URLs. -
URL parsing will now strip hostnames of leading and following whitespace, like browsers do.
Mainly, this improves
export mirror
outputs. -
Fixed output formatting when redirecting output to a non-tty destination.
-
simple_server-v1.7.0 - 2024-10-03
-
File path parts starting with "." in
profile
s (i.e. buckets) given by clients are ignored now.This prevents escapes from the given
--root
.The previous behaviour was not really a security issue, given that the server is not designed to be run with untrusted clients, and filenames are generated by it, not the clients. But still.
-
Renamed command-line options:
--no-print-cbors
->--no-print
--default-profile
->--default-bucket
--ignore-profiles
->--ignore-buckets
This is makes them use the same terminology the extension uses.
Old names are kept as aliases.
version
endpoint, for extensibility.
tool-v0.15.4 - 2024-10-02
-
get
,export mirror
, etc:- Added a work-around for
idna
module failing to parse some hostnames (#5 on GitHub).
- Added a work-around for
-
find
:-
Added
--sniff-*
options to fix crashes introduced inv0.15.0
.Added tests to hopefully stop this kind of errors.
-
-
get
,export mirror
, etc:-
--expr
: technically, renamedfull_url
->url
, though it did not officially exist before.Added it to the docs.
-
ftp
andftps
URL schemes are now allowed everywhere.
-
tool-v0.15.3 - 2024-09-28
-
scrub
,export mirror
:-
From now on
scrub
will simply remove allCORS
andSRI
attributes from all relevantHTML
tags.This works fine 99% of the time. Smarter handling for this will be implemented later.
-
Fixed
MIME
type sniffing ofXHTML
data.Also, added some tests for my
mimesniff
implementation. -
Fixed crashes when URL remapper encounters weirdly malformed URLs.
From now on they will be remapped into void URLs instead.
-
-
export mirror
:-
Fixed it skipping regular files given directly as command line arguments.
This was broken since
v0.15.0
.
-
-
scrub
,import
:- Fixed some places where the documentation was misaligned with the code.
- Improved documentation.
tool-v0.15.2 - 2024-09-21
-
scrub
,export mirror
:-
Fixed a stupid bug in
MIME
detection code that prevented externalCSS
files from being detected as such.Added tests to prevent such things in the future.
-
Fixed
CSS
formatting with+whitespace
set. -
Made
scrub
removecrossorigin
attributes fromHTML
tags for which it remapped a URL.This seems to have fixed most of issues causing pages produced by
export mirror
looking broken when opened in a web browser.
-
- Improved documentation.
tool-v0.15.1 - 2024-09-18
-
export mirror
- Added reporting for roots being queued at the very beginning.
-
import *
:- Added
--sniff-*
options to fix crashes introduced inv0.15.0
.
- Added
-
export mirror
- It will now remap URL fragments instead of dropping them.
- It will now report the correct
depth
value in the UI. - Stopped reporting of repeated
not remapping '%s'
lines.
tool-v0.15.0 - 2024-09-16
export mirror
sub-command now produces results quite usable in a normal web browser.
I.e. it is now comparable to, say, what Single-File
produces.
Feature-wise, it reaches a Pareto front, AFAICS, since no other tool I know of can do efficient (with shared page requisites) incremental static semi-open (see --remap-semi
option below) website mirrors.
At the moment, scrubbed CSS can get a bit broken sometimes, because hoardy-web
leans in favor of its results being safe to use, not them being as close to the original as possible.
Also, support for audio
, video
, and source
HTML
tags is still a bit quirky.
But the current state is quite usable.
-
scrub
,export mirror
:-
Implemented stylesheet (
CSS
) scrubbing with the help oftinycss2
.I.e., requisite resource URLs mentioned in stylesheets will now be properly remapped.
I.e., website mirrors will be styled now.
-
-
export mirror
:-
Added
--remap-semi
option, which does the same thing as--remap-open
(which is equivalent towget --convert-links
), except it remaps unavailable action links and page requisites to void URLs, making the resulting generated pages self-contained and safe to open in a web browser without it trying to download something.I.e.
--remap-semi
does whatwget --convert-links
should be doing, IMHO. -
Added
--root-url-prefix
and--root-url-re
options.
-
-
pprint
,get
,run
,stream
,export mirror
:-
Implemented
--sniff-*
options controllingmimesniff
algorithm usage.For
pprint
sub-command they replace--naive
and--paranoid
options.
-
-
--expr
,--output
: Addedpretty_net_url
,pretty_net_nurl
,raw_path_parts
, andmq_raw_path
atoms.
-
scrub
,export mirror
:-
Changed the way all
--remap-*
options are implemented. Most of the remapping logic was moved into thescrub
function.--remap-*
options simply change default values of the corresponding--expr
options now. -
+styles
and+iframes
options are now set by default.Since these things can now be properly mirrored.
-
Renamed
(+|-)srcs
options to(+|-)reqs
to follow the terminology used bywget
.In documentation, "page resources" became "requisite resources" and "page requisites".
-
Improved censoring for
IE
-pragmas. -
Improved
+indent
and+pretty
output layout a bit. -
Improved
+verbose
output format a bit.
-
-
export mirror
:-
Renamed
--root
option to--root-url
,-r
and--root
options now point to--root-url-prefix
instead. The--root
option name is deprecated now and will be removed in the future. -
Improved progress reporting UI.
It's much prettier and more informative now.
-
It ignores duplicate input paths now.
This allows to easily prioritize mirroring of some files over others by specifying them in the command line arguments first, followed by their containing directory in a later argument.
README.md
has a new example showcasing it. -
It delays disk writes for
HTML
pages until after all of their requisite resources finished mirroring now.I.e. newly generated
HTML
pages can now be opened in a web browser whileexport mirror
is still running, having not finished mirroring other things yet.
-
-
Improved content
MIME
type handling a bit, addedtext/vtt
recognition. -
--expr
,--output
:-
Renamed:
path_parts
->npath_parts
,mq_path
->mq_npath
. -
Changed semantics of
net_url
andpretty_url
a bit. Both add trailing slashes after emptyraw_path
s now. Also,pretty_url
does not normalizeraw_path
now, i.e. now it only re-quotes path parts, but does not interpret.
and..
path parts away.
-
-
Greatly improved documentation.
-
scrub
,export mirror
:-
Fixed generation of broken
file:
links for URLs with query parameters. -
From now on
stylesheet
,icon
, andshortcut
link
s are treated as page requisites.This fixed a bug where
export mirror
with--depth
set would forget to mirrorshortcut
icon
s andCSS
files. -
Fixed a bug where
export mirror
with--depth
and--remap-(open|closed)
set would fail to remap unreachable URLs properly.
-
-
Fixed some places where the code was misaligned with the documentation.
- Most importantly,
scrub
andexport mirror
use-verbose
by default now, which documentation claimed they did, but they did not.
- Most importantly,
-
Fixed some typos.
extension-v1.16.0 - 2024-09-05
- Renamed
pWebArc
->Hoardy-Web
. - Renamed all
::pWebArc::
error codes into a more consistent naming scheme. - Improved documentation.
tool-v0.14.1 - 2024-09-04
- Renamed
wrrarms
->hoardy-web
.
simple_server-v1.6.1 - 2024-09-04
- Renamed
dumb-dump-server
->hoardy-web-sas
.
extension-v1.15.1 - 2024-09-04
- Fixed some typos.
- Improved notifications.
- Improved documentation.
tool-v0.14.0 - 2024-09-04
-
Improved all the
script
s by adding usage descriptions and--help
options to all of them. -
Added
--to
option towrrarms-pandoc
script. It allows you to change the output format it will use. -
Added
wrrarms-spd-say
script, which can feed contents of an archivedHTML
document, extracted fromHTML
viapandoc -t plain
, tospeech-dispatcher
'sspd-say
, i.e. to your preferred TTS engine. -
Added
--*-url
options towrrarms-w3m
andwrrarms-pandoc
scripts. They allow you to control how to print the document's URL in the output. -
get
: implemented--expr-fd
option, which allows you to extracts multiple--expr
values from the same input file to different output file descriptors in a singlewrrarms
call. -
Modified
wrrarms-w3m
andwrrarms-pandoc
scripts to use--expr-fd
option, making them ~2x faster. -
export mirror
: implemented support for multiple--expr
arguments. -
import
: implemented--override-dangerously
option. -
Added more
--output
formats.
-
Renamed all
--no-output
options to--no-print
. -
Edited
--output
formats, making them more consistent with their expected usage:-
Edited the
default
,short
,surl_msn
, andurl_msn
--output
formats, replacing a "." before thenum
field with a "_". Because these formats do not mention any file extensions. -
Edited
surl_msn
andurl_msn
--output
formats, replacing and a "_" before themethod
with "__". To make these--output
formats useful in programmatic usage. -
Edited most other
--output
formats, replacing a "_" before themethod
field with a "." and a "." before the non-standalonenum
with a "_". Since these--output
formats do use file extensions, this turns the wholewrrarms
-specific suffix into a sub-extension.
-
-
wrrarms-pandoc
usesplain
text--to
output format by default now. The previous default wasorg
-mode. -
Improved error messages.
-
Improved documentation.
-
export mirror
now respects the given--errors
option value not only while indexing inputs, but also while rendering and writing out outputs. -
Resurrected
flat_n
--output
format.
extension-v1.15.0 - 2024-08-29
-
pWebArc
is now officially supported on Fenix (Firefox for Android). It is quite usable there now, so go forth and test it. -
Chromium version now has a
update_url
set in themanifest.json
, so if you usechromium-web-store
or some such, it can be updated semi-automatically now, see the extension'sREADME.md
for more info. -
Implemented
User Interface and Accessibility > Verbose
option.From now on, by default,
pWebArc
will have its most common but annoying notifications mention they can be disabled and explain how.This is mostly for Fenix users, where these things are not obvious, but it could also be useful for new users elsewhere.
-
Implemented
User Interface and Accessibility > Spawn internal pages in new tabs
option which controls if internal pages should be spawned in new tabs or reuse the current window.It can not be disabled on desktop browsers at the moment, but it is disabled by default on mobile browsers.
-
Implemented a bunch of new notifications about automatic fixes applied to
config
.I.e., it will now not just fix your
config
for you, but also complain if you try to set an invalid combinations of options. -
Implemented
Generate desktop notifications about ... > UI hints
option to allow you to disable the above notifications.
-
pWebArc
will CBOR-dump all reqres fields completely raw from now on.wrrarms
learned to handle this properly quite a while ago.This simplifies the parsing of the results and makes the implementation adhere to the stated technical philosophy more closely.
The dumps will grow in size a tiny bit, but this is negligible, since they are compressed by default by all the archival methods now.
Moreover:
- A lot of UI improvements, mainly for Fenix.
- From now on, per-tab
Stash 'in_limbo' reqres
option is being inherited by children tabs like the rest of similar options do. - Renamed
build.sh
chromium
target tochromium-mv2
in preparation for eventualchromium-mv3
support. - Advanced minimum browser versions to Firefox v102, and Fenix v113.
- Improved performance.
- Improved documentation and installation instructions.
- Fixed wrong "In limbo" counts after the extension gets reloaded.
- Fixed race conditions in
browserAction
updates. - Worked around
browserAction
title updates being flaky on Fenix. I.e. you can stare at theExtensions > pWebArc
line in the browser's UI now while the browser fetches some stuff and it will be properly interactively updated. - On Firefox, fixed the id of the extension leaking into
origin_url
field of the very first dump of each session whenWorkarounds for Firefox bugs > Restart the very first request
is enabled (which is the default). - Fixed some typos.
extension-v1.14.0 - 2024-08-25
-
pWebArc now runs under Fenix aka Firefox-for-Android-based browsers, including at least Fennec and Mull.
Thought,
Export via 'saveAs'
archival method is broken there, because of a bug in Firefox. Other methods do work, though.(Also, it is not marked as compatible with Firefox on Android on addons.mozilla.org at the moment, it probably will be in the next version.)
-
The above change also added a settings page (aka
options_ui
).At the moment, the settings page is simply an unrolled by default version of popup UI, with per-tabs settings removed.
This is need because on mobile browsers the main screen of the browser is not a tab and there's no toolbar, so there's no popup UI button there, and so the extension UI becomes really confusing without a separate settings page.
-
Split
in_flight
stat into a sum of two numbers.This makes things less confusing on Chromium, the
Help
page explains it in more detail. -
Added toolbar button's badge as a prefix to its title, changed its format a bit.
This is needed because Fenix-based browsers do not display the badge at all, so this change helps immensely there. Meanwhile, on desktop browsers this does not hurt.
-
Improved styling and dark mode contrast of the popup UI.
-
Improved documentation.
In particular, among other things, added a lot of new anchors to the
Help
page, most internal links referencing some fact discussed in another section now point directly to the relevant paragraph instead of pointing to its section header.
On Firefox:
-
Fixed capture of responses produced by service/shared workers.
Also, added a new error code for when it (very rarely) fails because of a race condition inherent in
webRequest
API and documented all of it on theHelp
page. -
Fixed
HTTP
protocol version detection, requests fetched viaHTTP/3
will now be marked as such. -
Added yet another
webRequest
API error to a list of those that mark reqres response data as incomplete.
On Chromium:
- Fixed more edge cases where reqres could get stuck in
in_flight
state indefinitely.
Generally:
-
Fixed navigation with browser's
Back
andForward
buttons to work properly on theHelp
page. -
Fixed a bug where force-stopping all in-flight reqres in a single tab could also drop some of the others.
extension-v1.13.1 - 2024-08-13
- Fixed a lot of places where the documentation was misaligned with current reality.
- Improved documentation, especially the
Help
page. - Tiny improvement in popup UI
HTML
layout. - Changed
config.history
default value.
extension-v1.13.0 - 2024-08-05
-
Implemented reqres persistence across restarts.
pWebArc can now save and reload
collected
but not archived reqres (including thosein_limbo
) by stashing them into browser's local storage. This is now enabled by default, but it can be disabled globally, or per-tab. -
As a consequence, pWebArc now tracks browsing sessions and shows when a reqres belongs to an older session on its
Internal State
page. -
Implemented two new archiving methods. pWebArc can now archive
collected
reqres by-
generating fake-Downloads containing either separate dumps (one dump of an
HTTP
request+response per file) or bundles of them (many dumps in a single file, for convenience, to be later imported viawrrarms import bundle
), -
archiving separate dumps to your own private archiving server (the old one, the previous default, inherited on extension update),
-
archiving separate dumps to your browser's local storage (the new default on a new clean install).
-
-
As a consequence, pWebArc now has a new
Saved in Local Storage
page for displaying the latter. -
Implemented display and filtering for
queued
andfailed
reqres on theInternal State
page. -
Implemented tracking of per-state size totals for reqres in most states after
finished
. -
As a consequence, popup UI will now display those newly tracked sizes.
-
Introduced the
errored
reqres state.With stashing to local storage enabled, pWebArc will now try its best not to loose any captured data even when its archiving code fails (bugs out) with an unexpected exception. If it bugs out in the capture code, then all bets are off, unfortunately.
-
pWebArc will now track if an error is recoverable and will not retry actions with unrecoverable errors automatically by default.
-
pWebArc now follows the following state diagram:
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (no_response) (incomplete) (complete) | | | | | \ | | |\---> (canceled) ----\ \ | | | \ \ \ | |\-> (incomplete_fc) ---\ \ \ v | >------>---------------------------->-----> (finished) |\--> (complete_fc) ----/ / | | / / | \----> (snapshot) ----/ /- (collected) <--------- (picked) <--/ | / ^ | | (stashIO?) <----/ | v v | \-- (in_limbo) <- (stashIO?) <- (dropped) v | | (queued) <------------------\ | | / | ^ \ \ \-----> (discarded) <-----/ (exported) <-/ | | \----------------\ \ ^ | | | \ \ | | /---/ \-----------------\ \ \ | | | | \ \ | | v | \ \ | |\-> (srvIO) -> (stashIO?) -> (failed) | \ | | | ^ / \ | | v | v | | | (sumbitted) --------------> (saveIO) --> (saved) | {{!saving}} | \ | \-------->-----------------------------------------------/
-
Renamed all
Profile
settings intoBucket
, as this makes more sense. -
Improved popup UI layout.
-
Changed toolbar icon's badge format a bit.
-
Improved debugging options.
-
A huge internal refactoring to solve constant sub-task scheduling issues once and for all.
-
Improved documentation.
- Removed
config.logDiscarded
option as it is no longer needed (pWebArc has proper log filtering now).
- Various small bugfixes.
tool-v0.13.0 - 2024-08-05
- Implemented
import bundle
sub-command which takes WRR-bundles (optionally gzipped concatenations of WRR-dumps) as inputs. The next version of the extension will start (optionally) producing these.
- Improved error handling and error messages.
- A tiny fix for
pprint
output formatting.
extension-v1.12.0 - 2024-07-03
-
pWebArc will no longer automatically reload on updates, waiting for the browser to restart or for you to reload it explicitly instead.
This way you won't lose any data on extension updates.
Proper automatic reloads on updates will be implemented later, after pWebArc gets full persistence.
-
Popup UI:
-
Reverted the split between
Globally
andThis session
.Implementing that split properly will make future things much harder, so, simple is best.
-
Queued
stat moved to a separate line again.It also shows the sum total of sizes of all dumps now.
-
Added
Scheduled ... actions
stat line, showing the names of actions that are scheduled.It is hidden by default, because watching it closely while pWebArc is very busy can probably cause seizures in some people.
-
-
Improved documentation.
- Various small bugfixes.
extension-v1.11.0 - 2024-06-27
-
Implemented DOM snapshots, their popup UI, keyboard shortcuts and documentation.
- Popup UI now has buttons to snapshot a single tab (
snapshotTab
) and snapshot all open tabs (snapshotAll
). Ctrl+Alt+S
runssnapshotTab
by default now.
- Popup UI now has buttons to snapshot a single tab (
-
Added a bunch of new toolbar icons for various tab states.
In particular,
problematic
state as well as mixed-capture states (e.g., disabled in this tab, but enabled and with limbo mode in children tabs) now have their own special icons.
-
Changed some default keyboard shortcuts:
Ctrl+Alt+A
andCtrl+Alt+C
runcollectAllInLimbo
andcollectAllTabInLimbo
respectively now;Alt+Shift+D
andAlt+Shift+W
rundiscardAllInLimbo
anddiscardAllTabInLimbo
now.
-
Popup UI:
- Improved layout.
- Destructive actions will start asking for confirmations now.
-
All SVG icons were edited to not reference any fonts, since those are not guaranteed to be available on a user's system.
-
Improved behaviour of new tabs created by clicking buttons on the
Internal State
page. -
Greatly improved documentation.
extension-v1.10.0 - 2024-06-18
-
Implemented dark mode theme. The extension will switch to it automatically when the browser asks (which it will if you switch your browser's theme to a dark one).
-
Implemented some new optional UI-related accessibility config options with toggles in popup UI:
- Colorblind mode: uses bluish colors instead of greenish where possible (which uses mostly the same colors pWebArc used before color-coding of UI toggles was introduced in
v1.9.0
, with slight variations for the new color-coding). - Pure text labels: disables emojis in UI labels, makes screen readers happier.
- Colorblind mode: uses bluish colors instead of greenish where possible (which uses mostly the same colors pWebArc used before color-coding of UI toggles was introduced in
-
Improved Internal State/Log UI:
- Added a bunch of tristate toggles for filtering the logs.
- Added in-log buttons to open a narrowed page for reqres with an associated tab.
-
Added UI for internal scheduled/delayed actions/functions (e.g., saving of frequently changing stuff to persistent storage, automatic actions when a tab closes, canceling and reloading not-yet-debugged tabs on Chromium, etc):
- If some functions are still waiting to be run, the badge will have
~
or.
in it and change its color, depending on the importance of the stuff that is waiting to be run. - Popup UI has a new stat line showing the number of such delayed actions and buttons to run or cancel them immediately.
- If some functions are still waiting to be run, the badge will have
-
Added config options and popup UI toggles for picking and marking as problematic reqres with various
HTTP
status codes. -
Implemented new config options and popup UI toggles for browser-specific workarounds. In particular, on Chromium you can now set the URL new root tabs will be reset to (still
about:blank
by default). -
Added more desktop notifications, added config options and popup UI toggles for them.
-
Improved keyboard shortcuts:
- In popup UI, toggles and buttons with bound keyboard shortcuts will now get those shortcuts displayed in their tooltips.
- The "Keyboard shortcuts" section of the
Help
page will now show currently active shortcuts (when viewed via theHelp
button from the extension UI). - The changes to the code there mean all the shortcuts will be reset to their default keys, but it makes stuff much cleaner internally, so.
- Collecting all reqres from currently active tab's limbo is bound to
Alt+S
by default now (similarly to howCtrl+S
saves the page). - Discarding all reqres from currently active tab's limbo is bound to
Alt+W
by default now (similarly to howCtrl+W
closes the tab). - Unmarking all problematic reqres in the currently active tab is bound to
Alt+U
by default now. - Added a few more shortcuts:
Alt+Shift+U
by default unmarks all problematic reqres globally now.Alt+Shift+S
andAlt+Shift+W
by default respectively collect and discard all reqres in limbo globally now.
-
Much of the code working with Chromium's debugger was rewritten. Now it reports all the errors properly and no longer crashes when the debugger gets detached at inopportune time in the pipeline (which is quite common, unfortunately).
-
Mark reqres as 'problematic' when they finish > ... with reqres errors
config option became> ... with reqres errors and get 'dropped'
, i.e. it is now disjoint with> ... with reqres errors and get 'picked'
. -
Improved desktop notifications.
-
Popup UI, in its default rolled-up state, now exposes
Generate desktop notifications about > ... new problematic reqres
option and has customtabindex
es set, for convenience. -
Changed some config option defaults (your existing config will not get affected).
-
Slightly improved performance in normal operation. Greatly improved performance when archiving large batches of reqres at once, e.g. when collecting a lot of stuff from limbo.
-
Greatly improved documentation.
- Various small bugfixes.
extension-v1.9.0 - 2024-06-07
-
A whole ton of bugfixes.
So many bugfixes that pWebArc on Chromium now actually works almost as well as on Firefox.
All leftover issues on Chromium I'm aware of are consequences of Chromium's debugging API limitations and, as far as I can see, are unsolvable without actually patching Chromium (which is unlikely to be accepted upstream, given that patching them will make ad-blocking easier).
archiveweb.page
project appears to suffer from the same issues.Meanwhile, pWebArc continues to work exceptionally well on Firefox-based browsers.
-
Implemented "negative limbo mode".
It does the same thing as limbo mode does, but for reqres that were dropped instead of picked. (Which is why there is an arrow from
dropped
toin_limbo
on the diagram below.) -
Implemented optional automatic actions when a tab gets closed.
E.g., you can ask pWebArc to automatically unmark that tab's
problematic
reqres and/or collect and archive everything belonging to that tab fromlimbo
. -
Implemented a bunch of new desktop notifications.
-
Added a bunch of new configuration options.
This includes a bunch of them for controlling desktop notifications.
-
Added a bunch of new keyboard shortcuts.
Also, keyboard shortcuts now work properly in narrowed
Internal State
pages. -
Implemented stat persistence between restarts.
You can brag about your archiving prowess to your friends by sharing popup UI screenshots now.
-
Added the
Changelog
page, which can be viewed by clicking the version number in the extension's popup.
-
pWebArc now follows the following state diagram:
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (no_response) (incomplete) (complete) | | | | | \ | | |\---> (canceled) -----\ \ | | | \ \ \ | | \ \ \ v |\-> (incomplete_fc) ----->----->---------------------------->-----> (finished) | / / | | / /-----/ | \--> (complete_fc) ----/ /--------------- (picked) <---/ v | | (dropped) v v / | (archived) <- (sIO) <- (collected) <------- (in_limbo) <---------/ | | ^ | | | | | | /------/ \-----\ \--> (discarded) <---/ | | \-> (failed to archive) -/
Terminology-wise, most notably,
picked
anddropped
now mean whatcollected
anddiscarded
meant before.See the
Help
page for more info. -
A lot of changes to make pWebArc consistently use the above terminology --- both in the source and in the documentation --- were performed for this release.
-
Improved visuals:
-
Extension's toolbar button icon, badge, and title are much more informative and consistent in their behaviour now.
-
The version number button in the popup (which opens the
Changelog
) will now get highlighted on major updates. -
Similarly, the
Help
button will now get highlighted when that page gets updated. -
The popup, the
Help
page, theInternal State
aka theLog
page all had their UI improved greatly. -
All the toggles in the popup are now color-coded with their expected values, so if something looks red(-dish), you might want to check the help string in question just in case.
-
-
Improved documentation.
tool-v0.12.0 - 2024-06-07
export mirror
: implemented--no-overwrites
,--partial
, and--overwrite-dangerously
options.
-
export mirror
: Switched the default from--overwrite-dangerously
(which is whatexport mirror
did before even if there was no option for it) to--no-overwrites
. This makes the default semantics consistent with that oforganize
. -
Changed format of reqres
.status
to<"C" or "I" for request.complete><"N" for no response or <response.code><"C" or "I" for response.complete> otherwise>
(yes, this changes most--output
formats oforganize
, again).-
Added
~=
expression atom which doesre.match
internally. -
Changed all documentation examples to do
~= .200C
instead of== 200C
to reflect the above change.
-
-
organize
: renamed--keep
->--no-overwrites
for consistency. -
Improved documentation.
extension-v1.8.1 - 2024-05-22
- A tiny bugfix.
extension-v1.8.0 - 2024-05-20
(Actually, this releases about half of the new changes in my local branches, so expect a new release soonish.)
-
Implemented
problematic
reqres flag, its tracking, UI, and documentation.This flag gets set for
no_response
andincomplete
reqres by default but, unlikeArchive reqres with
settings, it does not influence archival. Instead pWebArc displays "archival failure" as its icon and its badge gets!
at the end.This is needed because, normally, browsers provide no indication when some parts of the page failed to load properly --- they expect you to actually look at the page with your eyes to notice something looking broken instead --- which is not a proper way to do this when you want to be sure that the whole page with all its resources was archived.
-
Implemented currently active tab's limbo mode indication via the icon.
-
Added a separate state for reqres that are completed from cache:
complete_fc
.
-
Renamed reqres states:
noresponse
->no_response
,incomplete-fc
->incomplete_fc
.
-
pWebArc now follows the following state diagram:
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (no_response) (incomplete) (complete) | | | | | \ | | |\---> (canceled) -----\ \ | | | \ \ \ | | \ \ \ v |\-> (incomplete_fc) ----->----->---------------------------->-----> (finished) | / / | | / /-----/ | \--> (complete_fc) ----/ /------------- (collected) <--/ v | | (discarded) v v / | (archived) <- (sIO) <--- (queued) <-------- (in_limbo) <---------/ | | ^ | | | | | | /------/ \-----\ \----> (freeed) <----/ | | \-> (failed to archive) -/
-
Added more shortcuts, changed defaults for others:
-
Added
toggle-tabconfig-limbo
,toggle-tabconfig-children-limbo
, andshow-tab-state
shortcuts, -
Changed the default shortcut for
collect-all-tab-inlimbo
fromAlt+A
toAlt+Shift+A
for uniformity.
-
-
Improved UI:
- The internal state/log page is much nicer now.
- But the popup UI in its default state might have become a bit too long...
-
Improved performance when using limbo mode.
-
Improved documentation.
- Various small bugfixes.
tool-v0.11.2 - 2024-05-20
organize
: now works on Windows.
extension-v1.7.0 - 2024-05-02
-
Implemented "limbo" reqres processing stage and toggles.
"Limbo" is an optional pre-archival-queue stage for finished reqres that are ready to be archived but, unlike non-limbo reqres, are not to be archived automatically.
Which is useful in cases when you need to actually look at a page before deciding if you want to archive it.
E.g., you enable limbo mode, reload the page, notice there were no updates to the interesting parts of the page, and so you discard all of the reqres newly generated by that tab via appropriate button in the add-on popup, or via the new keyboard shortcut.
-
pWebArc now follows the following state diagram:
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (noresponse) (incomplete) (complete) | | | | | \ | | |\---> (canceled) -----\ \ | | | \ \ \ | | \ \ \ v \--> (incomplete-fc) ----->----->---------------------------->-----> (finished) / | /-----/ | /------------- (collected) <--/ v | | (discarded) v v / | (archived) <- (sIO) <--- (queued) <-------- (in_limbo) <---------/ | | ^ | | | | | | /------/ \-----\ \----> (freeed) <----/ | | \-> (failed to archive) -/
-
The
Log
page became theInternal State
page, now shows in-flight and in-limbo reqres. It also allows narrowing to data belonging to a single tab now. -
Improved UI.
-
Improved performance.
tool-v0.11.1 - 2024-05-02
- Improved default batching parameters.
- Improved documentation.
tool-v0.11.0 - 2024-04-03
-
Implemented
scrub
--expr
atom for rewriting links/references and wiping inner evils out fromHTML
,JavaScript
, andCSS
values.CSS
scrubbing is not finished yet, so allCSS
gets censored out by default at the moment.HTML
processing useshtml5lib
, which is pretty nice (though, rather slow), but overall the complexity of this thing and the time it took to debug it into working is kind of unamusing. -
Implemented
export mirror
subcommand generating static website mirrors from previously archived WRR files, kind of similar to whatwget -mpk
does, but offline and the outputs are properlyscrub
bed.
-
A bunch of
--expr
atoms were renamed, a bunch more were added. -
A bunch of
--output
formats changed, most notablyflat
is now namedflat_ms
. -
Improved performance.
-
Improved documentation.
- Various small bugfixes.
tool-v0.9.0 - 2024-03-22
-
Updated
wrrarms
to build with newernixpkgs
andcbor2
modules, the latter of which is now vendored, at least until upstream solves the custom encoders issue. -
Made more improvements to
--output
option oforganize
andimport
with IDNA and component-wise quoting/unquoting of tool-v0.8:-
Added
pretty_url
,mq_path
,mq_query
,mq_nquery
to substitutions and made pre-defined--output
formats use them.mq_nquery
, andpretty_url
do whatnquery
andnquery_url
did before v0.8.0, but better. -
Dropped
shpq
,hpq
,shpq_msn
, andhpq_msn
--output
formats as they are now equivalent to theirhup
versions.
-
-
run
:--expr
option now uses the same semantics asget --expr
. -
Tiny improvements to performance.
pprint
: fixedclock
line formatting a bit.
tool-v0.8.1 - 2024-03-12
- Added
--output flat_n
.
-
Bugfix #1:
tool-v0.8
might have skipped some of the updates whenimport
ing and forgot to do some actions when doingorganize
, which was not the case fortool-v0.6
.These bugs should have not been triggered ever (and with the default
--output
they are impossible to trigger) but to be absolutely sure you can re-runimport mitmproxy
andorganize
with the same arguments you used before. -
Bugfix #2:
organize --output
num
bering is deterministic again, like it was intool-v0.6
.
extension-v1.6.0 - 2024-03-08
- Replaced icons with a cuter set.
tool-v0.8 - 2024-03-08
- Implemented import for
mitmproxy
dumps.
-
Improved
net_url
normalization and components handling, added support for IDNA hostnames. -
Improved most
--output
formats, custom--output
formats now requireformat:
prefix to distinguish them from the built-in ones, like ingit
. -
Renamed response status codes:
N
->I
for "Incomplete"NR
->N
for "None"
-
Renamed
organize --action rename
->organize --move
(as it can now atomically move files between file systems, see below),--action hardlink
->--hardlink
,--action symlink
->--symlink
,--action symlink-update
->--symlink --latest
.
-
Added
organize --copy
. -
organize
now performs changes atomically: it writes to newly created files first,fsync
them, replaces old destination files,fsync
s touched directories, reports changes tostdout
(for consumption by subsequent commands'--stdin0
), and only then (when doing--move
) deletes source files. -
Made many internal changes to simplify things in the future.
Paths produced by wrrarms organize
are expected to change:
-
with the default
--output
format you will only see changes to WRR files with international (IDNA) hostnames and those with the above response statuses; -
names of files generated by most other
--output
formats will change quite a lot, since the path abbreviation algorithm is much smarter now.
dumb_server-v1.6.0 - 2024-02-19
- Implemented
--uncompressed
option.
- Renamed
--no-cbor
option to--no-print-cbors
.
dumb_server-v1.5.5 - 2023-12-04
- Improved documentation.
tool-v0.6 - 2023-12-04
organize
: implemented--quiet
,--batch-number
, and--lazy
options.organize
: implemented--output flat
and improved other--output
formats a bit.get
andrun
now allow multiple--expr
arguments.
- Improved performance.
- Improved documentation.
tool-v0.5 - 2023-11-22
- Initial public release.
dumb_server-v1.5 - 2023-10-25
- Added
--default-profile
option, changed semantics of--ignore-profiles
a bit. - Added
--no-cbor
option. - Packaged as both Python and Nix package.
- Generated filenames for partial files now have
.part
extension. - Generated filenames now include PID to allow multiple process instances of this to dump to the same directory.
extension-v1.5 - 2023-10-22
- Added keyboard shortcuts for toggling tab-related config settings.
- Improved UI.
- Improved documentation.
- Various small bugfixes.
extension-v1.4 - 2023-09-25
- Implemented context menu actions.
- Improved UI.
- Improved performance of dumping to CBOR.
- Improved documentation.
extension-v1.3.5 - 2023-09-13
- Improved
document_url
andorigin_url
handling. - Improved documentation.
extension-v1.3 - 2023-09-04
- Experimental Chromium support.
- Improved UI.
- Various small bugfixes.
extension-v1.1 - 2023-08-28
- Improved handling of
304 Not Modified
responses. - Improved UI and the
Help
page.
- Various small bugfixes.
dumb_server-v1.1 - 2023-08-28
- Implemented
--ignore-profiles
option.
dumb_server-v1.0 - 2023-08-25
- It now prints the its own server URL at the start, for convenience.
- Implemented gzipping before dumping to disk.
- The extension can now specify a per-dump
profile
, which is a suffix to be appended to the dumping directory. - Implemented optional printing of the head and the tail of the dumped data to the TTY.
All planned features are complete now.
extension-v1.0 - 2023-08-25
- Improved popup UI.
- Improved the
Help
page: it's much more helpful now. - Improved the
Log
page: it's an interactive page that gets updated automatically now.
- Various small bugfixes.
extension-v0.1 - 2023-08-20
- Initial public release.
dumb_server-v0.1 - 2023-08-20
- Initial public release.
... each roughly sorted according to the expected order things will probably get implemented.
- UI:
- Improve
Internal State
andSaved into Local Storage
UIs. - Add option persistence to
Internal State
andSaved into Local Storage
UIs. - Add URL matching to
Internal State
andSaved into Local Storage
UIs.
- Improve
- Core+UI:
- Add a popup UI section for
Closed tabs
, so that you could easily collect/discardin_limbo
reqres from such tabs. - Track navigations and allow to use them as boundaries between batches of reqres saved in limbo mode.
- (~25% done) Reorganize tracking- and problematic-related options into config profiles, allow them to override each over.
- Implement per-host profiles.
- Implement automatic capture of
DOM
snapshots when a page changes.
- Add a popup UI section for
- Core:
- Implement automatic management of
network.proxy.no_proxies_on
setting to allowHoardy-Web
archival to an archiving server to work out of the box when using proxies. - Maybe: Dumping straight into
WARC
, so that third-party tools (i.e. not justhoardy-web
) could be used for everything except capture.
- Implement automatic management of
mirror
,scrub
:- Handle SRI things.
- Handle CSP things.
mirror
:- Implement
mirror --standalone
, which would inline all resources into each mirrored page, a-laSingleFile
.
- Implement
organize
:- Implement automatic discernment of relatedness of
WRR
files (by URLs and similarity) and packing of related files intoWRR
bundles. - Maybe: Implement data de-duplication between
WRR
files. - Implement
un206
command/option, which would reassemble a bunch ofGET 206
WRR
files into a singleGET 200
WRR
file.
- Implement automatic discernment of relatedness of
mirror
,organize
:- Allow unloading and lazy re-loading of reqres loaded from anything other than separate
WRR
files. The fact that this is not possible at the moment makes memory consumption in those cases rather abysmal. - Implement on-the-fly mangling of reqres, so that, e.g. you could
organize
ormirror
a reqres containinghttps://web.archive.org/web/<something>/<URL>
as if it was just a<URL>
.
- Allow unloading and lazy re-loading of reqres loaded from anything other than separate
import
,export
:- Converters from
HAR
andWARC
toWRR
. - Converter from
WRR
toWARC
. - Converter from
PCAP
toWRR
.
- Converters from
serve
:- Allow to generate
--symlink --latest
hierarchies on-the-fly when running with--archive-to
.
- Allow to generate
*
:- Maybe: Full text indexing and search. "Maybe", because offloading (almost) everything search-related to third-party tools may be a better idea.