Skip to content

This tool downloads each page from the Wayback Machine for a specific domain and enables further keyword search on each saved page.

Notifications You must be signed in to change notification settings

dthanvi/wayback-keyword-search

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wayback-keyword-search

[IMPORTANT NOTE: WAYBACK IS NOW RATE LIMITING. I HAD TO UPDATE THE TOOL TO GET IT WORKING; HOWEVER IT IS SLOWER NOW AS IT CAN'T USE PARALLELISM ANYMORE. THE PYTHON VERSION IS UPDATED; THE GO VERSION IS NOT (YET)]

This tools downloads each page from the Wayback Machine for a specific input domain and saves each page as a local .txt file, so that you can later search for keyword matches within the saved files.

downloading is done with the "download" file; and searching with the "search" file.

You can download pages saved in specific years (i.e.: 2020), or years and months (i.e.: 202001), or years and months and days (i.e: 20200101), just specifying the date format in the prompt. If you want to download everything in the 2000's or 19**'s regardless the saved date, just type "2" (for the pages saved past 2000) or "1" (for the pages saved in the XXth century) in the prompt, and Wayback will save each page saved matching that criteria. So, if you want to save a website that has been saved across 1999 and 2000, you will need to run the tool twice.

There is a Python3 version and a Go version.


[*] Python usage:

python3 download.py > specify your domain like: nytimes.com (no quotes!)

When the download is completed, a directory named as the domain will be saved in the local path.

So you can search for keyword matches within each file in the local dir using the "search.py" file:

python3 search.py > specify your keyword (no quotes!).


[*] Go usage: [NOT WORKING NOW DUE TO ARCHIVE BLOCKING TOO MANY PARALLEL REQUESTS]

go run download.go

and then:

go run search.go

The best way to use the Go version is by running the compiled executables:

go build search.go

go build download.go

Notice that the Go version also features a download_channels.go version (thanks to Stephen Paulger for such improvement) which is a bit more efficient. Consider testing both!

About

This tool downloads each page from the Wayback Machine for a specific domain and enables further keyword search on each saved page.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 74.2%
  • Python 25.8%