Query Web Archive Crawl Indexes (‘CDX’)
Methods are provided to retrieve web archive crawl index (‘CDX’) metadata and directly query the ‘CDX’ ‘API’ endpoint to retrieve mementos for a given set of parameters.
The following functions are implemented:
cdx_query
: Query a CDX index endpointfetch_collections_index
: Fetch collections index
devtools::install_github("hrbrmstr/cdx")
library(cdx)
library(tidyverse)
# current verison
packageVersion("cdx")
## [1] '0.1.0'
cidx <- fetch_collections_index()
rprj <- cdx_query(cidx$cdx_api[1], "*.r-project.org")
rprj
## # A tibble: 14,358 x 12
## urlkey timestamp length url mime_detected offset mime filename status languages charset digest
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 org,r-project)/ 20180818010708 2835 http:… text/html 567295… text… crawl-data… 200 eng UTF-8 WTRYJ…
## 2 org,r-project)/ 20180819003057 700 http:… text/html 7358684 text… crawl-data… 301 <NA> <NA> RGN4E…
## 3 org,r-project)/ 20180819003057 2834 http:… text/html 550642… text… crawl-data… 200 eng UTF-8 WTRYJ…
## 4 org,r-project)/ 20180819003103 702 https… text/html 207506… text… crawl-data… 301 <NA> <NA> AMYSZ…
## 5 org,r-project)/ 20180819003103 2839 https… text/html 971885… text… crawl-data… 200 eng UTF-8 WTRYJ…
## 6 org,r-project)/ 20180819194145 2832 http:… text/html 537747… text… crawl-data… 200 eng UTF-8 WTRYJ…
## 7 org,r-project)/ 20180820013726 2832 http:… text/html 570313… text… crawl-data… 200 eng UTF-8 WTRYJ…
## 8 org,r-project)/ 20180820102922 556 https… <NA> 256576… warc… crawl-data… 304 <NA> <NA> 3I42H…
## 9 org,r-project)/ 20180821155855 702 https… text/html 208822… text… crawl-data… 301 <NA> <NA> AMYSZ…
## 10 org,r-project)/ 20180821155857 2837 https… text/html 962600… text… crawl-data… 200 eng UTF-8 WTRYJ…
## # ... with 14,348 more rows