Scripts for parallelized extraction of plain texts from WARC archives. Aiming at common and reproducible extraction approach.
- Install warc2text from https://github.com/bitextor/warc2text
- later will move to Easybuild
- for CESNET currently using this recipe
./run_warc2text.sh ../wide15-sample300/ test_filtered 250 ./
takes WARCs from ../wide15-sample300/, saves extracted texts and urls to test_filtered and logs to test_filtered_logs, performs extraction in 250 parallel processes, filters documents using filters from this repository.
To run without filters:
./run_warc2text.sh ../wide15-sample300/ test_filtered 250
cd stats
bash text_stats.sh ../test_filtered ../test_filtered_stats 250
calculates statistics for texts in ../test_filtered extracted by warc2text (number of bytes, words as reported by wc, newlines and documents for each language) and saves it to ../test_filtered_stats in .tsv format. Processes texts in 250 parallel processes. Additionally generates basic plots for some of these metrics and saves to the same folder.
Language statistics was calculated for cc40, wide00015 and wide00017 For generating custom plots comparing different statistics for several languages and datasets you may want start with this notebook.
git clone git@github.com:paracrawl/giashard.git
CGO_ENABLED=0 go build \
-o giashard-static \
-a -ldflags '-extldflags "-static"' \
github.com/paracrawl/giashard/cmd/giashard
CGO_ENABLED=0 go build \
-o giamerge-static \
-a -ldflags '-extldflags "-static"' \
github.com/paracrawl/giashard/cmd/giamerge
Running giashard:
cd path/to/data
./giashard.sh wide00016 mt
That will create a wide00016-shards/mt
folder in theory.