warc2text-runner

Scripts for parallelized extraction of plain texts from WARC archives. Aiming at common and reproducible extraction approach.

Install

Install warc2text from https://github.com/bitextor/warc2text
- later will move to Easybuild
- for CESNET currently using this recipe

Run

./run_warc2text.sh ../wide15-sample300/ test_filtered 250 ./

takes WARCs from ../wide15-sample300/, saves extracted texts and urls to test_filtered and logs to test_filtered_logs, performs extraction in 250 parallel processes, filters documents using filters from this repository.

To run without filters:

./run_warc2text.sh ../wide15-sample300/ test_filtered 250

Calculate language statistics

cd stats
bash text_stats.sh ../test_filtered ../test_filtered_stats 250

calculates statistics for texts in ../test_filtered extracted by warc2text (number of bytes, words as reported by wc, newlines and documents for each language) and saves it to ../test_filtered_stats in .tsv format. Processes texts in 250 parallel processes. Additionally generates basic plots for some of these metrics and saves to the same folder.

Collected statistics and plots

Language statistics was calculated for cc40, wide00015 and wide00017 For generating custom plots comparing different statistics for several languages and datasets you may want start with this notebook.

Compiling giashard

git clone git@github.com:paracrawl/giashard.git

CGO_ENABLED=0 go build \
  -o giashard-static \
  -a -ldflags '-extldflags "-static"' \
  github.com/paracrawl/giashard/cmd/giashard

CGO_ENABLED=0 go build \
  -o giamerge-static \
  -a -ldflags '-extldflags "-static"' \
  github.com/paracrawl/giashard/cmd/giamerge

Running giashard:

cd path/to/data

./giashard.sh wide00016 mt

That will create a wide00016-shards/mt folder in theory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

warc2text-runner

Install

Run

Calculate language statistics

Collected statistics and plots

Compiling giashard

Files

README.md

Latest commit

History

README.md

File metadata and controls

warc2text-runner

Install

Run

Calculate language statistics

Collected statistics and plots

Compiling giashard