scripts

Wikipedia Extraction

Then, download wikipedia data and extract texts with a script. You can get bz2 compressed files and gold.txt.

wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
./scripts/wiki2txt/dump.sh jawiki-latest-pages-articles.xml.bz2 /path/to/output

Here is a sample (xz compressed; 99MB; Original:449MB). It is licensed under CC-BY-SA 3.0 by Wikipedia.

Error candidate extraction with Wikipedia link

cat /path/to/output/gold.txt | ./scripts/gold2plain | mecab | gzip  > /path/to/out.mecab.gz

./scripts/eval -g /path/to/output/gold.txt -i <(zcat /path/to/out.mecab.gz) > /path/to/output.json

cat /path/to/output.json | ./scripts/pp > /path/to/output.err.tsv

Filter candidates

python3 ./scripts/filter.py \
    -e <(zcat /path/to/excludes.json.gz | jq -r .plain ) \
    -i <(zcat unidic.err.json.gz | python3 ./scripts/pp ) \
    >  /path/to/unidic.only.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

scripts

Wikipedia Extraction

Error candidate extraction with Wikipedia link

Filter candidates

Files

README.md

Latest commit

History

README.md

File metadata and controls

scripts

Wikipedia Extraction

Error candidate extraction with Wikipedia link

Filter candidates