Skip to content

Commit

Permalink
cleaned a bit repo; committed changes corresponding to HIPE2022 data …
Browse files Browse the repository at this point in the history
…release v1.0
  • Loading branch information
Matteo Romanello committed Mar 3, 2022
1 parent eff4486 commit 188219b
Show file tree
Hide file tree
Showing 40 changed files with 1,793 additions and 1,315 deletions.
18 changes: 16 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,25 @@ SHELL:=/bin/bash
SCHEMA?= data/preparation/TypeSystem.xml
DATA_DIR?=data/preparation
RELEASE_DIR?=data/release
VERSION?=v0.1
VERSION?=v1.0
ASSIGNMENTS_TABLE=document-selection.tsv

##########################################
# Make commands for full corpus release #
##########################################

#all: clean download export release

corpus: download-corpus

download-corpus: download-corpus-en

download-corpus-%:
python scripts/inception/download_curated.py --project-name=ajmc-corpus-$* --output-dir=$(DATA_DIR)/corpus/$*/curated/


##########################################
# Make commands for sample data release #
##########################################

miniref: download-miniref retokenize-miniref export-miniref release-miniref

Expand Down
134 changes: 67 additions & 67 deletions data/preparation/logs/export-annotated-de.log

Large diffs are not rendered by default.

60 changes: 30 additions & 30 deletions data/preparation/logs/export-annotated-en.log
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
2022-01-13 15:59:16,835 - root - INFO - Start conversion of 5 files.
2022-01-13 15:59:16,848 - root - INFO - Converting data/preparation/minireference/en/retokenized/cu31924087948174_0035.xmi into data/preparation/minireference/en/tsv/cu31924087948174_0035.tsv
2022-01-13 15:59:17,075 - root - INFO - Hyphenation – Removed character - from spec-tator => spectator
2022-01-13 15:59:17,076 - root - INFO - Hyphenation – Removed character - from her-self => herself
2022-01-13 15:59:17,081 - root - INFO - Converting data/preparation/minireference/en/retokenized/cu31924087948174_0063.xmi into data/preparation/minireference/en/tsv/cu31924087948174_0063.tsv
2022-01-13 15:59:17,274 - root - INFO - Hyphenation – Removed character - from occur-ring => occurring
2022-01-13 15:59:17,275 - root - INFO - Hyphenation – Removed character - from κελαι-νώπαν => κελαινώπαν
2022-01-13 15:59:17,276 - root - INFO - Hyphenation – Removed character - from mari-ners, => mariners,
2022-01-13 15:59:17,279 - root - INFO - Hyphenation – Removed character - from dark-ened => darkened
2022-01-13 15:59:17,284 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0014.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014.tsv
2022-01-13 15:59:17,414 - root - ERROR - Transcript for entity Aristotle (Δ δίς 1. 15 § 13) is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0014.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-01-13 15:59:17,416 - root - INFO - Hyphenation – Removed character - from con-nected => connected
2022-01-13 15:59:17,419 - root - INFO - Hyphenation – Removed character - from inter-polation => interpolation
2022-01-13 15:59:17,422 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146.tsv
2022-01-13 15:59:17,589 - root - ERROR - Transcript for entity 257. 1075 is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-01-13 15:59:17,590 - root - ERROR - Transcript for entity Philostratus ( Viz. Apoll. Δ. 22 § 5) is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-01-13 15:59:17,592 - root - INFO - Hyphenation – Removed character - from xpé-vov) => xpévov)
2022-01-13 15:59:17,593 - root - INFO - Hyphenation – Removed character - from how-ever, => however,
2022-01-13 15:59:17,597 - root - INFO - Hyphenation – Removed character - from per-son, => person,
2022-01-13 15:59:17,604 - root - INFO - Hyphenation – Removed character - from διοί-yew => διοίyew
2022-01-13 15:59:17,609 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0288.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288.tsv
2022-01-13 15:59:17,758 - root - INFO - Hyphenation – Removed character - from re-corded => recorded
2022-01-13 15:59:17,760 - root - INFO - Hyphenation – Removed character - from ξυν-ηρετεῖν => ξυνηρετεῖν
2022-01-13 15:59:17,762 - root - INFO - Hyphenation – Removed character - from ξυνηρε-Toes: => ξυνηρεToes:
2022-01-13 15:59:17,762 - root - INFO - Hyphenation – Removed character - from συῤνγ-ἥσεις => συῤνγἥσεις
2022-01-13 15:59:17,763 - root - INFO - Hyphenation – Removed character - from ξυναρτί-oes.) => ξυναρτίoes.)
2022-01-13 15:59:17,764 - root - INFO - Hyphenation – Removed character - from ξύμ-πλουν => ξύμπλουν
2022-01-13 15:59:17,765 - root - INFO - Hyphenation – Removed character - from ellipti-cal => elliptical
2022-01-13 15:59:17,770 - root - INFO - Hyphenation – Removed character - from *vio-lence’ => *violence’
2022-01-13 15:59:17,774 - root - INFO - Conversion completed.
2022-02-14 10:54:32,861 - root - INFO - Start conversion of 5 files.
2022-02-14 10:54:32,875 - root - INFO - Converting data/preparation/minireference/en/retokenized/cu31924087948174_0035.xmi into data/preparation/minireference/en/tsv/cu31924087948174_0035.tsv
2022-02-14 10:54:33,114 - root - INFO - Hyphenation – Removed character - from spec-tator => spectator
2022-02-14 10:54:33,116 - root - INFO - Hyphenation – Removed character - from her-self => herself
2022-02-14 10:54:33,121 - root - INFO - Converting data/preparation/minireference/en/retokenized/cu31924087948174_0063.xmi into data/preparation/minireference/en/tsv/cu31924087948174_0063.tsv
2022-02-14 10:54:33,330 - root - INFO - Hyphenation – Removed character - from occur-ring => occurring
2022-02-14 10:54:33,331 - root - INFO - Hyphenation – Removed character - from κελαι-νώπαν => κελαινώπαν
2022-02-14 10:54:33,332 - root - INFO - Hyphenation – Removed character - from mari-ners, => mariners,
2022-02-14 10:54:33,335 - root - INFO - Hyphenation – Removed character - from dark-ened => darkened
2022-02-14 10:54:33,339 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0014.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014.tsv
2022-02-14 10:54:33,478 - root - ERROR - Transcript for entity Aristotle (Δ δίς 1. 15 § 13) is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0014.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-02-14 10:54:33,480 - root - INFO - Hyphenation – Removed character - from con-nected => connected
2022-02-14 10:54:33,484 - root - INFO - Hyphenation – Removed character - from inter-polation => interpolation
2022-02-14 10:54:33,487 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146.tsv
2022-02-14 10:54:33,663 - root - ERROR - Transcript for entity 257. 1075 is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-02-14 10:54:33,664 - root - ERROR - Transcript for entity Philostratus ( Viz. Apoll. Δ. 22 § 5) is present in data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0146.xmi, yet entity is not marked as noisy. Levenshtein distance is computed nevertheless.
2022-02-14 10:54:33,666 - root - INFO - Hyphenation – Removed character - from xpé-vov) => xpévov)
2022-02-14 10:54:33,666 - root - INFO - Hyphenation – Removed character - from how-ever, => however,
2022-02-14 10:54:33,671 - root - INFO - Hyphenation – Removed character - from per-son, => person,
2022-02-14 10:54:33,679 - root - INFO - Hyphenation – Removed character - from διοί-yew => διοίyew
2022-02-14 10:54:33,683 - root - INFO - Converting data/preparation/minireference/en/retokenized/sophoclesplaysa05campgoog_0288.xmi into data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288.tsv
2022-02-14 10:54:33,840 - root - INFO - Hyphenation – Removed character - from re-corded => recorded
2022-02-14 10:54:33,843 - root - INFO - Hyphenation – Removed character - from ξυν-ηρετεῖν => ξυνηρετεῖν
2022-02-14 10:54:33,845 - root - INFO - Hyphenation – Removed character - from ξυνηρε-Toes: => ξυνηρεToes:
2022-02-14 10:54:33,845 - root - INFO - Hyphenation – Removed character - from συῤνγ-ἥσεις => συῤνγἥσεις
2022-02-14 10:54:33,846 - root - INFO - Hyphenation – Removed character - from ξυναρτί-oes.) => ξυναρτίoes.)
2022-02-14 10:54:33,847 - root - INFO - Hyphenation – Removed character - from ξύμ-πλουν => ξύμπλουν
2022-02-14 10:54:33,848 - root - INFO - Hyphenation – Removed character - from ellipti-cal => elliptical
2022-02-14 10:54:33,853 - root - INFO - Hyphenation – Removed character - from *vio-lence’ => *violence’
2022-02-14 10:54:33,857 - root - INFO - Conversion completed.
65 changes: 33 additions & 32 deletions data/preparation/logs/release-miniref.log
Original file line number Diff line number Diff line change
@@ -1,32 +1,33 @@
2022-01-13 15:59:21,731 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0035.tsv
2022-01-13 15:59:21,732 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0063.tsv
2022-01-13 15:59:21,732 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014.tsv
2022-01-13 15:59:21,732 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146.tsv
2022-01-13 15:59:21,733 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288.tsv
2022-01-13 15:59:21,734 - __main__ - INFO - Written sample to data/release/v0.1/HIPE-2022-ajmc-v0.1-sample-en.tsv
2022-01-13 15:59:21,736 - __main__ - INFO - data/release/v0.1/HIPE-2022-ajmc-v0.1-sample-en.tsv contains all 5 expected documents
2022-01-13 15:59:21,737 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0035-biblio.tsv
2022-01-13 15:59:21,737 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0063-biblio.tsv
2022-01-13 15:59:21,738 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014-biblio.tsv
2022-01-13 15:59:21,738 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146-biblio.tsv
2022-01-13 15:59:21,738 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288-biblio.tsv
2022-01-13 15:59:21,739 - __main__ - INFO - Written sample to data/release/v0.1/HIPE-2022-ajmc_biblio-v0.1-sample-en.tsv
2022-01-13 15:59:21,741 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0007.tsv
2022-01-13 15:59:21,741 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0016.tsv
2022-01-13 15:59:21,742 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0080.tsv
2022-01-13 15:59:21,742 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0087.tsv
2022-01-13 15:59:21,742 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0017.tsv
2022-01-13 15:59:21,743 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0049.tsv
2022-01-13 15:59:21,743 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0085.tsv
2022-01-13 15:59:21,744 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0125.tsv
2022-01-13 15:59:21,744 - __main__ - INFO - Written sample to data/release/v0.1/HIPE-2022-ajmc-v0.1-sample-de.tsv
2022-01-13 15:59:21,746 - __main__ - INFO - data/release/v0.1/HIPE-2022-ajmc-v0.1-sample-de.tsv contains all 8 expected documents
2022-01-13 15:59:21,746 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0007-biblio.tsv
2022-01-13 15:59:21,746 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0016-biblio.tsv
2022-01-13 15:59:21,747 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0080-biblio.tsv
2022-01-13 15:59:21,747 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0087-biblio.tsv
2022-01-13 15:59:21,747 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0017-biblio.tsv
2022-01-13 15:59:21,748 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0049-biblio.tsv
2022-01-13 15:59:21,748 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0085-biblio.tsv
2022-01-13 15:59:21,748 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0125-biblio.tsv
2022-01-13 15:59:21,749 - __main__ - INFO - Written sample to data/release/v0.1/HIPE-2022-ajmc_biblio-v0.1-sample-de.tsv
2022-02-14 10:54:35,262 - __main__ - INFO - Created folder data/release/v1.0 as it did not exist
2022-02-14 10:54:35,264 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0035.tsv
2022-02-14 10:54:35,265 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0063.tsv
2022-02-14 10:54:35,265 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014.tsv
2022-02-14 10:54:35,266 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146.tsv
2022-02-14 10:54:35,266 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288.tsv
2022-02-14 10:54:35,268 - __main__ - INFO - Written sample to data/release/v1.0/HIPE-2022-v1.0-ajmc-sample-en.tsv
2022-02-14 10:54:35,269 - __main__ - INFO - data/release/v1.0/HIPE-2022-v1.0-ajmc-sample-en.tsv contains all 5 expected documents
2022-02-14 10:54:35,269 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0035-biblio.tsv
2022-02-14 10:54:35,270 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/cu31924087948174_0063-biblio.tsv
2022-02-14 10:54:35,270 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0014-biblio.tsv
2022-02-14 10:54:35,271 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0146-biblio.tsv
2022-02-14 10:54:35,271 - __main__ - INFO - Read input from file data/preparation/minireference/en/tsv/sophoclesplaysa05campgoog_0288-biblio.tsv
2022-02-14 10:54:35,272 - __main__ - INFO - Written sample to data/release/v1.0/HIPE-2022-v1.0-ajmc_biblio-sample-en.tsv
2022-02-14 10:54:35,273 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0007.tsv
2022-02-14 10:54:35,274 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0016.tsv
2022-02-14 10:54:35,274 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0080.tsv
2022-02-14 10:54:35,274 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0087.tsv
2022-02-14 10:54:35,275 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0017.tsv
2022-02-14 10:54:35,276 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0049.tsv
2022-02-14 10:54:35,276 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0085.tsv
2022-02-14 10:54:35,276 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0125.tsv
2022-02-14 10:54:35,277 - __main__ - INFO - Written sample to data/release/v1.0/HIPE-2022-v1.0-ajmc-sample-de.tsv
2022-02-14 10:54:35,279 - __main__ - INFO - data/release/v1.0/HIPE-2022-v1.0-ajmc-sample-de.tsv contains all 8 expected documents
2022-02-14 10:54:35,280 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0007-biblio.tsv
2022-02-14 10:54:35,280 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0016-biblio.tsv
2022-02-14 10:54:35,281 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0080-biblio.tsv
2022-02-14 10:54:35,281 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/Wecklein1894_0087-biblio.tsv
2022-02-14 10:54:35,282 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0017-biblio.tsv
2022-02-14 10:54:35,282 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0049-biblio.tsv
2022-02-14 10:54:35,283 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0085-biblio.tsv
2022-02-14 10:54:35,283 - __main__ - INFO - Read input from file data/preparation/minireference/de/tsv/sophokle1v3soph_0125-biblio.tsv
2022-02-14 10:54:35,284 - __main__ - INFO - Written sample to data/release/v1.0/HIPE-2022-v1.0-ajmc_biblio-sample-de.tsv
Loading

0 comments on commit 188219b

Please sign in to comment.