Skip to content

Commit

Permalink
updated dataset README
Browse files Browse the repository at this point in the history
  • Loading branch information
Matteo Romanello committed Jan 20, 2022
1 parent 60972d1 commit eff4486
Showing 1 changed file with 9 additions and 8 deletions.
17 changes: 9 additions & 8 deletions data/release/HIPE2022-ajmc-README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ When sampling data for annotation, we kept only pages belonging to the introduct
This dataset comes in the CoNLL-like HIPE TSV format (for further details see the [HIPE 2020 Task Participation Guidelines](https://doi.org/10.5281/zenodo.3677171), p. 8). Sentence boundaries are indicated by the `EndOfLine` flag, contained in the `MISC` column, and correspond to manually identified linguistic sentences (see Guidelines, section 4). Hyphenated words were manually identified and re-composed (i.e. de-hyphenated).

Annotated data come in two flavours, corresponding to two different sets of tasks:
1) *NER and EL*: data contains annotations of universal entities, both coarse and fine grained, as well as entity links (see sample file).
2) *Citation mining*: data contains annotations of bibliographic references to both primary and secondary sources, according to the taxonomy described in the Annotation Guidelines section 2.3 (see sample file).
1) *NER and EL*: data contains annotations of universal entities, both coarse and fine grained, as well as entity links. See [sample file (English)](v0.1/HIPE-2022-ajmc-v0.1-sample-en.tsv).
2) *Citation mining*: data contains annotations of bibliographic references to both primary and secondary sources, according to the taxonomy described in the Annotation Guidelines section 2.3. See [sample file (English)](v0.1/HIPE-2022-ajmc_biblio-v0.1-sample-en.tsv).

**NB**: the two files are fully aligned, meaning line e.g. 100 in both files refers to the same annotated token. As such, information from both files can be used in a multi-task learning scenario.
**NB**: the two files are fully aligned, meaning that line *n* in both files will refer to the same annotated token. As such, information from both files can be combined together and used in multi-task learning scenarios.

## Statistics

Expand All @@ -45,12 +45,13 @@ The digitized commentaries are available in the Internet Archive and released in

## Domain specificity

TBD
This dataset raises some challenges for NER and EL that are related to its domain-specific nature:

- data sparsity
- entity linking/importance of context
- abbreviations: not commonly found in Wikidata, but they can be found in hucitlib (partly linked to Wikidata), see below.
- *data sparsity*: the fact that some entity types are under-represented in this dataset (e.g. `date`) calls for approaches to deal with data sparsity (e.g. data augmentation, meta-learning);
- *dependance on context*: the overall context of a commentary has a direct impact on how entity mentions are crafted, especially in terms of conciseness of the referents. This is especially relevant for EL as capturing the global document context becomes essential to select the correct linking candidate. To give a concrete example, a scholar commenting on a tragedy by Sophocles will probably omit the ancient author's name when referring to other works by Sophocles. To refer to a line of Sophocles' play *Philoctetes* she may write "*Ph.* 100" instead of the more easily intelligible "Soph. *Philoct.* 110".

## Related resources

**hucitlib knowledge base.** Commentators make abundant use of very concise abbreviations when referring e.g. to ancient authors (`pers.author`) and their works (`work.primlit`). Such abbreviations constitute a substantial challenge, especially for entity linking. An external resource that can be used in this respect is the [`hucitlib` knowledge base](https://hucitlib.readthedocs.io/) which is partially linked to Wikidata and provides abbreviations and variant names/titles for classical authors and their works.
**Hucitlib Knowledge Base.** Commentators make abundant use of very concise abbreviations when referring e.g. to ancient authors (`pers.author`) and their works (`work.primlit`). Such abbreviations constitute a substantial challenge, especially for entity linking. An external resource that can be used in this respect is the [`hucitlib` knowledge base](https://hucitlib.readthedocs.io/) which is partially linked to Wikidata and provides abbreviations and variant names/titles for classical authors and their works.

**Citation mining.** The dataset [*Annotated References in the Historiography on Venice: 19th–21st centuries*](http://doi.org/10.5334/johd.9), despite originating from a slightly different domain (i.e. history of Venice), contains annotations of primary and secondary bibliographic references. The guidelines according to which it was annotated are compatible with our guidelines for bibliographic entities.

0 comments on commit eff4486

Please sign in to comment.