Skip to content

Our experiments with cross-lingual parsing. A croSSSynt project. Starting with the SFNW paper (Slavic Forest, Norwegian Wood).

Notifications You must be signed in to change notification settings

ptakopysk/crosssynt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

To use the "Slavic Forest, Norwegian Wood" cross-lingual parsing system,
you need to fulfill several prerequisities:

You need to have the following software:
- Bash
- Python 3
- Perl
- Treex (http://ufal.mff.cuni.cz/treex)
- UDPipe (bundled in tools/)
- word2vec (bundled in tools/)
- Moses (install it in mosesdecoder/ !!!)

You need to have the following data:
- UD treebank for the source language
    - train is sufficient
- UD treebank for the target language
    - dev for evaluation
    - train for training the tagger and the supervised upper bound
      (or you may provide a trained target tagger and omit the supervised evaluation)
- parallel data (separate plaintext file for each of the languages, one
  sentence per line, with an implicit sentence-level alignment, i.e. sentences
  on the same line correspond to each other)
Small sample data for Czech and Polish are provided in the sample/ directory.

To run the pipeline on the sample data, run:
    ./sample.sh
This will print a lot of debug info to the error output,
and eventually the evaluation scores to standard output.

Look into the sample.sh script to see what needs to be done to run the
pipeline on some data:
- put the necessary treebanks into the treebanks/ directory
- put the necessary parallel data into the para/ directory
- create the script to run from the all-SSS-TTT.sh template
  by substituting SSS for the source language and TTT for the target language
  (can be done by running ./new.sh [srclang] [tgtlang])
- go to the run/ directory and run the script  

The source code of the template file is commented, modify it or remove parts
of it according to your needs.

If you have an SGE cluster, use the all-SSS-TTT.shc template instead,
generate the run script with ./new_shc.sh,
and submit it by going to run/ and calling qsub scriptname.shc
(you may need to change some of the setting in the shc template).

Of course, you need to get the treebank and parallel data somewhere.
If you are lucky, you can get these automatically:
- for treebanks, simply run:
    ./get_UD-1.4_treebanks.sh
- for parallel data, try to run e.g.:
    ./try_get_para_SRC_TGT.sh cs pl

Authors: Rudolf Rosa, Daniel Zeman, David Mareček and Zdeněk Žabokrtský
    <{rosa,zeman,marecek,zabokrtsky}@ufal.mff.cuni.cz>
    Institute of Formal and Applied Linguistics, Charles University
Licence: GPL v2
Year: 2017


Versions

v1.0 corresponds to our TLT 2018 paper: 
Rosa Rudolf, Žabokrtský Zdeněk: Error Analysis of Cross-lingual Tagging and Parsing. In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, Copyright © Univerzita Karlova, Praha, Czechia, ISBN 978-80-88132-04-2, pp. 106-118, 2017

v2.0 corrseponds to my 2018 dissertation:
Rosa Rudolf: Discovering the structure of natural language sentences by semi-supervised methods. Ph.D. thesis, Charles University, Faculty of Mathematics and Physics, Praha, Czechia, 186 pp., Jun 2018