GitHub - ptakopysk/crosssynt: Our experiments with cross-lingual parsing. A croSSSynt project. Starting with the SFNW paper (Slavic Forest, Norwegian Wood).

ptakopysk / crosssynt Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Our experiments with cross-lingual parsing. A croSSSynt project. Starting with the SFNW paper (Slavic Forest, Norwegian Wood).

ufal.mff.cuni.cz/grants/crosssynt

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
delex_uas_las		delex_uas_las
rur_aligner		rur_aligner
sample		sample
templates		templates
tools		tools
LANG_PAIRS		LANG_PAIRS
README		README
TODO		TODO
TODO_proper_sents		TODO_proper_sents
get_UD-1.4_treebanks.sh		get_UD-1.4_treebanks.sh
get_UD-2.1_treebanks_local.sh		get_UD-2.1_treebanks_local.sh
get_para.shca		get_para.shca
get_wikiextractor.sh		get_wikiextractor.sh
install_fast_align.sh		install_fast_align.sh
install_hunalign.sh		install_hunalign.sh
install_moses_and_mgiza.sh		install_moses_and_mgiza.sh
install_udtools.sh		install_udtools.sh
langs_para		langs_para
langs_wals		langs_wals
mosesdecoder		mosesdecoder
new_shc_setup.sh		new_shc_setup.sh
new_shc_setup_foreach_ensrc_para.sh		new_shc_setup_foreach_ensrc_para.sh
new_shc_setup_foreach_entgt_para.sh		new_shc_setup_foreach_entgt_para.sh
new_shc_setup_foreach_src_src.sh		new_shc_setup_foreach_src_src.sh
new_shc_setup_foreach_src_tgt.sh		new_shc_setup_foreach_src_tgt.sh
new_shc_setup_foreach_tgt_tgt.sh		new_shc_setup_foreach_tgt_tgt.sh
new_shc_setup_foreachpair.sh		new_shc_setup_foreachpair.sh
process_treebanks.sh		process_treebanks.sh
sample.sh		sample.sh
source.bash		source.bash
try_get_para_SRC_TGT.sh		try_get_para_SRC_TGT.sh
wiki_langs_list		wiki_langs_list

Repository files navigation

To use the "Slavic Forest, Norwegian Wood" cross-lingual parsing system,
you need to fulfill several prerequisities:

You need to have the following software:
- Bash
- Python 3
- Perl
- Treex (http://ufal.mff.cuni.cz/treex)
- UDPipe (bundled in tools/)
- word2vec (bundled in tools/)
- Moses (install it in mosesdecoder/ !!!)

You need to have the following data:
- UD treebank for the source language
    - train is sufficient
- UD treebank for the target language
    - dev for evaluation
    - train for training the tagger and the supervised upper bound
      (or you may provide a trained target tagger and omit the supervised evaluation)
- parallel data (separate plaintext file for each of the languages, one
  sentence per line, with an implicit sentence-level alignment, i.e. sentences
  on the same line correspond to each other)
Small sample data for Czech and Polish are provided in the sample/ directory.

To run the pipeline on the sample data, run:
    ./sample.sh
This will print a lot of debug info to the error output,
and eventually the evaluation scores to standard output.

Look into the sample.sh script to see what needs to be done to run the
pipeline on some data:
- put the necessary treebanks into the treebanks/ directory
- put the necessary parallel data into the para/ directory
- create the script to run from the all-SSS-TTT.sh template
  by substituting SSS for the source language and TTT for the target language
  (can be done by running ./new.sh [srclang] [tgtlang])
- go to the run/ directory and run the script  

The source code of the template file is commented, modify it or remove parts
of it according to your needs.

If you have an SGE cluster, use the all-SSS-TTT.shc template instead,
generate the run script with ./new_shc.sh,
and submit it by going to run/ and calling qsub scriptname.shc
(you may need to change some of the setting in the shc template).

Of course, you need to get the treebank and parallel data somewhere.
If you are lucky, you can get these automatically:
- for treebanks, simply run:
    ./get_UD-1.4_treebanks.sh
- for parallel data, try to run e.g.:
    ./try_get_para_SRC_TGT.sh cs pl

Authors: Rudolf Rosa, Daniel Zeman, David Mareček and Zdeněk Žabokrtský
    <{rosa,zeman,marecek,zabokrtsky}@ufal.mff.cuni.cz>
    Institute of Formal and Applied Linguistics, Charles University
Licence: GPL v2
Year: 2017


Versions

v1.0 corresponds to our TLT 2018 paper: 
Rosa Rudolf, Žabokrtský Zdeněk: Error Analysis of Cross-lingual Tagging and Parsing. In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, Copyright © Univerzita Karlova, Praha, Czechia, ISBN 978-80-88132-04-2, pp. 106-118, 2017

v2.0 corrseponds to my 2018 dissertation:
Rosa Rudolf: Discovering the structure of natural language sentences by semi-supervised methods. Ph.D. thesis, Charles University, Faculty of Mathematics and Physics, Praha, Czechia, 186 pp., Jun 2018