Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #646

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Data/ParlaMint-LT/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ All the proceedings (recordings and transcripts of the debates) on the Seimas fl

### Data source and acquisition

Transcripts of the Seimas floor debates (in digital format) are freely available from the official website of the Seimas (www.lrs.lt). Data was automatically scraped from the official document search site of the Seimas: https://e-seimas.lrs.lt/portal/documentSearch/lt. We entered the period (2012-11-162020-11-10), and the type of document (“Stenograma”) and the search engine retrieved a total of 876 transcripts in MS Word (*.doc / *.docx) format. The list was then manually cross-checked with the list available on the main [Seimas website](www.lrs.lt/sip/portal.show?p_r=35727) and no additional transcripts were found. Thus, the corpus consists of 876 transcripts of the Seimas floor debates. These documents have a total of 244835 speeches (with 390179 segments in them) and 14780871 word units in aggregate.
Transcripts of the Seimas floor debates (in digital format) are freely available from the official website of the Seimas (www.lrs.lt). Data was automatically scraped from the official document search site of the Seimas: https://e-seimas.lrs.lt/portal/documentSearch/lt. We entered the period (1992-11-252022-12-23), and the type of document (“Stenograma”) and the search engine retrieved a total of 3823 transcripts in MS Word (*.doc / *.docx) format. The list was then manually cross-checked with the list available on the main [Seimas website](www.lrs.lt/sip/portal.show?p_r=35727) and inconsistencies corrected. Eventally, the corpus consists of 3822 transcripts of the Seimas floor debates. These documents have a total of ??? speeches (with ??? segments in them) and ??? word units in aggregate.

Timespan of the corpus - the last two terms of the Seimas: 2012-11-16 - 2016-11-10 and 2016-11-14 - 2020-11-10.
Timespan of the corpus - six full terms of the Seimas starting from the second term after the restoration of independence in 1990-03-11: 1992-11-25 - 1996-11-19; 1996-11-25 - 2000-10-18; 2000-10-19 - 2004-11-11; 2004-11-15 - 2008-11-14; 2008-11-17 - 2012-11-14; 2012-11-16 - 2016-11-10; 2016-11-14 - 2020-11-10. Also, transcripts from the last term of the Seimas are included starting from 2020-11-13 and ending 2022-12-23. The first term was not included as the texts of the transcripts are badly structured and need a lot of manual corrections. Inclusion of this term is envisioned in the nearest future.

The retrieved files had to be converted into textual data files (plain text format) to be processed with text analytic tools. It should be noted that the entire data set is in Lithuanian; therefore, it was essential to preserve the UTF-8 encoding for further processing. It was a bit of a challenge as the downloaded files were in different formats, encodings. Therefore, we had to unify the data so that it could be processed automatically. Two converters were used: MultiDoc Converter (www.multidoc-converter.com/en/index.html) and EmEditor (www.emeditor.com).

Expand All @@ -34,4 +34,4 @@ Mostly, structural elements/attributes included into the ParlaMint Schema were u

### Linguistic annotation

The processing was carried out by means of a Python script combining an XML parser module within the Spacy package (https://spacy.io). The annotation pipeline includes tokenization, sentence segmentation, lemmatization, UD part-of-speech and morphological tagging, UD dependency parsing and named entity recognition.
The processing was carried out by means of a Python script combining an XML parser module within the Spacy package (https://spacy.io). The annotation pipeline includes tokenization, sentence segmentation, lemmatization, UD part-of-speech and morphological tagging, UD dependency parsing and named entity recognition.