clarin-eric · vaidasmo · May 10, 2023
diff --git a/Data/ParlaMint-LT/README.md b/Data/ParlaMint-LT/README.md
@@ -12,9 +12,9 @@ All the proceedings (recordings and transcripts of the debates) on the Seimas fl
 
 ### Data source and acquisition
 
-Transcripts of the Seimas floor debates (in digital format) are freely available from the official website of the Seimas (www.lrs.lt). Data was automatically scraped from the official document search site of the Seimas: https://e-seimas.lrs.lt/portal/documentSearch/lt. We entered the period (2012-11-16 – 2020-11-10), and the type of document (“Stenograma”) and the search engine retrieved a total of 876 transcripts in MS Word (*.doc / *.docx) format. The list was then manually cross-checked with the list available on the main [Seimas website](www.lrs.lt/sip/portal.show?p_r=35727) and no additional transcripts were found. Thus, the corpus consists of 876 transcripts of the Seimas floor debates. These documents have a total of 244835 speeches (with 390179 segments in them) and 14780871 word units in aggregate.
+Transcripts of the Seimas floor debates (in digital format) are freely available from the official website of the Seimas (www.lrs.lt). Data was automatically scraped from the official document search site of the Seimas: https://e-seimas.lrs.lt/portal/documentSearch/lt. We entered the period (1992-11-25 – 2022-12-23), and the type of document (“Stenograma”) and the search engine retrieved a total of 3823 transcripts in MS Word (*.doc / *.docx) format. The list was then manually cross-checked with the list available on the main [Seimas website](www.lrs.lt/sip/portal.show?p_r=35727) and inconsistencies corrected. Eventally, the corpus consists of 3822 transcripts of the Seimas floor debates. These documents have a total of ??? speeches (with ??? segments in them) and ??? word units in aggregate.
 
-Timespan of the corpus - the last two terms of the Seimas: 2012-11-16 - 2016-11-10 and 2016-11-14 - 2020-11-10.
+Timespan of the corpus - six full terms of the Seimas starting from the second term after the restoration of independence in 1990-03-11: 1992-11-25 - 1996-11-19; 1996-11-25 - 2000-10-18; 2000-10-19 - 2004-11-11; 2004-11-15 - 2008-11-14; 2008-11-17 - 2012-11-14; 2012-11-16 - 2016-11-10; 2016-11-14 - 2020-11-10. Also, transcripts from the last term of the Seimas are included starting from 2020-11-13 and ending 2022-12-23. The first term was not included as the texts of the transcripts are badly structured and need a lot of manual corrections. Inclusion of this term is envisioned in the nearest future.
 
 The retrieved files had to be converted into textual data files (plain text format) to be processed with text analytic tools. It should be noted that the entire data set is in Lithuanian; therefore, it was essential to preserve the UTF-8 encoding for further processing. It was a bit of a challenge as the downloaded files were in different formats, encodings. Therefore, we had to unify the data so that it could be processed automatically. Two converters were used: MultiDoc Converter (www.multidoc-converter.com/en/index.html) and EmEditor (www.emeditor.com).
 
@@ -34,4 +34,4 @@ Mostly, structural elements/attributes included into the ParlaMint Schema were u
 
 ### Linguistic annotation
 
-The processing was carried out by means of a Python script combining an XML parser module within the Spacy package (https://spacy.io). The annotation pipeline includes tokenization, sentence segmentation, lemmatization, UD part-of-speech and morphological tagging, UD dependency parsing and named entity recognition.
+The processing was carried out by means of a Python script combining an XML parser module within the Spacy package (https://spacy.io). The annotation pipeline includes tokenization, sentence segmentation, lemmatization, UD part-of-speech and morphological tagging, UD dependency parsing and named entity recognition.