The corpus has been adapted from the Catalan portion of WikiCorpus v. 1.0, as follows:
- The corpus contains only a selection (< 1.2M words) from the original set.
- The corpus contains only tokens and parts of speech, not lemmas and word senses.
- The part-of-speech tags have been simplified from the original, resulting in 29 tags.
- The format has been changed to the word/TAG format, with each sentence on a separate line.
The corpus is licensed under the same terms as the original, that is, the GNU Free Documentation License (FDL; http://www.fsf.org/licensing/licenses/fdl.html). That means that you are allowed to use and redistribute the texts, provided the derived works keep the same license.