Tagging Portuguese Wikipedia

We want to tag Portuguese Wikipedia using PyPLN and Palavras (we have a license). The goals of this project are:

Release a part-of-speech tagged Portuguese Wikipedia Corpus under a Creative Commons license.
Train a part-of-speech tagger on NLTK and release it under a free/libre software license.

Assumptions

We're going to use all Portuguese Wikipedia articles (pages).
Probably we're going to use the Palavras' tagset, but we can then translate it to NLTK's tagset.
We won't use an incremental tagger (the entire corpus will be loaded in memory to train a NLTK tagger).

Split the entire corpus (and tagger) into Wikipedia Portals, so we'll have a tagged corpus by subject.
Compare taggers (Palavras versus NLTK with our trained tagger)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
article		article
scripts		scripts
tests		tests
.gitignore		.gitignore
README.markdown		README.markdown
WikiExtractor.py		WikiExtractor.py
calculate_pos_size.py		calculate_pos_size.py
check_duplicates.py		check_duplicates.py
check_progress.py		check_progress.py
check_uploads.py		check_uploads.py
export_to_sqlite.py		export_to_sqlite.py
fabfile.py		fabfile.py
insert_into_mongodb.py		insert_into_mongodb.py
loginfo.py		loginfo.py
parse_logs.py		parse_logs.py
plot_document_length_histogram.py		plot_document_length_histogram.py
plot_time_histogram.py		plot_time_histogram.py
requirements.txt		requirements.txt
sqlite_corpus.py		sqlite_corpus.py
test_partition.py		test_partition.py
upload_to_pypln.py		upload_to_pypln.py