Code ready for modification to increase page classifier accuracy and precision.
Unzip labeled data
unzip dataset.zip
Generate page features
lein generate-features features.tsv
Cross-validate classifier
lein cross-validate features.tsv
Train classifier
lein train features.tsv
Evaluate classifier (optional, do not use data used for training)
lein evaluate test.tsv
Test classifier with real page
lein classify "http://..."
Add page to dataset (optional)
lein add-page "L" "http://..."
- Feature generation: Enlive, Boilerpipe
- Machine learning: Weka
- Use different classifier:
src/pageclass/train.clj
Available ones extend AbstractClassifier. - Generate meaningful features:
src/pageclass/features.clj
- A - Article
- D - Discussion, Forum
- F - Form
- H - Home page
- L - Listing
- I - Single item or product page in e-shop
- M - Media
- Z - Contacts page
- X - Unknown