Skip to content
This repository has been archived by the owner on Jul 30, 2022. It is now read-only.

Latest commit

 

History

History
138 lines (109 loc) · 5.3 KB

README.md

File metadata and controls

138 lines (109 loc) · 5.3 KB

Scratching my own itch here. Needed some basic natural language processing tools for simple projects. I turned to NLTK and NaturalNode, which both are just for a single programming language. Sure you can work around that or wrap it, but that doesn't give me the fun of working with Haxe and picking my own layout pattern.

I have no formal training in linguistics, but have contributed to multiple open source projects. Kinda making things up as I need them. Need something? Contribute or open a ticket!

Usage

Installing

You can git clone the repository directly, or use the latest version in haxelib.

Using haxelib

haxelib install haxe-linguistics

Using haxelib git

haxelib git haxe-linguistics https://github.com/sexybiggetje/haxe-linguistics/

Running your application

Example applications have been included in the examples folder.

haxe -main my.namespace.Application -cp src --interp

Supported languages

Since the beginning English (as main language), Dutch and German have been supported as first class citizens, I added basic support for the Frisian language as a second class citizen. Want to contribute? Take a peek at the Dutch (nl) implementation and send a pull request.

Tokenizing

Basic tokenizers are present for all current supported languages.

Linguistics.getInstance().setLanguage(Dutch);
var tokenizer:ITokenizer = Linguistics.getInstance().getBasicTokenizer();
trace(tokenizer.tokenize("Nederlanders drinken 's morgens gemiddeld 2 koppen koffie."));

Removing a set of tokens using a token filter

Linguistics.getInstance().setLanguage(Dutch);
var tokenizer:ITokenizer = Linguistics.getInstance().getBasicTokenizer();
var tokenSet:Array<IToken> = tokenizer.tokenize("Nederlanders drinken 's morgens gemiddeld 2 koppen koffie.");
trace( tokenizer.applyFilter( StopwordTokenFilter ) );

Dictionary

A dictionary indexes tokenized words and keeps track of word count. By default it uses the raw token but you can specify to use the normalized token. If no tokenizer is specified it defaults to the BasicTokenizer for your given language.

var dict:Dictionary = new Dictionary();
dict.addDocument("To be, or not to be: that is the question.");
trace( dict.getDictionaryWords() );

Or by adding tokens directly (for example after filtering them)

Linguistics.getInstance().setLanguage(Dutch);

var dict:Dictionary = new Dictionary();
var tokenizer:ITokenizer = Linguistics.getInstance().getBasicTokenizer();
var tokenSet:Array<IToken> = tokenizer.tokenize("Nederlanders drinken 's morgens gemiddeld 2 koppen koffie.");

dict.addTokens( tokenizer.applyFilter( tokenSet, StopwordTokenFilter ) );

trace( dict.getDictionaryWords() );

String distance

Using Levenshtein Distance calculation:

trace( LevenshteinDistance.getDistance( "kitten", "sitting" ) );

Classification

Currently there is an implementation for a Naive Bayes classificator. The classificator uses normalized tokens, and if no tokenizer is specified when calling the train method it will default to the BasicTokenizer specified for your language. Since there is no stemming support yet in this library tokens are unstemmed and unfiltered.

Example is shamelessly copied from NaturalNode documentation.

var classifier:IClassifier = new NaiveBayesClassifier();
classifier.addDocument( "i am the long qqqq", "buy" );
classifier.addDocument( "buy the q's", "buy" );
classifier.addDocument( "short gold", "sell" );
classifier.addDocument( "sell gold", "sell" );

classifier.train();

trace(classifier.classify( "i am short silver" ));

Stemming

Currently stemming is being implemented. Stemmers follow a simple implementation, they use IStemmer.stem( 'winning' ). The basic stemmers will be used optionally during tokenization. The snippet below returns the Porter stemmer implementation for English.

Linguistics.getInstance().setLanguage( English );
var stemmer:IStemmer = Linguistics.getInstance().getBasicStemmer();

trace( stemmer.stem( "consigned" ) );

Tests

Some tests might be mising or incomplete due to the premature state of the project, but I try to keep them up to date. At the moment the tests run only on hx source. I try to support every output format of Haxe, but haven't compiled the tests for everything just yet.

Running tests

Quickest way is to use haxe in interpreter mode

haxe -cp src -main tests.TestCaseRunner --interp

Building the testrunner for all targets

Install JDK for hxjava. See oracle.com . Install mono mdk for monocs. See mono-project.com/download .

Setting up dependencies:

haxelib install hxcpp
haxelib install hxjava
haxelib install hxcs

In your root (below src):

haxe build.hxml

Roadmap

  • in progress Concordance analysis
  • Support for n-grams
  • Language detection
  • in progress Nested tokenization allowing tokens to have a parent and children. (Opening doors for a Sentence or quotation token).
  • in progress Applying tags to tokens and being able to filter them based on a token. (This differs from classification, but a classification could be a tag).
  • in progress Stemming of languages. (Porter stemmer would be sufficient).
  • Parts of speech tagging. (This would require stemming and POS dictionaries).
  • Wordnet support and for other languages alternatives to it