Skip to content

Latest commit

 

History

History
121 lines (81 loc) · 3.98 KB

README.md

File metadata and controls

121 lines (81 loc) · 3.98 KB

fastText4j

Java port of C++ version of Facebook Research fastText.

This implementation supports prediction for supervised and unsupervised models, whether they are quantized or not. Please use C++ version of fastText for train, test and quantization.

Supported fastText version

fastText4j currently supports models from fastText 1b version (support of subwords for supervised models).

Implementation

This library offers two implementations of fastText library:

  • A regular in-memory model, which is a simple port of the C++ version
  • A memory-mapped version of the model, allowing a lower RAM usage

This second implementation relies on memory-mapped IO for reading the dictionary and the input matrix.

Note: In order to be able to use this second implementation, you will have to convert your fastText model to the appropriate memory-mapped model format.

Requirements

To build and use fastText4j, you will need:

  • Java 8 or above
  • Maven

Building fastText4j

This project uses maven as build tool. To build fastText4j, use the following:

$ mvn package

Memory-mapped model

Converting fastText model to memory-mapped model

You can convert both non-quantized and quantized fastText models to memory-mapped models. You will have to use the binary model .bin or .ftz for the conversion step.

Use the following command to obtain a zip archive containing an executable jar with dependencies and a bash script to launch the jar:

$ mvn install -Papp

The zip archive will be built in the app folder. You can then use this distribution to run the mmap model conversion:

$ cd app
$ unzip fasttext4j-app.zip
$ ./fasttext-mmap.sh -input <fastText-model-path> -output <fasttext-mmap-model-path>

Using the memory-mapped model

Model loading

Loading a memory-mapped model with fastText4j is completely transparent. You just have to provide the path <fasttext-mmap-model-path> that you passed to the output parameter above.

Closing the model

When loading a memory-mapped model, fastText4j internally opens FileChannels that will need to be closed. To properly close your memory-mapped model, you will need to call the .close() method on your FastText object.

Multithreaded use

The memory-mapped FastText may only be used from one thread, because it is not thread safe (it keeps internal state like the mapped file positions).

To allow multithreaded use, every FastText instance must be cloned before being used in another thread.

FastText references

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)