This system performs word sense induction form text. This is an implementation of the JoBimText approach in Scala, Spark, tuned for induction of word Senses (hence the "S" instead of "B" in the name, but also because of the name of the initial developer of the project Johannes Simon). The original JoBimText implementation is written in Java/Pig and is more generic as it supposes that "Jo"s (i.e. objects) and "Bims (i.e. features) can be any linguistic objects. This particular implementation is designed for modeling of words and multiword expressions.
The system consist of several modules:
- Term feature extraction
- Term similarity (this reposiroty). This repository performs construction of a distributional thesaurus from word-feature frequencies.
- Word sense induction
- git
- Java 1.8+
- Apache Spark 2.2+
- Get the source code:
git clone https://github.com/uhh-lt/josimtext.git
cd josimtext
- Build the tool:
make
- Set the environment variable
SPARK_HOME
to the directory with Spark installation.
- To see the list of available commands:
./run
- To see arguments of a particular command, e.g. :
./run WordSimFromTermContext --help
- By default, the tool is running locally. To change Spark and Hadoop parameters of the job (queue, number of executors, memory per job, and so on) you need to modify the
conf/env.sh
file. A sample file for running the jobs using the CDH YARN cluster are provided inconf/cdh.sh
.