Skip to content

dax-westerman/tgist-taxonomy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TGist Taxonomy Creator

Simplistic and unfinished code that creates a taxonomy from a small corpus, where the corpus is represented by a list of terms and feature vectors and roles associated with those terms.

Dependencies:

  • The object exploration code by Dimitris Andreou (https://github.com/DimitrisAndreou/memory-measurer). Make sure that the following jars are in your classpath: dist/objectexplorer.jar, lib/guava-r09.jar and lib/jsr305.jar. This code is used for debugging purposes only and technically you will not need this code when you check out the master branch of this repository.

Data required

Creating a taxonomy requires a list of terms with technology scores, feature vectors for those terms and a list of terms with their roles. These three should be together in one directory and are expected to have the following names:

filename description
classify.MaxEnt.out.s4.scores.sum.az list of terms
features.txt.gz feature vectors for all terms
NB.IG50.test1.woc.9999.results.classes list of terms with their roles

The first of those files is created by running the feature extraction and technology classifier over a corpus (see https://github.com/techknowledgist/tgist-features and https://github.com/techknowledgist/tgist-classifiers). The second can be created from the feature vectors created by tgist-features using the extract_features.py script. The third is created with the domain roles code in https://github.com/techknowledgist/act, this file is allowed to be empty.

Creating a taxonomy

The one-command way to create a new taxonomy from a directory with the needed input data is:

> java -jar dist/TGistTaxonomy.jar --create <TaxonomyLocation> <DataLocation>

Here, <DataLocation> is the directory with the three required files mentioned above and <TaxonomyLocation> is the location of the new taxonomy, if the path already exists the program exits with a warning.

Using the --create option is a shorthand for four separate commands:

> java -jar dist/TGistTaxonomy.jar --init <TaxonomyLocation> <DataLocation>
> java -jar dist/TGistTaxonomy.jar --import <TaxonomyLocation>
> java -jar dist/TGistTaxonomy.jar --build-hierarchy <TaxonomyLocation>
> java -jar dist/TGistTaxonomy.jar --add-relations <TaxonomyLocation>

With --init the taxonomy is initialized, which boils down to creating a directory with in it one file named properties.txt which stores a short name for the taxonomy (the base name of the path where the taxonomy is created) and the location of the input data directory. With --import the data in the input directory are imported into the taxonomy directory. Only terms with a minimal technology score and minimum frequency are added and only the feature vectors and roles for those terms are added (which reduces the size of the data significantly). Finally, with --build-hierarchy and --add-relations the taxonomy's hierarchy is built and relations between terms are added.

During the above processing the following files are created inside the taxonomy:

option files created
--init properties.txt
--import technologies.txt, features.txt, act.txt
--build-hierarchy hierarchy.txt
--add-relations relations-cooc.txt, relations-term.txt

Browsing a taxonomy

To browse a taxonomy do the following:

> java -jar dist/TGistTaxonomy.jar --browse <TaxonomyLocation>

You will get a splash screen and some limited functionality for navigating the taxonomy. Enter q followed by a return to exit the browser.

TODO: document this better once some minimal improvements are made.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 97.3%
  • TSQL 2.7%