The working title of the project was "Knowledge Extraction from Classification Schemes" (KECS) which is why this acronym is still used in several places.
This tutorial is written for technical users who would like to use the prototype on their own. The help page of the tool gives you an overview of all command-line parameters.
$ java -jar kecs.jar --help
The classification schema can be imported from different sources. In all cases a folder is created that stores your feedback progress. Since bootstrapping could overwrite already established progress, you have to specify an output folder that does not exist yet.
A pre-built JAR can be downloaded here: kecs.jar (ca. 80 MB).
Before using the tool on real data, you can try out a demo filesystem to learn how to use the application.
$ java -jar kecs.jar --mode Demo
The server runs on http://localhost:7572 (default user test
and password test
).
Before bootstrapping, you can define a simple ontology in JSON format, for example
ontology.json.
This way, classes and properties are preloaded when using the tool.
The default place where the ontology is loaded is ontology.json
,
but you can change it with the --ontology
argument.
$ java -jar kecs.jar --ontology another-ontology.json
If no ontology file is found, a default ontology is loaded.
This behavior can be disabled with the --no-default-ontology
switch.
The input has to be a folder.
Use --limit
to specify how many files should be crawled in the breadth-first traversal.
The default is 100,000 which is the size the prototype should handle well.
$ java -jar kecs.jar --mode BootstrapFilesystem --input /home/user/folder --limit 100000 --output kecs
To create a filesystem filename dump use the linux find
command.
$ find "$(pwd -P)" -printf "%y %p\n" | gzip > dump.txt.gz
If the filename ends with gz
, GZIP unzip is automatically applied.
Use --file-separator
to specify the separator in the path.
$ java -jar kecs.jar --mode BootstrapFilesystemDump --input dump.txt.gz --file-separator / --output kecs
Specify the character encoding (e.g. --charset Windows-1252
) when it differs from the default UTF-8.
Use the option --file-path-list
if you have a list of file paths instead of the find
output.
For a special use case the tool is also able to import an Excel file. We assume that the first row in a sheet contains column names. The following tree structure is extracted:
-
sheet name
-
column name
- distinct textual cell values
-
column name
$ java -jar kecs.jar --mode BootstrapExcel --input excel.xlsx --output kecs
You can whitelist columns (by letters) that should include distinct textual cell values. Do this to filter columns with too much data.
--excel-whitelist 'sheetname1:A,B,C;sheetname2:Z,AB'
To access the graphical user interface you have to load the created folder and start a localhost server.
$ java -jar kecs.jar --mode Load --output kecs --server
The server runs on
http://localhost:7572 (default user test
and password test
).
Port can be changed with --port
argument.
The --browser
option opens the website with your default browser.
The --language
option sets the language (choose from 'semweb', 'en' or 'de'. Default is 'semweb').
To configure who has access to the user interface, a file users.csv
has to
be completed (file can be changed with --users
argument).
If there is no such file, a default file is created with the following content:
username,password,first name,last name
test,test,Test,Test
For tests you can login with the default user test
and password test
.
In case you run this service for external project partners, change the user list as needed and restart the program. Distribute credentials to the corresponding people to give them access.
Before using the tool on real data, you can try out a demo filesystem to learn how to use the application.
$ java -jar kecs.jar --mode Demo
The server runs on http://localhost:7572 (default user test
and password test
).
The file tree was loaded from demo_semweb.txt. You can investigate the project files.
Export: assertions.ttl, terminology.ttl, topic-statements.ttl