1st project of Language Understanding Systems, Fall Semester 2017, University of Trento.
The project consist in building a model for concept sequence tagging for a Movie Domain. All the details can be found in the task file. All results are discussed in the report file.
The repository is organized in 2 main folders: report
and code
.
The report
folder contains the LaTeX report source files, the code
folder contains the Bash files script used to create, train and test the model, plus some utilities used to parse the results and compute statistics.
The code/model
folder contains the 2 models, the basic one model/v1.sh
and the improved one model/v2.sh
.
Each model accepts some parameters from the command line, like the feature to use for the training, the n-gram order and the smoothing method for the underlying language model.
Just run the script without paramters to get an help message with the right paramters.
Here we provide an example of invocation:
# run the first model, with the best parameters
./code/model/v1.sh word 5 witten_bell
# run the second model, with the best paramters
./code/model/v2.sh word word 4 kneser_ney
The script will automatically create the code/computations
folder.
Inside this folder you will find a folder for each execution one of the model with different parameters.
To check the results of a specific computation, you can check the file performances/performances.txt
, for instance:
# results of the model 1 with parameters: word 5 witten_bell
cat ./code/computations/v1-word-5-witten_bell/performances/performances.txt
# results of the model 1 with parameters: word word 4 kneser_ney
cat ./code/computations/v2-word-word-5-witten_bell/performances/performances.txt
Another interesting file generated is the performances/comparison.txt
, which contains the test data (feature and concept) and the concept predicted by the model (last column).
The script have been tested on Bash and require the GNU sed tool. Please contact me if there is any problem in the execution of the scripts.
The code/analysis
folder contains the scripts used to analyze the provided data and the performances of the model and generate the tables included in the report.
The model source code is licences under the MIT license. A copy of the license is available in the LICENSE file. The LaTeX sources and the report are licenced under the Creative Commons Attribution-ShareAlike 4.0 International License.