This source code converts a given corpus in the PennTreebank format to the DCG format, being appropriate to run in Prolog.
The project is still in development and upcoming updates will address the following tasks:
- Enable PennTreebank format
- Compute probability and frequency count for rules
- Reorder the rules for better efficiency and remove loops
- Generate the probability for the parse tree
- Generate the grammar with argument structure
- Add option for rule cut, pruning the rules with a frequency below a given threshold.
This project was tested with Python 3.8. To install the dependencies install the requirements:
pip install -r requirements.txt
To use the DCG converter just run the main.py
script with the following arguments:
usage: main.py [-h] --file_path FILE_PATH --file_format {VISL,PennTreebank,TigerXML} --output_folder OUTPUT_FOLDER [--graphviz]
optional arguments:
-h, --help show this help message and exit
--file_path FILE_PATH
File path in the specified format.
--file_format {VISL,PennTreebank,TigerXML}
File format.
--output_folder OUTPUT_FOLDER
Output folder.
--graphviz A boolean switch to render the tree in graphviz
Example of usage:
python main.py --file_path ../dataset/Bosque_CF_8.0.PennTreebank_utf8.ptb --file_format PennTreebank --output_folder ../output