Skip to content

Latest commit

 

History

History
135 lines (93 loc) · 4.59 KB

README.md

File metadata and controls

135 lines (93 loc) · 4.59 KB

VISL/PennTreebank to DCG converter

This source code converts a given corpus in the PennTreebank format to the DCG format, being appropriate to run in Prolog.

Adjustments and improvements

The project is still in development and upcoming updates will address the following tasks:

  • Enable PennTreebank format
  • Compute probability and frequency count for rules
  • Reorder the rules for better efficiency and remove loops
  • Generate the probability for the parse tree
  • Generate the grammar with argument structure
  • Add option for rule cut, pruning the rules with a frequency below a given threshold.

💻 Requirements

This project was tested with Python 3.8. To install the dependencies install the requirements:

pip install -r requirements.txt

☕ Using the DCG converter

To use the DCG converter just run the main.py script with the following arguments:

usage: main.py [-h] --file_path FILE_PATH --file_format {VISL,PennTreebank,TigerXML} --output_folder OUTPUT_FOLDER [--graphviz]

optional arguments:
  -h, --help            show this help message and exit
  --file_path FILE_PATH
                        File path in the specified format.
  --file_format {VISL,PennTreebank,TigerXML}
                        File format.
  --output_folder OUTPUT_FOLDER
                        Output folder.
  --graphviz            A boolean switch to render the tree in graphviz

Example of usage:

python main.py --file_path ../dataset/Bosque_CF_8.0.PennTreebank_utf8.ptb --file_format PennTreebank --output_folder ../output