Skip to content

plandes/amr

Repository files navigation

AMR annotation and feature generation

PyPI Python 3.10 Python 3.11 Build Status

Provides support for AMR graph manipulation, annotations and feature generation.

Features:

  • Annotation in AMR metadata. For example, sentence types found in the Proxy report AMR corpus.
  • AMR token alignment as spaCy components.
  • Integrates natural language parsing and features with Zensols zensols.nlparse library.
  • A scoring API that includes Smatch and WLK, which extends a more general NLP scoring module.
  • AMR parsing (amrlib) and AMR co-reference (amr_coref).
  • Command line and API utilities for AMR graph Penman graphs, debugging and files.
  • Tools for training and evaluating new AMR parse (text to graph) and generation (graph to text) models.
  • A method for re-indexing and updating AMR graph variables so that all in a document collection are unique.

Documentation

Installing

The library can be installed with pip from the pypi repository:

pip3 install zensols.amr

Installing the Gsii Model

The Gsii model link expires and requires a manual download of the model. To install it, do the following:

  1. Download the Gsii model (click "direct download").
  2. Move the file to the local directory.
  3. Install the file by forcing a test parse:
    amr parse 'Test sentence.' --override \
        amr_parse_gsii_resource.url=file:model_parse_gsii-v0_1_0.tar.gz

Usage

from penman.graph import Graph
from zensols.nlp import FeatureDocument, FeatureDocumentParser
from zensols.amr import AmrDocument, AmrSentence, Dumper, ApplicationFactory

sent: str = """

He was George Washington and first president of the United States.
He was born On February 22, 1732.

""".replace('\n', ' ').strip()

# get the AMR document parser
doc_parser: FeatureDocumentParser = ApplicationFactory.get_doc_parser()

# the parser creates a NLP centric feature document as provided in the
# zensols.nlp package
doc: FeatureDocument = doc_parser(sent)

# the AMR object graph data structure is provided in the feature document
amr_doc: AmrDocument = doc.amr

# dump a human readable output of the AMR document
amr_doc.write()

# get the first AMR sentence instance
amr_sent: AmrSentence = amr_doc.sents[0]
print('sentence:')
print(' ', amr_sent.text)
print('tuples:')

# show the Penman graph representation
pgraph: Graph = amr_sent.graph
print(f'variables: {", ".join(pgraph.variables())}')
for t in pgraph.triples:
    print(' ', t)
print('edges:')
for e in pgraph.edges():
    print(' ', e)

# visualize the graph as a PDF
dumper: Dumper = ApplicationFactory.get_dumper()
dumper(amr_doc)

Per the example, the t5.conf and gsii.conf configuration show how to include configuration needed per AMR model. These files can also be used directly with the amr command using the --config option.

However, the other resources in the example must be imported unless you redefine them yourself.

Library

When adding the amr spaCy pipeline component, the doc._.amr attribute is set on the Doc instance. You can either configure spaCy yourself, or you can use the configuration files in test-resources as an example using the zensols.util configuration framework. The command line application provides an example how to do this, along with the test case.

Command Line

This library is written mostly to be used by other program, but the command line utility amr is also available to demonstrate its usage and to generate ARM graphs on the command line.

To parse:

$ amr parse -c test-resources/t5.conf 'This is a test of the AMR command line utility.'
# ::snt This is a test of the AMR command line utility.
(t / test-01
   :ARG1 (u / utility
            :mod (c / command-line)
            :name (n / name
                     :op1 "AMR"
                     :toki1 "6")
            :toki1 "9")
   :domain (t2 / this
               :toki1 "0")
   :toki1 "3")

To generate graphs in PDF format:

$ amr plot -c test-resources/t5.conf 'This is a test of the AMR command line utility.'
wrote: amr-graph/this-is-a-test-of-the-amr-comm.pdf

Training

This package uses the amrlib training, but adds a command line and downloadable corpus aggregation / API. To train:

  1. Choose a model (i.e. SPRING, T5).
  2. Optionally edit the train configuration directory of the model you choose.
  3. Optionally edit the resources/train.yml to select/add more corpora (see Adding Corpora).
  4. Train the model: ./amr --config train-config/<model>.conf

Pretrained Models

This library was used to train all of the amrlib models (using the same checkpoints as amrlib), except the T5 Base v1 model, with additional examples from publicly available human annotated corpora. The differences of these trained models include:

  • None of the models were tested against a training set, only the development SMATCH scores are available. This was intentional to provide more training examples.
  • The AMR Release 3.0 (LDC2020T02) test set was added to the training set.
  • The Little Prince and Bio AMR corpora where used to train the models. The first 85% of the AMR sentences were added to training set and the remaining 15% were added to the development set.
  • The mini-batch size changed for generate-t5wtense-base due to memory constraints.
  • The number of training epochs were increased to account for the additional number of training examples.
  • Models have the same naming conventions but are prefixed with zsl.
  • Generative models were trained on graphs metadata annotated by the Sci spaCy en_core_sci_md model.

The performance of these models:

Model Name Model Type Checkpoint Performance
zsl_spring parse facebook/bart-large SMATCH: 81.26
zsl_xfm_bart_base parse facebook/bart-base SMATCH: 80.5
zsl_xfm_bart_large parse facebook/bart-large SMATCH: 82.7
zsl_t5wtense_base generative t5-base BLEU: 42.20
zsl_t5wtense_large generative google/flan-t5-large BLEU: 44.01

These models are available upon request.

Adding Corpora

You can retrain your own model and add additional training corpora by modifying the list of ${amr_prep_manager:preppers} in resources/train.yml. This file defines downloaded corpora for the Little Prince and Bio AMR corpora. To use the AMR 3.0 release, add the LDC downloaded file to (a new) download directory.

Attribution

This project, or reference model code, uses:

  • Python 3.11
  • amrlib for AMR parsing.
  • amr_coref for AMR co-reference
  • spaCy for natural language parsing.
  • zensols.nlparse for natural language features.
  • Smatch (Cai and Knight. 2013) and WLK (Opitz et. al. 2021) for scoring.

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2023-deepzensols,
	title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
	author = "Landes, Paul  and
	  Di Eugenio, Barbara  and
	  Caragea, Cornelia",
	editor = "Tan, Liling  and
	  Milajevs, Dmitrijs  and
	  Chauhan, Geeticka  and
	  Gwinnup, Jeremy  and
	  Rippeth, Elijah",
	booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
	month = dec,
	year = "2023",
	address = "Singapore, Singapore",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2023.nlposs-1.16",
	pages = "141--146"
}

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Copyright (c) 2021 - 2024 Paul Landes