A Java library for annotating grammatical mistakes in parallel text. Ported from ERRANT.
From the official ERRANT docs:
The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework.
Before you begin, ensure you have met the following requirements:
- You have Java 11 installed.
- You have installed spaCy4j as described here
Add this to the dependencies section of your pom.xml
:
<dependency>
<groupId>io.github.manzurola</groupId>
<artifactId>errant4j</artifactId>
<version>0.5.0</version>
</dependency>
To use Errant4J in code, follow these steps:
// Get a spaCy instance (from spaCy4j)
SpaCy spacy = SpaCy.create(CoreNLPAdapter.create());
// Create an english annotator
Annotator annotator = Errant.create().annotator("en", spacy);
// Parse source and target sentences
Doc source = annotator.parse("Yesterday I go to see my therapist.");
Doc target = annotator.parse("Yesterday I went to see my therapist.");
// Annotate grammatical errors
List<Annotation> annotations = annotator.annotate(source.tokens(), target.tokens());
// Inspect annotations
for (Annotation annotation : annotations) {
GrammaticalError error = annotation.grammaticalError();
String sourceText = annotation.sourceText();
String targetText = annotation.targetText();
System.out.printf("Error: %s, sourceText: %s, targetText: %s%n",
error,
sourceText,
targetText);
// Inspect the classified edit
Edit<Token> edit = annotation.edit();
// ...
}
Errant4J is currently available only for English.
If you wish to develop Errant4J for another language, start with the reference English implementation.
I suggest you copy that to a new package, i.e. lang.he
for hebrew, as well as the relevant test package.
As per the current design, you will be required to implement a custom Merger and Classifier. Then proceed to create a custom Annotator which provides a default TokenAligner as the first step in the pipeline.
I recommend starting with tests and then slowly develop the merger and classifier until they pass, like so:
// Create a custom in-development Annotator.
// The first pipeline component, the Token Aligner, comes preconfigured in the created Annotator.
Annotator annotator = Annotator.of(new EnMerger(), new EnClassifier());
// Prepare source and target docs
Doc source = annotator.parse("I am eat dinner.");
Doc target = annotator.parse("I am eating dinner.");
// Create an expected string edit and transform it to a Token edit
Edit<Token> edit = Edit.builder()
.substitute("eat")
.with("eating")
.atPosition(2, 2)
.project(source.tokens(), target.tokens());
// Create the expected annotation containing the Edit and GrammaticalError
Annotation expected = Annotation.of(edit, GrammaticalError.REPLACEMENT_VERB_FORM);
// Run Errant for the given source and target
List<Annotation> actual = annotator.annotate(source.tokens(), target.tokens());
// Assert that the actual errors contain our expected error
Assertions.assertTrue(actual.contains(expected));
Alternatively, contact me directly and I'll help you get started fast.
To contribute to Errant4J, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>
. - Make your changes and commit them:
git commit -m '<commit_message>'
- Push to the original branch:
git push origin <project_name>/<location>
- Create the pull request.
Alternatively see the GitHub documentation on creating a pull request.
Thanks to the following people who have contributed to this project:
If you want to contact me you can reach me at guy.manzurola@gmail.com.
This project uses the following license: MIT.