A tool for detecting grammatical features in sentences, clauses, and phrases in just a few lines of code. This tool is one piece of a larger project to facilitate the creation of reading exercises for language instruction. It is designed to determine if a text contains sentences relevant to the desired grammatical feature. Any language supported by spaCy is theoretically supported.
The patterns for these grammatical features are defined in YAML files called patternsets
in lieu of writing code. These YAML files expand the capabilities of the GrammarDetector
. The input text to be analyzed is compared against the patterns in the patternsets
. In other words, writing more code is unnecessary for supporting new grammatical features. This means that inaccurate results arise from inaccurate patterns (and not from the code itself). To mitigate errors, unittests can be defined in the patternsets
.
For the purposes of this tool, a sentence is roughly defined as:
- An independent clause with sentence-final punctuation and additional clauses, or
- A dependent clause with sentence-final punctuation which may satisfy the concept of a 'complete thought' in the context of surrounding sentences (e.g. "We tried updating it. Which didn't work. Nor did the reinstall.")
The core of this tool is the GrammarDetector
. After construction, it can be used in two different ways:
- Using the
GrammarDetector.__call__(self, input: str)
instance method on the input to run automatically. - Looping through the
GrammarDetector.detectors: list[Detector]
instance property and using theDetector.__call__(self, input: str)
instance method on the input to run manually.
Dependencies:
- python (>=3.9) -- frequent use of f-strings and type hints
- pyyaml -- loading patternset YAML files
- spacy -- rule-based grammatical pattern matching
- spacy-lookups-data -- spaCy dependency
- tabulate -- printing token tables to write patterns
Dev dependencies:
- black -- opinionated code formatter
- mypy -- type checking
- python-lsp-server -- IDE integration
- types-pyyaml -- type checking
- types-tabulate -- type checking
- types-setuptools -- type checking
Currently supports the ability to:
- Evaluate a sentence, clause, or phrase for its grammatical features
- Produce results that are reader-friendly and reader-useful
- Use built-in grammatical features with just 3 lines of code (import, construct, and call)
- Create your own grammatical feature by passing the filepath of a simple
patternset
YAML file with spaCyTokens
- Convert input into a table of
Tokens
to aid in visualizing and conceptualizingpatterns
- Convert input into list of
Tokenlikes
to aid in creating and improvingpatterns
- Define and run tests in
patternsets
to evaluate the accuracy of the patterns - Fragment an input into noun chunks automatically before the
Detector
is run
Future features:
- Add support for validating
patternset
YAML files (currently validatespatterns
only) - Publish the built-in
patternset
YAML files as a separate package
All current patterns are relatively naive, so they do not yet effectively handle recursivity. This problem can be solved by 1) writing recursive patterns or 2) writing alternative patterns and suffixing the rulename
property with numbers (e.g. ditransitive-1 and ditransitive-2).
- Determiners:
- Indefinite
- Definite
- Other
- None
- Persons:
- 1st
- 2nd
- 3rd
- Tense-Aspects:
- Present simple
- Present simple passive
- Past simple
- Past simple passive
- Future simple will
- Future simple will passive
- Future simple be-going-to
- Future simple be-going-to passive
- Present continuous
- Present continuous passive
- Past continuous
- Past continuous passive
- Future continuous
- Future continuous passive
- Present perfect
- Present perfect passive
- Past perfect
- Past perfect passive
- Future perfect
- Future perfect passive
- Present perfect continuous
- Present perfect continuous passive
- Past perfect continuous
- Past perfect continuous passive
- Future perfect continuous
- Future perfect continuous passive
- Transitivity and Valency:
- Impersonal (valency == 0)
- Intransitive (valency == 1)
- Transitive (valency == 2)
- Ditransitive (valency == 3)
- Voices:
- Active
- Passive
The default language model, en_core_web_md
(40 MB), can be substituted with another spaCy language model, such as en_core_web_lg
(560 MB) or en_core_web_sm
(12 MB). Be sure to disable builtins
(see below) if using a model from a language other than English.
$ pip install grammar-detector
$ python -m spacy download en_core_web_md
# my_script.py
from grammardetector import GrammarDetector
# Default values
settings = {
"builtins": True,
"language_model": "en_core_web_md",
"patternset_path": "", # Custom patternsets
"verbose": False,
"very_verbose": False,
}
grammar_detector = GrammarDetector(**settings) # Optionally, pass in **settings
# my_script.py
from grammardetector import GrammarDetector
grammar_detector = GrammarDetector()
input: str = "The dog was chasing a cat into the house."
results = grammar_detector(input)
# my_script.py
from grammardetector import GrammarDetector, Match
from typing import Union
ResultsType = dict[str, Union[str, list[Match]]]
grammar_detector = GrammarDetector()
input: str = "The dog was chasing a cat into the house."
results: ResultsType = grammar_detector(input)
print(results)
# {
# 'input': 'The dog was chasing a cat into the house.',
# 'voices': [<active: was chasing>],
# 'tense_aspects': [<past continuous: was chasing>],
# 'persons': [<3rd: dog>, <3rd: cat>, <3rd: house>],
# 'determiners': [<definite: The dog>, <indefinite: a cat>, <definite: the house>],
# 'transitivity': [<ditransitive: dog was chasing a cat into the house>]
# }
feature: str = "tense_aspects"
verb_tense: Match = results[feature][0]
print(verb_tense)
# <past continuous: was chasing>
print(verb_tense.rulename)
# "past continuous"
print(verb_tense.span)
# "was chasing"
print(verb_tense.span_features)
# {
# 'span': was chasing,
# 'phrase': 'was chasing',
# 'root': 'chasing',
# 'root_head': 'chasing',
# 'pos': 'VERB',
# 'tag': 'VBG',
# 'dep': 'ROOT',
# 'phrase_lemma': 'be chase',
# 'root_lemma': 'chase',
# 'pos_desc': 'verb',
# 'tag_desc': 'verb, gerund or present participle',
# 'dep_desc': 'root'
# }
from grammardetector import GrammarDetector
grammar_detector = GrammarDetector(
builtins=False,
patternset_path="path/to/my/patternset/files/",
)
input: str = "The dog was chasing a cat into the house."
results = grammar_detector(input)
print(results)
# Prints only your custom features
# my_script.py
from grammardetector import GrammarDetector
grammar_detector = GrammarDetector(patternset_path="path/to/my/patternset/files/")
grammar_detector.run_tests()
# Run the tests for the built-in patternsets
grammar_detector.run_tests(builtin_tests=True)
# my_script.py
from grammardetector import GrammarDetector
grammar_detector = GrammarDetector()
input: str = "The dog was chasing a cat into the house."
default_kwargs = {
"pos": True,
"tag": True,
"dependency": True,
"lemma": True,
}
table: str = grammar_detector.token_table(input, **default_kwargs)
print(table)
Word | POS | POS Definition | Tag | Tag Definition | Dep. | Dep. Definition | Lemma. |
---|---|---|---|---|---|---|---|
The | DET | determiner | DT | determiner | det | determiner | the |
dog | NOUN | noun | NN | noun, singular or mass | nsubj | nominal subject | dog |
was | AUX | auxiliary | VBD | verb, past tense | aux | auxiliary | be |
chasing | VERB | verb | VBG | verb, gerund or present participle | ROOT | root | chase |
a | DET | determiner | DT | determiner | det | determiner | a |
cat | NOUN | noun | NN | noun, singular or mass | dobj | direct object | cat |
into | ADP | adposition | IN | conjunction, subordinating or preposition | prep | prepositional modifier | into |
the | DET | determiner | DT | determiner | det | determiner | the |
house | NOUN | noun | NN | noun, singular or mass | pobj | object of preposition | house |
. | PUNCT | punctuation | . | punctuation mark, sentence closer | punct | punctuation | . |
# my_script.py
from grammardetector import GrammarDetector, Tokenlike
# TokenlikeKeys = Literal["pos", "tag", "dep", "lemma", "word"]
# Tokenlike = dict[TokenlikeKeys, str]
grammar_detector = GrammarDetector()
input: str = "The dog was chasing a cat into the house."
default_kwargs = {
"pos": True,
"tag": True,
"dependency": True,
"lemma": False,
"word": False,
}
data: list[Tokenlike] = grammar_detector.token_data(input, **default_kwargs)
for entry in data:
print(entry)
# {'pos': 'DET', 'tag': 'DT', 'dep': 'det'}
# {'pos': 'NOUN', 'tag': 'NN', 'dep': 'nsubj'}
# {'pos': 'AUX', 'tag': 'VBD', 'dep': 'aux'}
# {'pos': 'VERB', 'tag': 'VBG', 'dep': 'ROOT'}
# {'pos': 'DET', 'tag': 'DT', 'dep': 'det'}
# {'pos': 'NOUN', 'tag': 'NN', 'dep': 'dobj'}
# {'pos': 'ADP', 'tag': 'IN', 'dep': 'prep'}
# {'pos': 'DET', 'tag': 'DT', 'dep': 'det'}
# {'pos': 'NOUN', 'tag': 'NN', 'dep': 'pobj'}
# {'pos': 'PUNCT', 'tag': '.', 'dep': 'punct'}
# my_script.py
from grammardetector import GrammarDetector
grammar_detector = GrammarDetector(verbose=True, very_verbose=False) # very_verbose prioritized over verbose
# Prints logs for configuring and loading patternsets
input: str = "The dog was chasing a cat into the house."
results = grammar_detector(input)
# Prints logs for running the matcher and interpreting the results
This section describes the components used to build and run Detectors
inside the GrammarDetector
. To expand on the built-in features of the GrammarDetector
, understanding how patternset
YAML files are created, configured, and loaded is critical. To load your own patternset
files, pass the file or directory path to the patternset_path
keyword argument when constructing the GrammarDetector
.
The GrammarDetector
class is the entrypoint for loading in patternset
files and evaluating text input. By running the GrammarDetector.__call__(self, input: str)
instance method, the text input will be compared against both the built-in patternsets
and the provided patternsets
via the patternset_path
keyword argument. The DetectorRepository
is contained under the hood, which in turn contains the Detectors
. Extracting the internal Detectors
from the GrammarDetector
is unnecessary but easy via the GrammarDetector.detectors: list[Detector]
instance property.
# my_script.py
from grammardetector import Detector, GrammarDetector
grammar_detector = GrammarDetector(patternset_path="path/to/my/patternset/files/")
input: str = "The dog was chasing a cat into the house."
results = grammar_detector(input) # Making use of the __call__ method
# Alternatively, extract the detectors
detectors: list[Detector] = grammar_detector.detectors
for detector in detectors:
print(detector(input))
The smallest piece is the spacy.tokens.Token
class. Each Token
represents a single word and consists of a single JSON object. A list[Token]
represents a chain of words. Lists of Tokens
are used in patternset
YAML files to define grammatical patterns. Each Token
contains a POS
(part-of-speech), a TAG
(tag), and/or a DEP
(dependency). Grammatical categories are denoted with POS
and TAG
while syntactic categories are denoted with DEP
. An OP
(operation) may also be included to denote whether a Token
is required or optional. A complete list of POSs
, TAGs
, and DEPs
can be found in the spaCy glossary.
Some examples of POSs
are "VERB", "AUX", "NOUN", "PROPN", and "SYM" for symbol.
Some examples of TAGs
are "VB" for base form verb, "VBD" for past tense verb, "VBG" for gerund/present participle verb, "VBN" for past participle verb, "VBP" for non-3rd person singular present verb, and "VBZ" for 3rd person singular present verb.
Some examples of DEPs
are "ROOT" for root verb, "aux", "auxpass", "nsubj", and "dobj".
# my_feature.yaml
patterns:
# This is a single token (i.e. 1 word)
- rulename: present simple verb
tokens:
- {TAG: {IN: ["VBP", "VBZ"]}, # IN == one of these
# my_feature.yaml
patterns:
# This is also a single token (i.e. 1 word)
- rulename: passive auxiliary
tokens:
- {
TAG: {IN: ["VBP", "VBZ"]},
DEP: "auxpass",
LEMMA: "be",
OP: "+"
}
# my_feature.yaml
patterns:
# This is a list of 5 tokens (i.e. 5 words)
- rulename: future simple be-going-to passive
tokens:
- {TAG: {IN: ["VBP", "VBZ"]}, DEP: "aux", OP: "+"}
- {TAG: "VBG", OP: "+", LEMMA: "go"}
- {TAG: "TO", DEP: "aux", OP: "+"}
- {TAG: "VB", DEP: "auxpass", LEMMA: "be"}
- {TAG: "VBN", OP: "+"}
# my_feature.yaml
patterns:
# This is a list of 4 tokens minimum with some degree of recursivity
- rulename: ditransitive
tokens:
- {DEP: "nsubj"}
- {OP: "*"} # Indicates possible filler words between the tokens
- {DEP: "ROOT"}
- {OP: "*"}
- {DEP: {IN: ["dobj", "iobj", "pobj", "dative"]}}
- {OP: "*"}
- {DEP: {IN: ["dobj", "iobj", "pobj", "dative"]}}
Each Pattern
in patterns
has two properties:
rulename: str
-- the name given to thePattern
with the correspondinglist[spacy.tokens.Token]
tokens: list[spacy.tokens.Token]
-- the grammatical pattern
# transitivity.yaml
config:
how_many_matches: one
patterns:
- rulename: ditransitive
tokens:
- {DEP: "nsubj"}
- {OP: "*"}
- {DEP: "ROOT"}
- {OP: "*"}
- {DEP: {IN: ["dobj", "iobj", "pobj", "dative"]}}
- {OP: "*"}
- {DEP: {IN: ["dobj", "iobj", "pobj", "dative"]}}
- rulename: transitive
tokens:
- {DEP: "nsubj"}
- {OP: "*"}
- {DEP: "ROOT"}
- {OP: "*"}
- {DEP: "dobj"}
- rulename: intransitive
tokens:
- {DEP: "nsubj", LOWER: {NOT_IN: ["it"]}}
- {OP: "*"}
- {DEP: "ROOT"}
- rulename: impersonal
tokens:
- {TAG: "PRP", DEP: "nsubj", LOWER: "it"}
- {OP: "*"}
- {DEP: "ROOT"}
The patternsets
expand the capabilities of the GrammarDetector
to detect for new features. The patternsets
are created by loading YAML files containing these three properties:
patterns: list[Pattern]
-- an array of named sets of tokensconfig: dict[str, Union[str, bool]]
-- a configuration object to modify input/outputtests: list[Test]
-- an array of tests to validate the accuracy of thepatterns
Internally, this data from the patternset
file is converted into a PatternSet
.
# voices.yaml
config:
how_many_matches: one
patterns:
- rulename: active
tokens:
- {DEP: "aux", OP: "*"}
- {DEP: "ROOT"}
- rulename: passive
tokens:
- {DEP: "aux", OP: "*"}
- {DEP: "auxpass", OP: "+"}
- {TAG: "VBN", DEP: "ROOT"}
tests:
- input: The cat was chased by the dog.
rulenames:
- passive
spans:
- was chased
- input: The dog chased the cat.
rulenames:
- active
spans:
- chased
The patterns: list[dict[str, Union[str, list[spacy.tokens.Token]]]]
list contains rules and grammatical patterns with the following properties:
rulename: str
-- the name of the grammatical pattern (e.g. "present simple")tokens: list[spacy.tokens.Token]
-- the tokens of the grammatical pattern
The config: dict[str, Union[str, bool]]
dict contains several options for modifying the input and/or output:
extract_noun_chunks: bool
-- if true, then fragment the input into nouns before running the detector (default false)how_many_matches: str
-- if "all", then get all matches; if "one", then get the longest match (default "all")skip_tests: bool
-- if true, then skip all tests in the file when running the unittests (default false)
The tests: list[dict[str, Union[str, bool, list[str]]]]
list contains unittests with the following properties:
input: str
-- the sentence, clause, or phrase to be testedrulenames: list[str]
-- the expected rulenamesspans: list[str]
-- the expected matching textskip: bool
-- if true, then skip this test (but not the others)
Each test must contain 1) the input
and 2) the rulenames
and/or the spans
. To run the tests in your patternset
, call the GrammarDetector.run_tests(self)
instance method (see Usage: Running Tests in Custom Patternset YAML Files).
The PatternSetRepository
reads a patternset
YAML file and converts it into an internal PatternSet
. The stored PatternSets
can be retrieved individually by referencing its name as the cache key or retrieved collectively as a list[PatternSet]
. The PatternSetRepository
extends the Repository[Generic[T]]
helper class for creating, caching, and querying.
The PatternSetMatcher
is a wrapper class that is composed of an inner spacy.matcher.Matcher
and logic to interpret PatternSets
. The patterns defined in the PatternSets
are automatically loaded into the inner Matcher
. The raw matches from the inner Matcher
are then converted into a reader-friendly format.
The Detector
is the internal entrypoint by which a sentence, clause, or phrase is analyzed. A Detector
contains one PatternSet
and one PatternSetMatcher
. Each Detector
is bound to the specific grammatical feature of the PatternSet
. After loading the GrammarDetector
, its Detectors
can be accessed via the GrammarDetector.detectors: list[Detector]
instance property. This permits running them manually and reusing them. The GrammarDetector
and Detectors
are not bound to the text input.
The DetectorRepository
is responsible for creating and storing Detectors
. It is wrapped by the GrammarDetector
class, the main entrypoint. The repository manages the PatternSetRepository
and loads its PatternSets
into the PatternSetMatchers
. The DetectorRepository
extends the Repository[Generic[T]]
helper class for creating, caching, and querying.
This tool is only as good as the patternset
YAML files that support it. The primary ways to contribute to this project are:
- Creating new built-in
patternsets
- Improving existing
patterns
in the built-inpatternsets
- Adding
tests
to the built-inpatternsets
- Adding new
config
options and features to the codebase
Cloning the repository:
$ git clone https://github.com/SKCrawford/grammar-detector.git
Preparing the dev environment:
$ pipenv shell
$ pipenv install --dev
$ python -m spacy download en_core_web_md
Running the GrammarDetector
from the repository:
$ python -m grammardetector "The dog was chasing a cat into the house."
Running the patternset
unittests from the repository:
$ python -m unittest
To add new grammatical features or improve existing features, focus your efforts on the patternsets
directory and its YAML files. You may find the token tables generated by the GrammarDetector.token_table(self, input: str, **kwargs)
instance method to be helpful for conceptualizing sequences of tokens. You may also find the tokenlike lists generated by the GrammarDetector.token_data(self, input: str, **kwargs)
instance method to be helpful when generating new patterns or improving upon existing patterns.
Submissions of patternset
files will be rejected if they do not include tests for each pattern.
Steven Kyle Crawford
- 0.2.4
- New feature: Generate lists of tokens, which may be adapted for use in patternset files, via the
GrammarDetector.token_data(self, input: str, **kwargs)
instance method. - Export the
Tokenlike
return type for thetoken_data
method for type safety. - Improve docstrings for the
token_data
andtoken_table
methods in theGrammarDetector
class and utilities package.
- New feature: Generate lists of tokens, which may be adapted for use in patternset files, via the
- 0.2.3
- Rename the GrammarDetector constructor keyword argument from dataset to language_model.
- Change the default language model from en_core_web_lg to en_core_web_md
- 0.2.2
- Rename the patternset file property from meta to config. Retain usage of meta internally to avoid confusion with the Config class.
- Rename the run_tests keyword argument from internal_tests to builtin_tests.
- Bugfix the repository's test suite runner.
- 0.2.1
- Improve readme readability
- 0.2.0
- Alpha release
- 0.1.0
- Pre-alpha release
This project is licensed under the GNU General Public License V3. See the LICENSE.txt file for details.