-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
701 additions
and
285 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
## Specification of form manipulation | ||
|
||
|
||
Specification of the value-to-form processing in Lexibank datasets: | ||
|
||
The value-to-form processing is divided into two steps, implemented as methods: | ||
- `FormSpec.split`: Splits a string into individual form chunks. | ||
- `FormSpec.clean`: Normalizes a form chunk. | ||
|
||
These methods use the attributes of a `FormSpec` instance to configure their behaviour. | ||
|
||
- `brackets`: `{'(': ')'}` | ||
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string | ||
- `separators`: `;/,` | ||
Iterable of single character tokens that should be recognized as word separator | ||
- `missing_data`: `['*', '---', '-']` | ||
Iterable of strings that are used to mark missing data | ||
- `strip_inside_brackets`: `True` | ||
Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace) | ||
- `replacements`: `[]` | ||
List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets) | ||
- `first_form_only`: `False` | ||
Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc. | ||
- `normalize_whitespace`: `True` | ||
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces | ||
- `normalize_unicode`: `None` | ||
UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.