A language-based approach to unicode filtering.
This project is a WORK IN PROGRESS!
No stable versions are as of yet available!
Language-Linter is a library that is designed to help projects filter unicode characters based on linguistic characteristics.
Given an ISO 639
identifier, LangLint's methods enable the identification and management of foreign-language characters.
For example, in ar
(Arabic) environments, users will be able to use 'هذه الرموز', but not '这些符号'.
Conversely, in zh
(Chinese) environments, users will be able to use '这些符号', but not 'هذه الرموز'.
Subject to configuration, Latin characters (U+0000 - U+007F) may be whitelisted regardless of locale.
(Click to Expand)
Development for this library was started by the SG-Rewritten Project to accommodate the use-case outlined below:Under its default configuration, Stargate allows its end users to name their own gates, networks, etc.
While gate names are used to identify specific portals, network names serve to identify groups of portals.
In both cases, the plugin facilitates valid use cases wherein other players may need to retype the collected strings.
Accordingly, such strings must be memorable, legible, and most importantly, capable of being copied by other players.
It follows that they should prevent unicode characters when they are inaccessible to most of a relevant userbase.
Through an ISO 639-1 config option, SG has multilingual support; thus, filtering out non-Latin characters is not an option.
Instead, SG must filter out all unicode characters that are supported by neither Latin nor a target locale.
Language-Linter is a library developed by SG-Rewritten; its goal is to facilitate linguistic unicode filtration.
Apart from the above, we have thought of three other situations within which this library may be useful:
- Helping chat filtration systems detect spam and bypasses.
- Sanitizing user inputs to ensure they can be easily reproduced by other users.
- Ensuring string legibility and preventing things such as t̶̪̅h̷̼͝í̴̼s̸̬̋.
- Tools to convert from
ISO 639-1
to unicode block aliases. - Tools to filter strings based on a passed
ISO 639-1
locale. - Future features TBD.
LangLint maps languages
to scripts
; it also provides a collection of methods that interact with these mappings.
The mappings themselves are compiled with data from the Unicode Consortium's Common Locale Data Repository.
- All extant (and most extinct) languages can be represented by an
ISO 639
ID. - These IDs are assigned by the International Standards Organisation.
- Unicode is divided into scripts.
- Scripts are groups of symbols with common histories; generally, they are related to systems of writing.
Not yet available.