word splitting regex fails with underdot characters #40

chchch · 2018-02-12T06:48:32Z

Hi,

I'm using hypher.js with transliterated Sanskrit, and it doesn't play well with characters such as ṇ, ṣ, ḍ, ṭ, etc. The problem seems to be the long regex used to split a string into words (line 107 of hypher.js). I guess your character class doesn't include the unicode ranges for underdot characters. I've replaced it with a simpler expression:
var words = str.split(/([\s\n\r\t.,:;'"!?-])/g);
which matches word boundary characters instead of word characters. It works for me but it's not totally comprehensive... you would have to add a few more boundary characters to it to make it work for more languages...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word splitting regex fails with underdot characters #40

word splitting regex fails with underdot characters #40

chchch commented Feb 12, 2018

word splitting regex fails with underdot characters #40

word splitting regex fails with underdot characters #40

Comments

chchch commented Feb 12, 2018