Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Added support for sentence break suppressions to Intl.Segmenter #783

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions spec/segmenter.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,19 @@ <h1>Intl.Segmenter ( [ _locales_ [ , _options_ ] ] )</h1>

<emu-alg>
1. If NewTarget is *undefined*, throw a *TypeError* exception.
1. Let _internalSlotsList_ be &laquo; [[InitializedSegmenter]], [[Locale]], [[SegmenterGranularity]] &raquo;.
1. Let _internalSlotsList_ be &laquo; [[InitializedSegmenter]], [[Locale]], [[SentenceBreakSuppressions]], [[SegmenterGranularity]] &raquo;.
1. Let _segmenter_ be ? OrdinaryCreateFromConstructor(NewTarget, *"%Segmenter.prototype%"*, _internalSlotsList_).
1. Let _requestedLocales_ be ? CanonicalizeLocaleList(_locales_).
1. Set _options_ to ? GetOptionsObject(_options_).
1. Let _opt_ be a new Record.
1. Let _matcher_ be ? GetOption(_options_, *"localeMatcher"*, ~string~, &laquo; *"lookup"*, *"best fit"* &raquo;, *"best fit"*).
1. Set _opt_.[[localeMatcher]] to _matcher_.
1. Let _sentenceBreakSuppressions_ be ? GetOption(_options_, *"sentenceBreakSuppressions"*, ~string~, &laquo; *"none"*, *"standard"* &raquo;, *"none"*).
1. Set _opt_.[[ss]] to _sentenceBreakSuppressions_.
1. Let _localeData_ be %Segmenter%.[[LocaleData]].
1. Let _r_ be ResolveLocale(%Segmenter%.[[AvailableLocales]], _requestedLocales_, _opt_, %Segmenter%.[[RelevantExtensionKeys]], _localeData_).
1. Set _segmenter_.[[Locale]] to _r_.[[locale]].
1. Set _segmenter_.[[SentenceBreakSuppressions]] to _r_.[[ss]].
1. Let _granularity_ be ? GetOption(_options_, *"granularity"*, ~string~, &laquo; *"grapheme"*, *"word"*, *"sentence"* &raquo;, *"grapheme"*).
1. Set _segmenter_.[[SegmenterGranularity]] to _granularity_.
1. Return _segmenter_.
Expand Down Expand Up @@ -74,11 +77,8 @@ <h1>Internal slots</h1>
</p>

<p>
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; &raquo;.
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; *"ss"* &raquo;.
</p>
<emu-note>
Intl.Segmenter does not have any relevant extension keys.
</emu-note>
Comment on lines -79 to -81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lb and lw are only for line break, which Intl.Segmenter doesn't support.

dx is not widely used or implemented, and I've raised questions about its utility.


<p>
The value of the [[LocaleData]] internal slot is implementation-defined within the constraints described in <emu-xref href="#sec-internal-slots"></emu-xref>.
Expand Down Expand Up @@ -160,6 +160,9 @@ <h1>Intl.Segmenter.prototype.resolvedOptions ( )</h1>
<td>*"locale"*</td>
</tr>
<tr>
<td>[[SentenceBreakSuppressions]]</td>
<td>*"sentenceBreakSuppressions"*"</td>
</tr>
<td>[[SegmenterGranularity]]</td>
<td>*"granularity"*</td>
</tr>
Expand All @@ -185,6 +188,7 @@ <h1>Properties of Intl.Segmenter Instances</h1>

<ul>
<li>[[Locale]] is a String value with the language tag of the locale whose localization is used for segmentation.</li>
<li>[[SentenceBreakSuppressions]] is one of the String values *"none"* or *"standard"*, identifying whether to suppress certain sentence breaks that would otherwise be found by <a href="https://unicode.org/reports/tr14/">Unicode Standard Annex #14</a> rules</li>
<li>[[SegmenterGranularity]] is one of the String values *"grapheme"*, *"word"*, or *"sentence"*, identifying the kind of text element to segment.</li>
</ul>
</emu-clause>
Expand Down Expand Up @@ -394,17 +398,18 @@ <h1>FindBoundary ( _segmenter_, _string_, _startIndex_, _direction_ )</h1>
<emu-note>Boundary determination is implementation-dependent, but general default algorithms are specified in <a href="https://unicode.org/reports/tr29/">Unicode Standard Annex #29</a>. It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at <a href="https://cldr.unicode.org">https://cldr.unicode.org</a>).</emu-note>
<emu-alg>
1. Let _locale_ be _segmenter_.[[Locale]].
1. Let _sentenceBreakSuppressions_ be _segmenter_.[[SentenceBreakSuppressions]].
1. Let _granularity_ be _segmenter_.[[SegmenterGranularity]].
1. Let _len_ be the length of _string_.
1. If _direction_ is ~before~, then
1. Assert: _startIndex_ &ge; 0.
1. Assert: _startIndex_ &lt; _len_.
1. Search _string_ for the last segmentation boundary that is preceded by at most _startIndex_ code units from the beginning, using locale _locale_ and text element granularity _granularity_.
1. Search _string_ for the last segmentation boundary that is preceded by at most _startIndex_ code units from the beginning, using locale _locale_, sentence break suppression _sentenceBreakSuppressions_, and text element granularity _granularity_.
1. If a boundary is found, return the count of code units in _string_ preceding it.
1. Return 0.
1. Assert: _direction_ is ~after~.
1. If _len_ is 0 or _startIndex_ &ge; _len_, return +&infin;.
1. Search _string_ for the first segmentation boundary that follows the code unit at index _startIndex_, using locale _locale_ and text element granularity _granularity_.
1. Search _string_ for the first segmentation boundary that follows the code unit at index _startIndex_, using locale _locale_, sentence break suppressions _sentenceBreakSuppressions_, and text element granularity _granularity_.
1. If a boundary is found, return the count of code units in _string_ preceding it.
1. Return _len_.
</emu-alg>
Expand Down