Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Added support for sentence break suppressions to Intl.Segmenter #783

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions spec/annexes.html
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,9 @@ <h1>Implementation Dependent Behaviour</h1>
<li>
In Segmenter:
<ul>
<li>
The set of data used for standard sentence break suppressions (<emu-xref href="#sec-intl-segmenter-constructor"></emu-xref>)
</li>
<li>
Boundary determination algorithms (<emu-xref href="#sec-findboundary"></emu-xref>)
</li>
Expand Down
17 changes: 15 additions & 2 deletions spec/locale.html
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ <h1>Intl.Locale ( _tag_ [ , _options_ ] )</h1>
<emu-alg>
1. If NewTarget is *undefined*, throw a *TypeError* exception.
1. Let _relevantExtensionKeys_ be %Locale%.[[RelevantExtensionKeys]].
1. Let _internalSlotsList_ be &laquo; [[InitializedLocale]], [[Locale]], [[Calendar]], [[Collation]], [[HourCycle]], [[NumberingSystem]] &raquo;.
1. Let _internalSlotsList_ be &laquo; [[InitializedLocale]], [[Locale]], [[Calendar]], [[Collation]], [[HourCycle]], [[NumberingSystem]], [[SentenceBreakSuppressions]] &raquo;.
1. If _relevantExtensionKeys_ contains *"kf"*, then
1. Append [[CaseFirst]] as the last element of _internalSlotsList_.
1. If _relevantExtensionKeys_ contains *"kn"*, then
Expand Down Expand Up @@ -52,6 +52,8 @@ <h1>Intl.Locale ( _tag_ [ , _options_ ] )</h1>
1. If _numberingSystem_ is not *undefined*, then
1. If _numberingSystem_ does not match the Unicode Locale Identifier `type` nonterminal, throw a *RangeError* exception.
1. Set _opt_.[[nu]] to _numberingSystem_.
1. Let _sentenceBreakSuppressions_ be ? GetOption(_options_, *"sentenceBreakSuppressions"*, ~string~, &laquo;*"none"*, *"standard"* &raquo;, *"none"*).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Intl.Locale constructor, the default value for "sentenceBreakSuppressions" should be undefined. (Comparing to other options, such as "calendar", "collation", "hourCycle", "caseFirst", "numeric", "numberingSystem")

1. Set _opt_.[[ss]] to _sentenceBreakSuppressions_.
1. Let _r_ be ! ApplyUnicodeExtensionToTag(_tag_, _opt_, _relevantExtensionKeys_).
1. Set _locale_.[[Locale]] to _r_.[[locale]].
1. Set _locale_.[[Calendar]] to _r_.[[ca]].
Expand All @@ -65,6 +67,7 @@ <h1>Intl.Locale ( _tag_ [ , _options_ ] )</h1>
1. Else,
1. Set _locale_.[[Numeric]] to *false*.
1. Set _locale_.[[NumberingSystem]] to _r_.[[nu]].
1. Set _locale_.[[SentenceBreakSuppressions]] to _r_.[[ss]].
1. Return _locale_.
</emu-alg>
</emu-clause>
Expand Down Expand Up @@ -175,7 +178,7 @@ <h1>Intl.Locale.prototype</h1>
<h1>Internal slots</h1>

<p>
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; *"ca"*, *"co"*, *"hc"*, *"kf"*, *"kn"*, *"nu"* &raquo;. If %Collator%.[[RelevantExtensionKeys]] does not contain *"kf"*, then remove *"kf"* from %Locale%.[[RelevantExtensionKeys]]. If %Collator%.[[RelevantExtensionKeys]] does not contain *"kn"*, then remove *"kn"* from %Locale%.[[RelevantExtensionKeys]].
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; *"ca"*, *"co"*, *"hc"*, *"kf"*, *"kn"*, *"nu"*, *"ss"* &raquo;. If %Collator%.[[RelevantExtensionKeys]] does not contain *"kf"*, then remove *"kf"* from %Locale%.[[RelevantExtensionKeys]]. If %Collator%.[[RelevantExtensionKeys]] does not contain *"kn"*, then remove *"kn"* from %Locale%.[[RelevantExtensionKeys]].
</p>
</emu-clause>
</emu-clause>
Expand Down Expand Up @@ -312,6 +315,16 @@ <h1>get Intl.Locale.prototype.numberingSystem</h1>
</emu-alg>
</emu-clause>

<emu-clause id="sec-Intl.Locale.prototype.sentenceBreakSuppressions">
<h1>get Intl.Locale.prototype.sentenceBreakSuppressions</h1>
<p>`Intl.Locale.prototype.sentenceBreakSuppressions` is an accessor property whose set accessor function is *undefined*. Its get accessor function performs the following steps:</p>
<emu-alg>
1. Let _loc_ be the *this* value.
1. Perform ? RequireInternalSlot(_loc_, [[InitializedLocale]]).
1. Return _loc_.[[SentenceBreakSuppressions]].
</emu-alg>
</emu-clause>

<emu-clause id="sec-Intl.Locale.prototype.language">
<h1>get Intl.Locale.prototype.language</h1>
<p>`Intl.Locale.prototype.language` is an accessor property whose set accessor function is *undefined*. The following algorithm refers to <a href="https://www.unicode.org/reports/tr35/#Identifiers">UTS 35's Unicode Language and Locale Identifiers grammar</a>. Its get accessor function performs the following steps:</p>
Expand Down
19 changes: 12 additions & 7 deletions spec/segmenter.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,19 @@ <h1>Intl.Segmenter ( [ _locales_ [ , _options_ ] ] )</h1>

<emu-alg>
1. If NewTarget is *undefined*, throw a *TypeError* exception.
1. Let _internalSlotsList_ be &laquo; [[InitializedSegmenter]], [[Locale]], [[SegmenterGranularity]] &raquo;.
1. Let _internalSlotsList_ be &laquo; [[InitializedSegmenter]], [[Locale]], [[SentenceBreakSuppressions]], [[SegmenterGranularity]] &raquo;.
1. Let _segmenter_ be ? OrdinaryCreateFromConstructor(NewTarget, *"%Segmenter.prototype%"*, _internalSlotsList_).
1. Let _requestedLocales_ be ? CanonicalizeLocaleList(_locales_).
1. Set _options_ to ? GetOptionsObject(_options_).
1. Let _opt_ be a new Record.
1. Let _matcher_ be ? GetOption(_options_, *"localeMatcher"*, ~string~, &laquo; *"lookup"*, *"best fit"* &raquo;, *"best fit"*).
1. Set _opt_.[[localeMatcher]] to _matcher_.
1. Let _sentenceBreakSuppressions_ be ? GetOption(_options_, *"sentenceBreakSuppressions"*, ~string~, &laquo; *"none"*, *"standard"* &raquo;, *"none"*).
1. Set _opt_.[[ss]] to _sentenceBreakSuppressions_.
1. Let _localeData_ be %Segmenter%.[[LocaleData]].
1. Let _r_ be ResolveLocale(%Segmenter%.[[AvailableLocales]], _requestedLocales_, _opt_, %Segmenter%.[[RelevantExtensionKeys]], _localeData_).
1. Set _segmenter_.[[Locale]] to _r_.[[locale]].
1. Set _segmenter_.[[SentenceBreakSuppressions]] to _r_.[[ss]].
1. Let _granularity_ be ? GetOption(_options_, *"granularity"*, ~string~, &laquo; *"grapheme"*, *"word"*, *"sentence"* &raquo;, *"grapheme"*).
1. Set _segmenter_.[[SegmenterGranularity]] to _granularity_.
1. Return _segmenter_.
Expand Down Expand Up @@ -74,11 +77,8 @@ <h1>Internal slots</h1>
</p>

<p>
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; &raquo;.
The value of the [[RelevantExtensionKeys]] internal slot is &laquo; *"ss"* &raquo;.
</p>
<emu-note>
Intl.Segmenter does not have any relevant extension keys.
</emu-note>
Comment on lines -79 to -81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lb and lw are only for line break, which Intl.Segmenter doesn't support.

dx is not widely used or implemented, and I've raised questions about its utility.


<p>
The value of the [[LocaleData]] internal slot is implementation-defined within the constraints described in <emu-xref href="#sec-internal-slots"></emu-xref>.
Expand Down Expand Up @@ -160,6 +160,9 @@ <h1>Intl.Segmenter.prototype.resolvedOptions ( )</h1>
<td>*"locale"*</td>
</tr>
<tr>
<td>[[SentenceBreakSuppressions]]</td>
<td>*"sentenceBreakSuppressions"*"</td>
</tr>
<td>[[SegmenterGranularity]]</td>
<td>*"granularity"*</td>
</tr>
Expand All @@ -185,6 +188,7 @@ <h1>Properties of Intl.Segmenter Instances</h1>

<ul>
<li>[[Locale]] is a String value with the language tag of the locale whose localization is used for segmentation.</li>
<li>[[SentenceBreakSuppressions]] is one of the String values *"none"* or *"standard"*, identifying whether to suppress certain sentence breaks that would otherwise be found by <a href="https://unicode.org/reports/tr14/">Unicode Standard Annex #14</a> rules</li>
<li>[[SegmenterGranularity]] is one of the String values *"grapheme"*, *"word"*, or *"sentence"*, identifying the kind of text element to segment.</li>
</ul>
</emu-clause>
Expand Down Expand Up @@ -394,17 +398,18 @@ <h1>FindBoundary ( _segmenter_, _string_, _startIndex_, _direction_ )</h1>
<emu-note>Boundary determination is implementation-dependent, but general default algorithms are specified in <a href="https://unicode.org/reports/tr29/">Unicode Standard Annex #29</a>. It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at <a href="https://cldr.unicode.org">https://cldr.unicode.org</a>).</emu-note>
<emu-alg>
1. Let _locale_ be _segmenter_.[[Locale]].
1. Let _sentenceBreakSuppressions_ be _segmenter_.[[SentenceBreakSuppressions]].
1. Let _granularity_ be _segmenter_.[[SegmenterGranularity]].
1. Let _len_ be the length of _string_.
1. If _direction_ is ~before~, then
1. Assert: _startIndex_ &ge; 0.
1. Assert: _startIndex_ &lt; _len_.
1. Search _string_ for the last segmentation boundary that is preceded by at most _startIndex_ code units from the beginning, using locale _locale_ and text element granularity _granularity_.
1. Search _string_ for the last segmentation boundary that is preceded by at most _startIndex_ code units from the beginning, using locale _locale_, sentence break suppression _sentenceBreakSuppressions_, and text element granularity _granularity_.
1. If a boundary is found, return the count of code units in _string_ preceding it.
1. Return 0.
1. Assert: _direction_ is ~after~.
1. If _len_ is 0 or _startIndex_ &ge; _len_, return +&infin;.
1. Search _string_ for the first segmentation boundary that follows the code unit at index _startIndex_, using locale _locale_ and text element granularity _granularity_.
1. Search _string_ for the first segmentation boundary that follows the code unit at index _startIndex_, using locale _locale_, sentence break suppressions _sentenceBreakSuppressions_, and text element granularity _granularity_.
1. If a boundary is found, return the count of code units in _string_ preceding it.
1. Return _len_.
</emu-alg>
Expand Down
Loading