-
Notifications
You must be signed in to change notification settings - Fork 0
/
n2231.html
997 lines (903 loc) · 37.4 KB
/
n2231.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>char8_t: A type for UTF-8 characters and strings</title>
<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<style type="text/css">
pre {
display: inline;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
text-decoration: underline;
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFEBFF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
div.compare {
padding-left: 40px;
display: table; /* undo float:left effect */
}
div.compare_item {
float: left;
margin: 2px;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Proposal for C2x</th>
</tr>
<tr>
<th>WG14 N2231</th>
</tr>
<tr>
<th/>
</tr>
<tr>
<th>Title:</th>
<td>char8_t: A type for UTF-8 characters and strings</td>
</tr>
<tr>
<th>Author:</th>
<td>Tom Honermann <tom@honermann.net></td>
</tr>
<tr>
<th>Date:</th>
<td>2018-03-25</td>
</tr>
<tr>
<th>Proposal category:</th>
<td>New features, change to existing features</td>
</tr>
<tr>
<th>Target audience:</th>
<td>Developers working on combined C and C++ code bases</td>
</tr>
</table>
<p>
<strong>Abstract:</strong> A
<a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html">
proposal</a>
<sup><a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="#ref_wg21_p0482r1">
[WG21 P0482R1]</a></sup>
currently under consideration for C++ adds a new <tt>char8_t</tt> fundamental
type to be used as the code unit type of <tt>u8</tt> string and character
literals. This paper proposes a corresponding <tt>char8_t</tt> typedef and
related library functions to enable conversions between the execution
character encoding and UTF-8. These facilities are intended to improve
support for UTF-8 and to retain source code compatibility across the C
and C++ languages.
</p>
<ul>
<li><a href="#introduction">
Introduction</a></li>
<li><a href="#motivation">
Motivation</a></li>
<li><a href="#proposal">
Proposal</a></li>
<li><a href="#backward_compat">
Backward Compatibility</a></li>
<li><a href="#implementation_exp">
Implementation Experience</a></li>
<li><a href="#wording">
Formal Wording</a>
</li>
<li><a href="#acknowledgements">
Acknowledgements</a></li>
<li><a href="#references">
References</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>C11 introduced support for UTF-8, 16-bit, and 32-bit encoded string
literals. New <tt>char16_t</tt> and <tt>char32_t</tt> typedefs were added to
hold values of code units for the 16-bit and 32-bit variants, but a new type
was not added for the UTF-8 variant. Instead, UTF-8 string literals were
defined in terms of the <tt>char</tt> type used for the code unit type of
ordinary string literals. UTF-8 is the only text encoding mandated to be
supported by the C standard for which there is no distinctly named code unit
type.
</p>
<p>Whether <tt>char</tt> is a signed or unsigned type is implementation
defined and implementations that use an 8-bit signed char are at a
disadvantage with respect to working with UTF-8 encoded text due to the
necessity of having to rely on conversions to unsigned types in order to
correctly process leading and continuation code units of multi-byte encoded
code points.
</p>
<p>The lack of a distinct type and the use of a code unit type with a range
that does not portably include the full unsigned range of UTF-8 code units
presents challenges for working with UTF-8 encoded text that are not present
when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for
a new <tt>char8_t</tt> typedef and related library enhancements intended to
remove barriers to working with UTF-8 encoded text and to enable working with
all five of the standard mandated text encodings in a consistent manner.
</p>
<h1 id="motivation">Motivation</h1>
<p>As of November 2017,
<a title="Usage of UTF-8 for websites"
href="https://w3techs.com/technologies/details/en-utf8/all/all">
UTF-8 is now used by more than 90% of all websites</a>
<sup><a title="Usage of UTF-8 for websites"
href="#ref_w3techs">
[W3Techs]</a></sup>.
While UTF-8 now dominates websites, it has not attained similar usage success
as the execution character encoding of C and C++ compilers. Important
compilers, such as Microsoft's Visual Studio, do not support use of UTF-8 as
the execution character encoding<sup>[*]</sup>. Programs that must consume
and produce text in the execution character encoding and manipulate UTF-8
text must choose one of two approaches to managing text in these distinct
encodings:
<ol>
<li>Use <tt>char</tt> for both encodings while being careful to transcode
between the encodings when necessary.</li>
<li>Use <tt>char</tt> for the execution character encoding, and another
type, generally <tt>unsigned char</tt>, for UTF-8.</li>
</ol>
</p>
<p>The challenge with the first approach is ensuring that text is
appropriately transcoded and is in the correct encoding when passed to
other functions. Since the same type, <tt>char</tt>, is used as the code unit
type for both encodings, the programmer is unable to rely on the type system
to help identify mistakes.
</p>
<p>The challenge with the second approach is that UTF-8 string literals have
type array of <tt>char</tt>. Direct comparisons with UTF-8 string literals
are subject to sign mismatch (depending on the sign of <tt>char</tt>), and
attempts to assign pointers to the desired code unit type directly to UTF-8
string literals results in assignment from incompatible pointer types
(regardless of the sign of <tt>char</tt>).
</p>
<p>The following example demonstrates a potential consequence of failure to
manage character encodings correctly. The <tt>mb_utf8.c</tt> example
incorrectly passes UTF-8 string literals to the "ANSI" version of the Windows
<tt>MessageBox()</tt> function. This function requires strings to be provided
in the system encoding (Windows-1252 on the Windows 10 sytem used to produce
the output below). As shown, when run, mojibake is produced. The
<tt>mb_utf16.c</tt> example is a correct program intended to demonstrate that
Windows supports the example Unicode characters and is able to display them
correctly. This example is intended to demonstrate that, though the
<tt>mb_utf8.c</tt> code is incorrect, the compiler is unable to assist in
diagnosing what is wrong.
</p>
<p>
<div class="compare">
<div class="compare_item">
<fieldset><legend>mb_utf8.c</legend>
<pre><code class="c">
#include <windows.h>
int main() {
const char *caption =
u8"\U0001F631"; // U+1F631 FACE SCREAMING IN FEAR
const char *message =
u8"\U0001F648" // U+1F648 SEE-NO-EVIL MONKEY
u8"\U0001F649" // U+1F649 HEAR-NO-EVIL MONKEY
u8"\U0001F64A"; // U+1F64A SPEAK-NO-EVIL MONKEY
MessageBoxA(NULL, message, caption, MB_OK);
}
</code></pre></fieldset>
<fieldset><legend>Compile</legend>
<pre><code class="Bash">
> cl mb_utf8.c /Femb_utf8.exe user32.lib
Microsoft (R) C/C++ Optimizing Compiler Version 19.13.26128 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
mb_utf8.c
Microsoft (R) Incremental Linker Version 14.13.26128.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:mb_utf8.exe
mb_utf8.obj
user32.lib
</code></pre></fieldset>
<fieldset><legend>Run</legend>
<pre><code class="Bash">
> mb_utf8.exe
</code></pre>
<img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAIIAAACFCAYAAACXBiBFAAAABHNCSVQICAgIfAhkiAAAByJJREFU
eJzt3LFTWtkCx/Evb9JmAo5/QBTNzoANOlpg6Zg3Zou4jtpud4kWG5sUmbF0xsIGKl/ubrFpwVFT
CJMQy/cKHbFxLIJiinQ6Qv6C8woPiAYU8Sq4+X1mMqsXOPewfD33QJz4wuGwQX56jwD29vZaPQ9p
ob6+Pv7V6klIe1AIAigEsRSCAApBLIUggEIQy6MQjth0V9gtwtGmy8pu0Zth5d54FEIXI84kkQB0
jThMRgLeDCs1ua7LyclJ3dtPTk5wXfdGYzYeQnGXFdfF3TyiuLuC625yRJ0V4Giz6n7lleLs/nJ7
ExMTrH/4UDOGk5MT1j98YGJi4kZjNh5C4CnBDuAwS+p0AMcZoQvo6h+E7VzVi1xkd6fIYL+fr6dB
ppwBTlMuWbrputHUpJ7Ozk7GX778IYZyBOMvX9LZ2XmzQcPhsGnIacEUTu2XuZRJ5U4rNxU+vzv/
vvDZvPtcuG4wk0t9NtfdS652fHxs/vzrL3N8fHzh65sKh8PmUcPFBKDgumQBOjroOM1xFDlfFXY2
v1KMwNedQ4IDIzerUZpSvTIAza0EVoMhFNldyVIcnMKJBM72C6ltCkcjdHUBgQgDAZfcLhQZZKTe
NcA+7rT8vXsIQHDUqf8YuReNhVD8yuFpkIFJ+24gEGEguM1OqQicHevqDpLNbhMcdaj7niEQYdKJ
cBZWDv/kiPYNt1C9JwCa3x/g5QdKXd0E6aDD79mIcoXLG8N6G8hGNRZC4CnBjkN2Km8TjygcdhB8
2uznBQEiWg2aVu/dwW1iaHCPECAyMshhKoW7fXYkOOpw48+NLu8ROB9Le4TGra6uMjExUfMSUI5h
dXUVx3EaHtMXDoeNflXt56ZfVZMKhSCAQhBLIQigEMRSCAIoBLEUggAKQSyFIIBCEMAYoxDkjEIQ
QCGIpRAEUAhiKQQBFIJYCkEAhSCWQhBAIYilEARQCGIpBAEUglgKQQCFIJZCEEAhiKUQBFAIYikE
ARRCaxwkGI5lzv/bBjwPIRPzEcsckBgeJnHg9ej3d447nUfPa/47vo6vd5/5d2N3Nr+baJt/QykT
8/GCNKZN/sf8TMLhcLMrQoaYz4ev/Gc4wQHlnxB7l4MEw8MJMonhyv1imQwxe9+zMWJkADIx1scN
+dDC+eM9PsdB5T5VP72ZGL5rl2aPn2v5vJUxW7uqVTT8j3JXy+dN3hhj8nETddLnx9OOwX6fj0dN
NJ63x+Pm7Mu0iUejxkmfHYtGHZM2dXh6jrRxovFL4+VN3LHH7vO55uMmStTErz3x/QmFQqa5FaEn
z5LPh693DvaqforHxnHcBRIHB2wkYfrXnh8fOz1P6EuCxJdnzPe17hwHid/ZH39NjUffwzz6eHbt
ie9XUyFkYi9wnTTGGN5Pg7tQXgLHeBOH5NISyb55Xtd8sr38SpL9Z1fvBbw9xxhvppP0+nz4epNM
j3/h9/15GtmOeP5ce17zPr7HC1/VpaUd3PzSkDbOhaUtbRyqlvh83ETBOHXX/NafI+04Jm3HACrL
ej4eNYChcu67nEfaOGCINnB5umPNXxqu0vOMPqKEej0f2ZtzZGIshN7AUpLpvMGYPNPJJTJkWEpO
kzcGk58mudTAj+utnusY70yeOHM0cqq71kQIY4w7/yO5Ybe6mXXcaAhvX/e7OkeG2EKI97XX8Xuc
Rxtq6l1DeVm7sIxW3+bFrtj7c6Sdqsc0dGm4g3lUnReovPNopVAoZNrmAyVpnVt8oCT/NApBAIUg
lkIQQCGIpRAEUAhiKQQBFIJYCkEAhSCWQhBAIYilEARQCGIpBAEUglgKQQCFIJZCEEAhiKUQBFAI
YikEAeARwPfv31s9D2kxrQgCKASxFIIACkEshSCAQhBLIQigEMR61OoJXOXJkyetnsKDMDs7y+Li
4q3GaOsQAL59+9bqKbS1zc1NT8Zp+xAAHj9+3Oop/ONpjyCAQhBLIQigEMRSCFfJzuH3+yt/ni8X
qm9kzv+c8qHsnB//XLYl0/SCQqijsPwc/xSkSiVKpRKlUo7f1vprv9jZOaZIUYqP3v9EPaIQairw
cQ0Wc3HOX9puZv6zyNDfG1xMIcvcFKQecASgEGorfGRtK0RP96Xj3f/mt6G/2agqYe3VEr9cCOZh
Ugg3NsQvwfLXW2xtbbH2sXDVAx4EhVBLdw8h9jm4/Pr+sFIMsZhLEXr7iuUH3oJCqGmUPxbhbf9c
1X6gwPKrt7D4x6XLwCjxVIi3r5Z5yC08iL9raIXumU/keE6/3185NrSY49PM5Y0DMBonteGn/znk
Ps1Q4x5tTyFcoXvmE6WZereOEi+drw2j8RKle5nV3dClQQCFIJZCEEAhiKUQBHgA7xq8+p08uVpb
hzA7O9vqKfw02jqE2/6KtjROewQBFIJYCkEAhSCWQhBAIYilEARQCGIpBAHAFw6HTasnIa33f75V
n9ruGcTJAAAAAElFTkSuQmCC
"/>
</fieldset>
</div>
<div class="compare_item">
<fieldset><legend>mb_utf16.c</legend>
<pre><code class="c">
#include <windows.h>
int main() {
const wchar_t *caption =
L"\U0001F631"; // U+1F631 FACE SCREAMING IN FEAR
const wchar_t *message =
L"\U0001F648" // U+1F648 SEE-NO-EVIL MONKEY
L"\U0001F649" // U+1F649 HEAR-NO-EVIL MONKEY
L"\U0001F64A"; // U+1F64A SPEAK-NO-EVIL MONKEY
MessageBoxW(NULL, message, caption, MB_OK);
}
</code></pre></fieldset>
<fieldset><legend>Compile</legend>
<pre><code class="Bash">
> cl mb_utf16.c /Femb_utf16.exe user32.lib
Microsoft (R) C/C++ Optimizing Compiler Version 19.13.26128 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
mb_utf16.c
Microsoft (R) Incremental Linker Version 14.13.26128.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:mb_utf16.exe
mb_utf16.obj
user32.lib
</code></pre></fieldset>
<fieldset><legend>Run</legend>
<pre><code class="Bash">
> mb_utf16.exe
</code></pre>
<img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAHcAAACFCAYAAABhY2eYAAAABHNCSVQICAgIfAhkiAAAB0dJREFU
eJztnbFvGskXx7+c0kZZLP8BPzj7TgI32JKLoUT2yU4RX5S4vW6IXdy5uSKSJZcUaZbKv/Pviju5
OyScFFmU+ChzhSVDE6XwEHCRVEawKV29X8ECa844CyYmfryPtF6YmZ0d+7Pz3rAgHIrH4wThq2Fn
Zwfn5+fX6mN/fx8fP35EKB6P09u3b0c0NOFrYW5uDt+MexDCl0PkMkbkMkbkMkbkMkbkMkbkMkbk
MubOoAc0a0UUj9+j0fAKpqbw7UIKqUh4xEObLPb29vDw4UNMT09fWl+v15HP56G1Dt5pPB6nYDSo
lPuNcn9XqdHwFzeo+neOfsuVqNH3WOFznJ2d0f9+/53Ozs4GqutHPB6nwGG5ViyisaCRmjpGqVRD
EwDQRO20hOOpFPRCA8ViLfhVJVxgenoaaw8e4PmLF6jX653yer2O5y9eYO3Bg76zuh/B5DbLqE6l
kIoA4cQColELrSAchmVFsZAIA5EUUlNVlJsDnV/w0Sv4OmKBgDm3edoArP94T1wcV4GIl2Pd6jFc
KwKEAVhA47QJhCX/DotfMIChxQJBF1RWFFH3FDWEEQkn8CjVrYqkHnmPajh1o4haQ41D+AIEC8uu
C/dCQRO1chnlcjv3dhrCvdhQGBB/KL4sBw9C4AWVZQHVmreMqrlAIoFEAnC9MtSqrUbC0PTm2H6L
rKAEv4kRSWDeLXUWTJbvJ5plFN15JCIDn1/w6Ld4uo7gQHLDFuA2W/v3pRoAF6flGmrlU7gAaqX3
gBUGmm5rLwxMPp/vu3hqC87n8wP1OfDHbOQO1e1gbm5u8NuP4UgKjyKpzzcUxo68ccAYkcsYkcsY
kcsYkcsYkcsYkcsYkcsYkcsYkcsUIhK5nBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5
jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jBG5jLm+3EoBhUIBhcoIRiOMlOHlVgpIJ0MI
za5idXUVq7MhhJLp/pIrBWTTaaTTaaSz2SsvhkohjWQyhFAohGQyibRcOcMR/Ftb2xhytCIFRbZj
eqpsUlCktE2dGmPIsRUB6NkUKdshY/7dNwBSqmevHeo5m3AFsVgs+Le2tqlkn+Hk13VAb+P+rGmF
5PaGX7CtgfW17/AsXQBQQQUzwDvAdgyICMZ4e2cO/7wDZlBBBQAKWSSTP2F1D7AN4c2bNyDy9sYG
9lbxUzKJdGF0FzZ3Bg/L38WAl++wvgYYs4KVFQDPT4DZFcyaArA2h79OThCL+Y5Z28YvKwbZbAUz
MzMoZLMwK7/C9rWpnADbb7ah1TruzxSQTmdRQQXZZBoF3Me6Ulj/80/ETsRuYIYJy7ZSZPtipHF8
IdPYpJQXlo0h49UTGXK8MN5q335uWqHZ2KQUCN6xxovXrb1DGiAoRbYzXJiaNGKxGA0hl4gcTcpu
iXE8scYxnlCbtCfA2IqU1qQA0o4hMg5ppckxrTpAkdbK66vVL7z8SsYhrVttHd3K08qWrBuUIeUa
slX7j+2Q4xgyxpBxnO7CqT1zPVkACNruLp6MQ1p1F1fti6G9INNak+04ZIwhx7FJdy6Qkf3u7BlO
rl8YNDnGm73GC50XhPnKlE22rUlrTdq2u3JVd2XtaE0df76Z69WSFruBGUquueRlTfvlyoUyL4T6
2yulyXbsC7O2E2qNTQq6m8uNIcf/Usk4pHEx1wv9GUpuO/99dvNm2Wfbt2ejPyIo1Vpc9T6W0ByY
WCxG8s+RmRKPx+WNA86IXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaIXMaI
XMaIXMaIXMaIXMaIXMaIXMaIXMaIXMbcAYBPnz6NexzCF0BmLmNELmNELmNELmNELmNELmNELmNE
LmPu3PQJ7927d9OnvJVsbm4ik8lcq48blwsAHz58GMdpbw3FYnEk/YxFLgDcvXt3XKeeGCTnMkbk
MkbkMkbkMmby5B5uwbKszra8W/VXYstaRrvocMuCtXU4lmGOgomSW91dhvUYyLkuXNeF65bw48H8
5QIPt/AYObj20s0PdERMkNwqXh0AmZKNrq4oNv6bweIfL3FR7yG2HgO5WywWmCS51Vc4OIphJtpT
Hv0BPy7+gZc+uwdPnuH7CxfB7WRy5F7JIr7/tv34CEdHRzh4Vb3qgFvB5MiNziCGd6j0OvvXjF5E
ppRD7OkT7N5yv5MjF0v4OQM8nd/y5dcqdp88BTI/94TgJdi5GJ4+2cVt9ju2e8vjILrxGiUsY96y
OmWLmRJeb/QmYgBLNnIvLcwvA6XXG7ikxVfPRMkFWoLdjX61S7Dd7hxesl24NzKqL8MEheXJQ+Qy
RuQyRuQyRuQyZiyr5VF9Rki4mhuXu7m5edOnnFhuXO51P64pBEdyLmNELmNELmNELmNELmNELmNE
LmNELmNELmPuAEAymRz3OASPnZ0dnJ+fX6uP/f19hEIh/B8jZg5SOCteOQAAAABJRU5ErkJggg==
"/>
</fieldset>
</div>
</div>
</p>
<p>Difficulty in managing multiple encodings with the same code unit type is
not the only challenge posed by use of <tt>char</tt> as the UTF-8 code unit
type. The following code exhibits implementation defined behavior.
<blockquote><pre><code class="c">
_Bool is_utf8_multibyte_code_unit(char c) {
return c >= 0x80;
}
</code></pre></blockquote>
</p>
<p>UTF-8 leading and continuation code units have values in the range 128
(0x80) to 255 (0xFF). In the common case where <tt>char</tt> is implemented
as a signed 8-bit type with a two's complement representation and a range of
-128 (-0x80) to 127 (0x7F), these values exceed the unsigned range of the
<tt>char</tt> type. Such implementations typically encode such code units as
unsigned values which are then reinterpreted as signed values when read. In
the code above, integral promotion rules result in <tt>c</tt> being promoted to
type <tt>int</tt> for comparison to the <tt>0x80</tt> operand. if <tt>c</tt>
holds a value corresponding to a leading or continuation code unit value, then
its value will be interpreted as negative and the promoted value of type
<tt>int</tt> will likewise be negative. The result is that the comparison
is always false for these implementations.</p>
<p>To correct the code above, explicit conversions are required. For example:
<blockquote><pre><code class="c">
_Bool is_utf8_multibyte_code_unit(char c) {
return ((unsigned char)c) >= 0x80;
}
</code></pre></blockquote>
</p>
<p>Finally, no facilities are currently provided for transcoding between the
execution character encoding and UTF-8.
</p>
<p>The issues described above present significant challenges to working with
UTF-8 encoded text. As the use of UTF-8 continues to rise, the ability to
work well with UTF-8 text will only grow more important. The changes proposed
in this paper are intended to address the above issues while retaining the
ability to write source code that is compatible across C and C++.
</p>
<p><em>[*]: Microsoft Visual Studio 2015 added <tt>/utf-8</tt>,
<tt>/source-charset:utf-8</tt>, and <tt>/execution-charset:utf-8</tt> options
that enable use of UTF-8 as the execution character encoding, but in practice,
these options are of limited use since the Windows platform SDK does not, in
general, support UTF-8.</em>
</p>
<h1 id="proposal">Proposal</h1>
<p>The proposed changes include:
<ul>
<li>A new typedef of <tt>unsigned char</tt> named <tt>char8_t</tt> defined
in the <tt><uchar.h></tt> header.</li>
<li>The type of UTF-8 string literals is changed from array of
<tt>const char</tt> to array of <tt>const char8_t</tt>.</li>
<li>The type of UTF-8 character literals (assuming
<a title="[WG14 N2198]: Adding the u8 character prefix"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
WG14 N2198</a>
<sup><a title="[WG14 N2198]: Adding the u8 character prefix"
href="#ref_wg14_n2198">
[WG14 N2198]</a></sup> is adopted) is changed from <tt>char</tt> to
<tt>char8_t</tt>.</li>
<li>New <tt>mbrtoc8</tt> and <tt>c8rtomb</tt> functions declared in
<tt><uchar.h></tt> enable converting between the implementation
defined execution character encoding and UTF-8.</li>
<li>A new <tt>__STDC_UTF_8__</tt> macro used to indicate when <tt>u8</tt>
literals are encoded in UTF-8.</li>
<li>New <tt>char8_t</tt> related atomic macros and types.</li>
</ul>
</p>
<p>The addition of the <tt>char8_t</tt> typedef is intended to support source
code compatibility between C and C++ assuming the adoption of
<a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html">
WG21 P0482R1</a>
<sup><a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="#ref_wg21_p0482r1">
[WG21 P0482R1]</a></sup>
by the C++ committee. Mutual adoption would enable the following code to be
well-formed and portable for both languages while providing additional type
safety and protection from implementation defined sign issues.
<blockquote><pre><code class="c">
#include <uchar.h>
void use_utf8(const char8_t *p) {
if (p && p[0] >= 0x80) {
/* Handle UTF-8 lead or continuation code unit... */
}
}
int main() {
use_utf8(u8"text");
}
</code></pre></blockquote>
</p>
<h1 id="backward_compat">Backward Compatibility</h1>
<p>The changes proposed in this paper impact backward compatibility as a
result of changing the type of UTF-8 string literals. There are two primary
consequences:
<ol>
<li>Code that directly accesses the code unit values of UTF-8 string literals
without an intervening cast to an unsigned type may experience silent
behavioral changes for implementations with a signed 8-bit <tt>char</tt>
type. In general, such accesses are likely indicative of latent defects
in the code, and are defects likely fixed by the proposed changes.</li>
<li>Initialization or assignment of <tt>const char</tt> pointers (including
parameters) from UTF-8 string literals will now result in incompatible
pointer conversions. This is an intentional change intended to allow
the use of compiler diagnostics to identify cases where incorrectly
encoded text is used.</li>
</ol>
</p>
<p>These changes are a primary objective of this proposal. Implementations
are encouraged to add options to disable <tt>char8_t</tt> support entirely
when necessary to preserve compatibility with prior C language standards.
</p>
<h1 id="implementation_exp">Implementation Experience</h1>
<p>The proposed changes in the corresponding C++
<a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html">
WG21 P0482R1</a>
<sup><a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="#ref_wg21_p0482r1">
[WG21 P0482R1]</a></sup>
proposal have been implemented in a fork of gcc and are available on
GitHub in the <tt>char8_t</tt> branch of the following repository:
<ul>
<li>gcc: <a href="https://github.com/tahonermann/gcc/tree/char8_t">
https://github.com/tahonermann/gcc/tree/char8_t</a></li>
</ul>
</p>
<p>The proposed changes in this paper are being implemented in forks of gcc
and glibc, but are not yet complete. Once completed, they will be available
in the <tt>char8_t</tt> branches of the following repositories:
<ul>
<li>gcc: <a href="https://github.com/tahonermann/gcc/tree/char8_t">
https://github.com/tahonermann/gcc/tree/char8_t</a></li>
<li>glibc: <a href="https://github.com/tahonermann/glibc/tree/char8_t">
https://github.com/tahonermann/glibc/tree/char8_t</a></li>
</ul>
</p>
<p>
The new gcc <tt>-fchar8_t</tt> and <tt>-fno-char8_t</tt> compiler options
support enabling and disabling the new features. No backward compatibility
features are currently implemented.
</p>
<h1 id="wording">Formal Wording</h1>
<input type="checkbox" id="hidedel">Hide deleted text</input>
<p>These changes are relative to the ISO/IEC 9899:2017 committee draft as of
2018-03-17.</p>
<p>Additional updates will be necessary if
<a title="[WG14 N2198]: Adding the u8 character prefix"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
WG14 N2198</a>
<sup><a title="[WG14 N2198]: Adding the u8 character prefix"
href="#ref_wg14_n2198">
[WG14 N2198]</a></sup>
is adopted.</p>
<p>Change in 6.4.5 (String Literals) paragraph 6:
<blockquote>
[…] For UTF-8 string literals, the array elements have type
<del><tt>char</tt></del><ins><tt>char8_t</tt></ins>, and are initialized with
the characters of the multibyte character sequence, as encoded in UTF–8.
[…]
</blockquote>
</p>
<p>Change in 6.7.9 (Initialization) paragraph 14:
<blockquote>
An array of character type may be initialized by a character string
literal<del> or UTF-8 string literal</del>, optionally enclosed in braces. Successive
bytes of the string literal (including the terminating null character if there
is room or if the array is of unknown size) initialize the elements of the
array.
</blockquote>
</p>
<p><em>Drafting note: The changes to 6.7.9p14 affect backward compatibility
by removing the ability to initialize an array of character type with a UTF-8
string literal. This is an intentional change made to align with the changes
to C++ proposed in
<a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html">
WG21 P0482R1</a>
<sup><a title="[WG21 P0482R1]: char8_t: A type for UTF-8 characters and strings (Revision 1)"
href="#ref_wg21_p0482r1">
[WG21 P0482R1]</a></sup>.
</em></p>
<p>Insert a new paragraph after 6.7.9 (Initialization) paragraph 14:
<blockquote class=stdins>
An array with element type compatible with a qualified or unqualified version
of <tt>char8_t</tt> may be initialized by a UTF-8 string literal, optionally
enclosed in braces. Successive bytes of the string literal (including the
terminating null character if there is room or if the array is of unknown size)
initialize the elements of the array.
</blockquote>
</p>
<p>Change in 6.10.8.2 (Environment macros) paragraph 1:
<blockquote>
The following macro names are conditionally defined by the implementation:<br/>
<br/>
[…]<br/>
<br/>
<ins><tt>__STDC_UTF_8__</tt> The integer constant 1, intended to indicate that
values of type <tt>char8_t</tt> are UTF-8 encoded. If some other encoding is
used, the macro shall not be defined and the actual encoding used is
implementation-defined.</ins><br/>
<br/>
<tt>__STDC_UTF_16__</tt> The integer constant 1, intended to indicate that
values of type <tt>char16_t</tt> are UTF-16 encoded. If some other encoding is
used, the macro shall not be defined and the actual encoding used is
implementation-defined.<br/>
<br/>
<tt>__STDC_UTF_32__</tt> The integer constant 1, intended to indicate that
values of type <tt>char32_t</tt> are UTF-32 encoded. If some other encoding is
used, the macro shall not be defined and the actual encoding used is
implementation-defined.<br/>
<br/>
[…]<br/>
</blockquote>
</p>
<p>Change in 7.17.1 (Introduction) paragraph 3:
<blockquote>
The macros defined are the <em>atomic lock-free macros</em>
<blockquote>
ATOMIC_BOOL_LOCK_FREE<br/>
ATOMIC_CHAR_LOCK_FREE<br/>
<ins>ATOMIC_CHAR8_T_LOCK_FREE</ins><br/>
ATOMIC_CHAR16_T_LOCK_FREE<br/>
ATOMIC_CHAR32_T_LOCK_FREE<br/>
ATOMIC_WCHAR_T_LOCK_FREE<br/>
ATOMIC_SHORT_LOCK_FREE<br/>
ATOMIC_INT_LOCK_FREE<br/>
ATOMIC_LONG_LOCK_FREE<br/>
ATOMIC_LLONG_LOCK_FREE<br/>
ATOMIC_POINTER_LOCK_FREE<br/>
</blockquote>
[…]<br/>
</blockquote>
</p>
<p>Change in 7.17.6 (Atomic integer types) paragraph 1:
<blockquote>
For each line in the following table,<sup>261)</sup> the atomic type name is
declared as a type that has the same representation and alignment requirements
as the corresponding direct type.<sup>262)</sup>
<div style="margin-left: 1em;">
<table>
<tr>
<td>Atomic type name</td>
<td>Direct type</td>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
<tr>
<td><tt>atomic_ullong</tt></td>
<td><tt>_Atomic unsigned long long</tt></td>
</tr>
<tr>
<td><tt><ins>atomic_char8_t</ins></tt></td>
<td><tt><ins>_Atomic char8_t</ins></tt></td>
</tr>
<tr>
<td><tt>atomic_char16_t</tt></td>
<td><tt>_Atomic char16_t</tt></td>
</tr>
<tr>
<td><tt>atomic_char32_t</tt></td>
<td><tt>_Atomic char32_t</tt></td>
</tr>
<tr>
<td><tt>atomic_wchar_t</tt></td>
<td><tt>_Atomic wchar_t</tt></td>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
</table>
</div>
</blockquote>
</p>
<p>Change in 7.28 (Unicode utilities <uchar.h>) paragraph 2:
<blockquote>
The types declared are <tt>mbstate_t</tt> (described in 7.29.1) and
<tt>size_t</tt> (described in 7.19);
<ins>
<blockquote>
<tt>char8_t</tt>
</blockquote>
which is an unsigned integer type used for 8-bit characters and is the same
type as <tt>unsigned char</tt>; and</ins>
</ins>
<blockquote>
<tt>char16_t</tt>
</blockquote>
which is an unsigned integer type used for 16-bit characters and is the same
type as <tt>uint_least16_t</tt> (described in 7.20.1.12); and</ins>
<blockquote>
<tt>char32_t</tt>
</blockquote>
which is an unsigned integer type used for 32-bit characters and is the same
type as <tt>uint_least32_t</tt> (described in 7.20.1.12).</ins>
</blockquote>
</p>
<p>Insert a new subclause before 7.28.1.1 (The mbrtoc16 function):
<blockquote class="stdins">
7.28.1.1 <strong>The mbrtoc8 function</strong>
</blockquote>
</p>
<p>Add a new paragraph 1:
<blockquote class="stdins">
<strong>Synopsis</strong><br/>
<blockquote>
<div style="margin-left: 1em;">
<tt>#include</tt> <uchar.h><br/>
<tt>size_t</tt> mbrtoc8(<tt>char8_t</tt> * <tt>restrict</tt> pc8,<br/>
<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>
<p>Add a new paragraph 2:
<blockquote class="stdins">
<strong>Description</strong><br/>
If <tt>s</tt> is a null pointer, the <tt>mbrtoc8</tt> function is equivalent
to the call:
<blockquote>
<div style="margin-left: 2em;">
mbrtoc8(NULL, "", 1, ps)
</div>
</blockquote>
In this case, the values of the parameters <tt>pc8</tt> and <tt>n</tt> are
ignored.
</blockquote>
</p>
<p>Add a new paragraph 3:
<blockquote class="stdins">
If <tt>s</tt> is not a null pointer, the <tt>mbrtoc8</tt> function inspects at
most <tt>n</tt> bytes beginning with the byte pointed to by <tt>s</tt> to
determine the number of bytes needed to complete the next multibyte character
(including any shift sequences). If the function determines that the next
multibyte character is complete and valid, it determines the values of the
corresponding characters and then, if <tt>pc8</tt> is not a null pointer,
stores the value of the first (or only) such character in the object pointed to
by <tt>pc8</tt>. Subsequent calls will store successive characters without
consuming any additional input until all the characters have been stored. If
the corresponding character is the null character, the resulting
state described is the initial conversion state.
</blockquote>
</p>
<p>Add a new paragraph 4:
<blockquote class="stdins">
<strong>Returns</strong><br/>
The <tt>mbrtoc8</tt> function returns the first of the following that applies
(given the current conversion state):
<table>
<tr>
<td>0</td>
<td>if the next <tt>n</tt> or fewer bytes complete the multibyte character
that corresponds to the null character (which is the value stored).
</td>
</tr>
<tr>
<td><em>between 1 and <tt>n</tt> inclusive</em></td>
<td>if the next <tt>n</tt> or fewer bytes complete a valid multibyte
character (which is the value stored); the value returned is the number
of bytes that complete the multibyte character.
</td>
</tr>
<tr>
<td><tt>(size_t)</tt> (−3)</td>
<td>if the next character resulting from a previous call has been stored
(no bytes from the input have been consumed by this call).
</td>
</tr>
<tr>
<td><tt>(size_t)</tt> (−2)</td>
<td>if the next <tt>n</tt> bytes contribute to an incomplete (but
potentially valid) multibyte character, and all <tt>n</tt> bytes have
been processed (no value is stored).<sup><em>Footnote</em>)</sup>
</td>
</tr>
<tr>
<td><tt>(size_t)</tt> (−1)</td>
<td>if an encoding error occurs, in which case the next <tt>n</tt> or
fewer bytes do not contribute to a complete and valid multibyte
character (no value is stored); the value of the macro <tt>EILSEQ</tt>
is stored in <tt>errno</tt>, and the conversion state is unspecified.
</td>
</tr>
</table>
</blockquote>
</p>
<p>Add a new footnote for the reference in paragraph 4 above:
<blockquote class="stdins">
<sup><em>Footnote</em>)</sup>When <tt>n</tt> has at least the value of the
<tt>MB_CUR_MAX</tt> macro, this case can only occur if <tt>s</tt> points at a
sequence of redundant shift sequences (for implementations with state-dependent
encodings).
</blockquote>
</p>
<p>Insert another new subclause before 7.28.1.1 (The mbrtoc16 function):
<blockquote class="stdins">
7.28.1.2 <strong>The c8rtomb function</strong>
</blockquote>
</p>
<p>Add a new paragraph 1:
<blockquote class="stdins">
<strong>Synopsis</strong><br/>
<blockquote>
<div style="margin-left: 1em;">
<tt>#include</tt> <uchar.h><br/>
<tt>size_t</tt> c8rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char8_t</tt> c8,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>
<p>Add a new paragraph 2:
<blockquote class="stdins">
<strong>Description</strong><br/>
If <tt>s</tt> is a null pointer, the c8rtomb function is equivalent to the call
<blockquote>
<div style="margin-left: 2em;">
c8rtomb(buf, '\0', ps)
</div>
</blockquote>
where <tt>buf</tt> is an internal buffer.
</blockquote>
</p>
<p><em>Drafting note: If
<a title="[WG14 N2198]: Adding the u8 character prefix"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
WG14 N2198</a>
<sup><a title="[WG14 N2198]: Adding the u8 character prefix"
href="#ref_wg14_n2198">
[WG14 N2198]</a></sup>
is adopted, the character literal in paragraph 2 above should be changed from
'\0' to u8'\0'.
</em></p>
<p>Add a new paragraph 3:
<blockquote class="stdins">
If <tt>s</tt> is not a null pointer, the <tt>c8rtomb</tt> function determines
the number of bytes needed to represent the multibyte character that corresponds
to the character given or completed by <tt>c8</tt> (including any shift
sequences), and stores the multibyte character representation in the array whose
first element is pointed to by <tt>s</tt>, or stores nothing if <tt>c8</tt> does
not represent a complete character. At most <tt>MB_CUR_MAX</tt> bytes are
stored. If <tt>c8</tt> is a null character, a null byte is stored, preceded
by any shift sequence needed to restore the initial shift state; the resulting
state described is the initial conversion state.
</blockquote>
</p>
<p><em>Drafting note: The wording in paragraph 3 above includes the proposed
wording updates from
<a title="[WG14 DR 488]: c16rtomb() on wide characters encoded as multiple char16_t"
href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488">
WG14 DR 488</a>
<sup><a title="[WG14 DR 488]: c16rtomb() on wide characters encoded as multiple char16_t"
href="#ref_wg14_dr488">
[WG14 DR 488]</a></sup>.
</em></p>
<p>Add a new paragraph 4:
<blockquote class="stdins">
<strong>Returns</strong><br/>
The <tt>c8rtomb</tt> function returns the number of bytes stored in the array
object (including any shift sequences). When <tt>c8</tt> is not a valid
character, an encoding error occurs: the function stores the value of the macro
<tt>EILSEQ</tt> in <tt>errno</tt> and returns <tt>(size_t)</tt> (−1); the
conversion state is unspecified.
</blockquote>
</p>
<p>Change in B.16 (Atomics <stdatomic.h>)
<blockquote>
[…]<br/>
<tt>ATOMIC_CHAR_LOCK_FREE</tt><br/>
<ins><tt>ATOMIC_CHAR8_T_LOCK_FREE</tt></ins><br/>
<tt>ATOMIC_CHAR16_T_LOCK_FREE</tt><br/>
<tt>ATOMIC_CHAR32_T_LOCK_FREE</tt><br/>
<tt>ATOMIC_WCHAR_T_LOCK_FREE</tt><br/>
[…]<br/>
<tt>atomic_ullong</tt><br/>
<ins><tt>atomic_char8_t</tt></ins><br/>
<tt>atomic_char16_t</tt><br/>
<tt>atomic_char32_t</tt><br/>
<tt>atomic_wchar_t</tt><br/>
[…]<br/>
</blockquote>
</p>
<p>Change in B.27 (Unicode utilities <uchar.h>)
<blockquote>
<table>
<tr>
<td><tt>mbstate_t</tt></td>
<td><tt>size_t</tt></td>
<td><tt><ins>char8_t</ins></tt></td>
<td><tt>char16_t</tt></td>
<td><tt>char32_t</tt></td>
</tr>
</table>
<blockquote>
<div style="margin-left: 1em;">
<ins>
<tt>size_t</tt> mbrtoc8(<tt>char8_t</tt> * <tt>restrict</tt> pc8,<br/>
<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c8rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char8_t</tt> c8,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
</ins>
<tt>size_t</tt> mbrtoc16(<tt>char16_t</tt> * <tt>restrict</tt> pc16,<br/>
<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c16rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char16_t</tt> c16,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> mbrtoc32(<tt>char32_t</tt> * <tt>restrict</tt> pc32,<br/>
<tt>const</tt> <tt>char</tt> * <tt>restrict</tt> s, <tt>size_t</tt> n,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);<br/>
<tt>size_t</tt> c32rtomb(<tt>char</tt> * <tt>restrict</tt> s, <tt>char32_t</tt> c32,<br/>
<tt>mbstate_t</tt> * <tt>restrict</tt> ps);
</div>
</blockquote>
</blockquote>
</p>
<p>Change in J.3.4 (Characters):
<blockquote>
[…]<br/>
— The encoding of any of <tt>wchar_t</tt><ins><tt>, char8_t</tt></ins>,
<tt>char16_t</tt>, and <tt>char32_t</tt> where the corresponding standard
encoding macro (<tt>__STDC_ISO_10646__</tt><ins>, <tt>__STDC_UTF_8__</tt></ins>,
<tt>__STDC_UTF_16__</tt>, or
<tt>__STDC_UTF_32__</tt>) is not defined (6.10.8.2).
</blockquote>
</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>Thank you to Aaron Ballman for his kind assistance facilitating interaction
with WG14.</p>
<h1 id="references">References</h1>
<table id="references">
<tr>
<td id="ref_w3techs"><sup>[W3Techs]</sup></td>
<td>
"Usage of UTF-8 for websites", W3Techs, 2017.<br/>
<a href="https://w3techs.com/technologies/details/en-utf8/all/all">
https://w3techs.com/technologies/details/en-utf8/all/all</a></td>
</tr>
<tr>
<td id="ref_wg21_p0482r1"><sup>[WG21 P0482R1]</sup></td>
<td>
Tom Honermann,
"char8_t: A type for UTF-8 characters and strings (Revision 1)", P0482R1, 2018.<br/>
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html">
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html</a></td>
</tr>
<tr>
<td id="ref_wg14_n2198"><sup>[WG14 N2198]</sup></td>
<td>
Aaron Ballman,
"Adding the u8 character prefix", N2198, 2017.<br/>
<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf">
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf</a></td>
</tr>
<tr>
<td id="ref_wg14_dr488"><sup>[WG14 DR 488]</sup></td>
<td>
"c16rtomb() on wide characters encoded as multiple char16_t", DR 488, 2016.<br/>
<a href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488">
http://www.open-std.org/jtc1/sc22/WG14/www/docs/summary.htm#dr_488</a></td>
</tr>
</table>
</body>