-
Notifications
You must be signed in to change notification settings - Fork 7
/
rss.xml
5139 lines (4838 loc) · 878 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Store Halfword Byte-Reverse Indexed</title><link>https://sthbrx.github.io/</link><description>A Power Technical Blog</description><lastBuildDate>Mon, 07 Aug 2023 12:00:00 +1000</lastBuildDate><item><title>Going out on a Limb: Efficient Elliptic Curve Arithmetic in OpenSSL</title><link>https://sthbrx.github.io/blog/2023/08/07/going-out-on-a-limb-efficient-elliptic-curve-arithmetic-in-openssl/</link><description><p>So I've just managed to upstream some changes to OpenSSL for a <a href="https://github.com/openssl/openssl/blob/master/crypto/ec/ecp_nistp384.c">new strategy</a> I've developed for efficient arithmetic used in secp384r1, a curve prescribed by NIST for digital signatures and key exchange. In spite of its prevalence, its implementation in OpenSSL has remained somewhat unoptimised, even as less frequently used curves (P224, P256, P521) each have their own optimisations.</p>
<p>The strategy I have used could be called a 56-bit redundant limb implementation with <em>Solinas reduction</em>. Without too much micro-optimisation, we get ~5.5x speedup over the default (Montgomery Multiplication) implementation for creation of digital signatures.</p>
<p>How is this possible? Well first let's quickly explain some language:</p>
<h2>Elliptic Curves</h2>
<p>When it comes to cryptography, it's highly likely that those with a computer science background will be familiar with ideas such as key-exchange and private-key signing. The stand-in asymmetric cipher in a typical computer science curriculum is typically RSA. However, the heyday of Elliptic Curve ciphers has well and truly arrived, and their operation seems no less mystical than when they were just a toy for academia.</p>
<p>The word 'Elliptic' may seem to imply continuous mathematics. As a useful cryptographic problem, we fundamentally are just interested with the algebraic properties of these curves, whose points are elements of a <a href="https://en.wikipedia.org/wiki/Finite_field">finite field</a>. Irrespective of the underlying finite field, the algebraic properties of the elliptic curve group can be shown to exist by an application of <a href="https://en.wikipedia.org/wiki/Bézout%27s_theorem#:~:text=Bézout%27s%20theorem%20is%20a%20statement,the%20degrees%20of%20the%20polynomials.">Bézout's Theorem</a>. The <a href="https://en.wikipedia.org/wiki/Algebraic_group">group operator</a> on points on an elliptic curve for a particular choice of field involves the intersection of lines intersecting either once, twice or thrice with the curve, granting notions of addition and doubling for the points of intersection, and giving the 'point at infinity' as the group identity. A closed form exists for computing a point double/addition in arbitrary fields (different closed forms can apply, but determined by the field's <a href="https://en.wikipedia.org/wiki/Characteristic_(algebra)">characteristic</a>, and the same closed form applies for all large prime fields).</p>
<p>Our algorithm uses a field of the form <span class="math">\(\mathbb{F}_p\)</span>, that is the <a href="https://en.wikipedia.org/wiki/Finite_field#Existence_and_uniqueness">unique</a> field with <span class="math">\(p\)</span> (a prime) elements. The most straightforward construction of this field is arithmetic modulo <span class="math">\(p\)</span>. The other finite fields used in practise in ECC are of the form <span class="math">\(\mathbb{F}_{2^m}\)</span> and are sometimes called 'binary fields' (representible as polynomials with binary coefficients). Their field structure is also used in AES through byte substitution, implemented by inversion modulo <span class="math">\(\mathbb{F}_{2^8}\)</span>.</p>
<p>From a performance perspective, great optimisations can be made by implementing efficient fixed-point arithmetic specialised to modulo by single prime constant, <span class="math">\(p\)</span>. From here on out, I'll be speaking from this abstraction layer alone.</p>
<h2>Limbs</h2>
<p>We wish to multiply two <span class="math">\(m\)</span>-bit numbers, each of which represented with <span class="math">\(n\)</span> 64-bit machine words in some way. Let's suppose just for now that <span class="math">\(n\)</span> divides <span class="math">\(m\)</span> neatly, then the quotient <span class="math">\(d\)</span> is the minimum number of bits in each machine word that will be required for representing our number. Suppose we use the straightforward representation whereby the least significant <span class="math">\(d\)</span> bits are used for storing parts of our number, which we better call <span class="math">\(x\)</span> because this is crypto and descriptive variable names are considered harmful (apparently).</p>
<div class="math">$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$</div>
<p>If we then drop the requirement for each of our <span class="math">\(n\)</span> machine words (also referred to as a 'limb' from hereon out) to have no more than the least significant <span class="math">\(d\)</span> bits populated, we say that such an implementation uses 'redundant limbs', meaning that the <span class="math">\(k\)</span>-th limb has high bits which overlap with the place values represented in the <span class="math">\((k+1)\)</span>-th limb.</p>
<h2>Multiplication (mod p)</h2>
<p>The fundamental difficulty with making modulo arithmetic fast is to do with the following property of multiplication.</p>
<p>Let <span class="math">\(a\)</span> and <span class="math">\(b\)</span> be <span class="math">\(m\)</span>-bit numbers, then <span class="math">\(0 \leq a &lt; 2^m\)</span> and <span class="math">\(0 \leq b &lt; 2^m\)</span>, but critically we cannot say the same about <span class="math">\(ab\)</span>. Instead, the best we can say is that <span class="math">\(0 \leq ab &lt; 2^{2m}\)</span>. Multiplication can in the worst case double the number of bits that must be stored, unless we can reduce modulo our prime.</p>
<p>If we begin with non-redundant, 56-bit limbs, then for <span class="math">\(a\)</span> and <span class="math">\(b\)</span> not too much larger than <span class="math">\(2^{384} &gt; p_{384}\)</span> that are 'reduced sufficiently' then we can multiply our limbs in the following ladder, so long as we are capable of storing the following sums without overflow.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="cm">/* and so on ... */</span>
<span class="w"> </span><span class="n">out</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">5</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="w"> </span><span class="n">out</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">5</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">6</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="w"> </span><span class="n">out</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">5</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">6</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="w"> </span><span class="n">out</span><span class="p">[</span><span class="mi">8</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">4</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">5</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">((</span><span class="n">uint128_t</span><span class="p">)</span><span class="w"> </span><span class="n">in1</span><span class="p">[</span><span class="mi">6</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">in2</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="w"> </span><span class="cm">/* ... and so forth */</span>
</code></pre></div>
<p>This is possible, if we back each of the 56-bit limbs with a 64-bit machine word, with products being stored in 128-bit machine words. The numbers <span class="math">\(a\)</span> and <span class="math">\(b\)</span> were able to be stored with 7 limbs, whereas we use 13 limbs for storing the product. If <span class="math">\(a\)</span> and <span class="math">\(b\)</span> were stored non-redundantly, than each of the output (redundant) limbs must contain values less than <span class="math">\(6 \cdot 2^{56} \cdot 2^{56} &lt; 2^{115}\)</span>, so there is no possibility of overflow in 128 bits. We even have room spare to do some additions/subtractions in cheap, redundant limb arithmetic.</p>
<p>But we can't keep doing our sums in redundant limb arithmetic forever, we must eventually reduce. Doing so may be expensive, and so we would rather reduce only when strictly necessary!</p>
<h2>Solinas-ish Reduction</h2>
<p>Our prime is a <em>Solinas</em> (<em>Pseudo/Generalised-Mersenne</em>) <em>Prime</em>. Mersenne Primes are primes expressible as <span class="math">\(2^m - 1\)</span>. This can be generalised to low-degree polynomials in <span class="math">\(2^m\)</span>. For example, another NIST curve uses <span class="math">\(p_{224} = 2^{224} - 2^{96} + 1\)</span> (a 224-bit number) where <span class="math">\(p_{224} = f(2^{32})\)</span> for <span class="math">\(f(t) = t^7 - t^3 + 1\)</span>. The simpler the choice of polynomial, the simpler the modular reduction logic.</p>
<p>Our choice of <span class="math">\(t\)</span> is <span class="math">\(2^{56}\)</span>. <a href="https://en.wikipedia.org/wiki/Solinas_prime#Modular_reduction_algorithm">Wikipedia</a> the ideal case for Solinas reduction where the bitwidth of the prime is divisible by <span class="math">\(\log_2{t}\)</span>, but that is not our scenario. We choose 56-bits for some pretty simple realities of hardware. 56 is less than 64 (standard machine word size) but not by too much, and the difference is byte-addressible (<span class="math">\(64-56=8\)</span>). Let me explain:</p>
<h2>Just the Right Amount of Reduction (mod p)</h2>
<p>Let's first describe the actual prime that is our modulus.</p>
<div class="math">$$p_{384} = 2^{384} - 2^{128} - 2^{96} + 2^{32} - 1$$</div>
<p>Yuck. This number is so yuck in fact, that noone has so far managed to upstream a Solinas' reduction method for it in OpenSSL, in spite of <code>secp384r1</code> being the preferred curve for ECDH (Elliptic Curve Diffie-Hellman key exchange) and ECDSA (Elliptic Curve Digital Signature Algorithm) by NIST.</p>
<p>In 56-bit limbs, we would express this number so:</p>
<p>Let <span class="math">\(f(t) = 2^{48} t^6 - 2^{16} t^2 - 2^{40} t + (2^{32} - 1)\)</span>, then observe that all coefficients are smaller than <span class="math">\(2^{56}\)</span>, and that <span class="math">\(p_{384} = f(2^{56})\)</span>.</p>
<p>Now let <span class="math">\(\delta(t) = 2^{16} t^2 + 2^{40} t - 2^{32} + 1\)</span>, consider that <span class="math">\(p_{384} = 2^{384} - \delta(2^{56})\)</span>, and thus <span class="math">\(2^{384} \equiv \delta(2^{56}) \mod{p_{384}}\)</span>. From now on let's call <span class="math">\(\delta(2^{56})\)</span> just <span class="math">\(\delta\)</span>. Thus, 'reduction' can be achieved as follows for suitable <span class="math">\(X\)</span> and <span class="math">\(Y\)</span>:</p>
<div class="math">$$ab = X + 2^{384} Y \equiv X + \delta Y \mod{p_{384}}$$</div>
<h3>Calculating <span class="math">\(\delta Y\)</span></h3>
<h4>First Substitution</h4>
<p>First make a choice of <span class="math">\(X\)</span> and <span class="math">\(Y\)</span>. The first thing to observe here is that this can actually be made a large number of ways! We choose:</p>
<div class="math">$$X_1 = \sum_{k=0}^8\texttt{in[k]} t^k$$</div>
<div class="math">$$Y_1 = 2^8 t^2 \sum_{k=9}^{12}\texttt{in[k]} t^{k-9} = 2^8 \sum_{k=9}^{12}\texttt{in[k]} t^{k-7}$$</div>
<p>'Where does the <span class="math">\(2^8 t^{2}\)</span> come from?' I hear you ask. See <span class="math">\(t^9 = t^2 \cdot t^7 = t^2 (2^8 \cdot 2^{384}) \equiv (2^8 t^2) \delta \mod{f(t)}\)</span>. It's clear to see that the place value of <code>in[9] ... in[12]</code> is greater than <span class="math">\(2^{384}\)</span>.</p>
<p>I'm using the subscripts here because we're in fact going to do a series of these reductions to reach a suitably small answer. That's because our equation for reducing <span class="math">\(t^7\)</span> terms is as follows:</p>
<div class="math">$$t^7 \equiv 2^8\delta \equiv 2^{24} t^2 + 2^{48} t + (-2^{40} + 2^8) \mod{f(t)}$$</div>
<p>Thus reducing <code>in[12]</code> involves computing:</p>
<div class="math">$$\texttt{in[12]} t^{12} = \texttt{in[12]} (t^5)(t^7) \equiv 2^8\delta \cdot \texttt{in[12]} t^5 \mod{f(t)}$$</div>
<p>But <span class="math">\(\delta\)</span> is a degree two polynomial, and so our numbers can still have two more limbs than we would want them to have. To be safe, let's store <span class="math">\(X_1 + \delta Y_1\)</span> in accumulator limbs <code>acc[0] ... acc[8]</code> (this will at first appear to be one more limb than necessary), then we can eliminate <code>in[12]</code> with the following logic.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="cm">/* assign accumulators to begin */</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="mi">9</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="w"> </span><span class="cm">/* X += 2^128 Y */</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">8</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">32</span><span class="p">;</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="p">(</span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xffffffff</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">24</span><span class="p">;</span>
<span class="w"> </span><span class="cm">/* X += 2^96 Y */</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">7</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="p">(</span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xff</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">48</span><span class="p">;</span>
<span class="w"> </span><span class="cm">/* X += (-2^32 + 1) Y */</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">16</span><span class="p">;</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="p">((</span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xffff</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">40</span><span class="p">);</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">48</span><span class="p">;</span>
<span class="w"> </span><span class="n">acc</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="p">(</span><span class="n">in</span><span class="p">[</span><span class="mi">12</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="mh">0xffffffffffff</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;&lt;</span><span class="w"> </span><span class="mi">8</span><span class="p">;</span>
</code></pre></div>
<p>Notice that for each term in <span class="math">\(\delta = 2^{128} + 2^{96} + (2^{32} - 1)\)</span> we do two additions/subtractions. This is in order to split up operands in order to minimise the final size of numbers and prevent over/underflows. Consequently, we need an <code>acc[8]</code> to receive the high bits of our <code>in[12]</code> substitution given above.</p>
<h4>Second Substitution</h4>
<p>Let's try and now eliminate through substitution <code>acc[7]</code> and <code>acc[8]</code>. Let</p>
<div class="math">$$X_2 = \sum^{6}_{k=0}\texttt{acc[k]}t^k $$</div>
<div class="math">$$Y_2 = 2^8(\texttt{acc[7]} t^7 + \texttt{acc[8]} t^8)$$</div>
<p>But this time, <span class="math">\(\delta Y_2\)</span> is a number that comfortably can take up just five limbs, so we can update <code>acc[0], ..., acc[5]</code> comfortably in-place.</p>
<h4>Third Substitution</h4>
<p>Finally, let's reduce all the high bits of <code>in[6]</code>. Since <code>in[6]</code> has place value <span class="math">\(t^6 = 2^{336}\)</span>, thus we wish to reduce all but the least significant <span class="math">\(384 - 336 = 48\)</span> bits.</p>
<p>A goal in designing this algorithm is to ensure that <code>acc[6]</code> has as tight a bound as reasonably possible. Intuitively, if we can cause <code>acc[6]</code> to be as large as possible by absorbing the high bits of lower limbs, we reduce the number of bits that must be carried forward later on. As such, we perform a carry of the high-bits of <code>acc[4]</code>, <code>acc[5]</code> into <code>acc[6]</code> before we begin our substitution.</p>
<p>Again, let</p>
<div class="math">$$X_3 = \sum^{5}_{k=0}\texttt{acc[k]}t^k + (\texttt{acc[6]} \text{(low bits)})t^6$$</div>
<div class="math">$$Y_3 = 2^{48}(\texttt{acc[6]} \text{(high bits, right shifted)}) t^6$$</div>
<p>The equation for eliminating <span class="math">\(2^{48}t^6\)</span> is pretty straightforward:</p>
<div class="math">$$2^{384} = 2^{48}t^6 \equiv 2^{16}t^2 + 2^{40}t + (-2^{32} + 1) \mod{f(t)}$$</div>
<h4>Carries</h4>
<p>Finally, as each of <code>acc[0], ..., acc[6]</code> can contain values larger than <span class="math">\(2^{56}\)</span>, we carry their respective high bits into <code>acc[6]</code> so as to remove any redundancy. Conveniently, our preemptive carrying before the third substitution has granted us a pretty tight bound on our final calculation - the final reduced number has the range <span class="math">\([0, 2^{384}]\)</span>.</p>
<h4>Canonicalisation</h4>
<p>This is 'just the right amount of reduction' but not <em>canonicalisation</em>. That is, since <span class="math">\(0 &lt; p_{384} &lt; 2^{384}\)</span>, there can be multiple possible reduced values for a given congruence class. <code>felem_contract</code> is a method which uses the fact that <span class="math">\(0 \leq x &lt; 2 p_{384}\)</span> to further reduce the output of <code>felem_reduce</code> into the range <span class="math">\([0, p_{384})\)</span> in constant time.</p>
<p>This code has many more dragons I won't explain here, but the basic premise to the calculations performed there is as follows:</p>
<p>Given a 385 bit input, checking whether our input (expressed as a concatenation of bits) <span class="math">\(b_{384}b_{383} \ldots b_1b_0\)</span> is greater than or equal to <span class="math">\(p_{384}\)</span> whose bits we denote <span class="math">\(q_{384}, \ldots, q_0\)</span> (<span class="math">\(q_{384} = 0\)</span>) is determined by the following logical predicate (<span class="math">\(G(384)\)</span>):</p>
<div class="math">$$G(k) \equiv (b_k \land \lnot q_k) \lor ((b_k = q_k) \land G(k-1))$$</div>
<div class="math">$$G(0) \equiv b_k = q_k$$</div>
<p>With <span class="math">\(p_{384}\)</span> being a Solinas'/Pseudo-Mersenne Prime, it has a large number of contiguous runs of repeated bits, so we can of course use this to massively simplify our predicate. Doing this in constant time involves some interesting bit-shifting/masking schenanigans. Essentially, you want a bit vector of all ones/zeros depending on the value of <span class="math">\(G(384)\)</span>, we then logically 'and' with this bitmask to 'conditionally' subtract <span class="math">\(p_{384}\)</span> from our result.</p>
<h3>A Side Note about the Weird Constants</h3>
<p>Okay so we're implementing our modular arithmetic with unsigned integer limbs that together represent a number of the following form:</p>
<div class="math">$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$</div>
<p>How do we then do subtractions in a way which will make overflow impossible? Well computing <span class="math">\(a - b\)</span> is really straightforward if every limb of <span class="math">\(a\)</span> is larger than every limb of <span class="math">\(b\)</span>. We then add a suitable multiple of <span class="math">\(p_{384}\)</span> to <span class="math">\(a\)</span> that causes each limb of <span class="math">\(a\)</span> to be sufficiently large.</p>
<p>Thankfully, with redundant-limb arithmetic, we can do this easily by means of <em>telescopic sums</em>. For example, in <code>felem_reduce</code> we wanted all limbs of our <span class="math">\(p_{384}\)</span> multiple to be sufficiently large. We overshot any requirement and provided such a multiple which gives a lower bound <span class="math">\(2^{123}\)</span>. We first scale our prime accordingly so that its 'lead term' (speaking in the polynomial representation) is <span class="math">\(2^{124}\)</span>.</p>
<div class="math">$$2^{76} f(t) = 2^{124} t^6 - 2^{92} t^2 - 2^{116} t + (2^{108} - 2^{76}) t^0$$</div>
<p>Notice that most limbs of this multiple (the limbs will be the coefficients) are either too small or negative. We then transform this expression into a suitable telescopic sum. Observe that when <span class="math">\(t = 2^{56}\)</span>, <span class="math">\(2^{124} t^k = 2^{124-56}t^{k+1} = 2^{68} t^{k+1}\)</span>, and so simply introduce into each limb where required a <span class="math">\(2^{124}\)</span> term by means of addition, subtracting the same number from a higher limb.</p>
<div class="math">$$
\begin{align*}
2^{76} f(t) &amp;= (2^{124} - 2^{68}) t^6 \\
&amp;+ (2^{124} - 2^{68}) t^5 \\
&amp;+ (2^{124} - 2^{68}) t^4 \\
&amp;+ (2^{124} - 2^{68}) t^3 \\
&amp;+ (2^{124} - 2^{92} - 2^{68}) t^2 \\
&amp;+ (2^{124} - 2^{116} - 2^{68}) t \\
&amp;+ (2^{124} + 2^{108} - 2^{76})
\end{align*}
$$</div>
<p>We can then subtract values whose limbs are no larger than the least of these limbs above without fear of underflows providing us with an incorrect result. In our case, that upper bound for limb value is <span class="math">\(2^{124} - 2^{116} - 2^{68} &gt; 2^{123}\)</span>. Very comfortable.</p>
<h2>Concerning Timing Side-Channels</h2>
<p>Cryptographic routines must perform all of their calculations in constant time. More specifically, it is important that timing cryptography code should not reveal any private keys or random nonces used during computation. Ultimately, all of our work so far has been to speed up field arithmetic in the modulo field with prime <span class="math">\(p_{384}\)</span>. But this is done in order to facilitate calculations in the secp384r1 elliptic curve, and ECDSA/ECDH each depend on being able to perform scalar 'point multiplication' (repeat application of the group operator). Since such an operation is inherently iterative, it presents the greatest potential for timing attacks.</p>
<p>We implement constant-time multiplication with the <em>wNAF</em> ladder method. This relies on pre-computing a window of multiples of the group generator, and then scaling and selectively adding multiples when required. <a href="https://en.wikipedia.org/wiki/Elliptic_curve_point_multiplication#Point_multiplication">Wikipedia</a> provides a helpful primer to this method by cumulatively building upon more naive approaches.</p>
<h2>Conclusion</h2>
<p>While the resulting code borrows from and uses common language of Solinas reduction, ultimately there are a number of implementation decisions that were guided by heuristic - going from theory to implementation was far from cut-and-dry. The limb size, carry order, choice of substitutions as well as pre and post conditions made here are ultimately arbitrary. You could easily imagine there being further refinements obtaining a better result. For now, I hope this post serves to demystify the inner workings of ECC implementations in OpenSSL. These algorithms, although particular and sophisticated, need not be immutable.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script><script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Rohan McLure</dc:creator><pubDate>Mon, 07 Aug 2023 12:00:00 +1000</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2023-08-07:/blog/2023/08/07/going-out-on-a-limb-efficient-elliptic-curve-arithmetic-in-openssl/</guid><category>Cryptography</category></item><item><title>Quirks of parsing SSH configs</title><link>https://sthbrx.github.io/blog/2023/08/04/quirks-of-parsing-ssh-configs/</link><description><h2>Introduction</h2>
<p>I've been using the VSCodium
<a href="https://open-vsx.org/extension/jeanp413/open-remote-ssh">Open Remote - SSH</a>
extension recently to great results. I can treat everything as a single
environment, without any worry about syncing between my local development files
and the remote. This is very different to mounting the remote as a network drive
and opening a local instance of VSCodium on it: in addition to crippling latency
on every action, a locally mounted drive doesn't bring the build context that
tools like <code>clangd</code> require (e.g., system headers).</p>
<p>Instead, the remote extension runs a server on the remote that performs most
actions, and the local VSCodium instance acts as a client that buffers and
caches data seamlessly, so the experience is nearly as good as developing
locally. </p>
<p>For example, a project wide file search on a network drive is unusably slow
because every file and directory read requires a round trip back to the remote,
and the latency is just too large to finish getting results back in a reasonable
time. But with the client-server approach, the client just sends the search
request to the server for it to fulfil, and all the server has to do is send the
matches back. This eliminates nearly all the latency effects, except for the
initial request and receiving any results.</p>
<p>However there has been one issue with using this for everything: the extension
failed to connect when I wasn't on the same network as the host machine. So I
wasn't able to use it when working from home over a VPN. In this post we find
out why this happened, and in the process look at some of the weird quirks of
parsing an SSH config.</p>
<h2>The issue</h2>
<p>As above, I wasn't able to connect to my remote machines when working from home.
The extension would abort with the following error:</p>
<div class="highlight"><pre><span></span><code>[Error - 00:23:10.592] Error resolving authority
Error: getaddrinfo ENOTFOUND remotename.ozlabs.ibm.com
at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:26)
</code></pre></div>
<p>So it's a DNS issue. This would make sense, as the remote machine is not exposed
to the internet, and must instead be accessed through a proxy. What's weird is
that the integrated terminal in VSCodium has no problem connecting to the
remote. So the extension seems to be doing something different than just a plain
SSH connection.</p>
<p>You might think that the extension is not reading the SSH config. But the
extension panel lists all the host aliases I've declared in the config, so it's
clearly aware of the config at least. Possibly it doesn't understand the proxy
config correctly? If it was trying to connect directly from the host, it would
make sense to fail a DNS lookup.</p>
<h2>Investigating</h2>
<p>Enough theorising, time to debug the extension as it tries to connect.</p>
<p>From the error above, the string <code>"Error resolving authority"</code> looks like
something I can search for. This takes me to the
<a href="https://github.com/jeanp413/open-remote-ssh/blob/521098e24f48b4b9e04d476895f9097b03f8c984/src/authResolver.ts#L226"><code>catch</code> case for a large try-catch block</a>.
It could be annoying to narrow down which part of the block
throws the exception, but fortunately debugging is as easy as installing the
dependencies and running the pre-configured 'Extension' debug target. This opens
a new window with the local copy of the extension active, and I can debug it in
the original window.</p>
<p>In this block, there is a conditional statement on whether the <code>ProxyJump</code> field
is present in the config. This is a good place to break on and see what the
computed config looks like. If it doesn't find a proxy then of course it's going
to run everything on the host.</p>
<p>And indeed, it doesn't think there is a proxy. This is progress, but why does
the extension's view of the config not match up with what SSH does? After all,
invoking SSH directly connects properly. Tracing back the source of the config
in the extension, it ultimately comes from manually reading in and parsing the
SSH config. When resolving the host argument it manually computes the config as
per <a href="https://man7.org/linux/man-pages/man5/ssh_config.5.html"><code>ssh_config(5)</code></a>.
Yet somewhere it makes a mistake, because it doesn't include the <code>ProxyJump</code>
field.</p>
<h2>Parsing SSH config</h2>
<p>To get to the bottom of this, we need to know the rules behind parsing SSH
configs. The <code>ssh_config(5)</code> manpage does a pretty decent job of explaining
this, but I'm going to go over the relevant information here. I reckon most
people have a vague idea of how it works, and can write enough to meet their
needs, but have never looked deeper into the actual rules behind how SSH parses
the config.</p>
<ol>
<li>
<p>For starters, the config is parsed line by line. Leading whitespace (i.e.,
indentation) is ignored. So, while indentation makes it look like you are
configuring properties for a particular host, this isn't quite correct.
Instead, the <code>Host</code> and <code>Match</code> lines are special statements that enable or
disable all subsequent lines until the next <code>Host</code> or <code>Match</code>.</p>
<p>There is no backtracking; previous conditions and lines are not re-evaluated
after learning more about the config later on.</p>
</li>
<li>
<p>When a config line is seen, and is active thanks to the most recent <code>Host</code> or
<code>Match</code> succeeding, its value is selected if it is the first of that config
to be selected. So the earliest place a value is set takes priority; this may
be a little counterintuitive if you are used to having the latest value be
picked, like enable/disable command line flags tend to work.</p>
</li>
<li>
<p>When <code>HostName</code> is set, it replaces the <code>host</code> value in <code>Match</code> matches. It
is also used as the <code>Host</code> value during a final pass (if requested).</p>
</li>
<li>
<p>The last behaviour of interest is the <code>Match final</code> rule. There are several
conditions a <code>Match</code> statement can have, and the <code>final</code> rule says make this
active on the final pass over the config.</p>
</li>
</ol>
<p>Wait, final pass? Multiple passes? Yes. If <code>final</code> is a condition on a <code>Match</code>,
SSH will do another pass over the entire config, following all the rules above.
Except this time all the configs we read on the first pass are still active (and
can't be changed). But all the <code>Host</code> and <code>Matches</code> are re-evaluated, allowing
other configs to potentially be set. I guess that means rule (1) ought to have a
big asterisk next to it.</p>
<p>Together, these rules can lead to some quirky behaviours. Consider the following
config</p>
<div class="highlight"><pre><span></span><code>Match host=&quot;*.ozlabs.ibm.com&quot;
ProxyJump proxy
Host example
HostName example.ozlabs.ibm.com
</code></pre></div>
<p>If I run <code>ssh example</code> on the command line, will it use the proxy?</p>
<p>By rule (1), no. When testing the first <code>Match host</code> condition, our host value
is currently <code>example</code>. It is not until we reach the <code>HostName</code> config that we
start using <code>example.ozlabs.ibm.com</code> for these matches.</p>
<p>But by rule (4), the answer turns into <em>maybe</em>. If we end up doing a second pass
over the config thanks to a <code>Match final</code> that could be <em>anywhere</em> else, we
would now be matching <code>example.ozlabs.ibm.com</code> against the first line on the
second go around. This will pass, and, since nothing has set <code>ProxyJump</code> yet, we
would gain the proxy.</p>
<p>You may think, yes, but we don't have a <code>Match final</code> in that example. But if
you thought that, then you forgot about the system config.</p>
<p>The system config is effectively appended to the user config, to allow any
system wide settings. Most of the time this isn't an issue because of the
first-come-first-served rule with config matches (rule 2). But if the system
config includes a <code>Match final</code>, it will trigger the entire config to be
re-parsed, including the user section. And it so happens that, at least on
Fedora with the <code>openssh-clients</code> package installed, the system config does
contain a <code>Match final</code> (see <code>/etc/ssh/ssh_config.d</code>).</p>
<p>But wait, there's more! If we want to specify a custom SSH config file, then we
can use <code>-F path/to/config</code> in the command line. But this disables loading a
system config, so we would no longer get the proxy!</p>
<p>To sum up, for the above config:</p>
<ol>
<li><code>ssh example</code> doesn't have a proxy</li>
<li>...unless a system config contains <code>Match final</code></li>
<li>...but invoking it as <code>ssh -F ~/.ssh/config example</code> definitely won't have
the proxy</li>
<li>...but if a subprocess invokes <code>ssh example</code> while trying to resolve another
host, it'll probably not add the <code>-F ~/.ssh/config</code>, so we might get the
proxy again (in the child process).</li>
</ol>
<p>Wait, how did that last one slip in? Well, unlike environment variables, it's a
lot harder for processes to propagate command line flags correctly. If resolving
the config involves running a script that itself tries to run SSH, chances are
the <code>-F</code> flag won't be propagated and you'll see some weird behaviour.</p>
<p>I swear that's all for now, you've probably learned more about SSH configs than
you will ever need to care about.</p>
<h2>Back to VSCodium</h2>
<p>Alright, armed now with this knowledge on SSH config parsing, we can work out
what's going on with the extension. It ends up being a simple issue: it doesn't
apply rules (3) and (4), so all <code>Host</code> matches are done against the original
host name.</p>
<p>In my case, there are several machines behind the proxy, but they all share a
common suffix, so I had a <code>Host *.ozlabs.ibm.com</code> rule to apply the proxy. I
also use aliases to refer to the machines without the <code>.ozlabs.ibm.com</code> suffix,
so failing to follow rule (3) lead to the situation where the extension didn't
think there was a proxy.</p>
<p>However, even if this were to be fixed, it still doesn't respect rule (4), or
most complex match logic in general. If the hostname bug is fixed then my setup
would work, but it's less than ideal to keep playing whack-a-mole with parsing
bugs. It would be a lot easier if there was a way to just ask SSH for the config
that a given host name resolves to.</p>
<p>Enter <code>ssh -G</code>. The <code>-G</code> flag asks SSH to dump the complete resolved config,
without actually opening the connection (it may execute arbitrary code while
resolving the config however!). So to fix the extension once and for all, we
could swap the manual parser to just invoking <code>ssh -G example</code>, and parsing the
output as the final config. No <code>Host</code> or <code>Match</code> or <code>HostName</code> or <code>Match final</code>
quirks to worry about.</p>
<p>Sure enough, if we replace the config backend with this 'native' resolver, we
can connect to all the machines with no problem. Hopefully the
<a href="https://github.com/jeanp413/open-remote-ssh/pull/103">pull request</a> to add this
support will get accepted, and I can stop running my locally patched copy of the
extension.</p>
<p>In general, I'd suggest avoiding any dependency on a second pass being done on
the config. Resolve your aliases early, so that the rest of your matches work
against the full hostname. If you later need to match against the name passed in
the command line, you can use <code>Match originalhost=example</code>. The example above
should always be written as</p>
<div class="highlight"><pre><span></span><code>Host example
HostName example.ozlabs.ibm.com
Match host=&quot;*.ozlabs.ibm.com&quot;
ProxyJump proxy
</code></pre></div>
<p>even if the reversed order might appear to work thanks to the weird interactions
described above. And after learning these parser quirks, I find the idea of
using <code>Host</code> match statements unreliable; that they may or may not be run
against the <code>HostName</code> value allows for truely strange bugs to appear. Maybe you
should remove this uncertainty by starting your config with <code>Match final</code> to at
least always be parsed the same way.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Benjamin Gray</dc:creator><pubDate>Fri, 04 Aug 2023 18:00:00 +1000</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2023-08-04:/blog/2023/08/04/quirks-of-parsing-ssh-configs/</guid><category>Development</category><category>ssh</category></item><item><title>Detecting rootless Docker</title><link>https://sthbrx.github.io/blog/2023/04/05/detecting-rootless-docker/</link><description><h2>Trying to do some fuzzing...</h2>
<p>The other day, for the first time in a while, I wanted to do something with <a href="https://github.com/google/syzkaller">syzkaller</a>, a system call fuzzer that has been used to find literally thousands of kernel bugs. As it turns out, since the last time I had done any work on syzkaller, I switched to a new laptop, and so I needed to set up a few things in my development environment again.</p>
<p>While I was doing this, I took a look at the syzkaller source again and found a neat little script called <a href="https://github.com/google/syzkaller/blob/master/tools/syz-env"><code>syz-env</code></a>, which uses a Docker image to provide you with a standardised environment that has all the necessary tools and dependencies preinstalled.</p>
<p>I decided to give it a go, and then realised I hadn't actually installed Docker since getting my new laptop. So I went to do that, and along the way I discovered <a href="https://docs.docker.com/engine/security/rootless/">rootless mode</a>, and decided to give it a try.</p>
<h2>What's rootless mode?</h2>
<p>As of relatively recently, Docker supports rootless mode, which allows you to run your <code>dockerd</code> as a non-root user. This is helpful for security, as traditional "rootful" Docker can trivially be used to obtain root privileges outside of a container. Rootless Docker is implemented using <a href="https://github.com/rootless-containers/rootlesskit">RootlessKit</a> (a fancy replacement for <a href="https://wiki.debian.org/FakeRoot">fakeroot</a> that uses user namespaces) to create a new user namespace that maps the UID of the user running <code>dockerd</code> to 0.</p>
<p>You can find more information, including details of the various restrictions that apply to rootless setups, <a href="https://docs.docker.com/engine/security/rootless/">in the Docker documentation</a>.</p>
<h2>The problem</h2>
<p>I ran <code>tools/syz-env make</code> to test things out. It pulled the container image, then gave me some strange errors:</p>
<div class="highlight"><pre><span></span><code>ajd@jarvis-debian:~/syzkaller$ tools/syz-env make NCORES=1
gcr.io/syzkaller/env:latest
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [&lt;options&gt;] &lt;path&gt; &lt;path&gt;
...
fatal: detected dubious ownership in repository at &#39;/syzkaller/gopath/src/github.com/google/syzkaller&#39;
To add an exception for this directory, call:
git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
fatal: detected dubious ownership in repository at &#39;/syzkaller/gopath/src/github.com/google/syzkaller&#39;
To add an exception for this directory, call:
git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
go list -f &#39;{{.Stale}}&#39; ./sys/syz-sysgen | grep -q false || go install ./sys/syz-sysgen
error obtaining VCS status: exit status 128
Use -buildvcs=false to disable VCS stamping.
error obtaining VCS status: exit status 128
Use -buildvcs=false to disable VCS stamping.
make: *** [Makefile:155: descriptions] Error 1
</code></pre></div>
<p>After a bit of digging, I found that <code>syz-env</code> mounts the syzkaller source directory inside the container as a volume. <code>make</code> was running with UID 1000, while the files in the mounted volume appeared to be owned by root.</p>
<p>Reading the script, it turns out that <code>syz-env</code> invokes <code>docker run</code> with the <code>--user</code> option to set the UID inside the container to match the user's UID outside the container, to ensure that file ownership and permissions behave as expected.</p>
<p>This works in rootful Docker, where files appear inside the container to be owned by the same UID as they are outside the container. However, it breaks in rootless mode: due to the way RootlessKit sets up the namespaces, the user's UID is mapped to 0, causing the files to appear to be owned by root.</p>
<p>The workaround seemed pretty obvious: just skip the <code>--user</code> flag if running rootless.</p>
<h2>How can you check whether your Docker daemon is running in rootless mode?</h2>
<p>It took me quite a while, as a total Docker non-expert, to figure out how to definitively check whether the Docker daemon is running rootless or not. There's a variety of ways you could do this, such as checking the name of the current Docker context to see if it's called <code>rootless</code> (as used by the Docker rootless setup scripts), but I think the approach I settled on is the most correct one.</p>
<p>If you want to check whether your Docker daemon is running in rootless mode, use <code>docker info</code> to query the daemon's security options, and check for the <code>rootless</code> option.</p>
<div class="highlight"><pre><span></span><code>docker info -f &quot;{{println .SecurityOptions}}&quot; | grep rootless
</code></pre></div>
<p>If this prints something like:</p>
<div class="highlight"><pre><span></span><code>[name=seccomp,profile=builtin name=rootless name=cgroupns]
</code></pre></div>
<p>then you're running rootless.</p>
<p>If not, then you're running the traditional rootful.</p>
<p>Easy! (And I sent a fix which is now <a href="https://github.com/google/syzkaller/commit/340a1b9094e4b3fad232c98c62de653ec48954ab">merged into syzkaller!</a>)</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Donnellan</dc:creator><pubDate>Wed, 05 Apr 2023 13:00:00 +1000</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2023-04-05:/blog/2023/04/05/detecting-rootless-docker/</guid><category>Development</category><category>Docker</category><category>syzkaller</category></item><item><title>Dumb bugs: the PCI device that wasn't</title><link>https://sthbrx.github.io/blog/2023/04/04/dumb-bugs-the-pci-device-that-wasnt/</link><description><p>I was happily minding my own business one fateful afternoon when I received the following kernel bug report:</p>
<div class="highlight"><pre><span></span><code>BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
Read of size 4 at addr c000000264c26fdc by task swapper/0/1
Call Trace:
dump_stack_lvl+0x1bc/0x2b8 (unreliable)
print_report+0x3f4/0xc60
kasan_report+0x244/0x698
__asan_load4+0xe8/0x250
vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
__machine_initcall_pseries_vio_device_init+0x94/0xf0
do_one_initcall+0x12c/0xaa8
kernel_init_freeable+0xa48/0xba8
kernel_init+0x64/0x400
ret_from_kernel_thread+0x5c/0x64
</code></pre></div>
<p>OK, so <a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">KASAN</a> has helpfully found an out-of-bounds access in <code>vga_arbiter_add_pci_device()</code>. What the heck is that?</p>
<h2>Why does my VGA require arbitration?</h2>
<p>I'd never heard of the <a href="https://en.wikipedia.org/wiki/VGA_connector">VGA</a> arbiter in the kernel (do kids these days know what VGA is?), or <code>vgaarb</code> as it's called. What it does is irrelevant to this bug, but I found the history pretty interesting! <a href="https://lists.freedesktop.org/archives/xorg/2005-March/006663.html">Benjamin Herrenschmidt proposed VGA arbitration back in 2005</a> as a way of resolving conflicts between multiple legacy VGA devices that want to use the same address assignments. This was previously handled in userspace by the X server, but issues arose with multiple X servers on the same machine. Plus, it's probably not a good idea for this kind of thing to be handled by userspace. <a href="https://docs.kernel.org/gpu/vgaarbiter.html">You can read more about the VGA arbiter in the kernel docs</a>, but it's probably not something anyone has thought much about in a long time.</p>
<h2>The bad access</h2>
<div class="highlight"><pre><span></span><code><span class="k">static</span><span class="w"> </span><span class="kt">bool</span><span class="w"> </span><span class="nf">vga_arbiter_add_pci_device</span><span class="p">(</span><span class="k">struct</span><span class="w"> </span><span class="nc">pci_dev</span><span class="w"> </span><span class="o">*</span><span class="n">pdev</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">vga_device</span><span class="w"> </span><span class="o">*</span><span class="n">vgadev</span><span class="p">;</span>
<span class="w"> </span><span class="kt">unsigned</span><span class="w"> </span><span class="kt">long</span><span class="w"> </span><span class="n">flags</span><span class="p">;</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">pci_bus</span><span class="w"> </span><span class="o">*</span><span class="n">bus</span><span class="p">;</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">pci_dev</span><span class="w"> </span><span class="o">*</span><span class="n">bridge</span><span class="p">;</span>
<span class="w"> </span><span class="n">u16</span><span class="w"> </span><span class="n">cmd</span><span class="p">;</span>
<span class="w"> </span><span class="cm">/* Only deal with VGA class devices */</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">((</span><span class="n">pdev</span><span class="o">-&gt;</span><span class="n">class</span><span class="w"> </span><span class="o">&gt;&gt;</span><span class="w"> </span><span class="mi">8</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">PCI_CLASS_DISPLAY_VGA</span><span class="p">)</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nb">false</span><span class="p">;</span>
</code></pre></div>
<p>We're blowing up on the read to <code>pdev-&gt;class</code>, and it's not something like the data being uninitialised, it's out-of-bounds. If we look back at the call trace:</p>
<div class="highlight"><pre><span></span><code>vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
</code></pre></div>
<p>This thing is a VIO device, not a PCI device! Let's jump into the caller, <code>pci_notify()</code>, to find out how we got our <code>pdev</code>.</p>
<div class="highlight"><pre><span></span><code><span class="k">static</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="nf">pci_notify</span><span class="p">(</span><span class="k">struct</span><span class="w"> </span><span class="nc">notifier_block</span><span class="w"> </span><span class="o">*</span><span class="n">nb</span><span class="p">,</span><span class="w"> </span><span class="kt">unsigned</span><span class="w"> </span><span class="kt">long</span><span class="w"> </span><span class="n">action</span><span class="p">,</span>
<span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">device</span><span class="w"> </span><span class="o">*</span><span class="n">dev</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">;</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">pci_dev</span><span class="w"> </span><span class="o">*</span><span class="n">pdev</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">to_pci_dev</span><span class="p">(</span><span class="n">dev</span><span class="p">);</span>
</code></pre></div>
<p>So <code>pci_notify()</code> gets called with our VIO device (somehow), and we're converting that <code>struct device</code> into a <code>struct pci_dev</code> with no error checking. We could solve this particular bug by just checking that our device is <em>actually</em> a PCI device before we proceed - but we're in a function called <code>pci_notify</code>, we're expecting a PCI device to come in, so this would just be a bandaid.</p>
<p><code>to_pci_dev()</code> works like other struct containers in the kernel - <code>struct pci_dev</code> contains a <code>struct device</code> as a member, so the <code>container_of()</code> function returns an address based on where a <code>struct pci_dev</code> would have to be if the given <code>struct device</code> was actually a PCI device. Since we know it's not actually a PCI device and this <code>struct device</code> does not actually sit inside a <code>struct pci_dev</code>, our <code>pdev</code> is now pointing to some random place in memory, hence our access to a member like <code>class</code> is caught by KASAN.</p>
<p>Now we know why and how we're blowing up, but we still don't understand how we got here, so let's back up further.</p>
<h2>Notifiers</h2>
<p>The kernel's device subsystem allows consumers to register callbacks so that they can be notified of a given event. I'm not going to go into a ton of detail on how they work, because I don't fully understand myself, and there's a lot of internals of the device subsystem involved.
The best references I could find for this are <a href="https://elixir.bootlin.com/linux/latest/source/include/linux/notifier.h">notifier.h</a>, and for our purposes here, <a href="https://elixir.bootlin.com/linux/latest/source/include/linux/device/bus.h#L260">the register notifier functions in bus.h</a>.</p>
<p>Something's clearly gone awry if we can end up in a function named <code>pci_notify()</code> without passing it a PCI device. We find where the notifier is registered in <code>vgaarb.c</code> here:</p>
<div class="highlight"><pre><span></span><code><span class="k">static</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">notifier_block</span><span class="w"> </span><span class="n">pci_notifier</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="p">.</span><span class="n">notifier_call</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pci_notify</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">static</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">__init</span><span class="w"> </span><span class="nf">vga_arb_device_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="cm">/* some stuff removed here... */</span>
<span class="w"> </span><span class="n">bus_register_notifier</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pci_bus_type</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">pci_notifier</span><span class="p">);</span>
</code></pre></div>
<p>This all looks sane. A blocking notifier is registered so that <code>pci_notify()</code> gets called whenever there's a notification going out to PCI buses. Our VIO device is distinctly <em>not</em> on a PCI bus, and in my debugging I couldn't find any potential causes of such confusion, so how on earth is a notification for PCI buses being applied to our non-PCI device?</p>
<p>Deep in the guts of the device subsystem, if we have a look at <code>device_add()</code> we find the following:</p>
<div class="highlight"><pre><span></span><code><span class="kt">int</span><span class="w"> </span><span class="nf">device_add</span><span class="p">(</span><span class="k">struct</span><span class="w"> </span><span class="nc">device</span><span class="w"> </span><span class="o">*</span><span class="n">dev</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="cm">/* lots of device init stuff... */</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">dev</span><span class="o">-&gt;</span><span class="n">bus</span><span class="p">)</span>
<span class="w"> </span><span class="n">blocking_notifier_call_chain</span><span class="p">(</span><span class="o">&amp;</span><span class="n">dev</span><span class="o">-&gt;</span><span class="n">bus</span><span class="o">-&gt;</span><span class="n">p</span><span class="o">-&gt;</span><span class="n">bus_notifier</span><span class="p">,</span>
<span class="w"> </span><span class="n">BUS_NOTIFY_ADD_DEVICE</span><span class="p">,</span><span class="w"> </span><span class="n">dev</span><span class="p">);</span>
</code></pre></div>
<p>If the device we're initialising is attached to a bus, then we call the bus notifier of that bus with the <code>BUS_NOTIFY_ADD_DEVICE</code> notification, and the device in question. So we're going through the process of adding a VIO device, and somehow calling into a notifier that's only registered for PCI devices. I did a bunch of debugging to see if our VIO device was somehow malformed and pointing to a PCI bus, or the <code>struct subsys_private</code> (that's the <code>bus-&gt;p</code> above) was somehow pointing to the wrong place, but everything seemed sane. My thesis of there being confusion while matching devices to buses was getting harder to justify - everything still looked sane.</p>
<h2>Debuggers</h2>
<p>I do not like debuggers. I am an avid <code>printk()</code> enthusiast. There's no real justification for this, a bunch of my problems could almost certainly be solved easier by using actual tools, but my brain seemingly enjoys the routine of printing and building and running until I figure out what's going on. It was becoming increasingly obvious, however, that <code>printk</code> could not save me here, and we needed to go deeper.</p>
<p>Very thankfully for me, even though this bug was discovered on real hardware, it reproduces easily in <a href="https://www.qemu.org">QEMU</a>, making iteration easy. With <a href="https://qemu-project.gitlab.io/qemu/system/gdb.html">GDB attached to QEMU</a>, it's time to dive in to the guts of this issue and figure out what's happening.</p>
<p>Somehow, VIO buses are ending up with <code>pci_notify()</code> in their <code>bus_notifier</code> list. Let's break down the data structures here with a look at <code>struct notifier_block</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">struct</span><span class="w"> </span><span class="nc">notifier_block</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">notifier_fn_t</span><span class="w"> </span><span class="n">notifier_call</span><span class="p">;</span>
<span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">notifier_block</span><span class="w"> </span><span class="n">__rcu</span><span class="w"> </span><span class="o">*</span><span class="n">next</span><span class="p">;</span>
<span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">priority</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div>
<p>So notifier chains are <a href="https://en.wikipedia.org/wiki/Linked_list#Singly_linked_list">singly linked lists</a>. Callbacks are registered through functions like <code>bus_register_notifier()</code>, then after a long chain of breadcrumbs we reach <a href="https://elixir.bootlin.com/linux/latest/source/kernel/notifier.c#L22"><code>notifier_chain_register()</code></a> which walks the list of <code>-&gt;next</code> pointers until it reaches <code>NULL</code>, at which point it sets <code>-&gt;next</code> of the tail node to the <code>struct notifier_block</code> that was passed in. It's very important to note here that the data being appended to the list here is <em>not just the callback function</em> (i.e. <code>pci_notify()</code>), but the <code>struct notifier_block</code> itself (i.e. <code>struct notifier_block pci_notifier</code> from earlier). There's no new data being initialised, just updating a pointer to the object that was passed by the caller.</p>
<p>If you've guessed what our bug is at this point, great job! If the same <code>struct notifier_block</code> gets registered to two different bus types, then both of their <code>bus_notifier</code> fields will point to the <em>same memory</em>, and any further notifiers registered to either bus will end up being referenced by both since they walk through the same node.</p>
<p>So we bust out the debugger and start looking at what ends up in <code>bus_notifier</code> for PCI and VIO buses with breakpoints and watchpoints.</p>
<h2>Candidates</h2>
<p>Walking the <code>bus_notifier</code> list gave me the following:</p>
<div class="highlight"><pre><span></span><code>__gcov_.perf_trace_module_free
fail_iommu_bus_notify
isa_bridge_notify
ppc_pci_unmap_irq_line
eeh_device_notifier
iommu_bus_notifier
tce_iommu_bus_notifier
pci_notify
</code></pre></div>
<p>Time to find out if our assumption is correct - the same <code>struct notifier_block</code> is being registered to both bus types. Let's start going through them!</p>
<p>First up, we have <code>__gcov_.perf_trace_module_free</code>. Thankfully, I recognised this as complete bait. Trying to figure out what gcov and perf are doing here is going to be its own giant rabbit hole, and unless building without gcov makes our problem disappear, we skip this one and keep on looking. Rabbit holes in the kernel never end, we have to be strategic with our time!</p>
<p>Next, we reach <code>fail_iommu_bus_notify</code>, so let's take a look at that.</p>
<div class="highlight"><pre><span></span><code><span class="k">static</span><span class="w"> </span><span class="k">struct</span><span class="w"> </span><span class="nc">notifier_block</span><span class="w"> </span><span class="n">fail_iommu_bus_notifier</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="p">.</span><span class="n">notifier_call</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fail_iommu_bus_notify</span>
<span class="p">};</span>
<span class="k">static</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">__init</span><span class="w"> </span><span class="nf">fail_iommu_setup</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="cp">#ifdef CONFIG_PCI</span>
<span class="w"> </span><span class="n">bus_register_notifier</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pci_bus_type</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">fail_iommu_bus_notifier</span><span class="p">);</span>
<span class="cp">#endif</span>
<span class="cp">#ifdef CONFIG_IBMVIO</span>
<span class="w"> </span><span class="n">bus_register_notifier</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vio_bus_type</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="n">fail_iommu_bus_notifier</span><span class="p">);</span>
<span class="cp">#endif</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<p>Sure enough, here's our bug. The same node is being registered to two different bus types:</p>
<div class="highlight"><pre><span></span><code>+------------------+
| PCI bus_notifier \
+------------------+\
\+-------------------------+ +-----------------+ +------------+
| fail_iommu_bus_notifier |----| PCI + VIO stuff |----| pci_notify |
/+-------------------------+ +-----------------+ +------------+
+------------------+/
| VIO bus_notifier /
+------------------+
</code></pre></div>
<p>when it should be like:</p>
<div class="highlight"><pre><span></span><code>+------------------+ +-----------------------------+ +-----------+ +------------+
| PCI bus_notifier |----| fail_iommu_pci_bus_notifier |----| PCI stuff |----| pci_notify |
+------------------+ +-----------------------------+ +-----------+ +------------+
+------------------+ +-----------------------------+ +-----------+
| VIO bus_notifier |----| fail_iommu_vio_bus_notifier |----| VIO stuff |
+------------------+ +-----------------------------+ +-----------+
</code></pre></div>
<h2>The fix</h2>
<p>Ultimately, the fix turned out to be pretty simple:</p>
<div class="highlight"><pre><span></span><code>Author: Russell Currey &lt;ruscur@russell.cc&gt;
Date: Wed Mar 22 14:37:42 2023 +1100
<span class="w"> </span> powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
<span class="w"> </span> fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
<span class="w"> </span> PCI and VIO buses. struct notifier_block is a linked list node, so this
<span class="w"> </span> causes any notifiers later registered to either bus type to also be
<span class="w"> </span> registered to the other since they share the same node.
<span class="w"> </span> This causes issues in (at least) the vgaarb code, which registers a
<span class="w"> </span> notifier for PCI buses. pci_notify() ends up being called on a vio
<span class="w"> </span> device, converted with to_pci_dev() even though it&#39;s not a PCI device,
<span class="w"> </span> and finally makes a bad access in vga_arbiter_add_pci_device() as
<span class="w"> </span> discovered with KASAN:
<span class="w"> </span> [stack trace redacted, see above]
<span class="w"> </span> Fix this by creating separate notifier_block structs for each bus type.
<span class="w"> </span> Fixes: d6b9a81b2a45 (&quot;powerpc: IOMMU fault injection&quot;)
<span class="w"> </span> Reported-by: Nageswara R Sastry &lt;rnsastry@linux.ibm.com&gt;
<span class="w"> </span> Signed-off-by: Russell Currey &lt;ruscur@russell.cc&gt;
<span class="gh">diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c</span>
<span class="gh">index ee95937bdaf1..6f1117fe3870 100644</span>
<span class="gd">--- a/arch/powerpc/kernel/iommu.c</span>
<span class="gi">+++ b/arch/powerpc/kernel/iommu.c</span>
<span class="gu">@@ -171,17 +171,26 @@ static int fail_iommu_bus_notify(struct notifier_block *nb,</span>
<span class="w"> </span> return 0;
<span class="w"> </span>}
<span class="gd">-static struct notifier_block fail_iommu_bus_notifier = {</span>
<span class="gi">+/*</span>
<span class="gi">+ * PCI and VIO buses need separate notifier_block structs, since they&#39;re linked</span>
<span class="gi">+ * list nodes. Sharing a notifier_block would mean that any notifiers later</span>
<span class="gi">+ * registered for PCI buses would also get called by VIO buses and vice versa.</span>
<span class="gi">+ */</span>
<span class="gi">+static struct notifier_block fail_iommu_pci_bus_notifier = {</span>
<span class="gi">+ .notifier_call = fail_iommu_bus_notify</span>
<span class="gi">+};</span>
<span class="gi">+</span>
<span class="gi">+static struct notifier_block fail_iommu_vio_bus_notifier = {</span>
<span class="w"> </span> .notifier_call = fail_iommu_bus_notify
<span class="w"> </span>};
<span class="w"> </span>static int __init fail_iommu_setup(void)
<span class="w"> </span>{
<span class="w"> </span>#ifdef CONFIG_PCI
<span class="gd">- bus_register_notifier(&amp;pci_bus_type, &amp;fail_iommu_bus_notifier);</span>
<span class="gi">+ bus_register_notifier(&amp;pci_bus_type, &amp;fail_iommu_pci_bus_notifier);</span>
<span class="w"> </span>#endif
<span class="w"> </span>#ifdef CONFIG_IBMVIO
<span class="gd">- bus_register_notifier(&amp;vio_bus_type, &amp;fail_iommu_bus_notifier);</span>
<span class="gi">+ bus_register_notifier(&amp;vio_bus_type, &amp;fail_iommu_vio_bus_notifier);</span>
<span class="w"> </span>#endif
<span class="w"> </span> return 0;
</code></pre></div>
<p>Easy! Problem solved. The <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6b9a81b2a45">commit that introduced this bug back in 2012</a> was written by the legendary <a href="http://antonblanchardfacts.com">Anton Blanchard</a>, so it's always a treat to discover an Anton bug. Ultimately this bug is of little consequence, but it's always fun to catch dormant issues with powerful tools like KASAN.</p>
<h2>In conclusion</h2>
<p>I think this bug provides a nice window into what kernel debugging can be like. Thankfully, things are made easier by not dealing with any specific hardware and being easily reproducible in QEMU.</p>
<p>Bugs like this have an absurd amount of underlying complexity, but you rarely need to understand all of it to comprehend the situation and discover the issue. I spent way too much time digging into device subsystem internals, when the odds of the issue lying within were quite low - the combination of IBM VIO devices and VGA arbitration isn't exactly common, so searching for potential issues within the guts of a heavily utilised subsystem isn't going to yield results very often.</p>
<p>Is there something haunted in the device subsystem? Is there something haunted inside the notifier handlers? It's possible, but assuming the core guts of the kernel have a baseline level of sanity helps to let you stay focused on the parts more likely to be relevant.</p>
<p>Finally, the process was made much easier by having good code navigation. A ludicrous amount of kernel developers still use plain vim or Emacs, maybe with tags if you're lucky, and get by on <code>git grep</code> (not even ripgrep!) and memory. Sort yourselves out and get yourself an editor with LSP support. I personally use <a href="https://github.com/doomemacs/doomemacs">Doom Emacs</a> with <a href="https://clangd.llvm.org/">clangd</a>, and with the amount of jumping around the kernel I had to do to solve this bug, it would've been a much bigger ordeal without that power.</p>
<p>If you enjoyed the read, why not follow me on <a href="https://ozlabs.house/@ruscur">Mastodon</a> or checkout <a href="https://sthbrx.github.io/blog/2023/03/24/dumb-bugs-when-a-date-breaks-booting-the-kernel/">Ben's recount of another cursed bug!</a> Thanks for stopping by.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Russell Currey</dc:creator><pubDate>Tue, 04 Apr 2023 15:55:00 +1000</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2023-04-04:/blog/2023/04/04/dumb-bugs-the-pci-device-that-wasnt/</guid><category>Development</category><category>linux</category></item><item><title>Dumb bugs: When a date breaks booting the kernel</title><link>https://sthbrx.github.io/blog/2023/03/24/dumb-bugs-when-a-date-breaks-booting-the-kernel/</link><description><h2>The setup</h2>
<p>I've recently been working on internal CI infrastructure for testing kernels before sending them to the mailing list. As part of this effort, I became interested in <a href="https://reproducible-builds.org/">reproducible builds</a>. Minimising the changing parts outside of the source tree itself could improve consistency and ccache hits, which is great for trying to make the CI faster and more reproducible across different machines. This means removing 'external' factors like timestamps from the build process, because the time changes every build and means the results between builds of the same tree are no longer identical binaries. This also prevents using previously cached results, potentially slowing down builds (though it turns out the kernel does a good job of limiting the scope of where timestamps appear in the build).</p>
<p>As part of this effort, I came across the <code>KBUILD_BUILD_TIMESTAMP</code> environment variable. This variable is used to set the kernel timestamp, which is primarily for any users who want to know when their kernel was built. That's mostly irrelevant for our work, so an easy <code>KBUILD_BUILD_TIMESTAMP=0</code> later and... it still uses the current date.</p>
<p>Ok, checking <a href="https://docs.kernel.org/kbuild/kbuild.html#kbuild-build-timestamp">the documentation</a> it says</p>
<blockquote>
<p>Setting this to a date string overrides the timestamp used in the UTS_VERSION definition (uname -v in the running kernel). The value has to be a string that can be passed to date -d. The default value is the output of the date command at one point during build.</p>
</blockquote>
<p>So it looks like the timestamp variable is actually expected to be a date format. To make it obvious that it's not a 'real' date, let's set <code>KBUILD_BUILD_TIMESTAMP=0000-01-01</code>. A bunch of zeroes (and the ones to make it a valid month and day) should tip off anyone to the fact it's invalid.</p>
<p>As an aside, this is a different date to what I tried to set it to earlier; a 'timestamp' typically refers to the number of seconds since the UNIX epoch (1970), so my first attempt would have corresponded to 1970-01-01. But given we're passing a date, not a timestamp, there should be no problem setting it back to the year 0. And I like the aesthetics of 0000 over 1970.</p>
<p>Building and booting the kernel, we see <code>#1 SMP 0000-01-01</code> printed as the build timestamp. Success! After confirming everything works, I set the environment variable in the CI jobs and call it a day.</p>
<h2>An unexpected error</h2>
<p>A few days later I need to run the CI to test my patches, and something strange happens. It builds fine, but the boot tests that load a root disk image fail inexplicably: there is a kernel panic saying "VFS: Unable to mount root fs on unknown-block(253,2)".</p>
<div class="highlight"><pre><span></span><code>[ 0.909648][ T1] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(253,2)
[ 0.909797][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc2-g065ffaee7389 #8
[ 0.909880][ T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 (raw) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[ 0.910044][ T1] Call Trace:
[ 0.910107][ T1] [c000000003643b00] [c000000000fb6f9c] dump_stack_lvl+0x70/0xa0 (unreliable)
[ 0.910378][ T1] [c000000003643b30] [c000000000144e34] panic+0x178/0x424
[ 0.910423][ T1] [c000000003643bd0] [c000000002005144] mount_block_root+0x1d0/0x2bc
[ 0.910457][ T1] [c000000003643ca0] [c000000002005720] prepare_namespace+0x1d4/0x22c
[ 0.910487][ T1] [c000000003643d20] [c000000002004b04] kernel_init_freeable+0x36c/0x3bc
[ 0.910517][ T1] [c000000003643df0] [c000000000013830] kernel_init+0x30/0x1a0
[ 0.910549][ T1] [c000000003643e50] [c00000000000df94] ret_from_kernel_thread+0x5c/0x64
[ 0.910587][ T1] --- interrupt: 0 at 0x0
[ 0.910794][ T1] NIP: 0000000000000000 LR: 0000000000000000 CTR: 0000000000000000
[ 0.910828][ T1] REGS: c000000003643e80 TRAP: 0000 Not tainted (6.3.0-rc2-g065ffaee7389)
[ 0.910883][ T1] MSR: 0000000000000000 &lt;&gt; CR: 00000000 XER: 00000000
[ 0.910990][ T1] CFAR: 0000000000000000 IRQMASK: 0
[ 0.910990][ T1] GPR00: 0000000000000000 c000000003644000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR12: 0000000000000000 0000000000000000 c000000000013808 0000000000000000
[ 0.910990][ T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.910990][ T1] GPR28: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.911371][ T1] NIP [0000000000000000] 0x0
[ 0.911397][ T1] LR [0000000000000000] 0x0
[ 0.911427][ T1] --- interrupt: 0
qemu-system-ppc64: OS terminated: OS panic: VFS: Unable to mount root fs on unknown-block(253,2)
</code></pre></div>
<p>Above the panic was some more context, saying</p>
<div class="highlight"><pre><span></span><code>[ 0.906194][ T1] Warning: unable to open an initial console.
...
[ 0.908321][ T1] VFS: Cannot open root device &quot;vda2&quot; or unknown-block(253,2): error -2
[ 0.908356][ T1] Please append a correct &quot;root=&quot; boot option; here are the available partitions:
[ 0.908528][ T1] 0100 65536 ram0
[ 0.908657][ T1] (driver?)
[ 0.908735][ T1] 0101 65536 ram1
[ 0.908744][ T1] (driver?)
...
[ 0.909216][ T1] 010f 65536 ram15
[ 0.909226][ T1] (driver?)
[ 0.909265][ T1] fd00 5242880 vda
[ 0.909282][ T1] driver: virtio_blk
[ 0.909335][ T1] fd01 4096 vda1 d1f35394-01
[ 0.909364][ T1]
[ 0.909401][ T1] fd02 5237760 vda2 d1f35394-02
[ 0.909408][ T1]
[ 0.909441][ T1] fd10 366 vdb
[ 0.909446][ T1] driver: virtio_blk
[ 0.909479][ T1] 0b00 1048575 sr0
[ 0.909486][ T1] driver: sr
</code></pre></div>
<p>This is even more baffling: if it's unable to open a console, then what am I reading these messages on? And error <code>-2</code>, or ENOENT, on opening 'vda2' implies that no such file or directory exists. But it then lists vda2 as a present drive with a known driver? So is vda2 missing or not?</p>
<h2>Living in denial</h2>
<p>As you've read the title of this article, you can probably guess as to what changed to cause this error. But at the time I had no idea what could have been the cause. I'd already confirmed that a kernel with a set timestamp can boot to userspace, and there was another (seemingly) far more likely candidate for the failure: as part of the CI design, patches are extracted from the submitted branch and rebased onto the maintainer's tree. This is great from a convenience perspective, because you don't need to worry about forgetting to rebase your patches before testing and submission. But if the maintainer has synced their branch with Linus' tree it means there could be a lot of things changed in the source tree between runs, even if they were only a few days apart.</p>
<p>So, when you're faced with a working test on one commit and a broken test on another commit, it's time to break out the <code>git bisect</code>. Downloading the kernel images from the relevant CI jobs, I confirmed that indeed one was working while the other was broken. So I bisected the relevant commits, and... everything kept working. Each step I would build and boot the kernel, and each step would reach userspace just fine. I was getting suspicious at this point, so skipped ahead to the known bad commit and built and tested it locally. It <em>also worked</em>.</p>
<p>This was highly confusing, because it meant there was something fishy going on. Some kind of state outside of the kernel tree. Could it be... surely not...</p>
<p>Comparing the boot logs of the two CI kernels, I see that the working one indeed uses an actual timestamp, and the broken one uses the 0000-01-01 fixed date. Oh no. Setting the timestamp with a local build, I can now reproduce the boot panic with a kernel I built myself.</p>
<h2>But... why?</h2>
<p>OK, so it's obvious at this point that the timestamp is affecting loading a root disk somehow. But why? The obvious answer is that it's before the UNIX epoch. Something in the build process is turning the date into an actual timestamp, and going wrong when that timestamp gets used for something.</p>
<p>But it's not like there was a build error complaining about it. As best I could tell, the kernel doesn't try to parse the date anywhere, besides passing it to <code>date</code> during the build. And if <code>date</code> had an issue with it, it would have broken the <em>build</em>. Not <em>booting</em> the kernel. There's no <code>date</code> utility being invoked during kernel boot!</p>
<p>Regardless, I set about tracing the usage of <code>KBUILD_BUILD_TIMESTAMP</code> inside the kernel. The stacktrace in the panic gave the end point of the search; the function <code>mount_block_root()</code> wasn't happy. So all I had to do was work out at which point <code>mount_block_root()</code> tried to access the <code>KBUILD_BUILD_TIMESTAMP</code> value.</p>
<p>In short, that went nowhere.</p>
<p><code>mount_block_root()</code> effectively just tries to open a file in the filesystem. There's massive amounts of code handling this, and any part could have had the undocumented dependency on <code>KBUILD_BUILD_TIMESTAMP</code>. Approaching from the other direction, <code>KBUILD_BUILD_TIMESTAMP</code> is turned into <code>build-timestamp</code> inside a Makefile, which is in turn related to a file <code>include/generated/utsversion.h</code>. This file <code>#define</code>s <code>UTS_VERSION</code> equal to the <code>KBUILD_BUILD_TIMESTAMP</code> value. Searching the kernel for <code>UTS_VERSION</code>, we hit <code>init/version-timestamp.c</code> which stores it in a struct with other build information:</p>
<div class="highlight"><pre><span></span><code><span class="k">struct</span><span class="w"> </span><span class="nc">uts_namespace</span><span class="w"> </span><span class="n">init_uts_ns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="p">.</span><span class="n">ns</span><span class="p">.</span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">REFCOUNT_INIT</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
<span class="w"> </span><span class="p">.</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="p">.</span><span class="n">sysname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_SYSNAME</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">nodename</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_NODENAME</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">release</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_RELEASE</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">version</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_VERSION</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">machine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_MACHINE</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">domainname</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UTS_DOMAINNAME</span><span class="p">,</span>
<span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="p">.</span><span class="n">user_ns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">&amp;</span><span class="n">init_user_ns</span><span class="p">,</span>
<span class="w"> </span><span class="p">.</span><span class="n">ns</span><span class="p">.</span><span class="n">inum</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">PROC_UTS_INIT_INO</span><span class="p">,</span>
<span class="cp">#ifdef CONFIG_UTS_NS</span>
<span class="w"> </span><span class="p">.</span><span class="n">ns</span><span class="p">.</span><span class="n">ops</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">&amp;</span><span class="n">utsns_operations</span><span class="p">,</span>
<span class="cp">#endif</span>
<span class="p">};</span>
</code></pre></div>
<p>This is where the trail goes cold: I don't know if you've ever tried this, but searching for <code>.version</code> in the kernel's codebase is not a very fruitful endeavor when you're interested in a specific kind of version.</p>
<div class="highlight"><pre><span></span><code>$ rg &quot;(\.|\-&gt;)version\b&quot; | wc -l
5718
</code></pre></div>
<p>I tried tracing the usage of <code>init_uts_ns</code>, but didn't get very far.</p>
<p>By now I'd already posted this in chat and another developer, <a href="https://shenki.github.io/">Joel Stanley</a>, was also investigating this bizarre bug. They had been testing different timestamp values and made the horrifying discovery that the bug sticks around after a rebuild. So you could start with a broken build, set the timestamp back to the correct value, rebuild, and the resulting kernel would <em>still be broken</em>. The boot log would report the correct time, but the root disk mounter panicked all the same.</p>
<h2>Getting sidetracked</h2>
<p>I wasn't prepared to investigate the boot panic directly until the persistence bug was fixed. Having to run <code>make clean</code> and rebuild everything would take an annoyingly long time, even with ccache. Fortunately, I had a plan. All I had to do was work out which generated files are different between a broken and working build, and binary search by deleting half of them until deleting only one made the difference between the bug persisting or not. We can use <code>diff</code> for this. Running the initial diff we get</p>
<div class="highlight"><pre><span></span><code>$ diff -q --exclude System.map --exclude .tmp_vmlinux* --exclude tools broken/ working/
Common subdirectories: broken/arch and working/arch
Common subdirectories: broken/block and working/block
Files broken/built-in.a and working/built-in.a differ
Common subdirectories: broken/certs and working/certs
Common subdirectories: broken/crypto and working/crypto
Common subdirectories: broken/drivers and working/drivers
Common subdirectories: broken/fs and working/fs
Common subdirectories: broken/include and working/include
Common subdirectories: broken/init and working/init
Common subdirectories: broken/io_uring and working/io_uring
Common subdirectories: broken/ipc and working/ipc
Common subdirectories: broken/kernel and working/kernel
Common subdirectories: broken/lib and working/lib
Common subdirectories: broken/mm and working/mm
Common subdirectories: broken/net and working/net
Common subdirectories: broken/scripts and working/scripts
Common subdirectories: broken/security and working/security
Common subdirectories: broken/sound and working/sound
Common subdirectories: broken/usr and working/usr
Files broken/.version and working/.version differ
Common subdirectories: broken/virt and working/virt
Files broken/vmlinux and working/vmlinux differ
Files broken/vmlinux.a and working/vmlinux.a differ
Files broken/vmlinux.o and working/vmlinux.o differ
Files broken/vmlinux.strip.gz and working/vmlinux.strip.gz differ
</code></pre></div>
<p>Hmm, OK so only some top level files are different. Deleting all the different files doesn't fix the persistence bug though, and I know that a proper <code>make clean</code> does fix it, so what could possibly be the difference when all the remaining files are identical?</p>
<p>Oh wait. <code>man diff</code> reports that <code>diff</code> only compares the top level folder entries by default. So it was literally just telling me "yes, both the broken and working builds have a folder named X". How GNU of it. Re-running the diff command with actually useful options, we get a more promising story</p>
<div class="highlight"><pre><span></span><code>$ diff -qr --exclude System.map --exclude .tmp_vmlinux* --exclude tools build/broken/ build/working/
Files build/broken/arch/powerpc/boot/zImage and build/working/arch/powerpc/boot/zImage differ
Files build/broken/arch/powerpc/boot/zImage.epapr and build/working/arch/powerpc/boot/zImage.epapr differ
Files build/broken/arch/powerpc/boot/zImage.pseries and build/working/arch/powerpc/boot/zImage.pseries differ
Files build/broken/built-in.a and build/working/built-in.a differ
Files build/broken/include/generated/utsversion.h and build/working/include/generated/utsversion.h differ
Files build/broken/init/built-in.a and build/working/init/built-in.a differ
Files build/broken/init/utsversion-tmp.h and build/working/init/utsversion-tmp.h differ
Files build/broken/init/version.o and build/working/init/version.o differ
Files build/broken/init/version-timestamp.o and build/working/init/version-timestamp.o differ
Files build/broken/usr/built-in.a and build/working/usr/built-in.a differ
Files build/broken/usr/initramfs_data.cpio and build/working/usr/initramfs_data.cpio differ
Files build/broken/usr/initramfs_data.o and build/working/usr/initramfs_data.o differ
Files build/broken/usr/initramfs_inc_data and build/working/usr/initramfs_inc_data differ
Files build/broken/.version and build/working/.version differ
Files build/broken/vmlinux and build/working/vmlinux differ
Files build/broken/vmlinux.a and build/working/vmlinux.a differ
Files build/broken/vmlinux.o and build/working/vmlinux.o differ
Files build/broken/vmlinux.strip.gz and build/working/vmlinux.strip.gz differ
</code></pre></div>
<p>There are some new entries here: notably <code>init/version*</code> and <code>usr/initramfs*</code>. Binary searching these files results in a single culprit: <code>usr/initramfs_data.cpio</code>. This is quite fitting, as the <code>.cpio</code> file is an archive defining a filesystem layout, <a href="https://docs.kernel.org/filesystems/ramfs-rootfs-initramfs.html?highlight=initramfs#why-cpio-rather-than-tar">much like <code>.tar</code> files</a>. This file is actually embedded into the kernel image, and loaded as a bare-bones shim filesystem when the user doesn't provide their own initramfs<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>.</p>
<p>So it would make sense that if the CPIO archive wasn't being rebuilt, then the initial filesystem wouldn't change. And it would make sense for the initial filesystem to be causing mount issues of the proper root disk filesystem.</p>
<p>This just leaves the question of how <code>KBUILD_BUILD_TIMESTAMP</code> is breaking the CPIO archive. And it's around this time that a third developer, <a href="https://twitter.com/ajdlinux">Andrew</a>, who I'd roped into this bug hunt for having the (mis)fortune to sit next to me, pointed out that the generator script for this CPIO archive was passing the <code>KBUILD_BUILD_TIMESTAMP</code> to <code>date</code>. Whoop, we've found the murder weapon<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup>!</p>
<p>The persistence bug could be explained now: because the script was only using <code>KBUILD_BUILD_TIMESTAMP</code> internally, <code>make</code> had no way of knowing that the archive generation depended on this variable. So even when I changed the variable to a valid value, <code>make</code> didn't know to rebuild the corrupt archive. Let's now get back to the main issue: why boot panics.</p>
<h2>Solving the case</h2>
<p>Following along the CPIO generation script, the <code>KBUILD_BUILD_TIMESTAMP</code> variable is turned into a timestamp by <code>date -d"$KBUILD_BUILD_TIMESTAMP" +%s</code>. Testing this in the shell with <code>0000-01-01</code> we get this (somewhat amusing, but also painful) result</p>
<div class="highlight"><pre><span></span><code>date -d&quot;$KBUILD_BUILD_TIMESTAMP&quot; +%s
-62167255492
</code></pre></div>
<p>This timestamp is then passed to a C program that assigns it to a variable <code>default_mtime</code>. Looking over the source, it seems this variable is used to set the <code>mtime</code> field on the files in the CPIO archive. The timestamp is stored as a <code>time_t</code>, which is an alias for <code>int64_t</code>. That's 64 bits of data, up to 16 hexadecimal characters. And yes, that's relevant: CPIO stores the <code>mtime</code> (and all other numerical fields) as 32 bit unsigned integers represented by ASCII hexadecimal characters. The <code>sprintf()</code> call that ultimately embeds the timestamp uses the <code>%08lX</code> format specifier. This formats a <code>long</code> as hexadecimal, padded to at least 8 characters. Hang on... <strong><em>at least</em></strong> 8 characters? What if our timestamp happens to be more?</p>
<p>It turns out that large timestamps are already guarded against. The program will error during build if the date is later than 2106-02-07 (maximum unsigned 8 hex digit timestamp).</p>
<div class="highlight"><pre><span></span><code><span class="cm">/*</span>
<span class="cm"> * Timestamps after 2106-02-07 06:28:15 UTC have an ascii hex time_t</span>
<span class="cm"> * representation that exceeds 8 chars and breaks the cpio header</span>
<span class="cm"> * specification.</span>
<span class="cm"> */</span>
<span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">default_mtime</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mh">0xffffffff</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;ERROR: Timestamp too large for cpio format</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">);</span>
<span class="w"> </span><span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div>
<p>But we are using an <code>int64_t</code>. What would happen if one were to provide a negative timestamp?</p>
<p>Well, <code>sprintf()</code> happily spits out <code>FFFFFFF1868AF63C</code> when we pass in our negative timestamp representing <code>0000-01-01</code>. That's 16 characters, 8 too many for the CPIO header<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup>.</p>
<p>So at last we've found the cause of the panic: the timestamp is being formatted too long, which breaks the CPIO header and the kernel doesn't create an initial filesystem correctly. This includes the <code>/dev</code> folder (which surprisingly is not hardcoded into kernel, but must be declared by the initramfs). So when the root disk mounter tries to open <code>/dev/vda2</code>, it correctly complains that it failed to create a device in the non-existent <code>/dev</code>.</p>
<h2>Postmortem</h2>
<p>After discovering all this, I sent in a couple of patches to fix <a href="https://lore.kernel.org/all/20230320040839.660475-1-bgray@linux.ibm.com/">the CPIO generation</a> and <a href="https://lore.kernel.org/all/20230320040839.660475-2-bgray@linux.ibm.com/">rebuild logic</a>. They were not complicated fixes, but wow were they time consuming to track down. I didn't see the error initially because I typically only boot with my own initramfs over the embedded one, and not with the intent to load a root disk. Then the panic itself was quite far away from the real issue, and there were many dead ends to explore.</p>
<p>I also got curious as to why the kernel didn't complain about a corrupt initramfs earlier. A brief investigation showed a streaming parser that is <em>extremely</em> fault tolerant, silently skipping invalid entries (like ones missing or having too long a name). The corrupted header was being interpreted as an entry with an empty name and 2 gigabyte body contents, which meant that (1) the kernel skipped inserting it due to the empty name, and (2) the kernel skipped the rest of the initramfs because it thought that up to 2 GB of the remaining content was part of that first entry.</p>
<p>Perhaps this could be improved to require that all input is consumed without unexpected EOF, such as how the userspace <code>cpio</code> tool works (which, by the way, recognises the corrupt archive as such and refuses to decompress it). The parsing logic is mostly from the before-times though (i.e., pre initial git commit), so it's difficult to distinguish intentional leniency and bugs.</p>
<h2>Afterword</h2>
<p>Incidentally, in investigating this I came across another bug. There is a helper function <code>panic_show_mem()</code> in the initramfs that's meant to dump memory information and then call <code>panic()</code>. It takes in standard <code>printf()</code> style format string and arguments, and tries to forward them to <code>panic()</code> which ultimately prints them.</p>
<div class="highlight"><pre><span></span><code><span class="k">static</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="nf">panic_show_mem</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="kt">char</span><span class="w"> </span><span class="o">*</span><span class="n">fmt</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span>
<span class="p">{</span>
<span class="w"> </span><span class="kt">va_list</span><span class="w"> </span><span class="n">args</span><span class="p">;</span>
<span class="w"> </span><span class="n">show_mem</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="nb">NULL</span><span class="p">);</span>
<span class="w"> </span><span class="n">va_start</span><span class="p">(</span><span class="n">args</span><span class="p">,</span><span class="w"> </span><span class="n">fmt</span><span class="p">);</span>
<span class="w"> </span><span class="n">panic</span><span class="p">(</span><span class="n">fmt</span><span class="p">,</span><span class="w"> </span><span class="n">args</span><span class="p">);</span>
<span class="w"> </span><span class="n">va_end</span><span class="p">(</span><span class="n">args</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">void</span><span class="w"> </span><span class="nf">panic</span><span class="p">(</span><span class="k">const</span><span class="w"> </span><span class="kt">char</span><span class="w"> </span><span class="o">*</span><span class="n">fmt</span><span class="p">,</span><span class="w"> </span><span class="p">...);</span>
</code></pre></div>
<p>But variadic arguments don't quite work this way: instead of forwarding the list <code>args</code> as intended, <code>panic()</code> will instead interpret <code>args</code> as a single argument for the format string <code>fmt</code>. Standard library functions address this by providing <code>v*</code> variants of <code>printf()</code> and friends. For example,</p>
<div class="highlight"><pre><span></span><code><span class="kt">int</span><span class="w"> </span><span class="nf">printf</span><span class="p">(</span><span class="kt">char</span><span class="w"> </span><span class="o">*</span><span class="n">fmt</span><span class="p">,</span><span class="w"> </span><span class="p">...);</span>
<span class="kt">int</span><span class="w"> </span><span class="nf">vprintf</span><span class="p">(</span><span class="kt">char</span><span class="w"> </span><span class="o">*</span><span class="n">fmt</span><span class="p">,</span><span class="w"> </span><span class="kt">va_list</span><span class="w"> </span><span class="n">args</span><span class="p">);</span>
</code></pre></div>
<p>We might create a <code>vpanic()</code> function in the kernel that follows this style, but it seems easier to just make <code>panic_show_mem()</code> a macro and 'forward' the arguments in the source code</p>
<div class="highlight"><pre><span></span><code><span class="cp">#define panic_show_mem(fmt, ...) \</span>
<span class="cp"> ({ show_mem(0, NULL); panic(fmt, ##__VA_ARGS__); })</span>
</code></pre></div>
<p><a href="https://lore.kernel.org/all/20230320230534.50174-1-bgray@linux.ibm.com/">Patch sent</a>.</p>
<p>And that's where I've left things. Big thanks to Joel and Andrew for helping me with this bug. It was certainly a trip.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>initramfs, or initrd for the older format, are specific kinds of CPIO archives. The initramfs is intended to be loaded as the initial filesystem of a booted kernel, typically in preparation for loading your normal root filesystem. It might contain modules necessary to mount the disk for example.&#160;<a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Hindsight again would suggest it was obvious to look here because it shows up when searching for <code>KBUILD_BUILD_TIMESTAMP</code>. I unfortunately wasn't familiar with the <code>usr/</code> source folder initially, and focused on the core kernel components too much earlier. Oh well, we found it eventually.&#160;<a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">&#8617;</a></p>
</li>
<li id="fn:3">
<p>I almost missed this initially. Thanks to the ASCII header format, <code>strings</code> was able to print the headers without any CPIO specific tooling. I did a double take when I noticed the headers for the broken CPIO were a little longer than the headers in the working one.&#160;<a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">&#8617;</a></p>
</li>
</ol>
</div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Benjamin Gray</dc:creator><pubDate>Fri, 24 Mar 2023 00:00:00 +1100</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2023-03-24:/blog/2023/03/24/dumb-bugs-when-a-date-breaks-booting-the-kernel/</guid><category>Development</category><category>linux</category></item><item><title>What distro options are there for POWER8 in 2022?</title><link>https://sthbrx.github.io/blog/2022/11/16/what-distro-options-are-there-for-power8-in-2022/</link><description><p>If you have POWER8 systems that you want to keep alive, what are your options in 2022? You can keep using the legacy distribution you're still using as long as it's still supported, but if you want some modernisation, that might not be the best option for you. Here's the current landscape of POWER8 support in major distributions, and hopefully it helps you out!</p>
<p>Please note that I am entirely focused on what runs and keeps getting new packages, not what companies will officially support. <a href="https://www.ibm.com/docs/en/linux-on-systems?topic=lpo-supported-linux-distributions-virtualization-options-power8-power9-linux-power-systems">IBM provides documentation for that.</a> I'm also mostly focused on OpenPOWER and not what's supported under IBM PowerVM.</p>
<p><strong>RHEL-compatible</strong></p>
<p>Things aren't too great on the RHEL-compatible side. RHEL 9 is compiled with P9 instructions, removing support for P8. This includes compatible distributions, like CentOS Stream and Rocky Linux.</p>
<p>You can continue to use RHEL 8 for a long time. Unfortunately, Rocky Linux only has a Power release for EL9 and not EL8, and CentOS Stream 8 hits EOL May 31st, 2024 - a bit too soon for my liking. If you're a RHEL customer though, you're set.</p>
<p><strong>Fedora</strong></p>
<p>Fedora seems like a great option - the latest versions still support P8 and there's no immediate signs of that changing. The issue is that Fedora could change this with relatively little warning (and their big brother RHEL already has), Fedora doesn't provide LTS versions that will stay supported if this happens, and any options you could migrate to would be very different from what you're using.</p>
<p>For that reason, I don't recommend using Fedora on POWER8 if you intend to keep it around for a while. If you want something modern for a short-term project, go right ahead! Otherwise, I'd avoid it. If you're still keeping POWER8 systems alive, you probably want something more set-and-forget than Fedora anyway.</p>
<p><strong>Ubuntu</strong></p>
<p>Ubuntu is a mixed bag. The good news is that Ubuntu 20.04 LTS is supported until mid-2025, and if you give Canonical money, that support can extend through 2030. Ubuntu 20.04 LTS is my personal pick for the best distro to install on POWER8 systems that you want to have somewhat modern software but without the risks of future issues.</p>
<p>The bad news is that POWER8 support went away in Ubuntu 22.04, which is extremely unfortunate. Missing an LTS cycle is one thing, but <em>not having a pathway from 21.10 is another</em>. If you were on 20.10/21.04/21.10, you are completely boned, because they're all out of support and 22.04 and later don't support POWER8. You're going to have to reinstall 20.04.</p>
<p>If I sound salty, it's because I had to do this for a few machines. Hopefully you're not in that situation. 20.04 is going to be around for a good while longer, with a lot of modern creature comforts you'd miss on an EL8-compatible distro, so it's my pick for now.</p>
<p><strong>OpenSUSE</strong></p>
<p>I'm pretty ignorant when it comes to chameleon-flavoured distros, so take this with a grain of salt as most of it is from some quick searching. OpenSUSE Leap follows SLES, but without extended support lifetimes for older major versions. From what I can tell, the latest release (15.4) still includes POWER8 support (and adds Power10 support!), but similar to Fedora, that looks rather prone to a new version dropping P8 support to me.</p>
<p>If the 15.x series stayed alive after 16 came out, you might be good, but it doesn't seem like there's a history of that happening.</p>
<p><strong>Debian</strong></p>
<p>Debian 11 "bullseye" came out in 2021, supports POWER8, and is likely to be supported until around 2026. I can't really chime in on more than that because I am a certified Debian hater (even newer releases feel outdated to me), but that looks like a pretty good deal.</p>
<p><strong>Other options</strong></p>
<p>Those are just some major distros, there's plenty of others, including some Power-specific ones from the community.</p>
<p><strong>Conclusion</strong></p>
<p>POWER8's getting old, but is still plenty capable. Make sure your distro still remembers to send your POWER8 a birthday card each year and you'll have plenty more good times to come.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Russell Currey</dc:creator><pubDate>Wed, 16 Nov 2022 17:30:00 +1100</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2022-11-16:/blog/2022/11/16/what-distro-options-are-there-for-power8-in-2022/</guid><category>OpenPOWER</category><category>linux</category><category>power8</category><category>distro</category></item><item><title>Power kernel hardening features in Linux 6.1</title><link>https://sthbrx.github.io/blog/2022/10/26/power-kernel-hardening-features-in-linux-61/</link><description><p>Linux 6.1-rc1 was tagged on October 16th, 2022 and includes a bunch of nice things from my team that I want to highlight. Our goal is to make the Linux kernel running on IBM's Power CPUs more secure, and landed a few goodies upstream in 6.1 to that end.</p>
<p>Specifically, Linux 6.1 on Power will include <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7e92e01b724526b98cbc7f03dd4afa0295780d56">a complete system call infrastructure rework with security <em>and</em> performance benefits</a>, <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a5edf9815dd739fce660b4c8658f61b7d2517042">support for KFENCE (a low-overhead memory safety error detector)</a>, and <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=395cac7752b905318ae454a8b859d4c190485510">execute-only memory (XOM) support on the Radix MMU</a>.</p>
<p>The syscall work from Rohan McLure and Andrew Donnellan replaces arch/powerpc's legacy infrastructure with the syscall wrapper shared between architectures. This was a significant overhaul of a lot of legacy code impacting all of powerpc's many platforms, including multiple different ABIs and 32/64bit compatibility infrastructure. Rohan's series started at <a href="http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=302791&amp;state=*">v1 with 6 patches</a> and ended at <a href="http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=319348&amp;state=*">v6 with 25 patches</a>, and he's done an incredible job at adopting community feedback and handling new problems.</p>
<p>Big thanks to Christophe Leroy, Arnd Bergmann, Nick Piggin, Michael Ellerman and others for their reviews, and of course Andrew for providing a lot of review and feedback (and prototyping the syscall wrapper in the first place). Our syscalls have entered the modern era, we can zeroise registers to improve security (but don't yet due to some ongoing discussion around compatibility and making it optional, look out for Linux 6.2), and gain a nice little performance boost by avoiding the allocation of a kernel stack frame. For more detail, see <a href="http://patchwork.ozlabs.org/project/linuxppc-dev/cover/20220921065605.1051927-1-rmclure@linux.ibm.com/">Rohan's cover letter</a>.</p>
<p>Next, we have Nicholas Miehlbradt's implementation of <a href="https://www.kernel.org/doc/html/latest/dev-tools/kfence.html">Kernel Electric Fence (KFENCE)</a> (and <code>DEBUG_PAGEALLOC</code>) for 64-bit Power, including the Hash and Radix MMUs. Christophe Leroy has already implemented KFENCE for 32-bit powerpc upstream and a series adding support for 64-bit was posted by Jordan Niethe last year, but couldn't proceed due to locking issues. Those issues have since been resolved, and after fixing a previously unknown and very obscure MM issue, Nick's KFENCE patches have been merged.</p>
<p>KFENCE is a low-overhead alternative to memory detectors like KASAN (<a href="https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=41b7a347bf1491e7300563bb224432608b41f62a">which we implemented for Radix earlier this year, thanks to Daniel Axtens and Paul Mackerras</a>), which you probably wouldn't want to run in production. If you're chasing a memory corruption bug that doesn't like to present itself, KFENCE can help you do that for out-of-bounds accesses, use-after-frees, double frees etc without significantly impacting performance.</p>
<p>Finally, I wired up execute-only memory (XOM) for the Radix MMU. XOM is a niche feature that lets users map pages with <code>PROT_EXEC</code> only, creating a page that can't be read or written to, but still executed. This is primarily useful for defending against code reuse attacks like ROP, but has other uses such as JIT/sandbox environments. Power8 and later CPUs running the Hash MMU already had this capability through protection keys (pkeys), my implementation for Radix uses the native execute permission bit of the Radix MMU instead.</p>
<p>This basically took me an afternoon to wire up after I had the idea and I roped in Nicholas Miehlbradt to contribute a <a href="https://github.com/torvalds/linux/blob/master/tools/testing/selftests/powerpc/mm/exec_prot.c">selftest</a>, which ended up being a more significant engineering effort than the feature implementation itself. We now have a comprehensive test for XOM that runs on both Hash and Radix for all possible combinations of R/W/X upstream.</p>
<p>Anyway, that's all I have - this is my first time writing a post like this, so let me know what you think! A lot of our work doesn't result in upstream patches so we're not always going to have kernel releases as eventful as this, but we can post summaries every once in a while if there's interest. Thanks for reading!</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Russell Currey</dc:creator><pubDate>Wed, 26 Oct 2022 16:30:00 +1100</pubDate><guid isPermaLink="false">tag:sthbrx.github.io,2022-10-26:/blog/2022/10/26/power-kernel-hardening-features-in-linux-61/</guid><category>Development</category><category>linux</category><category>kernel</category><category>hardening</category></item><item><title>Fuzzing grub, part 2: going faster</title><link>https://sthbrx.github.io/blog/2021/06/14/fuzzing-grub-part-2-going-faster/</link><description><p>Recently a set of 8 vulnerabilities were disclosed for the <a href="https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/GRUB2SecureBootBypass2021">grub bootloader</a>. I
found 2 of them (CVE-2021-20225 and CVE-2021-20233), and contributed a number of
other fixes for crashing bugs which we don't believe are exploitable. I found
them by applying fuzz testing to grub. Here's how.</p>
<p>This is a multi-part series: I think it will end up being 4 posts. I'm hoping to
cover:</p>
<ul>
<li><a href="/blog/2021/03/04/fuzzing-grub-part-1">Part 1: getting started with fuzzing grub</a></li>
<li>Part 2 (this post): going faster by doing lots more work</li>
<li>Part 3: fuzzing filesystems and more</li>
<li>Part 4: potential next steps and avenues for further work</li>
</ul>
<p>We've been looking at fuzzing <code>grub-emu</code>, which is basically most parts of grub
built into a standard userspace program. This includes all the script parsing
logic, fonts, graphics, partition tables, filesystems and so on - just not