-
Notifications
You must be signed in to change notification settings - Fork 2
/
draft-ietf-nfsv4-rpcrdma-version-two.xml
6717 lines (6451 loc) · 209 KB
/
draft-ietf-nfsv4-rpcrdma-version-two.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt"?>
<rfc
category="std"
docName="draft-ietf-nfsv4-rpcrdma-version-two-latest"
indexInclude="true"
ipr="pre5378Trust200902"
obsoletes=""
scripts="Common,Latin"
sortRefs="true"
submissionType="IETF"
symRefs="true"
tocDepth="2"
tocInclude="true"
updates=""
version="3"
xml:lang="en">
<front>
<title abbrev="RPC-over-RDMA Version 2">
RPC-over-RDMA Version 2 Protocol
</title>
<seriesInfo name="Internet-Draft" value="draft-ietf-nfsv4-rpcrdma-version-two-latest"/>
<author initials="C." surname="Lever" fullname="Charles Lever" role="editor">
<organization abbrev="Oracle" showOnFrontPage="true">Oracle Corporation</organization>
<address>
<postal>
<street/>
<city/>
<region/>
<code/>
<country>United States of America</country>
</postal>
<email>chuck.lever@oracle.com</email>
</address>
</author>
<author initials="D." surname="Noveck" fullname="David Noveck">
<organization showOnFrontPage="true">NetApp</organization>
<address>
<postal>
<street>1601 Trapelo Road</street>
<city>Waltham</city>
<region>MA</region>
<code>02451</code>
<country>United States of America</country>
</postal>
<phone>+1 781 572 8038</phone>
<email>davenoveck@gmail.com</email>
</address>
</author>
<date/>
<area>Transport</area>
<workgroup>Network File System Version 4</workgroup>
<keyword>NFS-Over-RDMA</keyword>
<abstract>
<t>
This document specifies the second version
of a transport protocol that conveys
Remote Procedure Call (RPC) messages
using Remote Direct Memory Access (RDMA).
This version of the protocol is extensible.
</t>
</abstract>
<note removeInRFC="true">
<t>
Discussion of this draft takes place
on the NFSv4 working group mailing list (nfsv4@ietf.org),
which is archived at
<eref target="https://mailarchive.ietf.org/arch/browse/nfsv4/"/>.
Working Group information can be found at
<eref target="https://datatracker.ietf.org/wg/nfsv4/about/"/>.
</t>
<t>
The source for this draft is maintained in GitHub.
Suggested changes can be submitted as pull requests at
<eref target="https://github.com/chucklever/i-d-rpcrdma-version-two"/>.
Instructions are on that page as well.
</t>
</note>
</front>
<middle>
<section
anchor="section_72f6ba4a-aafb-4e9d-8b87-800ebccc5879"
numbered="true"
removeInRFC="false"
toc="default">
<name>Introduction</name>
<t>
Remote Direct Memory Access (RDMA)
<xref target="RFC5040" format="default" sectionFormat="of"/>
<xref target="RFC5041" format="default" sectionFormat="of"/>
<xref target="IBA" format="default" sectionFormat="of"/>
is a technique for moving data efficiently between network nodes.
By placing transferred data directly into destination buffers
using Direct Memory Access, RDMA delivers the reciprocal benefits of
faster data transfer
and
reduced host CPU overhead.
</t>
<t>
Open Network Computing Remote Procedure Call
(ONC RPC, often shortened in NFSv4 documents to RPC)
<xref target="RFC5531" format="default" sectionFormat="of"/>
is a Remote Procedure Call protocol
that runs over a variety of transports.
Most RPC implementations today use
UDP
<xref target="RFC0768" format="default" sectionFormat="of"/>
or
TCP
<xref target="RFC0793" format="default" sectionFormat="of"/>.
On UDP, a datagram encapsulates each RPC message.
Within a TCP byte stream,
a record marking protocol delineates RPC messages.
</t>
<t>
An RDMA transport, too, conveys RPC messages
in a fashion that must be fully defined
if RPC implementations are to interoperate
when using RDMA to transport RPC transactions.
Although RDMA transports encapsulate messages like UDP,
they deliver them reliably and in order, like TCP.
Further, they implement a bulk data transfer service
not provided by traditional network transports.
Therefore, we treat RDMA as a novel transport type for RPC.
</t>
<section
anchor="section_3ade56d8-45ea-4ab5-b97c-da817d3e0033"
numbered="true"
removeInRFC="false"
toc="default">
<name>Design Goals</name>
<t>
The general mission of RPC-over-RDMA transports is to
leverage network hardware capabilities to
reduce host CPU needs related to the transport of RPC messages.
In particular, this includes
mitigating host interrupt rates
and
limiting the necessity to copy RPC payload bytes on receivers.
</t>
<t>
These hardware capabilities benefit both RPC clients and servers.
On balance, however, the RPC-over-RDMA protocol design approach
has been to bolster clients more than servers, as the client is
typically where applications are most hungry for CPU resources.
</t>
<t>
Additionally,
RPC-over-RDMA transports are designed to
support RPC applications transparently.
However, such transports can also provide mechanisms
that enable further optimization of data transfer
when RPC applications are structured
to exploit direct data placement.
In this context, the Network File System (NFS) family of protocols
(as described in
<xref target="RFC1094" format="default" sectionFormat="of"/>,
<xref target="RFC1813" format="default" sectionFormat="of"/>,
<xref target="RFC7530" format="default" sectionFormat="of"/>,
<xref target="RFC7862" format="default" sectionFormat="of"/>,
<xref target="RFC8881" format="default" sectionFormat="of"/>,
and subsequent NFSv4 minor versions)
are all potential beneficiaries of RPC-over-RDMA.
</t>
<t>
A complete problem statement appears in
<xref target="RFC5532" format="default" sectionFormat="of"/>.
</t>
</section>
<section
anchor="section_0a2befc3-b5d7-468e-a48e-97c46a9c1b40"
numbered="true"
removeInRFC="false"
toc="default">
<name>Motivation for a New Version</name>
<t>
Storage administrators have broadly deployed
the RPC-over-RDMA version 1 protocol specified in
<xref target="RFC8166" format="default" sectionFormat="of"/>.
However, there are known shortcomings to this protocol:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The protocol's default size of Receive buffers forces
the use of RDMA Read and Write transfers for small payloads,
and limits the size of reverse-direction messages.
</li>
<li>
It is difficult to make optimizations or protocol fixes
that require changes to on-the-wire behavior.
</li>
<li>
For some RPC procedures, the maximum reply size is
difficult or impossible for an RPC client to estimate
in advance.
</li>
</ul>
<t>
To address these issues in a way that preserves interoperation
with existing RPC-over-RDMA version 1 deployments,
the current document presents
an updated version of the RPC-over-RDMA transport protocol.
</t>
<t>
This version of RPC-over-RDMA is extensible,
enabling the introduction of <bcp14>OPTIONAL</bcp14> extensions
without impacting existing implementations.
See
<xref target="section_d945b9f0-0666-4db7-9126-be57cf7b5f4f" format="default" sectionFormat="of"/>
for further discussion.
It introduces a mechanism to exchange implementation properties
to automatically provide further optimization of data transfer.
</t>
<t>
This version also contains incremental changes that
relieve performance constraints
and
enable recovery from unusual corner cases.
These changes are outlined in
<xref target="section_c2574344-5aec-427d-a5ed-048d7fcc0d95" format="default" sectionFormat="of"/>
and include
a larger default inline threshold,
the ability to convey a single RPC message using multiple RDMA Send operations,
support for authentication of connection peers,
richer error reporting,
improved credit-based flow control,
and
support for Remote Invalidation.
</t>
</section>
</section>
<section
anchor="section_ef1a2819-4d22-40af-8d38-fde10849c872"
numbered="true"
removeInRFC="false"
toc="default">
<name>Requirements Language</name>
<t>
The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
"<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>",
"<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>",
"<bcp14>SHOULD NOT</bcp14>",
"<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>"
in this document are to be interpreted
as described in BCP 14
<xref target="RFC2119" format="default" sectionFormat="of"/>
<xref target="RFC8174" format="default" sectionFormat="of"/>
when, and only when, they appear in all capitals, as shown here.
</t>
</section>
<section
anchor="section_4dc39c9c-3770-491f-b674-f824e87e2143"
numbered="true"
removeInRFC="false"
toc="default">
<name>Terminology</name>
<section
anchor="section_4d675e45-c377-48df-8029-c5c9f8c48f9f"
numbered="true"
removeInRFC="false"
toc="default">
<name>Remote Procedure Calls</name>
<t>
This section highlights critical elements of the RPC protocol
<xref target="RFC5531" format="default" sectionFormat="of"/>
and
the External Data Representation (XDR)
<xref target="RFC4506" format="default" sectionFormat="of"/>
it uses.
RPC-over-RDMA version 2 enables
the transmission of RPC messges built using XDR
and
also uses XDR internally to describe its header format.
</t>
<section
anchor="section_8d804fe5-c7c7-4c6c-92d8-888da10caaec"
numbered="true"
removeInRFC="false"
toc="default">
<name>Upper-Layer Protocols</name>
<t>
RPCs are an abstraction used to implement the operations of an Upper-Layer Protocol (ULP).
For RPC-over-RDMA, "ULP" refers to an RPC Program and Version tuple,
which is a versioned set of procedure calls that comprise a single well-defined API.
One example of a ULP is the Network File System Version 4.0
<xref target="RFC7530" format="default" sectionFormat="of"/>.
In the current document, the term "RPC consumer" refers to
an implementation of a ULP running on an RPC client.
</t>
</section>
<section
anchor="section_17a77782-8b11-4fb5-af0b-e0da7759c10A"
numbered="true"
removeInRFC="false"
toc="default">
<name>RPC Procedures</name>
<t>
Like a local procedure call,
every RPC procedure has a set of "arguments" and a set of "results".
A calling context invokes an RPC procedure,
passing arguments to it,
and the procedure subsequently returns a set of results.
Unlike a local procedure call,
an RPC procedure is executed remotely rather than
in the local application's execution context.
</t>
</section>
<section
anchor="section_97382254-b1a3-4e03-98e5-a0814b331bd0"
numbered="true"
removeInRFC="false"
toc="default">
<name>RPC Transactions</name>
<t>
The RPC protocol as described in
<xref target="RFC5531" format="default" sectionFormat="of"/>
is fundamentally a message-passing protocol
between one or more clients, where RPC consumers are running,
and a server, where a remote execution context is available
to process RPC transactions on behalf of these consumers.
</t>
<t>
ONC RPC transactions consist of two types of messages:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
A CALL message, or "Call", requests work.
An RPC Call message is designated
by the value zero (0) in the message's msg_type field.
</li>
<li>
A REPLY message, or "Reply",
reports the results of work requested by an RPC Call message.
An RPC Reply message is designated
by the value one (1) in the message's msg_type field.
</li>
</ul>
<t>
<xref target="RFC5531" section="9" format="default" sectionFormat="of"/>
introduces the RPC transaction identifier,
or "XID" for short.
Each connection endpoint interprets the value of an XID
in the context of the message's msg_type field.
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The sender of a Call message generates an arbitrary XID value
for each RPC that is unique among outstanding Calls from that sender.
</li>
<li>
The sender of a Reply message copies the XID of the initiating Call
to the Reply containing the results of that procedure.
</li>
</ul>
<t>
After receiving a Reply,
a Requester then matches the XID value in that Reply
with a Call it previously sent.
</t>
<t>
The ratio of Call messages to Reply messages is typically
but not always one-to-one.
</t>
<t>
The most common operational paradigm is when
a Requester sends a Call message to a Responder,
who then sends a Reply message back to the Requester
with the results of that procedure.
One Call message elicits a single Reply message in response.
A Responder never sends more than one Reply
for each received Call message.
</t>
<t>
A "retransmission" occurs when a Requester sends
exactly the same Call message,
with the same arguments and XID, more than once.
A Requester can retransmit if it believes the network layer
or Responder has dropped a Call message,
or if the Responder's Reply has been likewise lost.
To prevent unnecessary network traffic or the execution
of non-idempotent procedures multiple times,
Requesters avoid retransmitting needlessly.
</t>
<t>
In rare cases, an RPC procedure may not require any
results or even acknowledgement that the Responder
has executed the procedure.
In that case, the Requester sends a Call message
but no Reply is returned.
This document refers to that case as "Call-only".
</t>
</section>
<section
anchor="section_7e64e30d-0519-449b-b0bf-45c3d103b0be"
numbered="true"
removeInRFC="false"
toc="default">
<name>Message Serialization</name>
<t>
RPC messages are always transmitted atomically.
RPC peers may interleave messages,
but the contents of individual messages
cannot be broken up or interleaved
without making the messages illegible.
</t>
<t>
An RPC peer acting as a "Requester"
serializes the procedure's arguments
and
conveys them to a "Responder" endpoint via an RPC Call message.
A Call message contains an RPC protocol header with a unique XID,
a header describing the requested upper-layer operation,
and all arguments.
</t>
<t>
An RPC peer acting as a "Responder"
deserializes these arguments and processes the requested procedure.
It then serializes the procedure's results into an RPC Reply message.
An RPC Reply message contains an RPC protocol header with the same XID,
a header describing the upper-layer reply,
and all results.
</t>
<t>
The Requester deserializes the results
and
allows the RPC consumer to proceed.
At this point, the RPC transaction
designated by the XID in the RPC Call message is complete,
and the XID is retired.
</t>
</section>
<section
anchor="section_63f47fcd-629b-4b00-aa8b-dbf836401581"
numbered="true"
removeInRFC="false"
toc="default">
<name>RPC Transports</name>
<t>
The role of an "RPC transport" is
to mediate the exchange of RPC messages
between Requesters and Responders,
bridging the gap between
the RPC message abstraction
and
the native operations of a network transport
(e.g., a socket).
</t>
<t>
When an RPC transport type is connection-oriented,
RPC client endpoints initiate transport connections,
while RPC server endpoints wait passively to accept incoming connection requests.
RPC messages may also be exchanged without a connection association.
Because RPC-over-RDMA is a connection-oriented RPC transport,
connectionless operation is not discussed further in the current document.
</t>
<section
anchor="section_7aafb376-73d8-4fa1-8888-97a02c9a58c1"
numbered="true"
removeInRFC="false"
toc="default">
<name>Transport Failure Recovery</name>
<t>
So that appropriate and timely recovery action can be taken,
the transport implementation is responsible for notifying
a Requester when an RPC Call or Reply was not able to be conveyed.
Recovery can take the form of establishing a new connection,
re-sending RPC Calls, or terminating RPC transactions pending
on the Requester.
</t>
<t>
For instance, a connection loss may occur after a Responder
has received an RPC Call but before it can send the matching RPC Reply.
Once the transport notifies the Requester of the connection loss,
the Requester can re-send all pending RPC Calls on a fresh connection.
</t>
</section>
<section
anchor="section_2432566f-67e7-4f35-8ec4-9ed44cecd8cc"
numbered="true"
removeInRFC="false"
toc="default">
<name>Forward Direction</name>
<t>
Traditionally, an RPC client acts as a Requester,
while an RPC service acts as a Responder.
The current document
refers to this direction of RPC message passing
as "forward-direction" operation.
</t>
</section>
<section
anchor="section_189a59d0-9235-4c1e-a76f-ea2b20fd6c94"
numbered="true"
removeInRFC="false"
toc="default">
<name>Reverse-Direction</name>
<t>
The RPC specification
<xref target="RFC5531" format="default" sectionFormat="of"/>
does not forbid performing RPC transactions
in the other direction.
An RPC service endpoint can act as a Requester,
in which case an RPC client endpoint acts as a Responder.
This direction of RPC message passing is known as
"reverse-direction" operation.
</t>
<t>
During reverse-direction operation,
an RPC client is responsible
for establishing transport connections,
even though the RPC server originates RPC Calls.
</t>
<t>
RPC clients and servers are usually optimized
to perform and scale well when handling traffic
in the forward direction.
They might not be prepared to handle operation
in the reverse direction.
Not until NFS version 4.1
<xref target="RFC8881" format="default" sectionFormat="of"/>
has there been a strong need
to handle reverse-direction operation.
</t>
</section>
<section
anchor="section_05f24e3b-ad49-4370-a0fe-477b0f1364aa"
numbered="true"
removeInRFC="false"
toc="default">
<name>Bi-directional Operation</name>
<t>
A pair of connected RPC endpoints may choose to use
only forward-direction
or
only reverse-direction operation
on a particular transport connection.
Or, these endpoints may send Calls
in both directions concurrently
on the same transport connection.
</t>
<t>
"Bi-directional operation" occurs when both transport endpoints
act as a Requester and a Responder at the same time
on a single connection.
</t>
<t>
Bi-directionality is an extension
of RPC transport connection sharing.
Two RPC endpoints wish to exchange
independent RPC messages over a shared connection
but in opposite directions.
These messages may or may not be related
to the same workloads or RPC Programs.
</t>
<t>
During bi-directional operation,
forward- and reverse- direction XIDs
are typically generated
on distinct hosts by possibly different algorithms.
There is no coordination between the generation of XIDs
used in forward-direction and reverse-direction operation.
</t>
<t>
Therefore, a forward-direction Requester
<bcp14>MAY</bcp14> use the same XID value at the same time
as a reverse-direction Requester
on the same transport connection.
Although such concurrent requests use the same XID value,
they represent distinct RPC transactions.
</t>
</section>
</section>
<section
anchor="section_98bdc62c-0af4-4379-8b5c-6d98b7a520c7"
numbered="true"
removeInRFC="false"
toc="default">
<name>External Data Representation</name>
<t>
One cannot assume that all Requesters and Responders
represent data objects in the same way internally.
RPC uses External Data Representation (XDR)
to translate native data types and serialize arguments and results
<xref target="RFC4506" format="default" sectionFormat="of"/>.
</t>
<t>
XDR encodes data independently
of the endianness or size of host-native data types,
enabling unambiguous decoding of data by a receiver.
</t>
<t>
XDR assumes only that the number of bits in a byte (octet)
and
their order are the same on both endpoints and the physical network.
The smallest indivisible unit of XDR encoding is a group of four octets.
XDR can also flatten
lists,
arrays,
and
other complex data types
into a stream of bytes.
</t>
<t>
We refer to a serialized stream of bytes
that is the result of XDR encoding
as an "XDR stream".
A sender encodes native data
into an XDR stream and then transmits that stream to a receiver.
The receiver decodes incoming XDR byte streams
into its native data representation format.
</t>
<section
anchor="section_c6d3092c-99e6-4cce-b377-fffc4862929F"
numbered="true"
removeInRFC="false"
toc="default">
<name>XDR Opaque Data</name>
<t>
Sometimes, a data item is to be transferred as-is,
without encoding or decoding.
We refer to the contents of such a data item as "opaque data".
XDR encoding places the content of opaque data items
directly into an XDR stream without altering it in any way.
ULPs or applications perform
any needed data translation in this case.
Examples of opaque data items include the content of files
or generic byte strings.
</t>
</section>
<section
anchor="section_c210323f-c524-4e98-a02d-23549a4bebc5"
numbered="true"
removeInRFC="false"
toc="default">
<name>XDR Roundup</name>
<t>
The number of octets in a variable-length data item
precedes that item in an XDR stream.
If the size of an encoded data item is not a multiple of four octets,
the sender appends octets containing zero after the end of the data item.
These zero octets shift the next encoded data item in the XDR stream
so that it always starts on a four-octet boundary.
The addition of extra octets does not change
the encoded size of the data item.
Receivers do not expose the extra octets to ULPs.
</t>
<t>
We refer to this technique as "XDR roundup",
and the extra octets as "XDR roundup padding".
</t>
</section>
</section>
</section>
<section
anchor="section_de830270-64ed-4510-ac25-29837d352031"
numbered="true"
removeInRFC="false"
toc="default">
<name>Remote Direct Memory Access</name>
<t>
When a third party transfers large RPC payloads,
RPC Requesters and Responders can become more efficient.
An example of such a third party might be
an intelligent network interface
(data movement offload),
which places data in the receiver's memory so that
no additional adjustment of data alignment is necessary
(direct data placement or "DDP").
RDMA transports enable both of these optimizations.
</t>
<t>
In the current document, the standalone term "RDMA" refers to
the physical mechanism an RDMA transport utilizes when moving data.
</t>
<section
anchor="section_1b97ecfd-7aba-4299-9007-dab28ac76f81"
numbered="true"
removeInRFC="false"
toc="default">
<name>Direct Data Placement</name>
<t>
Typically, RPC implementations copy
the contents of RPC messages into a buffer before being sent.
An efficient RPC implementation sends bulk data
without first copying it into a separate send buffer.
</t>
<t>
However, socket-based RPC implementations
are often unable to receive data directly
into its final place in memory.
Receivers often need to copy incoming data
to finish an RPC operation,
if only to adjust data alignment.
</t>
<t>
Although it may not be efficient,
before an RDMA transfer, a sender may copy data into an intermediate buffer.
After an RDMA transfer, a receiver may copy that data again to its final destination.
In this document, the term "DDP" refers to
any optimized data transfer where a receiving host's CPU
does not move transferred data
to another location after arrival.
</t>
<t>
RPC-over-RDMA version 2 enables the use of RDMA Read and Write operations
to achieve both data movement offload and DDP.
However, note that
not all RDMA-based data transfer qualifies as DDP,
and
some mechanisms that do not employ explicit RDMA can place data directly.
</t>
</section>
<section
anchor="section_6903045e-bd1c-4e12-bf96-6b534989f46A"
numbered="true"
removeInRFC="false"
toc="default">
<name>RDMA Transport Operation</name>
<t>
RDMA transports require that
RDMA consumers provision resources in advance
to achieve good performance during receive operations.
An RDMA consumer might provide Receive buffers in advance
by posting an RDMA Receive Work Request
for every expected RDMA Send from a remote peer.
These buffers are provided
before the remote peer posts RDMA Send Work Requests.
Thus this is often referred to as "pre-posting" buffers.
</t>
<t>
An RDMA Receive Work Request remains outstanding
until the RDMA provider matches it to an inbound Send operation.
The resources associated with that Receive must be retained in
host memory, or "pinned", until the Receive completes.
</t>
<t>
Given these tenets of operation,
the RPC-over-RDMA version 2 protocol assumes
each transport provides the following abstract operations.
A more complete discussion of these operations appears in
<xref target="RFC5040" format="default" sectionFormat="of"/>.
</t>
<section
anchor="section_90f88ba5-5ad6-4ac1-b40d-ed9247e61ca5"
numbered="true"
removeInRFC="false"
toc="default">
<name>Memory Registration</name>
<t>
Memory registration assigns a steering tag
to a region of memory,
permitting the RDMA provider
to perform data-transfer operations.
The RPC-over-RDMA version 2 protocol assumes that
a steering tag of no more than 32 bits and memory
addresses of up to 64 bits in length
identifies each registered memory region.
</t>
</section>
<section
anchor="section_07bba55f-c48f-474c-918b-db6c9d2325dd"
numbered="true"
removeInRFC="false"
toc="default">
<name>RDMA Send</name>
<t>
The RDMA provider supports an RDMA Send operation,
with completion signaled on the receiving peer
after the RDMA provider has placed data in a pre-posted buffer.
Sends complete at the receiver
in the order they were posted at the sender.
The size of the remote peer's pre-posted buffers
limits the amount of data
that can be transferred by a single RDMA Send operation.
</t>
</section>
<section
anchor="section_9be6a44c-1ea5-4ccd-b188-ee04e930497b"
numbered="true"
removeInRFC="false"
toc="default">
<name>RDMA Receive</name>
<t>
The RDMA provider supports an RDMA Receive operation
to receive data conveyed by incoming RDMA Send operations.
To reduce the amount of memory that must remain pinned
awaiting incoming Sends,
the amount of memory posted per Receive is limited.
The RDMA consumer (in this case, the RPC-over-RDMA version 2 protocol)
provides flow control to prevent overrunning receiver resources.
</t>
</section>
<section
anchor="section_cfd79bce-e8e9-4a51-b43a-b747af6213f4"
numbered="true"
removeInRFC="false"
toc="default">
<name>RDMA Write</name>
<t>
The RDMA provider supports an RDMA Write operation
to place data directly into a remote memory region.
The local host initiates an RDMA Write
and the RDMA provider signals completion there.
The remote RDMA provider does not signal completion
on the remote peer.
The local host provides
the steering tag,
the memory address,
and
the length of the remote peer's memory region.
</t>
<t>
RDMA Writes are not ordered relative to one another,
but are ordered relative to RDMA Sends.
Thus, a subsequent RDMA Send completion
signaled on the local peer
guarantees that prior RDMA Write data
has been successfully placed in the remote peer's memory.
</t>
</section>
<section
anchor="section_f37121af-49ff-4575-a699-7310f4ae1296"
numbered="true"
removeInRFC="false"
toc="default">
<name>RDMA Read</name>
<t>
The RDMA provider supports an RDMA Read operation
to place remote source data directly into local memory.
The local host initiates an RDMA Read
and and the RDMA provider signals completion there.
The remote RDMA provider does not signal
completion on the remote peer.
The local host provides
the steering tags,
the memory addresses,
and the lengths for the remote source
and
local destination memory regions.
</t>
<t>
The RDMA consumer (in this case, the RPC-over-RDMA version 2 protocol)
signals Read completion to the remote peer
as part of a subsequent RDMA Send message.
The remote peer can then invalidate steering tags
and
subsequently free associated source memory regions.
</t>
</section>
</section>
</section>
</section>
<section
anchor="section_5ae4b016-9b44-4649-9021-5ae851ac9326"
numbered="true"
removeInRFC="false"
toc="default">
<name>RPC-over-RDMA Framework</name>
<t>
Before an RDMA data transfer can occur,
an endpoint first exposes regions of its memory to a remote endpoint.
The remote endpoint then initiates RDMA Read and Write operations
against the exposed memory.
A "transfer model" designates
which endpoint exposes its memory
and
which is responsible for initiating the transfer of data.
</t>
<t>
In RPC-over-RDMA version 2,
only Requesters expose their memory to the Responder,
and only Responders initiate RDMA Read and Write operations.
Read access to memory regions enables the Responder to
pull RPC arguments
or
whole RPC Calls from each Requester.
The Responder pushes
RPC results
or
whole RPC Replies to a Requester's
memory regions to which it has write access.
</t>
<section
anchor="section_195e0288-862d-40bb-a259-4239930c728a"
numbered="true"
removeInRFC="false"
toc="default">
<name>Message Framing</name>
<t>
Each RPC-over-RDMA version 2 message consists of at most two XDR streams:
</t>
<ul spacing="normal" bare="false" empty="false">
<li>
The "Transport stream" contains a header that describes
and controls the transfer of the Payload stream
in this RPC-over-RDMA message.
Every RDMA Send on an RPC-over-RDMA version 2 connection
<bcp14>MUST</bcp14> begin with a Transport stream.
</li>
<li>
The "Payload stream" contains part or all of a single RPC message.
The sender <bcp14>MAY</bcp14> divide an RPC message at any convenient boundary
but
<bcp14>MUST</bcp14> send RPC message fragments in XDR stream order
and
<bcp14>MUST NOT</bcp14> interleave Payload streams from multiple RPC messages.
</li>
</ul>
<t>
The RPC-over-RDMA framing mechanism described in this section
replaces all other RPC framing mechanisms.
Connection peers use RPC-over-RDMA framing
even when the underlying RDMA protocol runs
on a transport type with well-defined RPC framing, such as TCP.
However, a ULP can negotiate the use of RDMA,
dynamically enabling the use of RPC-over-RDMA on a connection
established on some other transport type.
Because RPC framing delimits an entire RPC request or reply,
the resulting shift in framing must occur between distinct RPC messages,
and in concert with the underlying transport.
</t>
</section>
<section
anchor="section_130ce79c-8b13-479e-8108-a943024047dD"
numbered="true"
removeInRFC="false"
toc="default">
<name>Reliable Message Delivery</name>
<t>
RPC-over-RDMA provides
a reliable
and
in-order
data transport service for RPC Calls and Replies.
</t>
<t>
RPC-over-RDMA transports
<bcp14>MUST</bcp14>
operate only on a reliable Queue Pair (QP) such as
the RDMA RC (Reliable Connected) QP type
as defined in Section 9.7.7 of
<xref target="IBA" format="default" sectionFormat="of"/>.
The Marker PDU Aligned (MPA) protocol
<xref target="RFC5044" format="default" sectionFormat="of"/>,
when deployed on a reliable transport such as TCP,
provides similar functionality.
Using a reliable QP type ensures
in-transit data integrity
and