forked from libwww-perl/URI
-
Notifications
You must be signed in to change notification settings - Fork 0
/
rfc2396.txt
2243 lines (1394 loc) · 82.5 KB
/
rfc2396.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Network Working Group T. Berners-Lee
Request for Comments: 2396 MIT/LCS
Updates: 1808, 1738 R. Fielding
Category: Standards Track U.C. Irvine
L. Masinter
Xerox Corporation
August 1998
Uniform Resource Identifiers (URI): Generic Syntax
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (1998). All Rights Reserved.
IESG Note
This paper describes a "superset" of operations that can be applied
to URI. It consists of both a grammar and a description of basic
functionality for URI. To understand what is a valid URI, both the
grammar and the associated description have to be studied. Some of
the functionality described is not applicable to all URI schemes, and
some operations are only possible when certain media types are
retrieved using the URI, regardless of the scheme used.
Abstract
A Uniform Resource Identifier (URI) is a compact string of characters
for identifying an abstract or physical resource. This document
defines the generic syntax of URI, including both absolute and
relative forms, and guidelines for their use; it revises and replaces
the generic definitions in RFC 1738 and RFC 1808.
This document defines a grammar that is a superset of all valid URI,
such that an implementation can parse the common components of a URI
reference without knowing the scheme-specific requirements of every
possible identifier type. This document does not define a generative
grammar for URI; that task will be performed by the individual
specifications of each URI scheme.
Berners-Lee, et. al. Standards Track [Page 1]
RFC 2396 URI Generic Syntax August 1998
1. Introduction
Uniform Resource Identifiers (URI) provide a simple and extensible
means for identifying a resource. This specification of URI syntax
and semantics is derived from concepts introduced by the World Wide
Web global information initiative, whose use of such objects dates
from 1990 and is described in "Universal Resource Identifiers in WWW"
[RFC1630]. The specification of URI is designed to meet the
recommendations laid out in "Functional Recommendations for Internet
Resource Locators" [RFC1736] and "Functional Requirements for Uniform
Resource Names" [RFC1737].
This document updates and merges "Uniform Resource Locators"
[RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order
to define a single, generic syntax for all URI. It excludes those
portions of RFC 1738 that defined the specific syntax of individual
URL schemes; those portions will be updated as separate documents, as
will the process for registration of new URI schemes. This document
does not discuss the issues and recommendation for dealing with
characters outside of the US-ASCII character set [ASCII]; those
recommendations are discussed in a separate document.
All significant changes from the prior RFCs are noted in Appendix G.
1.1 Overview of URI
URI are characterized by the following definitions:
Uniform
Uniformity provides several benefits: it allows different types
of resource identifiers to be used in the same context, even
when the mechanisms used to access those resources may differ;
it allows uniform semantic interpretation of common syntactic
conventions across different types of resource identifiers; it
allows introduction of new types of resource identifiers
without interfering with the way that existing identifiers are
used; and, it allows the identifiers to be reused in many
different contexts, thus permitting new applications or
protocols to leverage a pre-existing, large, and widely-used
set of resource identifiers.
Resource
A resource can be anything that has identity. Familiar
examples include an electronic document, an image, a service
(e.g., "today's weather report for Los Angeles"), and a
collection of other resources. Not all resources are network
"retrievable"; e.g., human beings, corporations, and bound
books in a library can also be considered resources.
Berners-Lee, et. al. Standards Track [Page 2]
RFC 2396 URI Generic Syntax August 1998
The resource is the conceptual mapping to an entity or set of
entities, not necessarily the entity which corresponds to that
mapping at any particular instance in time. Thus, a resource
can remain constant even when its content---the entities to
which it currently corresponds---changes over time, provided
that the conceptual mapping is not changed in the process.
Identifier
An identifier is an object that can act as a reference to
something that has identity. In the case of URI, the object is
a sequence of characters with a restricted syntax.
Having identified a resource, a system may perform a variety of
operations on the resource, as might be characterized by such words
as `access', `update', `replace', or `find attributes'.
1.2. URI, URL, and URN
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URI
that identify resources via a representation of their primary access
mechanism (e.g., their network "location"), rather than identifying
the resource by name or by some other attribute(s) of that resource.
The term "Uniform Resource Name" (URN) refers to the subset of URI
that are required to remain globally unique and persistent even when
the resource ceases to exist or becomes unavailable.
The URI scheme (Section 3.1) defines the namespace of the URI, and
thus may further restrict the syntax and semantics of identifiers
using that scheme. This specification defines those elements of the
URI syntax that are either required of all URI schemes or are common
to many URI schemes. It thus defines the syntax and semantics that
are needed to implement a scheme-independent parsing mechanism for
URI references, such that the scheme-dependent handling of a URI can
be postponed until the scheme-dependent semantics are needed. We use
the term URL below when describing syntax or semantics that only
apply to locators.
Although many URL schemes are named after protocols, this does not
imply that the only way to access the URL's resource is via the named
protocol. Gateways, proxies, caches, and name resolution services
might be used to access some resources, independent of the protocol
of their origin, and the resolution of some URL may require the use
of more than one protocol (e.g., both DNS and HTTP are typically used
to access an "http" URL's resource when it can't be found in a local
cache).
Berners-Lee, et. al. Standards Track [Page 3]
RFC 2396 URI Generic Syntax August 1998
A URN differs from a URL in that it's primary purpose is persistent
labeling of a resource with an identifier. That identifier is drawn
from one of a set of defined namespaces, each of which has its own
set name structure and assignment procedures. The "urn" scheme has
been reserved to establish the requirements for a standardized URN
namespace, as defined in "URN Syntax" [RFC2141] and its related
specifications.
Most of the examples in this specification demonstrate URL, since
they allow the most varied use of the syntax and often have a
hierarchical namespace. A parser of the URI syntax is capable of
parsing both URL and URN references as a generic URI; once the scheme
is determined, the scheme-specific parsing can be performed on the
generic URI components. In other words, the URI syntax is a superset
of the syntax of all URI schemes.
1.3. Example URI
The following examples illustrate URI that are in common use.
ftp://ftp.is.co.za/rfc/rfc1808.txt
-- ftp scheme for File Transfer Protocol services
gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
-- gopher scheme for Gopher and Gopher+ Protocol services
http://www.math.uio.no/faq/compression-faq/part1.html
-- http scheme for Hypertext Transfer Protocol services
mailto:mduerst@ifi.unizh.ch
-- mailto scheme for electronic mail addresses
news:comp.infosystems.www.servers.unix
-- news scheme for USENET news groups and articles
telnet://melvyl.ucop.edu/
-- telnet scheme for interactive services via the TELNET Protocol
1.4. Hierarchical URI and Relative Forms
An absolute identifier refers to a resource independent of the
context in which the identifier is used. In contrast, a relative
identifier refers to a resource by describing the difference within a
hierarchical namespace between the current context and an absolute
identifier of the resource.
Berners-Lee, et. al. Standards Track [Page 4]
RFC 2396 URI Generic Syntax August 1998
Some URI schemes support a hierarchical naming system, where the
hierarchy of the name is denoted by a "/" delimiter separating the
components in the scheme. This document defines a scheme-independent
`relative' form of URI reference that can be used in conjunction with
a `base' URI (of a hierarchical scheme) to produce another URI. The
syntax of hierarchical URI is described in Section 3; the relative
URI calculation is described in Section 5.
1.5. URI Transcribability
The URI syntax was designed with global transcribability as one of
its main concerns. A URI is a sequence of characters from a very
limited set, i.e. the letters of the basic Latin alphabet, digits,
and a few special characters. A URI may be represented in a variety
of ways: e.g., ink on paper, pixels on a screen, or a sequence of
octets in a coded character set. The interpretation of a URI depends
only on the characters used and not how those characters are
represented in a network protocol.
The goal of transcribability can be described by a simple scenario.
Imagine two colleagues, Sam and Kim, sitting in a pub at an
international conference and exchanging research ideas. Sam asks Kim
for a location to get more information, so Kim writes the URI for the
research site on a napkin. Upon returning home, Sam takes out the
napkin and types the URI into a computer, which then retrieves the
information to which Kim referred.
There are several design concerns revealed by the scenario:
o A URI is a sequence of characters, which is not always
represented as a sequence of octets.
o A URI may be transcribed from a non-network source, and thus
should consist of characters that are most likely to be able to
be typed into a computer, within the constraints imposed by
keyboards (and related input devices) across languages and
locales.
o A URI often needs to be remembered by people, and it is easier
for people to remember a URI when it consists of meaningful
components.
These design concerns are not always in alignment. For example, it
is often the case that the most meaningful name for a URI component
would require characters that cannot be typed into some systems. The
ability to transcribe the resource identifier from one medium to
another was considered more important than having its URI consist of
the most meaningful of components. In local and regional contexts
Berners-Lee, et. al. Standards Track [Page 5]
RFC 2396 URI Generic Syntax August 1998
and with improving technology, users might benefit from being able to
use a wider range of characters; such use is not defined in this
document.
1.6. Syntax Notation and Common Elements
This document uses two conventions to describe and define the syntax
for URI. The first, called the layout form, is a general description
of the order of components and component separators, as in
<first>/<second>;<third>?<fourth>
The component names are enclosed in angle-brackets and any characters
outside angle-brackets are literal separators. Whitespace should be
ignored. These descriptions are used informally and do not define
the syntax requirements.
The second convention is a BNF-like grammar, used to define the
formal URI syntax. The grammar is that of [RFC822], except that "|"
is used to designate alternatives. Briefly, rules are separated from
definitions by an equal "=", indentation is used to continue a rule
definition over more than one line, literals are quoted with "",
parentheses "(" and ")" are used to group elements, optional elements
are enclosed in "[" and "]" brackets, and elements may be preceded
with <n>* to designate n or more repetitions of the following
element; n defaults to 0.
Unlike many specifications that use a BNF-like grammar to define the
bytes (octets) allowed by a protocol, the URI grammar is defined in
terms of characters. Each literal in the grammar corresponds to the
character it represents, rather than to the octet encoding of that
character in any particular coded character set. How a URI is
represented in terms of bits and bytes on the wire is dependent upon
the character encoding of the protocol used to transport it, or the
charset of the document which contains it.
The following definitions are common to many elements:
alpha = lowalpha | upalpha
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
Berners-Lee, et. al. Standards Track [Page 6]
RFC 2396 URI Generic Syntax August 1998
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
alphanum = alpha | digit
The complete URI syntax is collected in Appendix A.
2. URI Characters and Escape Sequences
URI consist of a restricted set of characters, primarily chosen to
aid transcribability and usability both in computer systems and in
non-computer communications. Characters used conventionally as
delimiters around URI were excluded. The restricted set of
characters consists of digits, letters, and a few graphic symbols
were chosen from those common to most of the character encodings and
input facilities available to Internet users.
uric = reserved | unreserved | escaped
Within a URI, characters are either used as delimiters, or to
represent strings of data (octets) within the delimited portions.
Octets are either represented directly by a character (using the US-
ASCII character for that octet [ASCII]) or by an escape encoding.
This representation is elaborated below.
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.
A URI scheme may define a mapping from URI characters to octets;
whether this is done depends on the scheme. Commonly, within a
delimited component of a URI, a sequence of characters may be used to
represent a sequence of octets. For example, the character "a"
represents the octet 97 (decimal), while the character sequence "%",
"0", "a" represents the octet 10 (decimal).
Berners-Lee, et. al. Standards Track [Page 7]
RFC 2396 URI Generic Syntax August 1998
There is a second translation for some resources: the sequence of
octets defined by a component of the URI is subsequently used to
represent a sequence of characters. A 'charset' defines this mapping.
There are many charsets in use in Internet protocols. For example,
UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
of characters in the repertoire of ISO 10646.
In the simplest case, the original character sequence contains only
characters that are defined in US-ASCII, and the two levels of
mapping are simple and easily invertible: each 'original character'
is represented as the octet for the US-ASCII code for it, which is,
in turn, represented as either the US-ASCII character, or else the
"%" escape sequence for that octet.
For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277]. However, there is currently
no provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the
charset used.
It is expected that a systematic treatment of character encoding
within URI will be developed as a future modification of this
specification.
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Berners-Lee, et. al. Standards Track [Page 8]
RFC 2396 URI Generic Syntax August 1998
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
2.4.1. Escaped Encoding
An escaped octet is encoded as a character triplet, consisting of the
percent character "%" followed by the two hexadecimal digits
representing the octet code. For example, "%20" is the escaped
encoding for the US-ASCII space character.
escaped = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
2.4.2. When to Escape and Unescape
A URI is always in an "escaped" form, since escaping or unescaping a
completed URI might change its semantics. Normally, the only time
escape encodings can safely be made is when the URI is being created
from its component parts; each component may have its own set of
characters that are reserved, so only the mechanism responsible for
generating or interpreting that component can determine whether or
Berners-Lee, et. al. Standards Track [Page 9]
RFC 2396 URI Generic Syntax August 1998
not escaping a character will change its semantics. Likewise, a URI
must be separated into its components before the escaped characters
within those components can be safely decoded.
In some cases, data that could be represented by an unreserved
character may appear escaped; for example, some of the unreserved
"mark" characters are automatically escaped by some systems. If the
given URI scheme defines a canonicalization algorithm, then
unreserved characters may be unescaped according to that algorithm.
For example, "%7e" is sometimes used instead of "~" in an http URL
path, but the two are equivalent for an http URL.
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI. Implementers should be careful not to
escape or unescape the same string more than once, since unescaping
an already unescaped string might lead to misinterpreting a percent
data character as another escaped character, or vice versa in the
case of escaping an already escaped string.
2.4.3. Excluded US-ASCII Characters
Although they are disallowed within the URI syntax, we include here a
description of those US-ASCII characters that have been excluded and
the reasons for their exclusion.
The control characters in the US-ASCII coded character set are not
used within a URI, both because they are non-printable and because
they are likely to be misinterpreted by some control mechanisms.
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
The space character is excluded because significant spaces may
disappear and insignificant spaces may be introduced when URI are
transcribed or typeset or subjected to the treatment of word-
processing programs. Whitespace is also used to delimit URI in many
contexts.
space = <US-ASCII coded character 20 hexadecimal>
The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields. The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters.
delims = "<" | ">" | "#" | "%" | <">
Berners-Lee, et. al. Standards Track [Page 10]
RFC 2396 URI Generic Syntax August 1998
Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.
3. URI Syntactic Components
The URI syntax is dependent upon the scheme. In general, absolute
URI are written as follows:
<scheme>:<scheme-specific-part>
An absolute URI contains the name of the scheme being used (<scheme>)
followed by a colon (":") and then a string (the <scheme-specific-
part>) whose interpretation depends on the scheme.
The URI syntax does not require that the scheme-specific-part have
any general structure or set of semantics which is common among all
URI. However, a subset of URI do share a common syntax for
representing hierarchical relationships within the namespace. This
"generic URI" syntax consists of a sequence of four main components:
<scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a particular URI.
For example, some URI schemes do not allow an <authority> component,
and others do not use a <query> component.
absoluteURI = scheme ":" ( hier_part | opaque_part )
URI that are hierarchical in nature use the slash "/" character for
separating hierarchical components. For some file systems, a "/"
character (used to denote the hierarchical structure of a URI) is the
delimiter used to construct a file name hierarchy, and thus the URI
path will look similar to a file pathname. This does NOT imply that
the resource is a file or that the URI maps to an actual filesystem
pathname.
hier_part = ( net_path | abs_path ) [ "?" query ]
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments
Berners-Lee, et. al. Standards Track [Page 11]
RFC 2396 URI Generic Syntax August 1998
URI that do not make use of the slash "/" character for separating
hierarchical components are considered opaque by the generic URI
parser.
opaque_part = uric_no_slash *uric
uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
"&" | "=" | "+" | "$" | ","
We use the term <path> to refer to both the <abs_path> and
<opaque_part> constructs, since they are mutually exclusive for any
given URI and can be parsed as a single component.
3.1. Scheme Component
Just as there are many different methods of access to resources,
there are a variety of schemes for identifying such resources. The
URI syntax consists of a sequence of components separated by reserved
characters, with the first component defining the semantics for the
remainder of the URI string.
Scheme names consist of a sequence of characters beginning with a
lower case letter and followed by any combination of lower case
letters, digits, plus ("+"), period ("."), or hyphen ("-"). For
resiliency, programs interpreting URI should treat upper case letters
as equivalent to lower case in scheme names (e.g., allow "HTTP" as
well as "http").
scheme = alpha *( alpha | digit | "+" | "-" | "." )
Relative URI references are distinguished from absolute URI in that
they do not begin with a scheme name. Instead, the scheme is
inherited from the base URI, as described in Section 5.2.
3.2. Authority Component
Many URI schemes include a top hierarchical element for a naming
authority, such that the namespace defined by the remainder of the
URI is governed by that authority. This authority component is
typically defined by an Internet-based server or a scheme-specific
registry of naming authorities.
authority = server | reg_name
The authority component is preceded by a double slash "//" and is
terminated by the next slash "/", question-mark "?", or by the end of
the URI. Within the authority component, the characters ";", ":",
"@", "?", and "/" are reserved.
Berners-Lee, et. al. Standards Track [Page 12]
RFC 2396 URI Generic Syntax August 1998
An authority component is not required for a URI scheme to make use
of relative references. A base URI without an authority component
implies that any relative reference will also be without an authority
component.
3.2.1. Registry-based Naming Authority
The structure of a registry-based naming authority is specific to the
URI scheme, but constrained to the allowed characters for an
authority component.
reg_name = 1*( unreserved | escaped | "$" | "," |
";" | ":" | "@" | "&" | "=" | "+" )
3.2.2. Server-based Naming Authority
URL schemes that involve the direct use of an IP-based protocol to a
specified server on the Internet use a common syntax for the server
component of the URI's scheme-specific data:
<userinfo>@<host>:<port>
where <userinfo> may consist of a user name and, optionally, scheme-
specific information about how to gain authorization to access the
server. The parts "<userinfo>@" and ":<port>" may be omitted.
server = [ [ userinfo "@" ] hostport ]
The user information, if present, is followed by a commercial at-sign
"@".
userinfo = *( unreserved | escaped |
";" | ":" | "&" | "=" | "+" | "$" | "," )
Some URL schemes use the format "user:password" in the userinfo
field. This practice is NOT RECOMMENDED, because the passing of
authentication information in clear text (such as URI) has proven to
be a security risk in almost every case where it has been used.
The host is a domain name of a network host, or its IPv4 address as a
set of four decimal digit groups separated by ".". Literal IPv6
addresses are not supported.
hostport = host [ ":" port ]
host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
Berners-Lee, et. al. Standards Track [Page 13]
RFC 2396 URI Generic Syntax August 1998
IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
port = *digit
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters. The rightmost
domain label of a fully qualified domain name will never start with a
digit, thus syntactically distinguishing domain names from IPv4
addresses, and may be followed by a single "." if it is necessary to
distinguish between the complete domain name and any local domain.
To actually be "Uniform" as a resource locator, a URL hostname should
be a fully qualified domain name. In practice, however, the host
component may be a local domain literal.
Note: A suitable representation for including a literal IPv6
address as the host part of a URL is desired, but has not yet been
determined or implemented in practice.
The port is the network port number for the server. Most schemes
designate protocols that have a default port number. Another port
number may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the default port number is
assumed.
3.3. Path Component
The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority.
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters
"/", ";", "=", and "?" are reserved. Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
The parameters are not significant to the parsing of relative
references.
Berners-Lee, et. al. Standards Track [Page 14]
RFC 2396 URI Generic Syntax August 1998
3.4. Query Component
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved.
4. URI References
The term "URI-reference" is used here to denote the common usage of a
resource identifier. A URI reference may be absolute or relative,
and may have additional information attached in the form of a
fragment identifier. However, "the URI" that results from such a
reference includes only the absolute URI after the fragment
identifier (if any) is removed and after any relative URI is resolved
to its absolute form. Although it is possible to limit the
discussion of URI syntax and semantics to that of the absolute
result, most usage of URI is within general URI references, and it is
impossible to obtain the URI from such a reference without also
parsing the fragment and resolving the relative form.
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
The syntax for relative URI is a shortened form of that for absolute
URI, where some prefix of the URI is missing and certain path
components ("." and "..") have a special meaning when, and only when,
interpreting a relative path. The relative URI syntax is defined in
Section 5.
4.1. Fragment Identifier
When a URI reference is used to perform a retrieval action on the
identified resource, the optional fragment identifier, separated from
the URI by a crosshatch ("#") character, consists of additional
reference information to be interpreted by the user agent after the
retrieval action has been successfully completed. As such, it is not
part of a URI, but is often used in conjunction with a URI.
fragment = *uric
The semantics of a fragment identifier is a property of the data
resulting from a retrieval action, regardless of the type of URI used
in the reference. Therefore, the format and interpretation of
fragment identifiers is dependent on the media type [RFC2046] of the
retrieval result. The character restrictions described in Section 2
Berners-Lee, et. al. Standards Track [Page 15]
RFC 2396 URI Generic Syntax August 1998
for URI also apply to the fragment in a URI-reference. Individual
media types may define additional restrictions or structure within
the fragment for specifying different types of "partial views" that
can be identified within that media type.
A fragment identifier is only meaningful when a URI reference is
intended for retrieval and the result of that retrieval is a document
for which the identified fragment is consistently defined.
4.2. Same-document References
A URI reference that does not contain a URI is a reference to the
current document. In other words, an empty URI reference within a
document is interpreted as a reference to the start of that document,
and a reference containing only a fragment identifier is a reference
to the identified fragment of that document. Traversal of such a
reference should not result in an additional retrieval action.
However, if the URI reference occurs in a context that is always
intended to result in a new request, as in the case of HTML's FORM
element, then an empty URI reference represents the base URI of the
current document and should be replaced by that URI when transformed
into a request.
4.3. Parsing a URI Reference
A URI reference is typically parsed according to the four main
components and fragment identifier in order to determine what
components are present and whether the reference is relative or
absolute. The individual components are then parsed for their
subparts and, if not opaque, to verify their validity.
Although the BNF defines what is allowed in each component, it is
ambiguous in terms of differentiating between an authority component
and a path component that begins with two slash characters. The
greedy algorithm is used for disambiguation: the left-most matching
rule soaks up as much of the URI reference string as it is capable of
matching. In other words, the authority component wins.
Readers familiar with regular expressions should see Appendix B for a
concrete parsing example and test oracle.
5. Relative URI References
It is often the case that a group or "tree" of documents has been
constructed to serve a common purpose; the vast majority of URI in
these documents point to resources within the tree rather than
Berners-Lee, et. al. Standards Track [Page 16]
RFC 2396 URI Generic Syntax August 1998
outside of it. Similarly, documents located at a particular site are
much more likely to refer to other resources at that site than to
resources at remote sites.
Relative addressing of URI allows document trees to be partially
independent of their location and access scheme. For instance, it is
possible for a single set of hypertext documents to be simultaneously
accessible and traversable via each of the "file", "http", and "ftp"
schemes if the documents refer to each other using relative URI.
Furthermore, such document trees can be moved, as a whole, without
changing any of the relative references. Experience within the WWW
has demonstrated that the ability to perform relative referencing is
necessary for the long-term usability of embedded URI.
The syntax for relative URI takes advantage of the <hier_part> syntax
of <absoluteURI> (Section 3) in order to express a reference that is
relative to the namespace of another hierarchical URI.
relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
A relative reference beginning with two slash characters is termed a
network-path reference, as defined by <net_path> in Section 3. Such
references are rarely used.
A relative reference beginning with a single slash character is
termed an absolute-path reference, as defined by <abs_path> in
Section 3.
A relative reference that does not begin with a scheme name or a
slash character is termed a relative-path reference.
rel_path = rel_segment [ abs_path ]
rel_segment = 1*( unreserved | escaped |
";" | "@" | "&" | "=" | "+" | "$" | "," )
Within a relative-path reference, the complete path segments "." and
".." have special meanings: "the current hierarchy level" and "the
level above this hierarchy level", respectively. Although this is
very similar to their use within Unix-based filesystems to indicate
directory levels, these path components are only considered special
when resolving a relative-path reference to its absolute form
(Section 5.2).
Authors should be aware that a path segment which contains a colon
character cannot be used as the first segment of a relative URI path
(e.g., "this:that"), because it would be mistaken for a scheme name.
Berners-Lee, et. al. Standards Track [Page 17]
RFC 2396 URI Generic Syntax August 1998
It is therefore necessary to precede such segments with other
segments (e.g., "./this:that") in order for them to be referenced as
a relative path.
It is not necessary for all URI within a given scheme to be
restricted to the <hier_part> syntax, since the hierarchical
properties of that syntax are only necessary when relative URI are
used within a particular document. Documents can only make use of
relative URI when their base URI fits within the <hier_part> syntax.
It is assumed that any document which contains a relative reference
will also have a base URI that obeys the syntax. In other words,
relative URI cannot be used within a document that has an unsuitable
base URI.
Some URI schemes do not allow a hierarchical syntax matching the
<hier_part> syntax, and thus cannot use relative references.
5.1. Establishing a Base URI
The term "relative URI" implies that there exists some absolute "base
URI" against which the relative reference is applied. Indeed, the
base URI is necessary to define the semantics of any relative URI
reference; without it, a relative reference is meaningless. In order
for relative URI to be usable within a document, the base URI of that
document must be known to the parser.
The base URI of a document can be established in one of four ways,
listed below in order of precedence. The order of precedence can be
thought of in terms of layers, where the innermost defined base URI
has the highest precedence. This can be visualized graphically as:
.----------------------------------------------------------.
| .----------------------------------------------------. |
| | .----------------------------------------------. | |
| | | .----------------------------------------. | | |
| | | | .----------------------------------. | | | |
| | | | | <relative_reference> | | | | |
| | | | `----------------------------------' | | | |
| | | | (5.1.1) Base URI embedded in the | | | |
| | | | document's content | | | |
| | | `----------------------------------------' | | |
| | | (5.1.2) Base URI of the encapsulating entity | | |