-
Notifications
You must be signed in to change notification settings - Fork 23
FGDB Spec
This is a work-in-progress reverse-engineered specification of .gdbtable, .gdbtablx, .gdbindexes, .atx, .spx and .freelist files found in FileGDB datasets. It generally applies to FileGDB datasets v10, as well as earlier versions, unless otherwise specified.
- ubyte: unsigned byte
- int16: little-endian 16-bit integer
- int32: little-endian 32-bit integer
- float64: little-endian 64-bit IEEE754 floating point number
- utf16: string in little-endian UTF-16 encoding
- string: (UTF-8 ?) string
A row or a feature are synonyms in this document.
.gdbtable files describe fields and contain row data.
They are made of an header, a section describing the fields, and a section describing the rows.
-
4 bytes:
0x03 0x00 0x00 0x00
- unknown role. Constant among the files. Kind of signature ? - int32: number of (valid) rows
- 4 bytes: varying values - unknown role (TBC : this value does have something to do with row size. A value larger than the size of the largest row seems to be ok)
-
4 bytes:
0x05 0x00 0x00 0x00
- unknown role. Constant among the files -
4 bytes: varying values - unknown role. Seems to be
0x00 0x00 0x00 0x00
for FGDB 10 files, but not for earlier versions - 4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
- int32: file size in bytes
- 4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
- int64: offset in bytes at which the field description section begins (often 40 in FGDB 10). Note: datasets with 5 significant bytes (ie beyond 4GB) have been found per https://trac.osgeo.org/gdal/ticket/6830.
- int32: size of header in bytes (this field excluded)
- int32: version of the file. 3 for FGDB 9.X files and 4 for FGDB 10.X files. No other known values.
-
uint32: layer flags, including geometry type:
- bits 0 - 7: (i.e. flag & 0xff) geometry type:
- 0 = none
- 1 = point
- 2 = multipoint
- 3 = (multi)polyline
- 4 = (multi)polygon
- 5 = rectangle (envelope)
- 6 = "path"
- 7 = mixed/any geometry type
- 9 = multipatch
- 11 = ring
- 13 = line
- 14 = circular arc
- 15 = bezier curves
- 16 = elliptic curves
- 17 = geometry collection (any types)
- 18 = triangle strip
- 19 = triangle fan
- 20 = ray
- 21 = sphere
- 22 = TIN
- bit 8: encoding, is set for all known versions of the database
- bit 9: (or bits 10 or 12) likely an indicator of whether the database uses "high precision storage" or not. Always 1 in all encountered files, and according to the ESRI docs, it hasn't been possible to make low precision gdbs since 9.2
- bit 10: possibly storage type, see bit 9
- bit 11: unknown
- bit 12: possibly storage type, see bit 9
- bit 30: geometry has M values
- bit 31: geometry has Z values
- bits 0 - 7: (i.e. flag & 0xff) geometry type:
- int16: number of fields (including geometry field and implicit OBJECTID field)
Following immediately: the description of the fields (repeated as many times as the number of fields)
- ubyte: number of UTF-16 characters (not bytes) of the name of the field
- utf16: name of the field
- ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
- utf16: alias of the field (ommitted if previous field is 0)
- ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )
The next bytes for the field description depend on the field type.
For field type = 4 (string),
- int32: maximum length of string
- ubyte: flag
- varuint: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes with the default value numeric
For field type = 6 (objectid),
- ubyte: unknown role = 4
- ubyte: unknown role = 2
For field type = 7 (geometry),
- ubyte: unknown role = 0
- ubyte: flag = 6 or 7. If lsb is 1, the field can be null.
- int16: length (in bytes) of the WKT string describing the SRS.
-
string: WKT string describing the SRS Or
{B286C06B-0879-11D2-AACA-00C04FA33C20}
for no SRS (which corresponds to the COM CLSID for the ESRI UnknownCoordinateSystem class http://desktop.arcgis.com/en/arcobjects/latest/net/webframe.htm#UnknownCoordinateSystem.htm. -
ubyte: flags. Combination of values:
- (1<<0) seems to be systematically set (only bit for system table a00000004.gdbtable )
- (1<<2) indicates has_z = true
- (1<<4) indicates has_m = true
- float64: xorigin
- float64: yorigin
- float64: xyscale
- float64: morigin (present only if has_m = True)
- float64: mscale (present only if has_m = True)
- float64: zorigin (present only if has_z = True)
- float64: zscale (present only if has_z = True)
- float64: xytolerance
- float64: mtolerance (present only if has_m = True)
- float64: ztolerance (present only if has_z = True)
- float64: xmin of layer extent (might be NaN)
- float64: ymin of layer extent (might be NaN)
- float64: xmax of layer extent (might be NaN)
- float64: ymax of layer extent (might be NaN)
If geometry has z values (bit 31 of layer geometry type flags):
- float64: zmin of layer extent (might be NaN)
- float64: zmax of layer extent (might be NaN)
If geometry has m values (bit 30 of layer geometry type flags):
- float64: mmin of layer extent (might be NaN)
- float64: mmax of layer extent (might be NaN)
Then, values relating to the spatial index for the field:
- a byte always at 0 (possibly an indicator of existence of spatial index or its type?)
- a uint32 whose value is 1, 2 or 3, indicating the number of spatial grid sizes (see e.g. http://desktop.arcgis.com/en/arcmap/10.3/tools/data-management-toolbox/add-spatial-index.htm for more details about spatial grid sizes)
- for each grid size, float64: spatial index grid resolution at this level (referenced as grid_size[] in later section describing .spx files). ESRI software enforces grid_size[1] >= 3 * grid_size[0] and grid_size[2] >= 3 * grid_size[1]
For field type = 8 (binary),
- ubyte: unknown role
- ubyte: flag
For field type = 9 (raster),
- ubyte: unknown role
- ubyte: flag. If lsb is 1, the field can be null.
- ubyte: number of UTF-16 characters (not bytes) of the following string
- utf16: string whose value seems to be "Raster Column"
- int16: length (in bytes) of the WKT string describing the SRS.
-
string: WKT string describing the SRS Or
{B286C06B-0879-11D2-AACA-00C04FA33C20}
for no SRS . - ubyte: flags. Value is generally 1 (has_z = has_m = false, generally for system tablea00000004.gdbtable ), 5 (has_z = true, has_m = false) or 7 (has_z = has_m = true). If 0, none of the following float64 values is present : the next one is the ubyte of unknown role.
- float64: xorigin
- float64: yorigin
- float64: xyscale
- float64: morigin (present only if has_m = True)
- float64: mscale (present only if has_m = True)
- float64: zorigin (present only if has_z = True)
- float64: zscale (present only if has_z = True)
- float64: xytolerance
- float64: mtolerance (present only if has_m = True)
- float64: ztolerance (present only if has_z = True)
- ubyte: is_managed (1=if raster is managed within filegdb, 0=if raster is stored externally)
For field type = 10, 11 (UUID)
- ubyte: width : 38
- ubyte: flag
For field type = 12
- ubyte: width : 0
- ubyte: flag
For other field types,
- ubyte: width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime)
- ubyte: flag
- ubyte: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes
If the lsb of the flag field (when present) is set to 1, then the field can be null in records
The rows section does not necessarily immediately follow the last field description. It starts generally a few bytes after, but not in a predictable way. Note : for FGDB layers created by the ESRI FGDB SDK API, there are 4 bytes between the end of the field description section and the beginning of the rows section : 0xDE 0xAD 0xBE 0xEF (!)
The rows section is a sequence of X rows (where X is the total number of features found in the .gdbtablx, which might be different from the number of valid rows found in the header of the .gdbtable). Each row starts at an offset indicated in the .gdbtablx file
- int32: length in bytes of the row blob ( this field excluded)
- ceil(number_nullable_fields / 8) * ubyte: flags describing if a field is null. See below explanation
Each bit of the flags field encode for the presence or absence of the field content, for a nullable field, for the row. The flag is set to 1 if the field is missing/null, or 0 if the field is present/non-null (0 is used as well for spare bytes). The flag for the first field, in the order of the fields of the field description section (typically the geometry), is the least significant bit of the first byte of the flags field.
There are no bits reserved for non-nullable fields.
If all fields are non-nullable, the flag field is absent.
Note: there's no explicit data for OBJECTID and no reserved flag bit for it.
For each non-null field, the field content is appended in the order of the fields of the field description section.
This field is generally called "SHAPE".
Geometry blobs use 2 new encoding schemes :
-
varuint (64 bit): a sequence of bytes [b0, b1, ... bN]. All bytes except last one have their msb (most significant bit) set to 1. The presence of a msb = 0 marks the end of the sequence. The value of the varuint is
(b0 & 0x7F) | ((b1 & 0x7F) << 7) | ((b2 & 0x7F) << 14) | ... | ((bN & 0x7F) << (7 * N))
. Note that a valid sequence might be just 1 byte. -
varint (64 bit): same concept as varuint. But the 2nd most significant bit of b0 (i.e. the one obtained by masking with 0x40) indicates the sign of the result, and should be ignored in the computation of the unsigned value :
(b0 & 0x3F) | ((b1 & 0x7F) << 6) | ((b2 & 0x7F) << 13) | ... | ((bN & 0x7F) << (7 * N - 1))
. If the bit sign is set to 1, the value must be negated.
Common preambule to all geometry types
- varuint: length of the geometry blob in bytes (this field excluded)
- varuint: geometry_type. 1 = 2D point, 3 = 2D (multi)linestring, 5 = 2D (multi)polygon. Other values possible. See SHPT_ enumeration of ogrpgeogeometry.h. This is generally a single byte, but for SHPT_GENERALxxxxx geometries this can be multi-byte due to flags added to the base type
The bytes of the geometry blob following this preamble depend of course on the geometry type.
-
For point geometries (geometry type = 1, 9, 21, 11)
-
varuint:
x = (varuint - 1) / xyscale + xorigin
-
varuint:
y = (varuint - 1) / xyscale + yorigin
-
varuint ( present only if Z component ):
z = (varuint - 1) / zscale + zorigin
-
varuint ( present only if M component ):
m = (varuint - 1) / mscale + morigin
Note the (varuint - 1), instead of varint in following geometry types. The reason for that exception is unclear.
-
varuint:
-
For multipoint geometries (geometry type = 8, 20, 28, 18)
- varuint: number of points
-
varuint:
xmin = varuint / xyscale + xorigin
-
varuint:
ymin = varuint / xyscale + yorigin
-
varuint:
xmax = varuint / xyscale + xmin
-
varuint:
ymax = varuint / xyscale + ymin
followed by points coordinates:
For each point of all parts (dx = dy = 0 initially) :
-
varint:
dx = dx + varint; x[i] = dx / xyscale + xorigin
-
varint:
dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component, an array of Z values follows :
For each point of all parts (dz = 0 initially) :
-
varint:
dz = dz + varint; z[i] = dz / zscale + zorigin
-
For (multi)linestring (geometry type = 3, 10, 23, 13) or (multi)polygon (geometry type = 5, 19, 25, 15)
- varuint: total number of points of all following parts
- varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
-
varuint:
xmin = varuint / xyscale + xorigin
-
varuint:
ymin = varuint / xyscale + yorigin
-
varuint:
xmax = varuint / xyscale + xmin
-
varuint:
ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, points coordinates:
For each point of all parts (dx = dy = 0 initially) :
-
varint:
dx = dx + varint; x[i] = dx / xyscale + xorigin
-
varint:
dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component, an array of Z values follows :
For each point of all parts (dz = 0 initially) :
-
varint:
dz = dz + varint; z[i] = dz / zscale + zorigin
For polygons if the ring is clockwise then it is an outer ring and if is counterclockwise it is an inner ring. While it is not documented anywhere ESRI programs make the assumption that inner rings will always follow the the outer ring that contains them. So
[clockwise,counterclockwise,clockwise,clockwise,counterclockwise,counterclockwise]
can be represented in GeoJSON as
[[clockwise,counterclockwise],[clockwise],[clockwise,counterclockwise,counterclockwise]]
TODO: M values. Likely like Z component. But in FileGDB_API/samples/data/Shapes.gdb/a00000028.gdbtable, which is a polylinezm, the m values all are NaN, which is represented as
0x42 0x00 0x00 0x00 0x00
at the end of the geometry blob -
For GeneralPolyline ( (geometry type & 0xff) = 50 )
- varuint: total number of points of all following parts
- varuint: number of parts, number of linestrings of a multilinestring, or 1 for a linestring
- varuint: number of curve descriptions (present if (geom_type & 0x20000000) != 0 )
-
varuint:
xmin = varuint / xyscale + xorigin
-
varuint:
ymin = varuint / xyscale + yorigin
-
varuint:
xmax = varuint / xyscale + xmin
-
varuint:
ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, points coordinates:
For each point of all parts (dx = dy = 0 initially) :
-
varint:
dx = dx + varint; x[i] = dx / xyscale + xorigin
-
varint:
dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :
For each point of all parts (dz = 0 initially) :
-
varint:
dz = dz + varint; z[i] = dz / zscale + zorigin
If there is a M component ( (geom_type & 0x40000000) != 0 ) , an array of M values follows (unless the next byte is 0x42, in which case the M array is skipped) :
For each point of all parts (dm = 0 initially) :
-
varint:
dm = dm + varint; m[i] = dm / mscale + morigin
If there are curves ( (geom_type & 0x20000000) != 0 ), an array of segment modifiers follows. There are as many segment modifiers as the above "number of curve description" fields. The serialization of these curve descriptions is directly based on the esriSegmentModifier, WKSPoint, SegmentArc, SegmentBezierCurve and SegmentEllipticArc C structures described in extended_shape_buffer_format.pdf, which the following equivalences :
- C long --> int32
- C enum --> int32
- C double --> float64
-
For GeneralMultiPatch ( (geometry type & 0xff) = 54 )
- varuint: total number of points of all following parts
- varuint: unknown role
- varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
-
varuint:
xmin = varuint / xyscale + xorigin
-
varuint:
ymin = varuint / xyscale + yorigin
-
varuint:
xmax = varuint / xyscale + xmin
-
varuint:
ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, part type:
- varuint: : part type. Only keep 4 lowest significant bit (higher bits are for priority, material index. see extended-shapefile-format.pdf). 0 = triangle strip, 1 = triangle fan, 2 = outer ring, 3 = inner ring, 4 = first ring, 5 = ring, 6 = triangles
followed by, for each part, points coordinates:
For each point of all parts (dx = dy = 0 initially) :
-
varint:
dx = dx + varint; x[i] = dx / xyscale + xorigin
-
varint:
dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :
For each point of all parts (dz = 0 initially) :
-
varint:
dz = dz + varint; z[i] = dz / zscale + zorigin
Number of bytes of the string as a varuint, followed by binary content
If raster field definition has is_managed = 1:
- uint32: : raster ID (points to auxiliary tables)
If raster field definition has is_managed = 0:
- varuint: number of bytes (not characters!) of next string
- utf16: path to the raster
Number of bytes of the string as a varuint, followed by string content
16 bytes.
The string representation is the following (printf like expression) :
"{%02X%02X%02X%02X-%02X%02X-%02X%02X-%02X%02X-%02X%02X%02X%02X%02X%02X}", b[3], b[2], b[1], b[0], b[5], b[4], b[7], b[6], b[8], b[9], b[10], b[11], b[12], b[13], b[14], b[15]
(This is the standard way winapi handles CLSID to string conversions through CLSIDFromString16. See e.g. wine implementation at https://github.com/wine-mirror/wine/blob/6d801377055911d914226a3c6af8d8637a63fa13/dlls/compobj.dll16/compobj.c#L380 )
a int16 value for a int16 field, a int32 for a int32 field, etc..
Note : datetime values are the number of days since 30th dec 1899 00:00:00, encoded as float64
.gdbtablx files contain the offset of the rows of the associated .gdbtable file.
-
4 bytes:
0x03 0x00 0x00 0x00
- unknown role. Constant among the files. Kind of signature ? - int32: n1024BlocksPresent = number of blocks of offsets for 1024 features that are effectively present in that file (ie sparse blocks are not counted in that number).
- int32: number_of_rows : number of rows, included deleted rows
- int32: size_offset = number of bytes to encode each feature offset. Must be 4 (.gdbtable up to 4GB), 5 (.gdbtable up to 1TB) or 6 (.gdbtable up to 256TB)
The section starts immediately after the header (at offset 16) and is made of size_offset x number_rows bytes. For each row,
- int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file, or 0 if the row is deleted. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48
If there is a bit array (bitmap) to represent the presence/absence of blocks of offsets for 1024 features, then the correct row iCorrectedRow in the index for the FID iRow+1 is given by :
GUInt32 nCountBlocksBefore = 0;
int iBlock = iRow / 1024;
// Check if the block is not empty
if( (pabyTablXBlockMap[iBlock / 8] & (1 << (iBlock % 8))) == 0 )
{
nCurRow = -1;
return FALSE;
}
for(int i=0;i<iBlock;i++)
nCountBlocksBefore += ( pabyTablXBlockMap[i / 8] & (1 << (i % 8)) ) != 0;
int iCorrectedRow = nCountBlocksBefore * 1024 + (iRow % 1024);
Located at offset 16 + size_offset * n1024BlocksPresent * 1024
- int32: nBitmapInt32Words = number of int32 words for the bitmap (rounded to the next multiple of 32)
- int32: n1024BlocksTotal = (number_of_rows + 1023) / 1024. In the case where there's a bitmap, this is also nBitsForBlockMap = number of bits in the block map.
- int32: n1024BlocksPresentBis (must be == n1024BlocksPresent of the header)
- int32: nUsefulBitmapIn32Words = number of int32 words in the bitmap where there's at least a non-zero bit. Said otherwise, all following words until the end of the bitmap are 0. Doesn't seem to be used by proprietary implementations.
if nBitmapInt32Words == 0 (no bitmap), then n1024BlocksTotal == n1024BlocksPresentBis ( == n1024BlocksPresent) and nUsefulBitmapIn32Words = 0
Otherwise, following those 16 trailer bytes, there is a bit array of at least (n1024BlocksTotal + 7) / 8 bytes (in practice its size is rounded to the next muliple of 32 int32 words). Each bit in the array represents the presence of a block of offsets for 1024 features (bit = 1), or its absence (bit = 0). The total number of bits set to 1 must be equal to n1024Blocks
.gdbindexes files list the indexes that may exist on certain fields of a .gdbtable. This only apply to FileGDB v10 .gdbindexes : v9 .gdbindexes have a different (and more complicated) structure.
- int32: number of indexes describes in the file
The section starts immediately after the header (at offset 4) and is repeated as many times as they are indexes.
- uint32: number of UTF-16 characters for the following field
- utf16: suffix of the index file. If it's value is foo, the filename of the index is aXXXXXXXX.foo.atx (unless the index is FDO_OBJECTID in which case the index is the .gdbtablx file, or FDO_SHAPE in which case the index is the .spx file)
- int16: unknown role
- int16: unknown role
- int32: unknown role
- int16: unknown role
- int32: unknown role
- uint32: number of UTF-16 characters for the following field
- utf16: field name (or sometimes expression like "LOWER(Name)" as found in a00000001.gdbindexes)
- int16: unknown role
.atx files contain indexes for a field of a .gdbtable. The general idea is that the values that the field takes in the .gdbtable are listed in ascending order with the associated FID. .atx files are organized in pages of 4096 bytes and have a hierarchical organization whose depth depends on the size of the values of the field and the number of features of the table. The first page is 1, so page N is located at offset (N-1)*4096.
The reading of .atx files must start with its trailing section.
- byte: size in bytes of the values indexed (called size_value afterwards). This has a close relationship with the field type of the field being indexed. So for, int16 it is equal to 2. For int32: 4. For float32: 4. For float64: 8. For string: variable number that is a multiple of 2 (string values are encoded as UTF16 characters, so 2 bytes per character) and at maximum 160 bytes (80 characters). For datetime: 8. For UUID: 38 ( the string representation is 38 bytes. See above). Indexing of binary or XML fields has not been studied (if it is possible !)
- byte: unknown role
- int32: unknown role. Apparently always/often 1.
- uint32: index depth >= 1. If it is 1 the first page directly references features. Otherwise the first page reference pages that reference pages referencing features (depth = 2), or pages that reference pages that reference pages that reference features (depth = 3), and so on...
- uint32: number of features referenced in the file. Otherwise said number of features that have a non-null value for the field being indexed. Must not be greater than the number of valid features of the .gdbtable. It has been observed that (with FileGDB SDK 1.3) this value is not relieable for an index that has been built while features are inserted, if the values inserted are not in increasing order.
- int32: unknown role. Apparently always/often 0.
- int32: unknown role. Apparently always/often 1.
The maximum number of features (or sub-pages references) in a page is : nMaxPerPages = (4096 - 12) / (4 + size_value)
The offset at which field values are found in a page is : nOffsetFirstValInPage = 12 + nMaxPerPages * 4
For a given field value, if found in several features, the features are sorted by ascending ID. The structure of such a page is header section (12 bytes), followed by FID numbers (maximum of 4 * nMaxPerPages bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)
Header section structure (offset 0 in the page) :
- uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that if index_depth == 1, there is a single feature page, and for higher index depth, all feature-referencing pages are referenced from page referencing pages. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
- uint32: number of features referenced in the page (nFeatures). Not greater than nMaxPerPages
- uint32: unknown role. Apparently always/often 0.
FID section structure (offset 12 in the page) :
- uint32: FID of the first feature referenced in the page
- ...
- uint32: FID ot the (nFeatures)th feature referenced in the page.
Padding section of zeroes (size: nOffsetFirstValInPage - 12 - 4 * nFeatures)
Values section structure (offset nOffsetFirstValInPage in the page):
- type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): value of field for the first feature referenced in the page
- ...
- type: value of field for the (nFeatures)th feature referenced in the page.
The structure of such a page is header section (4 bytes), followed by sub-pages numbers (maximum of 4 * (1 + nMaxPerPages) bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)
Header section structure (offset 0 in the page) :
- uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that such a page is always referenced from a page upper in the hierarchy if there are several at that depth. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
- uint32: number of sub-pages referenced in the page (nSubPages). Not greater than nMaxPerPages
Sub-pages number section (offset 8 in the page):
- uint32: ID of the first sub-page referenced in the page
- ...
- uint32: ID of the (nSubPages)th sub-page referenced in the page.
- uint32: ID of the (nSubPages+1)th sub-page referenced in the page (note: there is no maching value for that last sub-page number in the values section)
Padding section of zeroes( size: nOffsetFirstValInPage - 8 - 4 * (nSubPages+1))
Values section structure (offset nOffsetFirstValInPage in the page):
- type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the first sub-page referenced in the page
- ...
- type: maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the (nSubPages)th sub-page referenced in the page
.spx files contain the spatial index for the geometry field of a .gdbtable. They have exactly the same structure as .atx files: same trailing section of 22 bytes, same principle of pages of 4096 byte, with either pages referencing other pages (depth > 0) or pages referencing features (depth = 0). The payload being indexed is a 64-bit integer number (size_value = 8).
It is built from (x,y) georeferenced coordinates and a grid number (grid_no) : point(x,y,grid_no) = (grid_no << 62) | (scaled_x << 31) | scaled_y
where grid_no = 0, 1, 2 (grid_no must be strictly lower that len(grid_size), where grid_size[] is the array giving the spatial grid resolution) and
- scale_x = int(floor(x / grid_size[grid_no] + (2^29)) / (grid_size[grid_no] / grid_size[0])))
- scale_y = int(floor(y / grid_size[grid_no] + (2^29)) / (grid_size[grid_no] / grid_size[0])))
Note: for the purpose of building this number, it is convenient to consider it as a unsigned quantity, especially when grid_no = 2, which sets the most-significant-bit, but sorting purposes in the .spx file, it has been found that this number if considered as a signed quantity.
In regular layers of sample files studied, it has been found that len(grid_size) == 1. It appears however that for FileGDB v10, the a0000004 system table can have up to 3 grid sizes.
The principle of spatial indexing consists in "rasterizing" the geometries on the spatial index grid(s) and indexing the 64-bit quantities corresponding to those rasterized points. Consequently for a non-punctual geometry, its FID may appear several times in the file. For a given 64-bit quantity, features appear in increasing FID in the .spx file.
On the read size, when interested in geometries that intersect the (minx, miny, maxx, maxy) envelope, one must search the index for indexed values in [point(x,miny,grid_no), point(x,maxy,grid_no)] for x in [minx, maxx] and grid_no in [0, len(grid_size()-1]).
One can see that if grid_size[] values are not carefully choosen, the size of the .spx file may be huge. A polygon with a large extent can correspond to a big number of indexed values. It is difficult to completely assert the strategy used for indexing when len(grid_size[]) > 1, but presumbably, from an example of a a0000004 system table, it would appear that features that would cause too many values to be generated at grid_no = 0 are rather indexed with grid_no = 1 or 2. On the read side, our assumption is that one should search indexed values for grid_no = 0 ... len(grid_size[])-1, and not only at grid_no = 0 even if there are matches at the resolution.
.freelist files contain the offset to the holes (rows deleted, or old updates) in the associated .gdbtable file. The file is rewritten after each edit session, with the most recent edit at the start of the file, and order being maintained during repeated edit operations. The file is optional, and will be deleted when the fGDB is compacted.
The file has 344 bytes of buffer at the end, and looks to be created in 4K blocks ( so, smallest is 4096 + 344 = 4440 bytes )
- int32: number of rows
- 4 bytes: 0xFFFFFFFF. No apparent use
The section starts immediately after the header and is made of (4 + size_offset) x number_rows bytes. For each row,
- int32: number of bytes
- int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48
Files are named in the format a[number in lowercase hex].[extension]
with files with the same base but different extensions being related. Files are numbered incrementally, a00000001
is first a00000002
is second, but numbers may be skipped.
For FileGDB v10, the first 8 (a00000001
to a00000008
) files seem to be reserved for database information and subsequent files are feature classes (a00000009
, a0000000a
, ...).
-
a00000001
is calledGDB_SystemCatalog
and contains a list of tables (including itself, other reserved tables and user tables). Tables may be mentionned but not actually found on the disk : this is often (only ?) the case of tablea00000008
. The FID of a record in this table determines the name of the file to consider. For example the record of FID 37 (the convention taken here for FID numbering is starting from 1) will be in filea00000025
. There might be deleted rows in this catalog table, so gaps in FID numbering.The table contains a
Name
field and aFileFormat
field. The value ofFileFormat
seems to be 0 in most cases, and sometimes 2 for a few reserved system tables. -
a00000002
contains config parameters for the database and is calledGDB_DBTune
-
a00000003
is calledGDB_SpatialRefs
and contains the SRS as WKT in fieldSRTEXT
(in ESRI WKT dialect) and the following fields :FalseX
,FalseY
,XYUnits
,FalseZ
,ZUnits
,FalseM
,MUnits
,XYTolerance
,ZTolerance
,MTolerance
. All rows are unique so if there are 3 features classes, all with the same spatial reference system, but one has a different ZTolerance there will be two rows. -
a00000004
is calledGDB_Items
and contains metadata about the items (layers), mostly in XML. The fields are :-
UUID
(UUID) : UUID -
Type
(UUID) : item type -
Name
(string) : item/layer name. Matches theName
field of theGDB_SystemCatalog
-
PhysicalName
(string) : item/layer name in upper case characters. -
Path
(string) : "\mylayername" for top-level layers or "\myfeaturedataset\mylayername" for layers attached to a feature dataset "myfeaturedataset" -
DatasetSubType1
(int32) : 1 for user tables (TBC) -
DatasetSubType2
(int32) : layer geometry type. 1 for point layer, 2 for multipoint layers, 3 for linestring layers, 4 for polygon layers -
DatasetInfo1
(string) : "SHAPE" for user tables (TBC) -
DatasetInfo2
(string) : NULL for user tables (TBC) -
URL
(string) : empty string (TBC) -
Definition
(XML) : DEFeatureClassInfo XML element. Contains an XML version of the information that can be obtained by parsing the header of a table : fields, SRS, ... -
Documentation
(XML) : metadata XML element -
ItemInfo
(XML) : NULL for user tables (TBC) -
Properties
(int32) : 1 for user tables (TBC) -
Defaults
(binary) : absent for user tables (TBC) -
Shape
(geometry) : 5 point polygon listing the corner of the bounding box of the layer reprojected into EPSG:4326 (even if the layer SRS is not EPSG:4326). Or missing if the layer SRS is undefined.
A few particular records :
- The first record is reserved for a kind of root item (
Name
= "",Path
= "" ). - The second record is reserved for a
Name
= "Workspace" item,Path
= "",Definition
containing a DEWorkspace XML element - When there are feature datatesets, they also appear as records : e.g.
Name
= "featuredataset",PhysicalName
= "FEATUREDATASET",Path
= "\FEATUREDATASET",Definition
containing a DEFeatureDataset XML element
-
-
a00000005
,a00000006
anda00000007
are one ofGDB_ItemRelationships
,GDB_ItemRelationshipTypes
orGDB_ItemTypes
(order may vary depending on datasets) -
a00000008
is calledGDB_ReplicaLog
. It is often listed in the GDB_SystemCatalog, but actually missing on disk.
Globally for v10 files, the main interesting reserved table seems to be the GDB_SystemCatalog
to establish the link between the layer name and its associated .gdbtable file. Using a00000004
might be needed in case there are user table of other table types listed in the GDB_SystemCatalog
that are not vector tables (rasters, relationships, ...), and also may be used to have an overview of all tables by exploiting the XML definition without opening all the corresponding .gdbtable files.
For FileGDB v9, the first 36 (a00000001
to a00000024
) files seem to be reserved for database information and subsequent files are feature classes (a00000025
, a00000026
, ...). Very often, the files between a00000009
and a00000024
are missing.
-
a00000001
:GDB_SystemCatalog
. Similar to v10. Contains as well aDatasetGUID
field. Records 1 to 36 are reserved for GDB_ tables -
a00000001
:GDB_DBTune
-
a00000003
:GDB_SpatialRefs
. Identical to v10 -
a00000004
:GDB_Release
. Contains a single record : for v9.2 databases:Major
= 2,Minor
= 2,Bugfix
= 0. For v9.3 databases:Major
= 2,Minor
= 3,Bugfix
= 0 -
a00000005
:GDB_FeatureDataset
-
a00000006
:GDB_ObjectClasses
. Contains aName
field, and other technical fields. -
a00000007
:GDB_FeatureClasses
. Simplified version ofGDB_Items
of v10. Contains the layer geometry type inGeometryType
and shape field name inShapeField
. TheObjectClassID
field is related to theFID
ofGDB_ObjectClasses
-
a00000008
:GDB_FieldInfo
. Contains information about some (but not all fields) of layers.
Globally for v9 files, the main interesting reserved table seems to be the GDB_SystemCatalog
to establish the link between the layer name and its associated .gdbtable file. Using a00000007
in conjunction with a00000006
might be needed in case there are user table of other table types listed in the GDB_SystemCatalog
that are not vector tables (rasters, relationships, ...)
Compressed tables are indicated by the presence of a ".cdf" file, which contains a compressed version of a layer. The encoding of CDF tables is significantly different from standard GDB tables.
- uint32: File identifier, either 0x43444623 or 0x43444632
- uint32: Flags. If flags & 0xff00 == 0x1000, then the table is version 10. If it's 0x0900, if a version 9 table.
- 16 bytes: Unique table UUID (See UUID field interpretation above for explanation)
- For version 9 only: uint32: codepage. The code page value & 0xffff should be one of 0x2ff, 0x3ff, 0x4ff or 0x5ff
- varint16: Offset for file TOC
- varuint: Number of objects in TOC
For each object in TOC:
- 16 bytes: object GUID, where GUID may be 0x010000000000000000000000000000 for "CDF Block", 0x02: "CDF Log", 0x03...: "CDF SINFO" (spatial information), 0x04...: "CDF_TABINFO" (table information), 0x14...: "SDC Block", 0x15...: "SDC PHYS", 0x16...: "SDC LOG". There's at most one of each of block, log, sinfo, tabinfo, phys. There MUST be a BLOCK and LOG entry, and for v9 files, there must also be a "SDC PHYS" entry.
- varint16: Offset of object in file
First, seek to LOG offset from TOC
- varuint: Field count
- 16 bytes: unknown
For each field:
- varuint: number of UTF-16 characters (not bytes) of the name of the field
- utf16: name of the field
- varuint: field type. These differ from standard GDB field types. Where 1 = INT16, 4 = OBJECTID, 5 = FLOAT32, 6 = FLOAT64, 7 = STRING, 8 = GEOMETRY, 9 = DATETIME, 10 = UUID1, 12 = BINARY, 16 = RASTER, 17 = UUID2
- varuint: unknown
For field type 4 (OBJECTID):
- varuint: unknown
For field type 5 or 6 (FLOAT32/FLOAT64)
- varuint: unknown
- varuint: unknown
For field type 8 (GEOMETRY):
- varuint: unknown
For all field types
- varuint: unknown, must be 0
- 16 unknown bytes
First, seek to TABINFO offset from TOC
- varuint: number of UTF-16 characters (not bytes) of the name of the table
- utf16: name of the table
First, seek to SINFO offset from TOC
- float64: x min (Extent of layer)
- float64: y min
- float64: x max
- float64: y max
- float64: unknown -- maybe resolution?
If z or m present, looks like two more sets of doubles for each -- likely z/m min/max, but unknown which order
- varuint: number of UTF-16 characters (not bytes) of the WKT definition of the table's SRS
- utf16: WKT definition of table's SRS
This specification document is (C) 2013 Even Rouault and licensed under the CC-BY-SA 3.0 terms.
Formatting to Markdown done by Calvin Metcalf.
Note: the scope of the copyrighted material does, of course, not extend onto any source or binary code derived from the specification, that may be licensed under the terms that their author may see fit.