FGDB Spec

This is a work-in-progress reverse-engineered specification of .gdbtable, .gdbtablx, .gdbindexes, .atx, .spx and .freelist files found in FileGDB datasets. It generally applies to FileGDB datasets v10, as well as earlier versions, unless otherwise specified.

Conventions

ubyte: unsigned byte
int16: little-endian 16-bit integer
int32: little-endian 32-bit integer
float64: little-endian 64-bit IEEE754 floating point number
utf16: string in little-endian UTF-16 encoding
string: (UTF-8 ?) string

A row or a feature are synonyms in this document.

Specification of .gdbtable files

.gdbtable files describe fields and contain row data.

They are made of an header, a section describing the fields, and a section describing the rows.

Header (40 bytes)

4 bytes: 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ?
int32: number of (valid) rows
4 bytes: varying values - unknown role (TBC : this value does have something to do with row size. A value larger than the size of the largest row seems to be ok)
4 bytes: 0x05 0x00 0x00 0x00 - unknown role. Constant among the files
4 bytes: varying values - unknown role. Seems to be 0x00 0x00 0x00 0x00 for FGDB 10 files, but not for earlier versions
4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
int32: file size in bytes
4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
int64: offset in bytes at which the field description section begins (often 40 in FGDB 10). Note: datasets with 5 significant bytes (ie beyond 4GB) have been found per https://trac.osgeo.org/gdal/ticket/6830.

Field description section

Fixed part

int32: size of header in bytes (this field excluded)
int32: version of the file. 3 for FGDB 9.X files and 4 for FGDB 10.X files. No other known values.
uint32: layer flags, including geometry type:
- bits 0 - 7: (i.e. flag & 0xff) geometry type:
  - 0 = none
  - 1 = point
  - 2 = multipoint
  - 3 = (multi)polyline
  - 4 = (multi)polygon
  - 5 = rectangle (envelope)
  - 6 = "path"
  - 7 = mixed/any geometry type
  - 9 = multipatch
  - 11 = ring
  - 13 = line
  - 14 = circular arc
  - 15 = bezier curves
  - 16 = elliptic curves
  - 17 = geometry collection (any types)
  - 18 = triangle strip
  - 19 = triangle fan
  - 20 = ray
  - 21 = sphere
  - 22 = TIN
- bit 8: encoding, is set for all known versions of the database
- bit 9: (or bits 10 or 12) likely an indicator of whether the database uses "high precision storage" or not. Always 1 in all encountered files, and according to the ESRI docs, it hasn't been possible to make low precision gdbs since 9.2
- bit 10: possibly storage type, see bit 9
- bit 11: unknown
- bit 12: possibly storage type, see bit 9
- bit 30: geometry has M values
- bit 31: geometry has Z values
int16: number of fields (including geometry field and implicit OBJECTID field)

Repeated part (per field)

Following immediately: the description of the fields (repeated as many times as the number of fields)

ubyte: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
utf16: alias of the field (ommitted if previous field is 0)
ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

The next bytes for the field description depend on the field type.

For field type = 4 (string),

int32: maximum length of string
ubyte: flag
varuint: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes with the default value numeric

For field type = 6 (objectid),

ubyte: unknown role = 4
ubyte: unknown role = 2

For field type = 7 (geometry),

ubyte: unknown role = 0
ubyte: flag = 6 or 7. If lsb is 1, the field can be null.
int16: length (in bytes) of the WKT string describing the SRS.
string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS (which corresponds to the COM CLSID for the ESRI UnknownCoordinateSystem class http://desktop.arcgis.com/en/arcobjects/latest/net/webframe.htm#UnknownCoordinateSystem.htm.
ubyte: flags. Combination of values:
- (1<<0) seems to be systematically set (only bit for system table a00000004.gdbtable )
- (1<<2) indicates has_z = true
- (1<<4) indicates has_m = true
float64: xorigin
float64: yorigin
float64: xyscale
float64: morigin (present only if has_m = True)
float64: mscale (present only if has_m = True)
float64: zorigin (present only if has_z = True)
float64: zscale (present only if has_z = True)
float64: xytolerance
float64: mtolerance (present only if has_m = True)
float64: ztolerance (present only if has_z = True)
float64: xmin of layer extent (might be NaN)
float64: ymin of layer extent (might be NaN)
float64: xmax of layer extent (might be NaN)
float64: ymax of layer extent (might be NaN)

If geometry has z values (bit 31 of layer geometry type flags):

float64: zmin of layer extent (might be NaN)
float64: zmax of layer extent (might be NaN)

If geometry has m values (bit 30 of layer geometry type flags):

float64: mmin of layer extent (might be NaN)
float64: mmax of layer extent (might be NaN)

Then, values relating to the spatial index for the field:

a byte always at 0 (possibly an indicator of existence of spatial index or its type?)
a uint32 whose value is 1, 2 or 3, indicating the number of spatial grid sizes (see e.g. http://desktop.arcgis.com/en/arcmap/10.3/tools/data-management-toolbox/add-spatial-index.htm for more details about spatial grid sizes)
for each grid size, float64: spatial index grid resolution at this level (referenced as grid_size[] in later section describing .spx files). ESRI software enforces grid_size[1] >= 3 * grid_size[0] and grid_size[2] >= 3 * grid_size[1]

For field type = 8 (binary),

ubyte: unknown role
ubyte: flag

For field type = 9 (raster),

ubyte: unknown role
ubyte: flag. If lsb is 1, the field can be null.
ubyte: number of UTF-16 characters (not bytes) of the following string
utf16: string whose value seems to be "Raster Column"
int16: length (in bytes) of the WKT string describing the SRS.
string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS .
ubyte: flags. Value is generally 1 (has_z = has_m = false, generally for system tablea00000004.gdbtable ), 5 (has_z = true, has_m = false) or 7 (has_z = has_m = true). If 0, none of the following float64 values is present : the next one is the ubyte of unknown role.
float64: xorigin
float64: yorigin
float64: xyscale
float64: morigin (present only if has_m = True)
float64: mscale (present only if has_m = True)
float64: zorigin (present only if has_z = True)
float64: zscale (present only if has_z = True)
float64: xytolerance
float64: mtolerance (present only if has_m = True)
float64: ztolerance (present only if has_z = True)
ubyte: is_managed (1=if raster is managed within filegdb, 0=if raster is stored externally)

For field type = 10, 11 (UUID)

ubyte: width : 38
ubyte: flag

For field type = 12

ubyte: width : 0
ubyte: flag

For other field types,

ubyte: width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime)
ubyte: flag
ubyte: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes

If the lsb of the flag field (when present) is set to 1, then the field can be null in records

Rows section

The rows section does not necessarily immediately follow the last field description. It starts generally a few bytes after, but not in a predictable way. Note : for FGDB layers created by the ESRI FGDB SDK API, there are 4 bytes between the end of the field description section and the beginning of the rows section : 0xDE 0xAD 0xBE 0xEF (!)

The rows section is a sequence of X rows (where X is the total number of features found in the .gdbtablx, which might be different from the number of valid rows found in the header of the .gdbtable). Each row starts at an offset indicated in the .gdbtablx file

Row description

int32: length in bytes of the row blob ( this field excluded)
ceil(number_nullable_fields / 8) * ubyte: flags describing if a field is null. See below explanation

Null fields flags

Each bit of the flags field encode for the presence or absence of the field content, for a nullable field, for the row. The flag is set to 1 if the field is missing/null, or 0 if the field is present/non-null (0 is used as well for spare bytes). The flag for the first field, in the order of the fields of the field description section (typically the geometry), is the least significant bit of the first byte of the flags field.

There are no bits reserved for non-nullable fields.

If all fields are non-nullable, the flag field is absent.

Note: there's no explicit data for OBJECTID and no reserved flag bit for it.

For each non-null field, the field content is appended in the order of the fields of the field description section.

Field content

Geometry field (type = 7)

This field is generally called "SHAPE".

Geometry blobs use 2 new encoding schemes :

varuint (64 bit): a sequence of bytes [b0, b1, ... bN]. All bytes except last one have their msb (most significant bit) set to 1. The presence of a msb = 0 marks the end of the sequence. The value of the varuint is (b0 & 0x7F) | ((b1 & 0x7F) << 7) | ((b2 & 0x7F) << 14) | ... | ((bN & 0x7F) << (7 * N)). Note that a valid sequence might be just 1 byte.
varint (64 bit): same concept as varuint. But the 2nd most significant bit of b0 (i.e. the one obtained by masking with 0x40) indicates the sign of the result, and should be ignored in the computation of the unsigned value : (b0 & 0x3F) | ((b1 & 0x7F) << 6) | ((b2 & 0x7F) << 13) | ... | ((bN & 0x7F) << (7 * N - 1)). If the bit sign is set to 1, the value must be negated.

Common preambule to all geometry types

varuint: length of the geometry blob in bytes (this field excluded)
varuint: geometry_type. 1 = 2D point, 3 = 2D (multi)linestring, 5 = 2D (multi)polygon. Other values possible. See SHPT_ enumeration of ogrpgeogeometry.h. This is generally a single byte, but for SHPT_GENERALxxxxx geometries this can be multi-byte due to flags added to the base type

The bytes of the geometry blob following this preamble depend of course on the geometry type.

For point geometries (geometry type = 1, 9, 21, 11)
- varuint: x = (varuint - 1) / xyscale + xorigin
- varuint: y = (varuint - 1) / xyscale + yorigin
- varuint ( present only if Z component ): z = (varuint - 1) / zscale + zorigin
- varuint ( present only if M component ): m = (varuint - 1) / mscale + morigin
Note the (varuint - 1), instead of varint in following geometry types. The reason for that exception is unclear.
For multipoint geometries (geometry type = 8, 20, 28, 18)
- varuint: number of points
- varuint: xmin = varuint / xyscale + xorigin
- varuint: ymin = varuint / xyscale + yorigin
- varuint: xmax = varuint / xyscale + xmin
- varuint: ymax = varuint / xyscale + ymin
followed by points coordinates:

For each point of all parts (dx = dy = 0 initially) :
- varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
- varint: dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component, an array of Z values follows :

For each point of all parts (dz = 0 initially) :
- varint: dz = dz + varint; z[i] = dz / zscale + zorigin
For (multi)linestring (geometry type = 3, 10, 23, 13) or (multi)polygon (geometry type = 5, 19, 25, 15)
- varuint: total number of points of all following parts
- varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
- varuint: xmin = varuint / xyscale + xorigin
- varuint: ymin = varuint / xyscale + yorigin
- varuint: xmax = varuint / xyscale + xmin
- varuint: ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, points coordinates:

For each point of all parts (dx = dy = 0 initially) :
- varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
- varint: dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component, an array of Z values follows :

For each point of all parts (dz = 0 initially) :
- varint: dz = dz + varint; z[i] = dz / zscale + zorigin
For polygons if the ring is clockwise then it is an outer ring and if is counterclockwise it is an inner ring. While it is not documented anywhere ESRI programs make the assumption that inner rings will always follow the the outer ring that contains them. So
```
[clockwise,counterclockwise,clockwise,clockwise,counterclockwise,counterclockwise] 
```
can be represented in GeoJSON as
```
[[clockwise,counterclockwise],[clockwise],[clockwise,counterclockwise,counterclockwise]] 
```
TODO: M values. Likely like Z component. But in FileGDB_API/samples/data/Shapes.gdb/a00000028.gdbtable, which is a polylinezm, the m values all are NaN, which is represented as 0x42 0x00 0x00 0x00 0x00 at the end of the geometry blob
For GeneralPolyline ( (geometry type & 0xff) = 50 )
- varuint: total number of points of all following parts
- varuint: number of parts, number of linestrings of a multilinestring, or 1 for a linestring
- varuint: number of curve descriptions (present if (geom_type & 0x20000000) != 0 )
- varuint: xmin = varuint / xyscale + xorigin
- varuint: ymin = varuint / xyscale + yorigin
- varuint: xmax = varuint / xyscale + xmin
- varuint: ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, points coordinates:

For each point of all parts (dx = dy = 0 initially) :
- varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
- varint: dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :

For each point of all parts (dz = 0 initially) :
- varint: dz = dz + varint; z[i] = dz / zscale + zorigin
If there is a M component ( (geom_type & 0x40000000) != 0 ) , an array of M values follows (unless the next byte is 0x42, in which case the M array is skipped) :

For each point of all parts (dm = 0 initially) :
- varint: dm = dm + varint; m[i] = dm / mscale + morigin
If there are curves ( (geom_type & 0x20000000) != 0 ), an array of segment modifiers follows. There are as many segment modifiers as the above "number of curve description" fields. The serialization of these curve descriptions is directly based on the esriSegmentModifier, WKSPoint, SegmentArc, SegmentBezierCurve and SegmentEllipticArc C structures described in extended_shape_buffer_format.pdf, which the following equivalences :
- C long --> int32
- C enum --> int32
- C double --> float64
For GeneralMultiPatch ( (geometry type & 0xff) = 54 )
- varuint: total number of points of all following parts
- varuint: unknown role
- varuint: number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings of a multilinestring, or 1 for a linestring)
- varuint: xmin = varuint / xyscale + xorigin
- varuint: ymin = varuint / xyscale + yorigin
- varuint: xmax = varuint / xyscale + xmin
- varuint: ymax = varuint / xyscale + ymin
- varuint: number of points of first part (omitted if there is only one part)
- ...: ...
- varuint: number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers
followed by, for each part, part type:
- varuint: : part type. Only keep 4 lowest significant bit (higher bits are for priority, material index. see extended-shapefile-format.pdf). 0 = triangle strip, 1 = triangle fan, 2 = outer ring, 3 = inner ring, 4 = first ring, 5 = ring, 6 = triangles
followed by, for each part, points coordinates:

For each point of all parts (dx = dy = 0 initially) :
- varint: dx = dx + varint; x[i] = dx / xyscale + xorigin
- varint: dy = dy + varint; y[i] = dy / xyscale + yorigin
If there is a Z component ( (geom_type & 0x80000000) != 0 ) , an array of Z values follows :

For each point of all parts (dz = 0 initially) :
- varint: dz = dz + varint; z[i] = dz / zscale + zorigin

Binary (type = 8)

Number of bytes of the string as a varuint, followed by binary content

Raster (type = 9)

If raster field definition has is_managed = 1:

uint32: : raster ID (points to auxiliary tables)

If raster field definition has is_managed = 0:

varuint: number of bytes (not characters!) of next string
utf16: path to the raster

String (type=4) or XML (type=12)

Number of bytes of the string as a varuint, followed by string content

UUID (type=10 or 11)

16 bytes.

The string representation is the following (printf like expression) :

"{%02X%02X%02X%02X-%02X%02X-%02X%02X-%02X%02X-%02X%02X%02X%02X%02X%02X}", b[3], b[2], b[1], b[0], b[5], b[4], b[7], b[6], b[8], b[9], b[10], b[11], b[12], b[13], b[14], b[15]

(This is the standard way winapi handles CLSID to string conversions through CLSIDFromString16. See e.g. wine implementation at https://github.com/wine-mirror/wine/blob/6d801377055911d914226a3c6af8d8637a63fa13/dlls/compobj.dll16/compobj.c#L380 )

Other types

a int16 value for a int16 field, a int32 for a int32 field, etc..

Note : datetime values are the number of days since 30th dec 1899 00:00:00, encoded as float64

Specification of .gdbtablx file

.gdbtablx files contain the offset of the rows of the associated .gdbtable file.

Header (16 bytes)

4 bytes: 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ?
int32: n1024BlocksPresent = number of blocks of offsets for 1024 features that are effectively present in that file (ie sparse blocks are not counted in that number).
int32: number_of_rows : number of rows, included deleted rows
int32: size_offset = number of bytes to encode each feature offset. Must be 4 (.gdbtable up to 4GB), 5 (.gdbtable up to 1TB) or 6 (.gdbtable up to 256TB)

Offset section

The section starts immediately after the header (at offset 16) and is made of size_offset x number_rows bytes. For each row,

int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file, or 0 if the row is deleted. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48

If there is a bit array (bitmap) to represent the presence/absence of blocks of offsets for 1024 features, then the correct row iCorrectedRow in the index for the FID iRow+1 is given by :

        GUInt32 nCountBlocksBefore = 0;
        int iBlock = iRow / 1024;
        // Check if the block is not empty
        if( (pabyTablXBlockMap[iBlock / 8] & (1 << (iBlock % 8))) == 0 )
        {
            nCurRow = -1;
            return FALSE;
        }
        for(int i=0;i<iBlock;i++)
            nCountBlocksBefore += ( pabyTablXBlockMap[i / 8] & (1 << (i % 8)) ) != 0;
        int iCorrectedRow = nCountBlocksBefore * 1024 + (iRow % 1024);

Trailing section (16 bytes + variable number )

Located at offset 16 + size_offset * n1024BlocksPresent * 1024

int32: nBitmapInt32Words = number of int32 words for the bitmap (rounded to the next multiple of 32)
int32: n1024BlocksTotal = (number_of_rows + 1023) / 1024. In the case where there's a bitmap, this is also nBitsForBlockMap = number of bits in the block map.
int32: n1024BlocksPresentBis (must be == n1024BlocksPresent of the header)
int32: nUsefulBitmapIn32Words = number of int32 words in the bitmap where there's at least a non-zero bit. Said otherwise, all following words until the end of the bitmap are 0. Doesn't seem to be used by proprietary implementations.

if nBitmapInt32Words == 0 (no bitmap), then n1024BlocksTotal == n1024BlocksPresentBis ( == n1024BlocksPresent) and nUsefulBitmapIn32Words = 0

Otherwise, following those 16 trailer bytes, there is a bit array of at least (n1024BlocksTotal + 7) / 8 bytes (in practice its size is rounded to the next muliple of 32 int32 words). Each bit in the array represents the presence of a block of offsets for 1024 features (bit = 1), or its absence (bit = 0). The total number of bits set to 1 must be equal to n1024Blocks

Specification of .gdbindexes files

.gdbindexes files list the indexes that may exist on certain fields of a .gdbtable. This only apply to FileGDB v10 .gdbindexes : v9 .gdbindexes have a different (and more complicated) structure.

Header (4 bytes)

int32: number of indexes describes in the file

Index description

The section starts immediately after the header (at offset 4) and is repeated as many times as they are indexes.

uint32: number of UTF-16 characters for the following field
utf16: suffix of the index file. If it's value is foo, the filename of the index is aXXXXXXXX.foo.atx (unless the index is FDO_OBJECTID in which case the index is the .gdbtablx file, or FDO_SHAPE in which case the index is the .spx file)
int16: unknown role
int16: unknown role
int32: unknown role
int16: unknown role
int32: unknown role
uint32: number of UTF-16 characters for the following field
utf16: field name (or sometimes expression like "LOWER(Name)" as found in a00000001.gdbindexes)
int16: unknown role

Specification of .atx files

.atx files contain indexes for a field of a .gdbtable. The general idea is that the values that the field takes in the .gdbtable are listed in ascending order with the associated FID. .atx files are organized in pages of 4096 bytes and have a hierarchical organization whose depth depends on the size of the values of the field and the number of features of the table. The first page is 1, so page N is located at offset (N-1)*4096.

The reading of .atx files must start with its trailing section.

Trailing section (22 bytes)

byte: size in bytes of the values indexed (called size_value afterwards). This has a close relationship with the field type of the field being indexed. So for, int16 it is equal to 2. For int32: 4. For float32: 4. For float64: 8. For string: variable number that is a multiple of 2 (string values are encoded as UTF16 characters, so 2 bytes per character) and at maximum 160 bytes (80 characters). For datetime: 8. For UUID: 38 ( the string representation is 38 bytes. See above). Indexing of binary or XML fields has not been studied (if it is possible !)
byte: unknown role
int32: unknown role. Apparently always/often 1.
uint32: index depth >= 1. If it is 1 the first page directly references features. Otherwise the first page reference pages that reference pages referencing features (depth = 2), or pages that reference pages that reference pages that reference features (depth = 3), and so on...
uint32: number of features referenced in the file. Otherwise said number of features that have a non-null value for the field being indexed. Must not be greater than the number of valid features of the .gdbtable. It has been observed that (with FileGDB SDK 1.3) this value is not relieable for an index that has been built while features are inserted, if the values inserted are not in increasing order.
int32: unknown role. Apparently always/often 0.
int32: unknown role. Apparently always/often 1.

The maximum number of features (or sub-pages references) in a page is : nMaxPerPages = (4096 - 12) / (4 + size_value)

The offset at which field values are found in a page is : nOffsetFirstValInPage = 12 + nMaxPerPages * 4

Page referencing features (4096 bytes)

For a given field value, if found in several features, the features are sorted by ascending ID. The structure of such a page is header section (12 bytes), followed by FID numbers (maximum of 4 * nMaxPerPages bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that if index_depth == 1, there is a single feature page, and for higher index depth, all feature-referencing pages are referenced from page referencing pages. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
uint32: number of features referenced in the page (nFeatures). Not greater than nMaxPerPages
uint32: unknown role. Apparently always/often 0.

FID section structure (offset 12 in the page) :

uint32: FID of the first feature referenced in the page
...
uint32: FID ot the (nFeatures)th feature referenced in the page.

Padding section of zeroes (size: nOffsetFirstValInPage - 12 - 4 * nFeatures)

Values section structure (offset nOffsetFirstValInPage in the page):

type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): value of field for the first feature referenced in the page
...
type: value of field for the (nFeatures)th feature referenced in the page.

Page referencing other pages (4096 bytes)

The structure of such a page is header section (4 bytes), followed by sub-pages numbers (maximum of 4 * (1 + nMaxPerPages) bytes), a few potential padding bytes, and finally field values (maximum of size_value * nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that such a page is always referenced from a page upper in the hierarchy if there are several at that depth. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
uint32: number of sub-pages referenced in the page (nSubPages). Not greater than nMaxPerPages

Sub-pages number section (offset 8 in the page):

uint32: ID of the first sub-page referenced in the page
...
uint32: ID of the (nSubPages)th sub-page referenced in the page.
uint32: ID of the (nSubPages+1)th sub-page referenced in the page (note: there is no maching value for that last sub-page number in the values section)

Padding section of zeroes( size: nOffsetFirstValInPage - 8 - 4 * (nSubPages+1))

Values section structure (offset nOffsetFirstValInPage in the page):

type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the first sub-page referenced in the page
...
type: maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the (nSubPages)th sub-page referenced in the page

Specification of .spx files

.spx files contain the spatial index for the geometry field of a .gdbtable. They have exactly the same structure as .atx files: same trailing section of 22 bytes, same principle of pages of 4096 byte, with either pages referencing other pages (depth > 0) or pages referencing features (depth = 0). The payload being indexed is a 64-bit integer number (size_value = 8).

It is built from (x,y) georeferenced coordinates and a grid number (grid_no) : point(x,y,grid_no) = (grid_no << 62) | (scaled_x << 31) | scaled_y

where grid_no = 0, 1, 2 (grid_no must be strictly lower that len(grid_size), where grid_size[] is the array giving the spatial grid resolution) and

scale_x = int(floor(x / grid_size[grid_no] + (2^29)) / (grid_size[grid_no] / grid_size[0])))
scale_y = int(floor(y / grid_size[grid_no] + (2^29)) / (grid_size[grid_no] / grid_size[0])))

Note: for the purpose of building this number, it is convenient to consider it as a unsigned quantity, especially when grid_no = 2, which sets the most-significant-bit, but sorting purposes in the .spx file, it has been found that this number if considered as a signed quantity.

In regular layers of sample files studied, it has been found that len(grid_size) == 1. It appears however that for FileGDB v10, the a0000004 system table can have up to 3 grid sizes.

The principle of spatial indexing consists in "rasterizing" the geometries on the spatial index grid(s) and indexing the 64-bit quantities corresponding to those rasterized points. Consequently for a non-punctual geometry, its FID may appear several times in the file. For a given 64-bit quantity, features appear in increasing FID in the .spx file.

On the read size, when interested in geometries that intersect the (minx, miny, maxx, maxy) envelope, one must search the index for indexed values in [point(x,miny,grid_no), point(x,maxy,grid_no)] for x in [minx, maxx] and grid_no in [0, len(grid_size()-1]).

One can see that if grid_size[] values are not carefully choosen, the size of the .spx file may be huge. A polygon with a large extent can correspond to a big number of indexed values. It is difficult to completely assert the strategy used for indexing when len(grid_size[]) > 1, but presumbably, from an example of a a0000004 system table, it would appear that features that would cause too many values to be generated at grid_no = 0 are rather indexed with grid_no = 1 or 2. On the read side, our assumption is that one should search indexed values for grid_no = 0 ... len(grid_size[])-1, and not only at grid_no = 0 even if there are matches at the resolution.

Specification of .freelist files

.freelist files contain the offset to the holes (rows deleted, or old updates) in the associated .gdbtable file. The file is rewritten after each edit session, with the most recent edit at the start of the file, and order being maintained during repeated edit operations. The file is optional, and will be deleted when the fGDB is compacted.

The file has 344 bytes of buffer at the end, and looks to be created in 4K blocks ( so, smallest is 4096 + 344 = 4440 bytes )

Header (8 bytes)

int32: number of rows
4 bytes: 0xFFFFFFFF. No apparent use

Offset section

The section starts immediately after the header and is made of (4 + size_offset) x number_rows bytes. For each row,

int32: number of bytes
int32, int40 or int48: (depending on size_offset value) offset of the beginning of the row in the .gdbtable file. int40 is made of a int32 with the 32 least significant bits followed by a 4th byte with the 8 most significant bits. Similar for int48

GDB files

Files are named in the format a[number in lowercase hex].[extension] with files with the same base but different extensions being related. Files are numbered incrementally, a00000001 is first a00000002 is second, but numbers may be skipped.

FileGDB v10

For FileGDB v10, the first 8 (a00000001 to a00000008) files seem to be reserved for database information and subsequent files are feature classes (a00000009, a0000000a, ...).

a00000001 is called GDB_SystemCatalog and contains a list of tables (including itself, other reserved tables and user tables). Tables may be mentionned but not actually found on the disk : this is often (only ?) the case of table a00000008. The FID of a record in this table determines the name of the file to consider. For example the record of FID 37 (the convention taken here for FID numbering is starting from 1) will be in file a00000025. There might be deleted rows in this catalog table, so gaps in FID numbering.

The table contains a Name field and a FileFormat field. The value of FileFormat seems to be 0 in most cases, and sometimes 2 for a few reserved system tables.
a00000002 contains config parameters for the database and is called GDB_DBTune
a00000003 is called GDB_SpatialRefs and contains the SRS as WKT in field SRTEXT (in ESRI WKT dialect) and the following fields : FalseX, FalseY, XYUnits, FalseZ, ZUnits, FalseM, MUnits, XYTolerance, ZTolerance, MTolerance. All rows are unique so if there are 3 features classes, all with the same spatial reference system, but one has a different ZTolerance there will be two rows.
a00000004 is called GDB_Items and contains metadata about the items (layers), mostly in XML. The fields are :
- UUID (UUID) : UUID
- Type (UUID) : item type
- Name (string) : item/layer name. Matches the Name field of the GDB_SystemCatalog
- PhysicalName (string) : item/layer name in upper case characters.
- Path (string) : "\mylayername" for top-level layers or "\myfeaturedataset\mylayername" for layers attached to a feature dataset "myfeaturedataset"
- DatasetSubType1 (int32) : 1 for user tables (TBC)
- DatasetSubType2 (int32) : layer geometry type. 1 for point layer, 2 for multipoint layers, 3 for linestring layers, 4 for polygon layers
- DatasetInfo1 (string) : "SHAPE" for user tables (TBC)
- DatasetInfo2 (string) : NULL for user tables (TBC)
- URL (string) : empty string (TBC)
- Definition (XML) : DEFeatureClassInfo XML element. Contains an XML version of the information that can be obtained by parsing the header of a table : fields, SRS, ...
- Documentation (XML) : metadata XML element
- ItemInfo (XML) : NULL for user tables (TBC)
- Properties (int32) : 1 for user tables (TBC)
- Defaults (binary) : absent for user tables (TBC)
- Shape (geometry) : 5 point polygon listing the corner of the bounding box of the layer reprojected into EPSG:4326 (even if the layer SRS is not EPSG:4326). Or missing if the layer SRS is undefined.
A few particular records :
- The first record is reserved for a kind of root item ( Name = "", Path = "" ).
- The second record is reserved for a Name = "Workspace" item, Path = "", Definition containing a DEWorkspace XML element
- When there are feature datatesets, they also appear as records : e.g. Name = "featuredataset", PhysicalName = "FEATUREDATASET", Path = "\FEATUREDATASET", Definition containing a DEFeatureDataset XML element
a00000005, a00000006 and a00000007 are one of GDB_ItemRelationships,GDB_ItemRelationshipTypes or GDB_ItemTypes (order may vary depending on datasets)
a00000008 is called GDB_ReplicaLog. It is often listed in the GDB_SystemCatalog, but actually missing on disk.

Globally for v10 files, the main interesting reserved table seems to be the GDB_SystemCatalog to establish the link between the layer name and its associated .gdbtable file. Using a00000004 might be needed in case there are user table of other table types listed in the GDB_SystemCatalog that are not vector tables (rasters, relationships, ...), and also may be used to have an overview of all tables by exploiting the XML definition without opening all the corresponding .gdbtable files.

FileGDB v9

For FileGDB v9, the first 36 (a00000001 to a00000024) files seem to be reserved for database information and subsequent files are feature classes (a00000025, a00000026, ...). Very often, the files between a00000009 and a00000024 are missing.

a00000001 : GDB_SystemCatalog. Similar to v10. Contains as well a DatasetGUID field. Records 1 to 36 are reserved for GDB_ tables
a00000001 : GDB_DBTune
a00000003 : GDB_SpatialRefs. Identical to v10
a00000004 : GDB_Release. Contains a single record : for v9.2 databases: Major = 2, Minor = 2, Bugfix = 0. For v9.3 databases: Major = 2, Minor = 3, Bugfix = 0
a00000005 : GDB_FeatureDataset
a00000006 : GDB_ObjectClasses. Contains a Name field, and other technical fields.
a00000007 : GDB_FeatureClasses. Simplified version of GDB_Items of v10. Contains the layer geometry type in GeometryType and shape field name in ShapeField. The ObjectClassID field is related to the FID of GDB_ObjectClasses
a00000008 : GDB_FieldInfo. Contains information about some (but not all fields) of layers.

Globally for v9 files, the main interesting reserved table seems to be the GDB_SystemCatalog to establish the link between the layer name and its associated .gdbtable file. Using a00000007 in conjunction with a00000006 might be needed in case there are user table of other table types listed in the GDB_SystemCatalog that are not vector tables (rasters, relationships, ...)

Compressed Tables

Compressed tables are indicated by the presence of a ".cdf" file, which contains a compressed version of a layer. The encoding of CDF tables is significantly different from standard GDB tables.

Header

uint32: File identifier, either 0x43444623 or 0x43444632
uint32: Flags. If flags & 0xff00 == 0x1000, then the table is version 10. If it's 0x0900, if a version 9 table.
16 bytes: Unique table UUID (See UUID field interpretation above for explanation)
For version 9 only: uint32: codepage. The code page value & 0xffff should be one of 0x2ff, 0x3ff, 0x4ff or 0x5ff
varint16: Offset for file TOC

16 bytes: object GUID, where GUID may be 0x010000000000000000000000000000 for "CDF Block", 0x02: "CDF Log", 0x03...: "CDF SINFO" (spatial information), 0x04...: "CDF_TABINFO" (table information), 0x14...: "SDC Block", 0x15...: "SDC PHYS", 0x16...: "SDC LOG". There's at most one of each of block, log, sinfo, tabinfo, phys. There MUST be a BLOCK and LOG entry, and for v9 files, there must also be a "SDC PHYS" entry.
varint16: Offset of object in file

Field Info

First, seek to LOG offset from TOC

varuint: Field count
16 bytes: unknown

For each field:

varuint: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
varuint: field type. These differ from standard GDB field types. Where 1 = INT16, 4 = OBJECTID, 5 = FLOAT32, 6 = FLOAT64, 7 = STRING, 8 = GEOMETRY, 9 = DATETIME, 10 = UUID1, 12 = BINARY, 16 = RASTER, 17 = UUID2
varuint: unknown

For field type 4 (OBJECTID):

varuint: unknown

For field type 5 or 6 (FLOAT32/FLOAT64)

varuint: unknown
varuint: unknown

For field type 8 (GEOMETRY):

varuint: unknown

For all field types

varuint: unknown, must be 0
16 unknown bytes

Table Info

First, seek to TABINFO offset from TOC

varuint: number of UTF-16 characters (not bytes) of the name of the table
utf16: name of the table

Spatial Info

First, seek to SINFO offset from TOC

float64: x min (Extent of layer)
float64: y min
float64: x max
float64: y max
float64: unknown -- maybe resolution?

If z or m present, looks like two more sets of doubles for each -- likely z/m min/max, but unknown which order

varuint: number of UTF-16 characters (not bytes) of the WKT definition of the table's SRS
utf16: WKT definition of table's SRS

License

Formatting to Markdown done by Calvin Metcalf.

Note: the scope of the copyrighted material does, of course, not extend onto any source or binary code derived from the specification, that may be licensed under the terms that their author may see fit.

FGDB Spec

Conventions

Specification of .gdbtable files

Header (40 bytes)

Field description section

Fixed part

Repeated part (per field)

Rows section

Row description

Null fields flags

Field content

Geometry field (type = 7)

Binary (type = 8)

Raster (type = 9)

String (type=4) or XML (type=12)

UUID (type=10 or 11)

Other types

Specification of .gdbtablx file

Header (16 bytes)

Offset section

Trailing section (16 bytes + variable number )

Specification of .gdbindexes files

Header (4 bytes)

Index description

Specification of .atx files

Trailing section (22 bytes)

Page referencing features (4096 bytes)

Page referencing other pages (4096 bytes)

Specification of .spx files

Specification of .freelist files

Header (8 bytes)

Offset section

GDB files

FileGDB v10

FileGDB v9

Compressed Tables

Header

TOC

Field Info

Table Info

Spatial Info

License

Clone this wiki locally