Skip to content

FileFormatDrafts

Christian Schneider edited this page May 27, 2018 · 6 revisions

File Format drafts

Note: At the moment all of these pages are draft pages, so feel free to add things you think are missing or correct errors. There is a Gitter chat room where things can be discussed. Some topics where further discussion might be useful are marked with (std) ("subject to discussion")

Current draft by Márcio Pais: https://files.gitter.im/encode-ru-Community-Archiver/Lobby/cnFl/Fairytale-File-Format.pdf


Alternative description by Christian Schneider:

The length column in the tables below can contain "VLI" which stands for "Variable Length Integer". This data structure is of variable length (1-9 bytes) and encodes a 64 bit integer the following way: The first bit in each byte is a flag. If it is set, there will be more bytes following. If not, this byte is the last one. The other bits of the bytes each encode 7 bits of the integer value. For example, the following VLI: 11010101 00110110 encodes the binary value 0110110 1010101 which is 6997 in decimal. Note that the order of the 7 bit "packages" is reversed. The code for encoding and decoding VLIs can be found in fairytale.cpp, methods vliEncode and vliDecode.

Archive file format

When files are compressed by Fairytale, an archive file with extension .ftl (std) is created that contains everything that is needed to restore the original files. The current draft of the format is:

Description Length
Magic bytes 6 bytes (std)
Offset to the first structure 8 bytes
Compressed block data variable
Directory tree structure variable
File structure variable
Codec structure variable
Block segmentation structure variable

Magic bytes

The beginning of the file identifies it as an Fairytale archive. Storing the version number is important because it's very likely that different versions of Fairytale will create incompatible files that can't be processed by other versions.

Description Length
"FTL" (std) 3 bytes
version number 3 bytes (std)

Offset to the first structure

As the size of the compressed data is not fixed, this offset is stored to allow skipping it. This allows to read the "meta" structures following the compressed data without the need to parse the data itself.

Structures

Everything following the compressed data is a "structure" with the following format:

Description Length
Structure size in bytes VLI
Data variable
CRC32 checksum 4 bytes

The checksum takes both the first field (structure size) and the data into account.

Directory tree structure

(to be done)

File structure

The file structure contains data for one or more files stored in the archive. For each of these files, this format is used:

Description Length
Directory ID VLI
Length of filename VLI
Filename variable
Length of metadata (std) VLI
File metadata variable (std)
Number of blocks VLI
Block 0 ID VLI
...
Block N ID VLI

File metadata

(to be done)

Codec structure

(to be done)

Block segmentation structure

(to be done)

Recovery file format

The recovery file format is intended to protect against different types of corruption. It is intended as a wrapper around the fairytale file format similar to .tar.gz. That way the recovery file format can also be used independently of fairytale. During decompression Fairytale will check for the recovery header, if it is present it will used i/o classes to transparently access the Fairytale file protected inside the recovery format.

Features

  • protection against flipped bits. May it be single bits or whole hdd-sectors
  • recovery from failed storage media.
  • multi part archives
  • encryption?

File format

Data will be split into blocks which should ideally correspond th file system blocks / hdd sectors. 4k may be reasonable. Each block has the following structure:

  • Marker: 2 bytes
  • UUID: 8 bytes
  • Frame ID: VLI (starts with 1, indicator for last block is 0)
  • Recovery parameters. Possibly only present in Blocks 1, 2, 4, 8, 16, ... and 0.
  • Payload data
  • Checksum: 4 bytes CRC32C

It may still be subject to change. The important info for the Fairytale format right now is that it only needs to check the first two bytes to decide on how to read its data. This way the recovery format can be implemented later.

open questions:

  • How to protect against lost frames as efficiently as possible?