Spatial Audio RFC (draft)

This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the spatial-media-discuss mailing list or by filing an issue on GitHub.

Metadata Format

MP4

Spatial audio metadata is stored in a new box, SA3D, defined in this RFC.

Spatial Audio Box (SA3D)

Definition

Box Type: SA3D
Container: Sound Sample Description box (e.g., mp4a, lpcm, sowt, etc.)
Mandatory: No
Quantity: Zero or one

When present, provides additional information about the spatial audio content contained in this audio track.

Syntax

aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
    unsigned int(8)  version;
    unsigned int(8)  ambisonic_type;
    unsigned int(32) ambisonic_order;
    unsigned int(8)  ambisonic_channel_ordering;
    unsigned int(8)  ambisonic_normalization;
    unsigned int(32) num_channels;
    for (i = 0; i < num_channels; i++) {
        unsigned int(32) channel_map;
    }
}

Semantics

version is an 8-bit unsigned integer that specifies the version of this box. Must be set to 0.
head_locked_stereo is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise.
ambisonic_type is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:

`ambisonic_type`	Ambisonic Type Description
`0`	Periphonic: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D).

ambisonic_order is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If the ambisonic_type is 0 (periphonic), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value of sqrt(n) - 1, where n is the number of channels in the represented ambisonic audio data. For example, a periphonic ambisonic sound field with ambisonic_order = 1 requires (ambisonic_order + 1)^2 = 4 ambisonic components.
ambisonic_channel_ordering is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:

`ambisonic_channel_ordering`	Channel Ordering Description
`0`	ACN: The channel ordering used is the Ambisonic Channel Number (ACN) system. In this, given a spherical harmonic of degree `l` and order `m`, the corresponding ordering index `n` is given by `n = l * (l + 1) + m`.

ambisonic_normalization is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:

`ambisonic_normalization`	Normalization Description
`0`	SN3D: The normalization used is Schmidt semi-normalization (SN3D). In this, the spherical harmonic of degree `l` and order `m` is normalized according to `sqrt((2 - δ(m)) * ((l - m)! / (l + m)!))`, where `δ(m)` is the Kronecker delta function, such that `δ(0) = 1` and `δ(m) = 0` otherwise.

num_channels is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track.
channel_map is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the defined ambisonic_channel_ordering. The sequence of channel_map values should match the channel sequence within the given audio track.

For the example case of ambisonic_type = 0 (Periphonic), consider a 4-channel audio track containing ambisonic components W, X, Y, Z at channel indexes 0, 1, 2, 3, respectively. For ambisonic_channel_ordering = 0 (ACN), the ordering of components should be W, Y, Z, X, so the channel_map sequence should be 0, 2, 3, 1.

As a simpler example, for a 4-channel audio track containing ambisonic components W, Y, Z, X at channel indexes 0, 1, 2, 3, respectively, the channel_map sequence should be specified as 0, 1, 2, 3 when ambisonic_channel_ordering = 0 (ACN).

For the example case of ambisonic_type = 0 (Periphonic) with head_locked_stereo = 1, the stored audio will consist of 4 ambisonic components W, Y, Z, X in addition to head-locked stereo components L and R. In this case, the SA3D atom will define num_channels = 6 and a channel_map specified as 0, 1, 2, 3, 4, 5 indicating that the channels are laid out in the file as W, Y, Z, X, L, R. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of 4, 5, 0, 1, 2, 3 indicates that the layout of the stored audio is L, R, W, Y, Z, X.

Example

Here is an example MP4 box hierarchy for a file containing the SA3D box:

moov
- trak
  - mdia
    - minf
      - stbl
        
        stsd
        
        mp4a
        
        esds
        
        SA3D

where the SA3D box has the following data:

Field Name	Value
`version`	`0`
`ambisonic_type`	`0`
`ambisonic_order`	`1`
`ambisonic_channel_ordering`	`0`
`ambisonic_normalization`	`0`
`num_channels`	`4`
`channel_map`	`0`
`channel_map`	`2`
`channel_map`	`3`
`channel_map`	`1`

Appendix 1 - Ambisonics

The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the associated Legendre polynomials (without Condon-Shortley phase) as the basis functions. Thus, the spherical harmonic of degree l and order m at elevation E and azimuth A is given by:

N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)

where:

N(l, m) is the spherical harmonics normalization function used.
P(l, m, x) is the (unnormalized) associated Legendre polynomial, without Condon-Shortley phase, of degree l and order m evaluated at x.
T(m, x) is sin(-m * x) for m < 0 and cos(m * x) otherwise.

Conventions

Azimuth

A = 0: The source is in front of the listener.
A in (0, pi/2): The source is in the forward-left quadrant.
A in (pi/2, pi): The source is in the back-left quadrant.
A in (-pi/2, 0): The source is in the forward-right quadrant.
A in (-pi, -pi/2): The source is in the back-right quadrant.

Elevation

E = 0: The source is in the horizontal plane.
E in (0, pi/2]: The source is above the listener.
E in [-pi/2, 0): The source is below the listener.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spatial-audio-rfc.md

spatial-audio-rfc.md

Spatial Audio RFC (draft)

Metadata Format

MP4

Spatial Audio Box (SA3D)

Definition

Syntax

Semantics

Example

Appendix 1 - Ambisonics

Conventions

Azimuth

Elevation

Files

spatial-audio-rfc.md

Latest commit

History

spatial-audio-rfc.md

File metadata and controls

Spatial Audio RFC (draft)

Metadata Format

MP4

Spatial Audio Box (SA3D)

Definition

Syntax

Semantics

Example

Appendix 1 - Ambisonics

Conventions

Azimuth

Elevation