This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the spatial-media-discuss mailing list or by filing an issue on GitHub.
Spatial audio metadata is stored in a new box, SA3D
, defined in this RFC.
Box Type: SA3D
Container: Sound Sample Description box (e.g., mp4a
, lpcm
, sowt
, etc.)
Mandatory: No
Quantity: Zero or one
When present, provides additional information about the spatial audio content contained in this audio track.
aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
unsigned int(8) version;
unsigned int(8) ambisonic_type;
unsigned int(32) ambisonic_order;
unsigned int(8) ambisonic_channel_ordering;
unsigned int(8) ambisonic_normalization;
unsigned int(32) num_channels;
for (i = 0; i < num_channels; i++) {
unsigned int(32) channel_map;
}
}
-
version
is an 8-bit unsigned integer that specifies the version of this box. Must be set to0
. -
head_locked_stereo
is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise. -
ambisonic_type
is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:
ambisonic_type |
Ambisonic Type Description |
---|---|
0 |
Periphonic: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D). |
-
ambisonic_order
is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If theambisonic_type
is0
(periphonic), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value ofsqrt(n) - 1
, wheren
is the number of channels in the represented ambisonic audio data. For example, a periphonic ambisonic sound field withambisonic_order = 1
requires(ambisonic_order + 1)^2 = 4
ambisonic components. -
ambisonic_channel_ordering
is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:
ambisonic_channel_ordering |
Channel Ordering Description |
---|---|
0 |
ACN: The channel ordering used is the Ambisonic Channel Number (ACN) system. In this, given a spherical harmonic of degree l and order m , the corresponding ordering index n is given by n = l * (l + 1) + m . |
ambisonic_normalization
is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:
ambisonic_normalization |
Normalization Description |
---|---|
0 |
SN3D: The normalization used is Schmidt semi-normalization (SN3D). In this, the spherical harmonic of degree l and order m is normalized according to sqrt((2 - δ(m)) * ((l - m)! / (l + m)!)) , where δ(m) is the Kronecker delta function, such that δ(0) = 1 and δ(m) = 0 otherwise. |
-
num_channels
is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track. -
channel_map
is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the definedambisonic_channel_ordering
. The sequence ofchannel_map
values should match the channel sequence within the given audio track.For the example case of
ambisonic_type = 0
(Periphonic), consider a 4-channel audio track containing ambisonic components W, X, Y, Z at channel indexes0
,1
,2
,3
, respectively. Forambisonic_channel_ordering = 0
(ACN), the ordering of components should be W, Y, Z, X, so thechannel_map
sequence should be0
,2
,3
,1
.As a simpler example, for a 4-channel audio track containing ambisonic components W, Y, Z, X at channel indexes
0
,1
,2
,3
, respectively, thechannel_map
sequence should be specified as0
,1
,2
,3
whenambisonic_channel_ordering = 0
(ACN).For the example case of
ambisonic_type = 0
(Periphonic) withhead_locked_stereo = 1
, the stored audio will consist of4
ambisonic components W, Y, Z, X in addition to head-locked stereo components L and R. In this case, the SA3D atom will definenum_channels = 6
and achannel_map
specified as0
,1
,2
,3
,4
,5
indicating that the channels are laid out in the file as W, Y, Z, X, L, R. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of4
,5
,0
,1
,2
,3
indicates that the layout of the stored audio is L, R, W, Y, Z, X.
Here is an example MP4 box hierarchy for a file containing the SA3D
box:
- moov
- trak
- mdia
- minf
- stbl
- stsd
- mp4a
- esds
- SA3D
- mp4a
- stsd
- stbl
- minf
- mdia
- trak
where the SA3D
box has the following data:
Field Name | Value |
---|---|
version |
0 |
ambisonic_type |
0 |
ambisonic_order |
1 |
ambisonic_channel_ordering |
0 |
ambisonic_normalization |
0 |
num_channels |
4 |
channel_map |
0 |
channel_map |
2 |
channel_map |
3 |
channel_map |
1 |
The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the associated Legendre polynomials (without Condon-Shortley phase) as the basis functions. Thus, the spherical harmonic of degree l
and order m
at elevation E
and azimuth A
is given by:
N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)
where:
N(l, m)
is the spherical harmonics normalization function used.P(l, m, x)
is the (unnormalized) associated Legendre polynomial, without Condon-Shortley phase, of degreel
and orderm
evaluated atx
.T(m, x)
issin(-m * x)
form < 0
andcos(m * x)
otherwise.
A = 0
: The source is in front of the listener.A
in(0, pi/2)
: The source is in the forward-left quadrant.A
in(pi/2, pi)
: The source is in the back-left quadrant.A
in(-pi/2, 0)
: The source is in the forward-right quadrant.A
in(-pi, -pi/2)
: The source is in the back-right quadrant.
E = 0
: The source is in the horizontal plane.E
in(0, pi/2]
: The source is above the listener.E
in[-pi/2, 0)
: The source is below the listener.