Skip to content

Latest commit

 

History

History
122 lines (91 loc) · 7.07 KB

spatial-audio-rfc.md

File metadata and controls

122 lines (91 loc) · 7.07 KB

Spatial Audio RFC (draft)

This document describes an open metadata scheme by which MP4 multimedia containers may accommodate spatial and head-locked stereo audio. Comments are welcome on the spatial-media-discuss mailing list or by filing an issue on GitHub.


Metadata Format

MP4

Spatial audio metadata is stored in a new box, SA3D, defined in this RFC.

Spatial Audio Box (SA3D)

Definition

Box Type: SA3D
Container: Sound Sample Description box (e.g., mp4a, lpcm, sowt, etc.)
Mandatory: No
Quantity: Zero or one

When present, provides additional information about the spatial audio content contained in this audio track.

Syntax
aligned(8) class SpatialAudioBox extends Box(‘SA3D’) {
    unsigned int(8)  version;
    unsigned int(8)  ambisonic_type;
    unsigned int(32) ambisonic_order;
    unsigned int(8)  ambisonic_channel_ordering;
    unsigned int(8)  ambisonic_normalization;
    unsigned int(32) num_channels;
    for (i = 0; i < num_channels; i++) {
        unsigned int(32) channel_map;
    }
}
Semantics
  • version is an 8-bit unsigned integer that specifies the version of this box. Must be set to 0.

  • head_locked_stereo is a 1-bit flag used to indicate that the stored audio track contains head-locked stereo audio in addition to ambisonics audio. The flag should be set if the track contains head-locked stereo and unset otherwise.

  • ambisonic_type is a 7-bit unsigned integer that specifies the type of ambisonic audio represented; the following values are defined:

ambisonic_type Ambisonic Type Description
0 Periphonic: Indicates that the audio stored is a periphonic ambisonic sound field (i.e., full 3D).
  • ambisonic_order is a 32-bit unsigned integer that specifies the order of the ambisonic sound field. If the ambisonic_type is 0 (periphonic), this is a non-negative integer representing the periphonic ambisonic order; in this case, it should take a value of sqrt(n) - 1, where n is the number of channels in the represented ambisonic audio data. For example, a periphonic ambisonic sound field with ambisonic_order = 1 requires (ambisonic_order + 1)^2 = 4 ambisonic components.

  • ambisonic_channel_ordering is an 8-bit integer specifying the channel ordering (i.e., spherical harmonics component ordering) used in the represented ambisonic audio data; the following values are defined:

ambisonic_channel_ordering Channel Ordering Description
0 ACN: The channel ordering used is the Ambisonic Channel Number (ACN) system. In this, given a spherical harmonic of degree l and order m, the corresponding ordering index n is given by n = l * (l + 1) + m.
  • ambisonic_normalization is an 8-bit unsigned integer specifying the normalization (i.e., spherical harmonics normalization) used in the represented ambisonic audio data; the following values are defined:
ambisonic_normalization Normalization Description
0 SN3D: The normalization used is Schmidt semi-normalization (SN3D). In this, the spherical harmonic of degree l and order m is normalized according to sqrt((2 - δ(m)) * ((l - m)! / (l + m)!)), where δ(m) is the Kronecker delta function, such that δ(0) = 1 and δ(m) = 0 otherwise.
  • num_channels is a 32-bit unsigned integer specifying the number of audio channels contained in the given audio track.

  • channel_map is a sequence of 32-bit unsigned integers that maps audio channels in a given audio track to ambisonic components, given the defined ambisonic_channel_ordering. The sequence of channel_map values should match the channel sequence within the given audio track.

    For the example case of ambisonic_type = 0 (Periphonic), consider a 4-channel audio track containing ambisonic components W, X, Y, Z at channel indexes 0, 1, 2, 3, respectively. For ambisonic_channel_ordering = 0 (ACN), the ordering of components should be W, Y, Z, X, so the channel_map sequence should be 0, 2, 3, 1.

    As a simpler example, for a 4-channel audio track containing ambisonic components W, Y, Z, X at channel indexes 0, 1, 2, 3, respectively, the channel_map sequence should be specified as 0, 1, 2, 3 when ambisonic_channel_ordering = 0 (ACN).

    For the example case of ambisonic_type = 0 (Periphonic) with head_locked_stereo = 1, the stored audio will consist of 4 ambisonic components W, Y, Z, X in addition to head-locked stereo components L and R. In this case, the SA3D atom will define num_channels = 6 and a channel_map specified as 0, 1, 2, 3, 4, 5 indicating that the channels are laid out in the file as W, Y, Z, X, L, R. This representation extends to different layouts of ambisonics and head-locked stereo components. For example, a channel layout of 4, 5, 0, 1, 2, 3 indicates that the layout of the stored audio is L, R, W, Y, Z, X.

Example

Here is an example MP4 box hierarchy for a file containing the SA3D box:

  • moov
    • trak
      • mdia
        • minf
          • stbl
            • stsd
              • mp4a
                • esds
                • SA3D

where the SA3D box has the following data:

Field Name Value
version 0
ambisonic_type 0
ambisonic_order 1
ambisonic_channel_ordering 0
ambisonic_normalization 0
num_channels 4
channel_map 0
channel_map 2
channel_map 3
channel_map 1

Appendix 1 - Ambisonics

The traditional notion of ambisonics is used, where the sound field is represented by spherical harmonics coefficients using the associated Legendre polynomials (without Condon-Shortley phase) as the basis functions. Thus, the spherical harmonic of degree l and order m at elevation E and azimuth A is given by:

N(l, abs(m)) * P(l, abs(m), sin(E)) * T(m, A)

where:

  • N(l, m) is the spherical harmonics normalization function used.
  • P(l, m, x) is the (unnormalized) associated Legendre polynomial, without Condon-Shortley phase, of degree l and order m evaluated at x.
  • T(m, x) is sin(-m * x) for m < 0 and cos(m * x) otherwise.

Conventions

Azimuth

  • A = 0: The source is in front of the listener.
  • A in (0, pi/2): The source is in the forward-left quadrant.
  • A in (pi/2, pi): The source is in the back-left quadrant.
  • A in (-pi/2, 0): The source is in the forward-right quadrant.
  • A in (-pi, -pi/2): The source is in the back-right quadrant.

Elevation

  • E = 0: The source is in the horizontal plane.
  • E in (0, pi/2]: The source is above the listener.
  • E in [-pi/2, 0): The source is below the listener.