Skip to content

Commit

Permalink
Merge pull request #847 from AOMediaCodec/sunghee-hwang-patch-2
Browse files Browse the repository at this point in the history
Fix #840, align of the keywords for conformance (UPPERCASE <-> lowerc…
  • Loading branch information
sunghee-hwang authored Jul 14, 2024
2 parents 79becc8 + 4bfc95d commit 3dfe67d
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ This specification defines a model for representing [=Immersive Audio=] contents
<center><img src="images/decoding_flow_cropped.svg" width="800"></center>
<center><figcaption>Processing flow to decode, reconstruct, render, and mix the 3D audio signals for immersive audio playback.</figcaption></center>

The model comprises a number of coded [=Audio Substream=]s and the metadata that describes how to decode, render and mix the [=Audio Substream=]s for playback. The model itself is codec-agnostic; any supported audio codec may be used to code the [=Audio Substream=]s.
The model comprises a number of coded [=Audio Substream=]s and the metadata that describes how to decode, render and mix the [=Audio Substream=]s for playback. The model itself is codec-agnostic; any supported audio codec MAY be used to code the [=Audio Substream=]s.

The model includes one or more [=Audio Element=]s, each of which consists of one or more [=Audio Substream=]s. The [=Audio Substream=]s that make up an [=Audio Element=] are grouped into one or more [=Channel Group=]s. The model further includes [=Mix Presentation=]s and [=Parameter Substream=]s.

Expand All @@ -336,15 +336,15 @@ The term channel means a component of Scene-based audio, a component of Object-b

The term <dfn noexport>Immersive Audio</dfn> (IA) means the combination of [=3D audio signal=]s recreating a sound experience close to that of a natural environment.

The term <dfn noexport>Audio Substream</dfn> means a sequence of audio samples, which may be encoded with any compatible audio codec.
The term <dfn noexport>Audio Substream</dfn> means a sequence of audio samples, which MAY be encoded with any compatible audio codec.

The term <dfn noexport>Channel Group</dfn> means a set of [=Audio Substream=](s) which is(are) able to provide a spatial resolution of audio contents by itself or which is(are) able to provide an enhanced spatial resolution of audio contents by combining with the preceding [=Channel Group=]s.

The term <dfn noexport>Audio Element</dfn> means a [=3D audio signal=], and is constructed from one or more [=Audio Substream=]s (grouped into one or more [=Channel Groups=]) and the metadata describing them. The [=Audio Substream=]s associated with one [=Audio Element=] use the same audio codec.

The term <dfn noexport>Mix Presentation</dfn> means a series of processes to present [=Immersive Audio=] contents to end-users by using [=Audio Element=](s). It contains metadata that describes how the [=Audio Element=](s) is(are) rendered and mixed together for playback through physical loudspeakers or headphones, as well as loudness information.

The term <dfn noexport>Parameter Substream</dfn> means a sequence of parameter values that are associated with the algorithms used for reconstructing, rendering, and mixing. It is applied to its associated [=Audio Element=] or [=Mix Presentation=]. [=Parameter Substream=]s may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time duration. As such, they may be viewed as a 1D signal with different metadata specified for different time durations.
The term <dfn noexport>Parameter Substream</dfn> means a sequence of parameter values that are associated with the algorithms used for reconstructing, rendering, and mixing. It is applied to its associated [=Audio Element=] or [=Mix Presentation=]. [=Parameter Substream=]s MAY change their values over time and MAY further be animated; for example, any changes in values MAY be smoothed over some time duration. As such, they MAY be viewed as a 1D signal with different metadata specified for different time durations.

The term <dfn noexport>Rendered Mix Presentation</dfn> means a [=3D audio signal=] after the [=Audio Element=](s) defined in a [=Mix Presentation=] is(are) rendered and mixed together for playback through physical loudspeakers or headphones.

Expand All @@ -368,7 +368,7 @@ For a given input [=3D audio signal=],

An IAMF generation processing including the Pre-Processor, the [=Channel Group=](s), the Codec Encoder, and the OBU Packetizer are defined in [[#iamfgeneration]]. The [=IA Sequence=] is defined in [[#standalone-ia-sequence]]. An IAMF processing including the OBU Parser, the Codec Decoder, the Element Reconstructor, the Renderer, the Mixer, and the Post-Processor are defined in [[#processing]].

Although not shown in the figure above, the [=IA Sequence=] may be encapsulated by a file packager, such as the ISO-BMFF Encapsulation, to output an IAMF file (ISO-BMFF file). Then, a file parser, such as the ISO-BMFF Parser, decapsulates it to output the [=IA Sequence=]. The ISO-BMFF Encapsulation, IAMF file (ISO-BMFF file), and ISO-BMFF Parser are defined in [[#isobmff]].
Although not shown in the figure above, the [=IA Sequence=] MAY be encapsulated by a file packager, such as the ISO-BMFF Encapsulation, to output an IAMF file (ISO-BMFF file). Then, a file parser, such as the ISO-BMFF Parser, decapsulates it to output the [=IA Sequence=]. The ISO-BMFF Encapsulation, IAMF file (ISO-BMFF file), and ISO-BMFF Parser are defined in [[#isobmff]].

## Bitstream Structure ## {#bitstream}

Expand Down Expand Up @@ -398,13 +398,13 @@ The normative definitions for an [=IA Sequence=] are defined in [[#standalone-ia

- The [=Audio Frame OBU=] provides the coded audio frame for an [=Audio Substream=]. Each frame has an implied start timestamp and an explicitly defined duration. A coded [=Audio Substream=] is represented as a sequence of [=Audio Frame OBU=]s with the same identifier, in time order.
- The [=Parameter Block OBU=] provides the parameter values in a block for a [=Parameter Substream=]. Each block has an implied start timestamp and an explicitly defined duration. A time-varying [=Parameter Substream=] is represented as a sequence of parameter values in [=Parameter Block OBU=]s with the same identifier, in time order.
- The [=Temporal Delimiter OBU=] identifies the [=Temporal Unit=]s. It may or may not be present in [=IA Sequence=]. If present, the first OBU of every [=Temporal Unit=] is the [=Temporal Delimiter OBU=].
- The [=Temporal Delimiter OBU=] identifies the [=Temporal Unit=]s. It MAY or MAY NOT be present in [=IA Sequence=]. If present, the first OBU of every [=Temporal Unit=] is the [=Temporal Delimiter OBU=].

## Timing Model ## {#timingmodel}

A coded [=Audio Substream=] is made of consecutive [=Audio Frame OBU=]s. Each [=Audio Frame OBU=] is made of audio samples at a given sample rate. The decode duration of an [=Audio Frame OBU=] is the number of audio samples divided by the sample rate. The presentation duration of an [=Audio Frame OBU=] is the number of audio samples remaining after trimming divided by the sample rate. The decode start time (respectively presentation start time) of an [=Audio Frame OBU=] is the sum of the decode durations (respectively presentation durations) of previous [=Audio Frame OBU=]s in the IA Sequence, or 0 otherwise. The decode duration (respectively presentation duration) of a coded [=Audio Substream=] is the sum of the decode durations (respectively presentation durations) of all its [=Audio Frame OBU=]s. The decode start time of an [=Audio Substream=] is the decode start time of its first [=Audio Frame OBU=]. The presentation start time of an [=Audio Substream=] is the presentation start time of its first [=Audio Frame OBU=] which is not entirely trimmed.

A [=Parameter Substream=] is made of consecutive [=Parameter Block OBU=]s. Each [=Parameter Block OBU=] is made of parameter values at a given sample rate. The decode duration of a [=Parameter Block OBU=] is the number of parameter values divided by the sample rate. The decode start time of a [=Parameter Block OBU=] is the sum of the decode duration of previous [=Parameter Block OBU=]s if any, 0 otherwise. The decode duration of a [=Parameter Substream=] is the sum of all its [=Parameter Block OBU=]s' decode durations. The start time of a [=Parameter Substream=] is the decode start time of its first [=Parameter Block OBU=]. When all parameter values in a [=Parameter Substream=] are constant, no [=Parameter Block OBU=]s may be present in the [=IA Sequence=].
A [=Parameter Substream=] is made of consecutive [=Parameter Block OBU=]s. Each [=Parameter Block OBU=] is made of parameter values at a given sample rate. The decode duration of a [=Parameter Block OBU=] is the number of parameter values divided by the sample rate. The decode start time of a [=Parameter Block OBU=] is the sum of the decode duration of previous [=Parameter Block OBU=]s if any, 0 otherwise. The decode duration of a [=Parameter Substream=] is the sum of all its [=Parameter Block OBU=]s' decode durations. The start time of a [=Parameter Substream=] is the decode start time of its first [=Parameter Block OBU=]. When all parameter values in a [=Parameter Substream=] are constant, no [=Parameter Block OBU=]s MAY be present in the [=IA Sequence=].

Within an [=Audio Element=], the presentation start times of all [=Audio Substream=]s coincide and is the presentation start time of the [=Audio Element=]. All [=Audio Substream=]s have the same presentation duration which is the presentation duration of the [=Audio Element=].
- The decode start times of all coded [=Audio Substream=]s and all [=Parameter Substream=]s coincide and is the decode start time of the [=Audio Element=].
Expand All @@ -421,7 +421,7 @@ The figure below shows an example of the Timing Model in terms of the decode sta
<center><img src="images/IAMF Timing Model.png" style="width:100%; height:auto;"></center>
<center><figcaption>An example of the IAMF Timing Model. AFO: [=Audio Frame OBU=], PBO: [=Parameter Block OBU=], \(\text{PT}x\): time \(x\) (ms) on the presentation layer's timeline, \(\text{DT}y\): time \(y\) (ms) on the decoding layer's timeline.</figcaption></center>

NOTE: For a given decoded [=Audio Substream=] (before trimming) and its associated [=Parameter Substream=](s), a decoder MAY apply trimming in 1 of 2 ways:
NOTE: For a given decoded [=Audio Substream=] (before trimming) and its associated [=Parameter Substream=](s), a decoder may apply trimming in 1 of 2 ways:
<br/>
1) The decoder processes the [=Audio Substream=] using the [=Parameter Substream=](s), and then trims the processed audio samples.
<br/>
Expand Down Expand Up @@ -556,7 +556,7 @@ NOTE: A future version of the specification may use this flag to specify an exte

Paresers SHOULD ignore <dfn noexport>Reserved OBU</dfn>s.

NOTE: Future versions of the specification MAY define syntax and semantics for an [=obu_type=] value, making it no longer a [=Reserved OBU=] for those parsers compliant with these future versions.
NOTE: Future versions of the specification may define syntax and semantics for an [=obu_type=] value, making it no longer a [=Reserved OBU=] for those parsers compliant with these future versions.

## IA Sequence Header OBU Syntax and Semantics ## {#obu-iasequenceheader}

Expand Down Expand Up @@ -639,7 +639,7 @@ NOTE: <code>ipcm</code> should not be confused with <code>lpcm</code>, which is

<dfn noexport>num_samples_per_frame</dfn> indicates the frame length, in samples, of the [=audio_frame=] provided in the audio_frame_obu. It SHALL NOT be set to zero. If the [=decoder_config=] structure for a given codec specifies a value for the frame length, the two values SHALL be equal.

<dfn noexport>audio_roll_distance</dfn> indicates how many audio frames prior to the current audio frame need to be decoded (and the decoded samples discarded) to set the decoder in a state that will produce the correct decoded audio signal. It SHALL always be a negative value or zero. For some audio codecs, even if an audio frame can be decoded independently, the decoded signal after decoding only that frame may not represent a correct, decoded audio signal, even ignoring compression artifacts. This can be due to overlap transforms. While potentially acceptable when starting to decode an [=Audio Substream=], it may be problematic when automatically switching between similar [=Audio Substream=]s of different quality and/or bitrate.
<dfn noexport>audio_roll_distance</dfn> indicates how many audio frames prior to the current audio frame need to be decoded (and the decoded samples discarded) to set the decoder in a state that will produce the correct decoded audio signal. It SHALL always be a negative value or zero. For some audio codecs, even if an audio frame can be decoded independently, the decoded signal after decoding only that frame MAY not represent a correct, decoded audio signal, even ignoring compression artifacts. This can be due to overlap transforms. While potentially acceptable when starting to decode an [=Audio Substream=], it MAY be problematic when automatically switching between similar [=Audio Substream=]s of different quality and/or bitrate.
- It SHALL be set to \(-R\) when [=codec_id=] is set to <code>Opus</code>, where
\[R = \left\lceil{\frac{3840}{\text{num_samples_per_frame}}}\right\rceil.\]
- It SHALL be set to -1 when [=codec_id=] is set to <code>mp4a</code>.
Expand Down Expand Up @@ -1078,7 +1078,7 @@ Where FLc: Front Left Centre, FC: Front Centre, FRc: Front Right Centre, FL: Fro

For a given input [=3D audio signal=] with an expanded channel layout defined in [=expanded_loudspeaker_layout=], [=num_layers=] SHALL be set to 1 (i.e., it is a non-scalable channel audio element). Except [=9.1.6ch=] [=Audio Element=], it is RECOMMENDED to use such an [=Audio Element=] as an auxiliary [=Audio Element=] to be mixed with a primary [=Audio Element=] (e.g., TOA or 7.1.4ch) within a [=Mix Presentation=]. If parsers encounter a [=loudspeaker_layout=] = 15 for any layer other than the first layer, they SHOULD skip the [=channel_audio_layer_config=] for that layer and all subsequent layers.

The following channel layouts may be indicated using an existing [=loudspeaker_layout=] or [=expanded_loudspeaker_layout=]. The stereo pair FLc/FRc is indicated using Stereo (L/R), the stereo pair BL/BR is indicated using Stereo-RS (Lrs/Rrs), the stereo pair TpFL/TpFR is indicated using Stereo-TF (Ltf/Rtf), the stereo pair TpBL/TpBR is indicated using Stereo-TB (Ltb/Rtb), and FLc/FC/FRc is indicated using 3.0ch (L/C/R).
The following channel layouts MAY be indicated using an existing [=loudspeaker_layout=] or [=expanded_loudspeaker_layout=]. The stereo pair FLc/FRc is indicated using Stereo (L/R), the stereo pair BL/BR is indicated using Stereo-RS (Lrs/Rrs), the stereo pair TpFL/TpFR is indicated using Stereo-TF (Ltf/Rtf), the stereo pair TpBL/TpBR is indicated using Stereo-TB (Ltb/Rtb), and FLc/FC/FRc is indicated using 3.0ch (L/C/R).

### Scalable Channel Group and Layout ### {#scalalechannelaudio-channelgroupandlayout}

Expand Down Expand Up @@ -1776,7 +1776,7 @@ The variable <b>audio_substream_id_in_bitstream</b> does not exist in an [=IA Se

<dfn noexport>explicit_audio_substream_id</dfn> indicates the [=audio_substream/audio_substream_id=] of this frame. The value SHALL be greater than 17. When this field is not present, [=audio_substream/audio_substream_id=] is implicit and is defined as a value from 0 to 17 for OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17, respectively.

NOTE: The first 18 [=Audio Substream=]s in an [=IA Sequence=] MAY use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17, which have predefined [=audio_substream/audio_substream_id=]s associated with them. This reduces bitrate by avoiding the extra [=explicit_audio_substream_id=] field in the bitstream.
NOTE: The first 18 [=Audio Substream=]s in an [=IA Sequence=] may use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17, which have predefined [=audio_substream/audio_substream_id=]s associated with them. This reduces bitrate by avoiding the extra [=explicit_audio_substream_id=] field in the bitstream.

<dfn noexport>coded_frame_size</dfn> is the size of [=audio_frame=] in bytes.

Expand Down Expand Up @@ -2070,7 +2070,7 @@ Parsers SHALL ignore these two fields.

<b>Semantics</b>

<dfn noexport>ia_configuration_box</dfn> is an instance of the IAConfigurationBox() class, which provides the configuration of the [=IA Sequence=]. The position of the instance SHALL comply with the rule specified in [[!ISO-BMFF]] for [=AudioSampleEntry=]. In other words, the instance SHALL be present after the [=samplerate=] field of [=AudioSampleEntry=]. When the instance is present with another optional box such as the BitRateBox() ('btrt'), their exact ordering is not defined.
<dfn noexport>ia_configuration_box</dfn> is an instance of the IAConfigurationBox() class, which provides the configuration of the [=IA Sequence=]. The position of the instance SHALL comply with the rule specified in [[!ISO-BMFF]] for [=AudioSampleEntry=]. In other words, the instance SHALL be present after the [=samplerate=] field of [=AudioSampleEntry=]. When the instance is present with another OPTIONAL box such as the BitRateBox() ('btrt'), their exact ordering is not defined.

### IA Configuration Box ### {#iaconfigurationbox-section}

Expand Down Expand Up @@ -2222,7 +2222,7 @@ An [=IA Sequence=] SHALL be decoded and processed to output an [=Immersive Audio
8. Post-processing the output mix to perform loudness normalization and peak limiting.
- Details are provided in [[#processing-post]].

NOTE: The IA decoder MAY choose to lazily parse OBUs to avoid unnecessarily parsing OBUs that are not used by the selected [=Mix Presentation=].
NOTE: The IA decoder may choose to lazily parse OBUs to avoid unnecessarily parsing OBUs that are not used by the selected [=Mix Presentation=].

The figure below depicts an example IA decoder architecture with modules that perform the steps above.

Expand Down Expand Up @@ -2871,7 +2871,7 @@ All syntax elements conform to the [=Syntactic Description Language=] specified

<b>string</b> <b>syntaxName</b>

<b>string</b> indicates a null-terminated (i.e., ending at the first byte set to 0x00), UTF-8 encoded as defined in [[!RFC-3629]] and whose length SHALL be limited to 128 bytes.
<b>string</b> indicates a null-terminated (i.e., ending at the first byte set to 0x00), UTF-8 encoded as defined in [[!RFC-3629]] and whose length is limited to 128 bytes.

<b>syntaxName</b> is a human readable label.

Expand Down

0 comments on commit 3dfe67d

Please sign in to comment.