Background

This section presents the background for new material that will follow in subsequent sections.  We discuss the notion of structured audio and the way it interrelates the theories of audio synthesis and natural audio coding.  Then we present an overview of the MPEG-4 Structured Audio coding format, for readers who may not be familiar with this tool.

Theory of structured audio

The term structured audio was introduced by Vercoe et al. [1] as a way of interrelating research on sound synthesis, audio coding, and sound recognition.  The central point of this theory is that every parametric sound representation can be viewed as a tool for sound understanding, transmission, and rendering.  This includes not only representations that are typically considered parametric, such as the frequency and amplitude of a sinusoid, but also more complicated ones, such as the particular set of weight values transported as the perceptual coding of a natural audio soundtrack. 

By always thinking of sound applications in terms of the parametric space they use, it becomes possible to interrelate and compare techniques from disciplines that are normally disparate.  For example, an audio codec such as MPEG-AAC [2] can be considered as the combination of (1) a sound-understanding algorithm that sets the parameters in representation space based on analysis of an acoustic waveform, (2) the transmission of these parameters, and (3) a sound-synthesis algorithm that maps from the transmitted parameters to a new sound.  We normally call the understanding step an encoding step, and the synthesis step a decoding step, but there is no real difference between these styles of terminology.

Similarly, imagine a simple MIDI-stream transmission scenario.[1]  A performance on a solo instrument is analyzed by pitch-tracking, and the pitch values are transmitted to a synthesizer, which turns them back into sound.  Just as in the example above, there are understanding, transmission, and rendering aspects to this process.  Although these scenario is normally considered a “performance” situation, it can also be considered a type of audio compression, particularly if the performance and resynthesis steps are separated by a network.

Recognizing that these two different applications have a similar underlying structure allow us to directly compare the analogous stages of processing.  The understanding (encoding) side is less computationally demanding for MIDI than for AAC, but more general for AAC than for MIDI.  That is, it is easier to pitch-track a monophonic sound than to perform AAC encoding (both in terms of development difficulty and in terms of run-time complexity), but the AAC encoder can be used on every sound, whereas the pitch-tracking process only applies to a subset of well-defined monophonic sounds.  The transmission process requires much less bandwidth for the MIDI case, since the parameterization is very succinct.  The rendering process is more exact in the AAC case (for a standard, we would say more normative), in that a particular set of parameters is always synthesized into sound in the same way.  In the MIDI-transmission case, the particular playback (rendering) method is not normative in the same way, and so different playback devices will present different sonic results.

Vercoe et al. [1] presented a taxonomy of many different kinds of sound representations, and the tradeoffs they create when used in applications.  For example, continuing the comparison above, we can say that AAC would be preferred to MIDI in applications where normative sound quality is important, a wide set of soundtracks must be represented, there is adequate bandwidth for the transmission of the parameters, and there are enough resources to deploy an encoder.  MIDI would be preferred to AAC in applications where normative sound quality is not important, we are only concerned with a narrow set of sounds, and there is restricted bandwidth or encoding resources.

The notion of application-driven (or “requirements-driven”) tradeoffs is essential to the present paper.  Every coding method presents tradeoffs; there is no “ideal” coding technique except in relation to a particular specific set of requirements.  Most high-quality audio research has progressed along one particular path of requirements.  As a result, great progress has resulted in this direction; however, we hope to show in the present paper that by taking a broader view, there is progress yet to be realized in the larger marketplace for coding technology.

MPEG-4 Structured Audio

MPEG-4 Structured Audio is one of the coding tools in the MPEG-4 International Standard, which was completed in 1998 and is scheduled for approval and publication in mid-1999.  It is the first standard to allow the direct application of algorithmic structured audio[2] techniques to the transmission of sound in a multimedia context.  Algorithmic structured audio is the idea of using a general-purpose software-synthesis language, and parameters to programs written in that language, to represent sound for transmission.  This idea was suggested as early as 1991 [4] , and first implemented in a prototype system called Netsound [5] . 

Space in this paper does not permit a full description of the structure and capabilities of the SA format, but we present a brief overview here, and the interested reader is referred to other papers [6-10] or to the standard [11] for more information.

At the heart of the SA standard is a sound-synthesis language called SAOL, for “Structured Audio Orchestra Language” [9] .  A program in SAOL describes a sound-processing or sound-synthesis algorithm.  SAOL resembles C syntactically, but variables in SAOL contain audio signals, and built-in processing functions allow signals to be generated, mixed, filtered, processed and otherwise manipulated.  Examples of SAOL programs, to be discussed later, are in Appendixes A, C, D, and E.  As will be proved in Section 2, any algorithm for parametrically creating or altering sound can be described in SAOL.

In order to transmit sound in the SA format, the bitstream header contains one or more algorithms written in SAOL, and the streaming data consists of Access Units containing parametric events written in SASL (“Structured Audio Score Language”).  The decoding process for the SA bitstream is shown in Figure 1.

At session startup, the SAOL algorithms (called instrument definitions) are communicated to a reconfigurable synthesis engine.  This engine knows how to interpret the SAOL programs and configure itself for parametric sound manipulation accordingly.  The exact properties of the reconfigurable synthesis engine are specified in the MPEG-4 standard; it implements the same functions in every compliant terminal.

During the streaming part of the session, the Access Units are decoded into events, which are stored in a time-sorted list.  The run-time scheduler, also specified in the standard, keeps track of these events and dispatches them when their time arrives.  Each event is either a parametric instruction that begins the execution of one of the SAOL algorithms (a note), or some new parameters for use by one of the algorithms that is already executing.  At any given time in the decoding process, several notes are active and stored in a note pool.  Each note corresponds to one of the synthesis algorithms delivered in the bitstream header. Multiple notes may use the same algorithm, in which case their data spaces are maintained separately, in the manner of object instances in the object-oriented programming paradigm.

Figure 1: The SA decoding process.  The bitstream header contains a number of instrument definitions written in SAOL.  At session startup, the instrument definitions are used to configure a reconfigurable synthesis engine.  During the streaming session, access units containing parametric events are used to control the synthesis engine.  The run-time scheduler keeps track of multiple note instances, each corresponding to one of the SAOL instruments, that are executed in turn to produce audio output.  Adapted from [10] , copyright 1999 Marcel-Dekker, used with permission.

At each time step, each of the notes that is active in the note pool is executed by running the corresponding SAOL algorithm for a short amount of time.  This execution creates some output for each note instance. The outputs of all instances are summed to create the final audio output.

By using a general-purpose software-synthesis language to specify the creation of sound, the normative sound quality required for high-quality applications can be achieved.  Since the exact operation of the SA decoding process is specified very precisely, content authors are guaranteed (if they wish) that sounds will be sample-by-sample identical on different MPEG-4 terminals.

Nothing has been written in this section about the nature of the sound-synthesis algorithms transmitted in SAOL and the parametric data that drive them.  The text of the standard uses language normally associated with sound-synthesis applications (“instrument,” “note,” and so forth), but there is no reason that SAOL algorithms must be simple mappings from two or three parameters to rule-based sound generation.  The built-in functions of SAOL include spectral transforms, filter structures, noise generators, and other techniques useful for the creation of natural sound decoders.  Similarly, the transmitted parameters might consist of time-synchronized frames of data for the control of natural decoders.

SAOL as a language can represent algorithms that are too complex for any particular system.  In order to restrict the computational complexity of the decoding terminal, the MPEG-4 standard contains a simulation tool that allows content authors to determine the amount of complexity required to decode their bitstreams.  Levels of the SA standard are set with regard to complexity as measured with this simulation tool—a Level 1 decoder, in order to conform to the standard, is required to perform a certain amount of computation in real-time, a Level 2 decoder somewhat more, and so forth.  The responsibility for ensuring that decoding complexity stays within the bounds of the Level appropriate for the application is left to the content developer.

In practice, many useful Structured Audio bitstreams are actually simpler to decode than bitstreams represented in today's complex wideband audio coders.



[1] MIDI, the Musical Instrument Digital Interface specification [3] , was originally created as a protocol for controller-synthesizer communications, but can be naturally viewed as a sound parameterization as well.  In MIDI, a sound is represented as a sequence of notes, each of which is parameterized by its pitch and “velocity,” a sort of loudness function.  The timbres of sounds are not represented in the MIDI protocol.

[2] Henceforth, Structured Audio or “SA” is used to refer to the MPEG-4 implementation of the structured-audio concept.  We will continue to use the phrase structured audio with lowercase initials to refer to the theory outlined in the preceding section.  These are different things: the former is one particular implementation of a general concept denoted by the latter.