Auditory Scene Analysis : A Brief report


In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. The term was coined by psychologist Albert Bregman. The related concept in machine perception is computational auditory scene analysis (CASA), which is closely related to source separation and blind signal separation.

The three key aspects of Bregman's ASA model are: segmentation, integration, and segregation.

Sound reaches the ear and the eardrum vibrates as a whole. This signal has to be analyzed (in some way). Bregman's ASA model proposes that sounds will either be heard as "integrated" (heard as a whole – much like harmony in music), or "segregated" into individual components (which leads to counterpoint). When two or more natural sounds occur at once, all the components of the simultaneously active sounds are received at the same time, or overlapped in time, by the ears of listeners. This presents their auditory systems with a problem: which parts of the sound should be grouped together and treated as parts of the same source or object? Grouping them incorrectly can cause the listener to hear non-existent sounds built from the wrong combinations of the original components.

In many circumstances the segregated elements can be linked together in time, producing an auditory stream. This ability of auditory streaming can be demonstrated by the so-called cocktail party effect. Up to a point, with a number of voices speaking at the same time or with background sounds, one is able to follow a particular voice even though other voices and background sounds are present.

A number of grouping principles appear to underlie ASA, many of which are related to principles of perceptual organization discovered by the school of Gestalt psychology. These can be broadly categorized into sequential grouping mechanisms (those that operate across time) and simultaneous grouping mechanisms (those that operate across frequency):

  • Errors in simultaneous grouping can lead to the blending of sounds that should be heard as separate, the blended sounds having different perceived qualities (such as pitch or timbre) to any of the sounds actually received. For instance two vowels presented simultaneously may not be identifiable if they are segregated.
  • Errors in sequential grouping can lead, for example, to hearing a word created out of syllables originating from two different voices

Segregation can be based primarily on perceptual cues or rely on the recognition of learned patterns ("schema-based").

The job of ASA is to group incoming sensory information to form an accurate mental representation of the individual sounds. When sounds are grouped by the auditory system into a perceived sequence, distinct from other co-occurring sequences, each of these perceived sequences is called an "auditory stream". In the real world, if the ASA is successful, a stream corresponds to a distinct environmental sound source producing a pattern that persists over time, such as a person talking, a piano playing, or a dog barking. However, in the lab, by manipulating the acoustic parameters of the sounds, it is possible to induce the perception of one or more auditory streams.

One example of this is the phenomenon of streaming, also called "stream segregation."If two sounds, A and B, are rapidly alternated in time, after a few seconds the perception may seem to "split" so that the listener hears two rather than one stream of sound, each stream corresponding to the repetitions of one of the two sounds.

epetitions of one of the two sounds.