Coding

In a first-generation cellular handset, you talk into a microphone and a variable voltage is produced, describing 3 kHz of voice modulated audio bandwidth. The voltage is then FM-modulated onto an RF carrier—an all analog processing chain. In second-generation handsets, you talk into a microphone and the voice is turned into a digital bit stream using waveform encoding. For example, in GSM, a 104 kbps data stream is produced prior to the vocoder. It is the vocoder’s job to reduce this data
rate to, for example, 13 kbps or less without noticeable loss of quality. In the wireline world and in digital cordless phones, this is achieved in the time domain by using time domain compression techniques (exploiting sample-to-sample predictability). These are known as adaptive differential pulse code modulation codecs. They work well in high background noise conditions but suffer quality loss at low codec rates—for example, 16 kbps or below.
The decision was made that digital cellular handsets should use speech synthesis codecs that coded in the frequency domain (see Figure 1.11). The figure shows a female voice saying “der.” Each block represents a 20-ms speech sample. The first block shows the “d,” and the second block shows the “er” described in the time domain (y-axis) and frequency domain (x-axis). Each sample is described in terms of frequency coefficients. Compression is achieved by exploiting similarity between samples.

In the receiver, the frequency coefficients are used to rebuild, or synthesize, the harmonic structure of the original voice sample. The more processing power used in the codec, the better the quality for a given compression ratio. Alternatively, rather than synthesize waveforms, waveforms can be prestored and fetched and inserted as needed in the decoder. This reduces processor overhead but
increases memory bandwidth in the vocoder. These codecs are known as codebook codecs or more precisely codebook excitation linear prediction (CELP) codecs. Codecs used in present CDMA handsets and most future handsets are CELP codecs. Voice codecs are also becoming variable rate, either switchable (for coverage or capacity gain) or adaptive (the codec rate varies according to the dynamic range of the input waveform). The objective of all codecs is to use processor bandwidth to reduce transmission bandwidth. Speech synthesis codecs and codebook codecs can deliver
compression ratios of 8:1 or more without significant loss of quality. 3G handsets add in MPEG-4 encoders/decoders to support image and video processing. In common with vocoders, these video codecs use time domain to frequency domain transforms (specifically, a discrete cosine transform) to identify redundancy in the input image waveform. As we will see, video codecs are capable of delivering compression ratios of 40:1 or more with tolerable image quality. Fourth-generation digital encoders will add in embedded rendering and mesh coding techniques to support motion prediction, motion estimation, and motion compensation.