The challenge of multimedia processing is to seamlessly integrate text, sound, image, and video information into a single communications channel, and to do it in a way that provides high quality communications, and preserves the ease- of-use and interactivity of conventional telephony. There are a number of technology drivers that are pushing the technology forward, as well as a number of technological problems that must be overcome before multimedia becomes as ubiquitous as voiceband telephony. Perhaps the most important of the technology drivers is the ability to compress and code multimedia signals efficiently. Another key issue has to do with creating coding and compression standards that insure connectivity between customers and a range of service providers. Multimedia processing is an area of communications that is rapidly evolving. However, a number of interesting and important multimedia communications applications have evolved over the past several years, and some of these applications will be described in this paper.
In a very real sense, virtually every individual has had experience with multimedia systems of one type or another. Perhaps the most common multimedia experiences are reading the daily newspaper or watching television. For most of us, when we think about multimedia and the promise for future communications systems, we tend to think about systems that combine video, graphics, animation with special effects (e.g., morphing of one image to another) and CD quality audio (as seen in movies like "Who Framed Roger Rabbit"). On a more business oriented scale, we think about creating virtual meeting rooms with 3-dimensional realism in sight and sound, including sharing of whiteboards, computer applications, and perhaps even computer-generated Business Meeting Notes for documenting the meeting in an efficient communications format. Other glamorous applications of multimedia processing include Distance Learning in which we learn and interact with instructors remotely over a broadband communication network, Virtual Library Access in which we instantly have access to all of the published material in the world, in its original form and format, and can browse, display, print, even modify the material instantaneously, Living Books which supplement the written word and the associated pictures with animations, and hyperlink access to supplementary material.
Modern voice communications networks evolved around the turn of the twentieth century with a focus on creating Universal Service, namely the ability to automatically connect any telephone user with any other telephone user, without the need for operator assistance or intervention. This revolutionary goal defined a series of technological problems that had to be solved before the vision became reality, including the invention of the vacuum tube for amplification of telephone signals, mechanical switching to replace the operator consoles that were used in most localities, numbering plans to route calls, signaling systems to route calls, etc. The first transcontinental call in the United States was completed in 1915, thereby ushering in the "modern age of voice communications," an age that has been developed and improved upon for the past 80 or so years.
We are now in the midst of another revolution in communications, one which holds the promise of providing ubiquitous service in multimedia communications. The vision for this revolution is to provide seamless, easy-to-use, high quality, affordable multimedia communications between people and machines, anywhere, and anytime. There are three key aspects of the vision which characterize the changes that will occur in communications once this vision is achieved, namely:
* the basic currency of communications evolves from narrowband voice telephony to seamlessly integrated, high quality, broadband, transmission of multimedia signals;
* the basic access method changes from wireline connections to combinations of wired and wireless, including cable, fiber, cell sites, satellite, and even electrical power lines;
* the basic mode of communications expands from primarily involving people-to-people communications, to include people-to-machine communications.
There are a number of forces that are driving this multimedia revolution, including:
* the evolution of communications networks and data networks into today's modern POTS (Plain Old Telephone Services) network and Packet (including the Internet) networks, with major forces driving these two networks into an integrated structure;
* the increasing availability of (almost unlimited) bandwidth on demand in the office, the home, and eventually on the road, based on the proliferation of high speed data modems, cable modems, hybrid fiber-coax systems, and recently a number of fixed wireless access systems;
* the availability of ubiquitous access to the network via LANs, wireline, and wireless networks providing the promise of anywhere, anytime access;
* the ever increasing amount of memory and computation that can be brought to bear on virtually any communications or computing system -- based on Moore's law of doubling the communications and memory capacity of chips every 18 or so months;
* the proliferation of smart terminals, including sophisticated screen phones, digital telephones, multimedia PC's that handle a wide range of text, image, audio, and video signals, "Network Computers" and other low-cost Internet access terminals, and PDAs (Personal Digital Assistants) of all types that are able to access and interact with the network via wired and wireless connections;
* the digitization of virtually all devices including cameras, video capture devices, video playback devices, handwriting terminals, sound capture devices, etc., fueled by the rapid and widespread growth of digital signal processing architectures and algorithms, along with associated standards for plug-and-play as well as interconnection and communications between these digital devices.
In order for multimedia systems to achieve the vision of the current communications revolution, and become available to everyone, much as POTS service is now available to all telephony customers, a number of technological issues must be addressed and put into a framework that leads to seamless integration, ease-of-use, and high quality outputs. Among the issues that must be addressed are the following:
* the basic techniques for compressing and coding the various media that constitute the multimedia signal, including the signal processing algorithms, the associated standards, and the issues involved with transmission of these media in real communications systems;
* the basic techniques for organizing, storing, and retrieving multimedia signals, including both downloading and streaming techniques, layering of signals to match characteristics of the network and the display terminal, and issues involved with defining a basic Quality-of- Service (QOS) for the multimedia signal and its constituent components
* the basic techniques for accessing the multimedia signals by providing tools that match the user to the machine, such as by using "natural" spoken language queries, through the use of media conversion tools to convert between media, through the use of agents that monitor the multimedia sessions and provide assistance in all phases of access and utilization;
* the basic techniques for searching in order to find multimedia sources that provide the desired information or material -- these searching methods, which in essence are based on machine intelligence, provide the interface between the network and the human user, and provide methods for searching via text requests, image matching methods, and speech queries;
* the basic techniques for browsing individual multimedia documents and libraries in order to take advantage of human intelligence to find desired material via text browsing, indexed image browsing, and voice browsing.
Table 1 shows the signal characteristics and the resulting uncompressed bit rate necessary to support the storage and transmission of speech, audio, image, and video signals with high quality. The table has separate sections for each of these signals, since their characteristics are very different in terms of frequency range of interest, sampling grids, etc.
It can be seen from Table 1a, that for narrowband speech, a bit rate of 128 kb/s is required without any form of coding or compression -- i.e., twice the rate used in ordinary POTS telephony. For wideband speech, a bit rate of 256 kb/s is required for the uncompressed signal, and for 2-channel stereo quality CD (Compact Disk) audio, a bit rate of 1.41 Mb/s is required for the uncompressed signal. We will see later in this section that narrowband speech can be compressed to about 4 kb/s (a 30-to-1 compression rate), wideband speech can be compressed to about 16 kb/s (a 15-to-1 compression rate) and CD audio can be compressed to 64 kb/s (a 22-to-1 compression rate) while still preserving the quality of the original signal.
Table 1b shows the uncompressed size needed for bilevel (FAX) and color still images. It can be seen that an ordinary FAX of an 8 1/2 by 11 inch document, scanned at 200 dpi (dots per inch), has an uncompressed size of 3.74 Mb, whereas color images (displayed on a computer screen) at VGA resolution require 2.46 Mb, and high resolution XVGA color images require 18.87 Mb for the uncompressed image. It will be shown that most images can be compressed by factors on the order of 100-to-1 (especially text-based FAX documents) without any significant loss in quality.
Finally, Table 1c shows the necessary bit rates for several video types. For standard television, including the North American NTSC standard and the European PAL standard, the uncompressed bit rates are 111.2 Mb/s (NTSC) and 132.7 Mb/s (PAL). For videoconferencing and videophone applications, smaller format pictures with lower frame rates are standard, leading to the CIF (Common Intermediate Format) and QCIF (Quarter CIF) standards, which have uncompressed bit rates of 18.2 Mb/s and 3.0 Mb/s, respectively. Finally, the digital standard for HDTV (in two standard formats) has requirements for an uncompressed bit rate of between 662.9 and 745.7 Mb/s.
a) Speech/Audio Narrowband Wideband CD Audio Type Speech Speech Frequency Range 200-3200 Hz 50-7000 Hz 20-20000 Hz Sampling Rate 8 kHz 16 kHz 44.1 kHz Bits/Sample 16 16 16 x 2 channels Uncompressed Bit Rate 128 kb/s 256 kb/s 1.41 Mb/s
b) Image Type FAX VGA XVGA Pixels per Frame 1700 x 2200 640 x 480 1024 x 768 Bits/Pixel 1 8 24 Uncompressed Size 3.74 Mb 2.46 Mb 18.87 Mb
c) Video Type NTSC PAL CIF Pixels per Frame 480 x 483 576 x 576 352 x 288 Image Aspect Ratio 4:3 4:3 4:3 Frames per Second 29.97 25 14.98 Bits/Pixel 16* 16 12# Uncompressed Bit rate 111.2 Mb/s 132.7 Mb/s 18.2 Mb/s
QCIF HDTV HDTV 176 x 144 1280 x 720 1920 x 1080 4:3 16:9 16:9 9.99 59.94 29.97 12 12 12 3.0 Mb/s 622.9 Mb/s 745.7 Mb/s
* Based on the so-called 4:2:2 color sub-sampling format with two chrominance samples for Cb and Cr for every four luminance samples. # Based on the so-called 4:1:1 color sub- sampling format with one chrominance samples for Cb and Cr for every four luminance samples.
Table 1: Characteristics and uncompressed bit rates of speech, audio, image and video signals
Clearly such high compression ratios are essential for combining speech with audio, images, video, text, and data and transmitting it and storing it efficiently, as required for multimedia processing systems.
Over the past two decades, speech and audio coding standards have evolved for network, cellular, and secure telephony applications. Such standards fall into two categories, namely waveform coding, and model-based coding methods. Among the most popular waveform coding methods include:
* PCM (G.711)-pulse code modulation This direct sample- by-sample quantizer provides high quality encoding at 64 kb/s rates for telephone bandwidth speech.
* ADPCM (G.726, G.727)-adaptive, differential PCM. By exploiting a backward adaptive differential PCM coder to predict the sample value and then quantizing the difference between the actual value and the predicted value, the encoding rate falls to 32 kb/s for high quality encoding of telephone bandwidth speech.
* Wideband coder (G.722)-2 band ADPCM for 7 kHz bandwidth speech. This source coder is designed for wideband speech and provides good quality low delay encoding at 64, 56, and 48 kb/s. Among the most popular model-based coding methods are:
* LD-CELP (G.728)-low delay, code-excited linear prediction coding. This source coder is a low delay, backward adaptive, codebook excitation linear prediction coder. It uses linear predictive coding (LPC) analysis to create 3 different linear predictors; a 50th order predictor for the next sample value, a 10th order predictor to guide the quantization process, and a perceptual weighting filter that is used to select the excitation signal. The coder runs at 16 kb/s for coding of telephone bandwidth speech.
* CS-ACELP (G.729)-conjugate structure, algebraic CELP. This coder is a forward adaptive, analysis-by-synthesis coder where the prediction filter and the gains are explicitly transmitted, along with the pitch period estimate. This coder is used for transmission of telephone bandwidth speech at 8 kb/s. This coder has only been used for Simultaneous Voice and Data (SVD) modems. * MPC-MLQ (G.723.1)-multi-pulse coding, maximum likelihood quantization. This coder is another forward adaptive, analysis-by-synthesis coder which operates at somewhat lower rates than G.729, namely 6.4 and 5.3 kb/s for telephone bandwidth speech. It has been selected for use in Internet telephony.
* VSELP (IS-54)-vector sum excitation linear prediction. This source coder is another forward adaptive, analysis-by- synthesis coder which was used as the basis of the first generation North American digital cellular coding standard at 8 kb/s for telephone bandwidth speech. Figure 1 shows a plot of speech quality (as measured subjectively in terms of mean opinion scores (MOS)) for a range of telephone bandwidth speech coders spanning bit rates from 64 kb/s down to 2.4 kb/s. Also shown are curves of speech quality based on measurements made around 1980 and 1990. It can be seen from Fig. 1 that telephone bandwidth coders maintain a uniformly high MOS score for source coding rates ranging from 64 kb/s down to about 8 kb/s. It is anticipated that in the next several years, this uniformly high MOS score will be attained in the 2-5 kb/s range, based on newer source coding methods which are currently in investigation in several research labs around the world.
Figure 1: Subjective quality of various telephone bandwidth speech coders versus bit rate.
Audio coding standards  have recently evolved in connections with MPEG video coding, including:
* MPEG-1 audio coder which provides audio source coding and compression based on perceptual coding methods. This coder achieves high quality at bit rates as low as 96 kb/s for monaural audio signals.
* MPEG-AAC audio coder which prescribed advanced audio coding methods for bit rates as low as 8 kb/s per channel and as high as 192 kb/s per channel. At 64 kb/s, stereo audio coding has been shown to preserve the CD quality of the original audio signal.
Image coding standards have evolved for FAX including :
* Group 3 and Group 4 FAX for run length coding. The Group 3 source coder operates on a line-by-line scanned document in a left-to-right fashion, and does a one- dimensional run length coding of the pixels on each line of the scanned document. For a typical scanned page (8 ½ by 11 inches), with a scan rate of 200 dots-per-inch, a compression rate of 20-to-1 is obtained on simple text documents. The Group 4 source coder provides an improvement over G3 FAX by using a two-dimensional coding scheme to take advantage of vertical spatial redundancy as well as the horizontal spatial redundancy used in G3 FAX coding. In particular, the G4 source coder uses the previous scan line as a reference when coding the current scan line. G4 FAX coding often provides a 25% improvement over G3 FAX coding for simple text documents.
* JBIG-1 for pixel prediction based on local neighborhoods. This source coder for FAX attempts to do a significantly better job on documents with images. It uses an arithmetic coder that is a dynamically adaptive binary coder that adapts to the statistics for each pixel context. Using either a sequential mode, where the pixel to be coded is predicted based on 9 adjacent and previously coded pixels plus one adaptive pixel that can be spatially separated from the others, or in a progressive mode that provides for successive resolution increases in successive encodings. The JBIG-1 FAX coder provides an improvement in compression by a factor of up to 8-to-1 for binary halftone images, and is comparable to G4 FAX coding for text documents.
* JBIG-2 for soft pattern matching on segmented regions. This source coding standard for FAX documents is based on "soft pattern matching" which attempts to code scanned documents by finding highly repeatable patterns in the document, and coding such patterns once each document. For text documents, the soft patterns correspond roughly to letters; for mixed documents with text and images, the soft patterns are combinations of letters and image segments.
Image coding standards include :
* JPEG for DCT processing, perceptual quantization, and entropy encoding. This source coding standard converts an image into a series of 8 (pixel) by 8 (pixel) blocks, spectrally analyzes each of the blocks using a forward Discrete Cosine Transform (DCT) algorithm, and the resulting DCT coefficients are scalar quantized based on a psychophysically-based table of quantization levels. The JPEG image coder can compress most color images by a factor of about 32-to-1 while maintaining good image quality.
* JPEG-2000 as a modern multimedia architecture with downloadable software. This evolving image coding standard is intended to provide low bit rate operation with subjective image quality performance superior to the existing JPEG standard, without sacrificing performance at higher bit rates.
Finally, video standards include :
* H.261 (p x 64), H.262 and H.263 with motion compensation for interframe coding. This family of video source coders is the baseline video mode for most multimedia conferencing systems. The baseline coder, H.261, codes inital video frames using a JPEG-like method, and, for subsequent frames, uses a motion compensation scheme to predict the displacement of groups of pixels from their position in the previous frame. A key aspect of the coding is the creation of a motion compensated prediction error (rather than a motion compensated sequence) which is used as the image signal to be coded and quantized, and entropy encoded using a Variable Length Coder for transmission over a fixed rate channel.
* MPEG-1 with specifications for coding, compression and transmission of audio, video, and data in packets. This class of video source coding is very similar to the H.26X series described above. A key difference is the use of uni- and bi-directional motion compensated prediction for three classes of pictures, namely Intra pictures (coded without motion compensation), Predictive pictures (coded based on previous Intra or previous Predictive pictures), and Bi- Directionally Predictive pictures (coded based on either the next and/or the previous pictures). MPEG-1 coding was created in order to store video sequences on CDROM media, i.e., audio and video coding at 1.4 Mb/s.
* MPEG-2 with capability of handling multi-channel, multimedia signals over broadband networks. This video coding standard was designed to provide the capability for compressing, coding, and transmitting high quality (broadcast), multi-channel, multimedia signals over broadband networks, e.g., ATM protocols. The source coding methods are very similar to those used for H.26X and MPEG-1, with less quantization of the coded video because of the higher bit rates for which the standard was designed.
* MPEG-4 an object-based approach to multimedia with independent coding of objects, interactive composition of objects, and the ability to integrate synthetic and natural objects. This standard was intended to provide capability of coding video at rates as low as 8 kb/s (low bit rate videophones) and as high as 1 Mb/s, and is still evolving in its utility and capability.
* MPEG-7 which adds the capability for searching, indexing, and authentication of large databases of multimedia objects. This standard will take shape in the next few years. 4. Multimedia Systems
We now show three examples of multimedia systems which illustrate how the technology comes together in practical systems.
FusionNet Service 
A key problem in the delivery of "on-demand" multimedia communications over the Internet, based on POTS or ISDN access, is that the Internet today cannot guarantee the quality of real-time signals such as speech, audio, and video. The FusionNet service overcomes this problem by using the Internet only to browse and request the video and audio, as well as to control the signal delivery (e.g., via VCR-like controls). FusionNet uses either POTS or ISDN to actually deliver guaranteed Quality of Service (QOS) for real-time transmission of audio and video.
Initial implementations of FusionNet Service required the user to maintain either two POTS lines (one for Internet access, one for the guaranteed QOS audio/video link), or a full ISDN connection (i.e., 2 or more B-channels each with 64 kb/s). The most recent implementation provides the FusionNet Service over a single ISDN B-channel by requiring the ISP (Internet Service Provider) to provide ISDN access equipment that seamlessly merges the guaranteed QOS audio/video signal with normal Internet traffic to and from the user via PPP (Point-to-Point Protocol) over dialed-up ISDN connections. Unless the traffic at the local ISP is very high, this method provides high quality FusionNet Service with a single ISDN B channel. Of course, additional channels can always be ganged together for higher quality service.
Cybrary-the Virtual Library
A key aspect in delivering high quality multimedia is the ability to view printed material in its original (uncompressed) form. This requires a capability of compressing, storing, indexing, and browsing documents stored anywhere. We call a system that provides these capabilities a cyber-library or Cybrary. Such a system essentially allows for virtual presence in a remote archive.
The Cybrary system we have created lets anyone connected to the Internet view any document in the library on any available screen. The key technological innovation which provides this capability is a new standard for document image compression (JBIG2), which makes possible quick page-flipping and browsing. Through the use of advanced OCR (Optical Character Recognition) techniques for translating image text into ASCII characters, the Cybrary system provides full text search, indexing, browsing, and hyperlinking.
Pictorial Transcripts System 
A key challenge in multimedia processing is to provide a compact representation of full motion video that can be indexed, stored efficiently, and displayed as a reference or archive. The Pictorial Transcripts System is one proposed solution to this challenge. Essentially the Pictorial Transcripts System provides a complete, albeit condensed, representation of a full motion video sequence, consisting of a carefully selected set of still images matched to (synchronized with) the text version of the audio. The technology has commercial applications in broadcast TV, where a network can automatically convert a closed- captioned broadcast into a Website program in real time, and for business applications where the system could create "Business Meeting Notes" of meetings, seminars, or conferences, and make these available on the Internet within minutes after the meetings ended.
Pictorial Transcripts automatically analyzes, condenses, and indexes multimedia information from closed captioned video broadcasts and generates web content in real time. The application thus allows selective retrieval of program content, since Pictorial Transcripts generates text and pictorial indices into video and multimedia images. The technical challenges here include finding ways to combine video, speech, and text processing and compression technologies, as well as perform linguistic analyses, computer programming, and systems engineering to index selected video information and transmit it over phone lines.
Using efficient, high-performance algorithms reduces the amount of storage on a one-hour news broadcast from as much as 1 gigabyte to about 1.5 megabytes -- i.e., about 1000 to 1 compression ratio (for the set of still images and the associated text). This is a propitious number because it means that an entire news program can be saved on a single 3 1/2 " floppy disk.
Multimedia processing is defined as the multiplexing and combining of any number of data streams, where the data streams can represent real-time signals (with their concomitant need for some type of guaranteed Quality of Service), data signals, control signals, conformance testing signals, etc. This has led to the evolution of modern multimedia systems that we believe will become commonplace and widely used in the future.
This paper is based on the paper "On the Applications of Multimedia Processing to Communications," by R. V. Cox, B. G. Haskell, Y. LeCun, B. Shahraray, and L. R. Rabiner, which has been submitted to the IEEE Proceedings for the special issue on Multimedia.
 W. B. Kleijn and K. K. Paliwal, Editors, Speech Coding and Synthesis, Elsevier, 1995.
 J. D. Johnston, "Transform Coding of Audio Signals Using Perceptual Noise Criteria," IEEE Journal on Selected Areas in Communications, Vol. 6, No. 2, pp. 314- 323, February 1988.
 K. R. McConnell, D. Bodson, and R. Schaphorst, FAX, Digital Facsimile Technology and Applications, Artech House, Boston, 1992.
 W. B. Pennebaker and J. L. Mitchell, "Other Image Compression Standards," Chapter 20 in JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993.
 A. N. Netravali and B. G. Haskell, Digital Pictures- Representation, Compression, and Standards, 2-nd Edition, Plenum Press, New York, 1995.
 M. R. Civanlar, G. L. Cash, and B. G. Haskell, "FusionNet: Joining the Internet and Phone Networks for Multimedia Applications," ACM Multimedia Proceedings, Boston, November 1996.
 B. Shahraray and D. C. Gibbon, "Automatic Generation of Pictorial Transcripts of Video Programs," Multimedia Computing and Networking 1995, Proceedings SPIE 2417, February 1995.