ICASSP 97, Table of Contents 1997 International Conference on Acoustics, Speech and Signal Processing Title: Space-Time Processing for Wireless Communications Authors: Arogyaswami Paulraj, Stanford University Volume: 1, Page: 1 Abstract: This paper reviews space-time signal processing in mobile wireless communications. Space-time processing refers to the signal processing performed in the spatial and temporal domain on signals received at or transmitted from an antenna array, in order to improve performance of wireless networks. We focus on antenna arrays deployed at the base stations since such applications are of current practical interest. ** Title: Variability Of Performance In Video Coding Authors: Don Pearson, University of Essex Volume: 1, Page: 5 Abstract: Modern video compression techniques exhibit variability of performance as a function of time. Studies are reported of viewers reactions to this variability, which indicate a sensitivity to particular features. Some interesting conclusions emerge for future work in video coding. ** Title: Expert Summaries Authors: Renato de Mori, C.E.R.I. Hermann Ney, University of Technology (RWTH), Aachen Hans Georg Musmann, University of Hannover Rama Chellapa, University of Maryland Mark J.T. Smith, Georgia Institute of Technology John R. Treichler, Applied Signal Technology, Inc. Georgios B. Giannakis, University of Virginia Michael D. Zoltowski, Purdue University Volume: 1, Page: 9 Abstract: Leading experts in their fields summarize the most relevant new ideas from submitted papers in the fields of speech processing, digital signal processing, image and multidimensional signal processing, and statistical and array processing. ** Title: Expanding Team Experiences in DSP Education Authors: Delores Etter, University of of Colorado Geoffrey C. Orsak, George Mason University Volume: 1, Page: 11 Abstract: Since practicing engineers work in multidisciplinary teams, it is important that universities provide as many teaming experiences as possible. In this paper, we present some of the advantages and disadvantages of traditional teaming approaches. We then present issues related to virtual teaming - the teaming of students from geographically distributed locations. Virtual teaming adds a new dimension to the teaming experiences that universities can provide to students to better equip them for the environment in which they will work in research positions and in industry. Experiences with a three-year program in virtual teaming between the University of Colorado and George Mason University will be presented. ** Title: Interactive Classroom For DSP/Communication Courses Authors: Huseyin Abut, School of Applied Science Yusuf Ozturk, San Diego State University Volume: 1, Page: 15 Abstract: In this study, we present a new classroom environment to conduct digital signal processing and communication systems courses. Key features of the model are the collaborating instructor embracing students, a smart classroom equipped with a `Whiteboard` and advanced telecommunication networks, electronic textbook, and other resources, World Wide Web (WWW), Matlab, and other on-line tools. The underlying assumptions of the educational process are teambuilding instead of independent learning, collaborating/ supervising instructor, lateral curriculum instead of a vertical curriculum, and idea-to-product design concept. We will present a sample lecture in the proposed interactive classroom, where the concept of eye diagrams in regenerative repeaters will be presented from the first author`s text using matlab and WWW. ** Title: Experiences in Teaching DSP First in the ECE Curriculum Authors: James H. McClellan, Georgia Institute of Technology Ronald W. Schafer, Georgia Institute of Technology Mark A. Yoder, Rose-Hulman Institute of Technology Volume: 1, Page: 19 Abstract: In this paper we describe experiences gained from teaching an introductory electrical engineering course based on digital signal processing rather than the traditional first course in analog circuit theory. We will discuss our motivation for teaching DSP first, before covering analog circuits and systems. We will describe the style of the course and point out difficulties, as well as advantages, in this organization of basic material. Finally, we will make some comments about extending this approach to encompass a wider range of students from other disciplines. ** Title: Analog Signal Processing: A Replacement for the Sophomore-Level Circuit Analysis Course Authors: David C. Munson, University of Illinois Volume: 1, Page: 23 Abstract: A new undergraduate curriculum in electrical engineering has been adopted by the Department of Electrical and Computer Engineering at the University of Illinois. Major changes have been incorporated, including a redistribution of the circuits and signal processing topics within the curriculum. After giving an overview of the new curriculum, this paper focuses on a new, required sophomore-level course on analog signal processing. This course combines material from the traditional course on circuit analysis with material on continuous-time signals and systems. Students completing this course can study digital signal processing as first-semester juniors, which leaves ample time for more advanced signal and image processing courses in future semesters. ** Title: Re-engineering The Electrical Engineering Curriculum Authors: Sanjit K. Mitra, University of California Volume: 1, Page: 27 Abstract: Three specific programs are suggested to modify the electrical engineering curriculum to keep up with the dramatic technological developments of recent years. One of the programs is a five-year combined BS/MS program which permits the student to specialize in more than one field. The second proposal is to restructure the BS program into a multi-track program. The third one is an internship-in-industry program to provide the student with a meaningful and valuable real-world design experience before graduation. ** Title: Structural Subband Decomposition: A new Concept in Digital Signal Processing Authors: Sanjit K. Mitra, University of California Volume: 1, Page: 31 Abstract: This paper introduces the concept of structural subband decomposition of sequences, a generalization of the polyphase decomposition of sequences, and outlines a number of applications of this concept, such as efficient FIR filter design and implementation, adaptive filtering, and fast computation of discrete transforms. ** Title: A New Algorithm for the Generalized Eigenvalue Problem Authors: Knut Huper, University of Wurzburg Uwe Helmke, University of Wurzburg Volume: 1, Page: 35 Abstract: The problem of finding the generalized eigenvalues and eigenvectors of a pair of real symmetric matrices A and B, with B>0, can be viewed as a smooth optimization problem on a smooth manifold. We present a cost function approach to the generalized eigenvalue problem which is posed on the product of the n-sphere and Euclidian space R. The critical point set of this cost function is studied. An algorithm is presented based on constrained optimization. A proof of local quadratic convergence is given. ** Title: A lattice structure for perfect reconstruction linear time varying filter banks with all pass analysis banks Authors: Soura Dasgupta, University of Iowa Chris Schwarz, University of Iowa Minyue Fu, University of Newcastle Volume: 1, Page: 39 Abstract: We consider a multi-input, multi-output lattice realization for linear time-varying analysis banks which are all pass. Such a realization has been given for LTI systems; and under certain conditions, we show how it generalizes to the LTV case. Moreover, our implementation is simpler than the existing LTI version. Finally, we describe the anticausal inverse of a lattice realization which is used in the synthesis bank. ** Title: Algorithm Design for Structured Systems: Application to Pole Placement Authors: Steffen Paul, Technical University of Munich Josef A. Nossek, Technical University of Munich Volume: 1, Page: 43 Abstract: Numerical algorithms for signal processing and control are quite often constructed by intuition. When the system to be designed contains algebraic or other invariants, then these constraints can be exploited to find appropriate transformations. The transformations in system theory are usually Lie groups. One has to find Lie groups which are consistent with the invariants. We show, how this point of view can be applied to construct pole placement algorithms for symmetric and skew-symmetric realizations. However, Lie group theory only reveals the appropriate transformations but is not able to reduce the design process to a trivial task. The problem discussed here does also show this limitation. ** Title: Actions of noncompact groups and algorithm design: A case study Authors: Klaus Diepold, IDT Rainer Pauli, Technical University of Munich Volume: 1, Page: 47 Abstract: Numerical matrix computations involving actions of noncompact transformation groups are known to produce numerical problems since the elements of the pertaining matrix representations are inherently unbounded. In this case study we analyse numerical problems occuring in a class of algorithms that is based on actions of the pseudo-orthogonal group O_n,m -- a group that is noncompact (hyperbolic geometry) and well established in signal processing (Schur methods). As a major result, it is shown how to exploit the additional degrees of freedom in defining coordinate frames in a Grassmannian setting in order to impose an a priori bound on the norm of the transformation matrices. This way, numerically disastrous situations can be circumvented systematically. Hence, it becomes possible to develop modified algorithms which exhibit superior numerical performance for a large class of problems based on e.g. hyperbolic transformations. ** Title: Discretization Issues for the Design of Optimal Blind Algorithms Authors: Rodney A. Kennedy, Australian National University Deva K. Borah, Australian National University Zhi Ding, Auburn University Volume: 1, Page: 51 Abstract: The performance and complexity of blind algorithms in a digital receiver is dependent on the prefilter prior to discretization of the received continuous time signal and the sampling rate. This paper shows that symbol spaced blind equalization algorithms are in general sub-optimal, since a matched filter cannot be used. We show that, for fractionally spaced equalizers, the prefilter can be a general low-pass filter and does not need to be matched to the unknown channel. This flexibility on choosing the prefilter can result in different discrete time models with different complexities for the signal processing algorithms to follow. As for example, a simpler whitening filter design which is needed for the success of several important blind equalization algorithms can be realized using this flexibility. ** Title: Continuous-Time Envelope-Constrained Filter Design via Laguerre Filter and H_(infinity) Optimization Methods Authors: Zhuquan Zang, ATRI, Curtin University of Technology Antonio Cantoni, ATRI, Curtin University of Technology Koklay Teo, ATRI, Curtin University of Technology Volume: 1, Page: 55 Abstract: Envelope-constrained filtering is concerned with the design of a time-invariant filter to process a given input signal such that the noiseless output of the filter is guaranteed to lie within a prespecified output mask. In this paper, using Laguerre filters and H_(infinity) optimization techniques, the continuous-time envelope-constrained filter design problem has been reformulated and solved as a constrained H_(infinity) model-matching problem. To illustrate the effectiveness of the design method, a numerical example is presented which deals with the design of an equalization filter for a digital transmission channel. ** Title: Local adaptive algorithms for information maximization in neural networks, and application to source separation Authors: Jeroen Dehaene, K.U.Leuven Nanayaa Twum-Danso, Harvard University Volume: 1, Page: 59 Abstract: Information theoretic criteria for neural network adaptation laws have recently become an important focus of attention. We consider the problem of adaptively maximizing the entropy of the outputs of a deterministic feedforward neural network with real valued stochastic input signals, as considered by Bell and Sejnowski. We give a new explanation for the relevance of output information (entropy) maximization for source separation applications and reinterpret Bell and Sejnowski's approach in a more general context of probability density estimation. This insight is the basis for a generalization of the approach, and we consider a family of gradient based algorithms. ** Title: Quick Aggregation of Markov Chain Functionals via Stochastic Complementation Authors: Kutluyil Dogancay, University of Melbourne Vikram Krishnamurthy, University of Melbourne Volume: 1, Page: 63 Abstract: The paper presents a quick and simplified aggregation method for a large class of Markov chain functionals based on the concept of stochastic complementation. Aggregation results in a reduction in the number of Markov states by grouping them into a smaller number of aggregated states, thereby producing a considerable saving on computational complexity associated with maximum likelihood parameter and state estimation for hidden Markov models. The importance of the proposed aggregation method stems from the ease with which Markov chains with a large number of states can be aggregated. Three Markov chain functionals which have widespread use are considered to illustrate the application of our aggregation method. ** Title: A rank preserving flow algorithm for quadratic optimization problems subject to quadratic equality constraints Authors: John B. Moore, Systems Engineering, ANU Danchi Jiang, Dept. MAE. Chinese University Volume: 1, Page: 67 Abstract: This paper concerns quadratic programming problems subject to quadratic equality constraints such as arise in broadband antenna array signal processing and elsewhere. At first, such a problem is converted into a semidefinite programming problem with a rank constraint. Then, a rank preserving flow is used to accommodate the rank constraint. The associated gradient formulas are carefully developed. The convergence of the resulted algorithm is also guaranteed. Our approach is demonstrated by a numerical experiment. ** Title: VERBMOBIL: The Combination Of Deep And Shallow Processing For Spontaneous Speech Translation Authors: Thomas Bub, DFKI Wolfgang Wahlster, DFKI Alex Waibel, Carnegie Mellon University Volume: 1, Page: 71 Abstract: Verbmobil is a speech-to-speech translation system for spontaneously spoken negotiation dialogs. The actual system translates 74.2% of spontaneously spoken German input. We give an overview of the Verbmobil system. After the introduction of the Verbmobil scenario and the unique constraints of the project, we describe the underlying system architecture and its realization. The progress that was achieved on the end-to-end translation rate owes much to the increase of the word recognition rate from 45% in 1993 to 87% in 1996. But in order to achieve the envisaged coverage on the incertain speech recognizer output, deep and shallow approaches to the analysis and transfer problem had to be combined. ** Title: Prosodic Processing and its Use in Verbmobil Authors: Heinrich Niemann, University of Erlangen Elmar Noth, University of Erlangen Andreas Kiessling, University of Erlangen Ralf Kompe, University of Erlangen Anton Batliner, L.M.-Univ. Munchen Volume: 1, Page: 75 Abstract: We present the prosody module of the VERBMOBIL speech-to-speech translation system, the world wide first complete system, which successfully uses prosodic information in the linguistic analysis. This is achieved by computing probabilities for clause boundaries, accentuation, and different types of sentence mood for each of the word hypotheses computed by the word recognizer. These probabilities guide the search of the linguistic analysis. Disambiguation is already achieved during the analysis and not by a prosodic verification of different linguistic hypotheses. So far, the most useful prosodic information is provided by clause boundaries. These are detected with a recognition rate of 94%. For the parsing of word hypotheses graphs, the use of clause boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings. ** Title: The Language Components in Verbmobil Authors: Hans Ulrich Block, Siemens AG Volume: 1, Page: 79 Abstract: This paper gives an overview over the main problems and their solutions in the language components of the Verbmobil speech translation system. Interpretation of spontaneously spoken language has to take into account that syntax and semantics differ from written language, that punctuation is missing, that accent and intonation have effects on the meaning and the translation, that the output of the speech recognizer may be noisy and that speakers produce errors due to distraction. The Verbmobil interpretation and translation components try to attack these problems by means of a grammar for spoken language, heavy use of prosodic information, a syntactic search on word hypothesis graphs and a shallow robust fall back translation device that is used in case the "deep" translation fails. ** Title: The Karlsruhe-Verbmobil Speech Recognition Engine Authors: Michael Finke, University of Karlsruhe Petra Geutner, University of Karlsruhe Hermann Hild, University of Karlsruhe Thomas Kemp, University of Karlsruhe Klaus Ries, University of Karlsruhe Martin Westphal, University of Karlsruhe Volume: 1, Page: 83 Abstract: Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development of a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a yearly basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. In this paper we will introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system -- speaker normalization, channel normalization and polyphonic clustering -- will be discussed and evaluated. Besides the acoustic models we delineate the different language models used in our evaluation system: Word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolkit and integrating all these parts into the recognition engine the word error rate on the German Spontaneous Scheduling Task (GSST) could be decreased from 30% word error rate in 1995 to 13.8% in 1996. ** Title: An Experiment On Korean-To-English And Korean-To-Japanese Spoken Language Translation Authors: Jae-Woo Yang, ETRI Jun Park, ETRI Volume: 1, Page: 87 Abstract: We have implemented a Korean-to-English and Korean-to-Japanese spoken language translation system prototype. The system can translate speech in travel planning domain with 5,000 word vocabulary. In our prototype, we concentrate on how to transfer the intention of a user to the partner in spite of current limitation of spoken language processing technology. We measured the end-to-end performance of the prototype to test whether the output of the system is understandable using a subjective measure. We also used an objective measure to evaluate the system performance and found that it generates coherent result with the subjective test. The test result shows that the user can understand the output even in the case that the system cannot translate speech correctly. Thus it is important to provide even partially correct translation output to the user, in order not to neglect the possibility that the user can infer the intended message using the context and his/her intelligence. ** Title: Multilingual Person to Person Communication at IRST Authors: Bianca Angelini, IRST Mauro Cettolo, IRST Anna Corazza, IRST Daniele Falavigna, IRST Gianni Lazzari, IRST Volume: 1, Page: 91 Abstract: This paper refers to a machine-mediated person-to-person multilingual communication system. Stress is put on robustness, that is the ability of the system to preserve communication even in presence of the variability and errors typical of spoken language systems. The statistical approach is adopted not only at the acoustic level, but also for the linguistic processing. Therefore, while an overview of the global architecture will be briefly introduced, the focus will be put on the acoustic recognizer and the understanding module. Experimental evaluations complete the presentation. ** Title: Fast Word-Graph Generation For Spontaneous Conversational Speech Translation Authors: Tohru Shimizu, ATR-ITL Harald Singer, ATR-ITL Yoshinori Sagisaka, ATR-ITL Volume: 1, Page: 95 Abstract: This paper introduces the latest advances in research at ATR on speech translation for spontaneous conversations, especially focusing on speech recognition efforts. For recognition, we employ a word search technique that generates moderate sized word graphs in real-time. To cope with a variety in length of utterances, e.g., word, phrase, sentence fragment, sentence, and concatenated sentences in spontaneous speech, we have adopted a two pass search strategy that uses variable-order word n-gram statistics in the first stage and task dependent language constraints in the second stage. This strategy is evaluated using the ``ATR Travel Arrangement'' corpus. ** Title: JANUS-III: Speech-to-Speech Translation in Multiple Languages Authors: Alon Lavie, Carnegie Mellon University Alex Waibel, Carnegie Mellon University Lori Levin, Carnegie Mellon University Michael Finke, Carnegie Mellon University Donna Gates, Carnegie Mellon University Marsal Gavalda, Carnegie Mellon University Torsten Zeppenfeld, Carnegie Mellon University Puming Zhan, Carnegie Mellon University Volume: 1, Page: 99 Abstract: This paper describes JANUS-III, our most recent version of the JANUS speech-to-speech translation system. We present an overview of the system and focus on how system design facilitates speech translation between multiple languages, and allows for easy adaptation to new source and target languages. We also describe our methodology for evaluation of end-to-end system performance with a variety of source and target languages. For system development and evaluation, we have experimented with both push-to-talk as well as cross-talk recording conditions. To date, our system has achieved performance levels of over 80% acceptable translations on transcribed input, and over 75% acceptable translations on speech input recognized with a 75-90% word accuracy. Our current major research is concentrated on enhancing the capabilities of the system to deal with input in broad and general domains. ** Title: State-Transition Cost Functions and an Application to Language Translation Authors: Hiyan Alshawi, AT&T Labs Adam L. Buchsbaum, AT&T Labs Volume: 1, Page: 103 Abstract: We define a general method for ranking the solutions of a search process by associating costs with equivalence classes of state transitions of the process. We show how the method accommodates models based on probabilistic, discriminative, and distance cost functions, including assignment of costs to unseen events. By applying the method to our machine translation prototype, we are able to experiment with different cost functions and training procedures, including an unsupervised procedure for training the numerical parameters of our English-Chinese translation model. Results from these experiments show that the choice of cost function leads to significant differences in translation quality. ** Title: Hybrid language processing in the Spoken Language Translator Authors: Manny Rayner, SRI International David M. Carter, SRI International Volume: 1, Page: 107 Abstract: The paper presents an overview of the Spoken Language Translator (SLT) system's hybrid language-processing architecture, focussing on the way in which rule-based and statistical methods are combined to achieve robust and efficient performance within a linguistically motivated framework. In general, we argue that rules are desirable in order to encode domain-independent linguistic constraints and achieve high-quality grammatical output, while corpus-derived statistics are needed if systems are to be efficient and robust; further, that hybrid architectures are superior from the point of view of portability to architectures which only make use of one type of information. We address the topics of ``multi-engine'' strategies for robust translation; robust bottom-up parsing using pruning and grammar specialization; rational development of linguistic rule-sets using balanced domain corpora; and efficient supervised training by interactive disambiguation. All work described is fully implemented in the current version of the SLT-2 system. ** Title: Finite-State Speech-to-Speech Translation Authors: Enrique Vidal, DSIC UPV Volume: 1, Page: 111 Abstract: A fully integrated approach to Speech-Input Language Translation in limited-domain applications is presented. The mapping from the input to the output language is modeled in terms of a finite state translation model which is learned from examples of input-output sentences of the task considered. This model is tightly integrated with standard acoustic-phonetic models of the input language and the resulting global model directly supplies, through Viterbi search, an optimal output-language sentence for each input-language utterance. Several extensions to this framework, recently developed to cope with the increasing difficulty of translation tasks, are reviewed. Finally, results for a task in the framework of hotel front-desk communication, with a vocabulary of about 700 words, are reported. ** Title: An Experimental Bidirectional Japanese/English Interpreting Video Phone System Using Internet. Authors: Shoji Hiraoka, MRIT Masakatsu Hoshimi, MRIT Kenji Matsui, CRL Jean-Claude Junqua, STL Volume: 1, Page: 115 Abstract: In this paper we report on an experimental bidirectional Japanese/English interpreting video phone system using Internet. We particularly emphasize the motivation for this work, the task, and the experiments conducted. Using in house technology developed both in Japan and in the United States, we demonstrated an Internet home shopping application where an American shop assistant and a Japanese customer engaged in task-directed dialogues, using their native languages. The experiments showed that when users are familiar with the application language, a natural interaction can be obtained. ** Title: From Neural Networks to Neural Strategies Authors: Christian Goerick, Ruhr-Univ. Bochum Bernhard Sendhoff, Ruhr-Univ. Bochum Werner von Seelen, Ruhr-Univ. Bochum Volume: 1, Page: 119 Abstract: Artificial neural network have evolved from their biologically inspired roots to a well established means to solve a broad spectrum of engineering problems. The embedding into modern statistics has provided the necessary theoretical foundation for challenging engineering tasks, such as advanced real-time image and signal processing. These are exemplary demonstrations for the applicability of this approach to complex information processing. However, the large number of applications must not obscure the fact that there are some major unsolved problems concerning neural networks. There are still no satisfactorily constructive ways to determine the optimal structure (elements as well as organization) or the learning and evaluation dynamics. The ongoing research addresses these problems. In addition to pursuing this direction, one can ask, what other lessons we can learn from biology concerning complex information processing. Our goal in this paper is to sketch a possible way from neural networks to more comprehensive neural strategies. ** Title: Neural And Traditional Techniques In Diagnostic ECG Classification. Authors: Rosaria Silipo, DSI Giovanni Bortolan, DSI Volume: 1, Page: 123 Abstract: Neural and traditional techniques have been compared for the particular task of automatic ECG analysis. A large validated ECG database has been used. Statistical methods, neural architectures with supervised and unsupervised learning, and a neuro-fuzzy architecture have been considered. The results from the connectionist approach are always at least comparable with those coming from more traditional classification methods. But the best performances have been obtained by the combination of the connectionist with the fuzzy approach. ** Title: Unsupervised Learning for Blind Source Separation: an Information-Theoretic Approach Authors: Dragan Obradovic, Siemens, Munchen Gustavo Deco, Siemens, Munchen Volume: 1, Page: 127 Abstract: This paper provides a detailed and rigorous analysis of the two commonly used methods for redundancy reduction: Linear Independent Component Analysis (ICA) and Information Maximization (InfoMax). The paper shows analytically that ICA based on the Kullback-Leibler information as a mutual information measure and InfoMax lead to the same solution if the parameterization of the output nonlinear functions in the latter method is sufficiently rich. Furthermore, this work briefly discusses the alternative redundancy measures not based on the Kullback-Leibler information distance and Nonlinear ICA. The practical issues of applying ICA and InfoMax are also discussed. ** Title: Applications of Neural Blind Separation to Signal and Image Processing Authors: Juha Karhunen, Helsinki University of Technology Aapo Hyvarinen, Helsinki University of Technology Ricardo Vigario, Helsinki University of Technology Jarmo Hurri, Helsinki University of Technology Erkki Oja, Helsinki University of Technology Volume: 1, Page: 131 Abstract: In blind source separation one tries to separate statistically independent unknown source signals from their linear mixtures without knowing the mixing coefficients. Such techniques are currently studied actively both in statistical signal processing and unsupervised neural learning. In this paper, we apply neural blind separation techniques developed in our laboratory to extraction of features from natural images and to separation of medical EEG signals. The new analysis method yields features that describe the underlying data better than for example classical principal component analysis. We briefly discuss difficulties related with real-world applications of blind signal processing, too. ** Title: Communications and Neural Networks: Theory and Practice Authors: Mark D. Plumbley, KCL Volume: 1, Page: 135 Abstract: In this paper we shall see that neural networks and communications are interlinked in a number of ways, towards the goal of efficient communication of information. One concrete example of this is the use of neural networks to ensure efficient use of communication channels, through connection admission control in ATM networks. In addition, however, efficient communication is also important within a decision making system such as a neural network. Finally we examine what type of neural network solutions are suggested by this approach. ** Title: Robust Vector Quantization by Competitive Learning Authors: Joachim M. Buhmann, University of Bonn Thomas Hofmann, University of Bonn Volume: 1, Page: 139 Abstract: Competitive neural networks can be used to efficiently quantize image and video data. We discuss a novel class of vector quantizers which perform noise robust data compression. The vector quantizers are trained to simultaneously compensate channel noise and code vector elimination noise. The training algorithm to estimate code vectors is derived by the maximum entropy principle in the spirit of deterministic annealing. We demonstrate the performance of noise robust codebooks with compression results for a teleconferencing system on the basis of a wavelet image representation. ** Title: Recognizing faces from a new viewpoint Authors: Thomas Vetter, Max-Planck-Institut, Tubingen Volume: 1, Page: 143 Abstract: A new technique is described for recognizing faces from new viewpoints. From a single 2D image of a face synthetic images from new viewpoints are generated and compared to stored views. A novel 2D image of a face can be computed without knowledge about the 3D structure of the head. The technique draws on prior knowledge of faces based on example images of other faces seen in different poses and on a single generic 3D model of a human head. The example images are used to learn a pose-invariant shape and texture description of a new face. The 3D model is used to solve the correspondence problem between images showing faces in different poses. The performance of the technique is tested on a date set of 200 faces of known orientation for rotations up to 90 degree. ** Title: Hybrid Optimization of Feedforward Neural Networks for Handwritten Character Recognition Authors: Wolfgang Utschick, Technical University of Munich Josef A. Nossek, Technical University of Munich Volume: 1, Page: 147 Abstract: An extension of a feedforward neural network is presented. Although utilizing linear threshold functions and a boolean function in the second layer, signal processing within the neural network is real. After mapping input vectors onto a discretization of the input space, real valued features of the internal representation of pattern are extracted. A vectorquantizer assigns a class hypothesis to a pattern based on its extracted features and adequate reference vectors of all classes in the decision space of the output layer. Training consists of a combination of combinatorial and convex optimization. This work has been applied to a standard optical character recognition task. Results and comparison to alternative approaches are presented. ** Title: Reading Checks with multilayer graph transducter networks Authors: Yann LeCun, AT&T Labs Leon Bottou, AT&T Labs Yoshua Bengio, AT&T Labs Volume: 1, Page: 151 Abstract: We propose a new machine learning paradigm called Multilayer Graph Transformer Network that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as input and produce graphs as output. A complete check reading system based on this concept is described. The system combines convolutional neural network character recognizers with graph-based stochastic models trained cooperatively at the document level. It is deployed commercially and reads million of business and personal checks per month with record accuracy. ** Title: Neural Networks For Process Control In Steel Manufacturing Authors: Martin Schlang, Siemens AG, Munich Einar Broese, Siemens AG, Munich Bjoern Feldkeller, Siemens AG, Munich Otto Gramckow, Siemens AG, Munich Michael Jansen, Siemens AG, Munich Thomas Poppe, Siemens AG, Munich Clemens Schaeffner, Siemens AG, Munich Guenter Soergel, Siemens AG, Munich Volume: 1, Page: 155 Abstract: Neural Networks are particularly suitable for the approximation of non-linear time-variant functions. Due to their learning capabilities, they have proven useful in control applications for complex industrial processes. In collaboration with the Corporate Research and Development Department, the Siemens Industrial and Building Systems Group developed Neural Network applications for the steel industry, resulting in a more economic use of resources and an improvement of productivity. At this time Siemens has installed more than 100 neural nets world wide at different plants. ** Title: A Neuro-Dynamic Programming Approach to Admission Control in ATM Networks: The Single Link Case Authors: Peter Marbach, LIDS, MIT John N. Tsitsiklis, LIDS, MIT Volume: 1, Page: 159 Abstract: We are interested in solving large-scale Markov Decision Problems. The classical method of Dynamic Programming provides a mathematical framework for finding optimal solutions for a given Markov Decision Problem. However, for Dynamic Programming algorithms become computationally infeasible when the underlying Markov Decision Problem evolves over a large state space. In recent years, a new methodology, called Neuro-Dynamic Programming, has emerged which tries to overcome this ``curse of dimensionality''. We present how Neuro-Dynamic Programming can be applied to the Admission Control Problem for a single link in an ATM environment. Based on results obtained through Neuro-Dynamic Programming, we derive a heuristic ``Threshold'' policy. Performances of the policies obtained through Neuro-Dynamic Programming are compared with a policy which always accepts a customer when the required resources are available. ** Title: Issues In Measuring The Benefits Of Multimodal Interfaces Authors: James L. Flanagan, Rutgers University Ivan Marsic, Rutgers University Volume: 1, Page: 163 Abstract: Multimedia interfaces are rapidly evolving to facilitate human/machine communication. Most of the technologies on which they are based are, as yet, imperfect. But, the interfaces do begin to allow information exchange in ways familiar and comfortable to the human--principally through natural actions in the sensory dimensions of sight, sound and touch. Further, as digital networking becomes ubiquitous, the opportunity grows for collaborative work through conferenced computing. In this context the machine takes on the role of mediator in human/machine/human communication--the ideal being to extend the intellectual abilities of humans through access to distributed information resources and collective decision making. The challenge is how to design machine mediation so that it extends, not impedes, human abilities. This report describes evolving work to incorporate multimodal interfaces into a networked system for collaborative distributed computing. It also addresses strategies for quantifying the synergies that may be gained. ** Title: Multimodal Interfaces for Multimedia Information Agents Authors: Alex Waibel, Carnegie Mellon University Bernhard Suhm, Carnegie Mellon University Minh Tu Vo, Carnegie Mellon University Jie Yang, Carnegie Mellon University Volume: 1, Page: 167 Abstract: When humans communicate they take advantage of a rich spectrum of cues. Some are verbal and acoustic. Some are non-verbal and non-acoustic. Signal processing technology has devoted much attention to the recognition of speech, as a single human communication signal. Most other complementary communication cues, however, remain unexplored and unused in human-computer interaction. In this paper we show that the addition of non-acoustic or non-verbal cues can significantly enhance robustness, flexibility, naturalness and performance of human-computer interaction. We demonstrate computer agents that use speech, gesture, handwriting, pointing, spelling jointly for more robust, natural and flexible human-computer interaction in the various tasks of an information worker: information creation, access, manipulation or dissemination. ** Title: Smart Rooms, Desks, and Clothes Authors: Alex Pentland, MIT Media Lab Volume: 1, Page: 171 Abstract: We are working to develop smart networked environments that can help people in their homes, offices, cars, and when walking about. Our research is aimed at giving rooms, desks, and clothes the perceptual and cognitive intelligence needed to become active helpers. ** Title: Human Machine Interaction by Voice and Gesture Authors: Nikil Jayant, Bell Laboratories Volume: 1, Page: 175 Abstract: Voice and gesture represent fundamental and universal modalities in interhuman communication. With recent advances in automatic methods of speech recognition and synthesis, human-machine interaction by voice is rapidly becoming a technological and commercial reality. Although less mature and deployed, gesture recognition by machine is becoming reliable enough to be considered as a serious supplement to the voice interface between humans and machines. ** Title: Audio-Visual Interaction in Multimedia Communication Authors: Tsuhan Chen, AT&T Labs - Research Ram R. Rao, Georgia Institute of Technology Volume: 1, Page: 179 Abstract: To many people, the word "multimedia" simply means the combination of various forms of information: text, speech, music, images, graphics and video. What is often overlooked is the interaction among these forms. In this paper, we will present our recent results in exploiting the audio-visual interaction that is very significant in multimedia communication. The applications include lip synchronization, joint audio-video coding, and person verification. We will present the enabling technologies, including audio-to-visual mapping and facial image analysis, for these applications. Our results show that the joint processing of audio and video provides advantages that are not available when audio and video are studied separately. ** Title: LIP Motion Modeling and Speech Driven Estimation Authors: F. Lavagetto, University of Genova S. Lepsoy, University of Genova C. Braccini, University of Genova S. Curinga, University of Genova Volume: 1, Page: 183 Abstract: Recent advances in joint acoustical/visual analysis for model-based lip motion synthesis is presented. The 2D lip motion field is modeled as a linear combination of a low dimensional motion basis computed through Principal Component Analysis (PCA). The vector of PCA coefficients is expressed as a function of a limited set of articulatory parameters which describe the external appearance of the mouth. The acoustical processing estimates these articulatory parameters from the direct analysis of the speech waveform based on a neural processing stage, i.e. through a bank of Time Delay Neural Networks. The achieved results have been subjectively evaluated by visualizing the estimated motion on a wire-frame mouth template presented in synchronization with speech. The experiments carried out so far deal with single-speaker trained TDNNs and with single-speaker PCA, but suitable algorithms for generalizing the techniques are currently under investigation. ** Title: Voice Source Localization for Automatic Camera Pointing System in Videoconferencing Authors: Hong Wang, PictureTel Peter Chu, PictureTel Volume: 1, Page: 187 Abstract: This paper describes the voice source localization algorithm used in the PictureTel automatic camera pointing system (LimeLight-TM, Dynamic Speech Locating Technology). The system uses an array of 46cm wide and 30cm high, which contains 4 microphones, and is mounted on top of the monitor. The three dimensional position of a sound source is calculated from the time delays of 4 pairs of microphones. In time delay estimation, the averaging of signal onsets of each frequency band is combined with phase correlation to reduce the influence of noise and reverberation. With this approach, it is possible to provide reliable three dimensional voice source localization by a small microphone array. Post processing based on a priori knowledge is also introduced to eliminate the influences of reflections from furniture such as tables. Results of speech source localization under real conference room conditions will be given. Some system related issues will also be discussed. ** Title: Video interface for spatiotemporal interactions based on multi-dimensional video computing Authors: Akihito Akutsu, NTT Human Interface Laboratories Yoshinobu Tonomura, NTT Human Interface Laboratories Hiroshi Hamada, NTT Human Interface Laboratories Volume: 1, Page: 191 Abstract: Because digital video is becoming increasingly important for the networked multimedia society, the audio-visual access environment should allow us to do more than just passively watch. We propose a new video user interface concept made possible by multi-dimensional video computing. Multi-dimensional video computing offers a framework for analyzing a video, creating new structures, and restyling and visualizing the video according to the user's demands. The video interface visualizes video content and context structure comprehensibly to allow us to access the spatiotemporal information in videos intuitively. In this paper, we introduce our research activities toward a video interface based on the information extracted from the video. New video interfaces called VideoBrowser, PanoramaVideo, and VideoJigsaw are described. ** Title: Indexing and Search of Multimodal Information Authors: Alexander G. Hauptmann, Carnegie Mellon University Howard D. Wactlar, Carnegie Mellon University Volume: 1, Page: 195 Abstract: The Informedia Digital Library Project allows full content indexing and retrieval of text, audio and video material. The integration of speech recognition, image processing, natural language processing and information retrieval overcomes limits in each technology to create a useful system. In order to answer the question how good speech recognition has to be in order to be useful and usable for indexing and retrieving speech recognizer generated transcripts, some empirical evidence is presented that illustrates the degradation of information retrieval at different levels of speech accuracy. In our experiments, word error rates up to 25% did not significantly impact information retrieval and error rates of 50% still provided 85 to 95% of the recall and precision relative to fully accurate transcripts in the same retrieval system. ** Title: Acoustic Indexing for Multimedia Retrieval and Browsing Authors: Steve J. Young, Cambridge University Engineering Dept Jonathan T. Foote, Cambridge University Engineering Dept Gareth J.F Jones, Cambridge University Engineering Dept Karen Sparck Jones, Cambridge University Computer Lab Martin G. Brown, ORL Ltd Volume: 1, Page: 199 Abstract: This paper reviews the Video Mail Retrieval (VMR) project at Cambridge University and ORL. The VMR project began in September 1993 with the aim of developing methods for retrieving video documents by scanning the audio soundtrack for keywords. The project has shown, both experimentally and through the construction of a working prototype, that speech recognition can be combined with information retrieval methods to locate multimedia documents by content. The final version of the VMR system uses pre-computed phone lattices to allow extremely rapid word spotting and audio indexing, and statistical information retrieval (IR) methods to mitigate the effects of spotting errors. The net result is a retrieval system that is open-vocabulary and speaker-independent, and which can search audio orders of magnitude faster than real time. ** Title: Broadcast News Transcription Authors: Francis Kubala, BBN Hubert Jin, BBN Long Nguyen, BBN Richard Schwartz, BBN Spyros Matsoukas, Northeastern University Volume: 1, Page: 203 Abstract: In this paper we describe our recent work on automatic transcription of radio and television news broadcasts. This problem is very challenging for large vocabulary speech recognition because of the frequent and unpredictable changes that occur in speaker, speaking style, topic, channel, and background conditions. Faced with such a problem, there is a strong tendency to try to carve the input into separable classes and deal with each one independently. In our early work on this problem, however, we are finding that the rewards for condition-specific techniques are disappointingly small. This is forcing us to look for general, robust, and adaptive algorithms for dealing with extremely variable data. Herein, we describe the BBN BYBLOS recognition system configured to handle off-line transcription and we characterize the speech contained in the 1996 DARPA Hub-4 testbed. On the partitioned development test set, we achieved a 29.4% overall word error rate. ** Title: Image/Speech Processing that Adopts an Artistic Approach -Toward Integration of Art and Technology- Authors: Ryohei Nakatsu, ATR-MIC Volume: 1, Page: 207 Abstract: In the areas of image/speech processing, researchers have long dreamed of producing computer agents that can communicate with people in a human-like way. Although the non-verbal aspects of communications, such as emotions-based communications, play very important roles in our daily lives, most research so far has concentrated on the verbal aspects of communications and has neglected the nonverbal aspects. To achieve human-like agents we have adopted a two-way approach. 1. To provide agents with nonverbal communications capability, engineers have started research on emotions recognition and facial expressions recognition. 2. Artists have begun to design and generate the reactions and behaviors of agents, to fill the gap between real human behaviors and those of computer agents. ** Title: Noise Cancelling for Microphone Arrays Authors: Jens Meyer, Darmstadt University of Technology Carsten Sydow, SIEMENS AG Volume: 1, Page: 211 Abstract: In this paper an application of the noise cancelling method for suppression of noise of a microphone array system is discussed. First an overview of the noise cancelling approach is given. This is followed by a description of the employment of the method in a realized microphone array system. The limiting factors are described and theoretical limits of the noise suppression are derived. Experimental results, which are obtained in a realistic environment, are presented. The results show, that depending on the recording situation the noise cancelling approach applied to a microphone array system leads to a significant enhancement of the signal to noise ratio of the array output signal. ** Title: A Microphone Array System for Speech Recognition Authors: Kenji Kiyohara, NTT Human Interface Labs. Yutaka Kaneda, NTT Human Interface Labs. Satoshi Takahashi, NTT Human Interface Labs. Hiroaki Nomura, NTT Human Interface Labs. Junji Kojima, NTT Human Interface Labs. Volume: 1, Page: 215 Abstract: This paper proposes a microphone array system which realizes the following important functions for speech recognition: i) SNR improvement, ii) flat spectrum response for an arbitrary speaker position, and iii) speech period detection in noisy speech. This microphone array system features time delay estimation using pre-whitening signal processing, delay-and-sum array weighted optimally, and speech period detection based on the level difference (called MLD) between signals before and after array processing. Word recognition experiments performed in the presence of crowd noise demonstrate greater robustness of the proposed system against noise than the system with conventional directional microphone and speech period detection method. ** Title: Strategies for combining Acoustic Echo Cancellers and Adaptive Beamforming Microphone Arrays Authors: Walter Kellermann, FH Regensburg Volume: 1, Page: 219 Abstract: New concepts for efficient combination of acoustic echo cancellation (AEC) and adaptive beamforming microphone arrays (ABMA) are presented. By decomposing common beamforming methods into a time-invariant part, which the AEC can integrate, and a separate time-variant part, the number of echo cancellers is minimized without rendering the system identification problem more difficult. Methods for controlling the interaction of ABMA and AEC are outlined and implementations for typical microphone array applications are discussed briefly. ** Title: A Steerable and Variable First-Order Differential Microphone Array Authors: Gary W. Elko, Acoustics Research Department Anh-Tho Nguyen Pong, Speech Processing Software and Technology Research Volume: 1, Page: 223 Abstract: A new first-order differential microphone array with an infinitely steerable and variable beampattern is described. The microphone consists of 6 small pressure microphones flush-mounted on the surface of a 3/4" diameter rigid nylon sphere. The microphones are located on the surface at points where included octahedron vertices contact the spherical surface. By appropriately combining the three Cartesian orthogonal pairs with simple scalar weightings, a general first-order differential microphone beam (or beams) can be realized and directed to any angle in 4(pi) steradian space. A working real-time version has been created and measured results from this microphone are shown. This microphone should be useful for surround sound recording/playback applications and to virtual reality audio applications. ** Title: Microphone Array based Speech Recognition with Different Talker-Array Positions Authors: Maurizio Omologo, ITC-IRST Marco Matassoni, ITC-IRST Piergiorgio Svaizer, ITC-IRST Diego Giuliani, ITC-IRST Volume: 1, Page: 227 Abstract: The use of a microphone array for hands-free continuous speech recognition in noisy and reverberant environment is investigated. An array of eight omnidirectional microphones was placed at different angles and distances from the talker. A time delay compensation module was used to provide a beamformed signal as input to a Hidden Markov Model (HMM) based recognizer. A phone HMM adaptation, based on a small amount of phonetically rich sentences, further improved the recognition rate obtained by applying only beamforming. These results were confirmed both by experiments conducted in a noisy and reverberant environment and by simulations. In the latter case, different conditions were recreated by using the image method to reproduce synthetic versions of the array microphone signals. ** Title: Acoustic Source Location In A Three-Dimensional Space Using Crosspower Spectrum Phase Authors: Piergiorgio Svaizer, ITC-IRST Marco Matassoni, ITC-IRST Maurizio Omologo, ITC-IRST Volume: 1, Page: 231 Abstract: A microphone array can be used to locate a dominant acoustic source in a given environment. This capability is successfully employed to locate an active talker in teleconferencing or other multi-speaker applications. In this work the source location is obtained in two steps: 1) a Time Difference Of Arrival (TDOA) computation between the signals of the array; 2) an ``optimal'' source location based on the interchannel delay estimates and on a geometrical description of the sensor arrangement. The Crosspower Spectrum Phase technique was used for TDOA estimation, while a Maximum Likelihood approach was followed to derive the source coordinates. Source location experiments in a three-dimensional space were performed by means of an array of 8 microphones. For this purpose both a loudspeaker and a real talker were used to collect data in a large noisy and reverberant room. ** Title: Superdirective Microphone Array for a Set-Top Videoconferencing System Authors: Peter Chu, PictureTel Volume: 1, Page: 235 Abstract: In set-top videoconferencing, the complete videoconferencing system fits unobtrusively on top of the television. The microphone sound pickup system is one of the most important functional blocks with constraints of small size, high performance, and low cost. Persons speaking several feet away from the system must be picked up satisfactorily while noise generated internally in the system by the cooling fan and hard drive, and noise generated externally from air conditioning and nearby computers must be attenuated. In this paper, a three microphone superdirective array is described which meets these constraints. An analog highpass and lowpass filter are used to merge two of the microphone signals to form a single channel, so that a single stereo A/D converter is required to process the three microphone signals. The microphone signals are then linearly combined so as to maximize the signal-to-noise ratio, resulting in nulls steered toward nearby objectionable noise sources. ** Title: Simultaneous Echo Cancellation and Car Noise Suppression Employing a Microphone Array Authors: Matttias Dahl, University of Karlskrona/Ronneby Ingvar Claesson, University of Karlskrona/Ronneby Sven Nordebo, University of Karlskrona/Ronneby Volume: 1, Page: 239 Abstract: This paper presents a method to simultaneously perform 20~dB acoustic echo cancellation and 15-20~dB speech enhancement using an adaptive microphone array combined with spectral subtraction. Primarily intended for handsfree telephones in automobiles, the microphone array system simultaneously emphasizes the near-end talker and suppresses the handsfree loudspeaker and the broadband car noise. The array system is based on a fast and efficient on-site calibration and can be used in other situations such as conventional speaker phones. ** Title: Analytical Evaluation of a Self-calibrating Microphone Array Authors: Sven Nordholm, University of Karlskrona/Ronneby Ingvar Claesson, University of Karlskrona/Ronneby Volume: 1, Page: 243 Abstract: This paper gives an analytical description of an adaptive microphone array which facilitates a simple built-in calibration to the environment and instrumentation. The scheme offers several advantages, such as a simple calibration procedure and reduced target signal distortion. The analysis employs noncausal Wiener filters yielding compact and effective theoretical suppression limits. ** Title: Microphone Array Response to Speaker Movements Authors: Yves Grenier, ENST Sofiene Affes, INRS-Telecommunications Volume: 1, Page: 247 Abstract: Matched filtering and adaptive beamforming are both necessary for efficient speech dereverberation and noise reduction by microphone arrays. This can be achieved by the identification of impulse responses. In this contribution, we show that adaptive microphone arrays are sensitive to identification errors of impulse responses, particularly due to speaker movements. We prove that adjusted matched-filtering and permanent tracking of impulse responses are also necessary. The proposed microphone array responds well to these requirements under realistic conditions. ** Title: A Digital Processing System for Source Location and Sound Capture by Large Microphone Arrays Authors: Harvey F. Silverman, Brown University William R. Patterson, Brown University James L. Flanagan, Rutgers University Daniel Rabinkin, Rutgers University Volume: 1, Page: 251 Abstract: The Huge Microphone Array(HMA) project started in February 1994 to design, construct, and test a real-time 512-microphone array system and to develop algorithms for use on it. Analysis of known algorithms showed that signal-processing performance of over 6 Gigaflops would be required; at the same time, there was a need for portability, i.e., fitting into a small van. These tradeoffs and many others have led to a unique design in both hardware and software. This paper presents the design and its justifications. Performance data for a few important algorithms relative to usage of processing-capability, response latency, and difficulty of programming are discussed. ** Title: 3-D Unitary ESPRIT for Joint 2-D Angle and Carrier Estimation Authors: Martin Haardt, Siemens Josef A. Nossek, Technical University of Munich Volume: 1, Page: 255 Abstract: It is essential for an efficient frequency and time slot allocation procedure in future mobile communication systems using space division multiple access (SDMA) to determine the mobiles that are spatially well separated from one another. Thus, once a mobile desires to initiate a call, precise knowledge of the 2-D arrival angles of its dominant wavefronts is required. In this application, 3-D Unitary ESPRIT for joint 2-D angle and carrier estimation offers an efficient way to handle such mobile access requests since it provides efficient high-resolution measurements of the spatial characteristics of the wireless channel, even if only a small number of antennas is available at the base station. Automatic pairing of the 3-D estimates is achieved via a new simultaneous Schur decomposition (SSD) of three real-valued, non-symmetric matrices. In general, the SSD enables an R-dimensional extension of Unitary ESPRIT (R greater-or-equal-to 3) to estimate several undamped R-dimensional modes or frequencies along with their correct pairing in multidimensional harmonic retrieval problems. Here, we present a Jacobi-type method to calculate the SSD. For each of the R dimensions, the corresponding frequency estimates are obtained from the real eigenvalues of a real-valued matrix. The SSD jointly estimates the eigenvalues of all R matrices and, thereby, achieves automatic pairing of the estimated R-dimensional modes via a closed-form procedure that neither requires any search nor any other heuristic pairing strategy. ** Title: Quality enhancement of coded and corrupted speeches in GSM mobile systems using residual redundancy Authors: Thomas Hindelang, Technical University of Munich Wen Xu, Technical University of Munich Christian Erben, Technical University of Munich Volume: 1, Page: 259 Abstract: There is often residual redundancy remaining in coded speech data, even if a powerful speech codec (e.g. the full rate coder used in GSM mobile communications) is employed. By using such redundancy together with the information provided by the channel decoder, such as soft output (L-value), the number of channel bits inverted by the decoder, or a cyclic redundancy check, the bit error rate can be further reduced and a more graceful degradation of speech quality can be achieved, especially under bad channel conditions. In this paper, we report on the study with regard to this aspect for GSM full rate speech transmission and error concealment. The algorithms developed can be easily implemented with a currently available DSP designed for GSM mobile phones. ** Title: Pilot Assisted Coherent DS-CDMA Reverse-Link Communications with Optimal Robust Channel Estimation Authors: Fuyun Ling, Motorola Volume: 1, Page: 263 Abstract: Optimal pilot assisted estimation of communication channels is considered for coherent cellular and PCS CDMA reverse link communications. Both pilot symbol and pilot channel based schemes are described and the optimal estimators for these two schemes are analyzed. Relative mean square estimation error (RMSEE) and optimal power allocation between data and pilot signals are derived based on the analysis. Finally, simulation results are given to show the reverse link performance can be significantly improved by using the pilot assisted coherent communication instead of non-coherent schemes for CDMA reverse link. ** Title: A new Frequency Estimator applied to Burst Transmission Authors: Christian Bergogne, Telecom Paris, Alcatel Telspace Michel Bousquet, ENSAE Philippe Sehier, Alcatel Telspace Volume: 1, Page: 267 Abstract: In TDMA communications systems using all feedforward sychronization techniques, the quality of data decoding strictly depends on the estimation accuracy of the synchronization parameters (timing, carrier phase/frequency and preamble detection) extracted from the received signal. The frequency offset estimation is the most critical point. Indeed, an inaccurate frequency estimation can cause cycle slips and then errors during decoding. In this paper, we propose a new frequency estimator, analytically derived from the Maximum Likelihood principle and optimized thanks to variance simulations. Its performance is compared to the Cramer Rao Bound. ** Title: Unified Specification of Control and Data Flow Authors: Thorsten Groetker, ISS, RWTH Aachen Rainer Schoenen, ISS, RWTH Aachen Heinrich Meyr, ISS, RWTH Aachen Volume: 1, Page: 271 Abstract: Many signal processing systems use event driven mechanisms - typically based on finite state machines (FSMs) - to control the operation of computationally intensive (data flow) parts. The state machines in turn are often fueled by external inputs as well as by feedback from the signal processing portions of the system. Packet-based transmission systems are a good example for such a close interaction between data and control flow. For an efficient design flow it is of crucial importance to be able to model and analyze the complete functionality of the system within one single design environment. Therefore, we developed a computational model that integrates the specification of control and data flow by combining the notion of data flow graphs with event driven process activation. ** Title: Reconfigurable Processing: The Solution to Low-Power Programmable DSP Authors: Jan M. Rabaey, University of California at Berkeley Volume: 1, Page: 275 Abstract: One of the most compelling issues in the design of wireless commu- nication components is to keep power dissipation between bounds. While low-power solutions are readily achieved in an application- specific approach, doing so in a programmable environment is a sub- stantially harder problem. This paper presents an approach to low- power programmable DSP that is based on the dynamic reconfigura- tion of hardware modules. This technique has shown to yield at least an order of magnitude of power reduction compared to traditional instruction-based engines for problems in the area of wireless com- munication. ** Title: DSP Cores for Moblile Communications: Where are we going ? Authors: Gerhard P. Fettweis, Technical University of Dresden Volume: 1, Page: 279 Abstract: Digital signal processors (DSPs) have become a key component for the design of communications ICs. Application customization leads to key market advantages but also to enormous problems of having too many different DSPs and their software development tools. First, by analysis of the problem open issues are pointed out. Then, a possible solution named CATS is presented, which allows for customization without the generation of too much heterogeneity in hardware and tools. ** Title: DSPs in Mobile Communication in the United States Authors: Sanjay Kasturia, Bell Labs, Lucent Technologies Colin Warwick, Bell Labs, Lucent Technologies Volume: 1, Page: 283 Abstract: The mobile communication industry in the United States is undergoing major changes. Auctioning of additional spectrum will lead to more service providers and will significantly increase competition. Service providers are likely to customize the services they offer to differentiate themselves from others. We will discuss possible technologies for differentiation of services and the implications of these on the requirements for embedded DSPs. In the US, supporting the customization in the absence of a single industry wide standard, and the high likelihood of at least three widely used air interfaces will significantly challenge the ability of the industry to serve the phone needs of all service providers. The need for customization in the context of multiple standards, will create strong pressure to significantly improve the code development environment for DSPs. This also implies evolution to architectures that are more friendly to developers. ** Title: FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign Authors: Markus Willems, ISS, RWTH Aachen Volker Bursgens, ISS, RWTH Aachen Thorsten Grotker, ISS, RWTH Aachen Heinrich Meyr, ISS, RWTH Aachen Volume: 1, Page: 287 Abstract: Digital mobile systems are sensitive to power consumption, chip size and costs. Therefore they are realized using fixed-point architectures, either dedicated HW or fixed-point processors. On the other hand, system design starts from a floating-point description. These requirements have been the motivation for FRIDGE, a design environment for the specification, evaluation and implementation of fixed-point systems. FRIDGE offers a seamless design flow from a floating-point description to a fixed-point implementation. Within this paper we focus on the FRIDGE-concept of an interactive, automated transformation of floating-point programs written in ANSI-C into fixed-point specifications, based on an interpolative approach. Since HW and SW implementations of the same functionality in general require different fixed-point specifications, the design time reductions that can be achieved by using FRIDGE make it a key component for an efficient HW/SW-CoDesign. ** Title: Staying Ahead of the Game In Silicon for Digital Mobile Communications Authors: Ravi Subramanian, Synopsys Inc. Marc Barberis, Synopsys Inc. Herbert Dawid, Synopsys GmbH Volume: 1, Page: 291 Abstract: While the mobile communication electronics industry's appetite grows for ever more functions and ever higher levels of integration, the complexity of these large designs is creating a discontinuity in the method by which these systems are designed. In this paper, we will take a close look at what is causing the design discontinuity, and how new design technologies are being used to design advanced digital communications systems for portable and wireless communication applications. We will examine how system-level design tools closely tied to silicon design implementation and verification technologies are enabling the creation of digital communications ICs in record time. We take several examples of commercially available silicon solutions designed using these methodologies- a G.721 ADPCM speech codec for cordless telephony and a complete variable-rate digital-video broadcast receiver for the DVB-S broadcast standard. ** Title: Approximation of Optimal Step Size Control for Acoustic Echo Cancellation Authors: Christiane Antweiler, RWTH Aachen Jorn Grunwald, RWTH Aachen Holger Quack, RWTH Aachen Volume: 1, Page: 295 Abstract: One of the most widely used gradient-based adaptation algorithms is the so called normalized least mean square (NLMS) algorithm. The rate of convergence, misadjustment and noise insensitivity of the NLMS-type algorithm depend on the proper choice of the step size parameter, which controls the weighting applied to each coefficient update. Different step size methods have been proposed to improve the convergence of NLMS-type filters, while preserving the steady-state performance. The step size methods considered here use either a step size parameter which varies with time or a separate, tap-individual step size for each filter tap. The derivation of the respective step size methods is based on different optimization criteria. In this paper a step size parameter is proposed satisfying a combined optimization criterion leading to a time variant and individual step size parameter. The realization aspects of the new concept are discussed for an acoustic echo control application as an example. ** Title: Subband stereo echo canceller using the projection algorithm with fast convergence to the true echo path Authors: Shoji Makino, NTT Human Interface Labs Klaus Strauss, NTT Human Interface Labs Suehiro Shimauchi, NTT Human Interface Labs Yoichi Haneda, NTT Human Interface Labs Akira Nakagawa, NTT Human Interface Labs Volume: 1, Page: 299 Abstract: This paper proposes a new subband stereo echo canceller that converges to the true echo path impulse response much faster than conventional stereo echo cancellers. Since signals are bandlimited and downsampled in the subband structure, the time interval between the subband signals become longer, so the variation of the crosscorrelation between the stereo input signals becomes large. Consequently, convergence to the true solution is improved. Furthermore, the projection algorithm, or affine projection algorithm, is applied to further speed up the convergence. Computer simulations using stereo signals recorded in a conference room demonstrate that this method significantly improves convergence speed and almost solves the problem of stereo echo cancellation with low computational load. ** Title: A Better Understanding and an Improved Solution to the Problems of Stereophonic Acoustic Echo Cancellation Authors: Jacob Benesty, Bell Labs Dennis R. Morgan, Bell Labs M. Mohan Sondhi, Bell Labs Volume: 1, Page: 303 Abstract: Teleconferencing systems employ acoustic echo cancelers (AECs) to reduce echos that result from coupling between the loudspeaker and microphone. To enhance the sound realism, two-channel audio is necessary. However, in this case (stereophonic sound) the acoustic echo cancellation problem is more difficult to solve because of the necessity to uniquely identify two acoustic paths. In this paper, we explain these problems in detail and give an interesting solution which is much better than previously known solutions. The basic idea is to introduce a small nonlinearity into each channel that has the effect of reducing the interchannel coherence while not being noticeable for speech due to self masking. ** Title: Comparison of three post-filtering algorithms for residual acoustic echo reduction Authors: Valerie Turbin, CNET Andre Gilloire, CNET Pascal Scalart, CNET Volume: 1, Page: 307 Abstract: We consider an acoustic echo control system composed of a short conventional acoustic echo canceller combined with a post-filter in a teleconference context. The post-filter is implemented in an open-loop structure in the frequency domain, which provides good adaptive performance and flexibility for the choice of the post-filter length. Three post-filtering algorithms are compared in terms of residual echo attenuation and near-end speech distortion. The effect of the post-filter length is also examined. Our study confirms that the post-filtering approach provides high residual echo attenuation. Moreover, it appears that the distortion of the near-end speech can be controlled by choosing appropriately the post-filter length. ** Title: Audio Coding Using Sinusoidal Excitation Representation Authors: Wen-Whei Chang, National Chiao-Tung University De-Yu Wang, National Chiao-Tung University Li-Wei Wang, National Chiao-Tung University Volume: 1, Page: 311 Abstract: Most LPC-based audio coders employ simplistic noise-shaping operations to perform psychoacoustic control of quantization noise. In this paper, we report on new approaches to exploiting perceptual masking in the design of adaptive quantization of LPC excitation parameters. Due to its localized spectral sensitivity, sinusoidal excitation representation is preferred to spectrally flat signals for use in excitation modeling. Simulation results indicate that the proposed multisinusoid excited coder can deliver high quality audio reproduction at the rate of 72 kb/s. ** Title: Optimum Bit Allocation and Decomposition for High Quality Audio Coding Authors: Xiang Wei, University of Central Lancashire Martyn J. Shaw, University of Central Lancashire Martin R. Varley, University of Central Lancashire Volume: 1, Page: 315 Abstract: Current audio compression schemes are capable of reducing the per channel bit rate of high quality audio signals from 16 bits per sample to around 2-4 bits per sample. In these schemes, knowledge of psychoacoustics is utilised and a uniform or nonuniform frequency decomposition method is used. In this paper we derive the optimum bit allocation to achieve the highest perceptual quality under a fixed bit rate, for an arbitrarily decomposed, critically sampled, filter bank. The resultant optimum bit allocation gives rise to a shaped reconstruction noise floor approximately parallel to the masking threshold level. Perceptual coding gain is defined and should be maximized for an optimum decomposition performed by the filter bank. Optimum band splitting is discussed and it is pointed out that decomposition in the manner of critical band splitting does not lead to optimal performance. ** Title: The D5 Lattice Quantization For A 64 KBit/S Low-Delay Subband Audio Coder With A 15 KHz Bandwidth Authors: Karine Hay, ENST-Br, Dept. SC. S. Saoudi, ENST-Br, Dept. SC. L. Mainard, CCETT, Servive RCS/SDA Volume: 1, Page: 319 Abstract: A new method for coding generic audio signals at 64 kbit/s in the 20-15000 Hz bandwidth with a low delay is presented. It combines subband coding, Low Delay CELP algorithm and cascaded filterbanks. Our earlier works shown that, when using an equal bit rate on each subband, the resulting audio quality was not appropriate. We propose here a new technique based on lattice quantization to avoid the search complexity of the statistical vector quantization. It allows an adaptive bit rate allocation in each subband. Experimental results assessing the validity of the proposed method are also presented. ** Title: An Experimental Audio Codec Based on Warped Linear Prediction of Complex Valued Signals Authors: Aki Harma, Helsinki University of Technology Unto K. Laine, Helsinki University of Technology Matti Karjalainen, Helsinki University of Technology Volume: 1, Page: 323 Abstract: Bark-scale warped linear prediction [WLP] is a very potential core for a monophonic perceptual audio codec. In the current paper the WLP scheme is extended for processing complex valued signals (CWLP). Three different methods of converting a stereo signal to one complex valued signal are introduced. The philosophy behind the coding scheme is to integrate some aspects of modern wideband audio coding (e.g. perceptuality and stereo signal processing) into one computational element in order to find a more holistic and economic way of processing. ** Title: High Quality Low Complexity Scalable Wavelet Audio Coding Authors: William Kurt Dobson, U.S. Robotics Jiankan Jack Yang, U.S. Robotics Kevin J. Smart, U.S. Robotics Feng Kathy Guo, U.S. Robotics Volume: 1, Page: 327 Abstract: This paper presents an audio coder for real-time multimedia applications. To achieve high quality at low bit rate, the audio coder uses a wavelet packet decomposition to transform the audio data into the wavelet domain, and a psychoacoustic model is used to minimize quantization noise. The wavelet packet decomposition tree structures were chosen in a way to closely mimic the critical bands in a psychoacoustic model. Instead of determining the masking thresholds in the Fourier domain, the wavelet coefficients are used to drive the psychoacoustic model directly. Most of the standard industrial sampling frequencies are supported by this coder. An efficient bit rate control scheme was designed such that the audio coder operates at virtually any desired bit rate level. The audio coder achieves near perceptually lossless quality at or below 80 kb/s for most audio sources. Real-time encoding/decoding is possible by using only a fraction of a Pentium or faster CPU. ** Title: An Efficient Tonal Component Coding Algorithm For MPEG-2 Audio NBC Authors: Yuichiro Takamizawa, NEC Corporation Masahiro Iwadare, NEC Corporation Akihiko Sugiyama, NEC Corporation Volume: 1, Page: 331 Abstract: This paper proposes a tonal component coding algorithm for a codec that employs a transform followed by Huffman coding, such as MPEG-2 Audio NBC (Non-Backward Compatible). After the input audio signal is mapped onto a frequency domain, the proposed algorithm withdraws local maximum components that degrade coding efficiency. By this withdrawal, the flatness of the spectrum increases and the efficiency in Huffman coding is improved. The withdrawn components are encoded separately as side information. When the frequency resolution of the time/frequency mapping is high, this algorithm works more effectively since local maximum samples appear more frequently with such a mapping. Simulation results show that this algorithm achieves as much as 11% bit reduction per frame and improves the coding efficiency in 41% of all the audio frames. ** Title: Spectral Amplitude Warping (SAW) for Noise Spectrum Shaping in Audio Coding Authors: Roch Lefebvre, University of Sherbrooke Claude Laflamme, University of Sherbrooke Volume: 1, Page: 335 Abstract: In this paper, we present a new approach to shape the coding noise in speech and audio coders. The approach, called Spectral Amplitude Warping (SAW), consists essentially of a pre- and post-processing which apply a non-linear transformation to the signal short-term spectrum prior to, and after, encoding. Since it is possible to view SAW as a separate entity from the coder, the noise shaping capability of an existing coder can be improved without modifying the coder itself. Using SAW as a pre- and post-process to the G.722 wideband speech coding standard, it was found in an informal listening test that the quality of the 64 kb/s operating mode can be achieved at only 48 kb/s. The price to be paid is an additional delay. ** Title: A fast noise-scaling algorithm for uniform quantization in audio coding schemes Authors: Carlos A. Serantes, Universidad de Vigo Antonio S. Pena, Universidad de Vigo Nuria Gonzalez-Prelcic, Universidad de Vigo Volume: 1, Page: 339 Abstract: A new bit assignment algorithm is presented. Its goals are the simultaneous assignment on all subbands in a few steps of an iterative calculus, the use of memory to achieve a better speed of convergence and the consideration of a deformable error curve. The basis of the algorithm is discussed and also other considerations that are likely to arise in practice. Finally, an example of performance is given. ** Title: Pyramid Vector Coding for high quality audio compression Authors: Daniele Cadel, Cefriel Giorgio Parladori, Alcatel Telecom Volume: 1, Page: 343 Abstract: Target of this work is the high quality audio coding at low bit rate. It will be shown how the Pyramid Vector Coding (PVC) can conveniently replace the classical Huffman Coding technique in audio compression systems, giving also an advantage in the bit allocation procedure. The compression performances can be further improved by fixing an upper limit value of the vector components. ** Title: Subband Audio Coding with Synthesis Filters Minimizing a Perceptual Criterion Authors: Karine Gosse, ENST Paris Francois Moreau de Saint-Martin, CCETT Xavier Durot, CCETT Pierre Duhamel, ENST Paris Jean-Bernard Rault, CCETT Volume: 1, Page: 347 Abstract: The design of filter banks for source coding purposes classically relies on the perfect reconstruction (PR) property. However, several recent studies have shown that taking the quantization noise into account in the design could yield noticeable reduction of the mean square reconstruction error. The purpose of this study is to show that perceptual improvement can also be obtained in the particular audio coding context by relaxing the PR constraint. In this context, the mean square error is not relevant any more, and we define a new perceptual distortion criterion, making use of a simplified ear model, the MPE (Mean Perceptual Error). Then, synthesis filters are optimized so as to minimize this MPE. Finally, this MMPE (Minimum MPE) filter bank is included in an audio coding scheme. Compared to the corresponding PR filter bank-based scheme by the means of POM (Perceptual Objective Measure), they show an improved audio quality. ** Title: New Results in Low Bitrate Audio Coding Using a Combined Harmonic-Wavelet Representation Authors: Simon Boland, Queensland University of Technology Mohamed Deriche, Queensland University of Technology Volume: 1, Page: 351 Abstract: In this paper, we propose a new combined harmonic-wavelet representation for audio where a harmonic analysis-synthesis scheme is used, first, to approximate each audio frame as a sum of several sinusoids. Then, the difference between the original signal and the reconstructed harmonic signal is analyzed using a wavelet filtering scheme. After each step (harmonic analysis & wavelet filtering), parameters are quantized and encoded. Compared to previously proposed methods, our audio coder uses different harmonic analysis-synthesis and wavelet filtering schemes. We use the Total Least Squares (TLS)-Prony algorithm for the harmonic analysis-scheme, and an M-band wavelet transform for analyzing the residual. Altogether, our proposed coder is capable of delivering excellent audio signal quality at encoder bitrates of 60-70 kb/s. ** Title: Adaptive Inverse Control of Weakly Nonlinear Systems Authors: Wolfgang J. Klippel, Dresden Volume: 1, Page: 355 Abstract: A weak nonlinear plant can be linearized and will track an input signal if the plant is preceded by a nonlinear controller which approximates the inverse of the plant's transfer function. Present techniques for adjusting the controller adaptively to the plant require an additional nonlinear adaptive filter to perform a separate system identification. Straightforward update algorithms can not directly update the filter parameters in the controller because the transfer function of the plant might cause instabilities in the adaptive process. This problem is overcome by performing additional linear filtering to the nonlinear state vector and/or error signal. Novel filtered-A and filtered-E modifications of the stochastic gradient based methods are presented which are capable to update generic as well as special block-oriented nonlinear filter architectures. ** Title: Broadband Beamforming with Adaptive Postfiltering for Speech Acquisition in Noisy Environments Authors: Sven Fischer, Ericsson Eurolab Karl-Dirk Kammeyer, University of Bremen Volume: 1, Page: 359 Abstract: In this paper the implementation of a broadband beamformer which is built up by several harmonically nested subarrays for each octave band combined with optimal postfiltering is described. This method has the advantage of providing large sensor distances for the postfilter estimation by simultaneously controlling the directivity of the array. The selection of an optimal postfilter is discussed in detail and its estimation based on a Nuttall/Carter method for spectrum estimation is described. The resulting noise reduction system yields improved performance in diffuse noise fields and no distortions in the case of coherent direct path noise. Furthermore, the system is robust to steering misadjustment. ** Title: Near-field Beamforming for Microphone Arrays Authors: James G. Ryan, National Research Council Rafik A. Goubran, Carleton University Volume: 1, Page: 363 Abstract: This paper describes the application of array optimization techniques to improving the near-field response of an arbitrary microphone array. The optimization exploits the differences in wavefront curvature between near-field and far-field sound sources and is suitable for reverberation reduction in small rooms. The optimum near-field beamformer provides increased array gain over that obtained from a uniformly weighted delay-and-sum beamformer. ** Title: A Robust Adaptive Microphone Array with Improved Spatial Selectivity and Its Evaluation in an Echoic Environment Authors: Osamu Hoshuyama, NEC Akihiko Sugiyama, NEC Akihiro Hirano, NEC Volume: 1, Page: 367 Abstract: This paper presents a new robust adaptive microphone array (AMA) and its evaluation in an echoic environment. The proposed AMA is a generalized sidelobe canceller equipped with a variable blocking matrix using coefficient-constrained adaptive filters, and a multiple-input canceller using norm-constrained adaptive filters (NCAFs). Because the NCAFs have selective nonlinearity in the relationship between coefficient norm and coefficient error, the proposed AMA has better spatial selectivity than the conventional AMA. Evaluation with real acoustic data captured in a room of 0.3-second reverberation time shows that the noise is suppressed by 19 dB. In subjective evaluation, the proposed AMA obtains 3.8 on a 5-point mean-opinion-score scale. ** Title: Tracking Multiple Talkers using Microphone-Array Measurements Authors: Douglas E. Sturim, Brown University Harvey F. Silverman, Brown University Michael S. Brandstein, Harvard University Volume: 1, Page: 371 Abstract: A method for tracking the positional estimates of multiple talkers in the operating region of an acoustic microphone array is presented. Initial talker location estimates are provided by a time-delay-based localization algorithm. These raw estimates are spatially smoothed by a Kalman filter derived from a set of potential source motion models. Data association techniques based on the estimate clusterings and source trajectories are incorporated to match location observations with individual talkers. Experimental results are presented for array recorded data using multiple talkers in a variety of scenarios. ** Title: A Robust Method for Speech Signal Time-Delay Estimation in Reverberant Rooms Authors: Michael S. Brandstein, Harvard University Harvey F. Silverman, Brown University Volume: 1, Page: 375 Abstract: Conventional time-delay estimators exhibit dramatic performance degradations in the presence of multipath signals. This limits their application in reverberant enclosures, particularly when the signal of interest is speech and it may not possible to estimate and compensate for channel effects prior to time-delay estimation. This paper details an alternative approach which reformulates the problem as a linear regression of phase data and then estimates the time-delay through minimization of a robust statistical error measure. The technique is shown to be less susceptible to room reverberation effects. Simulations are performed across a range of source placements and room conditions to illustrate the utility of the proposed time-delay estimation method relative to conventional methods. ** Title: A Model-Based Approach to Active Noise Cancellation Using Loudspeaker Array Authors: Jie Gu, HK University of Science & Tech. Sze Fong Yau, HK University of Science & Tech. Volume: 1, Page: 379 Abstract: This paper presents a new model-based adaptive noise cancellation system using loudspeaker array and error sensor array which can be used to reduce the noise in a specific three-dimensional region. First, open loop system transfer functions are designed using a theoretical propagation model. The transfer functions thus found are regarded as the nominal values for the complete system. Second, to compensate for deviations from the theoretical model, the transfer functions are adapted using error measures from error sensor array by LMS algorithm. Computer simulation results shows that our approach is effective for noise reduction in 3-D space. Experiments using real-time active noise control hardware also confirms the performance of the system. ** Title: Reverberant Sound Field Analysis using a Microphone Array Authors: Wolfgang Tager, CNET Yannick Mahieux, CNET Volume: 1, Page: 383 Abstract: The use of microphone arrays for sound pickup in reverberant environments has been proposed by many authors. The observation on the M microphones can be decomposed into a spatially coherent and an incoherent part. The first one is due to perfect (plane or spherical) sound waves caused by the direct path and specular reflections, whereas the latter is caused by diffusion, diffraction, non-perfect reflections, electrical and quantization noise. In this paper we firstly present a deflation method to detect and localize spatially coherent waves from the measured impulse responses. In a second step the filters which model the source directivity and the reflecting materials are estimated. The model takes into account nearfield delay, range attenuation, microphone and source directivity as well as non trivial reflections. ** Title: Minimisation of the Maximum Error Signal in Active Control Authors: Alberto Gonzalez, UPV, Valencia Antonio Albiol, UPV, Valencia Stephen J. Elliott, ISVR, Southampton Volume: 1, Page: 387 Abstract: This paper deals with Multiple Input Multiple Output systems for active control of acoustic signals. These systems are used when the acoustic field is complex and therefore a number of sensors are necessary to estimate the sound field and a number of sources to create the cancelling field. A steepest descent iterative algorithm is applied to minimise the p-norm of a vector composed by the output signals of a microphone array. The existing algorithms deal with the 2-norm of this vector. This paper describes a general framework that covers the existing systems and then it focuses on the (infinity)-norm minimisation algorithm. The minimax algorithm based on the (infinity)-norm minimises the output signal which has the greatest power. It is shown by means of simulations using measured data from a real room that the minimax algorithm leads to a more uniform final noise field than the existing algorithms. ** Title: Subband Active Noise Control Algorithm Based on a Delayless Subband Adaptive Filter Architecture Authors: Jeong-Hyeon Yun, Yonsei University Dae-Hee Youn, Yonsei University Young-Cheol Park, Samsung Electronic Volume: 1, Page: 391 Abstract: In this paper, a new active noise control algorithm based on a delayless subband adaptive filter architecture is presented. Also, an on-line system identification method implemented in the subband structure is suggested. To implement the filtered-x LMS algorithm in the subband structure, the secondary path transfer function is decomposed into sets of subband functions. The two filter on-line modeling algorithm is then applied to each subband to estimate the secondary-path transfer function in a decomposed form. In this manner, the computational load for the on-line system identification is reduced by a factor 3 compared with the wideband approach. Simulation results are presented to show the efficiency of the new ANC algorithm and the performance of the on-line system identification scheme. ** Title: Nonlinear Active Noise Control in a Linear Duct Authors: Paul Strauch, University of Edinburgh Bernard Mulgrew, University of Edinburgh Volume: 1, Page: 395 Abstract: The problem in active noise control in a linear duct is examined. Essentially, a nonlinear inverse to a nonminimum phase actuator is proposed. The nonlinear inverse exploits the non-Gaussian nature of some chaotic and stochastic noise sources. The architecture of the controller is derived using Bayesian estimation theory and is shown to be a combination of a linear adaptive network and a radial basis function (RBF) or Volterra series (VS) network. Because of the nonlinear nature of the controller, the filtered-x least means square (LMS) architecture cannot be used. Hence a modified active noise controller is proposed. Simulation results demonstrate the improvements in performance achievable with the combined linear and nonlinear controller. ** Title: Fast Exact Filtered-X LMS and LMS Algorithms for Multichannel Active Noise Control Authors: Scott C. Douglas, University of Utah Volume: 1, Page: 399 Abstract: In some situations where active noise control could be used, the well-known multichannel version of the filtered-X LMS adaptive filter is too computationally-complex to implement. In this paper, we develop a fast, exact implementation of this multichannel system whose complexity is approximately O(2L) per filter channel, where L is the FIR filter length. In addition, we provide a computationally-efficient method for effectively removing the delays of the secondary paths within the coefficient updates, thus yielding a fast implementation of the LMS adaptive algorithm for multichannel active noise control. Examples illustrate both the equivalence of the algorithms to their original counterparts and the computational gains provided by the new algorithms. ** Title: A Novel Frequency Domain Filtered-X LMS Algorithm for Active Noise Reduction Authors: Toshifumi Kosakat, Tokyo National College of Technology Stephen J. Elliott, University of Southampton Christopher C. Boucher, University of Southampton Volume: 1, Page: 403 Abstract: A Frequency Domain implementation of the LMS Algorithm has significant advantages. In broadband applications it is important to use the correct window function before Fourier transformation to obtain an unbiased estimation of the required cross correlation function and to eliminate wrap-around effects. In the Frequency Domain Filtered-X LMS Algorithm described in this paper, the control filter is updated in the frequency domain as a background task, while control filtering is performed in time domain, to minimize processing delays. The frequency domain algorithm showed better performance than the conventional time domain algorithm in simulations of single channel active control systems. The algorithm is also able to improve the convergence of multiple channel systems by compensating for the coupling between the control channels. ** Title: Practical Supergain Head Sized Arrays Authors: Dorra Masmoudi, University of Bordeaux Dominique Dallet, University of Bordeaux Jean Paul Dom, University of Bordeaux Volume: 1, Page: 407 Abstract: This paper carried out a new design of head sized sensor arrays with a simple delay-and-sum beamforming which provides useful amounts of directivity index with sufficient robustness to errors. A frequency-independant sidelobe reduction is proposed to achieve optimal frequency characteristics. In order to obtain this control, a principle of combining multiple level of array structures is established. Results are presented for spherically isotropic noise. It is found that good performance can be obtained for a head sized array by combining multiple level structures with simple delay and sum beamformer. ** Title: A Multichannel Compression Strategy for a Digital Hearing Aid Authors: Todd Schneider, Unitron Robert Brennan, Unitron Volume: 1, Page: 411 Abstract: Multi-channel compression schemes are a practical method of mapping the wide dynamic range of speech signals into the reduced dynamic range of hearing impaired listeners. These systems address two of the shortcomings of single-channel compression systems: (1) the reduction of gain as a result of narrow-band non-speech stimuli and (2) the reduction of gain that often occurs when high-frequency speech components are followed by intense low-frequency speech components. They also provide frequency-dependent compression ratios that are needed by many newer supra-threshold fitting strategies (e.g., DSL I/O). This paper presents a multichannel compression scheme that employs an oversampled, polyphase DFT filterbank. In each compressor channel, the gain is controlled by an adjustable combination of a overall, dual time-constant input signal level and the individual channel signal level that is measured with a short time-constant RMS detector. Informal listening tests have demonstrated that the design has very good audio quality and performs well in real-world listening situations. The design is suited for low-power, real-time operation. ** Title: Multi-Microphone Sub-band Adaptive Signal Processing for Improvement of Hearing Aid Performance: Preliminary Test Results using Normal Hearing Volunteers Authors: Paul Shields, University of Paisley Douglas R. Campbell, University of Paisley Volume: 1, Page: 415 Abstract: A system for the binaural pre-processing of speech signals for input to a standard linear hearing aid has been proposed. The work is based on that of Toner & Campbell which applied the Least Mean Squares (LMS) algorithm in sub-bands to speech signals from various acoustic environments and signal to noise ratios (SNR). The method attempts to take advantage of the multiple inputs to perform noise cancellation. The use of sub-bands enables a diverse processing mechanism to be employed, where the wide-band signal is split into smaller sub-bands, which can subsequently be processed according to their signal characteristics. The results of a series of intelligibility tests are presented from experiments in which acoustic speech and noise data, generated in a simulated room was tested on normal hearing volunteers. ** Title: Environmental noise reduction based on speech/non-speech identification for hearing aids Authors: Kenzo Itoh, NTT HI Labs. Masahide Mizushima, NTT HI Labs. Volume: 1, Page: 419 Abstract: We proposed a vary practical and useful noise reduction system that has wide application for hearing impaired persons, such as a sound-gathering system at a lecture hall or conference room. The system uses two basic technologies, a speech/non-speech identification process and a new noise reduction process. A speech/non-speech identification process uses four characteristics of the time and frequency domains of the input signal. In the noise reduction process, frequency weighting function is used for basic spectral subtraction and loss control algorithm. Various kinds of environmental noise were reduced by this system, which showed excellent performance. Noise is further reduced by using a multi-microphone system as an acoustic noise suppressor. The results of intelligibility tests using persons with hearing loss show excellent noise reduction. ** Title: Blind Separation of Multiple Speakers in a Multipath Environment Authors: Russell Lambert, TRW Anthony Bell, Salk Institute Volume: 1, Page: 423 Abstract: We relate information theoretic blind learning methods (infomax) and Bussgang blind equalization methods. The multipath extension of blind source separation methods can be seen in the frequency domain using FIR matrix algebra (matrices of finite impulse response filters). Three forms of Bussgang algorithms are given. The blind serial update method of Cardoso and Laheld is related to the infomax objective of Bell and Sejnowski. The application emphasis is on speech separation. We demonstrate the robustness and power of the new techniques by blindly separating speech signals recorded in a multipath environment. ** Title: A Single Chip 1,200 Sinusoid Real-Time Generator for Additive Synthesis of Musical Signals Authors: Fernando De Bernardinis, Dip. Ing. Informazione, Univ. Pisa Roberto Roncella, Dip. Ing. Informazione, Univ. Pisa Roberto Saletti, Dip. Ing. Informazione, Univ. Pisa Pierangelo Terreni, Dip. Ing. Informazione, Univ. Pisa Graziano Bertini, IEI-CNR, Pisa Volume: 1, Page: 427 Abstract: This paper presents a new hardware implementation of additive synthesis for high quality musical sound generation. The single-chip configuration is capable of performing 1,200 sinusoid real-time synthesis; the system is expandable to 13,200 partials by series connecting 11 chips. Each sinusoid is generated by a marginally stable second order IIR filter, and its frequency, amplitude and phase can be independently specified. The system is clocked at 60 MHz when working with a 44.1 kHz sampling rate. Two completely independent channels are available as output, and each sample relies on a 20 bit representation to achieve an SNR of at least 110 dB, thanks to the internal 24 bit word length. The IC is designed in a 0.5 (mu)m CMOS technology and has a core area of approximately 19 mm^2. ** Title: A Generalized Musical-Tone Generator with Applications to Sound Compression and Synthesis Authors: Carlo Drioli, University of Padova Davide Rocchesso, University of Padova Volume: 1, Page: 431 Abstract: A musical-tone generator based on physical modeling of the sound production mechanisms is presented. To the purpose of making this scheme general for a wide class of musical instruments, the nonlinear part of the tone-generator is modeled by a neural network. The system learns its parameters and the nonlinearity shape by means of nonlinear identification procedures based on waveform or spectral matching. Two possible applications of this model are discussed: sound compression can be obtained when considering the system as a nonlinear predictor, while sound synthesis can be obtained by adding control inputs to the network and by training the system to respond as desired. ** Title: A Singing Voice Synthesis System Based on Sinusoidal Modeling Authors: Michael Macon, Oregon Graduate Institute Leslie Jensen-Link, Momentum Data Systems James Oliverio, Georgia Institute of Technology Mark A. Clements, Georgia Institute of Technology E. Bryan George, Texas Instruments, Dallas Volume: 1, Page: 435 Abstract: Although sinusoidal models have been demonstrated to be capable of high-quality musical instrument synthesis speech modification, and speech synthesis, little exploration of the application of these models to the synthesis of singing voice has been undertaken. In this paper, we propose a system framework similar to that employed in concatenation-based text-to-speech synthesizers, and describe its extension to the synthesis of singing voice. The power and flexibility of the sinusoidal model used in the waveform synthesis portion of the system enables high-quality, computationally-efficient synthesis and the incorporation of musical qualities such as vibrato and spectral tilt variation. Modeling of segmental phonetic characteristics is achieved by employing a "unit selection" procedure that selects sinusoidally-modeled segments from an inventory of singing voice data collected from a human vocalist. The system, called LYRICOS, is capable of synthesizing very natural-sounding singing that maintains the characteristics and perceived identity of the analyzed vocalist. ** Title: Time-Scale Modification of Audio Signals with Combined Harmonic and Wavelet Representations Authors: Khaled N. Hamdy, University of Minnesota Ahmed H. Tewfik, University of Minnesota Satoshi Takagi, Sony Corporation Ting Chen, Stanford University Volume: 1, Page: 439 Abstract: We propose a new time-scale modification method for high quality audio signals. Our approach strives to preserve pitch and timbre. In our method, the signal is represented as the sum of sinusoidal components and a residual (edges and noise). The decomposition is computed via a combined harmonic and wavelet representation. Time-scaling is performed on the harmonic components and residual components separately. The harmonic portion is time-scaled by demodulating each harmonic component to DC, interpolating and decimating the DC signal, and remodulating each component back to its original frequency. The residual portion is time-scaled by preserving edges and relative distances between the edges while time-scaling the stationary (noise) components between the edges. ** Title: A Waveguide Model for Slapbass Synthesis Authors: Erhard Rank, Vienna University of Technology Gernot Kubin, Vienna University of Technology Volume: 1, Page: 443 Abstract: Starting from the waveguide model for plucked strings, a new digital signal processing model for the slapping technique on electric bassguitars is derived. The model includes amplitude limitations for the string at the frets and/or the fingerboard. These highly nonlinear elements are realized by conditional reflections which depend on the local string displacement. A model of the string dynamics for the two slapbass techniques - knocking the string with the thumb knuckle and plucking very strong with the index or middle finger - has been implemented both as MATLAB and C simulations and synthesizes sounds close to the natural instrument. ** Title: Minimum Perceptual Spectral Distance FIR Filter Design Authors: Shao-Po Wu, Stanford University William Putnam, Stanford University Volume: 1, Page: 447 Abstract: This paper addresses the problem of designing finite impulse response filters which optimally approximate desired frequency responses in the sense that they minimize a perceptual audio spectral measure. This measure is based on a simplified auditory model similar to those used in the area of perceptual audio quality measurement. It is shown that this problem can be cast as a logarithmic Chebychev approximation problem, which can be solved efficiently using recent interior point methods. ** Title: A Phase Interpolation Algorithm for Sinusoidal Model Based Music Synthesis Authors: Xiaoshu Qian, URI Yinong Ding, TI Volume: 1, Page: 451 Abstract: This paper presents a least square quadratic phase interpolation algorithm for sinusoidal model based music synthesis. This algorithm uses two additions with one parameter per data frame to generate the phase samples of a component sine wave. Compared with the cubic phase interpolation algorithm proposed by McAulay and Quatieri, the proposed algorithm is more efficient in terms of computational complexity and parameter storage. In the meantime, it also produces smoother frequency tracks. Unlike the existing quadratic phase interpolation algorithm, where the phase measurements are totally ignored ("magnitude-only"), the proposed algorithm interpolates phase in a least square sense from both the phase and the frequency measurements at data frame boundaries. Thus the resulting phase samples are approximately "locked" to the measured ones. Informal listening tests on various musical instrument tones indicate that the proposed algorithm clearly outperforms the magnitude-only synthesis approach and is qualitatively comparable to the cubic one. ** Title: Analytical Approximations of Fractional Delays: Lagrange Interpolators and Allpass Filters Authors: Stephan Tassart, IRCAM Philippe Depalle, IRCAM Volume: 1, Page: 455 Abstract: We propose in this paper a new point of view which unifies two well known filter families for approximating ideal fractional delay filters: Lagrange Interpolator Filters (LIF) and Thiran Allpass Filters. We achieve this unification by approximating the ideal Fourier transform of the fractional delay according to two different Pade approximations: series expansions and continued fraction expansions, and by proving that both approximations correspond exactly either to the LIF family or to the allpass delay filters family. This leads to an efficient modular implementation of LIFs. ** Title: Improved discrete-time modeling of multi-dimensional wave propagation using the interpolated digital waveguide mesh Authors: Lauri Savioja, Helsinki University of Technology Vesa Valimaki, Helsinki University of Technology Volume: 1, Page: 459 Abstract: The digital waveguide mesh is an extension of the one-dimensional digital waveguide technique. Waveguide meshes are used for simulation of two- and three-dimensional wave propagation in musical instruments and acoustic spaces. The original waveguide mesh algorithm suffers from direction-dependent dispersion. In this paper we show that this problem may be reduced by using an interpolated rectilinear mesh. In the analysis part we show the analytical solution for the wave propagation speed and numerical simulations of the magnitude response and phase speed in both the original and the interpolated two-dimensional waveguide mesh algorithms. We demonstrate by simulation that the wave propagation characteristics of the proposed interpolated waveguide mesh are independent of direction and thus the remaining errors caused by dispersion may be corrected with a postprocessor. ** Title: Generalized Likelihood Ratio Test for Selecting a Geo-acoustic Environmental Model Authors: Christoph F. Mecklenbrauker, RUB Peter Gerstoft, SACLANTCEN Pei-Jung Chung, COMNETS Johann F. Bohme, RUB Volume: 1, Page: 463 Abstract: A generalized likelihood ratio test is considered for testing acoustic environmental models with application to parameter inversion using an acoustic propagation code. In the following, we use the term ``hierarchy of models'' to denote a sequence of model structures M_1, M_2,ldots in which each particular model structure M_n contains all previous ones as special cases. We propose a combined parameter estimation and multiple sequential test for simultaneously determining the model order and its parameters: given the observed data, how many parameters should be included in the model? The last question is important for the order selection problem in hierarchies of models with increasing number of parameters where the observations are corrupted by additive noise. Monte Carlo simulations show the behaviour of the sequential test for selecting a model order as a function of the SNR. Finally, the test is applied to broadband data measured using a vertical array near the island of Elba in the Mediterranean Sea. ** Title: Tuning Genetic Algorithms for Underwater Acoustics Using a priori Statistical Information Authors: Maria Joao Rendas, I3S/CNRS Georges Bienvenu, Thomson Marconi Sonar Volume: 1, Page: 467 Abstract: In this paper we present a new technique for the evaluation/selection procedures of genetic algorithms, to be used in the context of parameter estimation problems. The proposed algorithm uses a priori information about the structure of the surface of which an extremum is being searched. For parameter estimation problems, the availability, at each iteration of a genetic algorithm, of a collection of samples of the ambiguity surface of the problem, enables the determination of the correlation between the observed ambiguity surface (at the sampled points) and the predicted ambiguity surface. The consideration of this information allows early detection of secondary extrema (which yield an ambiguity surface which does not correlate well with the observed one) and thus contributes to speed the convergence of the algorithm to the global optimal values. The paper applies the proposed technique to a source localization problem. ** Title: Robust Beamformer Design for Broadband Matched-Field Processing Authors: Kerem Harmanci, Duke University Jeffrey L. Krolik, Duke University Volume: 1, Page: 471 Abstract: Matched-field beamforming has been proposed for localizing wideband acoustic sources in uncertain underwater channels. While adaptive matched-field beamforming provides adequate sidelobe suppression for stronger sources, at low signal-to-noise ratios it converges to its quiescent response, in this case the Bartlett beamformer, which has unacceptably high sidelobe levels. In this paper, a design method is presented for reducing matched-field non-adaptive beamformer sidelobe levels given a sufficiently large observation time-bandwidth product. The proposed alpha-beamformer incoherently averages narrowband matched-field beamformer output power over the signal band after a trade-off has been performed at each frequency to achieve better sidelobe suppression at the expense of some reduction in gain against diffuse noise. Simulations and results with Mediterranean vertical array data indicate that the wideband alpha-beamformer can provide improved sidelobe suppression versus conventional techniques. ** Title: FASTMAP: A Fast, Approximate Maximum A Posteriori Probability Parameter Estimator with Application to Robust Matched-Field Processing Authors: Brian F. Harrison, NUWCDIVNPT Richard J. Vaccaro, University of Rhode Island Donald W. Tufts, University of Rhode Island Volume: 1, Page: 475 Abstract: In many estimation problems, the set of unknown parameters can be divided into a subset of desired parameters and a subset of nuisance parameters. Using a maximum a posteriori (MAP) approach to parameter estimation, these nuisance parameters are integrated out in the estimation process. This can result in an extremely computationally-intensive estimator. This paper proposes a method by which computationally-intensive integrations over the nuisance parameters required in Bayesian estimation may be avoided under certain conditions. The propsed method is an approximate MAP estimator which is much more computationally efficient than direct, or even Monte Carlo, integration of the joint posteriori distribution of the desired and nuisance parameters. As an example of its efficiency, we apply the fast algorithm to matched-field source localization in an uncertain environment. ** Title: Electromagnetic Matched Field Processing for Source Localization Authors: Donald F. Gingras, Naval Command, Control and Ocean Surveillance Center Peter Gerstoft, SACLANT Neil L. Gerr, Office of Naval Research Christoph F. Mecklenbrauker, Vienna University of Technology Volume: 1, Page: 479 Abstract: Matched field processing (MFP) refers to signal and array processing techniques in which, rather than a planewave arrival model, complex-valued (amplitude and phase) field predictions for propagating signals are used. Matched field processing has been successfully applied in ocean acoustics. In this paper the extension of MFP to the electromagnetic domain, i.e., electromagnetic (EM) MFP (EM-MFP) is described. Simulations of EM-MFP in the tropospheric setting suggest that, under suitable conditions, EM-MFP methods can enable EM sources to be both detected/localized and used as sources of opportunity for estimating the environmental parameters that determine EM propagation. ** Title: Power-Law Processors for Detecting Unknown in Signals in Colored Noise Authors: Ivars P. Kirsteins, NUWCDIVNPT Sanjay K. Mehta, NUWCDIVNPT John Fay, NUWCDIVNPT Volume: 1, Page: 483 Abstract: We propose a new non-parametric adaptive detector for detecting an unknown broadband signal in interference consisting of non-stationary narrowband components and a locally stationary broadband component. An important feature of this detector is that it needs no prior information about the signal or interference. The proposed detector is based on the integration of the non-parametric power law detector of Nuttall with robust narrowband interference removal and whitening using a multiple taper spectral estimation-based technique. Experimental results indicate that the proposed detector outperforms conventional detectors. ** Title: Multitarget detection/tracking of echoes with known waveform: algorithm and applications Authors: Vittorio Rampa, C.S.T.S. - C.N.R. Umberto Spagnolini, Politecnico di Milano Volume: 1, Page: 487 Abstract: The Time of Delay (TOD) estimation of multiple echoes is here solved with an iterative multitarget detection/tracking algorithm. The evaluation of the TODs is based on their a-posteriori probability, while a first-order Markov model is used for a-priori probability estimation. The effectiveness of the algorithm (low false-alarm rate and robustness) is also experimentally proven. Moreover the algorithm exhibits a better noise rejection and an improved target resolution with respect to algorithms that perform separate detection and tracking. ** Title: Detection of Gaussian Bandpass Transients Under Impulsive Noise: A Wavelet Transform Approach Authors: Francisco M. Garcia, ISR - IST Isabel M.G. Lourtie, ISR - IST Volume: 1, Page: 491 Abstract: In underwater acoustics, the modeling of impulsive noise ambients by symmetric-alpha-stable laws is motivated by the generalized central limit theorem. However, detection of stochastic signals under such additive noise is a difficult task to implement, due to the lack of a closed-form expression of the a-posteriori probability density function. In this paper, we present a suboptimal detector for Gaussian bandpass transients in impulsive noise that uses a nonlinear, memoryless prefilter followed by a discrete wavelet transform. The resulting signals present a Gaussian-like behavior and the decision is achieved by the comparison of a quadratic likelihood ratio with a threshold. The tuning of the nonlinearity parameter is performed either by looking at the receiver operating characteristic or using the Chernoff distance, that, although resulting in an approximate solution, is easier to compute. Simulation results are presented by Monte-Carlo simulation. ** Title: Maximum Likelihood Estimator for Magneto-Acoustic Localisation Authors: Gilles Dassot, LETI CEA/Grenoble Roland Blanpain, LETI CEA/Grenoble Claude Jauffret, GESSY, University Toulon et Var Volume: 1, Page: 495 Abstract: This paper is devoted to the localization of magneto- acoustic sources moving in a straight line at a constant speed. Our technique is based on the association of narrow band acoustic signals and magnetostatic measurements. First of all, we describe features that make possible the association of magnetic and acoustic data, secondly, we show that positioning accuracy is much improved by this association. In this paper we focus on solving the problem with as few sensors as possible. A geometric discussion of identifiability is proposed, as well as a Batch Maximum Likelihood estimator whose covariance matrix asymptotically achieves Cramer Rao Lower Bounds (CRLB). ** Title: Barankin Bound for Source Localization in Shallow Water Authors: Joseph Tabrikian, Duke University Jeffrey L. Krolik, Duke University Volume: 1, Page: 499 Abstract: Matched-field methods are known to have a severe ambiguity problem. In low signal-to-noise-ratios (SNR's), where the estimator cannot distinguish between the ambiguity function peak near the true source location and ambiguous ones, its mean square error deviates radically from the Cramer-Rao lower bound (CRLB). In this paper, the Barankin bound for the source localization problem in an uncertain shallow water environment is derived. In particular, a method of selection of the test-points for evaluation of the bound is presented. The bound is evaluated using a ``general mismatch'' benchmark scenario. The results presented here predict the threshold SNR below which the performance degrades dramatically. Channel uncertainties in the benchmark scerario are shown to increase this threshold SNR by as much as 3dB. ** Title: Underwater transient signal processing: marine mammal identification, localization, and source signal deconvolution Authors: Zoi-Heleni Michalopoulou, CAMS, NJIT Volume: 1, Page: 503 Abstract: Processing marine-mammal signals for species classification and monitoring of endangered marine mammals are problems that have recently attracted attention in the scientific literature. For classification it has been proposed to use methods appropriate for non-stationary signals, such as time-frequency and time-scale analysis. This paper shows that a factor that can significantly affect results from marine-mammal signal processing is the impulse response of the ocean in which the signals propagate. The ocean is a dispersive propagation medium and, therefore, affects the time-frequency characteristics of a propagating acoustic signal. Because of this distortion, feature selection should be performed after the oceanic impulse response has been deconvolved from the recorded signals. The paper also discusses localization of vocalizing marine mammals using matched-field processing and shows how this becomes a part of the deconvolution process. ** Title: Numerical Optimization of Non-adaptive Microphone Arrays Authors: Alexander Goldin, IBM Israel S&T Volume: 1, Page: 507 Abstract: The paper describes an application of the numerical optimization methods for the design of non-adaptive multi-sensor arrays. The parameters and the geometry of such arrays do not change with changes in the input signals, and must be chosen in advance. Generally, the goal of a non-adaptive multi-sensor array may be numerically expressed through its pattern function which shows the gain for a signal coming from a particular direction in space. The real pattern function depends on the geometry of the array and on the processing which signals from every sensor undergo. The array pattern function is non-linear and it is frequency dependent. The geometry and the processing parameters of the multi-sensor array are optimized to provide the minimum difference between the goal and the real functions over a specified frequency range. Optimization results for several goal functions for multi-microphone arrays are provided and discussed. ** Title: Joint Direction-of-Arrival and Array Shape Tracking for Multiple Moving Targets Authors: Jason Goldberg, Tel Aviv University Ana Perez-Neira, UPC Miguel Lagunas, UPC Volume: 1, Page: 511 Abstract: An algorithm for the joint tracking of source DOA's and sensor positions is presented to address the problem of DOA tracking in the presence sensor motion. Initial maximum likelihood estimates of source DOA's and sensor positions are refined by Kalman filtering. Spatio-temporally correlated array movement is considered. Source angle dynamics are used to achieve correct data association. The new technique is capable of performing well for the difficult cases of sources that cross in angle, fully coherent sources, as well as sources of identical or vastly different (possibly time-varying) power. Computer simulations show that the approach is robust in the presence of array motion modeling uncertainty and effectively reduces dependence on expensive and possibly unreliable hardware. ** Title: Comparison of Probabilistic Least Squares and Probabilistic Multi-Hypothesis Tracking Algorithms for Multi-Sensor Tracking Authors: Mark L. Krieg, DSTO, CSSIP, University of Adelaide Douglas A. Gray, University of Adelaide, CSSIP Volume: 1, Page: 515 Abstract: A key element for successful tracking is knowing from which target each measurement originates. These measurement-to-target associations are generally unavailable, and the tracking problem becomes one of estimating both the assignments and the target states. We present the Probabilistic Least Squares Tracking (msPLST) algorithm for estimating the measurement-to-target assignments and the track trajectories of multiple targets, using measurements from multiple sensors. This is a different approach to that used in Probabilistic Multi-Hypothesis Tracking (PMHT), although both algorithms employ the concept of an extended observer containing both the target states and the measurement-to-target assignments. A comparison of both algorithms is made, and their performance is evaluated using simulated data. ** Title: Direction Finding with Imperfect Wavefront Coherence: A Matrix Fitting Approach Using Genetic Algorithm Authors: Alex B. Gershman, Ruhr University Bochum Christoph F. Mecklenbrauker, Ruhr University Bochum Johann F. Bohme, Ruhr University Bochum Volume: 1, Page: 519 Abstract: The performance of high-resolution direction finding methods degrades in several practical situations where the wavefronts have imperfect spatial coherence. The original solution to this problem was proposed by Paulraj and Kailath, but their technique requires a priori knowledge of the matrix characterizing the loss of wavefront coherence along the array aperture. Below, a novel solution to this problem is proposed, which does not require a priori knowledge of the spatial coherence matrix. Our technique is based on the multidimensional minimization of appropriate concentrated cost function using Genetic Algorithm (GA). ** Title: Design Of An Optimum Wideband Active Sonar Array With Robustness Authors: Saman S. Abeysekera, Curtin University of Technology Y.H. Leung, Curtin University of Technology Volume: 1, Page: 523 Abstract: The use of wideband active sonar array processing to estimate the range, velocity and bearing of a target has received much interest in the literature recently. Although increased attention has been focused on wideband correlation processing for estimating range and velocity, array directivity patterns are almost always computed and interpreted under the narrowband signal assumption. This paper considers the target bearing estimation problem using the wideband correlation approach. Via this approach, it will be shown how an optimum set of array weights can be selected for a known transmitted signal. The optimization procedure also provides robustness against errors in the array structure. ** Title: Multipath time-delay estimation Authors: Jean Jacques Fuchs, IRISA Volume: 1, Page: 527 Abstract: A transmitted and known signal is observed at the receiver through more than one path in additive noise. The problem is to estimate the number of paths and for each of them the associated attenuation and delay. It is a frequent problem in sonar, radar and geophysics. We propose an algorithm that is easy to implement, that has a reasonable computational load and seems to be able to solve the problem under more severe conditions (lower SNR) than previous methods. ** Title: Fast Maximum Likelihood Estimation With Multiple Signal Initialization Authors: Robert B. MacLeod, NUWC Richard J. Vaccaro, University of Rhode Island Volume: 1, Page: 531 Abstract: In this paper we are concerned with signal processing of acoustic signals resulting from active transmissions by high frequency sonar systems. These signals consist of structured interference related to propagation effects in the media, reflections from targets, and measurement noise. The methods herein model these signals as replicas of the transmitted signal, scaled in amplitude and time, and delayed. Furthermore, we are interested in signals with `simple' time frequency profiles, such as linear frequency modulated (LFM) or hyperbolic frequency modulated (HFM) signals. These signals have the underlying property that the principle ridge of the autoambiguity function crosses the mid point of the time-frequency plane in a smooth manner, with a simple relationship between time delay and time scaling (frequency shifting). This paper describes a method for estimating the delay and time scale of signal components using fast maximum likelihood, while preserving the high resolution property of related time delay estimation techniques. ** Title: An Algorithm for Detecting Closely Spaced Delay/Doppler Components Authors: Amir W. Habboosh, NUWCDIVNPT Richard J. Vaccaro, University of Rhode Island Steven M. Kay, University of Rhode Island Volume: 1, Page: 535 Abstract: This paper considers a method for estimating time delays, amplitudes, and Doppler scales of a multipath signal. The method is an extension of work previously reported by Manickam and Vaccaro which dealt solely with time delays and amplitudes, and extended by Habboosh and Vaccaro to include Doppler scale. In this paper, an algorithm is presented for determining the size of the indicator set to reduce ill-conditioning of the signal subspace matrix. Simulation results are shown and comparisons to the Cramer-Rao lower bound provided; these results show that significant reduction in estimate variances can be achieved using the deconvolution approach with a properly selected indicator set. ** Title: Improvement of TDOA Measurement using Wavelet Denoising with a Novel Thresholding Technique Authors: Shi Quan Wu, The Chinese University of Hong Kong Hing Cheung So, City University of Hong Kong Pak Chung Ching, The Chinese University of Hong Kong Volume: 1, Page: 539 Abstract: In this paper, wavelet denoising is applied in time delay estimation between signals received at two spatially separated sensors in the presence of noise. Prior to cross correlation, each of the sensor outputs is denoised according to a novel thresholding rule in order to increase the input signal-to-noise ratio. Unlike conventional generalized cross correlators (GCCs), it does not require spectral estimation of the source signal and the corrupting noises which may introduce large delay variance. It is proved that the delay estimate provided by the proposed method is globally convergent to the true value with a high probability. Computer simulations illustrate that the technique outperforms other GCCs for different SNRs when the sampling rate is sufficiently high. ** Title: A Short-Time Wiener Filter for Noise Removal in Underwater Acoustic Data Authors: Charles W. Therrien, NPS, Monterey K.L. Frack, NPS, Monterey N. Ruiz Fontes, NPS, Monterey Volume: 1, Page: 543 Abstract: A noise removal algorithm based on short-time Wiener filtering is described. An analysis of the performance of the filter in terms of processing gain, mean square error, and signal distortion is presented. A generalized form of the filter is also discussed and results of applying the algorithm to some typical underwater acoustic data are presented. ** Title: Fast Approximate Subspace Tracking (FAST) Authors: Donald W. Tufts, University of Rhode Island Edward C. Real, Sanders, A Lockheed Martin Company James W. Cooley, University of Rhode Island Volume: 1, Page: 547 Abstract: A new fast and accurate algorithm for tracking singular values, singular vectors and the dimension of the signal subspace through an overlapping sequence of data matrices is presented. The accuracy of the algorithm approaches that of the Prony-Lanczos method with speed and accuracy superior to both the PAST and PASTd algorithms for moderate to large size problems. The algorithm is described for the special case of changes to two columns of the matrix prior to each update of principal singular vectors and values. Comparisons of speed and accuracy are made with the algorithms named above. ** Title: A Subspace Framework for Fast Parameter Estimation with Known Waveforms Authors: Brian E. Freburger, University of Rhode Island Donald W. Tufts, University of Rhode Island Tom A. Palka, University of Rhode Island Volume: 1, Page: 551 Abstract: An efficient scheme for implementing a search of a likelihood function of known form at moderate to high SNR is constructed. Often, the original function to be searched is ill behaved with many local extreme points. By projecting the signal onto a subspace of replica waveforms we first find the maximum of a related function that is more well behaved, and then follow with a local search on the original function. The approach builds on a previous method of estimation of time delay of a narrowband signal, and it can be used to improve the efficiency of Fast Maximum Likelihood estimation. ** Title: Terrain Classification in Polarimetric SAR Using Wavelet Packets Authors: Nirmal Keshava, Carnegie Mellon University Jose M.F. Moura, Carnegie Mellon University Volume: 1, Page: 555 Abstract: POL-SAR data acquired from the two 1994 flights of the SIR-C/X-SAR platform has illustrated the variability of measurements due to seasonal, spectral, and angular changes. Consequently, statistical techniques for terrain classification make robust, unsupervised classification problematic. We present an algorithm for classifying terrain that accounts for variability in terrain signatures by deriving a single representative process for each terrain from a family of stochastic scattering models. A best-basis search through a wavelet packet tree, using the Bhattacharyya coefficient as a cost measure, determines the optimal unitary basis of eigenvectors for the representative process and offers a scale-based interpretation of the scattering phenomena. The associated eigenvalues and means are determined through iterative algorithms. The technique is illustrated with a simple example. ** Title: Electromagnetic Matched-Field Processing for Target Height Finding with Over-the-Horizon Radar Authors: Michael Papazoglou, Duke University Dept. of ECE Jeffrey L. Krolik, Duke University Dept. of ECE Volume: 1, Page: 559 Abstract: The refraction of over-the-horizon skywave radar signals by the ionosphere facilitates wide-area surveillance. While current systems measure target ground range, azimuth, and velocity they do not estimate target altitude, which is important for classification purposes. In this paper, a method akin to matched-field processing in underwater acoustics is proposed for target height-finding. The approach exploits the delay-Doppler differences between direct and surface-reflected multipath returns from the target. In particular, the coherent sum of these multipath returns can be matched in the complex delay-Doppler space for a single dwell to estimate target altitude, ground range, and radial velocity. In this paper, a maximum likelihood estimate (MLE) of these target coordinates is developed without requiring knowledge of the target backscatter reflection coefficients. The performance of the MLE is evaluated through simulation for an uncertain quasi-parabolic ionosphere and compared to the Cramer-Rao lower bound (CRLB). ** Title: Time-Frequency Classification using a Multiple Hypotheses Test: An Application to the Classification of Humpback Whale Signals Authors: Geoff Roberts, SPRC, Queensland University of Technology Abdelhak M. Zoubir, SPRC, Queensland University of Technology Boualem Boashash, SPRC, Queensland University of Technology Volume: 1, Page: 563 Abstract: We present a non-stationary signal classification algorithm based on a time-frequency representation and a multiple hypothesis test. The time-frequency representation is used to construct a time-dependent quadratic discriminant function. At selected points in time we evaluate the discriminant function and form a set of statistics which are used to test the multiple hypotheses. The multiple hypotheses are treated simultaneously using the sequentially rejective Bonferroni test to control the probability of incorrect classification of one class. We show results for classifying three classes of humpback whale calls. The results demonstrate that this time-frequency method performs favourable when compared with a frequency domain method which assumes stationarity. ** Title: Source Classification Using Pole Method of AR Model Authors: Jianguo Huang, NPU, Shaanxi Jianping Zhao, NPU, Shaanxi Yiqing Xie, NPU, Shaanxi Volume: 1, Page: 567 Abstract: An easy and efficient method to classify the underwater sources for passive sonar by extracting poles of AR model as the feature of source emitted noise is proposed . Our research demonstrates that poles of AR model can represent the intrinsic spectral characteristic of sources, and the simple statistical classifiers can be used to have excellent recognition performance due to the good cluster property and robustness of poles corresponding to different sources. It is more important that poles of low order AR model can represent the basic feature of source, thus the computation burden will be reduced significantly. Real data are processed and classification results show the efficiency even for short data records. ** Title: The LMMSE Estimate-based Multiuser Detector: Performance Analyses and Adaptive Implementation Authors: Hongya Ge, New Jersey Institute of Technology Volume: 1, Page: 571 Abstract: Presented in this work are analytical expressions of the performance measure on the LMMSE estimate-based multiuser detector, including error probability expression and its computationally and notationally efficient approximations, signal to interference-plus-noise ratio, and asymptotic efficiency. Also included in this work are adaptive implementation schemes of the LMMSE detector and the equivalent relation between them under appropriate assumptions. Simulations are included to show the tightness of approximate results over a wide range of near-far ratio and various combinations of SNRs of interfering multiple-access users. ** Title: Improved Doppler Tracking and Correction for Underwater Acoustic Communications Authors: Mark Johnson, WHOI Lee E. Freitag, WHOI Milica Stojanovic, NEU Volume: 1, Page: 575 Abstract: The performance of coherent acoustic communication systems involving moving platforms (e.g., underwater vehicles and ships) is adversely effected by Doppler shift resulting from relative motion of the transmitter and receiver. This paper presents a series of innovations which, together, dramatically improve the response to Doppler shift of a widely-used adaptive receiver algorithm. The innovations include a frequency-shift estimator, time-scale interpolator and robust phase-locked loop (PLL). These techniques reduce the computational load of the coherent equalizer and provide accurate Doppler tracking. Results from at-sea testing are presented to illustrate the performance of the combined algorithm. ** Title: A Blind Multichannel Combiner for Long Range Underwater Communications Authors: Bayan S. Sharif, University of Newcastle Jeff Neasham, University of Newcastle David Thompson, University of Newcastle Oliver R. Hinton, University of Newcastle Alan E. Adams, University of Newcastle Volume: 1, Page: 579 Abstract: This paper presents the development and performance of blind algorithms for a spatial diversity scheme to enable reliable data telemetry over a long range underwater acoustic channel. A number of Bussgang based stochastic gradient algorithms were tested for this multipath channel with additive white and coloured shipping noise. Both simulation and real experimental tests have shown that a significant improvement is obtained by utilising the spatial diversity of the long range channel and the ability of the combiner to perform joint equalisation and carrier phase tracking. ** Title: Very Long Instruction Word Architectures for Digital Signal Processing Authors: Jon Mellott, HSDAL, University of Florida Fred Taylor, HSDAL, University of Florida Volume: 1, Page: 583 Abstract: Due to advancements in semiconductor processing technology, unprecedented levels of system integration are now possible in digital signal processing systems. MIMD/multicomputer architectures used for parallel digital signal processing applications are not always efficient, and are difficult to program. Very long instruction word processors are uniquely suited to digital signal processing applications, able to exploit opportunities for fine and coarse grained parallelism efficiently without the overhead of MIMD/multicomputer approaches. A flexible, high-level language programming environment has been developed in support of this processor paradigm. ** Title: A Novel 32 Bit RISC Architecture Unifying RISC and DSP Authors: Christoph Baumhof, Hyperstone Electronics Frank Muller, Hyperstone Electronics Otto Muller, Hyperstone Electronics Manfred Schlett, Hyperstone Electronics Volume: 1, Page: 587 Abstract: A novel 32 bit RISC architecture is presented which is the basis of a powerful general purpose microprocessor and in parallel a 16/32 bit fixed point DSP processor. This unifying of RISC and DSP was not achieved by simply using a microprocessor and DSP core, but a new concept for the implementation of DSP processors has been developed. With the architecture presented it has been proven that a DSP processor can be implemented using strictly the RISC design philosophy. Besides providing basic 16 bit fixed point functionality, the architecture implements a set of DSP instructions that support an efficient mapping of common DSP algorithms to the processor. ** Title: A Dual-Issue RISC Processor for Multimedia Signal Processing Authors: Hisakazu Sato, Mitsubishi Electric Corporation Edgar Holmann, Mitsubishi Electric Corporation Toyohiko Yoshida, Mitsubishi Electric Corporation Masahito Matsuo, Mitsubishi Electric Corporation Toru Kengaku, Mitsubishi Electric Corporation Volume: 1, Page: 591 Abstract: This paper presents the architecture of a newly-developed dual-issue RISC processor, D10V, that achieves both high throughput signal processing capability and maintains flexibility for general purpose applications. To achieve adequate performance for signal processing, this RISC processor operates both a MAC unit and a memory access unit in parallel, where two-word data memory access is supported. As the result of several benchmarks illustrate, the D10V competes favorably and in some instances outperforms conventional DSPs. ** Title: A processor-coprocessor architecture for high end video applications Authors: Elmar Maas, Braunschweig University of Technology Dirk Herrmann, Braunschweig University of Technology Rolf Ernst, Braunschweig University of Technology Peter Ruffer, Braunschweig University of Technology Sieghard Hasenzahl, Philips Martin Seitz, Philips Volume: 1, Page: 595 Abstract: High end video applications are still implemented in hardware consisting of many components. Integration of these components on one IC is difficult as they are typically low volume products and often customization is also required, e.g. in studio applications. This is easier on the board level than on an integrated system. Using hardware parameters for customization can partly overcome the flexibility problem with additional hardware costs. Low cost can be obtained by a change in the architecture paradigm to a processor-coprocessor system. This, however, requires careful design space exploration since the performance target is beyond current DSP processors while at the same time flexibility is required. The paper presents the application of high level synthesis and novel Hardware-Software Co-Synthesis tools to design space exploration. It is shown that completely different algorithms can be mapped to the same target system at much a lower cost than the current approaches. ** Title: An MPEG-2 Encoder Architecture Based on a Single-Chip Dedicated LSI with a control MPU Authors: Yasushi Ooi, NEC Corp. Osamu Ohnishi, NEC Corp. Yutaka Yokoyama, NEC Corp. Yoichi Katayama, NEC Corp. Masayuki Mizuno, NEC Corp. Masakazu Yamashina, NEC Corp. Hideto Takano, NEC Corp. Naoya Hayashi, NEC Corp. Ichiro Tamitani, NEC Corp. Volume: 1, Page: 599 Abstract: This paper describes an MPEG-2 encoder architecture based on a hard-wired LSI with a control MPU. All basic functions of MPEG-2 MP@ML video compression are integrated in the dedicated LSI. For the motion estimation, a horizontally subsampled, diamond search was employed as a simplified first search step. It can reduce operations to 20% of the full-search, with an estimated SNR degradation of only -0.1dB. To help achieve a single-memory interface, a pair of 81MHz, 16Mb SDRAMs are used as a frame buffer and a code buffer. Data bandwidth between the SDRAMs and the LSI is kept to less than 94% of the maximum data rate. Jobs assigned to the control MPU need be executed less frequently than those of the macroblock coding, which helps reduce the requirements for MPU performance to about 7MIPS. ** Title: An Efficient and Reconfigurable VLSI Architecture for Different Block Matching Motion Estimation Algorithms Authors: Xiao-Dong Zhang, University of Science and Technology Chi-ying Tsui, HKUST Volume: 1, Page: 603 Abstract: This paper describes a VLSI architecture which can be reconfigured to support both Full Search Block-Matching algorithm and 3-step Hierarchical Search Block-Matching algorithm. By using a reconfigurable register-mux array and a parameterizable adder tree, the 2-D array architecture provides efficient real time motion estimation for many video applications. We also propose a memory architecture and an associated switching network to solve the simultaneous data access problem. ** Title: An Operation-Saving VLSI Geometry Engine Core Authors: Konstantina Karagianni, University of Patras George Diamantakos, University of Patras Vassilis Paliouras, University of Patras Thanos Stouraitis, University of Patras Volume: 1, Page: 607 Abstract: A floating point geometry engine core is introduced in this paper. The proposed core is optimized for performing the 3-D geometrical transformations, including the hardware evaluation of sin(x) and cos(x) functions. The architecture exploits the structure of the transformation matrices, thus reducing the number of floating point operations required per transformation. VLSI chip implementation issues for the specific architecture are also discussed. ** Title: The FFT Butterfly Operation in 4 Processor Cycles on a 24 Bit Fixed-point DSP with a Pipelined Multiplier Authors: Martin Grajcar, University of Passau Bernhard Sick, University of Passau Volume: 1, Page: 611 Abstract: Most of the existing Digital Signal Processors (DSPs) are optimized for a fast and efficient computation of the Fast Fourier Transform (FFT). However, there are only two floating-point DSPs available, which perform the butterfly operation of a FFT in 4 processor cycles, but no fixed-point DSP is designed that way. The new 24 bit fixed-point DSP DAISY, which is able to execute the butterfly in 4 cycles even using a two-stage pipelined multiplier, is described in this paper. With this pipelined multiplication it is possible to reduce the processor cycle time significantly. ** Title: New Unified VLSI Architectures for Computing DFT and Other Transforms Authors: Shen-Fu Hsiao, CIE, NSYSU Chung-Yi Yen, CIE, NSYSU Volume: 1, Page: 615 Abstract: Fast computation of DFT (Discrete Fourier Transform) and other popular transform is essential in high-speed DSP applications. This paper proposes new architectures with low hardware cost and high throughput rate. The new architectures are very suitable for VLSI implementation since they are regular and require much fewer complex multipliers compared to the recently proposed approaches. Furthermore, the same architectures may be exploited to compute a variety of frequently-used transforms. ** Title: Half-Rate GSM Vocoder Implementation On A Dual MAC Digital Signal Processor Authors: Mohit K. Prasad, Lucent Technologies Paul D'Arcy, Lucent Technologies Arup Gupta, Lucent Technologies Marc S. Diamondstein, Lucent Technologies Hosahalli R. Srinivas, Lucent Technologies Volume: 1, Page: 619 Abstract: The Global System for Mobile (GSM) communications uses a 13 Kbps vocoder which expands to 22.8 Kbps after channel coding. To increase the user capacity the half-rate channel has a gross transfer rate of 11.4 Kbps. The vocoder for the half-rate channels operates at 5.6 Kbps. The computational requirements of a half-rate vocoder and other necessary services required design of an entirely new digital signal processing architecture geared towards 1-D signal and speech processing. The architecture is characterized by Very Large Instruction Word (VLIW) and two multiply-accumulate (MAC) units. Other enhancements of the hardware allow an efficient implementation of the half-rate GSM vocoder. This paper describes the architecture and compares the vocoder performance with existing implementations. ** Title: VLSI Implementation of an Area-Efficient Architecture for the Viterbi Algorithm Authors: Carlos Cabrera, University of Santiago de Compostela Montserrat Boo, University of Santiago de Compostela Javier Bruguera, University of Santiago de Compostela Volume: 1, Page: 623 Abstract: The Viterbi algorithm is widely used in communications and signal processing. Recently, several area--efficient architectures for this algorithm have been proposed. Area--efficient architectures trade speed for area by means of mapping the N states of the trellis describing the Viterbi algorithm to P processing elements, where N>P. In this paper a practical VLSI implementation of an area--efficient architecture to evaluate the Viterbi algorithm is presented. The architecture that has been implemented is composed of only two processing elements and the corresponding routing network to process, in different cycles, all the states of the trellis. The resulting architecture has been integrated in a chip using a 0.7 micron CMOS technology, occupying an area of 9 sq. mm. ** Title: Low-Area Dual Basis Divider over GF($2^m$) Authors: Leilei Song, University of Minnesota Keshab K. Parhi, University of Minnesota Volume: 1, Page: 627 Abstract: This paper presents a low-area finite field divider using dual basis representation. This divider is based on the division algorithm of solving Discrete Wiener-Hopf Equation using Gauss-Jordan elimination method. The hardware complexity of the matrix generation part has been reduced dramatically form $O(m^2)$ to $O(m)$. When it is used as a building block for a large system, this divider can achieve more savings in hardware by utilizing sub-structure sharing techniques. ** Title: VLSI Architecture for Datapath Integration of Arithmetic over $GF(2^m)$ on Digital Signal Processors Authors: Wolfram Drescher, Technical University of Dresden Kay Bachmann, Technical University of Dresden Gerhard P. Fettweis, Technical University of Dresden Volume: 1, Page: 631 Abstract: This paper examines the implementation of Finite Field arithmetic, i.e. multiplication, division, and exponentiation, for any standard basis $GF(2^m)$ with $m<=8$ on a DSP datapath. We introduce an opportunity to exploit cells and the interconnection structure of a typical binary multiplier unit for the Finite Field operations by adding just a small overhead of logic. We develop division and exponentiation based on multiplication on the algorithm level and present a simple scheme for implementation of all operations on a processor datapath. ** Title: A Fast Direction Sequence Generation Method for CORDIC Processors Authors: Seunghyeon Nahm, Seoul National University Wonyong Sung, Seoul National University Volume: 1, Page: 635 Abstract: This paper describes a new direction sequence generation method for the circular CORDIC algorithm. A conventional approach employs an angle computation algorithm to control the direction of rotation in the form of a sign sequence, where the sign generation is a bottle-neck for the fast implementations. The proposed method reduces the number of sequential computations by employing a new angle representation model and linearizing the arctangent function in small angles. The direction sequence can be generated by about a third of the iterative computations required in the conventional algorithm, which also reduces the hardware requirements as much. Especially, this algorithm is attractive when pipelining is not allowed for feedback control, such as found in phase tracking applications. A VLSI implementation example for a high-speed quadrature demodulator is also discussed. ** Title: A Radix-4 Redundant Cordic Algorithm with Fast On-line Variable Scale Faktor Compensation Authors: Chieh-Chih Li, Industrial Research Institute Sau-Gee Chen, Nat. Chiao Tung University Volume: 1, Page: 639 Abstract: In this work, a fast radix-4 redundant CORDIC algorithm with variable scale factor is proposed. The algorithm includes an on-line scale factor decomposition algorithm that transforms the complicated variable scale factor into a sequence of simple shift-and-add operations and does the variable scale factor compensation in the same fashion. On the other hand, the on-line decomposition algorithm itself can be realized with a simple and fast hardware. The new CORDIC algorithm has the smallest number of 0.8n iterations among all the CORDIC algorithms, which requires only about two-third rotation number that of the existing best (hybrid radix-2 and radix-4) redundant algorithms. Therefore, the new algorithm achieves fast rotation iterations, high-speed and low-overhead scale factor compensations, which are hard to attain simultaneously for the existing algorithms. The on-line scale factor compensation can be also applied to the existing on-line CORDIC algorithms. ** Title: Pipelining of Cordic Based IIR Digital Filters Authors: Jun P. Ma, University of Minnesota Keshab K. Parhi, University of Minnesota Ed F. Deprettere, Delft University of Technology Volume: 1, Page: 643 Abstract: Cordic based IIR digital filters possess desirable properties for VLSI implementation such as local connection, regularity, and good finite word-length behavior, but can't be pipelined to finer levels (such as bit or multi-bit levels) due to the presence of feedback loops. In this paper, a pipelining method for the cordic based IIR digital filters is proposed using the constrained filter design methods and the polyphase decomposition technique. Using this method, the filter sample rate can be increased to any desired level. ** Title: An Asynchronous Implementation of the MAXLIST Algorithm Authors: Chris J. Myers, University of Utah Hao Zheng, University of Utah Volume: 1, Page: 647 Abstract: We present an efficient asynchronous VLSI architecture for calculating running maximum or minimum values over a sliding window. Running maximums or minimums are very useful for many signal and image processing tasks. Our architecture performs the calculation using the MAXLIST algorithm. In order to take advantage of the wide delay variations due to data-dependencies and operating conditions, an asynchronous approach is taken to achieve higher performance and lower power. Simulation results demonstrate that our asynchronous architecture is significantly faster than existing and potential synchronous architectures. ** Title: A Novel Systematic Mapping Approach for Highly Efficient Multiplexed FIR-Filter Architectures Authors: Wolfgang Wilhelm, RWTH Aachen Tobias Noll, RWTH Aachen Volume: 1, Page: 651 Abstract: A systematic mapping approach leading to efficient VLSI-architectures for FIR-filters with a wide range of system parameters is presented. This approach is subdivided into two steps. In the first step the folding technique is applied at bit-level. The free parameters of this technique are then fixed in the second step according to guidelines which are derived from design-strategies for efficient VLSI-architectures. For many applications this approach leads to a reduced hardware complexity in comparison with state-of-the-art techniques. In addition, regularity and scalability of the resulting architectures keep the design effort small. In order to demonstrate the efficiency and the flexibility of this approach a new class of efficient time-shared FIR-filters for adaptive equalizing and a new class of efficient matched filters for rapid code acquisition in spread spectrum receivers are presented. ** Title: An upper bound of the throughput of multirate multiprocessor schedules Authors: Rainer Schoenen, ISS, RWTH Aachen Vojin E. Zivojnovic, ISS, RWTH Aachen Heinrich Meyr, ISS, RWTH Aachen Volume: 1, Page: 655 Abstract: Multirate Dataflow Graphs are used for modelling iterative computations, allowing concurrency and arbitrary data rates at ports. This model is often used for signal processing algorithms. For static scheduling the iteration period bound represents the final barrier for the computation speed, the approximation of which is often the goal of an implementation. For the singlerate case (SR-DFG), where all rates are one, an explicit bound exists and is subject of many published papers. This work presents a bound for the multirate case, which reduces to the known bound if applied to an SR-DFG. Assumptions made are a vectorized execution and a blocked schedule that organizes multiple iterations inside one period (also called execution cycle). The influence of characteristic properties in the multirate case is emphasized. ** Title: Minimizing The Number Of Operations In DSP Computations Authors: Inki Hong, UCLA Miodrag Potkonjak, UCLA Volume: 1, Page: 659 Abstract: Reduction of the number of operations optimizes the important design metrics such as area, cost, throughput, and power consumption for both custom ASIC and programmable processor implementations. We propose a novel technique to minimize the number of operations in DSP computations. The first step of the approach logically partitions a computation into strongly connected components. The second step optimizes each component separately. In the third step the components are merged to further optimize. Finally, the components are scheduled to minimize memory consumption. The effectiveness of our approach is demonstrated on real-life examples. ** Title: BEEHIVE: An Adaptive, Distributed, Embedded Signal Processing Environment Authors: Shahram Famorzadeh, Georgia Institute of Technology Vijay K. Madisetti, Georgia Institute of Technology Thomas Egolf, Georgia Institute of Technology Tuongvu Nguyen, Georgia Institute of Technology Volume: 1, Page: 663 Abstract: We propose an open signal processing system design and implementation environment, BEEHIVE, that allows application developers to rapidly compose and debug functional specifications in a networked, distributed computing environment, and then later migrate the application (transparently) onto an embedded, distributed, computing hardware/software platform, with the capability to reconfigure (adaptively) the resources assigned to the application to meet the dynamic real-time requirements of the implementation. Recent developments in the area of virtual machines; broker-based, distributed, transportable computing; object-oriented programming methodologies, Java and its real-time extensions; reconfigurable and programmable hardware; approximate algorithms; adaptive-load and resource-management algorithms, are harnessed in this operating environment. ** Title: On Objective Function Selection in List Scheduling Algorithms for Digital Signal Processing Applications Authors: Jan Jonsson, Chalmers University of Technology Jonas Vasell, Chalmers University of Technology Volume: 1, Page: 667 Abstract: In this paper we discuss the choice of objective function in list scheduling algorithms for scheduling data flow graphs onto multiprocessor architectures. A majority of the list scheduling algorithms used in practice utilize a global strategy wherein actor static levels are used for making scheduling decisions. When fine-grain DSP applications such as FIR or elliptical filters need to be scheduled on architectures that consist of commodity part processors and a general interconnection network whose interprocessor communication cost cannot be ignored, a traditional list scheduling algorithm is in many cases not the best choice. In an experimental study we compare these global strategies to local strategies that utilize load balancing. The study reveals that global strategies suffer from flaws that could cause local strategies to yield more than 10% shorter schedule lengths on the average. In particular we find that a novel Earliest Finish Time (EFT) strategy exhibits very good performance. ** Title: VLSI High Level Synthesis of Fast Exact Least Mean Square Algorithms based on Fast FIR filters Authors: Jean Philippe Diguet, University of Rennes, ENSSAT Olivier Sentieys, University of Rennes, ENSSAT Daniel Chillet, University of Rennes, ENSSAT Jean Luc Philippe, University of Rennes, ENSSAT Volume: 1, Page: 671 Abstract: This paper relates experiences of algorithmic transformations in High Level Synthesis, in the area of acoustic echo cancellation. The processing and memory units are automatically designed for various equivalent LMS algorithms, in the FIR case, with important computational load. The results obtained with different filter lengths, give an accurate prototyping of new fast versions of the LMS algorithm. It also show that a theoretical arithmetic reduction must be correlated to the associated increase of memory requirements. ** Title: Hierarchical VHDL Libraries for DSP ASIC Design Authors: John McCanny, Queen's University Belfast Douglas Ridge, ISS Ltd. Yi Hu, ISS Ltd. Jill Hunter, Queen's University Belfast Volume: 1, Page: 675 Abstract: Methods are presented for the rapid design of DSP ASICs based on the use of hierarchical VHDL libraries. These are portable across many silicon foundries and allow complex DSP silicon systems to be developed in a fraction of the time normally required. Resulting designs are highly competitive with ones created using conventional methods. The approach is illustrated by its application to ADPCM codec and DCT cores. ** Title: DSP QUANT: Design, Validation, And Applications Of DSP Hard Real-Time Benchmark Authors: Chunho Lee, UCLA Darko Kirovski, UCLA Inki Hong, UCLA Miodrag Potkonjak, UCLA Volume: 1, Page: 679 Abstract: Although the undeniable importance of high quality, efficient and effective DSP synthesis benchmark has been firmly and widely established, until now the emphasis of benchmarking has been restricted on assembling individual examples. In this paper we introduce the ``ideal candidate benchmark methodology'' which poses the development of the benchmark as well as defines a statistical and optimization problem. We first outline the goals and requirements relevant for the benchmark development. After discussing the computational complexity of the benchmark selection problem, we present a simulated annealing-based algorithm for solving this computationally intractable optimization task. Using this approach from 150 examples we select 12 examples for the new DSP Quant benchmark for DSP hard Real-Time applications. The DSP benchmark is statistically validated, and its application to the analysis and development of system-level synthesis algorithms is demonstrated. ** Title: Constructing Memory Layouts for Address Generation Units Supporting Offset 2 Access Authors: Bernhard Wess, Vienna University of Technology Martin Gotschlich, Vienna University of Technology Volume: 1, Page: 683 Abstract: We present an efficient memory layout generation algorithm for digital signal processors (DSPs) which takes advantage of indirect addressing modes with auto-modify operations. Previously proposed algorithms are optimized with respect to offset 1 access (auto-increment and decrement by 1). Our algorithm is based on a heuristic since the problem of generating optimum memory layouts is NP-complete. However, this algorithm produces optimum results if a bandwidth 2 layout exists for a given program variable access sequence. It is verified by experimental results that our technique achieves significant improvements over existing techniques. ** Title: Modulo-Addressing Utilization in Automatic Software Synthesis for Digital Signal Processors Authors: Markus Willems, ISS, RWTH Aachen Holger Keding, ISS, RWTH Aachen Vojin E. Zivojnovic, ISS, RWTH Aachen Heinrich Meyr, ISS, RWTH Aachen Volume: 1, Page: 687 Abstract: Digital Signal Processors (DSPs) have become key components for the implementation of digital signal processing systems. With DSPs moving into new application domains and the increasing complexity of modern DSP architectures, efficient programming support receives major interest. Therefore, an optimizing compiler becomes a must for future DSP-architectures. Todays DSP compilers result in significant overheads both in memory consumption and program execution time compared to hand-written assembly code. This is mainly due to an inefficient compiler support of the DSP specific architectural features, such as the modulo-addressing capability which is an enabeling feature for a large class of DSP algorithms. Within this paper we analyze why existing compilers fail short in supporting the modulo-addressing mode and present a compiler concept that allows the efficient utilization of this feature. We describe how an advanced compiler optimization strategy allows a near optimum support of the modulo-addressing mode, and point out why this concept is favorable to DSP-specific language extensions. ** Title: Cooperative register assignment and code compaction for digital signal processors with irregular datapaths Authors: Werner Kreuzer, Vienna University of Technology Bernhard Wess, Vienna University of Technology Volume: 1, Page: 691 Abstract: We address the phase ordering problem of code compaction and register assignment in a data flow graph compiler. During register assignment, we take into account the instruction-level parallelism available. Symbolic variables in straight-line code are allocated to register set/memory location pairs which maximally preserve the freedom available for code compaction. Whenever necessary, spill code is inserted during final register assignment and scheduled during code compaction. Register assignment is performed taking into account its impact on code compaction. This strategy results in final code of high quality. ** Title: Optimization of Embedded DSP Programs Using Post-pass Data-flow Analysis Authors: Ashok Sudarsanam, Princeton University Sharad Malik, Princeton University Steven Tjiang, Synopsys, Inc. ATG Stan Liao, Synopsys, Inc. ATG Volume: 1, Page: 695 Abstract: We investigate the problem of code generation for DSP systems on a chip. Such systems devote a limited quantity of silicon to program ROM, so application software must be maximally dense. Additionally, the software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, whose architectures are highly irregular. Consequently, designers often resort to programming application software in assembly -- a time-consuming, error-prone, and non-portable task. Thus, DSP compiler technology must be improved substantially. We describe some optimizations that significantly improve the quality of compiler-generated code. Our optimizations are applied globally and even across procedure calls. Additionally, they are applied to the machine-dependent assembly representation of the source program. Our target architecture is the Texas Instruments' TMS320C25 DSP. ** Title: Code Positioning to Reduce Instruction Cache Misses in Signal Processing Applications on Multimedia RISC Processors Authors: Hans-Joachim Stolberg, University of Hannover Masao Ikekawa, NEC Corp. Ichiro Kuroda, NEC Corp. Volume: 1, Page: 699 Abstract: Real-time operation of signal processing applications on multimedia RISC processors is often limited by high instruction cache miss rates of direct-mapped caches. In this paper, a heuristic approach is presented which reduces high instruction cache miss rates in direct-mapped caches by code positioning. The proposed algorithm rearranges functions in memory based on trace data so as to minimize cache line conflicts. Moreover, a new method to extract potential cache misses from trace data is introduced which enables accurate cache behavior analysis and greatly enhances code positioning efficiency. Application of code positioning to an MPEG-1 video decoder implementation on the V830 multimedia RISC processor reduced instruction cache refill cycles by 66--98 %. The proposed code positioning algorithm does not require hardware modifications; it can easiliy be integrated in an object linker to automate the optimization process. ** Title: Code Generation By Using Integer-Controlled Dataflow Graph Authors: Takashi Miyazaki, NEC Edward A. Lee, EECS, UCB Volume: 1, Page: 703 Abstract: Integer-Controlled Dataflow (IDF) and its code generation applications in Ptolemy are presented. In IDF graphs, which specify data processing systems, data token flow is controlled by integer control tokens and states of actors at run-time. The firing order of actors (schedule) is determined at compile-time, however, the actors are conditionally activated at run-time. This static schedule contributes to effective simulation of systems. IDF supports code generation. This enables code generation from program graphs that include conditional jumps, loops and repetitions, and greatly improves the practical usability of the program synthesis in Ptolemy. ** Title: Fixed-Point C Compiler for TMS320C50 Digital Signal Processor Authors: Jiyang Kang, Seoul National University Wonyong Sung, Seoul National University Volume: 1, Page: 707 Abstract: A fixed-point C compiler is developed for convenient and efficient programming of TMS320C50 fixed-point digital signal processor. This compiler supports the `fix' data type that can have an individual integer word-length according to the range of a variable. It can add or subtract two data having different integer word-lengths by automatically inserting shift operations. The accuracy of fixed-point multiply operation is significantly increased by storing the upper part of the multiplied double-precision result instead of keeping the lower part as conducted in the integer multiplication. Several target specific code optimization techniques are employed to improve the compiler efficiency. The empirical results show that the execution speed of a fixed-point C program is much, about an order of magnitude, faster than that of a floating-point C program in a fixed-point digital signal processor. ** Title: Transcription of broadcast news - system robustness issues and adaptation techniques Authors: Raimo Bakis, IBM T.J. Watson Research Center Scott Schen, IBM T.J. Watson Research Center Ponani Gopalakrishnan, IBM T.J. Watson Research Center Ramesh Gopinath, IBM T.J. Watson Research Center Stephane Maes, IBM T.J. Watson Research Center Lazaros Polymenakos, IBM T.J. Watson Research Center Volume: 2, Page: 711 Abstract: This paper describes some of the main problems and issues specific to the transcription of broadcast news and describes some of the methods for solving them that have been incorporated into the IBM Large Vocabulary Continuous Speech Recognition System. ** Title: Transcribing Broadcast News Shows Authors: Jean-Luc Gauvain, LIMSI Gilles Adda, LIMSI Lori Lamel, LIMSI Martine Adda-Decker, LIMSI Volume: 2, Page: 715 Abstract: While significant improvements have been made over the last 5 years in large vocabulary continuous speech recognition of large read-speech corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ) for American English and the BREF corpus for French, these tasks remain relatively artificial. In this paper we report on our development work in moving from laboratory read speech data to real-world speech data in order to build a system for the new ARPA broadcast news transcription task. The LIMSI Nov96 speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on newspaper texts. The acoustic models are trained on the WSJ0/WSJ1, and adapted using MAP estimation with task-specific training data. The overall word error on the Nov96 partitioned evaluation test was 27.1%. ** Title: Broadcast News Transcription Using HTK Authors: Philip C. Woodland, University of Cambridge Mark J.F. Gales, University of Cambridge David Pye, University of Cambridge Steve J. Young, University of Cambridge Volume: 2, Page: 719 Abstract: This paper examines the issues in extending a large vocabulary speech recognition system designed for clean and noisy read speech tasks to handle broadcast news transcription. Results using the 1995 DARPA H4 evaluation data set are presented for different front-end analyses and use of unsupervised model adaptation using maximum likelihood linear regression (MLLR). The HTK system for the 1996 H4 evaluation is then described. It includes a number of new features over previous HTK large vocabulary systems including decoder-guided segmentation, segment clustering, cache-based language modelling, and combined MAP and MLLR adaptation. The system runs in multiple passes through the data and the detailed results of each pass are given. ** Title: Transcription of Broadcast Television and Radio News: The 1996 Abbot System Authors: C.D. Cook, Cambridge University D.J. Kershaw, Cambridge University J.D.M. Christie, Cambridge University C.W. Seymour, Cambridge University S.R. Waterhouse, Cambridge University Volume: 2, Page: 723 Abstract: This paper describes the development of the CU-CON system which participated in the 1996 ARPA Hub 4 Evaluations. The system is based on ABBOT, a hybrid connectionist-HMM large vocabulary continuous speech recognition system developed at the Cambridge University Engineering Department. The Hub 4 Evaluation task involves the transcription of broadcast television and radio news programmes. This is an extremely demanding task for state-of-the-art speech recognition systems. Typical programmes include a wide variety of speaking styles and acoustic conditions. These range from read speech recorded in the studio to extemporaneous speech recorded over telephone channels. ** Title: Improved Topic Discrimination of Broadcast News Using a Model of Multiple Simultaneous Topics Authors: Toru Imai, NHK Richard Schwartz, BBN Francis Kubala, BBN Long Nguyen, BBN Volume: 2, Page: 727 Abstract: This paper presents a new method of topic spotting that attempts to retrieve detailed multiple simultaneous topics from broadcast news stories, each of which has about four different topics out of several thousand different topics. A new topic model uses a simple HMM where each state of the HMM represents one topic and the topic state emits topic dependent keywords probabilistically. The model allows (unobserved) transitions among topics, word by word. These characteristics improve the discriminative ability between keywords and general words in a topic model and decrease the probabilistic overlap among the topic models more than the conventional topic models (such as a simple multinomial probability model). In addition, the model is not confused by words from multiple topics within one story. We applied the new method to topic spotting from manually transcribed texts of news shows. The new method showed better results in precision and recall rates than the conventional method. ** Title: Enhanced Full Rate Speech Codec For IS-136 Digital Cellular System Authors: Tero Honkanen, NRC Janne Vainio, NRC Kari Jarvinen, NRC Petri Haavisto, NRC Redwan Salami, USH Claude Laflamme, USH Jean-Pierre Adoul, USH Volume: 2, Page: 731 Abstract: In this paper, we describe the enhanced full rate (EFR) speech codec that has recently been standardised for the North American TDMA digital cellular system (IS-136). The EFR codec, specified in the IS-641 standard, has been jointly developed by Nokia and University of Sherbrooke. The codec consists of 7.4 kbit/s speech (source) coding and 5.6 kbit/s channel coding (error protection) resulting in a 13.0 kbit/s gross bit-rate in the channel. Speech coding is based on the ACELP algorithm (Algebraic Code Excited Linear Prediction). The codec offers speech quality close to that of wireline telephony (G.726 32 kbit/s ADPCM used as a wireline reference) and provides a substantial improvement over the quality of the current speech channel. The improved speech quality is not only achieved in error-free conditions, but also in typical cellular operating conditions including transmission errors, environmental noise, and tandeming of speech codecs. ** Title: A CELP Variable Rate Speech Codec with Low Average Rate Authors: Lei Zhang, SFU Tian Wang, SFU Vladimir Cuperman, SFU Volume: 2, Page: 735 Abstract: This paper presents a variable-rate CELP codec which achieves good communications speech quality at an average rate of about 3 kb/s. The codec operates as a source-controlled variable rate coder with rates of 4.9~kb/s for voiced and transition sounds, 3.0~kb/s for unvoiced sounds and 670~b/s for silent frames. New techniques used in the codec include prediction of the fixed codebook target vector and joint optimization of the adaptive and fixed codebook search. The prediction of the fixed codebook target vector is based on fixed codebook selections in previous subframes and a running estimate for the fundamental frequency. Informal subjective testing (MOS) indicates that the proposed codec, at an average rate of less than 3.2 kb/s, achieves better quality than fixed rate standard codecs with rates in the range 4~-~4.8~kb/s. ** Title: HCELP: Low bit rate speech coder for voice storage applications Authors: Mustapha Bouraoui, SGS-THOMSON Microelectronics Francois Bill Druilhe, SGS-THOMSON Microelectronics Gang Feng, ICP Volume: 2, Page: 739 Abstract: There is an increasing need for low cost, fully integrated digital phone systems including telephony functions, fax, hands-free and answering machines. For the latter feature, a high quality, low bit rate speech coder is recommended. It should require only a reasonable complexity to stay competitive in this product range. Recent advances in CELP speech coding have shown the feasibility of this concept for this kind of consumer applications. A 4.8 kbps Hamming Code Excited Linear Prediction (HCELP) coder is proposed in this paper with an algebraic structure for the codebook. It features a very fast search algorithm which has been evaluated to be 3 times faster than usual algebraic codebook search procedures. Quality evaluation yielded satisfactory results. Implementation aspects and the integration of the coder in an Advanced Telephone Set are also detailed. ** Title: Low-rate CELP Speech Coding Using an Improved Weighting Function Authors: Chul-Hong Kwon, Digicom Chong-Kwan Un, KAIST Volume: 2, Page: 743 Abstract: Below 4.8 kbits/s, CELP coders in general suffer from two kinds of perceptually important degradtion. One is noise between adjacent harmonics of output speech - inter-harmonic noise - which results in roughness in voiced sound. The other is poor reproduction of speech signal at high frequencies - high frequency mismatch. To remedy these degradations, we propose in this paper an improved weighting function which utilizes the spectral weighting methodology and also takes into account the periodic character in voiced sound. The function can adapt to variation of pitch by itself without any pitch estimation in voiced sound; it is also applicable to all speech segments without any voiced/unvoiced discrimination algorithm. Simulation results show that the performance of the CELP coder with the proposed weighting function is better than that of the conventional CELP coder. ** Title: Toll Quality Variable-Rate Speech Codec Authors: Pasi Ojala, NRC Volume: 2, Page: 747 Abstract: This paper presents a source controlled variable-rate CELP type speech codec. First, a voice activity detection block distinguishes active speech frames from silence and background noise. The active speech is further classified into voiced and unvoiced frames. The voiced frames have variable bit-rate pitch-lag quantization based on the characteristics of the speech, whereas the unvoiced frames are coded without pitch information. A variable bit-rate fixed codebook excitation with a variable number of excitation pulses is determined for each speech frame. The performance of the linear analysis part of the codec as well as the input speech characteristics determine the excitation bit-rate. The average bit-rate of the codec is around 7.0 kbit/s for active speech, and the overall bit-rate ranges from 0 to 7.85 kbit/s. The described variable-rate codec produces toll quality speech equal to that of the 32 kbit/s ADPCM (G.726) standard. ** Title: A Variable-Rate Multimodal Speech Coder with Gain-Matched Analysis-by-Synthesis Authors: Erdal Paksoy, Texas Instruments Alan V. McCree, Texas Instruments Vishu Viswanathan, Texas Instruments Volume: 2, Page: 751 Abstract: In general, a variable rate coder can obtain the same speech quality as a fixed rate coder, while reducing the average bit rate. We have developed a variable-rate multimodal speech coder with an average bit rate of 3 kb/s for a speech activity factor of 80% and quality comparable to the GSM full rate coder. The coder has four coding modes and uses a robust classification method involving the pitch gain, zero crossings, and a peakiness measure. Also the coder employs a novel gain-matched analysis-by- synthesis technique for very low rate coding of unvoiced frames and an improved noise-level-dependent postfilter. This paper describes the details of our algorithm and presents the results from subjective listening tests. ** Title: Design of a Toll-Quality 4-kbit/s Speech Coder Based on Phase-Adaptive PSI-CELP Authors: Kazunori Mano, NTT Human Interface Labs. Volume: 2, Page: 755 Abstract: This paper describes the design of a toll-quality 4-kbit/s speech coder based on phase-adaptive PSI-CELP. This adaptation method not only gives pitch periodicity to the random excitation but also synchronizes the basic point of the stored random vector with the pitch phase. We further improve the proposed coder by introducing a backward gain prediction scheme. In subjective evaluation experiment, there is no significant difference between the quality of ITU-T G.726 32-kbit/s coder and that of the proposed 4-kbit/s coder under the conditions of normal and low input levels, tandem connection for clean speech. In noisy environment, there are also no significant differences between G.726 and 4-kbit/s coders from MOS results of ACR test. ** Title: A High-Quality BI-CELP Speech Coder At 8 KBit/s And Below Authors: Soon Y. Kwon, TNI Hochong Park, SEC Hyokang Chang, ComBasis Volume: 2, Page: 759 Abstract: This paper describes "BI-CELP: baseline and implied CELP," which is a high quality speech coding method based on a code excited linear prediction (CELP) model employing excitation vectors combined from two codebooks, one from the baseline codebook and the other from the implied codebook. In this method the index of the baseline codebook is coded and transmitted to the receiver while the index of the implied codebook is extracted from the synthesized speech. This method has been applied to a lower rate voice coder at 8 Kbit/s to produce high quality voice comparable to that of the 16 Kbit/s G.728 LD-CELP. The performance of the 8 Kbit/s BI-CELP coder is measured in terms of SNRseg and MOS. The average SNRseg is 12.14 dB which is 0.6 dB higher than that of the 8 Kbit/s G.729 CS-ACELP. The MOS for the quiet input is 3.8 which is 0.02 higher than that of G.729 CS-ACELP. BI-CELP algorithm is implemented in real-time on a single TMS320C31 with 27 MIPS of CPU. ** Title: Low Complexity VQ for Multi-tap Pitch Predictor Coding Authors: Jayesh Patel, DSPSE Volume: 2, Page: 763 Abstract: Pitch predictors are successfully used in Linear Prediction Analysis-by-Synthesis (LPAS) coders to model periodicity in speech. The various structures of pitch predictors are investigated and used in LPAS coders. In most of the low bit-rate LPAS coder design, single-tap or three-tap pitch are commonly used. Higher prediction gain can be achieved by using additional taps. 5-tap pitch predictor is rarely used in low bit-rate speech coder because of high complexity and bandwidth requirement in encoding additional tap gains. This paper describes the technique for reducing the complexity and bandwidth requirement for 5-tap pitch predictor. ** Title: A 4 kbit/s Renewal Code Excited Linear Prediction Speech Coder Authors: Hong Kook Kim, SAIT Yong Duk Cho, SAIT Moo Young Kim, SAIT Sang Ryong Kim, SAIT Volume: 2, Page: 767 Abstract: This paper proposes a new 4 kbit/s speech coder based on CELP structure with 45 ms total codec delay. The coder is mainly featured by the renewal codebook of the excitation signal and the linked split-vector quantizer of LSPs which enable the coder to get high quality speech at low bit rate. In addition, techniques of the formant enhancement in spectral envelop and the harmonic recovery in transient region are also introduced to reduce buzzy and hoarse sounds, respectively. From the intensive listening test with intermediated response system (IRS) speech, we obtained the comparable subjective quality to 32 kbit/s ADPCM (ITU Recommendation G.726) under nominal speech input level of -26 dB overload. ** Title: GSM Enhanced Full Rate Speech Codec Authors: Kari Jarvinen, NRC Janne Vainio, NRC Pekka Kapanen, NRC Tero Honkanen, NRC Petri Haavisto, NRC Redwan Salami, USH Claude Laflamme, USH Jean-Pierre Adoul, USH Volume: 2, Page: 771 Abstract: This paper describes the GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system. The GSM EFR codec has been jointly developed by Nokia and University of Sherbrooke. It provides speech quality at least equivalent to that of a wireline telephony reference (32 kbit/s ADPCM). The EFR codec uses 12.2 kbit/s for speech coding and 10.6 kbit/s for error protection. Speech coding is based on the ACELP algorithm (Algebraic Code Excited Linear Prediction). The codec provides substantial quality improvement compared to the existing GSM full rate and half rate codecs. The old GSM codecs lack behind wireline quality even in error-free channel conditions, while the EFR codec provides wireline quality not only for error-free conditions but also for the most typical error conditions. With the EFR codec, wireline quality is also sustained in the presence of background noise and in tandem connections (mobile to mobile calls). ** Title: Description of ITU-T Recommendation G.729 Annex A: Reduced Complexity 8 kbit/s CS-ACELP Codec Authors: Redwan Salami, University of Sherbrooke Claude Laflamme, University of Sherbrooke Bruno Bessette, University of Sherbrooke Jean-Pierre Adoul, University of Sherbrooke Volume: 2, Page: 775 Abstract: This paper describes the recently adopted ITU-T Recommendation G.729 Annex A (G.729A) for encoding speech signals at 8 kbit/s with low complexity. G.729A has been selected as the standard speech coding algorithm for multimedia digital simultaneous voice and data (DSVD). G.729A is bitstream interoperable with G.729; i.e., speech coded with G.729A can be decoded with G.729, and vice versa. As G.729, it uses the CS-ACELP algorithm with 10 ms frames. However, several algorithmic changes have been introduced into G.729 which resulted in 50% drop in its complexity, enabling a DSP implementation with a complexity of about 10--12 MIPS. This paper describes the algorithmic changes which have been introduced in order to achieve the low complexity goal while meeting the terms of reference. Subjective tests have been performed by ITU-T in both the selection phase and the characterization phase and the results showed that the performance of G.729A is equivalent to both G.729 and G.726 at 32 kbit/s in most operating conditions; however, it is slightly worse in case of three tandems and in the presence of background noise. A breakdown of the complexities of both G.729 and G.729A is given at the end of the paper. ** Title: Semantic Clustering for Adaptive Language Modeling Authors: Reinhard Kneser, Philips Research Jochen Peters, Philips Research Volume: 2, Page: 779 Abstract: In this paper we present efficient clustering algorithms for two novel class-based approaches to adaptive language modeling. In contrast to bigram and trigram class models, the proposed classes are related to the distribution and co-occurrence of words within complete text units and are thus mostly of a semantic nature. We introduce adaptation techniques such as the adaptive linear interpolation and an approximation to the minimum discriminant estimation and show how to use the automatically derived semantic structure in order to allow a fast adaptation to some special topic or style. In experiments performed on the Wall-Street-Journal corpus, intuitively convincing semantic classes were obtained. The resulting adaptive language models were significantly better than a standard cache model. Compared to a static model a reduction in perplexity of up to 31% could be achieved. ** Title: Task adaptation using MAP estimation in N-gram language modeling. Authors: Hirokazu Masataki, ATR Yoshinori Sagisaka, ATR Kazuya Hisaki, Kyoto University Tatsuya Kawahara, Kyoto University Volume: 2, Page: 783 Abstract: This paper describes a method of task adaptation in N-gram language modeling,for accurately estimating the N-gram statisticsfrom the small amount of data of the target task.Assuming a task-independent N-gram to be a-priori knowledge,the N-gram is adapted to a target task byMAP (maximum a-posteriori probability) estimation.Experimental results showed that the perplexities of the task adapted modelswere 15% (trigram), 24% (bigram)lower than those of the task-independent model,and that the perplexity reduction of the adaptation went up to 39 % at maximumwhen the amount of text data in the adapted task was very small. ** Title: Distant Bigram Language Modelling Using Maximum Entropy Authors: Michael Simons, RWTH Aachen Hermann Ney, RWTH Aachen Sven C. Martin, RWTH Aachen Volume: 2, Page: 787 Abstract: In this paper, we apply the maximum entropy approach to so-called distant bigram language modelling. In addition to the usual unigram and bigram dependencies, we use distant bigram dependencies, where the immediate predecessor word of the word position under consideration is skipped. The contributions of this paper are: (1) We analyze the computational complexity of the resulting training algorithm, i.e. the generalized iterative scaling (GIS) algorithm, and study the details of its implementation. (2) We describe a method for handling unseen events in the maximum entropy approach; this is achieved by discounting the frequencies of observed events. (3) We study the effect of this discounting operation on the convergence of the GIS algorithm. (4) We give experimental perplexity results for a corpus from the WSJ task. By using the maximum entropy approach and the distant bigram dependencies, we are able to reduce the perplexity from 205.4 for our best conventional bigram model to 169.5. ** Title: Nonuniform Markov Models Authors: Eric Sven Ristad, Princeton University Robert G. Thomas, Princeton University Volume: 2, Page: 791 Abstract: We propose a new way to model conditional independence in Markov models. The central feature of our nonuniform Markov model is that it makes predictions of varying lengths using contexts of varying lengths. Experiments on the Wall Street Journal reveal that the nonuniform model performs slightly better than the classic interpolated Markov model of Jelinek and Mercer (1980). This result is somewhat remarkable because both models contain identical numbers of parameters whose values are estimated in a similar manner. The only difference between the two models is how they combine the statistics of longer and shorter strings. ** Title: Modelling word-pair relations in a category-based language model Authors: Thomas Niesler, University of Cambridge P.C. Woodland, University of Cambridge Volume: 2, Page: 795 Abstract: A new technique for modelling word occurrence correlations within a word-category based language model is presented. Empirical observations indicate that the conditional probability of a word given its category, rather than maintaining the constant value normally assumed, exhibits an exponential decay towards a constant as a function of an appropriately defined measure of separation between the correlated words. Consequently a functional dependence of the probability upon this separation is postulated, and methods for determining both the related word pairs as well as the function parameters are developed. Experiments using the LOB, Switchboard and Wall Street Journal corpora indicate that this formulation captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers), and the reductions increase with the size of the training corpus. ** Title: Language Model Adaptation using mixtures and an exponentially decaying cache Authors: Philip Clarkson, Cambridge University Anthony J. Robinson, Cambridge University Volume: 2, Page: 799 Abstract: This paper presents two techniques for language model adaptation. The first is based on the use of mixtures of language models: the training text is partitioned according to topic, a language model is constructed for each component, and at recognition time appropriate weightings are assigned to each component to model the observed style of language. The second technique is based on augmenting the standard trigram model with a cache component in which words' recurrence probabilities decay exponentially over time. Both techniques yield a significant reduction in perplexity over the baseline trigram language model when faced with multi-domain test text, the mixture-based model giving a 24% reduction and the cache-based model giving a 14% reduction. The two techniques attack the problem of adaptation at different scales, and as a result can be used in parallel to give a total perplexity reduction of 30%. ** Title: Confidence-driven Estimator Perturbation: BMPC Authors: Stefan Besling, Philips Research Hans-Gunter Meier, Fachhochschule Dusseldorf Volume: 2, Page: 803 Abstract: In most practical applications of speech recognition, the acceptance and performance of the system depends strongly on its capability to adapt to the special speaker characteristics. Restricted to the problem of language model adaptation, one has to find an efficient way to combine a typically well-trained a priori estimator for a domain with a regularly updated but undertrained estimator reflecting the actual speaker-specific data so far. In this paper we present a new language model estimation technique that makes explicit use of the confidence in estimates obtained on the (typically small) adaptation or training data. Mathematically it attempts to perturb a given reliable a priori distribution in such a way that it fits into the confidence regions given by the training material. Experiments performed on real-life data supplied by US radiologists indicate that the method could improve standard adaptation techniques like linear interpolation. ** Title: Domain Adaptation With Clustered Language Models Authors: Joerg Peter Ueberla, Forum Technology - DRA Malvern Volume: 2, Page: 807 Abstract: In this paper, a method of domain adaptation for clustered language models is developed. It is based on a previously developed clustering algorithm, but with a modified optimisation criterion. The results are shown to be slightly superior to the previously published 'Fillup' method, which can be used to adapt standard n-gram models. However, the improvement both methods give compared to models built from scratch on the adaptation data is quite small (less than 11% relative improvement in word error rate). This suggests that both methods are still unsatisfactory from a practical point of view. ** Title: Improving Parsing of Spontaneous Speech with the Help of Prosodic Boundaries Authors: Ralf Kompe, University of Erlangen Andreas Kiessling, University of Erlangen Heinrich Niemann, University of Erlangen Elmar Noth, University of Erlangen Anton Batliner, L.M.-Univ. Munchen Stefanie Schachtl, Siemens Tobias Ruland, Siemens Hans Ulrich Block, Siemens Volume: 2, Page: 811 Abstract: Parsing can be improved in automatic speech understanding if prosodic boundaries are taken into account, because syntactic boundaries are often marked prosodically. Since large databases are needed for the training of statistical models, we developed a labeling scheme for syntactic-prosodic boundaries within the German VERBMOBIL speech-to-speech translation project. We compare the results of classifiers (multi-layer perceptrons and language models) trained on these labels with results for perceptual and syntactic labels. Recognition rates of up to 96% were achieved. The turns consist of 20 words on the average and frequently contain sequences of partial sentence equivalents (restarts, ellipsis). The boundary scores computed by our classifiers were successfully integrated into the syntactic parsing of word graphs; currently, they improve the parse time by 92% and reduce the number of parse trees by 96%. This is achieved by introducing a special Prosodic Syntactic Clause Boundary symbol into our grammar and by guiding the search for the best word chain with the boundary scores. ** Title: Specialized Language Models using Dialogue Predictions Authors: Cosmin Popovici, ICI Paolo Baggia, CSELT Volume: 2, Page: 815 Abstract: This paper analyses language modeling in spoken dialogue systems for accessing a database. The use of several language models obtained by exploiting dialogue predictions gives better results than the use of a single model for the whole dialogue interaction. For this reason several models have been created, each one for a specific system question, such as the request or the confirmation of a parameter. The use of dialogue-dependent language models increases the performance both at the recognition and at the understanding level, especially on answers to system requests. Moreover using other methods to increase performance, like automatic clustering of vocabulary words or the use of better acoustic models during recognition, does not affect the improvements given by dialogue-dependent language models. The system used in our experiments is Dialogos, the Italian spoken dialogue system used for accessing railway timetable information over the telephone. The experiments were carried out on a large corpus of dialogues collected using Dialogos. ** Title: K-TLSS(S) Language Models for Speech Recognition Authors: German Bordel, UPV/EHU, Bilbao Amparo Varona, UPV/EHU, Bilbao Volume: 2, Page: 819 Abstract: The class of K-Testable Languages in the Strict Sense (K-TLSS) is a subclass of regular languages. Previous works demonstrate that stochastic K-TLSS language models describe the same probability distribution as N-gram models, and that smoothing techniques can be efficiently applied (Back-off like methods). Once we have a set of k-TLSS models (k=1... K) and a smoothing technique that specifically fits in them, here we propose an integration into a unique self-contained model (the K-TLSS(S)) which embeds the smoothing within the topology allowing extremely simple parsing procedures. To build this model we designed a more general syntactic mechanism that we call Stochastic Deterministic Finite State Automaton with Recursive Transitions. The topology of the new models (K-TLSS(S)) allows an easy pruning procedure. Pruned K-TLSS(S) models give probability distributions that are equivalent to Variable N-gram models. Experimental results gave as a conclusion that the effect of a small pruning is always positive. ** Title: Language Model Adaptation For Conversational Speech Recognition Using Automatically Tagged Pseudo-Morphological Classes Authors: Carlos Crespo, TID Daniel Tapias, TID Gregorio Escalada, TID Jorge Alvarez, TID Volume: 2, Page: 823 Abstract: Statistical language models provide a powerful tool to model natural spoken language. Nevertheless it is required a large set of training sentences to reliably estimate the model parameters. In this paper we present a method to estimate n-gram probabilities from sparse data. The proposed language modeling strategy allows to adapt a generic language model (LM) to a new semantic domain with just a few hundreds of sentences. This reduced set of sentences is automatically tagged with eighty different pseudo-morphological labels, and then a word-bigram LM is derived from them. Finally, this target domain word-bigram LM is interpolated with a generic backoff word-bigram LM, which was estimated using a large text database. This strategy reduces a 27% the word error rate of the SPATIS (Spanish ATIS) task. ** Title: Model Adaptation based on HMM decomposition for Reverberant Speech Recognition Authors: Tetsuya Takiguchi, NAIST Satoshi Nakamura, NAIST Kiyohiro Shikano, NAIST Qiang Huo, ATR ITL Volume: 2, Page: 827 Abstract: The performance of a speech recognizer degrades drastically in reverberant environments. We proposed previously a novel algorithm which can model an observation signal by composition of HMMs of clean speech, noise and an acoustic transfer function. However, how to estimate HMM parameters of the acoustic transfer function is a remaining serious problem. In our previous paper, we measured real impulse responses of training positions in an experiment room. It is inconvenient and unrealistic to measure impulse responses for every possible new experiment room. This paper presents a new method to estimate HMM parameters of the acoustic transfer function from some adaptation data by using an HMM decomposition algorithm. Its effectiveness is confirmed by a series of speaker dependent and independent word recognition experiments on simulated distant-talking speech data. ** Title: Model Compensation for Noises in Training and Test Data Authors: Driss Matrouf, LIMSI Jean-Luc Gauvain, LIMSI Volume: 2, Page: 831 Abstract: It is well known that the performances of speech recognition systems degrade rapidly as the mismatch between the training and test conditions increases. Approaches to compensate for this mismatch generally assume that the training data is noise-free, and the test data is noisy. In practice, this assumption is seldom correct. In this paper, we propose an iterative technique to compensate for noises in both the training and test data. The adopted approach compensates the speech model parameters using the noise present in the test data, and compensates the test data frames using the noise present in the training data. The training and test data are assumed to come from different and unknown microphones and acoustic environments. The interest of such a compensation scheme has been assessed on the MASK task using a continuous density HMM-based speech recognizer. Experimental results show the advantage of compensating for both test and training noises. ** Title: Jacobian Approach to Fast Acoustic Model Adaptation Authors: Shigeki Sagayama, NTT HI Labs Yoshikazu Yamaguchi, NTT HI Labs Satoshi Takahashi, NTT HI Labs Jun-ichi Takahashi, NTT HI Labs Volume: 2, Page: 835 Abstract: This paper describes a Jacobian approach to fast adaptation of acoustic models to noisy environments. Acoustic models under a noise assumption are compensated by Jacobian matrices with the difference between assumed and observed noise cepstra. Detailed mathematical formulation and algorithm derivation are presented. Experiments showed that when a small amount of training data is given, this approach outperforms the existing approaches (such as PMC and NOVO) for composing a model from speech and noise models. It drastically reduces computational cost by replacing the complicated computation of model composition by simple matrix arithmetic and enables real-time environmental noise adaptation. Combination with spectrum subtraction is also discussed. ** Title: A unified maximum likelihood approach to acoustic mismatch compensation: Application to noisy Lombard speech recognition Authors: Mohamed Afify, CRIN/CNRS-INRIA-Lorraine Yifan Gong, Speech research,Texas Instruments Jean-Paul Haton, CRIN/CNRS-INRIA-Lorraine Volume: 2, Page: 839 Abstract: In the context of continuous density hidden Markov model (CDHMM) we present a unified maximum likelihood (ML) approach to acoustic mismatch compensation. This is achieved by introducing additive Gaussian biases at the state level in both the mel cepstral and linear spectral domains. Flexible modelling of different mismatch effects can be obtained through appropriate bias tying. A maximum likelihood approach for joint estimation of both mel cepstral and linear spectral biases from the observed mismatched speech given only one set of clean speech models is presented, where the obtained bias estimates are used for the compensation of clean speech models during decoding. The proposed approach is applied to the recognition of noisy Lombard speech, and significant improvement in the word recognition rate is achieved. ** Title: Enhancement and Recognition of Noisy Speech Within an Autoregressive Hidden Markov Model Framework Using Noise Estimates from the Noisy Signal Authors: Beth T. Logan, Cambridge University Anthony J. Robinson, Cambridge University Volume: 2, Page: 843 Abstract: This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The probability framework developed is then used to reestimate the noise models from the corrupted speech waveform and the process is repeated. Enhancement is performed using the Wiener filters formed from the final clean speech models and noise estimates. Results are presented for additive stationary Gaussian and coloured noise. ** Title: Fast speech recognition algorithm under noisy environment using modified CMS-PMC and improved IDMM+SQ Authors: Hiroki Yamamoto, Canon Inc Tetsuo Kosaka, Canon Inc Masayuki Yamada, Canon Inc Yasuhiro Komori, Canon Inc Minoru Fujita, Canon Inc Volume: 2, Page: 847 Abstract: In this paper, we describe a fast speech recognition algorithm under noisy environment. To achieve an accurate and fast speech recognition under noisy environment, a very fast speech recognition algorithm with well-adapted model against the noisy environment is required. First, for the model adaptation, we propose MCMS-PMC: a combination of the parallel model combination(PMC) and the modified cepstral mean subtraction(MCMS) which estimates the cepstrum mean by taking account of the additive noise. Then, for the fast speech recognition, we propose new techniques to create the noise-adapted scalar quantized codebook in order to introduce the MCMS-PMC into the IDMM+SQ which we proposed in ICASSP96 as fast speech recognition algorithm using scalar quantization approach. Finally, an effect of proposed method is shown through the speaker-independent telephone-bandwidth continuous speech recognition experiment. ** Title: The Effects Of Background Music On Speech Recognition Accuracy Authors: Bhiksha Raj, Carnegie Mellon University, Pittsburgh Vipul Parikh, Carnegie Mellon University, Pittsburgh Richard Stern, Carnegie Mellon University, Pittsburgh Volume: 2, Page: 851 Abstract: Recognition of broadcast data, such as TV and radio programs is a topic of great interest. One of the problems with such data is the frequent presence of background music that degrades the perfor- mance of speech recognition systems. In this paper we examine the effects of different kinds of music on automatic speech recognition systems by comparing the effects of music with the relatively well-known effects of white noise on these systems. We also examine the extent to which compensation algorithms that have been successfully applied to noisy speech are also helpful in improving recognition accuracy for speech that is corrupted by music. It is hoped that these experimental compari- sons will lead to a better understanding of how to compensate for the effects of background music. ** Title: Joint Model- and Feature-Space Optimization for Robust Speech Recognition Authors: Jenq-Neng Hwang, University of Washington Chien-Jen Wang, University of Washington Volume: 2, Page: 855 Abstract: This paper presents a maximum likelihood joint-space adaptation technique for robust speech recognition. In this joint-space adaptation process, the N-Best HMM inversion frame-by-frame adapts the speech features non-parametrically to compensate the temporal deviation, while the models are transformed parametrically to catch the global characteristics of the mismatch. The proposed method provides a better compensation to the mismatch than either of the single-space adaptation does. This algorithm operates only on the given testing speech and the models, therefore no stereo or adaptation data are required. As verified by the experiments performed under different mismatch environments, the proposed method improves the performance in all the cases without degrading the performance under the match condition. ** Title: Co-Channel Speech Separation for Robust Automatic Speech Recognition: Stability and Efficiency Authors: Kuan-Chieh Yen, University of Illinois Yunxin Zhao, University of Illinois Volume: 2, Page: 859 Abstract: A signal-separation front-end based on adaptive decorrelation filtering (ADF) was integrated with an HMM based speaker independent continuous speech recognition system for co-channel speech recognition. The ADF is improved by addressing the adaptation gain for system stability and efficiency: an upper bound of adaptation rate is derived for system stability, and an accelerated sequence of adaptation gain is introduced for system efficiency. The system was evaluated under simulated room acoustic conditions with both time-invariant and time-varying channels. It is shown that the system significantly improved the signal-to-interference ratio and the word recognition accuracy, and that the combination of the derived upper bound for adaptation rate with the accelerated adaptation gain sequence achieved the best performance for system stability and efficiency. ** Title: Missing Data Techniques for Robust Speech Recognition Authors: Martin P. Cooke, University of Sheffield Andrew C. Morris, University of Sheffield Philip D. Green, University of Sheffield Volume: 2, Page: 863 Abstract: In noisy listening conditions, the information available on which to base speech recognition decisions is necessarily incomplete: some spectro-temporal regions are dominated by other sources. We report on the application of a variety of techniques for missing-data in speech recognition. These techniques may be based on marginal distributions or on reconstruction of missing parts of the spectrum. Application of these ideas in the Resource Management task shows performance which is robust to random removal of up to 80% of the frequency channels, but falls off rapidly with deletions which more realistically simulate masked speech. We report on a vowel classification experiment designed to isolate some of the RM problems for more detailed exploration. The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained. ** Title: Spectral Subtraction and Rasta-Filtering in Text-Dependent HMM-Based Speaker Verification Authors: Detlef Hardt, Technical University of Berlin Klaus Fellbaum, Brandenburg Technical University of Cottbus Volume: 2, Page: 867 Abstract: In real text-dependent telephone-based speaker verification systems, both, additive and convolutional noise influence the error rate considerably. In this paper different procedures which make a speaker verification system more robust against noise are compared. We either use the spectral subtraction in addition to the MFCC-feature extraction or only the PLP and RASTA-PLP (without spectral subtraction). Considering spectral subtraction two modifications were examined: one version which was pre-connected to the system and a second one being integrated into the MFCC computation. The first version has the advantage that the window length can be chosen independently on those of the MFCC procedure. This led to better results. However, the most effective procedure for telephone speech data is the J-RASTA-PLP, but the estimation of the optimal J factor is difficult. At first we used a fixed J factor based on the off-line measurement of the noise power. Finally, we performed some experiments to optimize the system w ** Title: Noise Robust Speech Recognition with State Duration Constraints Authors: Kari Laurila, NRC Volume: 2, Page: 871 Abstract: In this paper, we present a method to incorporate and re-estimate state duration constraints within the Maximum Likelihood training of hidden Markov models. In the recognition phase we find the optimal state sequence fulfilling the state duration constraints obtained in the training phase. Our target is to get speaker-dependent training and recognition perform well with a very small amount of training data in the case of mismatch between the training and testing environments. We take advantage of the fact that speakers tend to preserve their speaking style in similar situations (e.g. when speaking to a machine) and our main means to reach the target is to force similar state segmentations in the training and recognition phases. We show that with the proposed method we can substantially improve the robustness of a speech recognizer and decrease the error rates by over 93% when compared with a standard approach. ** Title: Confidence measures for spontaneous speech Authors: Thomas Schaaf, University of Karlsruhe Thomas Kemp, University of Karlsruhe Volume: 2, Page: 875 Abstract: For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of confidence tagger JANKA, which is able to provide confidence information for the words in the output of the speech recognizer JANUS-3-SR. On a spontaneous german human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%. ** Title: A Probabilistic Approach to Confidence Estimation and Evaluation Authors: Larry Gillick, Dragon Systems Yoshiko Ito, Dragon Systems Jonathan Young, Dragon Systems Volume: 2, Page: 879 Abstract: In this paper we propose a novel way of estimating confidences for words that are recognized by a speech recognition system, together with a natural methodology for evaluating the overall quality of those confidence estimates. Our approach is based on an interpretation of a confidence as the probability that the corresponding recognized word is correct, and makes use of generalized linear models as a means for combining various predictor scores so as to arrive at confidence estimates. Experimental results using these models are presented based on four different sources of speech data: Switchboard, Spanish and Mandarin CallHome, and Wall Street Journal. ** Title: Word-based Confidence measures as a guide for stack search in speech recognition Authors: Chalapathy Neti, IBM Research Salim Roukos, IBM Research Ellen Eide, IBM Research Volume: 2, Page: 883 Abstract: The Maximum a posteriori hypothesis is treated as the decoded truth in speech recognition. However, since the word recognition accuracy is not 100%, it is desirable to have an independent confidence measure on how good the maximum a posteriori hypothesis is relative to the spoken truth for some applications. Efforts are in progress[1,2,3] to develop such confidence measures with the intent of applying it to assesment of confidence of whole utterances, rescoring of N-best lists, etc. In this paper, we explore the use of word-based confidence measures to adaptively modify the hypothesis score during search in continuous speech recognition: specifically, based on the confidence of the current sequence of hypothesized words during search, the weight of its prediction is changed as a function of the confidence. Experimental results are described for ATIS and SwitchBoard tasks. About 8% relative reduction in word error is obtained for ATIS. ** Title: Neural - Network Based Measures of Confidence for Word Recognition Authors: Mitch Weintraub, SRI International Francoise Beaufays, SRI International Zeev Rivlin, SRI International Yochai Konig, SRI International Andreas Stolcke, SRI International Volume: 2, Page: 887 Abstract: This paper proposes a probabilistic framework to define and evaluate confidence measures for word recognition. We describe a novel method to combine different knowledge sources and estimate the confidence in a word hypothesis, via a neural network. We also propose a measure of the joint performance of the recognition and confidence systems. The definitions and algorithms are illustrated with results on the Switchboard Corpus. ** Title: Improving Utterance Verification Using Hierarchical Confidence Measures in Continuous Natural Numbers Recognition Authors: Javier Caminero, Telefonica I+D Luis Hernandez-Gomez, E.T.S.I. Telecomunicacion, UPM Celinda de la Torre, Telefonica I+D Cesar Martin, Telefonica I+D Volume: 2, Page: 891 Abstract: Utterance Verification (UV) is a critical function of an Automatic Speech Recognition (ASR) System working on real applications where spontaneous speech, out-of-vocabulary (OOV) words and acoustic noises are present. In this paper we present a new UV procedure with two major features: a) Confidence tests are applied to decoded string hypotheses obtained from using word and garbage models that represent OOV words and noises. Thus the ASR system is designed to deal with what we refer to as Word Spotting and Noise Spotting capabilities. b) The UV procedure is based on three different confidence tests, two based on acoustic measures and one founded on linguistic information, applied in a hierarchical structure. Experimental results from a real telephone application on a natural number recognition task show an 50% reduction in recognition errors with a moderate 12% rejection rate of correct utterances and a low 1.5% rate of false acceptance. ** Title: On The Influence Of Frame-Asynchronous Grammar Scoring In A CSR System Authors: Antonio J. Rubio, University of Granada Jesus E. Diaz, University of Granada Pedro Garcia, University of Granada Jose C. Segura, University of Granada Volume: 2, Page: 895 Abstract: It is usually assumed that grammar probabilities and acoustic probabilities in a Continuous Speech Recognition system have to be incorporated to the general score with different weights. This is an experimental fact and there is no generally accepted theoretical explanation. In this paper we propose an explanation to this fact, related to the way grammar scoring is incorporated in the searching procedure. Accordingly to this explanation, we perform a set of experiments to test our hypothesis. We are also proposing a new way of introducing grammar probabilities in a tree-based vocabulary search strategy, where systems are usually bound to use the worst strategy. To apply our ideas to unigrams is rather simple. For more complex language models like bigrams we have to implement a new procedure. ** Title: A Segment-based Wordspotter Using Phonetic Filler Models Authors: Alexandros S. Manos, MIT-LCS Victor W. Zue, MIT-LCS Volume: 2, Page: 899 Abstract: A common approach to wordspotting is to augment the keyword models with "filler" models to account for non-keyword intervals. An alternative approach is to use a large vocabulary continuous speech recognition system (LVCSR) to produce a word string, and then search for the keywords in that string. While the latter approach typically yields higher performance, it requires costly computation and extensive training data. In this study, we develop several segment-based wordspotters in an effort to achieve performance comparable to that of the LVCSR spotter, but with only a fraction of the vocabulary. We investigate several methods to model the background, ranging from a few general models to refined phone representations. The task is to detect sixty-one keywords from continuous speech in the ATIS domain. The best performance we achieve is 91.4% Figure of Merit for the LVCSR spotter and 86.7% for a spotter using 57 phone-based filler models. ** Title: A Multi-Phase Approach For Fast Spotting Of Large Vocabulary Chinese Keywords From Mandarin Speech Using Prosodic Information Authors: Bo-Ren Bai, NTU Chiu-Yu Tseng, Academia Sinica Lin-Shan Lee, NTU Volume: 2, Page: 903 Abstract: This paper presents a multi-phase approach for fast spotting of large vocabulary Chinese keywords from a spontaneous Mandarin speech utterance using prosodic knowledge. Without searching through the whole utterance using large number of keyword models, the multi-phase framework proposed here including some special scoring schemes provides very good efficiency by considering the monosyllable-based structure of Mandarin Chinese. This approach is therefore very fast due to very good boundary estimations and the deletion of most impossible syllable and keyword candidates using context independent models, and also very accurate with the carefully designed scoring processes. A task with 2611 keywords was tested here. An inclusion rate of 85.79% for the top 10 candidates is attained, at a speed requiring only 1.2 times of the utterance length on a Sparc 20 workstation. ** Title: Accurate Keyword Spotting Using Strictly Lexical Fillers Authors: Rachida El Meliani, INRS-Telecom Douglas O'Shaughnessy, INRS-Telecom Volume: 2, Page: 907 Abstract: Our goal is to design an accurate keyword spotter that can deal with any size of keyword set, since the size actually required in a wide range of applications is large (number of airports, number of names in a directory, etc.). This justifies the choice of an architecture based on a large-vocabulary continuous-speech recognizer. In a previous paper we introduced the use of strictly-lexical subword fillers for keyword spotting based on the INRS large-vocabulary continuous-speech recognizer showing that they are, when compared to acoustic fillers, a good compromise between memory and time consumption, keyword choice freedom and task-independence training on one hand and accuracy on the other hand. We propose here two new high-performance designs of individual strictly-lexical subword fillers that perform, this time, better than their acoustic counterparts while still keeping the mentioned advantages. ** Title: Failure Simulation for a Phoneme HMM Based Keyword Spotter Authors: Martin Holzapfel, Siemens Gunther Ruske, Technical University of Munich Harald Hoge, Siemens Volume: 2, Page: 911 Abstract: A basic problem in keyword spotting is the fact that the keywords itself cannot be completely different from background speech. Therefore, false alarms arise from those parts of the keyword which are also contained in the background. The paper describes the favourable application of a model trellis which enables to test individual phoneme sequences with respect to their influence on the underlying phoneme HMMs in a statistical way. It is shown, that the Viterbi path highly is affected by those partly fitting phoneme groups. The probability of occurrance of these phoneme sequences is captured by a statistical "speech model" consisting of a Markov graph having an order up to 2. In this way sequences of 1, 2, or 3 phonemes are considered. By combining the model trellis and the statistical speech model, the probability of false alarms can be precalculated in advance, thus providing an useful measure for the suitability of the keyword under consideration. When the choice of keywords was optimized by this suitability measure in a practical application (spotting multicom 94.4 data) , the false alarm rate could be reduced by a factor of 3.5. ** Title: Wordspotting Using a Predictive Neural Model for the Telephone Speech Corpus Authors: Suhardi Suhardi, Technical University of Berlin Klaus Fellbaum, Brandenburg Technical University of Cottbus Volume: 2, Page: 915 Abstract: We describe a wordspotting algorithm based on a predictive neural model for a telephone speech corpus. Each keyword is modeled as a whole word. For keyword detection scoring we used a minimum accumulated prediction residual. We computed empirically a threshold value for rejecting non-keyword speech in place of building non-keyword models. We tested the algorithm with the TUBTEL telephone speech corpus and compared it with other algorithms like the standard DTW-based wordspotting algorithm and the two-stage wordspotting algorithm based on a DTW and a multilayer perceptron. ** Title: Shape-Invariant Pitch and Time-Scale Modification of Speech by Variable Order Phase Interpolation Authors: Mat P. Pollard, University of Liverpool Barry M.G. Cheetham, University of Liverpool Colin C. Goodyear, University of Liverpool Mike D. Edgington, B.T. Laboratories Volume: 2, Page: 919 Abstract: To preserve the waveform shape and perceived quality of pitch and time-scale modified sinusoidally modelled voiced speech, the phases of the sinusoids used to model the glottal excitation are made to add coherently at estimated pitch pulse locations. The glottal excitation is therefore made to resemble a pseudo-periodic impulse train, a quality essential for shape- invariance. Conventional methods attempt to maintain the coherence once per synthesis frame by interpolating the phase through a single modified pitch pulse location, a time where all excitation phases are assumed to be integer multiples of 2(pi). Whilst this is adequate for small degrees of modification, the coherence is lost when the required amount of modification is increased. This paper presents a technique which is capable of better preserving the impulse-like nature of the glottal excitation whilst allowing its phases to evolve slowly through time. ** Title: A Chinese Text-to-Speech System Based on Part-of-Speech Analysis, Prosodic Modeling and Non-Uniform Units Authors: Fu-Chiang Chou, NTU Chiu-Yu Tseng, Academia Sinica Keh-Jiann Chen, Academia Sinica Lin-Shan Lee, Academia Sinica Volume: 2, Page: 923 Abstract: This paper presents a new Chinese text-to-speech system that produces very natural and intelligible synthetic Mandarin speech based on part-of-speech analysis, prosodic modeling and non-uniform units. The distinguishing features and key technology for the system can be summarized as follows: (1) A text analysis module for word identification and tagging was developed based on part-of-speech modeling and using heuristic rules to achieve very high accuracy. (2) The required prosodic parameters for the synthetic speech are derived from a two-stage procedure. The prosodic structures of the input texts are first derived from a statistical model trained by a large speech database, and the prosodic parameters are then determined according to the structures. (3) A specially designed speech segments inventory constructed with non-uniform and pitch dependent units is used to improve the fluency and intelligibility of the system. ** Title: Automatic Prosodic Modeling for Speaker and Task Adaptation in Text-to-Speech Authors: Eduardo Lopez-Gonzalo, ETSITeleco UPMadrid Jose M. Rodriguez-Garcia, ETSITeleco UPMadrid Luis Hernandez-Gomez, ETSITeleco UPMadrid Juan M. Villar, ETSITeleco UPMadrid Volume: 2, Page: 927 Abstract: One of the most important demands for future TTS systems is their ability to improve naturalness when embedded in a particular task or application that requires a particular speaking style for a particular speaker. In this paper, we present a new prosodic modeling procedure for improving naturalness by adapting a TTS system to a new speaker and a new speaking style. The proposed procedure is an extension of our automatic data-driven methodology presented in [1], to model both fundamental frequency and segmental duration. Automatic linguistic and acoustic analysis are performed on both a task dependent text corpus and the recorded material from the selected speaker. ** Title: Prosody Generation with a Neural Network: Weighing the importence of input parameters Authors: Gerit P. Sonntag, University of Bonn Thomas Portele, University of Bonn Barbara Heuft, Lernout and Hauspie Volume: 2, Page: 931 Abstract: As an alternative to synthesis-by-rule, the use of neural networks in speech synthesis has been successfully applied to prosody generation, yet it is not known precisely which input parameters are responsible for good results. The approach presented here tries to quantify the contribution of each input parameter. This is done first by comparing the mean errors of networks trained with only one parameter each and by looking at the performance of a group of networks where each lacks one parameter. In a second approach different networks were perceptually evaluated in a pair comparison test with synthesized stimuli. ** Title: Evaluation of a speech synthesis method for nonlinear modeling of vocal folds vibration effect Authors: Hiroshi Ohmura, ETL Kazuyo Tanaka, ETL Volume: 2, Page: 935 Abstract: In this paper, we present a new speech synthesis method for improving voice quality in parametric rule-based speech synthesis systems. We also describe the results of a preference test on speech wave reconstruction to confirm the performance of the proposed method. The method is based on the functional approximation of vocal tract resonance produced by nonlinear interaction between the glottis and the vocal tract. In the performance test, evaluators listen to two kinds of reconstructed speech samples: one is synthesized by the proposed method and the other is by an ordinary LPC(Linear predictive coding)-based method. The speech sample set used in this test contains 60 sentences uttered by four speakers. Results show that the proposed method is superior in its quality. ** Title: Generation of F0 Contour using Stochastic Mapping and Vector Quantization Control Parameters Authors: Byeon Heo-Jin, KAIST Kim Yeon-Jun, KAIST Oh Yung-Hwan, KAIST Volume: 2, Page: 939 Abstract: This paper introduces an F0 contour generation method for text-to-speech synthesis using stochastic mapping and vector quantization control parameters. This model uses a new F0 contour labelling scheme based on the RFC (Rise/Fall/Connection) model, which describes F0 contour patterns with seven F0 labels and three pause labels. This paper also suggests an efficient selection method for control parameters instead of using the mean values of the control parameters. We achieved 78.06% accuracy in the F0 label prediction and 95.87% accuracy in the pause label prediction using this model. The experimental results shows that synthesized speech using vector quantization control parameters is more natural than using the mean values of the feature parameters. ** Title: Spectral Normalization Employing Hidden Markov Modeling of Line Spectrum Pair Frequencies Authors: Bryan L. Pellom, Duke University John H.L. Hansen, Duke University Volume: 2, Page: 943 Abstract: This paper proposes a spectral normalization approach in which the acoustical qualities of an input speech waveform are mapped onto that of a desired neutral voice. Such a method can be effective in reducing the impact of speaker variability such as accent, stress, and emotion for speech recognition. In the proposed method, the transformation is performed by modeling the temporal characteristics of the Line Spectrum Pair (LSP) frequencies of the neutral voice using hidden Markov models. The overall approach is integrated into a pitch synchronous overlap and add (PSOLA) analysis/synthesis framework. The algorithm is objectively evaluated using a distance measure based on the log-likelihood of observing the input (or normalized input) speech given Gaussian mixture speaker models for both the input and desired neutral voice. Results using the Gaussian mixture model formulated criteria demonstrate consistent normalization using a 10 speaker database. ** Title: Time Domain Technique For Pitch Modification And Robust Voice Transformation Authors: Rivarol Vergin, INRS-Telecom Douglas O'Shaughnessy, INRS-Telecom Azarshid Farhat, INRS-Telecom Volume: 2, Page: 947 Abstract: Modification of speech a subject of major interest today, with numerous applications including text to speech synthesis. The basic mechanisms behind this process often consist of pitch-scale and time-scale modifications of speech. While giving generally good results, it remains in most of the cases that the same speaker can be associated with the original signal and its modified version, which limits the use of these techniques in some applications where disguising voices is necessary. These paper presents an approach to increase the possibilities of speech modifications while preserving most of the speech quality of the original signal. ** Title: A New Fundamental Frequency Modification Algorithm with Transformation of Spectrum Envelope according to (F_0) Authors: Kimihito Tanaka, NTT HI Labs. Masanobu Abe, NTT HI Labs. Volume: 2, Page: 951 Abstract: This paper proposes a new speech modification algorithm which makes it possible to change the fundamental frequency ((F_0)) while preserving high quality. One novel point of the algorithm is that the spectrum envelope is transformed according to amount of (F_0) modification. Based on a codebook mapping formulation, transformation rules are generated using speech data uttered in a different (F_0) range. The rules have two purposes: one is transforming the spectrum envelope of the low frequency band and the other is adjusting the balance between low band power and high band power. The proposed algorithm is applied to a text-to-speech system based on waveform concatenation, and good performance is confirmed by listening tests. ** Title: Reliability Assessment And Evaluation Of Objectively Measured Descriptors For Perceptual Speaker Characterization Authors: Burhan F. Necioglu, Georgia Institute of Technology Mark A. Clements, Georgia Institute of Technology Thomas P. Barnwell, Georgia Institute of Technology Volume: 2, Page: 955 Abstract: With the more widespread use of lower bit rate speech coders, the evaluation of speaker recognizability becomes a major issue to be addressed as well as the evaluation of overall voice quality. Furthermore, subjective quality evaluation of speech coders may produce different results depending on the voice character of the speakers used in the evaluation process. It follows naturally that methods and procedures to characterize speakers perceptually must be devised. In this paper, we report on an enhanced set of objective descriptors of the speech waveform, assessing the reliability of their measurements as well as their merit in discriminating utterances from different speakers. Of the 45 measures presented, 35 have less than 10% RMS measurement error, and 25 of those have less than 5%. ** Title: Recent Improvements On Microsofts Trainable Text-To-Speech System - Whistler Authors: Xuedong Huang, Microsoft Alex Acero, Microsoft Hsiao-Wuen Hon, Microsoft Yun-Cheng Ju, Microsoft Jingsong Liu, Microsoft Scott Meredith, Microsoft Mike Plumpe, Microsoft Volume: 2, Page: 959 Abstract: Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API and requires less than 3 MB of working memory. ** Title: Automatic Generation of Speech Synthesis Units Based on Closed Loop Training Authors: Takehiko Kagoshima, Toshiba R&D Center Masami Akamine, Toshiba R&D Center Volume: 2, Page: 963 Abstract: This paper proposes a new method for automatically generating speech synthesis units. A small set of synthesis units is selected from a large speech database by the proposed Closed-Loop Training method (CLT). Because CLT is based on the evaluation and minimization of the distortion caused by the synthesis process such as prosodic modification, the selected synthesis units are most suitable for synthesizers. In this paper, CLT is applied to a waveform concatenation based synthesizer, whose basic unit is CV/VC(diphone). It is shown that synthesis units can be efficiently generated by CLT from a labeled speech database with a small amount of computation. Moreover, the synthesized speech is clear and smooth even though the storage size of the waveform dictionary is small. ** Title: Isolated Word Recognition Using the HMM Structure Selected by the Genetic Algorithm Authors: Tomio Takara, University of the Ryukyus Kazuya Higa, University of the Ryukyus Itaru Nagayama, University of the Ryukyus Volume: 2, Page: 967 Abstract: Hidden Markov models (HMMs) are widely used for automatic speech recognition because they have a powerful algorithm used in estimating the models' parameters, and achieve a high performance. Once a structure of the model is given, the model's parameters are obtained automatically by feeding training data. There is, however, no effective design method leading to an optimal structure of HMMs. In this paper, we propose a new application of a genetic algorithm to search out such an optimal structure. In this method, the left-right structures are adopted for HMMs and the likelihood is used for the fitness of the genetic algorithm. We report the results of our experiment showing the effectiveness of the genetic algorithm in automatic speech recognition. ** Title: Discrete Mixture HMM Authors: Satoshi Takahashi, NTT HI Labs. Japan Kiyoaki Aikawa, NTT HI Labs. Japan Shigeki Sagayama, NTT HI Labs. Japan Volume: 2, Page: 971 Abstract: This paper proposes a new type of acoustic model called the discrete mixture HMM (DMHMM). As large scale speech databases have been constructed for speaker-independent HMMs, continuous mixture HMMs (CMHMMs) are needed to increase the number of mixture components in order to represent complex distributions. This leads to a high computational cost for calculating output probabilities. The DMHMM represents the feature parameter space by using the mixtures of multivariate distributions in the same way as the diagonal covariance CMHMM. Instead of using Gaussian mixtures to represent feature distributions in each dimension, the DMHMM uses the mixtures of the discrete distributions based on the scalar quantization (SQ). Since the discrete distribution has a higher degree-of-freedom in terms of representation, the DMHMM is advantageous in representing the feature distributions efficiently with fewer mixture components. In isolated word recognition experiments for telephone speech, we have found that the DMHMM outper ** Title: Using Word Temporal Structure in HMM Speech Recognition Authors: Luciano Fissore, CSELT Franco Ravera, CSELT Pietro Laface, DAI - Politecnico di Torino Volume: 2, Page: 975 Abstract: Isolated word speech recognizers with fixed vocabularies are often used to provide vocal services through the telephone line. The paper illustrates a simple post processing approach that allows the hypotheses produced by a HMM recognizer to be rescored taking into account the global temporal structure of the pronounced words. Our approach does not directly rely on state/word duration modeling. It models, instead, the global time variations of the spectral features of each word and their correlation in time: two important perceptual cues that are only partially exploited by standard HMMs. Results are presented for isolated word speaker independent systems with vocabulary of different size and complexity. We show that the recognition rate improves not only for small vocabulary recognition systems such as the isolated digit one, but also for a 475 city name vocabulary used in a vocal service that provides information about the main railway connections. ** Title: Smootheness Analysis for Trajectory Features Authors: Zhihong Hu, CSLU,OGI Etienne Barnard, CSLU,OGI Volume: 2, Page: 979 Abstract: Dynamic modeling of speech is potentially a major improvement on Hidden Markov Models (HMMs). In one approach, trajectory models are used to model the dynamics of the spectrum, and are used as basis for classification. Although some improvement has been achieved in this way, one would hope for more substantial improvements given that the independence assumption is removed. One reason why this was not achieved may be that the trajectory models are based on cepstral coefficients; we show that these tracks contain spurious oscillations. This suggests that these trajectory features might have a high within-class variance. We introduce a measure of evaluating the smoothness of trajectory-based features. This measure provides a method of selecting the best of a set of similar features. Formant trajectories prove to be significantly smoother than trajectories of mel scale cepstral coefficients (MFCC) by this measure, but this does not translate directly to improved performance. ** Title: Frequency-Warping and Speaker-Normalization Authors: Srinivasan Umesh, City University Hunter College, New York L. Cohen, City University Hunter College, New York D. Nelson, City University Hunter College, New York Volume: 2, Page: 983 Abstract: Recently, we have proposed the use of scale-cepstral coefficients as features in speech recognition. We have developed a corresponding frequency-warping function, such that, in the warped domain the formant envelopes of different speakers are approximately translated versions of one and another for any given vowel. These methods were motivated by a desire to achieve speaker-normalization. In this paper, we point out to very interesting parallels of the various steps in computing the scale-cepstrum, with those observed in computing features based on physiological models of the auditory system or psychoacoustic experiments. It may therefore be useful to have a better understanding of the need for the various signal-processing steps which may result in the development of more robust recognizers. ** Title: Integrating Syllable Boundary Information into Speech Recognition Authors: Su-Lin Wu, ICSI / UC Berkeley Michael L. Shire, ICSI / UC Berkeley Steven Greenberg, ICSI / UC Berkeley Nelson Morgan, ICSI / UC Berkeley Volume: 2, Page: 987 Abstract: In this paper we examine the proposition that knowledge of the timing of syllabic onsets may be useful in improving the performance of speech recognition systems. A method of estimating the location of syllable onsets derived from the analysis of energy trajectories in critical band channels has been developed, and a syllable-based decoder has been designed and implemented that incorporates this onset information into the speech recognition process. For a small, continuous speech recognition task the addition of artificial syllabic onset information (derived from advance knowledge of the word transcriptions) lowers the word error rate by 38%. Incorporating acoustically-derived syllabic onset information reduces the word error rate by 10% on the same task. The latter experiment has highlighted representational issues on coordinating acoustic and lexical syllabifications, a topic we are beginning to explore. ** Title: Explicit, N-best Formant Features for Vowel Classification Authors: Philipp Schmid, Oregon Graduate Inst. Etienne Barnard, Oregon Graduate Inst. Volume: 2, Page: 991 Abstract: We demonstrate the use of explicit formant features for vowel and semi--vowel classification. The formant trajectories are approximated by either three line segments or Legendre polynomials. Together with formant amplitude, formant bandwidth, pitch, and segment duration, these formant features form a compact feature representation which performs as well (71.8%) as a cepstral--based feature representation (71.6%). The combination of the formant and cepstral feature improves the accuracy further to 73.4%. Additionally, we outline future experiments using our robust, N--best formant tracker. ** Title: Dual-Channel Auditory Spectrum Modeling Authors: Jayadev Billa, EE Dept. University of Pittsburgh Volume: 2, Page: 995 Abstract: In this paper we propose a new approach to the modeling of speech based on cues from the peripheral auditory system. Our approach attempts to incorporate the dynamic adaptation of biological auditory systems to varying sound by simplistically formulating a dual-processing strategy that treats unvoiced and voiced speech as deserving of different processing. Preliminary studies show that this approach possesses significant noise robustness. ** Title: Direct Identification vs. Correlated Models to Process Acoustic and Articulatory Informations in Automatic Speech Recognition Authors: Regine Andre-Obrecht, IRIT Bruno Jacob, IRIT Volume: 2, Page: 999 Abstract: Our work deals with the classical problem of merging heterogenous and asynchronous parameters. It's well known that lips reading improves the speech recognition score, specially in noise condition; so we study more precisely the modeling of acoustic and labial parameters to propose two Automatic Speech Recognition Systems: - a Direct Identification is performed by using a classical HMM approach: no correlation between visual and acoustic parameters is assumed. - two correlated models: a master HMM and a slave HMM, process respectively the labial observations and the acoustic ones. To assess each approach, we use a segmental pre-processing. Our task is the recognition of spelled french letters, in clear and noisy (coktail party) environments. Whatever the approach and condition, the introduction of labial features improves the performances, but the difference between the two models isn't enough sufficient to provide any priority. ** Title: Adapting PSN Recognition Models to the GSM Environment by Using Spectral Transformation Authors: Thierry Soulas, France Telecom, CNET Chafic Mokbel, France Telecom, CNET Denis Jouvet, France Telecom, CNET Jean Monne, France Telecom, CNET Volume: 2, Page: 1003 Abstract: In this work, environment adaptation is studied in order to transform PSN speaker independent isolated words HMM to the GSM environment. LMR transformations associated with groups of HMM densities are used to adapt the densities. Both mean vectors and covariance matrices of the densities are adapted. It has been shown that few amount of GSM data are sufficient to transform the PSN HMM in order to match the GSM environment and to achieve performances equivalent to those of an HMM trained with large amount of GSM data. The number of groups of Gaussian densities seems to have small influence on the results. However, the minimum number of groups depends on the vocabulary size. Finally, this technique is compared to the Bayesian adaptation and the results show that similar performances can be obtained with both methods. ** Title: Integrated-multilingual speech recognition using universal phonological features in a functional speech production model Authors: Li Deng, University of Waterloo Volume: 2, Page: 1007 Abstract: An outline and general design of an integrated-multilingual speech recognizer is presented, focusing on its key novelty of cross-language portability. This recognizer extends the one described in Deng and Sun (1994) in that the overlapping features designed originally for American English are improved, generalized, and need only a slight expansion to cover Mandarin/Cantonese Chinese and Canadian French. It also enhances the recognizer of Deng and Sameti (1996) in that the object of dynamic modeling is moved from the observable acoustic domain to the hidden production-affiliated variables defined in the task-dynamic model of speech production (Saltzman and Munhall, 1989). Major components of the recognizer and the related training and recognition algorithms are described. ** Title: Phone Classification with Segmental Features and a Binary-Pair Partitioned Neural Network Classifier Authors: Stephen A. Zahorian, Old Dominion University Peter L. Silsbee, Old Dominion University Xihong Wang, Old Dominion University Volume: 2, Page: 1011 Abstract: This paper presents methods and experimental results for phonetic classification using 39 phone classes and the NIST recommended training and test sets for NTIMIT and TIMIT. Spectral/temporal features which represent the smoothed trajectory of FFT derived speech spectra over 300 ms intervals are used for the analysis. Classification tests are made with both a binary-pair partitioned (BPP) neural network system (one neural network for each of the 741 pairs of phones) and a single large neural network. Classification accuracy is very similar for the two types of networks, but the BPP method has the advantage of much less training time. The best results obtained (77% for TIMIT and 67.4% for NTIMIT) compare favorably to the best results reported in the literature for this task. ** Title: Smoothed N-best-based Speaker Adaptation for speech recognition Authors: Tomoko Matsui, NTT Tatsuo Matsuoka, NTT Sadaoki Furui, NTT Volume: 2, Page: 1015 Abstract: Smoothed estimation and utterance verification are introduced into the N-best-based speaker adaptation method. That method is effective even for speakers whose decodings using speaker-independent (SI) models are error-prone, that is, for speakers for whom adaptation techniques are truly needed. The smoothed estimation improves the performance for such speakers, and the utterance verification reduces the required amount of calculation. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed a reduction of 36.4% in the error rates for speakers whose decodings using SI models are error-prone. To try and find an effective model-transformation for speaker adaptation, we discuss replacing mixture-mean bias estimation by the widely used mixture-mean linear-regression-matrix estimation. ** Title: A Fast Algorithm for Unsupervised Incremental Speaker Adaptation Authors: Michael Schussler, FORWISS Erlangen Florian Gallwitz, University of Erlangen Stefan Harbeck, University of Erlangen Volume: 2, Page: 1019 Abstract: Speaker adaptation algorithms often require a rather large amount of adaptation data in order to estimate the new parameters reliably. In this paper, we investigate how adaptation can be performed in real--time applications with only a few seconds of speech from each user. We propose a modified Bayesian codebook reestimation which does not need the computationally intensive evaluation of normal densities and thus speeds up the adaptation remarkably, e.g.~by a factor of 18 for 24--dimensional feature vectors. We performed experiments in two real--time applications with very small amounts of adaptation data, and achieved a word error reduction of up to 11%. ** Title: Improved Estimation of Supervision in Unsupervised Speaker Adaptation Authors: Shigeru Homma, NTT HILab Kiyoaki Aikawa, NTT HILab Shigeki Sagayama, NTT HILab Volume: 2, Page: 1023 Abstract: Unsupervised speaker adaptation plays an important role in "batch dictation," the aim of which is to automatically transcribe large amounts of recorded dictation using speech recognition. In the case of unsupervised speaker adaptation which uses recognition results of target speech as the means of supervision, erroneous recognition results degrade the quality of the adapted acoustic models. This paper presents a new supervision selection method. By using this method, correction of the first candidate is judged based on the likelihood ratio of the first and the second candidates. This method eliminates erroneous recognition results and corresponding speech data from the adaptive training data. We implemented this method in the iterative unsupervised speaker adaptation procedure. It is shown that the recognition errors are drastically reduced by 50% in a practical application of batch-style speech-to-text conversion of recorded dictation of Japanese medical diagnoses compared with speaker-independent recognition. ** Title: Improved Bayesian Learning of Hidden Markov Models for Speaker Adaptation Authors: Jen Tzung Chien, National Tsing Hua University Hsiao-Chuan Wang, National Tsing Hua University Chin Hui Lee, Bell Lab Volume: 2, Page: 1027 Abstract: We propose an improved maximum a posteriori (MAP) learning algorithm of continuous-density hidden Markov model (CDHMM) parameters for speaker adaptation. The algorithm is developed by sequentially combining three adaptation approaches. First, the clusters of speaker-independent HMM parameters are locally transformed through a group of transformation functions. Then, the transformed HMM parameters are globally smoothed via the MAP adaptation. Within the MAP adaptation, the parameters of unseen units in adaptation are further adapted by employing the transfer vector interpolation scheme. Experiments show that the combined algorithm converges rapidly and outperforms those other adaptation methods. ** Title: Studies in Transformation-Based Adaptation Authors: Venkatesh Nagesha, DSI Larry Gillick, DSI Volume: 2, Page: 1031 Abstract: This paper studies the use of transformation-based speaker adaptation in improving the performance of large vocabulary continuous speech recognition systems. We present a formulation of the adaptation procedure that is simpler than existing methods. Our experiments demonstrate that speaker normalization continues to be important even after significant amounts of speaker adaptation. An automatic clustering algorithm is compared to human expertise in sorting output distributions into collections that share the same transformation. We quantify improvements over standard Bayesian (by maximum a posteriori or MAP) adaptation in terms of (a) speed of adaptation, and (b) robustness to transcription errors. Finally, we discuss the use of speaker transformations in the training process. ** Title: Speaker Adaptation in the Philips System for Large Vocabulary Continuous Speech Recognition Authors: Eric Thelen, Philips Research Xavier Aubert, Philips Research Peter Beyerlein, Philips Research Volume: 2, Page: 1035 Abstract: The combination of Maximum Likelihood Linear Regression (MLLR) with Maximum a posteriori (MAP) adaptation has been investigated for both the enrollment of a new speaker as well as for the asymptotic recognition rate after several hours of dictation. We show that a least mean square approach to MLLR is quite effective in conjunction with phonetically derived regression classes. Results are presented for both ARPA read-speech test sets and real-life dictation. Significant improvements are reported. While MLLR achieves a faster adaptation rate when only few data is available, MAP has desirable asymptotic properties and the combination of both methods provides the best results. Both incremental and iterative batch modes are studied and compared to the performance of speaker-dependent training. ** Title: Speaker Normalization Based on Frequency Warping Authors: Puming Zhan, Interactive Systems Laboratories Martin Westphal, Interactive Systems Laboratories Volume: 2, Page: 1039 Abstract: In speech recognition, speaker-dependence of a speech recognition system comes from speaker-dependence of the speech feature, and the variation of vocal tract shape is the major source of inter-speaker variations of the speech feature, though there are some other sources which also contribute. In this paper, we address the approachs of speaker normalization which aim at normalizing speaker's vocal tract length based on Frequency WarPing (FWP). The FWP is implemented in the front-end preprocessing of our speech recognition system. We investigate the formant-based and ML-based FWP in linear and nonlinear warping modes, and compare them in detail. All experimental results are based on our JANUS3 large vocabulary continuous speech recognition system and the Spanish Spontaneous Scheduling Task database (SSST). ** Title: Speaker Adaptive Training: A Maximum Likelihood Approach to Speaker Normalization Authors: Tasos Anastasakos, BBN Corporation John W. McDonough, BBN Corporation John Makhoul, BBN Corporation Volume: 2, Page: 1043 Abstract: This paper describes the speaker adaptive training (SAT) approach for speaker independent (SI) speech recognizers as a method for joint speaker normalization and estimation of the parameters of the SI acoustic models. In SAT, speaker characteristics are modeled explicitly as linear transformations of the SI acoustic parameters. The effect of inter-speaker variability in the training data is reduced, leading to parsimonious acoustic models that represent more accurately the phonetically relevant information of the speech signal. The proposed training method is applied to the Wall Street Journal (WSJ) corpus that consists of multiple training speakers. Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm. ** Title: Experiments in Speaker Normalisation and Adaptation for Large Vocabulary Speech Recognition Authors: David Pye, University of Cambridge Philip C. Woodland, University of Cambridge Volume: 2, Page: 1047 Abstract: This paper examines techniques for speaker normalisation and adaptation that are applied in training with the aim of removing some of the variability from the speaker independent models. Two techniques are examined: vocal tract normalisation (VTN) which estimates a single "vocal tract length" parameter for each speaker and then modifies the speech parameterisation accordingly and speaker adaptive training (SAT) which estimates Gaussian mean and variance parameters jointly with a speaker specific set of maximum likelihood linear regression (MLLR) based transformations. It is shown that VTN is effective for both clean speech and mismatched conditions and that the further improvements obtained by applying MLLR in testing are essentially additive. Detailed results from the use of SAT show that worthwhile improvements over using MLLR with standard speaker independent models are obtained. ** Title: Effectiveness of Speaker Normalized HMM by Projection to Speaker Subspace Authors: Yasuo Ariki, Ryukoku University Volume: 2, Page: 1051 Abstract: Conventional speaker independent HMMs ignore the speaker differences and collect speech data in an observation space. This causes a problem that probability distribution of the HMMs becomes flat, and then causes recognition errors. To solve this problem, we construct the speaker subspace for an individual speaker and project his speech data to his own subspace. By this method we can extract speaker independent phonetic information included in the speech data. Speaker independent HMMs can be constructed using this phonetic information. In this paper, we describe the result of phoneme recognition experiments using the speaker independent HMMs constructed by the speech data projected to the speaker subspaces. ** Title: Speaker normalization and adaptation based on linear transformation Authors: Jun Ishii, ATR Masahiro Tonomura, ATR Volume: 2, Page: 1055 Abstract: We propose novel speaker independent (SI) modeling and speaker adaptation based on a linear transformation. An SI model and speaker dependent (SD) models are usually generated using the same preprocessing of acoustic data. This straightforward preprocessing causes a serious problem. Probability distributions of the SI models become broad and the SI models do not give good initial estimates for speaker adaptation. To solve these problems, a normalized SI model is generated by removing speaker characteristics using a shift vector obtained by the maximum likelihood linear regression (MLLR) technique. In addition, we propose a speaker adaptation method that combines the MLLR and maximum a posteriori (MAP) techniques from the normalized SI model. For the baseline recognition test of normalized SI model, 12.8% reduction phoneme recognition error rate compared to the conventional SI model was achieved. Furthermore the proposed adaptation method using normalized SI model was effective than the tested conventional method. ** Title: Speaker-Adapted Training on the Switchboard Corpus Authors: John W. McDonough, BBN STD Tasos Anastasakos, BBN STD George Zavaliagkos, BBN STD Herbert Gish, BBN STD Volume: 2, Page: 1059 Abstract: Speaker adaptation is the process of transforming some speaker-independentacoustic model in such a way as to more closely match the characteristicsof a particular speaker. It has been shown by several researchers to be aneffective means of improving the performance of large vocabulary continuousspeech recognition systems. Until very recently speaker adaptation has beenused exclusively as a part of the recognition process. This is undesireableinasmuch as it leads to a mismatched condition between test and training,and hence sub-optimal recognition performance. Very recently, there hasbeen a growing interest in applying speaker-adaptation techniques to HMMtraining in order to alleviate the training/test mismatch. In prior work,we presented an iterative scheme for determining the maximum likelihoodsolution for the set of speaker-independent means and variances whenspeaker-dependent adaptation is performed during HMM training. In thepresent work, we shall investigate specific issues encountered in applyingthis general framework to the task of improving recognition performance onthe Switchboard Corpus. ** Title: Model Transformation for Robust Speaker Recognition from Telephone Data Authors: Francoise Beaufays, SRI International Mitch Weintraub, SRI International Volume: 2, Page: 1063 Abstract: In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under different acoustic conditions to derive a synthetic variance distribution. This distribution is then used to modify the variances of other speaker models from other telephone databases. ** Title: Speaker Recognition with the Switchboard corpus Authors: Lori Lamel, LIMSI Jean-Luc Gauvain, LIMSI Volume: 2, Page: 1067 Abstract: In this paper we present our development work carried out in preparation for the March'96 speaker recognition test on the Switchboard corpus organized by NIST. The speaker verification system evaluated was a Gaussian mixture model. We provide experimental results on the development test and evaluation test data, and some experiments carried out since the evaluation comparing the GMM with a phone-based approach. Better performance is obtained by training on data from multiple sessions, and with different handsets. High error rates are obtained even using a phone-based approach both with and without the use of orthographic transcriptions of the training data. We also describe a human perceptual test carried out on a subset of the development data, which demonstrates the difficulty human listeners had with this task. ** Title: Handset Dependent Background Models for Robust Text-Independent Speaker Recognition Authors: Larry P. Heck, Speech Technology and Research Lab. Mitch Weintraub, Speech Technology and Research Lab. Volume: 2, Page: 1071 Abstract: This paper studies the effects of handset distortion on telephone-based speaker recognition performance, resulting in the following observations: (1) the major factor in speaker recognition errors is whether the handset type (e.g., electret, carbon) is different across training and testing, not whether the telephone lines are mismatched, (2) the distribution of speaker recognition scores for true speakers is bimodal, with one mode dominated by matched handset tests and the other by mismatched handsets, (3) cohort-based normalization methods derive much of their performance gains from implicitly selecting cohorts trained with the same handset type as the claimant, and (4) utilizing a handset-dependent background model which is matched to the handset type of the claimant's training data sharpens and separates the true and false speaker score distributions. Results on the 1996 NIST Speaker Recognition Evaluation corpus show that using handset-matched background models reduces false acceptances (at a 10 % miss ** Title: Telephone Based Speaker Recognition using Multiple Binary Classifier and Gaussian Mixture Models Authors: Pierre J. Castellano, Queensland University of Technology Stefan Slomka, Queensland University of Technology Sridha Sridharan, Queensland University of Technology Volume: 2, Page: 1075 Abstract: The present study evaluates MBCM and GMM solutions for both ASV and ASI problems involving text-independent telephone speech from the King speech database. The MBCM's accuracy is enhanced by selectively removing those classifiers within the model which perform worst (pruning). An unpruned MBCM outperforms a GMM for ASV and speakers taken from within the same dialectic region (San Diego, CA). Once pruned, the MBCM is found to be 2.6 times more accurate than the GMM. For closed set ASI, based on the same data, the MBCM is roughly twice as accurate as the GMM but only after pruning. ** Title: Comparison of Whole Word and Subword Modeling Techniques for Speaker Verification with Limited Training Data Authors: Stephan Euler, Bosch Telecom Rainer Langlitz, Bosch Telecom Joachim Zinke, FH Friedberg Volume: 2, Page: 1079 Abstract: In this paper we use whole word and subword hidden Markov models for text dependent speaker verification. In this application usually only a small amount of training data is available for each model. In order to cope with this limitation we propose a intermediate functional representation of the training data allowing the robust initialization of the models. This new approach is tested with two data bases and is compared both with standard training techniques and the dynamic time warp method. Secondly, we give results for two types of subword units. The scores of these units are combined in two different ways to obtain word error rates. ** Title: A Comparison of Model Estimation Techniques for Speaker Verification Authors: Michael J. Carey, Ensigma Ltd Eluned S. Parris, Ensigma Ltd Stephen J. Bennett, Ensigma Ltd Harvey Lloyd-Thomas, Ensigma Ltd Volume: 2, Page: 1083 Abstract: We address the problem of building speaker dependent HMM for a speaker verification system. A number of model building techniques are described and the comparative performance of a system using models built using these techniques is presented. Mean estimated models, models where the means of the HMMs are estimated using segmental K means but where the variances are taken from speaker independent models, out performed other techniques for training times of 120s to15s. Mean estimated models were also built with varying numbers of components in the state mixture distributions and a performance gain was again observed. The incorporation of transitional features into the system had degraded performance when the Baum-Welch algorithm was used for model estimation. However the inclusion of delta and delta-delta cepstra into the system using mean estimated models now gave a significant improvement in performance. These changes halved the equal error rate of the system to 7.8%. ** Title: Speaker Verification using Frame and Utterance Level Likelihood Normalization Authors: Seiichi Nakagawa, TUT, Toyohashi Konstantin P. Markov, TUT, Toyohashi Volume: 2, Page: 1087 Abstract: In this paper, we propose a new method, where the likelihood normalization technique is applied at both the frame and utterance levels. In this method based on Gaussian Mixture Models (GMM), every frame of the test utterance is inputed to the claimed and all background speaker models in parallel. In this procedure, for each frame, likelihoods from all the background models are available, hence they can be used for normalization of the claimed speaker likelihood at every frame. A special kind of likelihood normalization, called Weighting Models Rank, is also proposed. We have evaluated our method using two databases - TIMIT and NTT. Results show that the combination of frame and utterance level likelihood normalization in some cases reduces the equal error rate (EER) more than twice. ** Title: A New Codebook Traning Algorithm for VQ-based Speaker Recognition Authors: Jialong He, University of ULM Li Liu, University of ULM Gunther Palm, University of ULM Volume: 2, Page: 1091 Abstract: VQ-based speaker recognition has proven to be a successful method. Usually, a codebook is trained to minimize the quantization error for the data from an individual speaker. The codebooks trained based on this criterion have weak discriminative power when used as a classifier. The LVQ algorithm can be used to globally train the VQ-based classifier. However, the correlation between the feature vectors is not taken into consideration, in consequence, a high classification rate for feature vectors does not lead to a high classification rate for the test sentences. In this paper, a heuristic training procedure is proposed to retrain the codebooks so that they give a lower classification error rate for randomly selected vector-groups. Evaluation experiments demonstrated that the codebooks trained with this method provide much higher recognition rates than that trained with the LBG algorithm alone, and often they can outperform the more powerful Gaussian mixture speaker models. ** Title: Bispectrum Features for Robust Speaker Identification Authors: Stanley Wenndt, Rome Laboratory Sanyogita Shamsunder, Colorado State University Volume: 2, Page: 1095 Abstract: Along with the spoken message, speech contains information about the identity of the speaker. Thus, the goal of speaker identification is to develop features which are unique to each speaker. This paper explores a new feature for speech and shows how it can be used for robust speaker identification. The results will be compared to the cepstrum feature due to its widespread use and success in speaker identification applications. The cepstrum, however, has shown a lack of robustness in varying conditions, especially in a cross-condition environment where the classifier has been trained with clean data but then tested on corrupted data. Part of the bispectrum will be used as a new feature and we will demonstrate its usefulness in varying noise settings. ** Title: Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System Authors: Deb K. Roy, MIT Media Lab Carl Malamud, IMS Volume: 2, Page: 1099 Abstract: We report on an audio retrieval system which lets Internet users efficiently access a large audio database containing recordings of the proceedings of the United States House of Representatives. The audio has been temporally aligned to text transcripts of the proceedings (which are manually generated by the U.S. Government) using a novel method based on speaker identification. Speaker sequence and approximate timing information is extracted from the text transcript and used to constrain a Viterbi alignment of speaker models to the observed audio. Speakers are modeled by computing Gaussian statistics of cepstral coefficients extracted from samples of each persons speech. The speaker identification is used to locate speaker transition points in the audio which are then linked to corresponding speaker transitions in the text transcript. The alignment system has been successfully integrated into a World Wide Web based search and browse system as an experimental service on the Internet. ** Title: Robust Speaker Recognition through Acoustic Array Processing and Spectral Normalization Authors: Joaquin Gonzalez-Rodriguez, Univ. Politecnica de Madrid Javier Ortega-Garcia, Univ. Politecnica de Madrid Volume: 2, Page: 1103 Abstract: The development of a robust speaker recognition system obtained through the joint use of acoustic array processing and spectral normalization as input to a Gaussian Mixture Model speaker recognition system is described in this paper. Results obtained with these techniques have been reported previously by the authors [10(, but operational problems appear if extensive testing with different configurations and testing conditions are intended. In this paper, we describe an open system that has been developed to cope with this problem. The number and geometry of the microphones, the time delay estimation method, the array processing structure and the spectral normalization technique together with the room size, noise type and SNR are some of the options that can be easily changed. It will also allow testing with real multichannel databases and any new algorithm can easily be incorporated to the system. ** Title: Providing Single and Multi-Channel Acoustical Robustness to Speaker Identification Systems Authors: Javier Ortega-Garcia, Univ. Politecnica de Madrid Joaquin Gonzalez-Rodriguez, Univ. Politecnica de Madrid Volume: 2, Page: 1107 Abstract: Acoustical mismatch between training and testing phases induce degradation of performance in automatic speaker recognition systems. Providing robustness to speaker recognizers has to be, therefore, a priority matter. Robustness in the acoustical stage can be accomplished through speech enhancement techniques as a prior stage to the recognizer. These techniques are oriented to the reduction of the impact that acoustical noise produces on the input signal. In this paper, several spectral subtraction-derived techniques are used to enhance single-channel noisy speech. Other perspectives, based in dual-channel (adaptive filtering) and multi-channel (microphone arrays) processing are also presented as optimal solutions to speech enhancement needs. A comparative analysis of the proposed techniques, with different types of noise at different SNRs, as a pre-processing stage to an ergodic HMM-based speaker recognizer, is presented. ** Title: Robust Spoken Language Identification using Large Vocabulary Speech Recognition Authors: James L. Hieronymus, Bell Labs, Murray Hill, NJ Shubha Kadambe, Atlantic Aerospace, Greenbelt, MD Volume: 2, Page: 1111 Abstract: A robust, task independent spoken Language Identification (LID) system which uses a Large Vocabulary Continuous Speech Recognition (LVCSR) module for each language to choose the most likely language spoken is described. The acoustic analysis uses mean cepstral removal on mel scale cepstral coefficients to compensate for different input channels. The system has been trained on 5 languages: English, German, Japanese, Mandarin Chinese and Spanish using a subset of the Oregon Graduate Institute 11 language data base. The five language results show 88% correct recognition for 50 second utterances without using confidence measures and 98 % correct with confidence measures without the robust front end. The recognition rate is 81 % correct for 10 second utterances without confidence measures and 93 % correct with confidence measures without the robust front end. Adding the robust front end improves the recognition rate approximately 3 % on the short utterances and 1 % for the long utterances. The best performance has been obtained for systems trained on phonetically hand labeled speech. ** Title: Double Bigram-Decoding in Phonotactic Language Identification Authors: Jiri Navratil, Technical University of Ilmenau Werner Zuhlke, Technical University of Ilmenau Volume: 2, Page: 1115 Abstract: In this paper a phonotactic language identification system that employs a multilingual phone-recognizer with multiple language-dependent grammars to tokenize the spoken signal into several phone-streams is described. For each stream an independent set of language models is used to compute the language scores that are subsequently processed by two classification stages. Thus, the system acquires information from both the original-label and the decoded-phone statistics. A discriminative weighting method is applied in the second stage for better distinguishing between similar languages. A modified language-bigram model, the so-called skip-gram, that allows exploiting of a wider phonotactic context without increasing the estimation costs of a standard bigram, is introduced. Measured on the NIST'95 evaluation set, the described system outperforms the state-of-the-art phonotactic components that use multiple recognizers, and is, at the same time, less computationally expensive. ** Title: Random Walk Theory Applied to Language Identification Authors: Etienne Marcheret, RPI Michael I. Savic, RPI Volume: 2, Page: 1119 Abstract: In this paper we discuss the most recent evaluation of the RPI language identification system by the National Institute of Standards and Technologies (NIST). This system is based on an acousto-phonetic approach where the phonemes present in a language are identified by a hidden semi-Markov model (HSMM). The HSMM was also developed at RPI. Knowledge of these phonemes provides us with the necessary probabilistic framework for classifier design. The classifier used in this system is designed in such a way that language specific scores generated during an evaluation form a random walk. Random walk theory has extensive applications in ecology, metallurgy, chemistry and physics. Until recently random walk theory has been primarily used as a tool for the measurement of the territory covered by a diffusing particle. We now show that random walk theory can be used to effectively design a language identification system. ** Title: Frequency Characteristics of Foreign Accented Speech Authors: Levent M. Arslan, Duke University John H.L. Hansen, Duke University Volume: 2, Page: 1123 Abstract: In this study, frequency characteristics of foreign accented speech is investigated. Experiments are conducted to discover the relative significance of different resonant frequencies and frequency bands in terms of their accent discrimination ability. It is shown that second and third formants are more important than other resonant frequencies. A filter bank analysis of accented speech supports this statement, where the 1500-2500 Hz range was shown to be the most significant frequency range in discriminating accented speech. Based on these results, a new frequency scale is proposed in place of the commonly used Mel-scale to extract the cepstrum coefficients from the speech signal. The proposed scale results in better performance for the problems of accent classification and language identification. ** Title: A Study on Improving Decisions in Closed Set Speaker Identification Authors: Mubeccel Demirekler, ODTU Afsar Saranli, ODTU Volume: 2, Page: 1127 Abstract: In this study, closed-set, text-independent speaker identification is considered and the problem of improving the reliability of the decisions made by available algorithms is addressed. The work presented here is based on the idea of combining the evidences from different algorithms or decision strategies to improve the recognition performance and the reliability. For this purpose, the models generated by a single algorithm for 17 speakers from the SPIDRE database are considered and a matrix of speaker-to-model fitness values is processed by two different decision strategies. Ideas from the Mathematical Theory of Evidence are applied to combine the decisions produced by these two strategies to generate a better decision on the speaker identity. The combined decision show an improved degree of corectness hence suggesting a promising way of combining the decisions from partially successful algorithms. ** Title: The Use Of Harmonic Features In Speaker Recognition Authors: Bojan Imperl, University of Maribor Zdravko Kacic, University of Maribor Bogomir Horvat, University of Maribor Volume: 2, Page: 1131 Abstract: In this paper the Harmonic features based on the harmonic decomposition of the Hildebrand - Prony line spectrum are introduced. A Hildebrand -- Prony method of spectral analysis was applied because of its high resolution and accuracy. Comparative tests with the LP and LP - cepstral features were made with 50 speakers from the Slovene database SNABI (isolated words corpus) and 50 speakers of the German database BAS Siemens 100 (utterances of sentences). With both databases the advantages of the Harmonic features were noticed especially for the speaker identification while for the speaker verification the Harmonic features have performed better on the SNABI database and as good as the LP cepstral features on the BAS Siemens 100 database. ** Title: An Approach to Speaker Identification Using Multiple Classifiers Authors: Vlasta Radova, University of West Bohemia Josef Psutka, University of West Bohemia Volume: 2, Page: 1135 Abstract: Presented paper takes interest in a speaker identification problem. The attributes representing voice of a particular speaker are obtained from very short segments of the speech waveform corresponding only to one pitch period of vowels. The patterns formed from the samples of a pitch period waveform are either matched in time domain by use of a nonlinear time warping method, known as dynamic time warping (DTW), or they are converted into the cepstral coefficients and compared using the cepstral distance measure. Since an uttered speech signal usually contains a lot of vowels the techniques using a combination both various classifiers and multiple classifier outputs are considered in the decision making process. Experiments performed for hundred speakers are described at the end of this paper. ** Title: Development And Evaluation Of The ATOS Spontaneous Speech Conversational System Authors: Jorge Alvarez, TID Daniel Tapias, TID Carlos Crespo, TID Ismael Cortazar, TID Fernando Martinez, TID Volume: 2, Page: 1139 Abstract: In this paper we report our recent development work in Spanish spontaneous speech conversational systems. We describe the Automatic Telephone Operator Service (ATOS) and present the improvements introduced into it to deal with spontaneous speech, which are: (a) a task independent dialogue manager, that can be adapted to a new semantic domain by changing a configuration file. It also generates a prediction about the user's expected utterance to constrain the language model used by the speech recognizer. (b) a language modeling strategy, which allows to adapt the statistical language model to a new task with just few hundreds of sentences. This strategy reduces a 27% the word error rate. We also report the results, conclusions and the speech database collected in the evaluation of the ATOS system, which has been tested by 30 real users. ** Title: A Spoken Language System For Automated Call Routing Authors: Giuseppe Riccardi, AT&T Labs-Research Allen Gorin, AT&T Labs-Research Andrej Ljolje, AT&T Labs-Research Michael Riley, AT&T Labs-Research Volume: 2, Page: 1143 Abstract: We are interested in the problem of understanding fluently spoken language. In particular, we consider people's responses to the open-ended prompt of "How May I help you?". We then further restrict the problem to classifying and automatically routing such a call, based on the meaning of the user's response. Thus, we aim at extracting a relatively small number of semantic actions from the utterances of a very large set of users who are not trained to the system's capabilities and limitations. In this paper, we describe the main components of our speech understanding system: the large vocabulary recognizer and the language understanding module performing the call-type classification. In particular, we propose automatic algorithms for selecting phrases from a training corpus in order to enhance the prediction power of the standard word n-gram. The phrase language models are integrated into stochastic finite state machines which outperform standard word n-gram language model. From the speech recognizer output we recognize and exploit automatically acquired salient phrase fragments to make a call-type classification. This system is evaluated on a database of 10K fluently spoken utterances collected from interactions between users and human agents. ** Title: Dialogos: A Robust System for Human-Machine Spoken Dialogue on the Telephone Authors: Dario Albesano, CSELT Paolo Baggia, CSELT Morena Danieli, CSELT Roberto Gemello, CSELT Elisabetta Gerbino, CSELT Claudio Rullent, CSELT Volume: 2, Page: 1147 Abstract: This paper presents Dialogos, a real time system for human-machine spoken dialogue on the telephone in task-oriented domains. The system has been tested in a large trial with inexperienced users and it has proved robust enough to allow spontaneous interactions both to users which get good recognition performance and to the ones which get lower scores. The robust behavior of the system has been achieved by combining the use of specific language models during the recognition phase of analysis, the tolerance toward spontaneous speech phenomena, the activity of a robust parser, and the use of pragmatic-based dialogue knowledge. This integration of the different modules allows to deal with partial or total breakdowns of the different levels of analysis. We report the field trial data of the system and the evaluation results of the overall system and of the submodules. ** Title: Surfin' the World Wide Web with Japanese Authors: Kazuhiro Kondo, Texas Instruments Inc. Charles T. Hemphill, Texas Instruments Inc. Volume: 2, Page: 1151 Abstract: Previously, we have developed Speech-Aware Multimedia (SAM) which controls a WWW browser using English speech. We recently extended its capability to use Japanese speech to browse Japanese pages, and developed a prototype using speaker-independent, continuous speech recognition with Japanese context- dependent phonetic models. Some challenges not seen in English include: segregation of Japanese text into word units for optional silence insertion, Japanese text to phone conversion and accommodation of English link names embedded in Japanese pages. In order to accomplish the first two, we modified a public-domain dictionary look-up tool for segmentation and to accommodate heuristics required for improved text-to-phone conversion accuracy. Preliminary tests show that the conversion result contains the correct phone sequence over 97% of the time, and the prototype correctly understands the input speech 91.5 % of the time. ** Title: Internet Chinese Information Retrieval Using Unconstrained Mandarin Speech Queries Based on A Client-Server Architecture and APAT-tree-based Language Model Authors: Lee-Feng Chien, IIS, Sinica Academia Ming-Chiuan Chen, IIS, Sinica Academia Hsin-Min Wang, IIS, Sinica Academia Lin-Shan Lee, IIS, Sinica Academia Sung-Chien Lin, Dept. CSIE, National Taiwan University Jenn-Chau Hong, Dept. CSIE, National Taiwan University Jia-Lin Shen, Dept. EE, National Taiwan University Volume: 2, Page: 1155 Abstract: In order to pursue high performance of Chines information access on the Internet,this paper presents an attractive approach with a successful integration of efficient speech recognition and information retrieval techniques. A working system based on the proposed approach for speech retrieval of real-time Chinese net news services has been implemented and tested. Very exciting performance has been achieved. ** Title: Combining Key-Phrase Detection and Subword-based Verification for Flexible Speech Understanding Authors: Tatsuya Kawahara, Kyoto University Chin Hui Lee, Bell Labs Biing-Hwang Juang, Bell Labs Volume: 2, Page: 1159 Abstract: A flexible speech understanding framework combining key-phrase detection and verification is presented. Detection of semantically-tagged key-phrases directly leads to robust understanding. In order to select reliable detection and eliminate false alarms, utterance verification technique is incorporated. A phrase verifier combines subword-based likelihood ratios of correct models and anti-subword alternate models. A confidence measure that focuses on mis-matched subwords is proposed and demonstrated as the most effective. The combined strategy drastically improves the semantic accuracy for out-of-grammar utterances, while maintaining the performance for in-grammar samples. We also found that utterance verification applied after grammar-based decoding is not so effective as the proposed detection and verification strategy. ** Title: Controlling Limited-Domain Applications by Probabilistic Semantic Decoding of Natural Speech Authors: Holger Stahl, TUM Johannes Muller, TUM Manfred Lang, TUM Volume: 2, Page: 1163 Abstract: The paper describes a speech understanding system, which allows the online control of arbitrary running applications owning a well-defined command interface. A sequential combination of a signal preprocessor, a stochastic-driven one-stage semantic decoder and a rule-based intention decoder is proposed. Following this principle and using the respective algorithms, speech understanding front-ends for the domains 'graphic editor' and 'service robot' could be successfully realized. ** Title: Multi-Channel Speech Enhancement in a Car Environment using Wiener Filtering and Spectral Subtraction Authors: Joerg Meyer, University of Bremen Klaus Uwe Simmer, University of Bremen Volume: 2, Page: 1167 Abstract: This paper presents a multichannel-algorithm for speech enhancement for hands-free telephone systems in cars. This new algorithm takes advantage of the special noise characteristics in fast driving cars. The incoherence of the noise allows to use adaptive Wiener filtering in the frequencies above a theoretically determined frequency. Below this frequency a smoothed spectral subtraction (SSS) is used to get an improved noise suppression. The algorithm yields better results in noise reduction with significantly less distortions and artificial noise than spectral subtraction or Wiener filtering alone. ** Title: Weighted Matching Algorithms and Reliability in Noise Cancelling by Spectral Subtraction Authors: Nestor Becerra Yoma, CCIR/University of Edinburgh Fergus McInnes, CCIR/University of Edinburgh Mervyn Jack, CCIR/University of Edinburgh Volume: 2, Page: 1171 Abstract: This paper addresses the problem of speech recognition with signals corrupted by additive noise at moderate SNR. A technique based on spectral subtraction and noise cancellation reliability weighting in acoustic pattern matching algorithms is studied. A model for additive noise is proposed and used to compute the variance of the hidden clean signal information and the reliability of the spectral subtraction process. The results presented in this paper show that a proper weight on the information provided by static parameters can substantially reduce the error rate. ** Title: HMM-Based Speech Enhancement Using Harmonic Modeling Authors: Michael E. Deisher, Intel Andreas S. Spanias, ASU Volume: 2, Page: 1175 Abstract: This paper describes a technique for reduction of non-stationary noise in electronic voice communication systems. Removal of noise is needed in many such systems, particularly those deployed in harsh mobile or otherwise dynamic acoustic environments. The proposed method employs state-based statistical models of both speech and noise, and is thus capable of tracking variations in noise during sustained speech. This work extends the hidden Markov model (HMM) based minimum mean square error (MMSE) estimator to incorporate a ternary voicing state, and applies it to a harmonic representation of voiced speech. Noise reduction during voiced sounds is thereby improved. Performance is evaluated using speech and noise from standard databases. The extended algorithm is demonstrated to improve speech quality as measured by informal preference tests and objective measures, to preserve speech intelligibility as measured by informal Diagnostic Rhyme Tests, and to improve the performance of a low bit-rate speech coder and a speech recognition system when used as a pre-processor. ** Title: Model Based Speech Pause Detection Authors: Bruce L. McKinley, Signal Processing Consultants Gary H. Whipple, U.S. Department of Defense Volume: 2, Page: 1179 Abstract: This paper presents two new algorithms for robust speech pause detection (SPD) in noise. Our approach was to formulate SPD into a statistical decision theory problem for the optimal detection of noise-only segments, using the framework of model-based speech enhancement (MBSE). The advantages of this approach are that it performs well in high noise conditions, all necessary information is available in MBSE, and no other features are required to be computed. The first algorithm is based on a maximum a posteriori probability (MAP) test and the second is based on a Neyman-Pearson test. These tests are seen to make use of the spectral distance between the input vector and the composite spectral prototypes of the speech and noise models, as well as the probabilistic framework of the hidden Markov model. The algorithms are evaluated and shown to perform well against different types of noise at various SNRs. ** Title: Integrated Speech Enhancement and Coding in the Time-Frequency Domain Authors: Andrzej Drygajlo, LTS-DE, EPFL Benito Carnero, LTS-DE, EPFL Volume: 2, Page: 1183 Abstract: This paper addresses the problem of merging speech enhancement and coding in the context of an auditory modeling. The noisy signal is first processed by a fast wavelet packet transform algorithm to obtain an auditory spectrum, from which a rough masking model is estimated. Then, this model is used to refine a subtractive-type enhancement algorithm. The enhanced speech coefficients are then encoded in the same time-frequency transform domain using masking threshold constraints for quantization noise. The advantage of the proposed method is that both enhancement and coding are performed with the transform coefficients, without making use of the additional FFT processing. ** Title: Quality Enhancement Of Narrowband CELP-Coded Speech Via Wideband Harmonic Re-synthesis Authors: Cheung-Fat Chan, City University of Hong Kong Wai-Kwong Hui, City University of Hong Kong Volume: 2, Page: 1187 Abstract: Results for improving the quality of narrowband CELP-coded speech by enhancing the pitch periodicity and by regenerating the highband components of speech spectra are reported. Multiband excitation (MBE) analysis is applied to enhance the pitch periodicity by re-synthesizing the speech signal using a harmonic synthesizer. The highband magnitude spectra are regenerated by matching to lowband spectra using a trained wideband spectral codebook. Information about the voiced/unvoiced (V/UV) excitation in the highband are derived from a training procedure and recovered by using the matched lowband index. Simulation results indicate that the quality of the wideband enhanced speech is significantly improved over the narrowband CELP-coded speech. ** Title: Speech Enhancement using CSS-based Array Processing Authors: Futoshi Asano, ETL Satoru Hayamizu, ETL Volume: 2, Page: 1191 Abstract: A method for recovering the LPC spectrum from a microphone array input signal corrupted by ambient noise is proposed. This method is based on the CSS (coherent subspace) method, which is designed for DOA (direction of arrival) estimation of broadband array input signals. The noise energy is reduced in the subspace domain by the maximum likelihood method. To enhance the performance of noise reduction, elimination of noise-dominant subspace using projection is further employed, which is effective when the SNR is low and classification of noise and signals in the subspace domain is difficult. The results of the simulation show that some small formants, which cannot be estimated by the conventional delay-and-sum beamformer, were well estimated by the proposed method. ** Title: Co-channel Speaker Separation Using Constrained Nonlinear Optimization Authors: Daniel S. Benincasa, Rome Laboratory Michael I. Savic, Rensselaer Polytechnic Institute Volume: 2, Page: 1195 Abstract: This paper describes a technique to separate the speech of two speakers recorded over a single channel. The main focus of this research is to separate overlapping voiced speech signals using constrained nonlinear optimization. Based on the assumption that voiced speech can be modeled as a slowly-varying vocal tract filter with a quasi-periodic train of impulses, the speech waveform is represented as a sum of sine waves with time-varying amplitude, frequency and phase. In this work the unknown parameters of our speech model will be the amplitude, frequency and phase of the harmonics of both speech signals. Using constrained nonlinear optimization, we will determine, on a frame by frame basis, the best possible parameters that provides the least mean square error (LMSE) between the original co-channel speech signal and the sum of the reconstructed speech signals. ** Title: A Contextual Blind Separation of Delayed and Convolved Sources Authors: Te-Won Lee, Max-Planck-Society Reinhold Orglmeister, Berlin University of Technology Volume: 2, Page: 1199 Abstract: We present a new method to tackle the problem of separating mixtures of real sources which have been convolved and time-delayed under real world conditions. To this end, we learn two sets of parameters to unmix the mixtures and to estimate the true density function. The solutions are discussed for feedback and feedforward architectures. Since the quality of separation depends on the modeling of the underlying density we propose different methods to closer approximate the density function using some context. The proposed density estimation achieves separation of a wider class of sources. Furthermore, we employ the FIR polynomial matrix techniques in the frequency domain to invert a true-phase mixing system. The significance of the new method is demonstrated with the successful separation of two speakers and separation of music and speech recorded with two microphones in a reverberating room. ** Title: Segregation of Concurrent Speech: an Application of the Reassigned Spectrum Authors: Georg F. Meyer, Keele University Fabrice Plante, Liverpool University Frederic Berthommier, ICP Grenoble Volume: 2, Page: 1203 Abstract: Modulation maps provide an effective method for the segregation of voiced speech sounds from competing background activity. The maps are constructed by computing modulation spectra in a bank of auditory filters. Target spectra are recovered by sampling the modulation spectra at the initial five multiples of the fundamental frequency of the target sound. If the modulation spectra are computed using a conventional DFT, windows of 200ms duration are necessary. Using the reassigned spectrum, a new time-frequency representation, the window size can be reduced to 50ms with minimal loss of performance. The algorithm is tested on a 'double vowel' identification task that has been used extensively in psychophysical experiments. ** Title: Enhancement of esophageal speech by injection noise rejection Authors: Hector Raul Javkin, PTI-STL Michael Galler, PTI-STL Nancy Niedzielski, PTI-STL Volume: 2, Page: 1207 Abstract: Esophageal speakers, who produce a voice source by bringing about a vibration of the esophageal superior sphincter, must insufflate the esophagus with an air injection gesture before every utterance, creating an air reservoir to drive the vibration. The resulting noise is generally undesired by the speakers. This paper describes a method for the automatic recognition and rejection of the injection noise which occurs in esophageal speech. ** Title: Real-Time Digital Speech Processing Strategies For The Hearing Impaired Authors: Neeraj Magotra, University of New Mexico Sudheer Sirivara, University of New Mexico Volume: 2, Page: 1211 Abstract: This paper deals with digital processing of speech as it pertains to the hearing impaired. The issues described in this paper deal with the development of a true real-time digital hearing aid. The system (based on Texas Instruments TMS320C3X) implements frequency shaping, noise reduction, interaural time delay, amplitude compression and various timing options. It also provides a testbed for future development. The device is referred to as the DIgital Programmable Hearing Aid (DIPHA). DIPHA uses a wide bandwidth (upto 16 KHz). DIPHA is a fully programmable device that permits us to program various speech processing algorithms and test them on hearing impaired subjects in the real world as well as in the laboratory. ** Title: Iterative-Batch And Sequential Algorithms For Single Microphone Speech Enhancement Authors: Sharon Gannot, Tel-Aviv University David Burshtein, Tel-Aviv University Ehud Weinstein, Tel-Aviv University Volume: 2, Page: 1215 Abstract: Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In this paper we represent a class of Kalman-filter based speech enhancement algorithms with some extensions, modifications, and improvements. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a byproduct of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Experimental study, using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms. ** Title: Kalman filtering for low distortion speech enhancement in mobile communication Authors: Patrik Sorqvist, Ericsson Peter Handel, Ericsson Bjorn Ottersten, KTH Volume: 2, Page: 1219 Abstract: This paper presents a model-based approach for noise suppression of speech contaminated by additive noise. A Kalman filter based speech enhancement system is presented and its performance is investigated in detail. It is shown that with a novel speech parameter estimation algorithm, it is possible to achieve 10dB noise suppression with a high total audible quality. ** Title: Exploiting the Potential of Auditory Preprocessing for Robust Speech Recognition by Locally Recurrent Neural Networks Authors: Klaus Kasper, University of Frankfurt Herbert Reininger, University of Frankfurt Dietrich Wolf, University of Frankfurt Volume: 2, Page: 1223 Abstract: In this paper we present a robust speaker independent speech recognition system consisting of a feature extraction based on a model of the auditory periphery, and a Locally Recurrent Neural Network for scoring of the derived feature vectors. A number of recognition experiments were carried out to investigate the robustness of this combination against different types of noise in the test data. The proposed method is compared with Cepstral, RASTA, and JAH-RASTA processing for feature extraction and Hidden Markov Models for scoring. The presented results show that the information in features from the auditory model can be best exploited by Locally Recurrent Neural Networks. The robustness achieved by this combination is comparable to that of JAH-RASTA in combination with HMM but without any requirement for an explicit adaptation to the noise in speech pauses. ** Title: Feature adaptation using deviation vector for robust speech recognition in noisy environment Authors: Tai-Hwei Hwang, NTHU Lee-Ming Lee, MIT Hsiao-Chuan Wang, NTHU Volume: 2, Page: 1227 Abstract: When a speech signal is contaminated by additive noise, its cepstral coefficients are assumed to be the functions of noise power. By using Taylor series expansion with respect to noise power, the cepstral vector can be approximated by a nominal vector plus the first derivative term. The nominal cepstrum corresponds to the clean speech signal and the first derivative term is a quantity to adapt the speech feature to noisy environment. A deviation vector is introduced to estimate the derivative term. The experiments show that the feature adaptation based on deviation vectors is superior to those projection based methods. ** Title: Binaural Phoneme Recognition Using the Auditory Image Model and Cross-Correlation Authors: Keith I. Francis, Cedarville College Timothy R. Anderson, Armstrong Laboratory Volume: 2, Page: 1231 Abstract: An improved method for phoneme recognition in noise is presented using an auditory image model and cross-correlation in a binaural approach called the binaural auditory image model (BAIM). Current binaural methods are explained as background to BAIM processing. BAIM and a variation of the cocktail-party-processor incorporating the auditory image model are applied in phoneme recognition experiments. The results show BAIM performs as well or better than current methods for most signal-to-noise ratios. ** Title: Utterance dependent parametric warping for a talker-independent HMM-based recognizer. Authors: Daniel J. Mashao, Brown University John E. Adcock, Brown University Volume: 2, Page: 1235 Abstract: In an effort to improve recognition performance of talker-independent speech systems, many adaptive methods have been proposed. The methods generally seek to exploit the higher recognition performance rate of talker-dependent systems and extend it to talker-independent systems. This is achieved by some form of placing talkers into several categories, usually using gender or vocal-tract size. In this paper we investigate a similar idea, but categorize each utterance independently. An utterance is processed using several spectral compressions, and the compression with the maximum likelihood is then used to train a better model. For testing, the spectral compression with the maximum likelihood is used to decode the utterance. While the spectral compressions divided the utterances well, this did not translate into significant improvement in performance, and the computational cost increase was significant. ** Title: Phase-corrected RASTA for automatic speech recognition over the phone Authors: Johan de Veth, Univerity of Nijmegen Louis Boves, Univerity of Nijmegen Volume: 2, Page: 1239 Abstract: In this paper we propose an extension to the classical RASTA technique. The new method consists of classical RASTA filtering followed by a phase correction operation. In this manner, the influence of the communication channel is as effectively removed as with classical RASTA. However, our proposal does not introduce a left-context dependency like classical RASTA. Therefore the new method is better suited for automatic speech recognition based on context-independent modeling with Gaussian mixture hidden Markov models. We tested this in the context of connected digit recognition over the phone. In case we used context-dependent hidden Markov models (i.e. word models), we found that classical RASTA and phase-corrected RASTA performed equally well. For context-independent phone-based models, we found that phase-corrected RASTA can outperform classical RASTA depending on the acoustic resolution of the models. ** Title: A Binaural Speech Processing Method Using Subband-Crosscorrelation Analysis for Noise Robust Recognition Authors: Shoji Kajita, Nagoya University Kazuya Takeda, Nagoya University Fumitada Itakura, Nagoya University Volume: 2, Page: 1243 Abstract: This paper describes an extended subband-crosscorrelation (SBXCOR) analysis to improve the robustness against noise. The SBXCOR analysis, which has been already proposed, is a binaural speech processing technique using two input signals and extracts the periodicities associated with the inverse of the center frequency (CF) in each subband. In this paper, by taking an exponentially weighted sum of crosscorrelation at the integral multiples of the inverse of CF, SBXCOR is extended so as to capture more periodicities included in two input signals. The experimental results using a DTW word recognizer showed that the processing improves the performance of SBXCOR for both that of the white noise and a computer room noise. For white noise, the extended SBXCOR performed significantly better than the smoothed group delay spectrum and the mel-frequency cepstral coefficient (MFCC) extracted from both monaural and binaural signals. However, for the computer room noise, it outperformed only at SNR 0dB. ** Title: Modelling asynchrony in speech using elementary single-signal decomposition Authors: Michael J. Tomlinson, DRA Malvern Martin J. Russell, DRA Malvern Roger K. Moore, DRA Malvern Andrew P. Buckland, DRA Malvern Martin A. Fawley, DRA Malvern Volume: 2, Page: 1247 Abstract: Although the possibility of asynchrony between different components of the speech spectrum has been acknowledged, its potential effect on automatic speech recogniser performance has only recently been studied. This paper presents the results of continuous speech recognition experiments in which such asynchrony is accommodated using a variant of HMM decomposition. The paper begins with an investigation of the effects of partitioning the speech spectrum explicitly into sub-bands. Asynchrony between these sub-bands is then accommodated, resulting in a significant decrease in word errors. The same decomposition technique has previously been used successfully to compensate for asynchrony between the two input streams in an audio-visual speech recognition system. ** Title: Subband-based speech recognition Authors: Herve Bourlard, FPMS - TCTS Stephane Dupont, FPMS - TCTS Volume: 2, Page: 1251 Abstract: In the framework of Hidden Markov Models (HMM) or hybrid HMM/Artificial Neural Network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to divide up the full frequency band (represented in terms of critical bands) into several subbands, compute phone probabilities for each subband on the basis of subband acoustic features, perform dynamic programming independently for each band, and merge the subband recognizers (recombining the respective, possibly weighted, scores) at some segmental level corresponding to temporal anchor points. The results presented in this paper confirm some preliminary tests reported earlier. On both isolated word and continuous speech tasks, it is indeed shown that even using quite simple recombination strategies, this subband ASR approach can yield at least comparable performance on clean speech while providing better robustness in the case of narrowband noise. ** Title: Sub-band Based Recognition Of Noisy Speech Authors: Sangita Tibrewala, OGI Hynek Hermansky, ICSI Volume: 2, Page: 1255 Abstract: A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed. ** Title: Recognizing Reverberant Speech with RASTA-PLP Authors: Brian E.D. Kingsbury, ICSI / UC Berkeley Nelson Morgan, ICSI / UC Berkeley Volume: 2, Page: 1259 Abstract: The performance of the PLP, log-RASTA-PLP, and J-RASTA-PLP front ends for recognition of highly reverberant speech is measured and compared with the performance of humans and the performance of an experimental RASTA-like front end on reverberant speech, and with the performance of a PLP-based recognizer trained on reverberant speech. While humans are able to reliably recognize the reverberant test set, achieving a 6.1% word error rate, the best RASTA-PLP-based recognizer has a word error rate of 68.7% on the same test set, and the PLP-based recognizer trained on reverberant speech has a 50.3% word error rate. Our experimental variant on RASTA processing provides a statistically significant improvement in performance on the reverberant speech, with a best word error rate of 64.1%. ** Title: Multi-Resolution Phonetic/Segmental Features and Models for MM-Based Speech Recognition Authors: Saeed Vaseghi, QUB Naomi Harte, QUB Ben Milner, QUB Volume: 2, Page: 1263 Abstract: This paper explores the modelling of phonetic segments of speech with multi-resolution spectral/time correlates. For spectral representation a set of multi-resolution cepstral features are proposed. Cepstral features obtained from a DCT of the log energy-spectrum over the full voice-bandwidth (100-4000 Hz) are combined with higher resolution features obtained from the DCT of the upper subband (say 100-2100) and the lower subband (2100-4000) halves. This approach can be extended to several levels of different resolutions. For representation of the temporal structure of speech segments, or phones, the conventional cepstral and dynamic cepstral features representing speech at sub-phonetic levels, are supplemented by a set of phonetic features that describe the trajectory of speech over the duration of a phoneme. A conditional probability model for phonetic and subphonetic features. Experimental evaluations demonstrate that the inclusion is considered of segmental features results in about 10% decrease in error rates. ** Title: Maximum Likelihood Weighting of Dynamic Speech Features for CDHMM Speech Recognition Authors: Javier Hernando, UPC-Barcelona Volume: 2, Page: 1267 Abstract: Speech dynamic features are routinely used in current speech recognition systems in combination with short-term (static) spectral features. Although many existing speech recognition systems do not weight both kinds of features, it seems convenient to use some weighting in order to increase the recognition accuracy of the system. In the cases that this weighting is performed, it is manually tuned or it consists simply in compensating the variances. The aim of this paper is to propose a method to automatically estimate an optimum state-dependent stream weighting in a CDHMM recognition system by means of a maximum-likelihood based training algorithm. Unlike other works, it is shown that simple constraints on the new weighting parameters permit to apply the maximum-likelihood crtierion to this problem. Experimental results in speaker independent digit recognition show an important increase of recognition accuracy. ** Title: Speech recognition using automatically derived acoustic baseforms Authors: Richard C. Rose, ATT Labs-Research Eduardo Lleida, University of Zaragoza Volume: 2, Page: 1271 Abstract: This paper investigates procedures for obtaining user-configurable speech recognition vocabularies. These procedures use example utterances of vocabulary words to perform unsupervised automatic acoustic baseform determination in terms of a set of speaker independent subword acoustic units. Several procedures, differing both in the definition of subword acoustic model context and in the phonotactic constraints used in decoding have been investigated. The tendency of input utterances to contain out-of-vocabulary or non-speech information is accounted for using likelihood ratio based utterance verification procedures. Comparisons of different definitions of the likelihood ratio used for utterance verification and of different criteria for estimating parameters used in the likelihood ratio test have been performed. The performance of these techniques has been evaluated on utterances taken from a trial of a voice label recognition service. ** Title: On Combining Frequency Warping and Spectral Shaping in HMM Based Speech Recognition Authors: Alexandros Potamianos, ATT Labs-Research Richard C. Rose, ATT Labs-Research Volume: 2, Page: 1275 Abstract: Frequency warping approaches to speaker normalization have been proposed and evaluated on various speech recognition tasks. These techniques have been found to significantly improve performance even for speaker independent recognition from short utterances over the telephone network. In maximum likelihood (ML) based model adaptation a linear transformation is estimated and applied to the model parameters in order to increase the likelihood of the input utterance. The purpose of this paper is to demonstrate that significant advantage can be gained by performing frequency warping and ML speaker adaptation in a unified framework. A procedure is described which compensates utterances by simultaneously scaling the frequency axis and reshaping the spectral energy contour. This procedure is shown to reduce the error rate in a telephone based connected digit recognition task by 30-40%. ** Title: Recursive Linear Prediction Using OBE Identification With Automatic Bound Estimation Authors: John R. Deller, Michigan State University Tsung Ming Lin, Michigan State University Majid Nayeri, Michigan State University Volume: 2, Page: 1279 Abstract: Application of set-membership (SM) identification to real-time speech processing is made possible by the optimal bounding ellipsoid algorithm with automatic bound estimation (OBE-ABE) that blindly deduces model-input bounds. To date, lack of any tenable approach to estimating bounds in speech models has rendered these interesting new SM methods impractical. OBE-ABE is consistently convergent, offers significant computational advantages, and provides a set of feasible solutions in finite time. ** Title: Nonlinear Long-Term Prediction of Speech Signals Authors: Martin Birgmeier, Vienna University of Technology Hans-Peter Bernhard, Vienna University of Technology Gernot Kubin, Vienna University of Technology Volume: 2, Page: 1283 Abstract: We present an in-depth study of nonlinear long-term prediction of speech signals. Successful long-term prediction strongly depends on the nonlinear oscillator framework for speech modeling. This hypothesis has been confirmed in a series of experiments run on a voiced speech database. We provide results for the prediction gain as a function of the prediction delay using two methods. One is based on an extended form of radial basis function networks. The other relies on calculating the mutual information between multiple signal samples. We explain the role of this mutual information function as the upper bound on the achievable prediction gain. We show that with matching memory and dimension, the two methods yield nearly the same value for the achievable prediction gain. It turns out that the nonlinear predictor's gain is significantly higher than that for a linear predictor using the same parameters. ** Title: Vocal Tract Shape Trajectory Estimation using MLP Analysis-by-Synthesis Authors: Hywel B. Richards, University of Wales Swansea John S. Mason, University of Wales Swansea John S. Bridle, Dragon Systems UK Ltd. Melvyn J. Hunt, Dragon Systems UK Ltd. Volume: 2, Page: 1287 Abstract: The objective of this work is a computationally efficient method for inferring vocal tract shape trajectories from acoustic speech signals. We use an MLP to model the vocal tract shape-to-acoustics mapping, then in an analysis-by-synthesis approach, optimise an objective function that includes both the accuracy of the spectrum approximation and the credibility of the vocal tract dynamics. This optimisation carries out gradient descent using back-propagation of derivatives through the MLP. Employing a series of MLPs of increasing order avoids getting trapped in local optima caused by the many-to-one mapping between vocal tract shapes and acoustics. We obtain two orders of magnitude speed increase compared with our previous methods using codebooks and direct optimisation of a synthesiser. ** Title: Fast and Robust Joint Estimation of Vocal Tract and Voice Source Parameters Authors: Ding Wen, ATR Interpreting Telecomm Research Lab. Nick Campbell, ATR Interpreting Telecomm Research Lab. Higuchi Norio, ATR Interpreting Telecomm Research Lab. Volume: 2, Page: 1291 Abstract: A new pitch-synchronous method of joint estimation is described to estimate vocal tract and voice source parameters from speech signals based on an ARX model. The method uses Kalman filtering to estimate the time-varying coefficients and simulated annealing to deal with the non-linear optimization of Rosenberg-Klatt parameters. A compact method is suggested in the algorithm in order to reduce the computation cost. Further, an automatic model order selection method is proposed to determine the proper analysis pole-order of the ARX model, based on the estimated formant bandwidths. The new method has been shown to be much faster than our previous method and the order selection technique has been shown to be effective. Finally, an ATR two-channel speech database including varying sentence-level prominence patterns is used to verify the proposed method. ** Title: Spectral correlates of glottal waveform models: an analytic study Authors: Boris Doval, LIMSI, Orsay Christophe d'Alessandro, LIMSI, Orsay Volume: 2, Page: 1295 Abstract: This paper deals with spectral representation of the glottal flow. The LF and the KLGLOTT88 models of the glottal flow are studied. In a first part, we compute analytically the spectrum of the LF-model. Then, formulas are given for computing spectral tilt and amplitudes of the first harmonics as functions of the LF-model parameters. In a second part we consider the spectrum of the KLGLOTT88 model. It is shown that this model can be modeled in the spectral domain by an all-pole third-order linear filter. Moreover, the anticausal impulse response of this filter is a good approximation of the glottal flow model. Parameter estimation seems easier in the spectral domain. Therefore our results can be used for modification of the (hidden) glottal flow characteristic of natural speech signals, by processing directly the spectrum, without needing time-domain parameter estimation. ** Title: A time varying ARMAX speech modeling with phase compensation using glottal source model Authors: Keiichi Funaki, Hokkaido University Yoshikazu Miyanaga, Hokkaido University Koji Tochinai, Hokkaido University Volume: 2, Page: 1299 Abstract: This paper presents new speech analysis method based on a Glottal-ARMAX (Auto Regressive and Moving Average eXogenous) model with phase compensation. A Glottal-ARMAX model consists of two kinds of inputs: glottal source model excitation and a white gauss input, and a vocal tract ARMAX model. The proposed method can simultaneously estimate the glottal source model and vocal tract ARMAX model parameters pitch synchronously. In this method, ARMAX identification using a modified MIS(Model Identification System) method is adopted to estimate ARMAX parameters, and the hybrid approach of Genetic algorithm(GA) and Simulated annealing(SA) is employed to efficiently solve the non-linear simultaneous optimization of both parameters. Furthermore, phase compensation using an all-pass filter is introduced within a generation loop in the GA method in order to compensate phase distortion. Experiments using synthetic speech and natural speech demonstrate the efficacy of the proposed method. ** Title: Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited Authors: Hideki Kawahara, ATR-HIP Volume: 2, Page: 1303 Abstract: A simple new procedure called STRAIGHT (Speech Transformation andRepresentation using Adaptive Interpolation of weiGHTed spectrum) has been developed.STRAIGHT usespitch-adaptive spectral analysis combined with a surfacereconstruction method in the time- frequency region, and an excitationsource design based on phase manipulation. It preserves the bilinear surface in the time-frequency regionand allows for over 600% manipulation of such speech parameters aspitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation. ** Title: The weft: A representation for periodic sounds Authors: Dan Ellis, ICSI Volume: 2, Page: 1307 Abstract: For the problem of separating sound mixtures, periodicity is a powerful cue used by both human listeners and automatic systems. Short-term autocorrelation of subband envelopes, as in the correlogram, accounts for much perceptual data. We present a discrete representation of common-period sounds, derived from the correlogram, for use in computational auditory scene analysis: The weft describes a sound in terms of a time-varying periodicity and a smoothed spectral envelope of the energy exhibiting that period. Wefts improve on several aspects of previous approaches by providing, without additional grouping, a single, invertible element for each detected signal, and also a provisional solution to detecting and dissociating energy of different periodicities in a single frequency channel (unlike systems which allocate whole frequency channels to one source). We define the weft, describe the analysis procedure we have devised, and illustrate its capacity to separate periodic sounds from other signals. ** Title: A computationally efficient algorithm for calculating loudness patterns of narrowband speech Authors: Markus Hauenstein, University of Kiel Volume: 2, Page: 1311 Abstract: Loudness patterns are closer to the human perception of sound waves than spectrograms. This paper describes how loudness patterns can be efficiently calculated with an allpass-transformed polyphase filterbank based on a mixed radix FFT and three subsequent non-linear stages that model masking effects in the frequency and time domain as well as loudness compression. ** Title: Two-channel blind deconvolution for non-minimum phase impulse responses Authors: Ken'ichi Furuya, NTT HI Labs. Yutaka Kaneda, NTT HI Labs. Volume: 2, Page: 1315 Abstract: A new blind deconvolution method is proposed for recovering an unknown source signal, which is observed through two unknown channels characterized by non-minimum phase impulse response filters. Conventional methods cannot estimate the non-minimum phase parts. Our method is based on computing the eigenvector corresponding to the smallest eigenvalue of the input correlation matrix and using a cost function to determine the order of the impulse response filter model. Multi-channel inverse filtering with the estimated impulse responses is used to recover the unknown source signal. Sub-band processing is also used to reduce the complexity of dealing with long impulse responses such as room impulse responses. Computer simulation shows that the effectiveness of our method. ** Title: Variable Time-scale Modification of Speech using Transient information Authors: Sungjoo Lee, Pusan National University Hee Dong Kim, The University of Suwon Hyung Soon Kim, Pusan National University Volume: 2, Page: 1319 Abstract: Conventional time-scale modification methods have the problem that as the modification rate gets higher the time-scale modified speech signal becomes less intelligible, because they ignore the effect of articulation rate on speech characteristics. In this paper, we propose a variable time-scale modification method based on the knowledge that the timing information of transient portions of a speech signal plays an important role in speech perception. After identifying transient and steady portions of a speech signal, the proposed method gets the target rate by modifying steady portions only. The result of subjective preference test indicates that the proposed method porduces performance superior to that of the conventional SOLA method. ** Title: Speech Enhancement with Reduction of Noise Components in the Wavelet Domain Authors: Jong Won Seok, University of KNU Keun Sung Bae, University of KNU Volume: 2, Page: 1323 Abstract: This paper describes a general problem of removing additive background noise from the noisy speech in the wavelet domain. A semisoft thresholding is used to remove noise components from the wavelet coefficients of noisy speech. To prevent the quality degradation of the unvoiced sounds during the denoising process, the unvoiced region is classified first and then thresholding is applied in a different way. Experimental results demonstrate that the proposed speech enhancement algorithm is very promising. ** Title: Blind Separation and Restoration of Signals Mixed in Convolutive Environment Authors: Jiangtao Xi, McMaster University James P. Reilly, McMaster University Volume: 2, Page: 1327 Abstract: This paper proposes new neural network approaches for separating and restoring signals mixed through FIR channels. Firstly, a set of maximal entropy based train rules are developed. Secondly, a new scheme for restoring the original signals is proposed for the 2X2 case. Computer simulation results for speech signals are presented to verify the proposed approaches. ** Title: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator Authors: Eric Scheirer, Interval Research Corp. Malcolm Slaney, Interval Research Corp. Volume: 2, Page: 1331 Abstract: We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound. ** Title: Encoding of Speech Spectral Parameters Using the Adaptive Quantization Methods Authors: Insung Lee, Chungbuk National University Hong Chae Woo, Taegu University Volume: 2, Page: 1335 Abstract: Efficient quantization methods of the line spectrum pairs(LSP) which have good performances, low complexity and memory are proposed. The adaptive quantization method utilizing the ordering property of LSP parameters is used in a scalar quantizer and a vector-scalar hybrid quantizer. The maximum quantization range of each LSP parameter is varied adaptively on the quantized value of the previous order's LSP parameter. The proposed scalar quantization algorithm needs 31 bits/frame which is 3 bits less than in the conventional scalar quantization method with interframe prediction to maintain the transparent quality of speech. The improved vector-scalar quantizer achieves an average spectral distortion of 1 dB using 26 bits/frame. The performance of proposed quantization methods are evaluated in the channel errors. ** Title: Optimal Transformation of LSP Parameters Using Neural Network Authors: Hai Le Vu, BME HIT Laszlo Lois, BME HIT Volume: 2, Page: 1339 Abstract: In this paper, the intraframe correlation properties of Line Spectrum Pair (LSP) are used to develop an efficient encoding algorithm using the Karhunen-Loeve (KL) transformation. An important nonuniform statistical characteristics of LSP frequencies are investigated. Based upon this nonuniform property the neural network based techniques for generating the transform vectors via system training are studied. Using Principal Component Analysis (PCA) network to decorrelate LSP coefficients, we show that these new approaches lead to as good or better distortion as compared to other methods for speech analysis-synthesis. ** Title: Speech spectrum representation and coding using multigrams with distance Authors: Jan Cernocky, ESIEE Genevieve Baudoin, ESIEE Gerard Chollet, ENST Volume: 2, Page: 1343 Abstract: The multigrams allow us to split a string of symbols into a stream of variable length sequences. The direct application of this method to vector-quantized speech spectra fails, we develop an extension of the method called modified multigrams or multigrams with distance. The algorithm for modified multigram dictionary training as well as experimental results are presented. We found a significant improvement of rate/distortion ratio in comparison to vector quantization with small codebooks. For precise spectrum representation, this method is less suitable and we see its application rather in speech segmentation or in very low bit rate coding. ** Title: Incorporating Perception Into LSF Quantization - Some Experiments Authors: Ronald P. Cohn, U.S. DoD John S. Collura, U.S. DoD Volume: 2, Page: 1347 Abstract: In the context of vector quantization (VQ) of the line spectrum frequency (LSF) parameters, we determine experimentally a spectral distribution of quantization error perceived to be "balanced", i.e., error at all frequencies contributing equally, on average, to the perceived distortion. Quantizers which have a balanced distribution should outperform those which don't, given the same number of bits. We examine the spectral error distributions produced by various weighted Euclidean distance measures in the LSF domain and develop one which produces a quantizer having an approximately balanced distribution. This quantizer's performance is compared with that of others having different error distributions. ** Title: Predictive VQ for Noisy Channel Spectrum Coding: AR or MA? Authors: Jan Skoglund, Chalmers University of Technology Jan Linden, Chalmers University of Technology Volume: 2, Page: 1351 Abstract: In this paper, the performance of different predictive vector quantization (PVQ) structures is studied and compared for different degrees of channel noise. Predictive quantization schemes with an auto-regressive (AR) decoder structure are compared with schemes that employ a moving average (MA) decoder. For noisy channels MA prediction performs better than AR. It is shown here that a combination of a PVQ scheme (AR or MA) and a memoryless VQ outperforms both types of traditional predictive quantizer schemes in noiseless as well as noisy channels. ** Title: Efficient Encoding of Mel-Generalized Cepstrum for CELP Coders Authors: Kazuhito Koishida, P&I Lab., Tokyo Institute of Technology Takao Kobayashi, P&I Lab., Tokyo Institute of Technology Satoshi Imai, P&I Lab., Tokyo Institute of Technology Keiichi Tokuda, Nagoya Institute of Technology Volume: 2, Page: 1355 Abstract: In this paper, the performance of several algorithms for the quantization of the mel-generalized cepstral coefficients is studied. First, the objective and subjective performance of two-stage vector quantization (VQ) is measured. It is shown that subjective quality for the mel-generalized cepstral coefficients is higher than that for LSP. Secondly, interframe prediction is introduced in the encoding of mel-generalized cepstral coefficients. By utilizing interframe moving average (MA) prediction, the mel-generalized cepstral coefficients can be encoded more efficiently than LSP in terms of cepstral distortion. Finally, we implement a CELP coder based on mel-generalized cepstral analysis in which mel-generalized cepstral coefficients are quantized using MA prediction. This coder has higher objective quality than conventional CELP. ** Title: A Candidate Coder for the ITU-T's New Wideband Speech Coding Standard Authors: Juin-Hwey Chen, Voxware Inc. Volume: 2, Page: 1359 Abstract: This paper presents AT&T's candidate coder for the ITU-T's new wideband speech coding standard at 16, 24 and 32 kb/s. This coder achieves high speech quality with a low coder complexity. The basic idea of the coder is to perform closed-loop pitch prediction on perceptually weighted speech, and then quantize the prediction residual using perceptually based transform coding techniques. A first version of the coder based on DFT was thoroughly tested and submitted to the ITU-T in February 1996, and it was selected as one of two surviving candidates to advance to the next phase. A revised version based on MDCT was later submitted in October 1996. Both versions are described in this paper. ** Title: Perceptual Speech Coding Using Time and Frequency Masking Constraints Authors: Benito Carnero, LTS-DE, EPFL Andrzej Drygajlo, LTS-DE, EPFL Volume: 2, Page: 1363 Abstract: This paper presents a new wide-band speech coding system based on a fast wavelet packet transform algorithm as well as a formulation of temporal and spectral psychoacoustic models of masking. The proposed FFT-like overlapped block orthogonal transform allows us to approximate the auditory critical band decomposition in an efficient manner, which is a major advantage over previous approaches that used uniform filter banks. As a result of such a decomposition, the perceptually tuned time-frequency structure of the original speech signal is preserved. This allows us to make use of the temporal and spectral properties of the human auditory system to decrease the average bit rate of the encoder, while perceptually hiding the quantization error. ** Title: A Multi-band CELP Wideband Speech Coder Authors: Anil Ubale, UCSB Allen Gersho, Pennsylvania State University Volume: 2, Page: 1367 Abstract: A novel low-delay wideband speech coder, called Multi-band CELP (MB-CELP), overcomes the major obstacles usually associated with two traditional CELP approaches to wideband speech coding - namely fullband CELP and split-band CELP. The new MB-CELP coder employs a multi-band bank of off-line filtered excitation codebooks, fullband linear prediction synthesis, and minimization of the error between original and synthesized speech signal over the full frequency range. A 16 kbps version of MB-CELP coder with two equal bands, is described in this paper. Subjective comparison test results show that this coder performs better than the G.722 coder at the bit-rate of 48 kbps. ** Title: A Design of Transform Coder for Both Speech and Audio Signals at 1 bit/sample Authors: Takehiro Moriya, NTT Human Interface Labs. Naoki Iwakami, NTT Human Interface Labs. Akio Jin, NTT Human Interface Labs. Kazunaga Ikeda, NTT Human Interface Labs. Satoshi Miki, NTT Human Interface Labs. Volume: 2, Page: 1371 Abstract: This paper proposes a speech and audio coder which operates at 1 bit/sample, namely an 8 kbit/s coder for 8 kHz sampling or a 16 kbit/s coder for 16 kHz sampling. The basic structure is inherited from a TwinVQ (Transform domain Weighted Interleave Vector Quantization) high-quality audio coding scheme. Periodical component extraction scheme is newly added to the quantization of MDCT coefficients. This scheme is found to be effective for reducing distortion and improving robustness against channel errors. Qualities for music signals at 8 kbit/s are better than those of G.729 at the same bit rates, while they are worse for clean speech. Qualities at 16 kbit/s are comparable or better than those of G.722 at 48 kbit/s. ** Title: Speech Quality Assessment of Compounded Digital Telecommunication Systems Authors: Kim Tilgaard Petersen, Tele Danmark A/S Steffen Duus Hansen, Technical University of Denmark John Aasted Sorensen, Technical University of Denmark Volume: 2, Page: 1375 Abstract: Digital telecommunication networks may involve a multiple number of public switched telephone networks (PSTN), cellular and mobile systems and to some extent also satellite systems. Most of these networks contain non-linear speech coders and other speech algorithms which may degrade the overall end-to-end quality of speech. An important problem is how to assess the speech quality of such compounded systems. The object of this paper is to describe the first stage of the construction of a proposed three-layer model for speech quality assessment. A subjective test of the speech quality of 16 different compounded transmission paths (mixtures of PCM, GSM full and half rate, DECT, CELP, LD CELP, FS10-16) is carried out by 40 subjects using 21 different rating scales. The main result of this paper is the test results which lead to the definition of four main perceptual dimensions to be used in the second layer of the proposed model. ** Title: Performance Assessment of Tandem Connection of Cellular and Satellite-Mobile Coders Authors: Simao F. Campos Neto, COMSAT Franklin L. Corcoran, COMSAT Ara Karahisar, Teleglobe Volume: 2, Page: 1379 Abstract: In the near future, 16 and 8 kbit/s toll- or near-toll low-rate codecs are expected to be used together with 32 kbit/s digital circuit multiplication equipment, providing speech compression and digital speech interpolation. Additionally, a growing proportion of international calls originate from different digital cellular/satellite mobile (C/SM) systems. Knowledge of the end-to-end voice quality of tandem connections is fundamental in the planning of international circuits. Previous studies assessed tandem performance of cellular codecs and the fixed network, however satellite-mobile systems were not included. This paper presents a subjective evaluation of the voice quality of tandem connections of C/SM codecs in seven basic scenarios. This study concludes that the number of codecs used in tandem should be minimized and network capacity has to be increased for a given traffic load if voice quality cannot be compromised. In extreme cases, calls originating from C/SM terminals should be transmitted using clear channels. ** Title: The Consequences of Linguistic Perception on Low Rate Speech Coding Authors: John J. Parry, University of Wollongong Ian S. Burnett, University of Wollongong Volume: 2, Page: 1383 Abstract: This paper considers the issue of the effect of languages and linguistic perception on low rate speech coding. Current algorithms exploit the redundancies of speech but these redundancies are not common across all languages. Similarly speech coder evaluation techniques do not take into account the nuances of linguistic perception across languages. This paper illustrates some of the linguistic sensitivities experienced by low-rate coders and explores approaches to low-rate coder design. This is achieved through an evaluation of cross-language spectral distortion measures which account for specific linguistic peculiarities influencing linguistic perception. ** Title: Using a Quantitative Psychoacoustical Signal Representation for Objective Speech Quality Measurement Authors: Martin Hansen, University of Oldenburg Birger Kollmeier, University of Oldenburg Volume: 2, Page: 1387 Abstract: This paper describes the application of a quantitative psychoacoustical signal preprocessing model for objective speech quality measurement. The preprocessing is applied to transform the original and the distorted speech signal to an internal representation which is thought of as the information that is accessible to higher neural stages of perception. From a comparison of these internal representations a quality measure can be derived that shows a high correlation to the subjective MOS data of various test data bases. The inherent parameters of the preprocessing model were derived directly from psychoacoustical data independent of the present study. The detection thresholds of codec-like distortions obtained in a psychoacoustical experiment could also be predicted by the model. This indicates that the internal representation contains the relevant information for detecting perceivable differences. It provides evidence for a direct relation between speech quality and detectability of a distortion. ** Title: A Method of Extracting Time-Varying Acoustic Features Effective for Speech Recognition Authors: Kazuyo Tanaka, Electrotech. Lab. Hiroaki Kojima, Electrotech. Lab. Volume: 2, Page: 1391 Abstract: Feature extraction plays a substantial role in automatic speech recognition systems. In this paper, a method is proposed to extract time-varying acoustic features that are effective for speech recognition. This issue is discussed from two aspects: one is on speech power spectrum enhancement and the other is on discriminative time-varying feature extraction which employs subphonetic units, called demiphonemes, for distinguishing non-steady labels from steady ones. We confirm its potential by applying it to spoken word recognition. The results indicate that recognition scores are improved by using the proposed features, compared with those using ordinary features such as delta-mel-cepstra provided by a well-known software tool. ** Title: Elimination of Trajectory Folding Phenomenon: HMM, Trajectory Mixture HMM and Mixture Stochastic Trajectory Model Authors: Irina Illina, CRIN/CNRS, INRIA-Lor. Yifan Gong, Texas Instruments Volume: 2, Page: 1395 Abstract: In this paper, a study of topology of Hidden Markov Model (HMM) used in speech recognition is addressed. Our main contribution is the introduction of the notion of trajectory folding phenomenon of HMM. In complex phonetic contexts and in speaker-variability, this phenomenon degrades the discriminability of HMM. The goal of this paper is to give some explanation and experimental evidence suggesting the existence of this phenomenon. The systems eliminating (partially or entirely) the trajectory folding are HMM with a special topology, called Trajectory Mixture HMM (TMHMM), and a Mixture Stochastic Trajectory Model linebreak (MSTM), proposed recently. HMM, TMHMM and MSTM have been tested on a 1011 words vocabulary, speaker dependent and multi-speaker continuous French speech recognition task. With similar number of model parameters, linebreak TMHMM and MSTM cuts down the error rate produced by the HMM, which confirms our hypothesis. ** Title: Linear dynamic segmental HMMs: variability representation and training procedure Authors: Wendy J. Holmes, DRA Malvern Martin J. Russell, DRA Malvern Volume: 2, Page: 1399 Abstract: This paper describes investigations into the use of linear dynamic segmental hidden Markov models (SHMMs) for modelling speech feature-vector trajectories and their associated variability. These models use linear trajectories to describe how features change over time, and distinguish between extra-segmental variability of different trajectories and intra-segmental variability of individual observations around any one trajectory. Analyses of mel cepstrum features have indicated that a linear trajectory is a reasonable approximation when using models with three states per phone. Good recognition performance has been demonstrated with linear SHMMs. This performance is, however, dependent on the model initialisation and training strategy, and on representing the distributions accurately according to the model assumptions. ** Title: Model parameter estimation for mixture density polynomial segment models Authors: Toshiaki Fukada, ATR Yoshinori Sagisaka, ATR Kuldip K. Paliwal, ATR Volume: 2, Page: 1403 Abstract: In this paper, we propose parameter estimation techniques for mixture density polynomial segment models (henceforth MDPSM) where their trajectories are specified with an arbitrary regression order. MDPSM parameters can be trained in one of three different ways : (1) segment clustering, (2) expectation maximization (EM) training of mean trajectories, or (3) EM training of mean and variance trajectories. These parameter estimation methods were evaluated in TIMIT vowel classification experiments. The experimental results showed that modeling both the mean and variance trajectories are consistently superior to modeling only the mean trajectory. We also found that modeling both trajectories results in significant improvements over the conventional HMM. ** Title: The Importance of Segmentation Probability in Segment Based Speech Recognizers Authors: Jan Verhasselt, RUG Jean-Pierre Martens, RUG Irina Illina, CRIN/CNRS, INRIA-Lorraine, Nancy Jean-Paul Haton, CRIN/CNRS, INRIA-Lorraine, Nancy Yifan Gong, PSL, TI Volume: 2, Page: 1407 Abstract: In segment based recognizers, variable length speech segments are mapped to the basic speech units (phones, diphones,...). In this paper, we address the acoustical modeling of these basic units in the framework of segmental posterior distribution models (SPDM). The joint posterior probability of a unit sequence (underline)u and a segmentation (underline)s, Pr((underline)u,(underline)s|(underline) x) can be written as the product of the segmentation probability Pr((underline)s|(underline) x) and the unit classification probability Pr((underline)u|(underline)s,(underline) x), where (underline) x is the sequence of acoustic observation parameter vectors. In particular, we point out the role of the segmentation probability and demonstrate that it does improve the recognition accuracy. We present evidence for this in two different tasks (speaker dependent continuous word recognition in French and speaker independent phone recognition in American English) in combination with two different unit classification models. ** Title: Adaptation of Polynomial Trajectory Segment Models for Large Vocabulary Speech Recognition Authors: Ashvin Kannan, Boston University Mari Ostendorf, Boston University Volume: 2, Page: 1411 Abstract: Segment models are a generalization of HMMs that can represent feature dynamics and/or correlation in time. In this work we develop the theory of Bayesian and maximum-likelihood adaptation for a segment model characterized by a polynomial mean trajectory. We show how adaptation parameters can be shared and adaptation detail can be controlled at run-time based on the amount of adaptation data available. Results on the Switchboard corpus show error reductions for unsupervised transcription mode adaptation and supervised batch mode adaptation. ** Title: Speaker adaptation experiments using nonstationary-state hidden Markov models: A MAP approach Authors: Chengalv Rathinavelu, University of Waterloo Li Deng, University of Waterloo Volume: 2, Page: 1415 Abstract: In this paper, we report our recent work on applications of the MAP approach to estimating the time-varying polynomial Gaussian mean functions in the nonstationary-state or trended HMM. Assuming uncorrelatedness among the polynomial coefficients in the trended HMM, we have obtained analytical results for the MAP estimates of the time-varying mean and precision parameters. We have implemented a speech recognizer based on these results in speaker adaptation experiments using TI46 corpora. Experimental results show that the trended HMM always outperforms the standard, stationary-state HMM and that adaptation of polynomial coefficients only is better than adapting both polynomial coefficients and precision matrices when fewer than four adaptation tokens are used. ** Title: Vocabulary optimization based on perplexity Authors: Kyuwoong Hwang, ETRI, Taejon Volume: 2, Page: 1419 Abstract: In this paper, we suggest a method to optimize the vocabulary for a given task using the perplexity criterion. The optimization allows us to reduce the size of the vocabulary at the same perplexity of the original word based vocabulary or to reduce perplexity at the same vocabulary size. This new approach is an alternative to phoneme n-gram language model in the speech recognition search stage. We show the convergence of our approach on the Korean training corpus. This method may provide an optimized speech recognizer for a given task. We used phonemes, syllables, morphemes as the basic units for the optimization and reduced the size of the vocabulary to the half of the original word vocabulary size for the morpheme case. ** Title: REMAP for Video Soundtrack Indexing. Authors: Philippe Gelin, Institut Eurecom Christian J. Wellekens, Institut Eurecom Volume: 2, Page: 1423 Abstract: Indexing of video soundtracks is an important issue for the navigation in multimedia databases. Based on wordspotting techniques, it should meet very constraining specifications; namely fast response to queries, concise processed speech information for limiting the storage memory, speaker independant mode, easy characterization of any word by its phonemic spelling. A solution based on phonemic lattices and on a division of the indexing process into an off-line and an on- line part is proposed in this paper. Previous works [1][2] based on frame labelling and Maximum Likelihood criterion are now modified to take into account this new approach based on a Maximum a Posteriori (MAP) criterion. The REMAP algorithm [3] implements this MAP criterion for training. It has several avantages such as maximizing the global discriminant criterion, avoiding the difficult problem of phoneme transition detection during the training process and being well suited for a hybrid Hidden Markov Model (HMM) and Neural Network (NN) approach. ** Title: Robust Pitch Detection of Speech Signals Using Steerable Filters Authors: Jinhai Cai, University of Melbourne Zhi-Qiang Liu, University of Melbourne Volume: 2, Page: 1427 Abstract: Most of the well known and widely used pitch determination algorithms are frame-based. They only consider the speech local stationarity within the analysis frame. However, our novel pitch determination algorithms employ the steerable filters to obtain the direction of pitch change. Therefore, the proposed algorithms not only make full use of the information within an analysis frame, but also optimally utilize the information from neighbor frames by taking the advantage of the pitch direction. This allows us to use more than one frame to enhance pitch peaks for non-stationary, noisy speech signals. As a result, the proposed algorithms are superior to conventional methods in term of accuracy and reliability, and is robust to noise. Besides, the direction of pitch change can be estimated in different domains. Therefore, our algorithms can be applied in either time or frequency domain, or both of them. ** Title: Evaluation Of The Relationship Between Emotional Concepts And Emotional Parameters On Speech Authors: Tsuyoshi Moriyama, Keio University Hideo Saito, Keio University Shinji Ozawa, Keio University Volume: 2, Page: 1431 Abstract: In this paper, we propose the linear model of the relationship between the physical changes in speech and perceived emotional concepts. We make use of orthogonal bases in spite of emotional words and physical parameters themselves in order to avoid dependence on the method of selecting words and parameters. Furthermore we regard the emotions that listeners perceive from speech as the standard of emotional concepts because the emotions that speakers intented rely on personality and temporary psychological state. Evaluation for relative information indicates that the proposed linear model is representable for the relationship between physical quantities and psychological quantities in speech. ** Title: Time-Frequency Analysis of the Glottal Opening Authors: Wolfgang Wokurek, IMS Uni-Stuttgart Volume: 2, Page: 1435 Abstract: Simultaneous recordings of the laryngograph signal and speech recorded in an non-reverberating environment are investigated for acoustic evidence of the glottal opening within the microphone signal. It is demonstrated that the high resolution time-frequency analysis of the microphone signal by the smoothed pseudo Wigner distribution (SPWD) shows responses of the vocal tract to both, the glottal closure and the glottal opening. Thus, a convolution-based model for the relation between the laryngograph signal and the microphone signal is evaluated. It turns out, that the microphone signal may be viewed as filtered version of a power function of the laryngograph signal. Hence, such a nonlinear processed laryngograph signal may be an appropriate model for the acoustic excitation of the vocal tract. ** Title: Time-frequency structured decorrelation of speech signals via nonseparable Gabor frames Authors: Werner Kozek, University of Vienna Hans Georg Feichtinger, University of Vienna Volume: 2, Page: 1439 Abstract: We present a new approach to the linear representation of speech signals that combines desirable structure, computational efficiency and almost decorrelation. The basic principle is a statistically adapted, group-theoretical modification of the classical Gabor expansion. In contrast to traditional linear time-frequency (TF) representations which always correspond to a separable tiling of the TF plane, we suggest the use of a hexagonal (thus nonseparable) tiling whose parameters are matched to the TF correlation of the speech signal. We estimate the TF correlation via a pitch-adapted Zak-transform motivated by modeling the vocal tract as underspread system. The TF correlation determines both the optimum tiling and the optimum window. ** Title: Generalized Mixture of HMMS for Continuous Speech Recognition Authors: Filipp Korkmazskiy, Bell Labs Biing-Hwang Juang, Bell Labs Frank Soong, Bell Labs Volume: 2, Page: 1443 Abstract: This paper presents a new technique for modeling heterogeneous data sources such as speech signals received via distinctly different channels. Such a scenario arises when an automatic speech recognition system is deployed in wireless telephony in which highly heterogeneous channels coexist and interoperate. The problem is that a simple model may become inadequate to describe accurately the diversity of the signal, resulting in an unsatisfactory recognition performance. To deal with such a problem, we propose a Generalized Mixture Model (GMM) approach. For speech signals, in particular, we use mixtures of hidden Markov models (i.e., GMHMM, Generalized Mixture of HMM's). By applying discriminative training for GMHMM we obtained 1.0% word error rate for the recognition of the digits strings from the wireless database, comparing to 1.4% word error rate for the conventional HMM based discriminative technique. ** Title: Writer adaptation of a HMM handwriting recognition system Authors: Andrew W. Senior, IBM Research Krishna S. Nathan, IBM Research Volume: 2, Page: 1447 Abstract: This paper describes a scheme to adapt the parameters of a tied-mixture, hidden Markov model, on-line handwriting recognition system to improve performance on new writers' handwriting. The means and variances of the distributions are adapted using the Maximum Likelihood Linear Regression technique. Experiments are performed with a number of new writers in both supervised and unsupervised modes. Adaptation on data quantities as small as 5 words is found to result in models with 6% lower error rate than the writer independent model. ** Title: In-Service Adaptation of Multilingual Hidden-Markov-Models Authors: Udo Bub, Siemens AG Joachim Kohler, Siemens AG Bojan Imperl, University of Maribor Volume: 2, Page: 1451 Abstract: In this paper we report on advances regarding our approach to porting an automatic speech recognition system to a new target task. In case there is not enough acoustic data available to allow for thorough estimation of HMM parameters it is impossible to train an appropriate model. The basic idea to overcome this problem is to create a task independent seed model that can cope with all tasks equally well. However, the performance of such generalist model is of course lower than the performance of task dependent models (if these were available). So, the seed model is gradually enhanced by using its own recognition results for incremental online task adaptation. Here, we use a multilingual romanic/germanic seed model for a slavic target task. In tests on Slovene digits multilingual modeling yields the best recognition accuracy compared to other language dependent models. Applying unsupervised online task adaptation we observe a remarkable boost of recognition performance. ** Title: Development of Dialect-Specific Speech Recognizers Using Adaptation Methods Authors: Vassilios Diakoloukas, TUC Vassilios Digalakis, TUC Leonardo Neumeyer, STAR-SRI Jaan Kaja, Telia Volume: 2, Page: 1455 Abstract: Several adaptation approaches have been proposed in an effort to improve the speech recognition performance in mismatched conditions. However, the application of these approaches had been mostly constrained to the speaker or channel adaptation tasks. In this paper, we first investigate the effect of mismatched dialects between training and testing speakers in an Automatic Speech Recognition (ASR) system. We find that a mismatch in dialects significantly influences the recognition accuracy. Consequently, we apply several adaptation approaches to develop a dialect-specific recognition system using a dialect-dependent system trained on a different dialect and a small number of training sentences from the target dialect. We show that adaptation improves recognition performance dramatically with small amounts of training sentences. We further show that, although the recognition performance of traditionally trained systems highly degrades as we decrease the number of training speakers, the performance of adapted systems is not influenced so much. ** Title: Syllable-Based Relevance Feedback Techniques for Mandarin Voice Record Retrieval Using Speech Queries Authors: Bo-Ren Bai, NTU Lee-Feng Chien, Academia Sinica Lin-Shan Lee, NTU Volume: 2, Page: 1459 Abstract: In order to solve the problem with the new environment of fast growth of audio resources on the Internet, we have presented a syllable-based approach which is capable of retrieving Mandarin voice records using queries of unconstrained speech. However, the performance achieved by this previously proposed approach is still not satisfactory, and one of the reason is that very often the information provided by the speech query for the request subject may not be sufficient. In this paper, we present approaches based on the relevance feedback technique to improving the performances of the previous research. The proposed approaches include a relevance measure adjustment scheme using a relevance table for the voice database, a query expansion scheme to generate a new query including the feedback information, and a combination of these two schemes. Extensive preliminary experiments were performed and encouraging results were demonstrated. ** Title: Automatic Alternative Transcription Generation and Vocabulary Selection for Flexible Word Recognizers Authors: Doroteo Torre, TID Luis Villarrubia, TID Jose Maria Elvira, TID Luis Hernandez-Gomez, ETSIT-UPM Volume: 2, Page: 1463 Abstract: In accordance with the new emerging Voice Response Systems that use Flexible Vocabulary Recognizers (FVRs), prediction of word confusabilities have been received increasing interest during the last few years. In this contribution we present a new method for transcription confusabilities estimation based on a new statistical modelling criterion. We propose the use of the new transcription confusability measure in two different word error rate (WER) reduction procedures for FVRs: an automatic vocabulary selection procedure suitable for those applications where the set of vocabulary words is not totally defined by the application, and an automatic procedure for generation of alternative transcriptions. Experimental results using a telephonic database show 20% WER relative reduction using the automatic alternative transcription generation procedure for a 37 word vocabulary, and over 50% (20%) WER relative reduction using our unrestricted (restricted by groups of synonyms) vocabulary selection procedure instead of random word selection. ** Title: An Advanced System to Generate Pronunciations of Proper nouns Authors: Neeraj Deshmukh, ISIP Julie Ngan, ISIP Jonathan Hamaker, ISIP Joseph Picone, ISIP Volume: 2, Page: 1467 Abstract: Accurate recognition of proper nouns is a critical component of automatic speech recognition (ASR). Since there are no obvious letter-to-sound conversion rules that govern the pronunciation of any large set of proper nouns, this is an open-ended problem that evolves constantly under various sociolinguistic influences. A Boltzmann machine neural network is well-suited for the task of generating the most likely pronunciations of a proper noun. This pronunciation output can be used to build better acoustic models for the noun that result in improved recognition performance. We present here an advanced version of this N-best pronunciations system; and a multiple pronunciations dictionary of 18000 surnames and 25000 pronunciations used as a training database. The database and software are available in the public domain. ** Title: Automatic Pronunciation Scoring for Language Instruction Authors: Horacio Franco, SRI International Leonardo Neumeyer, SRI International Yoon Kim, SRI International Orith Ronen, SRI International Volume: 2, Page: 1471 Abstract: In this work we address the task of grading the pronunciation quality of the speech of a student of a foreign language. The automatic grading system uses SRI's Decipher continuous speech recognition system to generate phonetic segmentations. Based on these segmentations and probabilistic models we produce pronunciation scores for individual or group of sentences. Scores obtained from expert human listeners are used as the reference to evaluate the different machine scores and to provide targets when training some of the algorithms. In previous work we had found that duration-based scores outperformed HMM log-likelihood-based scores. In this paper we show that we can significantly improve HMM-based scores by using average phone segment posterior probabilities. Correlation between machine and human scores went up from r=0.50 with likelihood-based scores to r=0.88 with posterior-based scores, they also outperformed duration-based scores mainly in the case of using few sentences to compute a score. ** Title: Speaker-Independent Name Dialing With Out-of-Vocabulary Rejection Authors: Coimbatore S. Ramalingam, Texas Instruments Lorin P. Netsch, Texas Instruments Yu-Hung Kao, Texas Instruments Volume: 2, Page: 1475 Abstract: In this paper we propose a system for speaker-independent name dialing in which a name enrolled by a user can be used by other members in a family or co-workers in an office. We use speaker-independent sub-word models during enrollment; the recognized sub-word string is later used during recognition. We also present a mechanism for rejecting out-of-vocabulary (OOV) phrases. The best in-vocabulary (IV) correct and OOV rejection performance for other speakers is 90%/60% (IV/OOV) on a database containing eighteen speakers. If the orthography were known, the best performance is 96%/65%. ** Title: Hidden Understanding Models For Statistical Sentence Understanding Authors: Richard Schwartz, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 Scott Miller, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 David Stallard, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 John Makhoul, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 Volume: 2, Page: 1479 Abstract: We describe the first sentence understanding system that is completely based on learned methods both for understanding individual sentences, and determining their meaning in the context of preceding sentences. We divide the problem into three stages: semantic parsing, semantic classification, and discourse modeling. Each of these stages requires a different model. When we ran this system on the last test (December, 1994) of the ARPA Air Travel Information System (ATIS) task, we achieved 13.7 error rate. The error rate for those sentences that are context-independent (class A) was 9.7%. ** Title: An alternative scheme for perplexity estimation Authors: Frederic Bimbot, ENST / SIG - CNRS / URA-820 Marc El-Beze, LIA Michele Jardino, LIMSI - CNRS Volume: 2, Page: 1483 Abstract: Language models are usually evaluated on test texts using the perplexity derived directly from the model likelihood function. In order to use this measure in the framework of a comparative evaluation campaign, we have developped an alternative scheme for perplexity estimation. The method is derived from the Shannon game and based on a gambling approach on the next word to come in a truncated sentence. We also use entropy bounds proposed by Shannon and based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models. The relevance of the approach is assessed on an example. ** Title: Extensions to Phone-State Decision-Tree Clustering: Single Tree and Tagged Clustering Authors: Douglas B. Paul, Dragon Systems, Inc. Volume: 2, Page: 1487 Abstract: The following article describes two extensions to the traditional decision tree methods for clustering allophone HMM states in LVCSR systems. The first, single tree clustering, combines all allophone states of all phones into a single tree. This can be used to improve performance for very small systems. The single tree clustering structure can also be exploited for speaker and channel adaptation and is shown to provide a 30 percent reduction in the error rate for an LVCSR task under matched channel conditions and a greater reduction under mismatched channel conditions. The second, tagged clustering, is a mechanism for providing additional information to the clustering procedure. The tags are labels for any of a wide variety of factors, such as stress, placed on the triphones. These tags are then accessible to the clustering process. Small improvements in recognition performance were obtained under certain conditions. Both methods can be combined. ** Title: Evaluation of fast algorithms for finding the nearest neighbor Authors: Stephane Lubiarz, Matra Com. Philip Lockwood, Matra Com. Volume: 2, Page: 1491 Abstract: In speech recognition systems as well as in speech coders using vector quantization, the search for the nearest neighbor is a computationally intensive task. In this paper, we adress the problem of fast nearest neighbor search. State of the art solutions tend to approach logarithmic access time. The problem is that such performance is generally achieved at the expense of a significant increase in storage requirements. In this contribution, we compare several known approaches and propose new extensions. These new contributions allows for a significant reduction in memory requirements without impacting the performance in terms of number of distances computed and optimality of the search. ** Title: Fusion of Visual and Acoustic Signals for Command-Word Recognition Authors: Rudolf Kober, FAW Ulm Ulrich Harz, FAW Ulm Jutta Schiffers, FAW Ulm Volume: 2, Page: 1495 Abstract: In this paper, we investigate the question of how the visual information of lip movement contributes to command-word recognition. The fusion of the acoustic and visual signal can be carried out either at the feature level or at the class level. Integration at the feature level means merging of the acoustic and visual features to yield a combined feature vector which is feed into a HMM-system. Fusion at the class level means separate classification of the two sources of information and combination of the classification results. An HMM classifier is used for the acoustic signal and three different classifiers (HMM, DTW and ClaRe) for the visual signal. The classification results are combined using C4.5. The recognition rates of both fusion schemes are comparable. Both yield small improvements at high SNR's using the acoustic/visual system in comparsion to the acoustic system alone. Larger improvements (up to 12 %) result at low SNR's. ** Title: Difference in visual information between face to face and telephone dialogues Authors: Yuri Iwano, Waseda University Yosuke Sugita, Waseda University Yusuke Kasahara, Waseda University Shu Nakazato, Waseda University Katsuhiko Shirai, Waseda University Volume: 2, Page: 1499 Abstract: In this research, we analyzed conversations between a pair of subjects, under two conditions. One is face to face conversation that has a visual contact, and the other is conversation through telephone line that has not. From the recorded videotape we extracted the subject's actions especially focusing on the head movements. By comparing the dialogues under two conditions, it seems that there are two types of head movements, one is intended to give a response to his partner and the other is to send some signal. We are going to analyze how visual information contributes in spoken dialogue perceptions, and possibility of adopting it in a multi-modal human interface. ** Title: Cepstrum-based filter-bank design using discriminative feature extraction training at various levels Authors: Alain Biem, ATR HIP. Shigeru Katagiri, ATR HIP. Volume: 2, Page: 1503 Abstract: This paper investigates the realization of optimal filter bank-based cepstral parameters. The framework is the Discriminative Feature Extraction method (DFE) which iteratively estimates the filter-bank parameters according to the errors that the system makes. Various parameters of the filter-bank, such as center frequency, bandwidth, gain are optimized using a string-level optimization and a frame-level optimization scheme. Application to vowel and noisy telephone speech recognition tasks shows that the DFE method realizes a more robust classifier by appropriate feature extraction. ** Title: Minimum Error Rate Training for Designing Tree-Structured Probability Density Function Authors: Wu Chou, Bell Labs. Volume: 2, Page: 1507 Abstract: In this paper, we propose a signal prototype classification and evaluation framework in acoustic modeling. Based on this framework, a new tree-structured likelihood function is derived. It uses a designated cluster kernel $f_{m}^{C}$ for signal prototype classification and a designated cluster kernel $f_{m}^{L}$ for likelihood evaluation of outlier or tail events of the cluster. A minimum classification error (MCE) rate training approach is described for designing tree-structured likelihood function. Experimental results indicate that the new tree-structured likelihood function significantly improves the acoustic resolution of the model. It has a more significant speedup in decoding than the one obtained from the conventional approach. ** Title: A Frequency-Weighted HMM Based on Minimum Error Classification for Noisy Speech Recognition Authors: Hiroshi Matsumoto, Shinshu University Masanori Ono, Shinshu University Volume: 2, Page: 1511 Abstract: As a noise robust HMM, we previously proposed a frequency-weighted HMM (HMM-FW) whose covariance matrices are replaced by the inverse of frequency-weighting matrices. In this HMM, the frequency-weighting parameters were common to all classes and states, and were experimentally adjusted. In order to achieve further noise robustness, this paper examines the class- and state-dependent weighting parameters and their minimum error classification training (MCE) of their weighting characteristics. Using the NOISEX-92 database, the MCE-trained HMM-FWs are shown to be more robust even under untrained noise conditions than both the previous HMM-FW and conventional HMM. ** Title: Dictionary-Based Discriminative HMM Parameter Estimation for Continuous Speech Recognition Systems Authors: Daniel Willett, Duisburg University Christoph Neukirchen, Duisburg University Jorg Rottland, Duisburg University Volume: 2, Page: 1515 Abstract: The estimation of the HMM parameters has always been a major issue in the design of speech recognition systems. Discriminative objectives like Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) have proved to be superior over the common Maximum Likelihood Estimation (MLE) in cases where a robust estimation of the probabilistic density functions (pdfs) is not possible. The determination of the overall likelihood of an acoustic observation is the most crucial point of the MMI-parameter estimation when applied to continuous speech systems. Contrary to the common approaches that estimate the overall likelihood of the training observations by evaluating the most confusing sentences or by applying global state frequencies, this paper suggests to perform a dictionary analysis in order to get estimates for the dictionary-based risk of mixing up each two HMM states. These estimates are used to estimate the observations' likelihood and to control the discriminative MMI training procedure. Results on a monophone SCHMM speech recognition system are presented that prove the practicability of the new approach. ** Title: A DFE-Based Algorithm For Feature Selection In Speech Recognition Authors: Angel de la Torre, University of Granada Antonio M. Peinado, University of Granada Antonio J. Rubio, University of Granada Victoria Sanchez, University of Granada Volume: 2, Page: 1519 Abstract: The algorithms for the reduction of the number of features without degrading the performance of pattern recognition systems play an important role in real applications. In this work a new algorithm for feature selection is proposed. This algorithm is based on the Discriminative Feature Extraction (DFE) technique and has been applied to speech recognition. The experimental results show that the recognition systems accept important reductions of the number of features without a degradation of the performance. For the representation used in our experiments, the recognition error-rate is not significantly increased when the number of components in the feature vector is reduced from 42 to 20. ** Title: Robustness Issues and Solutions in Speech Recognition Based Telephony Services Authors: Vijay Raman, NYNEX S&T Inc. Vidhya Ramanujam, NYNEX S&T Inc. Volume: 2, Page: 1523 Abstract: HMM-based algorithms for speaker-dependent recognition as well as speaker-independent recognition form the basis of speech services developed at NYNEX S&T and deployed widely by NYNEX and other telephone service providers. Based on the analysis of the initially deployed VoiceDialing service, robustness of these algorithms was recognized to be a dominant issue. In this paper, we discuss the features of a high-performance, robust speaker-dependent recognition algorithm, and include some deployment issues that were successfully resolved. ** Title: Speaker-Dependent Speech Recognition Based on Phone-Like Units Models --- Application to Voice Dialing Authors: Vincent Fontaine, FPMS - TCTS Herve Bourlard, FPMS - TCTS Volume: 2, Page: 1527 Abstract: This paper presents a speaker dependent speech recognition with application to voice dialing. This work has been developed under the constraints imposed by voice dialing applications, i.e., low memory requirements and limited training material. Two methods for producing speaker dependent word baseforms based on Phone Like Units (PLU) are presented and compared : (1) a classical vector quantizer is used to divide the space into regions associated with PLUs; (2) a speaker independent hybrid HMM/MLP recognizer is used to generate speaker dependent PLU based models. This work shows that very low error rates can be achieved even with very simple systems, namely a DTW-based recognizer. However, best results are achieved when using the hybrid HMM/MLP system to generate the word baseforms. Finally, a realtime demonstration simulating voice dialing functions and including keyword spotting and rejection capabilities has been set up and can be tested online. ** Title: Enhanced Control and Estimation of Parameters for a Telephone Based Isolated Digit Recognizer Authors: Josef G. Bauer, Siemens AG Volume: 2, Page: 1531 Abstract: The paper studies the use of discriminative techniques for a telephone based isolated digit recognizer with respect to a reduced system complexity. The combination of Linear Discriminant Analysis (LDA) and Minimum Error Classification (MEC) training provides improved system performance at reduced costs for the training process and for the application. Experiments are performed on an isolated digit database recorded over public lines including approximately 700 speakers. The use of a single linear transformation matrix based on LDA allows the use of density modeling, that doesn't consider variances explicitly, at a high recognition rate. Minimum Classification Error training is found to perform best in case of a small amount of system parameters. A reduction of error rate up to 80% was achieved by the combination of the two methods for such a system configuration. ** Title: HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects Authors: Douglas A. Reynolds, MIT Lincoln Laboratory Volume: 2, Page: 1535 Abstract: This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and produce speech differing only in transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a handset (HTIMIT). A comparison of results between the two corpora addresses the realism of artificially creating handset degraded speech by playing recorded speech through handsets. The corpora are designed primarily for speaker recognition experimentation, but since both speaker and speech recognition systems use the same acoustic features affected by the handset, knowledge gleaned is directly transferable to speech recognizers. Initial speaker identification performance on these corpora are presented. In addition, the application of HTIMIT in developing a handset detector that was successfully used on a Switchboard speaker verification task is described. ** Title: Robustness Improvements in Continuously Spelled Names over the Telephone Authors: Michael Galler, STL Jean-Claude Junqua, STL Volume: 2, Page: 1539 Abstract: A speaker-independent speech recognizer for continuously spelled names, implemented for a switchboard call-routing task, is analyzed for sources of error. Results indicate most errors are due to extraneous speech and end-point detection errors. Strategies are proposed for improving the robustness of recognition, including tolerance for speech with pauses, and a letter-spotting strategy to handle extraneous speech. Experimental results on laboratory data indicate that with the letter-spotting method, name retrieval error rate is reduced on noisy signals or signals with extraneous speech 60.1%, while it is increased on clean signals from 4.5% to 5.5%. On data collected during a telephone field trial, error is reduced 54.1% in offline tests by introducing the letter-spotting algorithm. ** Title: A Fast Algorithm for Stochastic Matching with Application to Robust Speaker Verification Authors: Qi Li, Bell Labs S. Parthasarathy, Bell Labs Aaron E. Rosenberg, Bell Labs Volume: 2, Page: 1543 Abstract: Acoustic mismatch between training and test environments is one of the major problems in telephone-based speaker recognition. Speaker recognition performances are degraded when an HMM trained under one set of conditions is used to evaluate data collected from different telephone channels, microphones, etc. The mismatch can be approximated as a linear transform in a cepstral domain. In this paper, we present a fast, efficient algorithm to estimate the parameters of the linear transform for real-time applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the the detailed characteristics of speech, then, speaker dependent HMM's can be used to evaluate the details under the same condition as training. Compared to cepstral mean subtraction (CMS) and other bias removal techniques, the proposed linear transform is more general since CMS and others only consider translation; compared to maximum-likelihood approaches for stochastic matching, the proposed algorithm is simpler and faster since iterative techniques are not required. The proposed algorithm improves the performance of a speaker verification system in the experiments reported in this paper. ** Title: A Bayesian Predictive Classification Approach to Robust Speech Recognition Authors: Qiang Huo, ATR-ITL Hui Jiang, University of Tokyo Chin Hui Lee, Bell Labs Volume: 2, Page: 1547 Abstract: We introduce a new Bayesian predictive classification (BPC) approach to robust speech recognition and apply the BPC framework to Gaussian mixture continuous density hidden Markov model based speech recognition. We propose and focus on one of the approximate BPC approach called quasi-Bayesian predictive classification (QBPC). In comparison with the standard plug-in maximum a posteriori decoding, when the QBPC method is applied to speaker independent recognition of a confusable vocabulary, namely 26 English letters, where a broad range of mismatches between training and testing conditions exist, the QBPC achieves around 14% relative recognition error rate reduction. While the QBPC method is applied to cross-gender testing on a less confusable vocabulary, namely 20 English digits and commands, the QBPC method achieves around 24% relative recognition error rate reduction. ** Title: Robust Speech Recognition Based on Viterbi Bayesian Predictive Classification Authors: Hui Jiang, University of Tokyo Keikichi Hirose, University of Tokyo Qiang Huo, ATR Volume: 2, Page: 1551 Abstract: In this paper, we investigate a new Bayesian predictive classification (BPC) approach to realize robust speech recognition when there exist mismatches between training and test conditions but no accurate knowledge of the mismatch mechanism is available. A specific approximate BPC algorithm called Viterbi BPC (VBPC) is proposed for both isolated word and continuous speech recognition. The proposed VBPC algorithm is compared with conventional Viterbi decoding algorithm on speaker-independent isolated digit and connected digit string (TIDIGITS) recognition tasks. The experimental results show that VBPC can considerably improve robustness when mismatches exist between training and testing conditions. ** Title: Efficient Mixed Excitation Models in LPC Based Prototype Interpolation Speech Coders Authors: Charalampos Papanastasiou, University of Manchester Costas Xydeas, University of Manchester Volume: 2, Page: 1555 Abstract: This paper presents a new and efficient method for modelling voiced, mixed excitation spectra in Sinusoidal (SC) and Prototype Interpolation Coding (PIC) systems. Speech harmonics are classified as weak-voiced or strong-voiced by simply examining the short-term residual magnitude spectrum. This information is encoded effectively in terms of fixed width frequency bands and is used to control sets of periodic and random sine wave oscillators which model the short-term mixed excitation nature of speech. In this way the model allows the mixing of periodic and random signal energy on a harmonic basis. The proposed methodology has been used in a 2.4Kbits/sec speech coder, whose recovered speech quality is better than that of the 4.8Kbits/sec DoD standard. ** Title: High Quality Split-Band LPC Vocoder Operating at Low Bit Rates Authors: Ian Atkinson, University of Surrey Centre for Satellite Eng. Research Suat Yeldener, University of Surrey Centre for Satellite Eng. Research Ahmet Kondoz, University of Surrey Centre for Satellite Eng. Research Volume: 2, Page: 1559 Abstract: LPC based speech coders operating at bit rates below 3.0 kbits/sec are usually associated with buzzy or metallic artefacts in the synthetic speech. These are mainly attributable to the simplifying assumptions made about the excitation source, which are usually required to maintain such low bit rates. In this paper a new LPC vocoder is presented which splits the LPC excitation into two frequency bands using a variable cut-off frequency. The lower band is responsible for representing the voiced parts of speech, whilst the upper band represents unvoiced speech. In doing so the coders performance during both mixed voicing speech and speech containing acoustic noise is greatly improved, producing soft natural sounding speech. The paper also describes new parameter determination and quantisation techniques vital to the operation of this coder at such low bit rates. ** Title: Non-linear Techniques for Pitch and Waveform Enhancement in PWI Coders Authors: Hui Li, University of Leeds Gordon B. Lockhart, University of Leeds Volume: 2, Page: 1563 Abstract: Two non-linear interpolation techniques are introduced for enhancing speech reproduction in Prototype Waveform Interpolation (PWI) and similar encoders. A Temporal Differential Rate (TDR) vector is used to characterise the non- uniform evolution of pitch cycle temporal structure during interpolation. Experimental results show a clear improvement in the accuracy of decoded pitch cycle lengths and in the reproduction of periodicity in general. It is also shown that waveform reproduction can be significantly improved by vector quantising sets of Optimal Combination Coefficients (OCC) aimed at maximising the similarity between interpolated and target signal segments. Both time domain waveform similarity and frequency domain spectral envelope similarity derived OCC are tested. Subjective assessment suggests a general preference for non-linear interpolation methods and the scheme using frequency domain derived OCC with perceptual weighting provided the best subjective preference. ** Title: Multi-Prototype Waveform Coding Using Frame-by-Frame Analysis-by-Synthesis Authors: Ian S. Burnett, University of Wollongong Duong H. Pham, University of Wollongong Volume: 2, Page: 1567 Abstract: A new mechanism for using Analysis-by-Synthesis techniques in low rate Waveform Interpolation based coders is introduced. The algorithm, implemented as part of a Multi-Prototype Waveform coder, exploits the high quality speech produced by interpolating unquantised speech-domain Prototype Waveforms. In the new scheme, a frame of Prototype Waveforms is quantised using two sets of codebook searches, one representing the slowly evolving prototype shape and the other the rapid, noisy components. The scheme offers performance advantages over the previous open-loop Multi-Prototype Waveform coder, particularly when perceptual weighting is incorporated in the search. Reductions in search complexity and the use of the scheme for quantisation at higher rates are also considered. This results in a generalised Analysis-by-Synthesis Waveform Interpolation architecture with closed-loop optimisation of all Prototype Waveform properties. ** Title: Multiband Prototype Waveform Analysis for Very Low Bit Rate Speech Coding Authors: Khashayar Yaghmaie, University of Surrey Ahmet Kondoz, University of Surrey Volume: 2, Page: 1571 Abstract: Prototype waveform interpolation is one of the most efficient compression techniques for coding the speech signal at bit rates below 4 kb/s. Most of the PWI coders employ prototype waveforms of the linear predictive residual signal for coding purpose. In the latest PWI systems, decomposition methods are used to separate the voiced and unvoiced components of the prototype waveforms prior to coding. This has resulted in high quality speech at very low bit rates. This paper presents a novel combination of the Multiband voicing analysis and PWI coding system in which the Multiband analysis is exploited to identify the voiced and unvoiced spectral components of the prototype waveforms of the original speech signal. To produce a high quality synthetic speech, energy variation of the original signal is recovered by transmitting its energy envelope. This method resulted in a high quality and low complexity coder operating at 2.55 kb/s. ** Title: A Formant Vocoder based on Mixtures of Gaussians Authors: Parham Zolfaghari, Cambridge University Tony Robinson, Cambridge University Volume: 2, Page: 1575 Abstract: This paper describes a new low bit-rate formant vocoder. The formant parameters are represented by Gaussian mixture distributions, which are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. A voiced/unvoiced classification mechanism has been developed based on the harmonic nature of each formant in the DFT spectrum modulated by the Gaussian Mixture distribution. Using a magnitude-only sinusoidal synthesiser, intelligible synthetic speech has been obtained. Vector quantisation of the vocal tract parameters enables this formant vocoder to operate at bit-rates down to 1248 bps. ** Title: Natural Quality Variable--Rate Spectral Speech Coding Below 3.0 kbps Authors: Engin Erzin, Lucent Technologies Arun Kumar, UC, Santa Barbara Allen Gersho, UC, Santa Barbara Volume: 2, Page: 1579 Abstract: We propose new techniques for natural quality variable rate spectral speech coding at an average rate of 2.2 kbps for dialog speech and 2.8 kbps for monolog speech. The coder models the Fourier spectrum of each frame and it builds on recent enhancements to the classical multiband excitation (MBE) approach. New techniques for robust pitch estimation and tracking, for efficient quantization of voiced and unvoiced spectra and encoding of partial phase information are the key features that result in improved quality over earlier spectral vocoders. Subjective performance results are reported which show that the coder is very close in quality to the ITU-T G.723.1 algorithm at 5.3 kbps. ** Title: A New 2-kbit/s Speech Coder Based on Normalized Pitch Waveform Authors: Yuusuke Hiwasaki, NTT Human Interface Labs Kazunori Mano, NTT Human Interface Labs Volume: 2, Page: 1583 Abstract: Speech coding at very low bitrate is useful for purposes such as voice communication over computer networks. However, speech coding at around 2.0 kbit/s is difficult for CELP coders while maintaining a high quality. In this paper, a speech coding model called `normalized pitch waveform' and its quantization scheme are presented, aiming for effective compression coding of the `voiced' speech. Listening tests has proven that an efficient and high quality coding has been achieved at bitrate 2.0 kbit/s, less than half of the FS1016. Furthermore, this paper discusses the disadvantage of the normalized pitch waveform and presents an alternative method of using non-normalized pitch waveform. Encoding of a transitional `mixed' state between the `voiced' and the `unvoiced' state is discussed for further improvements. ** Title: A Comparison of the New 2400 bps MELP Federal Standard with Other Standard Coders Authors: Mary A. Kohler, U.S. D.o.D. Volume: 2, Page: 1587 Abstract: In 1996, the U.S. Department of Defense Digital Voice Processing Consortium (DDVPC) selected Texas Instrument's mixed excitation linear prediction (MELP) algorithm as the recommended new federal standard for 2400 bps voice communications. The algorithm selection process involved quality, intelligibility, communicability, and recognizability testing in many acoustic noise, error, and tandem conditions. Algorithm complexity was also measured. This paper compares the performance scores, diagnostic information, and complexity of MELP to the 4800 bps federal standard (FS1016) code excited linear prediction (CELP) algorithm, the 16 kbps continuously variable slope delta modulation (CVSD) algorithm, and the venerable federal standard (FIPS Pub. 137) 2400 bps linear predictive coding (LPC-10) algorithm. ** Title: MELP: The New Federal Standard at 2400 bps Authors: Lynn M. Supplee, Department of Defense Ronald P. Cohn, Department of Defense John S. Collura, Department of Defense Alan V. McCree, Texas Instruments Volume: 2, Page: 1591 Abstract: This paper describes the new U.S. Federal Standard at 2400 bps. The Mixed Excitation Linear Prediction (MELP) coder was chosen by the DoD Digital Voice Processing Consortium to replace the existing 2400 bps Federal Standard FS 1015 (LPC-10). This new standard provides equal or improved performance over the 4800 bps Federal Standard FS 1016 (CELP) at a rate equivalent to LPC-10. The MELP coder is based on the traditional LPC model, but includes additional features to improve its performance. ** Title: Using a Perception-Based Frequency Scale in Waveform Interpolation Authors: Jes Thyssen, AT&T Labs Bastiaan Kleijn, AT&T Labs Roar Hagen, AT&T Labs Volume: 2, Page: 1595 Abstract: In speech coding it is important to focus the coding effort on the perceptually important features of the speech signal. This paper describes new quantization techniques which take advantage of current knowledge of human perception in speech coders. The new procedures exploit the frequency-dependent frequency resolution of the human auditory system. The methods are applied to the waveform interpolation (WI) coder, and their effectiveness is confirmed with experimental results. The principles described in the paper are not restricted to the WI coder, but are also applicable to many other speech coding algorithms. ** Title: Very low complexity interpolative speech coding at 1.2 to 2.4 Kbps Authors: Yair Shoham, Bell Laboratories, Lucent Technologies Volume: 2, Page: 1599 Abstract: The recently-introduced waveform interpolation (WI) coders provide good-quality speech at low rates but may be too complex for commercial use. This paper proposes new approaches to low-complexity WI speech coding at rates of 1.2 and 2.4 kbps. The proposed coders are 4 to 5 times faster than the previously reported ones . At 2.4 kbps, the complexity is about 7.5 and 2.5 MFLOPS for the encoder and decoder, respectively. At 1.2 kbps, the complexity is about 6 and 2.3 MFLOPS for the encoder and decoder, respectively. Informal subjective evaluation shows that, at 2.4 kbps, the quality is close to that of the high-complexity coders. The quality does not significantly degrade at 1.2 kbps and it is considered sufficient for messaging applications. ** Title: Modified Multiband Excitation Model at 2400 bps Authors: Michele Jamrozik, Clemson University John Gowdy, Clemson University Volume: 2, Page: 1603 Abstract: This paper presents the Modified Multiband Excitation Model used for speech coding. In many MBE model coders, speech quality is degraded when incorrect voicing decisions are made, particularly for high-pitched female speakers. The MMBE addresses this issue with a modified voiced/unvoiced decision algorithm and a more robust pitch estimate. The listening quality of speech produced using the MMBE model is superior to the FS-1016 CELP coder and is at least comparable with the new 2400 bps MELP coder chosen as the new 2400 bps Federal Standard. ** Title: Variable Bit Rate MBELP Speech Coding Via V/UV Distribution Dependent Spectral Quantization Authors: Eric W.M. Yu, City University of Hong Kong Cheung-Fat Chan, City University of Hong Kong Volume: 2, Page: 1607 Abstract: A variable bit rate multiband excited linear predictive speech coder is proposed in this paper. Speech signal is compressed in different bit rates ranging from 0.88 kbps to 2.6 kbps according to the mode of operation and the optimum V/UV transition frequency. An average bit rate of 1.24 kbps is achieved. The proposed speech coder improves the speech quality by splitting the non-stationary speech segments for analysis. The V/UV distribution of a short-time speech spectrum is represented efficiently by using a closed-loop minimised V/UV transition frequency. Depending on the V/UV transition frequency, the spectrum envelope is quantized in variable bit rate through embedded differential predictive scalar and vector quantizations of the LSP parameters. The proposed spectral quantization scheme results in a spectral distortion comparable to a fixed 24-bit 2-dimensional differential scalar quantization scheme. ** Title: Voice Characteristics Conversion for HMM-based Speech Synthesis System Authors: Takashi Masuko, P&I Lab., Tokyo Institute of Technology Keiichi Tokuda, Nagoya Institute of Technology Takao Kobayashi, P&I Lab., Tokyo Institute of Technology Satoshi Imai, P&I Lab., Tokyo Institute of Technology Volume: 3, Page: 1611 Abstract: In this paper, we describe an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system. Since our speech synthesis system uses phoneme HMMs as speech units, voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthesized speech to the target speaker, we applied MAP/VFS algorithm to the phoneme HMMs. Using 5 or 8 sentences as adaptation data, speech samples synthesized from a set of adapted tied triphone HMMs, which have approximately 2,000 distributions, are judged to be closer to the target speaker by 79.7% or 90.6%, respectively, in an ABX listening test. ** Title: The Perceptual Importance of Selected Voice Quality Parameters Authors: Gudrun Klasmeyer, Berlin University of Technology Volume: 3, Page: 1615 Abstract: It is well known, that personal voice qualities differ in the speakers use of temporal structures, F0 contours, articulation precision, vocal effort and type of phonation. Whereas temporal structures and F0 contours can be measured in the acoustic signal and conclusions about articulation precision can be made from the formant structure, this paper focuses especially on vocal effort and type of phonation. These voice quality percepts are a combination of several acoustic voice quality parameters: The glottal pulse shape in the time domain or damping of the harmonics in the frequency domain, spectral distribution of turbulent signal components and voicing irregularities. In an investigation on emotionally loaded speech material it could be shown, that the named acoustic parameters are useful for differentiating between the emotions happiness, sadness, anger, fear and boredom [Klasmeyer, 1996]. The perceptual importance of selected acoustic voice quality parameters is investigated in perception experiments with synthetic speech. ** Title: A Parametric Three-Dimensional Model of the Vocal-Tract Based on MRI data Authors: Hani Yehia, ATRI, Kyoto Mark Tiede, ATR HIP, Kyoto Volume: 3, Page: 1619 Abstract: In this paper, 24 three-dimensional (3D) vocal-tract (VT) shapes extracted from MRI data are used to derive a parametric model for the vocal-tract. The method is as follows: first, each 3D VT shape is sampled using a semi-cylindrical grid whose position is determined by reference points based on VT anatomy. After that, the VT projections onto each plane of the grid are represented by their two main components obtained via principal component analysis (PCA). PCA is once again used to parametrize the sequences of coefficients that represent the sections along the tract. It was verified that the first four components can explain about 90% of the total variance of the observed shapes. Following this procedure, 3D VT shapes are approximated by linear combinations of four 3D basis functions. Finally, it is shown that the four parameters of the model can be estimated from VT midsagittal profiles. ** Title: Inverse Filter Approach to Pitch Modification: Application to Concatenative Synthesis of Female Speech Authors: Rashid Ansari, University of Illinois, Chicago Volume: 3, Page: 1623 Abstract: In this paper a new method for modifying the pitch of units of recorded female speech is described. This method was developed to overcome limitations in an otherwise promising technique called Residual-Excited Linear Prediction (RELP). In the new method, the stored speech unit is processed with a suitably shaped time-varying filter. The filtered signal is modified according to the required change in the fundamental frequency. The modified filtered signal is applied to the inverse of the above-mentioned prefilter. Based on observations of spectra of multiple recordings of the same speech unit at different pitch frequencies, the magnitude response of the inverse filter was chosen to have a significantly less peaky structure than that which is typically obtained in LPC. Speech modifications using this method were found to be superior in quality to those obtained by RELP, while at the same time being less sensitive than RELP to changes in pitch marking. ** Title: Vowel amplitude variation during sentence production Authors: Helen M. Hanson, Sensimetrics Corp. Volume: 3, Page: 1627 Abstract: With the goal of synthesizing natural-sounding speech based on higher-level parameters, sources of vowel amplitude variation were studied for sentences having different prosodic patterns. Previous theoretical and experimental work has shown that sound pressure level (SPL) is proportional to subglottal pressure ($P_s$) on a log scale during production of sustained vowels. The current work is based on acoustic sound pressure signals and estimated $P_s$ signals recorded during the production of reiterant speech, which is closer to natural speech production and includes prosodic effects. The results show individual, and perhaps gender, differences in the relationship between SPL and $P_s$, and in the degree of vowel amplitude contrast between full and reduced vowels. However, a general trend among speakers is to use subglottal pressure to control vowel amplitude at sentence level and main prominences, and to use adjustments of glottal configuration to control vowel amplitude variations for reduced and non-nuclear full vowels. These results have implications not only for articulatory speech synthesis, but also for automatic speech recognition systems. ** Title: Experiments in Female Voice Speech Synthesis Using a Parametric Articulatory Model Authors: Dong Bing Wei, University of Liverpool Colin C. Goodyear, University of Liverpool Volume: 3, Page: 1631 Abstract: A parametric vocal tract model and a two dimensional articulatory parametric subspace for a female voice are presented. The parameters of the model, which determine the vocal tract shape can be found uniquely for VV transitions by mapping directly from f1 and f2 onto this subspace, while a modified technique involving f3 is available for voiced VC and CV diphones. The area functions of the vocal tract, generated by these parameters are used to drive a time-domain synthesiser. Synthesis to give female speech, copied from either male or female natural speech, may be performed. ** Title: Aritculatory Speech Synthesis Using Diphone Units Authors: Andrew Richard Greenwood, JMU Volume: 3, Page: 1635 Abstract: Two different parametric models of the vocal tract have been developed. These have been used to obtain area functions for use in an articulatory synthesiser based on the Kelly-Lochbaum model. Random sampling of the geometric space spanned by the model has been performed to obtain a codebook for use in spectral copy synthesis. A dynamic programming search of this codebook produces intelligible synthetic speech, but the overall quality is limited by the density of codebook entries in articulatory space. To increase the coverage without significantly increasing the codebook size, a method of generating several small codebooks, each of which covers a small amount of acoustic space has been developed. By using codebooks which map the regions of acoustic space defined by voiced diphones, it has been possible to significantly improve the quality of the synthetic speech. ** Title: An Auditory-Based Measure for Improved Phone Segment Conceatenation Authors: David T. Chappell, Duke University John H.L. Hansen, Duke University Volume: 3, Page: 1639 Abstract: This paper describes a new auditory-based distance measure intended for use in a concatenated synthesis technique wherein time- and frequency-domain characteristics are used to perform natural-sounding speaker synthesis. Whereas most concatenation systems use large databases (often +100,000 units), we begin from a small, limited database (approx. 400 units) and use a new spectral distortion measure to aid in the selection of phones for optimal concatenation. At the transition between speech segments, the new auditory-based distance metric assesses perceived discontinuities in the frequency domain. The distortion measure, which employs the Carney auditory model, is used to select phones which minimize the perceived distortion between concatenated segments. Moreover, time- and frequency-domain methods can shape the prosodic and spectral characteristics of each speech segment. The final results demonstrate improved performance over standard concatenation methods applied to small databases. ** Title: Correlation Based Speech Formant Recovery Authors: Douglas Nelson, Department of Defense Volume: 3, Page: 1643 Abstract: A new method for generating speech spectrograms is presented. This algorithm is based on an autocorrelation function whose parameters are chosen provide processing gain and formant resolution, while minimizing pitch artifacts in the spectrum. Crisp formants are produced, and the power ratio of the formants can be adjusted by pre-filtering the data The process is functionally equivalent to a time-smoothed, windowed Wigner distribution, in which the cross-terms normally associated with the Wigner distribution are greatly attenuated by the smoothing operation. ** Title: The Modulation Spectrogram: In Pursuit of an Invariant Representation of Speech Authors: Steven Greenberg, ICSI / UC Berkeley Brian E.D. Kingsbury, ICSI / UC Berkeley Volume: 3, Page: 1647 Abstract: Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments. ** Title: From Vocalic Detection to Automatic Emergence of Vowel Systems Authors: Francois Pellegrino, IRIT Regine Andre-Obrecht, IRIT Volume: 3, Page: 1651 Abstract: This paper presents our work on vowel system detection as part of a project of Automatic Language Identification using phonological typologies. We have developed a vowel detection algorithm based on spectral analysis of the acoustic signal and requiring no learning stage. It has been tested with two telephone speech corpora: - with a French corpus provided by the CNET, 7.4 % of detections are false while about 25 % of the vowels present in the signal are not found. - experiments with 5 languages of the OGI_TS corpus result in 88.1 % of correct detection and about 15 % of non-detection. We also present in this paper the Vector Quantizer (VQ) LBG-Rissanen Algorithm that we use for vowel system modeling. Preliminary experiments are reported. ** Title: Acoustic characteristics of lexical stress in continuous speech Authors: David van Kuijk, Nijmegen University Louis Boves, Nijmegen University Volume: 3, Page: 1655 Abstract: In this paper we investigate acoustic differences between vowels in syllables that do or don't carry lexical stress. The speech material on which the investigation is based differs from the type of material used in previous research: we used phonetically rich sentences from the Dutch POLYPHONE corpus. We shortly discuss the definition of the linguistic feature `lexical stress' and its possible impact on the phonetic realization. We then proceed to explain the experiments that were carried out and the presentation of the results. Although most of the Duration, Energy and Spectral Tilt features that we used in the investigation show statistically significant differences for the population means for stressed and unstressed vowels, it also appears that the distributions overlap to such an extent that automatic detection of stressed and unstressed syllables yields accuracy scores of not much more han 65%. It is argued that this is due to the large variety in the ways in which the abstract linguistic feature `lexical stress' is realized in the acoustic speech signal. ** Title: Pole-Zero Modeling of Vocal Tract for Fricative Sounds Authors: Minsheng Liu, University of Frankfurt Arild Lacroix, University of Frankfurt Volume: 3, Page: 1659 Abstract: This paper presents a pole-zero model based on a multi-tube acoustic model for fricative sounds. This model consists of the front and back cavity formed by oral tract and pharynx, in which the excitation source is located at the point of constriction. The transfer function of this model including poles and zeros is derived andits properties are investigated. Small losses such as viscous friction which is an important for the fracative sound in the vocal tract are considered and the results show, if the vocal tract is lossless, the numerator part of the pole-zero model is symmetric. The transfer function with small losses overcomes the limitation of the symmetry.This method is applied by employing the inverse filtering and an adaptive algorithm to analyse fricative sounds. ** Title: Quantitative characterization of functional voice disorders using motion analysis of highspeed video and modeling Authors: Thomas Wittenberg, University of Erlangen Patrick Mergell, University of Erlangen Monika Tigges, University of Erlangen Ulrich Eysholdt, University of Erlangen Volume: 3, Page: 1663 Abstract: A semiautomatic motion analysis software is used to extract elongation-time diagrams (trajectories) of vocal fold vibrations from digital highspeed video sequences. By combining digital image processing with biomechanical modeling we extract characteristic parameters such as phonation onset time and pitch. A modified two-mass model of the vocal folds is employed in order to fit the main features of simulated time series to those of the extracted trajectories. Due to the variation of the model parameters, general conclusions can be made about laryngeal dysfunctions such as functional dysphonia. We show the first results of semi-automatic motion analysis in combination with model simulations as a step towards a computer aided diagnosis of voice disorders. ** Title: Robust Speech Decoding: A Universal Approach to Bit Error Concealment Authors: Tim Fingscheidt, RTWH Aachen Peter Vary, RTWH Aachen Volume: 3, Page: 1667 Abstract: In digital mobile communication systems there is the need for reducing the subjective effects of residual bit errors which have not been eliminated by channel decoding by the use of error concealment techniques. Due to the fact that most standards do not specify these algorithms bit exactly, there is room for new solutions to improve the speech quality. This contribution develops a new approach for optimum estimation of speech codec parameters. It can be applied to any speech codec standard if a bit reliability information is provided by the demodulator (e.g. DECT), or by the channel decoder (e.g. soft-output Viterbi algorithm -- SOVA in GSM). The proposed method includes an inherent muting mechanism leading to a graceful degradation of speech quality in case of adverse transmission conditions. Particularly the additional exploitation of residual source redundancy, i.e. some a priori knowledge about codec parameters gives a significant enhancement of the output speech quality. In the case of an error free channel, bit exactness as required by the standards can be preserved. ** Title: On Optimal and Minimum-Entropy Decoding Authors: Bastiaan Kleijn, Delft University of Technology Volume: 3, Page: 1671 Abstract: In the quantization of a signal in speech coding, dependencies between its samples are often neglected. Generally, these dependencies are then also neglected at the decoder. However, usually a priori information about these dependencies is available, making it possible to improve decoder performance by means of enhanced decoding. An attractive feature of enhanced decoding is that it can be applied to existing coding standards. This paper describes several enhanced decoding methods, including a vector decoding method and a method which aims at reducing the differential entropy rate of the decoded signal. Experimental results are used to confirm that both these decoding procedures can provide better performance than conventional decoding for common signal/encoder combinations. ** Title: A New Sinusoidal Phase Modeling Algorithm Authors: Sassan Ahmadi, ASU Andreas S. Spanias, ASU Volume: 3, Page: 1675 Abstract: A new phase modeling algorithm for sinusoidal analysis and synthesis of speech signals is presented. Short-time sinusoidal phases are efficiently approximated by incorporating linear prediction, spectral sampling, delay compensation, and phase correction techniques. The algorithm is different than phase compensation methods proposed for multi-pulse LPC in that it has been tailored to sinusoidal transform coding of speech signals. Performance analysis on a large speech database indicates considerable improvement in temporal and spectral matching between the original and reconstructed signals as compared to other sinusoidal phase models as well as improved subjective quality of the reproduced speech. ** Title: Recursive and Adaptive Predictive Coding of Speech Authors: Kazuo Nakata, Chiba Institute of Technology Kin-ich Higure, Chiba Institute of Technology Volume: 3, Page: 1679 Abstract: A new alorithm of speech coding 'recursive and adaptive prediction' is proposed and tested. An adaptive linear prediction of input is carried out at sample by sample, and only predictive residuals are quantized and transmitted in binary codes. Predictive coefficients are adaptively controlled by quantized prediction error. Segmental SNR of almost 22 dB is obtained at 16 kb/s by the cascade connection of 2 stages of prediction. The algorithm can handle mixed voices as well, and easy be implemented by single DSP. ** Title: The Multimodal Multipulse Excitation Vocoder Authors: Takahiro Unno, Asahi Chemical Thomas P. Barnwell, Georgia Institute of Technology Mark A. Clements, Georgia Institute of Technology Volume: 3, Page: 1683 Abstract: This paper presents a new high-quality, variable-rate vocoder in which the average bit-rate is parametrically controllable. The new vocoder is intended for use with data-voice simultaneous channel(DVSC) applications, in which the speech data is transmitted simultaneously with video and other types of data. The vocoder presented in this paper achieves state-of-the-art quality at several different bit-rates between 5.5 Kbps and 10 Kbps. Further, it achieves this performance at acceptable levels of complexity and delay. ** Title: Minimum Variance Distortionless Response (MVDR) Modeling of Voiced Speech Authors: Manohar N. Murthi, University of California, San Diego Bhaskar D. Rao, University of California, San Diego Volume: 3, Page: 1687 Abstract: In this paper we propose the MVDR method, which is based upon the Minimum Variance Distortionless Response (MVDR) spectrum estimation method, for modeling voiced speech. Developed to overcome some of the shortcomings of Linear Prediction models, the MVDR method provides better models for medium and high pitch voiced speech. The MVDR model is an all-pole model whose spectrum is easily obtained from a modest non-iterative computation involving the Linear Prediction coefficients thereby retaining some of the computational attractiveness of LPC methods. With the proper choice of filter order, which is dependent on the number of harmonics, the MVDR spectrum models the formants and spectral powers of voiced speech exactly. An efficient reduced model order MVDR method is developed to further enhance its applicability. An extension of the reduced order MVDR method for recovering the correct amplitudes of the harmonics of voiced speech is also presented. ** Title: Phase Modelling of Speech Excitation for Low Bit-Rate Sinusoidal Transform Coding Authors: Xiaoqin Sun, University of Liverpool Fabrice Plante, University of Liverpool Barry M.G. Cheetham, University of Liverpool Kenneth W.T. Wong, B.T. Laboratories Volume: 3, Page: 1691 Abstract: Sinusoidal transform coding (STC) techniques model speech as the sum of sine-waves whose frequencies, amplitudes and phases are specified at regular intervals. To achieve a low-bit rate representation, only the spectral envelope is encoded and the phases are regenerated according to a minimum phase assumption. In this paper, the inaccuracy of the minimum phase model is demonstrated. It is shown that the phase spectra of decoded speech segments may be corrected using either the parameters of a Rosenberg pulse model or a second order all-pass filter. Experiments have shown that by applying this correction, the phase accuracy increases and the speech quality improves. ** Title: An Adaptive-Rate Digital Communication System For Speech Authors: John E. Kleider, Motorola SSTG, Speech and Signal Processing Lab William M. Campbell, Motorola SSTG, Speech and Signal Processing Lab Volume: 3, Page: 1695 Abstract: Current digital voice communication systems allow only modest levels of protection of the coded speech and often do not follow the dynamic changes that occur in the transmission channel. We present a method that provides optimal voice quality and intelligibility for any given transmission channel condition. The approach is performed via adaptive rate voice (ARV) coding using an adaptive- rate modem, channel coding, and a multimode sinusoidal transform coder. In general, the receiver utilizes channel state information to not only optimally demodulate and decode the currently corrupted symbols from the channel, but also to inform the transmitter, via a feedback channel, of the optimal strategy for voice/channel coding and modulation format. We compare several source-channel coding schemes at multiple transmission symbol rates and compare the performance to fixed aggregate-rate channel- controlled variable rate voice coding systems. ** Title: Smoothing The Evolution Of The Spectral Parameters In Linear Prediction Of Speech Using Target Matching Authors: Mohammad Reza Zad-Issa, McGill University Peter Kabal, McGill University Volume: 3, Page: 1699 Abstract: Linear prediction (LP) coefficients are used to describe the formant structure of a speech waveform. Many factors contribute to the frame-to-frame fluctuation of these parameters. These variations adversely affect the performance of the LP quantizer and the quality of the synthesized speech. For voiced speech, efficient coding of the pitch pulses at the output of the inverse formant filter relies on the similarity of successive pitch waveforms. The performance of this coding stage is also jeopardized by LP variations. In this paper, we propose a new method which smoothes the evolution of the LP parameters. Our algorithm is based on matching the output of the formant predictor to a target signal constructed using smoothed pitch pulses. With this approach we have successfully reduced the frame-to-frame variation of LP coefficients, while increasing the similarity of pitch pulses. ** Title: Comparative Study Of Different Parameters For Temporal Decomposition Based Speech Coding Authors: Shahrokh Ghaemmaghami, Queensland University of Technology Mohamed Deriche, Queensland University of Technology Boualem Boashash, Queensland University of Technology Volume: 3, Page: 1703 Abstract: Temporal decomposition (TD) is an effective technique to compress the spectral information of speech through orthogonalization of the matrix of spectral parameters leading to an efficient rate reduction in speech coding applications. The performance of TD is basically function of properties of the parameters set used. Although ``decomposition suitability'' of a parameter set is typically defined on the basis of ``phonetic relevance'' criterion, it can not be directly used in speech coding. Instead, quality evaluation of reconstructed speech is more appropriate. In this paper, we extend our earlier work in this area and attempt to assess several ``popular'' spectral parameter sets from the viewpoint of decomposition suitability in very low-rate speech coding using parametric, perceptually-based spectral, and energy distance measures. ** Title: Efficient Algorithm to Compute LSP Parameters from 10th-order LPC Coefficients Authors: Sara Grassi, University of Neuchatel Alain Dufaux, University of Neuchatel Michael Ansorge, University of Neuchatel Fausto Pellandini, University of Neuchatel Volume: 3, Page: 1707 Abstract: Line Spectrum Pair (LSP) representation of Linear Predictive Coding (LPC) parameters is widely used in speech coding applications. An efficient method for LPC to LSP conversion is Kabal's method. In this method the LSPs are the roots of two polynomials $P'_{p}(x)$ and $Q'_{p}(x)$, and are found by a zero crossing search followed by successive bisections and interpolation. The precision of the obtained LSPs is higher than required by most applications, but the number of bisections cannot be decreased without compromising the zero crossing search. In this paper, it is shown that, in the case of $10^{th}$-order LPC, five intervals containing each only one zero crossing of $P'_{10}(x)$ and one zero crossing of $Q'_{10}(x)$ can be calculated, avoiding the zero crossing search. This allows a trade-off between LSP precision and computational complexity resulting in considerable computational saving. ** Title: Speech Compression with Preservation of Speaker Identity Authors: John Leis, USQ Mark Phythian, USQ Sridha Sridharan, Queensland University of Technology Volume: 3, Page: 1711 Abstract: Although much effort has been directed recently towards speech compression at rates below 4 kb/s, the primary metric for comparison has, understandably, been the amount of spectral distortion in the decompressed speech. However, an aspect which is becoming important in some applications is the ability to identify the original speaker from the coded speech algorithmically. We investigate here the effect of speech compression using multistage vector quantization of the short-term (formant) filter parameters on text-independent speaker identification. It is demonstrated that in cases where the speech is stored in a compressed database for retrieval, the speaker model should be constructed from the raw speech before spectral compression. Additionally, Gaussian models of sufficiently high order are able to reduce the negative effects of spectral vector quantization upon speaker identification accuracy. ** Title: A method for measuring information transmission of speech systems Authors: Juha Backman, Nokia Mobile Phones Volume: 3, Page: 1715 Abstract: The paper presents a method for measuring the transmission of speech transmitted through a channel with linear or nonlinear distortion and arbitrary noise. The method is a generalization of the well-established method of measuring speech intelligibility using modulation transmission function, but instead of measuring only the amount of the modulation in the received signal and comparing it against the amount of modulation in the transmitted signals in given carrier and modulation frequency band, the proposed method cross-correlates the envelopes of the transmitted and received signal. ** Title: Perceptual entropy rate estimates for the phonemes of American English Authors: Vincent Van de Laar, Delft University of Technology Bastiaan Kleijn, Delft University of Technology Ed F. Deprettere, Delft University of Technology Volume: 3, Page: 1719 Abstract: We estimated the perceptual entropy rate of the phonemes of American English and found that the upper limit of the perceptual entropy of voiced phonemes is approximately 1.4 bit/sample, whereas the perceptual entropy of unvoiced phonemes is approximately 0.9 bit/sample. Results indicate that a simple voiced/unvoiced classification is suboptimal when trying to minimize bit rate. We used two different methods for the entropy estimation, and the results of both methods show that short segments of unvoiced speech are approximately Gaussian. ** Title: Rescoring under Fuzzy Degrees with a Multilayer Neural Network in a Rule-Based Speech Recognition System Authors: Olivier Oppizzi, University of Avignon Regis Quelavoine, University of Avignon Volume: 3, Page: 1723 Abstract: In this paper, a speech rescoring system is developed on a set of phonetic hypotheses produced by a bottom-up knowledge-based decoder. An original method to automatically compute a fuzzy membership function from top-down acoustic rules statistics is compared with a possibilistic measure. To aggregate the fuzzy degrees into a phonetic score, a mutilayer neural network is trained on the results of all the rules in order to detect how these rules characterize different phonemes and then in order to give a weight to each rule. Rescoring performance of top-down rules for fricatives will be discussed on an isolated-word speech database of French with 1000 utterances pronounced by five speakers. ** Title: Optimization Of HMM By A Genetic Algorithm Authors: Chak-Wai Chau, City University of Hong Kong S. Kwong, City University of Hong Kong C.K. Diu, Department of Applied Computing W.R. Fahrner, FernUniversitaet Volume: 3, Page: 1727 Abstract: Hidden Markov Model (HMM) is a natural and highly robust statistical methodology for automatic speech recognition. It is also being tested and proved considerably in a wide range of applications. The model parameters of the HMM are essence in describing the behavior of the utterance of the speech segments. Many successful heuristic algorithms are developed to optimize the model parameters in order to best describe the trained observation sequences. However, all these methodologies are exploring for only one local maxima in practice. No one methodology can recovering from the local maxima to obtain the global maxima or other more optimized local maxima. In this paper, a stochastic search method called Genetic Algorithm (GA) is presented for HMM training. GA mimics natural evolution and perform global searching within the defined searching space. Experimental results showed that using GA for HMM training (GA-HMM training) has a better performance than using other heuristic algorithms. ** Title: Inference of Variable-length Acoustic Units for Continuous Speech Recognition Authors: Sabine Deligne, Telecom Paris Frederic Bimbot, Telecom Paris Volume: 3, Page: 1731 Abstract: In the field of speech recognition, the patterns assumed to structure the speech material (phonemes, triphones, words...) are defined a priori according to a linguistic criterion, whereas the recognition criterion is based on an acoustic similarity measure. From this may result a lack of consistency for the recognition units. In this paper, we explore the possibility of a more data-driven approach, where recognition units are derived according to an acoustic criterion, and then, mapped to variable length sequences of phonemes in an unsupervised way. Continuous speech recognition experiments are reported to evaluate the consistency of those units as opposed to linguistically defined units. ** Title: Comparative Performance Analysis of Statistical Trajectory Models in Cellular Environment Authors: Bojan Petek, University of Ljubljana Ove Andersen, CPK, Aalborg University, Denmark Paul Dalsgaard, CPK, Aalborg University, Denmark Volume: 3, Page: 1735 Abstract: Two systems (Statistical Trajectory Models (STM) and continuous density HMMs) utilizing three preprocessing methodologies (MFCC, RASTA and FBDYN) were evaluated on two databases, namely CTIMIT and the corresponding downsampled TIMIT. Within the bounds of the experimental setup the comparative performance analysis showed that the STM significantly outperforms the HMM system on the CTIMIT database. Specifically, the performance of the STM system was found to be at least 10% better as compared to the one obtained by HMM when the RASTA preprocessing was used. The performance of both systems with FBDYN parametrization was found to be inferior to those using MFCC and RASTA. On the other hand, in low-noise conditions on the TIMIT database FBDYN yielded an improved performance for the HMM system, whereas STM achieved the best results with the MFCC parametrization. ** Title: Inter-Digit HMM Connected Digit Recognition Using the Macrophone Corpus Authors: Yu-Hung Kao, Texas Instruments Lorin P. Netsch, Texas Instruments Volume: 3, Page: 1739 Abstract: Continuous digit recognition over the telephone channel is a key technology for many telecommuncations applications such as voice dialing, automatic banking, and credit card number entry. Speech recognizers usually acheive high performance by modeling the acoustics in Hidden Markov Models (HMMs) using large numbers of multivariate Gaussian mixtures with assumed diagonal covariance in order to model the variability of different speakers and channel conditions. In this paper, we present a system that uses single mixture 16 feature Gaussian distributions with an assumed identity covariance to achieve 1.0% word error and 5.7% sentence error rate on the Macrophone corpus. We found that inter-digit modeling, discriminant training, and per-utterance adaptation can each contribute about 30% reduction in error rate. Using this approach, we can realize a system with relatively low memory requirements. ** Title: Wide Context Acoustic Modeling in Read vs. Spontaneous Speech Authors: Michael Finke, University of Karlsruhe Ivica Rogina, University of Karlsruhe Volume: 3, Page: 1743 Abstract: Context-dependent acoustic models have been applied in speech recognition research for many years, and have been shown to increase the recognition accuracy significantly. The most common approach is to use triphones. Recently, several speech recognition groups have started investigating the use of larger phonetic context windows when building acoustic models. In this paper we discuss some of the computational problems arising from wide context modeling (polyphonic modeling) and present methods to cope with these problems. A two stage decision tree based polyphonic clustering approach is described which implements a more flexible parameter tying scheme. The new clustering approach gave us significant improvement across all tasks - WSJ, SWB, and Spontaneous Scheduling Task - and across all languages involved (German, Spanish, English). We report recognition results based on the JANUS speech recognition toolkit on two tasks comparing acoustic context phenomena in English read versus spontaneous speech. We used our WSJ 60K recognizer and the JANUS SWB 10K polyphonic recognizer. ** Title: Performance of Hybrid MMI-Connectionist / HMM Systems on the WSJ Speech Database Authors: Jorg Rottland, Duisburg University Christoph Neukirchen, Duisburg University Daniel Willett, Duisburg University Volume: 3, Page: 1747 Abstract: In this paper, a hybrid MMI-connectionist / hidden Markov model (HMM) speech recognition system for the Wall Street Journal (WSJ) database is presented. The HMM part of this system uses discrete probability density functions (pdf). The neural network (NN) is used to replace a classical vector quantizer (VQ) like a k-means or LBG algorithm, which are typically used in discrete HMM systems. The NN is trained on an algorithm, that tries to achieve maximum mutual information (MMI) between the generated output labels and the underlying phonetic description. The system has been trained and tested with the five thousand word speaker independent WSJ task. The error rates of the MMI-Connectionist approach are 21% lower than the error rates of a k-means system. The system achieves error rates which have been achieved before only by the best continuous/semi-continuous HMM speech recognizers, with the advantage of a faster recognition algorithm. ** Title: Statistical Modeling of Co-Articulation in Continuous Speech Based on Data Driven Interpolation Authors: Don Sun, Bell Labs, Lucent Technologies Volume: 3, Page: 1751 Abstract: Parsimonious modeling of the context dependency nature of speech due to co-articulation is very important for improving speech recognition systems. Most of the proposed methods in dealing with this problem are based on the idea of using context-dependent speech units, which inevitably increases the complexity of the model space. This paper presents a new approach of speech co-articulation modeling with complexity only comparable to context independent models. We model the movement of a sequence of speech signals by a set of anchor points in the feature vector space corresponding to the target phonemic units. The transitions are modeled as interpolations between the target vectors. The auxiliary parameters specifying the transitional units are estimated ``online'' during recognition, hence it does not contribute to the complexity of the models. Some phonetic classification experiments showed that the new model can achieve the same performance as the more complex context dependent models. ** Title: Microsegment-Based Connected Digit Recognition Authors: John J. Godfrey, TI Dallas Coimbatore S. Ramalingam, TI Dallas Aravind Ganapathiraju, Mississippi State University Joseph Picone, Mississippi State University Volume: 3, Page: 1755 Abstract: By building acoustic phonetic models which explicitly represent as much knowledge of pronunciation in a small domain (the digits) as possible, we can create a recognition system which not only performs well but allows for meaningful error analysis and improvement. An HMM-based recognizer for the digits and a few associated words was constructed in accord with these principles. About 65 phonetic models were trained on 140 carefully labeled utterances, then iteratively on unlabeled data under orthographic supervision. The basic system achieved less than 3% word error rate on digit strings of unknown length from unseen test speakers, and 1.4% on 7-digit strings of known length. This is competitive with word-based models using the same HMM engine and similar parameter settings. As an R&D system, it allows meaningful analysis of errors and relatively straightforward means of improvement. ** Title: Context--Dependent~ Hybrid HME / HMM Speech Recognition using Polyphone Clustering Decision Trees Authors: Jurgen Fritsch, University of Karlsruhe Michael Finke, University of Karlsruhe Alex Waibel, University of Karlsruhe Volume: 3, Page: 1759 Abstract: This paper presents a context-dependent hybrid connectionist speech recognition system that uses a set of generalized hierarchical mixtures of experts (HME) to estimate context-dependent posterior acoustic class probabilities. The connectionist part of the system is organized in a modular fashion, allowing the distributed training of such a system on regular workstations. Context classes are based on polyphonic contexts, clustered using decision trees which we adopt from our continuous density HMM recognizer JANUS. The system is evaluated on ESST, an english speaker-independent spontaneous speech database. Context dependent modeling is shown to yield significant improvements over simple context-independent modeling, requiring only small additional overhead in terms of training and decoding time. ** Title: Improved automatic recognition of Norwegian natural numbers by incorporating phonetic knowledge Authors: Knut Kvale, Telenor R&D Ingunn Amdal, Telenor R&D Volume: 3, Page: 1763 Abstract: This paper addresses the problem of speaker-independent connected natural number recognition over telephone lines. Increasing the vocabulary from digits (0--9) to natural numbers (0--99) opens for more user-friendly services, but also introduces many new, language-specific problems, such as more similar sounding words, a more complex grammar network, and more ambiguities due to segmentation problems of connected natural numbers. The paper shows that incorporating phonetic knowledge into a Norwegian natural number recogniser, improved the recognition performance from 70.6~% to 76.3~% correctly recognised 8-digits telephone numbers in noisy conditions. ** Title: Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements Authors: Stephane Dupont, FPMS - TCTS Herve Bourlard, FPMS - TCTS Olivier Deroo, FPMS - TCTS Vincent Fontaine, FPMS - TCTS Jean-Marc Boite, FPMS - TCTS Volume: 3, Page: 1767 Abstract: In this paper, we evaluate multi-Gaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database which is particularly well suited to this kind of experiments since (1) it is a very large telephone database and (2) the size and content of the test vocabulary is very flexible. For each system, different HMM topologies are compared to test the influence of state tying (with a number of parameters approximately kept constant) on the recognition performance. Two lexica (Phonebook and CMU) are also compared and it is shown that the CMU lexicon is leading to significantly better performance. Finally, it is shown that with a quite simple system and a few adaptations to the basic HMM/ANN scheme, recognition performance of 98.5% and 94.7% can easily be achieved, respectively on a lexicon of 75 and 600 words (isolated words, telephone speech and lexicon words not present in the training data). ** Title: European Speech Databases for Telephone Applications Authors: Harald Hoge, Siemens AG Herbert S. Tropf, Siemens AG Richard Winski, Vocalis Ltd. Henk van den Heuvel, SPEX Reinhold Haeb-Umbach, Philips GmbH Khalid Choukri, ELRA Volume: 3, Page: 1771 Abstract: The SpeechDat project aims to produce speech databases for all official languages of the European Union and some major dialectal variants and minority languages resulting in 28 speech databases. They will be recorded over fixed and mobile telephone networks. This will provide a realistic basis for training and assessment of both isolated and continuous-speech utterances, employing whole-word or subword approaches, and thus can be used for developing voice driven teleservices including speaker verification. The specification of the databases has been developed jointly, and is essentially the same for each language to facilitate dissemination and use. There will be a controlled variation among the speakers concerning sex, age, dialect, environment of call etc. The validation of all databases will be carried out centrally. The SpeechDat databases will be transferred to ELRA for distribution. Next databases to be recorded will cover East European languages. ** Title: Development of a Large Vocabulary Speech Database for Cantonese Authors: Pak Chung Ching, Chinese University of Hong Kong K.F. Chow, Chinese University of Hong Kong Tan Lee, Chinese University of Hong Kong L.W. Chan, Chinese University of Hong Kong Alfred Y.P. Ng, Chinese University of Hong Kong Volume: 3, Page: 1775 Abstract: This paper describes our recent work on developing a large vocabulary speech database for Cantonese. As a major Chinese dialect, Cantonese is spoken by tens of millions of people in Southern China and Hong Kong. It is very different from Mandarin or Putonghua in phonology, phonetics, vocabulary and grammatical structure. A speech database specially designed for Cantonese is urgently needed for the design, implementation and performance evaluation of various speech recognition systems. The proposed database contains a large number of speech utterances which include isolated syllables, polysyllabic words and phonetically rich sentences. It covers most of the intra-syllable and inter-syllable acoustic variations. We hope that this pioneer work will be beneficial and useful to facilitate future research activities in the related areas. ** Title: An Approach to Continuous Speech Recognition Based on Layered Self-Adjusting Decoding Graph Authors: Qiru Zhou, Bell Labs., Lucent Technologies Wu Chou, Bell Labs., Lucent Technologies Volume: 3, Page: 1779 Abstract: In this paper, an approach of continuous speech recognition based on layered self-adjusting decoding graph is described. It utilizes a scaffolding layer to support fast network expansion and releasing. A two level hashing structure is also described. It introduces self-adjusting capability in dynamic decoding on general re-entrant decoding network. In stack decoding, the scaffolding layer in the proposed approach enables the decoder to look several layers into the future so that long span inter-word context dependency can be exactly preserved. Experimental results indicate that highly efficient decoding can be achieved with a significant savings on recognition resources. ** Title: Look-Ahead Techniques for Fast Beam Search Authors: Stefan Ortmanns, RWTH Aachen Andreas Eiden, RWTH Aachen Hermann Ney, RWTH Aachen Norbert Coenen, RWTH Aachen Volume: 3, Page: 1783 Abstract: This paper presents two look-ahead techniques for speeding up large vocabulary continuous speech recognition. These two techniques, which are referred to as language model look-ahead and phoneme look-ahead, are incorporated into the pruning process of the time-synchronous one-pass beam search algorithm. The search algorithm is based on a tree-organized pronunciation lexicon in connection with a bigram language model. Both look-ahead techniques have been tested on the 20 000-word NAB'94 task (ARPA North American Business Corpus). The recognition experiments show that the combination of bigram language model look-ahead and phoneme look-ahead reduces the size of search space by a factor of about 30 without affecting the word recognition accuracy in comparison with no look-ahead pruning technique. ** Title: An Efficient Search Method for Large-Vocabulary Continuous-Speech Recognition Authors: Hanazawa Ken, Tokyo Institute of Technology Sadaoki Furui, Tokyo Institute of Technology Yasuhiro Minami Minami, NTT Human Interface Laboratories Volume: 3, Page: 1787 Abstract: This paper proposes an efficient method for large-vocabulary continuous-speech recognition, using a compact data structure and an efficient search algorithm. We introduce a very compact data structure DAWG as a lexicon to reduce the search space. We also propose a search algorithm to obtain the N-best hypotheses using the DAWG structure. This search algorithm is composed of two phases: ``forward search'' and ``traceback''. Forward search, which basically uses the time-synchronous Viterbi algorithm, merges candidates and stores the information about them in DAWG structures to create phoneme graphs. Traceback traces the phoneme graphs to obtain the N-best hypotheses. An evaluation of this method's performance using a speech-recognition-based telephone-directory-assistance system having a 4000-word vocabulary confirmed that our strategy improves speech recognition in terms of time and recognition rate. ** Title: Extensions to the Word Graph Method for Large Vocabulary Continuous Speech Recognition Authors: Hermann Ney, RWTH Aachen Stefan Ortmanns, RWTH Aachen Ingo Lindam, RWTH Aachen Volume: 3, Page: 1791 Abstract: This paper describes two methods for constructing word graphs for large vocabulary continuous speech recognition. Both word graph methods are based on a time-synchronous, left-to-right beam search strategy in connection with a tree-organized pronunciation lexicon. The first method is based on the so-called word pair approximation and fits directly into a word-conditioned search organization. In order to avoid the assumptions made in the word pair approximation, we design another word graph method. This method is based on a time conditioned factoring of the search space. For the case of a trigram language model, we give a detailed comparison of both word graph methods with an integrated search method. The experiments have been carried out on the North American Business (NAB'94) 20,000-word task. ** Title: An O(N root-(overline)E) Viterbi Algorithm Authors: Sarvar Patel, Bellcore Volume: 3, Page: 1795 Abstract: In continuous speech recognition, a significant amount of time is used every frame to evaluate interword transitions. In fact, if N is the size of the vocabulary and each word transitions on average to (overline)E other words then O(N(overline)E) operations are required. Similarly when evaluating a partially connected HMM, the Viterbi algorithm requires O(N(overline)E) operations. This paper presents the first algorithm to break the O(N(overline)E) complexity requirement. The new algorithm has an average complexity of O(N root-(overline)E). An algorithm was previously presented by the author for the special case of fully connected models, however, the new algorithm is general. It speeds up evaluations of both partial and fully connected HMM and language models. Unlike pruning, this paper does not use any heuristics which may sacrifice optimality, but fundamentally improves the basic evaluation of the time synchronous Viterbi algorithm. ** Title: CCLMDS'96: Towards a Speaker-Independent Large-Vocabulary Mandarin Dictation System Authors: Tung-Hui Chiang, ATC/CCL/ITRI Chung-Mou Pengwu, ATC/CCL/ITRI Shih-Chieh Chien, ATC/CCL/ITRI Chao-Huang Chang, ATC/CCL/ITRI Volume: 3, Page: 1799 Abstract: This paper presents the first known results for the speaker-independent large-vocabulary Mandarin Dictation System, namely CCLMDS'96, developed by Computer & Communication Research Laboratories (CCL) at Industrial Technology Research Institute (ITRI). First, a fast searching algorithm is proposed to improve the searching efficiency such that the CCLMDS'96 can operate in real time running on a personal computer. In addition, a discriminative scoring function is proposed to integrate the speech recognizer and the word-class-based bigram language model. With this discriminative scoring function, the system attains word accuracy rate of 91.3%, which significantly outperforms the conventional integration approach. ** Title: Japanese Large-Vocabulary Continuous-Speech Recognition using a Business-Newspaper Corpus Authors: Tatsuo Matsuoka, NTT Human Interface Labs. Katsutoshi Ohtsuki, NTT Human Interface Labs. Takeshi Mori, NTT Human Interface Labs. Kotaro Yoshida, Institute of Technology, Tokyo Sadaoki Furui, Institute of Technology, Tokyo Katsuhiko Shirai, Waseda University Volume: 3, Page: 1803 Abstract: A large-vocabulary continuous-speech recognition (LVCSR) system was developed and evaluated. To evaluate the system, a Japanese business-newspaper speech corpus was designed and recorded. The corpus was designed so that is can be used for Japanese LVCSR research in the same way that the Wall Street Journal (WSJ) corpus, for example, is used for English LVCSR research. Since Japanese sentences are written without spaces between words, a morphological analysis was introduced to segment sentences into words so that word n-gram language models could be used. To enable the use of detailed word n-gram language models, a two-pass decoding strategy was applied. Context- dependent (CD) phone models and word trigram language models reduced the word error rate from 80.2% to 10.1% (an error reduction of about 88%). This result shows that CD phoneme modeling and word trigram language models can be used effectively in Japanese LVCSR. ** Title: A PC-based Real-Time Large Vocabulary Continuous Speech Recognizer for German Authors: Meinrad Niemoller, Siemens AG, Munich Alfred Hauenstein, Siemens AG, Munich Erwin Marschall, Siemens AG, Munich Petra Witschel, Siemens AG, Munich Ulrike Harke, Siemens AG, Munich Volume: 3, Page: 1807 Abstract: A large vocabulary speech recognizer for German is presented. The main properties of the recognizer are speaker independence, continuous speech input and real-time operation. It is integrated into a client/server framework, which allows for simple porting between different hard- and software platforms. Methods like simplified language model spreading in beam search and specialized word-begin and -end modelling are introduced in order to achieve real-time operation on a Pentium-based PC. Recognition tests for two different dictation applications (controlled speech newspaper dictation and spontaneous speech medical dictation) are presented showing the importance of adding efforts in the modelling of spontaneous speech. ** Title: Progress in Recognizing Conversational Telephone Speech Authors: Barbara Peskin, Dragon Systems, Inc. Larry Gillick, Dragon Systems, Inc. Natalie Liberman, Dragon Systems, Inc. Mike Newman, Dragon Systems, Inc. Paul van Mulbregt, Dragon Systems, Inc. Steven Wegmann, Dragon Systems, Inc. Volume: 3, Page: 1811 Abstract: This paper describes recent improvements made to Dragon's speech recognition system which have improved performance on Switchboard recognition by roughly 10 percentage points in the past year. These features include the use of rapid speaker adaptation, a move from a 20 to a 10 msec frame rate for recognition, expansion of the acoustic training set and lexicon, and the introduction of interpolated language models. Preliminary results applying this Switchboard-trained system to conversations drawn from the English CallHome corpus are also quite strong, suggesting that this technology ports well to novel tasks. Finally, the paper includes a report on several research projects currently in progress which show promise of further reducing the error rate. ** Title: Recognition of Conversational Telephone Speech using the Janus Speech Engine Authors: Torsten Zeppenfeld, Carnegie Mellon University Michael Finke, Carnegie Mellon University Klaus Ries, Carnegie Mellon University Martin Westphal, Carnegie Mellon University Alex Waibel, Carnegie Mellon University Volume: 3, Page: 1815 Abstract: Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10% or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resisted most attempts at improvements by way of the proven techniques to date. Difficulties arise from shorter words, telephone channel degradation, and highly disfluent and coarticulated speech. In this paper, we describe the application, adaptation, and performance evaluation of our JANUS speech recognition engine to the Switchboard conversational speech recognition task. Through a number of algorithmic improvements, we have been able to reduce error rates from more than 50% word error to 38%, measured on the official 1996 NIST evaluation test set. Improvements include vocal tract length normalization, polyphonic modeling, label boosting, speaker adaptation with and without confidence measures, and speaking mode dependent pronunciation modeling. ** Title: Approaches to Phoneme-Based Topic Spotting: An Experimental Comparison. Authors: Roland Kuhn, STL Peter Nowell, DRA Malvern Caroline Drouin, CRIM Volume: 3, Page: 1819 Abstract: Topic spotting is often performed on the output of a large vocabulary recognizer or a keyword spotter. However, this requires detailed knowledge about the vocabulary, and transcribed training data. If portability to new topics and languages is important, then a topic spotter based on phoneme recognition is preferable. A phoneme recognizer is run on training data consisting of audio files labeled by topic alone - no word transcripts are required. Phoneme sub-sequences which help to predict the topic are then extracted automatically. This work was carried out by two teams exploring three different approaches to phoneme-based topic spotting: the ``DP-ngram'', the ``decision tree'', and the ``Euclidean'' approach. Results obtained by each team on the ARM (Airborne Reconnaissance Mission) and Switchboard data sets were compared by means of Receiver Operating Characteristic (ROC) curves. The best performance for each team was obtained via a similar type of discriminative training. ** Title: A Keyword Selection Strategy for Dialogue Move Recognition and Multi-Class Topic Identification Authors: Philip N. Garner, DRA Malvern Aidan Hemsworth, DRA Malvern Volume: 3, Page: 1823 Abstract: The concept of usefulness for keyword selection in topic identification problems is reformulated and extended to the multi-class domain. The derivation is shown to be a generalisation of that for the two class problem. The technique is applied to both multinomial and Poisson based estimates of word probability, and shown to outperform or compare favourably to various information theoretic techniques classifying dialogue moves in the map task corpus, and reports in the LOB corpus. ** Title: Improved Lexicon Modeling For Continuous Speech Recognition Authors: Seong-Jin Yun, KAIST Yung-Hwan Oh, KAIST Gyung-Chul Shin, KAIST Volume: 3, Page: 1827 Abstract: We propose the stochastic lexicon model which represents the pronunciation variations to optimally cope with the continuous speech recognizer. In this lexicon model, the baseform of words are represented by subword states and probability distribution of subwords as hidden Markov model. Also, proposed approach can be applied to system employing non-linguistic recognition units and lexicon is automatically trained from a training utterances. In speaker independent speech recognition tests using a 3000 word continuous speech database, the proposed system improves the word accuracy by about 27.8% and the sentence accuracy by about 22.4%. ** Title: Interpolation, spectrum analysis, error-control coding, and fault-tolerant computing Authors: Jose M.N. Vieira, University of Aveiro Paulo J.S.G. Ferreira, University of Aveiro Volume: 3, Page: 1831 Abstract: This paper uncovers relations between the topics mentioned in the title, relations that we believe to have gone nearly unnoticed so far. More precisely, we show that four often studied problems in signal processing, spectrum analysis, information theory, and computing are closely related or even equivalent in a certain sense (if one of them can be solved, so can any of the others, and using essentially the same algorithms). The problems are (i) a nonlinear band-limited finite-dimensional interpolation problem (ii) the problem of estimating a signal that is the superposition of a finite number of harmonics (iii) an error-control coding problem in the real field, and (iv) certain techniques that occur in algorithm-based fault tolerant computing. The advantages of recognizing these problems as equivalent are obvious: the techniques commonly used in one field can be imported to the others, the duplication of research efforts is prevented, and the overall degree of understanding of the four problems increases. New algorithms are suggested as a result of these investigations. ** Title: Analysis of the stability of time-domain source separation algorithms for convolutively mixed signals Authors: Yannick Deville, LEP Nabil Charkani, LTIRF/INPG Volume: 3, Page: 1835 Abstract: In this paper, we investigate the self-adaptive source separation problem for convolutively mixed signals. The proposed approach uses a recurrent structure adapted by a generic rule involving arbitrary separating functions. We first analyze the stability of this class of algorithms. We then apply these results to some classical rules for instantaneous and convolutive mixtures that were proposed in the literature but only partly analyzed. This provides a better understanding of the conditions of operation of these rules. Eventually, we define and analyze a normalized version of the proposed type of algorithms, which yields several attractive features. ** Title: Time-varying reconstruction of stationary processes subjected to analogue periodic scrambling Authors: Alban Duverdier, ENSEEIHT Bernard Lacaze, ENSEEIHT Volume: 3, Page: 1839 Abstract: In modern telecommunications, it is often desirable to scramble the contents of the information. This paper presents a particularly efficient method of analogue signal scrambling. A stationary process is subjected to scrambling by means of a linear periodic time-varying filter. We observe then a cyclostationary process. We demonstrate that perfect reconstruction is possible. In presence of overlapping spectra, unscrambling requires a time-varying filter. We apply this method to scramble stationary binary signals. Simulations show that the system is additive noise resistant. ** Title: Signal Recovery From Grouped Data Authors: M. Pawlak, University of Manitoba U. Stadtmuller, University of Ulm Volume: 3, Page: 1843 Abstract: The problem of recovering a signal in the class of band-limited functions is studied. We consider asituation when discrete data points are first grouped to the points of an uniform grid and then thereconstruction is carried out from such a reduced data set. The data grouping is common for computerrounding errors and may also be viewed as a data compression process. The accuracy of the proposedgrouping techniques is examined. These results are used to provide an understanding of the number of grid points required to achieve a given level of accuracy. ** Title: Two-Dimensional Pilot-Symbol-Aided Channel Estimation by Wiener Filtering Authors: Peter Hoeher, DLR Stefan Kaiser, DLR Patrick Robertson, DLR Volume: 3, Page: 1845 Abstract: The potentials of pilot-symbol-aided channel estimation in two dimensions are explored. In order to procure this goal, the discrete shift-variant 2-D Wiener filter is derived and analyzed given an arbitrary sampling grid, an arbitrary (but possibly optimized) selection of observations, and the possibility of model mismatch. Filtering in two dimensions is revealed to outperform filtering in just one dimension with respect to overhead and mean-square error performance. However, two cascaded orthogonal 1-D filters are simpler to implement and shown to be virtually as good as true 2-D filters. ** Title: Blind Equalization of Switching Channels By ICA and Learning of Learning Rate Authors: Howard Hua Yang, RIKEN Shun-ichi Amari, RIKEN Volume: 3, Page: 1849 Abstract: In the literature of blind equalization, algorithms developed for equalizing an SISO or SIMO channel fail sometimes when the channel condition is poor. We derive blind equalization algorithms from blind separation algorithms to equalize the SISO channel with fractionally sampling. The approach is also applied to equalize SIMO or MIMO channels. For switching channels, we use an updating rule to tune the learning rate of on-line algorithms automatically to follow the channel change. The idea is applicable to improve all blind equalization algorithms to equalize switching channels. ** Title: Adaptive Soft-Constraint Satisfaction (SCS) Algorithms for Fractionally-Spaced Blind Equalizers Authors: Buyurman Baykal, Imperial College Oguz Tanrikulu, Imperial College Jonathon A. Chambers, Imperial College Volume: 3, Page: 1853 Abstract: Constant Modulus algorithms based on a deterministic error criterion are presented. Soft constraint satisfaction methods yield a general family of blind equalization algorithms employing nonlinear functions of the equalizer output which must satisfy certain conditions. The algorithms are also extended to cover fractionally-spaced blind equalization. A normalization factor which appears as a result of the deterministic formulation of the problem helps the blind equalizer improve its performance. Also, the family supports a wide range of nonlinear functions. Extensive simulations are presented to reveal convergence characteristics which also include signals from the Signal Processing Information Base (SPIB). ** Title: Complete iterative reconstruction algorithms for irregularly sampled data in spline--like spaces Authors: Akram Aldroubi, NIH, Bethesda Hans Georg Feichtinger, University of Vienna Volume: 3, Page: 1857 Abstract: We prove that the exact reconstruction of a function f from its samples $f(x_i)$ on any "sufficiently dense" sampling set ${x_i}$ in $R^n$, where the index set is countable , can be obtained for a large class of spline-like spaces that belong to $L^p(R^n)$. Moreover, the reconstruction can be implemented using fast iterative algorithms. Since, a special case is the space of bandlimited functions, our result generalizes the classical Shannon-Whittacker sampling theorem on regular sampling and the Paley-Wiener theorem on nonuniform sampling. ** Title: Signal De-Noising using the Wavelet Transform and Regularization Authors: Sony John, Caltech Uday Desai, IIT-Bombay Volume: 3, Page: 1861 Abstract: This paper presents a new signal de-noising algorithm using wavelets. We have developed a filtering scheme in the wavelet domain, that involves selective smoothing at each scale of the time-frequency plot. The amount of smoothing is controlled by regularizing factors, and gradient-based switches are used to avoid distortion of signal features. The algorithm is seen to compare favorably to that of Mallat et al, as it is able to recover both the smooth portions as well as Brownian texture in the input, from the noisy signal. ** Title: Exact Multichannel Deconvolution on Radial Domains Authors: Stephen D. Casey, American University Carlos A. Berenstein, American University David F. Walnut, American University Volume: 3, Page: 1865 Abstract: A novel multisensor approach to deconvolution is developed. This theory circumvents the ill-posedness inherent in convolution equations by overdetermining the input signal by a multichannel system of convolvers ((mu)_i), chosen so that any information lost by one channel is retained by another. The deconvolution problem is then solved by constructing ``deconvolvers'' that allow us to construct the Dirac (delta) by filtering each (mu)_i by its deconvolver, and then adding the filtered channels together. This in turn allows us to reconstruct the original signal f. The process is linear and stable with respect to noise. The general multichannel theory is discussed. The deconvolution theory in radially symmetric domains is then developed in greater detail. ** Title: Signal Reconstruction from Phase Only Information and Application to Blind System Estimation Authors: Haralambos Pozidis, Drexel University Athina P. Petropulu, Drexel University Volume: 3, Page: 1869 Abstract: We propose a method for the reconstruction of a complex signal from its Fourier phase only, where the phase is known within a linear phase term, and the sequence's length is unknown. The case of the phase known exactly has received a lot of attention in the past, however, in most cases the phase can be estimated up to a linear phase term whose slope is unknown. Moreover, in most cases of interest, the exact length of the sequence which is to be recovered is unknown. As an application of the reconstruction from phase technique, we propose a method for blind channel identification. ** Title: A fast Gauss-Newton parallel-cascade adaptive truncated Volterra filter Authors: Thomas Panicker, University of Utah V. Mathews, University of Utah Volume: 3, Page: 1873 Abstract: This paper introduces a computationally efficient Gauss-Newton type adaptation algorithm for parallel-cascade realizations of truncated Volterra systems with arbitrary, but finite order nonlinearity. Parallel-cascade realizations implement higher-order Volterra systems using parallel and multiplicative combinations of lower-order Volterra systems. The complexity of our system is comparable to the complexity of the system model itself, and is considerably less than that of the fast RLS Volterra filters. Results of experiments comparing the Gauss-Newton method with a competing structure with similar computational complexity as well as demonstrating the capability of parallel-cascade systems to approximate truncated Volterra systems are also included in the paper. ** Title: Sufficient Stability Bounds for Slowly Varying Discrete-Time Recursive Linear Filters Authors: Alberto Carini, DEEI, University Trieste V. Mathews, University of Utah Giovanni L. Sicuranza, DEEI, University Trieste Volume: 3, Page: 1877 Abstract: This paper derives sufficient time-varying bounds on the maximum variation of the coefficients of an exponentially stable, linear, time-varying and recursive filter. The stability bound is less conservative than all previously derived bounds for time-varying IIR systems. The bound is then applied to control the step size of output error adaptive IIR filters to achieve exponentially stable operation. Experimental results that demonstrate the good stability characteristics of the resulting algorithms are included in this paper. ** Title: Spread Spectrum Interference Suppression Using Adaptive Time-Frequency Tilings Authors: Brian S. Krongold, University of Illinois Kannan Ramchandran, University of Illinois Douglas L. Jones, University of Illinois Michael L. Kramer, University of Illinois Volume: 3, Page: 1881 Abstract: Interference suppression in spread spectrum communication systems is often essential for achieving maximum system performance. Existing interference suppression methods do not perform well for most types of nonstationary interference. We first consider interference suppression schemes based on adaptive orthogonal time-frequency decompositions, such as wavelet packet and arbitrary dyadic time-frequency tilings. These methods often reduce interference substantially, but their performance can vary dramatically with minor changes in interference characteristics such as the center frequency. To circumvent these drawbacks, we propose a multiple overdetermined tiling (MODT) with an accompanying blind interference excision scheme which appears very promising for mitigating time-frequency-concentrated interference. Simulations with narrowband, impulsive, and simultaneous impulsive and narrowband interference compare the performance of the various methods and illustrate the promise of approaches based on multiple overdetermined tilings. ** Title: A DSP Based Long Distance Echo Canceller Using Short Length Centered Adaptive Filters Authors: Paulo Alexandre Marques, ISEL Fernando Manuel Sousa, ISEL Jose Manuel Leitao, IST Volume: 3, Page: 1885 Abstract: This paper describes an implementation of a long distance echo canceller which copes with double talking situations and exceeds the CCITT G.165 recommendation. The proposed solution is based on short length adaptive filters centered on the positions of the most significant echoes, which are tracked by time-delay estimators. To deal with double talking situations a speech detector is employed. The resulting algorithm enables long-distance echo cancellation with low computational requirements. It reaches greater echo return loss enhancement and shows faster convergence speed as compared with results reported in recent literature. ** Title: Optimal and Robust Shockwave Detection and Estimation Authors: Brian M. Sadler, ARL Laurel C. Sadler, ARL Tien Pham, ARL Volume: 3, Page: 1889 Abstract: We consider detection and estimation of aeroacoustic shockwaves generated by supersonic projectiles. The shockwave is an N-shaped acoustic wave. The optimal detection/estimation scheme is considered based on an additive white Gaussian noise model. The introduction of an invertible linear transformation, such as the Fourier transform or the wavelet transform, does not improve detection performance under this model. However, if unknown interference and/or model mismatch is present, linear transforms may be of use. In addition, they may significantly reduce complexity at the cost of sub-optimality. We consider the use of the wavelet transform as a means of detecting the very fast rise and fall times of the shockwave, resulting in a 1-D edge detection problem. This method is effective at moderate to high SNR and is robust with respect to unknown environmental interference that will generally not exhibit singularities as sharp as the N-wave edges. ** Title: Automatic Fault Monitoring using Acoustic Emissions Authors: Gopal Venkatesan, Dept. of EE, University of Minnesota Dennis West, Dept. of EE, University of Minnesota Kevin Buckley, Dept. of EE, University of Minnesota Ahmed H. Tewfik, Dept. of EE, University of Minnesota Mostafa Kaveh, Dept. of EE, University of Minnesota Volume: 3, Page: 1893 Abstract: Techniques for automatic monitoring of faults in machinery are being considered as a means to safely simplify or dispense with expensive periodic fault inspection procedures. This paper presents results from an ongoing investigation into the feasibility of using Acoustic Emissions (AEs) for automatic detection of microcrack formation/growth in machine components. ** Title: A new Algorithm for double talk detection and separation in the context of digital mobile radio telephone Authors: Hassan Ezzadi, DSA Univ. du Quebec Jean Rouat, DSA Univ. du Quebec Ivan Bourmeyster, Alcatel Volume: 3, Page: 1897 Abstract: This paper describes a new technique that enhances the Voice Activity Detection (V.A.D) performance between the remote speaker (receive signal) and the local speaker (located in the vehicle) in the context of mobile radio telephone environment. We use an Auditory Pitch and voiced/unvoiced Detection (A.P.D) in conjunction with an Auto Regressive (A.R) analysis in order to remove the remote speaker's voice signal from the car hands-free microphone signal. Results are compared with the reference system that doesn't include the APD. ** Title: Transmission of chosen transform coefficients of normalized cardiac beats for compression Authors: Supratim Saha, University of Erlangen Ramakrishnan Angarai G., Department of Electrical Engineering, Indian Institute of Science Volume: 3, Page: 1901 Abstract: A new technique for ECG compression is presented. Each delineated ECG beat is period normalized by multirate processing and then amplitude normalized. Discrete Wavelet Transform (DWT), based on Daubechies-4 basis functions is applied on these normalized beats, after shifting each of them to the origin. The concatenation of ordered DWT coefficients of these beats is a near-cyclostationary signal. An algorithm is proposed to select a set of common positions of the significant coefficients to be retained from each beat. Linear Prediction is then applied to predict only these DWT coefficients of the current beat from the corresponding coefficients of a certain number of previous beats. Transmitting only the residuals of selected coefficients improves compression. A significant advantage of this technique is that the maximum reconstruction error in any cycle does not occur in the diagnostically crucial QRS region, while achieving a compression of about 15:1 and a normalized root mean square error of about 10%. ** Title: FIR Filters in Envelope Constrained Filter Design Authors: Ba-Ngu Vo, ATRI, Curtin University of Technology Thi-Ngoc Ho, EE, UWA. Antonio Cantoni, ATRI, Curtin University of Technology Victor Sreeram, EE, UWA. Volume: 3, Page: 1905 Abstract: Consider a continuous-time filter which in structure is comprised of an A/D converter, an FIR filter, a D/A converter and an analog post-filter. The envelope constrained (EC) filtering problem for this filter structure is to design it's digital component so as to minimize the effect of input noise whilst satisfying the constraint that the noiseless response of the filter to a specified excitation fits into a prescribed envelope. This problem is formulated as a quadratic programming (QP) problem with functional inequality constraints. Approximating this continuum of constraints by a finite set, the problem is solved by QP via active set strategy. ** Title: Modified Cepstral Analysis For Accurate Estimation Of Echo Parameters In Telecommunication Networks Authors: Matteo Bertocco, Padova University Dionisio Lorenzin, Necsy Pietro Paglierani, Padova University Volume: 3, Page: 1909 Abstract: A modified cepstral analysis for accurate estimation of the echo delay and the echo loss in a telecommunication system is presented. It is based on the optimization of a parametric transformation of the observed signal energy spectrum. Simulation results that show the effectiveness and the accuracy of the proposed method are reported and discussed. ** Title: Bispectral Reconstruction using Incomplete Phase Knowledge: a Neuroelectric Signal Estimation Application Authors: Olivier Meste, University of Nice-Sophia Antipolis Volume: 3, Page: 1913 Abstract: The bispectral averaging technique is often used in order to analyze signal with variable signal delay, in presence of noise. Unfortunately, as the bispectrum is time-shift invariant, the initial phase of the signal can't be recovered. When studying somatosensory evoked potentials (neuroelectric signals) this phase is generally the major information, especially when it characterizes pathologies. We show that some informations about this phase can be extracted from the averaged signal. An attempt to include this knowledge in the magnitude and phase recovery algorithms is made. We illustrate the benefits of this approach on a simulation and a real application leading to a details enhancement of the analyzed signal. ** Title: Optimal Phase-Locked Loop Design with Kalman Predictors for Synchronous Networks Authors: Gustavo A. Hirchoren, UNICAMP Dalton S. Arantes, UNICAMP Volume: 3, Page: 1917 Abstract: A systematic technique for the optimal design of phase-locked loops for synchronous networks is presented. The method is based on Kalman estimation theory under self-similar random noise processes. This approach is optimal for certain noise models and for linear phase-detectors. The results are then extended in order to maintain the minimum mean-square phase error when the reference signal of a master-slave network is lost. ** Title: Overparametrization in Adaptive Filters Authors: Albertus C. den Brinker, Eindhoven University of Technology Volume: 3, Page: 1921 Abstract: Adaptive filters can be made fault tolerant by overparametrization. Conditions are derived such that no deterioration is caused by the redundancy under fault-free operation and that the deterioration caused by weight failures is minimized. ** Title: Recursive Estimation of Linearly or Harmonically Modulated Frequencies of Multiple Cisoids in Noise Authors: Petr Tichavsky, Inst. of Information Theory and Automation, Prague Peter Handel, Tampere University of Technology Volume: 3, Page: 1925 Abstract: Tracking of slowly varying parameters of multiple sinusoids or cisoids (complex-valued sinusoids) in additive noise is an important problem in many engineering applications such as radar, communications, control, biomedical engineering and others. In some applications the sinusoidal frequencies are piecewise linear or periodic functions of time. Signals with harmonically varying sinusoidal frequencies are encountered e.g. in coherent laser radar technology for remote sensing of vibrational characteristics of objects. In these cases, standard algorithms for tracking of (multiple) sinusoidal frequencies, such as the adaptive notch filter, exhibit a nonzero tracking delay, which can be interpreted as an estimation bias. To eliminate this bias, two novel algorithms are designed, one for tracking of linearly and the latter for tracking of harmonically modulated frequencies. Both of the algorithms simultaneously separate the measured signal to individual components and update signal parameters using estimated phase differencies. Performance of the algorithms is demonstrated by simulations. ** Title: Selective Coefficient Update of Gradient-Based Adaptive Algorithms Authors: Tyseer Aboulnasr, University of Ottawa Khaled Mayyas, JUST Volume: 3, Page: 1929 Abstract: One common approach to reducing the computational overhead of the normalized LMS (NLMS) algorithm is to update a subset of the adaptive filter coefficients. It is known that the mean square error (MSE) is not equally sensitive to the variations of the coefficients. Accordingly, the choice of the coefficients to be updated becomes crucial. On this basis, we propose an algorithm that belongs to the same family but selects at each iteration a specific subset of the coefficients that will result in the largest reduction in the performance error. The proposed algorithm reduces the complexity of the NLMS algorithm, as do the current algorithms from the same family, while maintaining a performance close to the full update NLMS algorithm specifically for correlated inputs. ** Title: A Pipelined Architecture for LMS Adaptive FIR Filters Without Adaptation Delay Authors: Quanhong Zhu, University of Utah Scott C. Douglas, University of Utah Kent F. Smith, University of Utah Volume: 3, Page: 1933 Abstract: Past methods for mapping the least-mean-square (LMS) adaptive finite-impulse-response (FIR) filter onto parallel and pipelined architectures either introduce delays in the coefficient updates or have excessive hardware requirements. In this paper, we describe a pipelined architecture for the LMS adaptive FIR filter that produces the same output and error signals as would be produced by the standard LMS adaptive filter architecture without adaptation delays. Unlike existing architectures for delayless LMS adaptation, the new architecture's throughput and hardware complexity are independent of and linear with the filter length, respectively. ** Title: Deterministic Stabilty Analyses Of Unit-Norm Constrained Algorithms For Unbiased Adaptive IIR Filtering Authors: Markus Rupp, Lucent Technologies Scott Douglas, Lucent Technologies Volume: 3, Page: 1937 Abstract: Recently, two simple gradient-based algorithms for unbiased IIR system identification in the presence of zero-mean correlated output noise were derived and shown to perform well in simulation. In this paper, we study the stability and robustness of these two adaptive filters, deriving strictly positive real (SPR) conditions on the overall unknown-plus-adaptive systems to guarantee convergence of the coefficients to their optimum values. Unlike other algorithms for unbiased IIR adaptive filtering, the stability of each of these algorithms depends on the initial values of the filter coefficients. However, near the optimum coefficient solutions, both algorithms are locally-stable, irrespective of the unknown system. Simulations verify the results of our analyses. ** Title: A Modified Normalized Lattice Adaptive Filter for Fast Sampling Authors: Parthapratim De, University of Cincinnati. Howard Fan, University of Cincinnati. Volume: 3, Page: 1941 Abstract: Most filters, adaptive or not, formulated using the delay operator, have no limit when sampling becomes fast and therefore they will have numerical problems. We will show that one reason that the normalized lattice filter has less numerical problems is because that it has a limit as the sampling period tends to zero. The transfer function in the $s$-domain obtained as a limit of the normalized lattice filter will, however, will have only every other power in the denominator polynomial. We propose a modified normalized lattice filter that can realize any arbitrary transfer function in the discrete ($z$) domain and its order-recursive limit as the sampling period tends to zero can realize any arbitrary transfer function in the $s$-domain. Various stability properties of the new lattice are also studied. ** Title: Symmetric Alpha-Stable Filter Theory Authors: John S. Bodenschatz, University of Southern California Volume: 3, Page: 1945 Abstract: Symmetric alpha-Stable (SAS) processes are used to model impulsive noise. Wiener filter theory is generally not meaningful in SASP environments because the expectations may be unbounded. To develop a filter theory for linear finite impulse response systems with independent identically distributed SASP inputs, we propose median orthogonality as a linear filter criterion, present a generalized Wiener-Hopf solution equation, and show a necessary condition for a filter to achieve the criterion. For non-Gaussian SASP densities, zero-forcing least-mean-square is the only well-known filter that satisfies the criterion, but others can easily be designed. We present a second algorithm and simulations showing that both converge to the generalized Wiener-Hopf solution. ** Title: Adaptive Channel Equalization using Context Trees Authors: Owen E. Kelly, Rice University Don H. Johnson, Rice University Volume: 3, Page: 1949 Abstract: The maximum likelihood sequence estimator is the optimal receiver for the inter-symbol interference (ISI) channel with additive white noise. A receiver is demonstrated that estimates sequence likelihood using a variable order Markov model constructed from a crudely quantized training sequence. Receiver performance is relatively unaffected by heavy-tailed noise that can undermine the performance of Gaussian based algorithms such as decision feedback equalization with gradient based (LMS) adaptation. ** Title: Subband Adaptive Filtering with Time-Varying Nonuniform Filter Banks Authors: Michael McCloud, University of Colorado, Boulder Delores Etter, University of Colorado, Boulder Volume: 3, Page: 1953 Abstract: A technique is presented for subband adaptive filtering with nonuniform filter banks. The bandwidth allocations of the subband analysis and synthesis filters are adapted to the spectral characteristics of the input data in such a manner as to minimize an objective function built from the subband error powers. The nonuniform filter bank structure allows for fast convergence times for high order systems with a reduced mean square error relative to the uniform subband scheme. Results are presented for the case of a nonstationary system with time-varying spectral characteristics. ** Title: Best Input for Optimal Tracking Randomly-Time Varying Systems : Justification of Adaptive Predictive Structure Authors: Sofia Ben Jebara, LSTelecoms, ENIT / ESPPT, Tunis Meriem Jaidane, LSTelecoms, ENIT, Tunis Volume: 3, Page: 1957 Abstract: This paper presents a tracking analysis of the LMS algorithm used in order to identify system variations modeled by a random walk. We prove that the steady state properties are strongly related to the input characteristics. The input correlation degrades the performances. Consequently, best performances are obtained for white input. We justify then the cpoupled adaptive predictive structures with system identification in order to improve classical scheme steady state performances. ** Title: Stability of Variable and Random Stepsize LMS Authors: Saul B. Gelfand, Purdue University Yongbin Wei, Purdue University James V. Krogmeier, Purdue University Volume: 3, Page: 1961 Abstract: The stability of variable stepsize LMS (VSLMS) algorithms with uncorrelated stationary Gaussian data is studied. It is found that when the stepsize is determined by the past data, the boundedness of the stepsize by the usual stability condition of fixed stepsize LMS is sufficient for the stability of VSLMS. When the stepsize is also related to the current data, the above constraint is no longer sufficient. Instead, both the upperbound and the lowerbound of the stepsize must be within a smaller region. An exact expression of the stability region is developed for single tap filter. The results are verified by computer simulations. ** Title: Fast Least-Squares Polynomial Approximation in Moving Time Windows Authors: Erich Fuchs, University of Passau Klaus Donner, University of Passau Volume: 3, Page: 1965 Abstract: Only a few time series methods are applicable to signal trend analysis under real-time conditions. The use of orthogonal polynomials for least-squares approximations on discrete data turned out to be very efficient for providing estimators in the time domain. A polynomial extrapolation considering signal trends in a certain time window is obtainable even for high sampling rates. The presented method can be used as a prediction algorithm, e.g. in threshold monitoring systems, or as a trend correction possibility preparing the analysis of the remaining signal. In the theoretical derivation, the recursive computation of orthogonal polynomials allows the development of these fast algorithms for least-squares approximations in moving time windows. ** Title: An Efficient HAAR Wavelet-Based Approach For The Harmonic Retrieval Problem Authors: Yi Chu, National Taiwan Inst. of Technology Wen-Hsien Fang, National Taiwan Inst. of Technology Shun-Hsyung Chang, National Ocean University Volume: 3, Page: 1969 Abstract: Modern subspace-based algorithms can offer high-resolution spectral estimates but with a cost of high computational complexity for the eigenvalue decomposition (EVD) involved. In this paper, we propose a novel preprocessing scheme which can be used in conjunction with the subspace-based algorithms to alleviate the high computations previously required. The new scheme is to demodulate the input data first, and then takes the computationally efficient discrete-time Haar wavelet transform (HWT). Only the principle subband component (PSC) of the transformed data is kept for further processing, which not only retains the same amount of information but also possesses the same characteristic as that of the original (noiseless) harmonic data. The subspace-based algorithms are thus applicable to this new set of transformed data but with substantially reduced computational load. Some simulation results are provided to justify the proposed approach. ** Title: Wavelet Transform based Fast Approximate Fourier Transform Authors: Haitao Guo, Rice University C. Sidney Burrus, Rice University Volume: 3, Page: 1973 Abstract: We propose an algorithm that uses the discrete wavelet transform (DWT) as a tool to compute the discrete Fourier transform (DFT). The Cooley-Tukey FFT is shown to be a special case of the proposed algorithm when the wavelets in use are trivial. If no intermediate coefficients are dropped and no approximations are made, the proposed algorithm computes the exact result, and its computational complexity is on the same order of the FFT, i.e. $O(Nlog_2N)$. The main advantage of the proposed algorithm is that the good time and frequency localization of wavelets can be exploited to approximate the Fourier transform for many classes of signals resulting in much less computation. Thus the new algorithm provides an efficient complexity v.s. accuracy tradeoff. When approximations are allowed, under certain sparsity conditions, the algorithm can achieve linear complexity, i.e. $O(N)$. The proposed algorithm also has built-in noise reduction capability. ** Title: On Computing the 2-D Extended Lapped Transforms Authors: Dragutin Sevic, INFIZ, Belgrade Miodrag Popovic, University of Belgrade Volume: 3, Page: 1977 Abstract: In this paper a new implementation of the two-dimensional Extended Lapped Transform (2-D ELT) is proposed. Compared to the separable solution, proposed by Malvar [1], the new realization of 2-D ELT has reduced arithmetic complexity. Computational savings are achieved because scaling and inverse scaling of butterfly matrices, suggested by Malvar for 1-D case, are, after some modifications of the basic separable algorithm, extended to 2-D case. The new implementation has the same frequency response as Malvar's. ** Title: Efficient computation of the Discrete Wigner Distribution Function through a new iterative algorithm Authors: Isabel Garcia, University of Extremadura Consuelo Gonzalo, Univ. Politecnica de Madrid Margarita Perez-Castellanos, Univ. Politecnica de Madrid Jose A. Moreno, University of Extremadura Jose M. Sanchez-Dehesa, University of Extremadura Volume: 3, Page: 1981 Abstract: This paper presents a new iterative method to speed up the DWDF computation. At the present it has been considered from a computational point of view as an 1-D section of the Wigner Kernel (WK) N points FT's. We purpose a new way to compute the DWDF based on the symmetry properties of the WK and the cosine function. The proposed algorithm is doubly based on a subdivision procedure: on the one hand we have subdivided for each m-value the sum over the k variable into log2N/4-PL partial sums, where PL is the k parity level. And the other hand for each n-value the algorithm computes the DWDF elements by grouping its in group depending on the m PL. The algorithm has been optimized to reduce the accesses of memory, and it improves the FFT algorithms when the number of samples is less than 256 and for this number the algorithm match the FFT algorithms. ** Title: Probabilistic Complexity Analysis of Incremental DFT Algorithms Authors: Joseph M. Winograd, Boston University S. Hamid Nawab, Boston University Volume: 3, Page: 1985 Abstract: We present a probabilistic complexity analysis of a class of multi-stage algorithms for computing successive approximations to the DFT. While the quality of the approximate spectra obtained after any stage of these algorithms can be readily quantified in terms of commonly used input-independent metrics of spectral quality, each stage's arithmetic complexity is dependent on the nature of the input signal. Modeling the input signal as a stationary Gaussian-distributed random process, we obtain estimates of the distribution of the number of arithmetic operations required to complete any algorithm stage. This enables the derivation of important design information such as the probability with which a desired quality of approximation is achieved within a given arithmetic bound. Our results are verified using a Monte Carlo analysis. ** Title: On the Recursive Total Least-Squares Authors: Cuong Pham, St. Clara University Tokunbo Ogunfunmi, St. Clara University Volume: 3, Page: 1989 Abstract: In this paper, by exploiting the Total Least-Square (TLS) closed-form solution and using the state-space structure in Krein space, we will show that the solution of the TLS problems can be computed via the recursive Kalman Filtering algorithm. This makes it possible to use the TLS in real-time applications. ** Title: Parallel-Recursive Filter Structures for the Computation of Discrete Transforms Authors: Richard J. Kozick, Bucknell University Maurice F. Aburdene, Bucknell University Volume: 3, Page: 1993 Abstract: A general approach is presented for implementing discrete transforms as a set of first-order or second-order recursive digital filters. Clenshaw's recurrence formulae are used to formulate the second-order filters. The resulting structure is suitable for efficient implementation of discrete transforms in VLSI or FPGA circuits. The general approach is applied to the discrete Legendre transform as an illustration. ** Title: Basefield Transforms Derived From Character Tables Authors: Andreas Klappenecker, University of Karlsruhe Volume: 3, Page: 1997 Abstract: We show that it is possible to define Hartley-like transforms for (generalized) character tables of finite groups. This large class of transforms include Hartley transforms for discrete Fourier transforms over abelian groups and Hartley-like transforms for the discrete cosine transform of type I. ** Title: Block-Recursive Filters and Filter-Banks Authors: Peceli Gabor, Dept. of Meas.& Instr. Eng., Technical University of Budapest Annamaria R. Varkonyi-Koczy, Dept. of Meas.& Instr. Eng., Technical University of Budapest Volume: 3, Page: 2001 Abstract: Block-oriented signal processing techniques have exceptional role due to the availability of fast algorithms. However, if larger data segments are to be evaluated in real-time, the delay caused by the block-oriented approach is not always tolerable especially if the response time of our evaluating system is also specified. This can be exceptionally critical if the signal processing is related to feedback loops. In this paper block-oriented signal processing methods are combined with recursive ones. This combination reduces the delay problem caused by the block-oriented fast algorithms and at the same time keeps the computational complexity on relatively low level. Possibly the most original component of the suggested solution is the extension of given size signal transformer-bank channels (e.g. DFT channels) toward larger blocks simply via recursive averaging. ** Title: Fast Approximate DCT: Basic-idae, Error Analysis, Applications Authors: Abdulnasir Hossen, University of Kiel Ulrich Heute, University of Kiel Volume: 3, Page: 2005 Abstract: The discrete cosine transform (DCT) has a variety of applications in image and speech processing. The idea of the subband-DFT (SB-DFT) [1], [2] is applied in [3] to the DCT. In this paper the basic idea of the SB-DCT is discussed which is based on subband decomposition of the input sequence. Approximation is done by discarding the computations of bands of little energy. The complexity of this fast approximate method is examined in comparing it with a fast cosine-transform method [4] in terms of program running-time. New accurate analysis of the errors due to the approximation is presented for any number of decomposition stages. New applications of the SB-DCT in speech cepstrum analysis and in echo detection are also included by using the SB-DCT instead of the full-band FFT in calculating the real and complex cepstra. ** Title: Fast Sliding Transforms in Transform-Domain Adaptive Filtering Authors: Annamaria R. Varkonyi-Koczy, Dept. of Meas.& Instr. Eng., Technical University of Budapest Sergios Theodoridis, University of Athens Volume: 3, Page: 2009 Abstract: Transform-domain adaptive signal processing proved to be very successful in very many applications especially where systems with long impulse responses are to be evaluated. The popularity of these methods is due to the efficiency of the fast signal transformation algorithms and that of the block oriented adaptation mechanisms. In this paper the applicability of the fast sliding transformation algorithms is investigated for transform domain adaptive signal processing. It is shown that these sliding transformers may contribute to a better distribution of the computational load along time and therefore enable higher sampling rates. It is also shown that the execution time of the widely used Overlap-Save and Overlap-Add Algorithms can also be shortened. The prize to be paid for this improvements is the increase of the end-to-end delay which in certain configurations may cause some degradation of the tracking capabilities of the overall system. Fortunately, however, there are versions where this delay does not hu ** Title: Time Dependent Autoregressive Spectrum Estimation of Heart Wall Vibrations Authors: Hiroshi Kanai, Tohoku University Michie Sato, Tohoku University Noriyoshi Chubachi, Tohoku University Volume: 3, Page: 2013 Abstract: We present a new method for estimation of spectrum transition of a nonstationary signals in low signal-to-noise ratio cases. Instead of basic functions which are employed by the previously proposed time-varying AR modeling, we introduce the spectrum transition constraint in the cost function described by the partial correlation coefficients so that the method is applicable to noisy nonstationary signals of which spectrum transition patterns are complex. By applying this method to the analysis of vibration signals on the interventricular septum of the heart, noninvasively measured by the method developed in our laboratory using ultrasonics, spectrum transition pattern is clearly obtained during one beat period for a normal individual and a patient. ** Title: Properties of the Structured Auto-Regressive Time-Frequency Distribution Authors: Jakob Angeby, Chalmers University of Technology Volume: 3, Page: 2017 Abstract: Primarily the structured auto-regressive (AR) model was introduced as a mean to estimate the parameters of non-stationary signals in additive noise. However, it is straightforward to use the structured AR model as a model-based time-frequency distribution (TFD). It is shown that the structured AR TFD can be interpreted as a member of Cohen's class with a non-stationary adaptive kernel. The interpretation of the structured AR TFD as a member of Cohen's class establishes a link between TFD:s and signal parameter estimation. ** Title: Zero-Tracking Time-Frequency Distributions Authors: Chenshu Wang, Villanova University Moeness G. Amin, Villanova University Volume: 3, Page: 2021 Abstract: The zero-tracking time-frequency distribution (TFD) is introduced. The local autocorrelation function of the TFD, defined by an appropriate kernel, is usedto form a polynomial whose roots correspond to the instantaneous frequencies of the multicomponent signal. Two techniques for zero-tracking based on TFD are presented. The first technique requires updating all of the polynomial signal and extraneous zeros, and is based on the formula relating, to the first order approximation, the changes in the polynomial roots and coefficients.The second technique employs the zero-finding Newton's method to only obtain the zero-trajectories of interest. ** Title: Vector Sampling Expansion: Deterministic and Stochastic Signals Authors: Daniel Seidner, Tel-Aviv University Meir Feder, Tel-Aviv University Volume: 3, Page: 2025 Abstract: This work extends Papoulis' General Sampling Expansion to the vector case where N band limited signals are passed through a multi-input multi-output (MIMO) LTI system that generates M (Mgreater-or-equal-to N) output signals. We find necessary and sufficient conditions for reconstructing the N input signals from the samples of the M output signals, all sampled at N/M the Nyquist rate. A surprising necessary condition is that M/N must be an integer. This condition is no longer necessary when each of the output signals can be sampled at a different rate. ** Title: Optimal Time Segmentation for Signal Modeling and Compression Authors: Paolo Prandoni, EPFL Martin Vetterli, EECS, UC Berkeley Michael Goodwin, EECS, UC Berkeley Volume: 3, Page: 2029 Abstract: The idea of optimal joint time segmentation and resource allocation for signal modeling is explored with respect to arbitrary segmentations and arbitrary representation schemes. When the chosen signal modeling techniques can be quantified in terms of a cost function which is additive over distinct segments, a dynamic programming approach guarantees the global optimality of the scheme while keeping the computational requirements of the algorithm sufficiently low. Two immediate applications of the algorithm to LPC speech coding and to sinusoidal modeling of musical signals are presented. ** Title: Pre-Filtering for the Initialization of Multi-Wavelet Transforms Authors: Michael Vrhel, Biomedical Engineering and Instrumentation Akram Aldroubi, Biomedical Engineering and Instrumentation Volume: 3, Page: 2033 Abstract: We introduce a new method for initializing the multi-wavelet decomposition algorithm. The approach assumes that the input signal is contained within some well-defined subspace of L2 (e.g. the space of bandlimited functions). The initialization algorithm is the orthogonal projection of the input signal into the space defined by the multi-scaling function. Unlike an interpolation approach, the projection method will always have a solution. We provide examples and implementation details. ** Title: Matching Pursuit With Damped Sinusoids Authors: Michael Goodwin, U.C.Berkeley Volume: 3, Page: 2037 Abstract: The matching pursuit algorithm derives an expansion of a signal in terms of the elements of a large dictionary of time-frequency atoms. This paper considers the use of matching pursuit for computing signal expansions in terms of damped sinusoids. First, expansion based on complex damped sinusoids is explored; it is shown that the expansion can be efficiently derived using the FFT and simple recursive filterbanks. Then, the approach is extended to provide decompositions in terms of real damped sinusoids. This extension relies on generalizing the matching pursuit algorithm to derive expansions with respect to dictionary subspaces; of specific interest is the subspace spanned by a complex atom and its conjugate. Developing this particular case leads to a framework for deriving real-valued expansions of real signals using complex atoms. Applications of the damped sinusoidal decomposition include system identification, spectral estimation, and signal modeling for coding and analysis--modification--synthesis. ** Title: Localized Subclasses of Quadratic Time-Frequency Representations Authors: Antonia Papandreou-Suppappola, University of Rhode Island Robin L. Murray, University of Rhode Island Faye G. Boudreaux-Bartels, University of Rhode Island Volume: 3, Page: 2041 Abstract: We discuss the existence of classes of quadratic time-frequency representations (QTFRs), e.g. Cohen, power, and generalized time-shift covariant, that satisfy a time-frequency (TF) concentration property. This important property yields perfect QTFR concentration along group delay curves. It also (1) simplifies the QTFR formulation and property kernel constraints as the kernel reduces from 2-D to 1-D, (2) reduces the QTFR computational complexity, and (3) yields simplified design algorithms. We derive the intersection of Cohen's class with the new power exponential class, and show that it belongs to Cohen's localized-kernel subclass. In addition to the TF shift covariance and concentration properties, these intersection QTFRs preserve power exponential time shifts, important for analyzing signals passing through exponentially dispersive systems. ** Title: Class-Dependent, Discrete Time-Frequency Distributions via Operator Theory Authors: Jack McLaughlin, University of Washington James Droppo, University of Washington Les E. Atlas, University of Washington Volume: 3, Page: 2045 Abstract: We propose a property for kernel design which results in distributions for each of two classes of signals which maximally separates their energies in the time-frequency plane. Such maximally separated distributions may result in improved classification because the signal representation is optimized to accentuate the differences in signal classes. This is not the case with other time-frequency kernels which are optimized based upon some criteria unrelated to the classification task. Using our operator theory formulation for time-frequency representations, our "maximal separation" criteria takes on a very easily solved form. Analysis of the solution in both the time-frequency and ambiguity planes is given along with an example on discrete signals. ** Title: Extending the characteristic function method for joint a-b and time-frequency analysis Authors: Franz Hlawatsch, Vienna University of Technology Teresa Twaroch, Vienna University of Technology Volume: 3, Page: 2049 Abstract: We extend the characteristic function method (CFM) to more general groups, operators, and signal spaces. We show that the extended CFM can be applied to projected unitary operators as well as discrete-time/periodic signals. ** Title: An Architecture For Realization Of The Cross-Terms Free Polynomial Wigner-Ville Distribution Authors: Ljubisa Stankovic, University of Montenegro Srdjan Stankovic, University of Montenegro Igor Djurovic, University of Montenegro Volume: 3, Page: 2053 Abstract: A method for the Polynomial Wigner-Ville distributions realization, in the case of multicomponent signals, is presented. It is based on the author's recently proposed S-method. Using this method one may, theoretically, get the sum of the Polynomial Wigner-Ville distributions of each component separately. Architecture for the Polynomial Wigner-Ville distributions realization, starting from the short time Fourier transform, is given. Method is illustrated on a numerical example. ** Title: Understanding Discrete Rotations Authors: Michael S. Richman, Cornell University Thomas W. Parks, Cornell University Volume: 3, Page: 2057 Abstract: The concept of rotations in continuous-time, continuous-frequency is extended to discrete-time, discrete-frequency as it applies to the Wigner distribution. As in the continuous domain, discrete rotations are defined to be elements of the special orthogonal group over the appropriate (discrete) field. Use of this definition ensures that discrete rotations will share many of the same mathematical properties as continuous ones. A formula is given for the number of possible rotations of a prime-length signal, and an example is provided to illustrate what such rotations look like. In addition, by studying a 90 degree rotation, we formulate an algorithm to compute a prime-length discrete Fourier transform (DFT) based on convolutions and multiplications of discrete, periodic chirps. This algorithm provides a further connection between the DFT and the discrete Wigner distribution based on group theory. ** Title: Design of RNS Frequency Sampling Filter Banks Authors: Uwe Meyer-Base, HSDAL, University of Florida Jon Mellott, HSDAL, University of Florida Fred Taylor, HSDAL, University of Florida Volume: 3, Page: 2061 Abstract: Frequency sampling filters (FSF) are of interest to the designers of multirate filter banks due to their intrinsic low-order, complexity, and linear phase behavior. Fast FSFs residing in smaller packages will be required to support future high-bandwidth, mobile image and signal processing applications. Since FSF designs rely on the exact annihilation of selected poles-zeros, a new facilitating technology is required which is fast, compact, and numerically exact. Exact FSF pole-zero annihilation is guaranteed by implementing polynomial filters over an integer ring in the residue arithmetic system (RNS). The design methodology is evaluated as an ASIC. Based on an FPGA technology, at least an 86% complexity reduction can be achieved with even greater advantages gained as a custom VLSI. An RNS-based FSF implementation of an eight channel cochlea filter bank is presented which demonstrates both the performance and packaging advantages of the new FSF paradigm. ** Title: Design of Multirate Systems with Constraints Authors: William M. Campbell, Motorola SSTG Thomas W. Parks, School of Electrical Engineering, Cornell University Volume: 3, Page: 2065 Abstract: The design of constrained multirate systems using a relative (l)^2 error criterion is considered. A general algorithm is proposed to solve the problem. One application of the algorithm is the design of a new class of multirate filters for signal decomposition--projection filters. These multirate systems are projection operators that optimally approximate linear time-invariant filters in the (l)^2 norm. A second application of constrained multirate filter design is also presented--optimal design of multistage multirate systems. Examples illustrate the new design method and its advantages over design methods intended for linear time-invariant systems. ** Title: Design of Conjugate Quadrature Filters Having Specified Zeros Authors: Wayne Lawton, ISS, NUS Charles A. Micchelli, IBM Volume: 3, Page: 2069 Abstract: Conjugate quadrature filters with multiple zeros at 1 have classical applications to unitary subband coding of signals using exact reconstruction filter banks. Recent work shows how to construct, given a set of n negative numbers, a CQF whose degree does not exceed 2n-1 and whose zeros contain the specified negative numbers, and applies such filters to interpolatory subdivision and to wavelet construction in Sobelov spaces. This paper describes a recent result of the authors which extends this construction for an arbitrary set of n nonzero complex numbers that contains no negative or negative reciprocal conjugate pairs. Detailed derivations are to be given elsewhere. We design several filters using an exchange algorithm to illustrate a conjecture concerning the minimal degree and we discuss an application to coding transient acoustic signals. ** Title: Design of Paraunitary Oversampled Cosine-Modulated Filter Banks Authors: Jorg Kliewer, University of Kiel Alfred Mertins, University of Kiel Volume: 3, Page: 2073 Abstract: In this paper we derive perfect reconstruction (PR) conditions for oversampled cosine-modulated filter banks. The results can be regarded as a generalization of the known work for critical subsampling. We show that in the oversampled case we gain some additional degree of freedom, which can be exploited in the filter design process. This leads to PR prototypes with stopband attenuations being much higher than in the critically subsampled PR case. The filters designed as PR filters for the oversampled case can also serve as prototypes for critically subsampled cosine-modulated pseudo QMF banks. ** Title: Time-Domain Design of Linear-Phase PR Filter Banks Authors: Masaaki Ikehara, Keio University Truong Q. Nguyen, University of Wisconsin Volume: 3, Page: 2077 Abstract: In this paper, we present a novel way to design biorthogonal and paraunitary linear phase(LPPUFB) filter banks. The square error of the perfect reconstruction condition is expressed in quadratic form of filter coefficients and the cost function is minimized by solving linear equation iteratively without nonlinear optimization. With some modifications, the method can be extended to the design of paraunitary filter banks. Using this method, we can design LPPUFB with many channels easily and quickly. Design examples are given to validate the proposed method. ** Title: QMF Filter Bank Design by a New Global Optimization Method Authors: Benjamin W. Wah, UIUC Yi Shang, UIUC Tao Wang, UIUC Ting Yu, UIUC Volume: 3, Page: 2081 Abstract: In this paper, we study various global optimization methods for designing QMF (quadrature mirror filter) filter banks. We formulate the design problem as a nonlinear constrained optimization problem, using the reconstruction error as the objective, and other performance metrics as constraints. This formulation allows us to search for designs that improve over the best existing designs. We present NOVEL, a global optimization method for solving nonlinear continuous constrained optimization problems. We show that NOVEL finds better designs with respect to simulated annealing and genetic algorithms in solving QMF benchmark design problems. We also show that relaxing the constraints on transition bandwidth and stopband energy leads to significant improvements in the other performance measures. ** Title: Unified Approach to the Design of Quadrature Mirror Filters Authors: Vijay Jain, University of South Florida Volume: 3, Page: 2085 Abstract: A unified approach to the design of linear- and nonlinear-phase QMFs is developed. Formulated as an optimization problem, the design procedure is shown to translate into an eigenvalue-eigenvector problem. To find the optimal filter an algorithm is presented, which typically converges in a few tens of iterations. The flexibility of our design procedure permits several practical extensions to be made readily. These are (a) inclusion of frequency-weighted stopband energy criteria, and (b) inclusion of finite word-length constraints which is stressed in the paper. We have successfully used our filters to applications such as image coding and analysis; here, their use in wavelet-series analysis of oceanographic data is demonstrated. ** Title: Inverse Filter Technique for High-Precision Ultrasonic Pulsed Wave Range Doppler Sensors Authors: Heinrich Ruser, Siemens AG, ZT KM1, Munchen Martin Vossiek, Siemens AG, ZT KM1, Munchen Alexander v.Jena, Siemens AG, ZT KM1, Munchen Valentin Magori, Siemens AG, ZT KM1, Munchen Volume: 3, Page: 2089 Abstract: Ultrasonic pulsed wave range Doppler sensors provide application in various fields, e.g. intruder alarm systems or autonomous vehicle steering. The time-frequency methods commonly used in these sensors, however, inhere the problem that, due to the transdu cer's non constant and direction-dependent transfer functions, the Doppler frequency cannot be determined with high accuracy needed for such applications. The easiest way to improve the Doppler resolution is to reduce the signal bandwidth, but only at the expense of worse range resolution. In this paper a direction-dependent inverse filter technique is presented, which compensates erroneous effects of the transfer function in the time-frequency analysis. An ultrasonic intruder alarm system determining loc ation and velocity of persons in rooms serves as an example that the novel approach gives evidently better performance than conventional methods, resulting in both high velocity and range resolution. ** Title: Classification of Piano Sounds Using Time-Frequency Signal Analysis Authors: Christoph Delfs, University of Karlsruhe Friedrich Jondral, University of Karlsruhe Volume: 3, Page: 2093 Abstract: A topical task is the classification of burst-like signals, e.g. in signal detection. Piano sounds are used here as an example. Different time-frequency methods including wavelet processing are used alternatively for feature extraction. A classifier checks whether the generated features are sufficient to identify the correct piano. Results of the real data signal processing are presented and discussed. ** Title: Transform/Subband Representations for Signals with Arbitrarily Shaped Regions of Support Authors: John G. Apostolopoulos, MIT Jae S. Lim, MIT Volume: 3, Page: 2097 Abstract: Transform/subband representations form a basic building block for many signal processing algorithms and applications. Most of the effort has focused on developing representations for infinite-length signals, with simple extensions to finite-length 1-D and rectangular support 2-D signals. However, many signals may have arbitrary length or arbitrarily shaped (AS) regions of support (ROS). We present a novel framework for creating critically sampled perfect reconstruction transform/subband representations for AS signals. Our method selects an appropriate subset of vectors from an (easily obtained) basis for a larger (superset) signal space, in order to form a basis for the AS signal. In particular, we have developed a number of promising wavelet representations for arbitrary-length 1-D signals and AS 2-D/$M$-D signals that provide high performance with low complexity. ** Title: On optimum oversampling in the Gabor scheme Authors: Martin J. Bastiaans, Technical University of Eindhoven Volume: 3, Page: 2101 Abstract: The windowed Fourier transform of a time signal is considered, as well as a way to reconstruct the signal from a sufficiently densely sampled version of its windowed Fourier transform using a Gabor representation; following Gabor, sampling occurs on a two-dimensional time-frequency lattice with equidistant time intervals and equidistant frequency intervals. In the limit of infinitely dense sampling, the optimum synthesis window (which appears in Gabor's reconstruction formula) becomes similar to the analysis window (which is used in the windowed Fourier transform). It is shown that this similarity can already be reached for a rather small degree of oversampling, if the sampling distances in the time and frequency directions are properly chosen. A procedure is presented with which the optimum ratio of the sampling intervals can be determined. The theory is elucidated by finding the optimum ratio in the cases of a Gaussian and an exponential analysis window. ** Title: The Discrete-Time Frequency Warped Wavelet Transforms Authors: Gianpaolo Evangelista, University of Naples Sergio Cavaliere, University of Naples Volume: 3, Page: 2105 Abstract: In this paper we show that the dyadic wave