only search ASSTA Proceedings

Proceedings of SST 1992

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Speech Perception I

Pages Authors Title and Abstract PDF
2--7 P.Basile, F.Cutugno P.Maturi The Wavelet Transform And Phonetic Analysis

Abstract  The advantages of the application of a wavelet-transform analysis system to spoken materials are here discussed, with reference to some Italian examples.

PDF
8--13 M.P. Cooke and G.J. Brown Computational Auditory Scene Analysis: Exploiting The Continuity Illusion

Abstract  Acoustic sources are often occluded by other sounds, yet the strategies employed by the auditory system are remarkably robust against these intrusions. There are often sufficient cues which allow the auditory system to determine whether sound components continue through such occlusions. In this paper, we review the situations where an assumption of continuity is warranted and show how the so-called continuity illusion can be modelled within a computational system for auditory scene analysis. We present results which demonstrate the practical effectiveness of the model in improving the performance of a system for segregating speech from other acoustic sources.

PDF
14--19 Kerrie Lee and Phillip Dermody The Relationship Between Perceptual And Acoustic Analysis Of Speech Sounds.

Abstract  The study investigates the relationship between the perceptual identification of CV syllables and acoustic analyses of these sounds. The perceptual data for the speech sounds are used to show the relative robustness of the different speech syllables. Acoustic analyses including traditional measures of overall energy, duration and position of spectral peaks as well as measures based on spectral slope and ratio measures between spectral peaks were performed on the same syllables. The results suggest that acoustic characteristics such as energy and duration are not simply related to perceptual robustness of speech sounds and that perceptual robustness is related to relational measures between spectral peaks in the signal.

PDF

Acoustic Phonetics I

Pages Authors Title and Abstract PDF
22--27 Shuping Ran and J.Bruce Millar Phonetic Feature Extraction Using Artificial Neural Networks

Abstract  The work described in this paper is part of a strategy to investigate useful architectures of parallel computation which encode speech knowledge into a speech recognition system to optimise its performance. At the first level of the system, phonetic features ( an extended set of Jakobson et al's distinctive features ) are extracted from burst onset intervals and pseudo-static vowel intervals of CVd words. Our result gives limited support to the existence of invariant cues for some of these features.

PDF
28--33 Janet Fletcher and Andrew McVeigh Towards A Model Of Segment And Syllable Duration In Australian English

Abstract  A corpus of nearly 6500 syllables and their component segments were analysed to formulate a model of segment and syllable duration for Australian English. Segments were grouped into four prosodic categories, unstressed, stressed, pitch accented and phrase final. Syllables were labelled and analysed according to their length (number of segments), prosodic context and grammatical function. Syllable duration was modelled using a three-layer neural network that was trained and tested on different portions of the database. Segment durations stretched or compressed to fit the network-assigned syllable duration frame. This relatively simple model was able to account for nearly 80% of the durational variance observed in the database.

PDF
34--39 Santha Sampath, David Slater and Michael Wagner An Acoustic-Phonetic Comparison Of English Diphthongs Produced By Native Speakers Of Tamil And Australian English

Abstract  This paper describes a comparative study of diphthongs produced in two different dialects of English, namely that spoken by native speakers of Tamil and that spoken by native speakers of Australian English. Similarities as well as some notable differences between these speaker groups have been found in the production of certain diphthongs. Further differences have also been observed between diphthongs produced by Tamil speakers who have been resident in Australia fora short time and those who have been resident fora long time: The small number of diphthongs occurring in Tamil is suggested as a possible explanation for some of the contrasts reported herein.

PDF
40--45 K. N. Stevens Models For Production And Acoustics Of Stop Consonants

Abstract  Stop consonants are produced by forming a closure in the vocal tract, building up pressure in the mouth behind this closure, and releasing the closure. Models of the mechanical, aerodynamic, and acoustic events in the vicinity of the stop consonant are described, and examples of calculations of the airflow and of various components of the radiated sound are given.

PDF

Speech Analysis I

Pages Authors Title and Abstract PDF
48--53 Frantz CLERMONT Formant-Contour Parameterisation Of Vocalic Sounds By Temporally-Constrained Spectral Matching

Abstract  A method is described for estimating the three lowest, resonance or formant frequencies (F1, F2 and F3) of a vocalic sound, and for tracking the temporal course of each formant through the duration of the sound. The problem of estimating these frequencies at a short-time frame of the speech signal is approached by spectral matching, i.e., by analysis-by-synthesis of hypothesised spectra. The problem of tracking such spectra over consecutive frames is then recast as an optimum path search, with temporal constraints defined in a Dynamic Programming framework. Both estimation and tracking algorithms hinge on the formant-enhancement and the formant-sensitivity properties of the negative derivative, Linear-Prediction phase spectrum. Using a moderately large dataset often English vowels produced randomly by a male speaker three times in CVd and VC (C = /b, cl, g/) contexts, the method presented here is shown to yield formant-contours (the F1 and F2 in particular) which are very similar to those tracked manually by an expert phonetician. The strength of the correlations is found to be 0.99, 0.98 and 0.71 for F1, F2 and F3, respectively.

PDF
54--59 Y.K. Jang, J.F. Chicharo and B. Ribbum Pitch Detection And Estimation Using Adaptive IIR Comb Filtering

Abstract  A modified gradient-based IIR comb filtering technique is presented to estimate and track the pitch period. The advantages of the proposed method compared to the super resolution pitch determination technique (Medan, Yair and Chazan, 1991) are the following: reduced computational burden, improved detection response and the availability of pitch estimates at every sample update Experimental results are included which show the potential performance of the proposed technique.

PDF

Speaker Verification I

Pages Authors Title and Abstract PDF
62--66 Tom Downs, Ah Chung Tsoi, Mark Schulz, Brian Lovell, Michael Barlow, Ian Booth, David Shrimpton, Brett Watson. An Overview Of The Speaker Verification Project At The University Oe Queensland

Abstract  This paper gives an overview of a project on speaker verification at The University of Queensland which is funded under a Syndicated Research and Development Program. In 1991, The University of Queensland initiated a number of projects that were funded under an external Syndicated Research and Development program. This program was established by private investors who were interested in investing in pre-competitive, but potentially commercial, research projects. One of these projects concerns speaker verification and is being conducted in the Department of Electrical and Computer Engineering at The University of Queensland. This paper gives an overview of the project

PDF
67--72 M.E. Forsyth, A.M. Sutherland, J.A. Elliott and MA. Jack Hmm Speaker Verification With Sparse Training Data On Telephone Quality Speech

Abstract  Speaker verification experiments using discrete and semi-continuous HMMs with telephone quality isolated digits are reported. The models were trained with varying numbers of tokens, giving equal error rates of 14% and 16% respectively on single isolated digits, and 4% and 2% on a sequence of 7 isolated digits.

PDF
73--78 Liam Kilmartin and Eliathamby Ambikairajah A Hybrid Mlp-Rbf Based Speaker Verification System

Abstract  A two stage neural architecture model is proposed in this paper for the task of speaker verification. This model operates solely in the time domain and hence removes the need for any computationally demanding pre-processing of the input speech signal. The first stage in the model is a Multi-Layer Perceptron (MLP) based non-linear speech predictor. This non-linear prediction process is non- recursive in nature and is carried out for a complete utterance. When the weights in the MLP have converged after several epochs of training, they are then applied as inputs to the second stage of the model. This second stage is a Radial Basis Function (RBF) classifier which will output a decision as to whether the utterance originated with the true speaker or not. The results obtained from initial experiments with this model were promising but a slight change in the traditional learning process of the MLP stage causes a great improvement in the results obtained. This new model was tested with a small database of speakers. It was found that the model could distinguish correctly all of the utterances used to train it but also could generalize to distinguish correctly all of the utterances in a previously unseen test set .

PDF

Speech Databases I

Pages Authors Title and Abstract PDF
80--85 J.Bruce Millar The Description Of Spoken Language

Abstract  This paper outlines the importance of a principled description of spoken language for the future of speech and language technology and proposes an initial structure for the description of spoken language which will draw on the perspectives of professionals from a range of disciplines. A principled description of spoken language is deemed important for the development of speech and language technology for two major reasons: (1) for the efficient use of resources, and (2) for effective assessment of the technology. The implementation of such a spoken language description is described in the context of the ANDOSL (Australian National Spoken Language Database) project. This implementation is seen to be conveniently structured as a hierarchy of levels containing mostly technical but also political and even legal material. The relationship between this structure and other structures in use is discussed.

PDF
86--91 Karen Croot, Janet Fletcher and Jonathan Harrington Levels Of Segmentation And Labelling In The Australian National Database Of Spoken Language

Abstract  A database of spoken Australian English is being developed at the Speech Hearing and Language Research Centre, Macquarie University, as a preliminary to the Australian National Database of Spoken Language. The database comprises speech data which has been segmented and labelled at multiple levels, including acoustic-phonetic. broad phonetic, and intonational, with other levels to be added in the nearfuture. This paper will describe the bases for some of these levels of labelling.

PDF

Speech Recognition I Algorithms

Pages Authors Title and Abstract PDF
92--97 Florian Schiel Phonetically Seeded Schmm Of Variable Length For Speaker Independent Recognition Of Isolated Words

Abstract  Most applications of HMMs of words use a constant number of states for each model, which are trained by the well-known Baum-Welch-Algorithm. We propose a new approach with phonetically seeded word models of variable length, that are trained by a Viterbi~like algorithm. Our basic motivation was the idea that every state of a word SCHMM should model a phonetic unit of the word. We can achieve this by creating seed models, where the mixture coefficients of each state describe the features of a phonetic segment of the word. A phonetic segment should be a segment where the feature loud- ness spectrum (one of 3 feature vectors) behaves quite stationary. A stationary signal leads to a stationary result of the Vector Quantization (VQ) of the loudness spectrum. It is sufficient to investigate the two best results of the semi-continuous VQ over time for stationary behaviour. We define a segment as a time interval, where at least one of a pair of codebook (CB) symbols (m1,m2) is observed in the first or second position of the VQ result. Furthermore there is information about the position of the vowels by detecting maxima in the smoothed loudness function according to [Zwicker, E., 1982]. The segmentations found in different training utterances are matched into one common seed model by an algorithm that tries to identify corresponding phonetical segments correctly. The seed model is re-trained with all training utterances by a Viterbi-like algorithm similar to the segmental-k-means. Although the number of training utterances was only 12 - 36 (6 speakers) we achieved very good results compared with standard HMM of fixed length.

PDF
98--103 Qiang Huo & Chorkin Chan The Gradient Projection Method For The Training Of Hidden Markov Model

Abstract  By looking at the training of HMM as a general constrained optimization problem with linear constraints, in this paper, a gradient projection method for nonlinear programming with linear constraints has been presented to solve for "optimal" values of the model parameters. The presented algorithm has been shown to be convergent and to have a linear convergence rate. When this method is applied to the training of HMMs with discrete or Gaussian mixture observation densities, a very simple formulation has been derived due to the special structure of the constraints of HMM parameters.

PDF
104--109 Tetsuo Kosaka and Shigeki Sagayama An Algorithm For Automatic Hmm Structure Generation In Speech Recognition

Abstract  We discuss the number of mixture components in continuous mixture density HMM phone models (CHMMs) and present the principle of "distribution size uniformity," instead of the "mixture number uniformity" principle applied in conventional approaches. An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes.

PDF

Poster Session 1

Pages Authors Title and Abstract PDF
112--117 K. Katagishi, H.Singer, K.Aikawa, and S. Sagayama Linear Filtering Of A Feature Vector Sequence For Speech Recognition

Abstract  This paper provides a new interpretation of the so called "delta-cepstrum". We show that it can be regarded as linear filtering of a cepstrum vector sequence, and this basic idea is then extended to a general class of linear filters. Two possible cases, i.e. scalar and matrix coefficients, are considered and tested using phoneme recognition.

PDF
118--123 A.S. Prabhavathy and T.V. Sreenivas Noise Robustness In Multi-Level Crossing Interval Based Method Of Spectral Estimation

Abstract  A new method of sound estimation in noise using multi-level crossing interval information is presented. using the statistics level crossing intervals a sinusoidal criterion is formulated and an expression for threshold SNR is derived for any single level. The analysis is extended to a pooled set of multi-level crossing intervals resulting in a better SNR threshold of detecting a sinusoid. It is demonstrated that the multi-level crossing information is more robust than single level crossing information and the robustness improves with increasing contribution from higher levels.

PDF
124--129 Mylene Pijpers and Michael D. Alder Affine Transformations Of The Speech Space

Abstract  The papers Speaker Normalization of static and dynamic vowel spectral features (J.A.S.A 90, July 1991 pp 67-75) and Minimum Mean-Square Error Transformations of Categorical Data to Target Positions (IEEE Trans Sig.Proc,40 Jan 1992, ppl3-23) by Zahorian and Jagharghi describe an algorithm for transforming the space of speech sounds so as to improve the accuracy of classification. Classification was accomplished by both back-propagation neural nets and by a Bayesian Maximum Likelihood method on the model of each vowel class being specified by a gaussian distribution. The transformation was an affine transformation obtained by choosing ideal 'target' points for each cluster in a second space and minimising the mean square distance of the points in the speech space from the appropriate target. The speech space itself was a space of cepstral coefficients obtained from a Discrete Cosine Transform, These findings are remarkable, indeed almost unbelievable. The reason is that both the maximum likelihood classification on the gaussian model, and the Neural Net classifier are essentially affine invariant. In the case where the transformation is invertible, this is clearly the case. When the transformation has non-trivial kernel, it may happen that the classification gets worse, but it cannot get better. A back-propagation neural net in effect classifies by dividing the space into regions by means of hyperplanes. The gaussian model does so by means of quadratic forms with quadratic discrimination hypersurfaces. Projecting a hyperplane by any non-zero affine map which is onto the target space will usually give another hyperplane in the target space, and if the second separates points, so will the first. Conversely, if there is a solution in the target space, it can be pulled back to a solution in the domain space. It is not hard to show that similar considerations apply to the case where we use quadratic hypersurfaces. In this paper, we attempt to account for the results of Zahorian and Jagharghi by investigating vowel data. We describe a simple projection algorithm which may be applied to high dimensional data to give a view on a computer screen of the data and of transformations of it.

PDF
130--133 Yuki Kakita and Hitoshi Okamoto Fractal Dimensional Analysis Of Voice Fluctuation: Normal And Pathological Cases.

Abstract  This paper reports on the fractional dimension of frequency frequency (F0) fluctuation and amplitude fluctuation for two kinds of voice, the normal and the pathological (perceived as "rough"). The values of fractal dimension were computed by the embedding method. The median of fractal dimension was 3.3 for the normal voice and 4.8 for the pathological, differing by 1 to 2 dimensions. Reconstructed phase portraits were also examined to characterize the difference between the two kinds of voices geometrically.

PDF
134--139 Dr Leisa Condie Phase Space Behaviour Of Speech

Abstract  Characterisation of phase space behaviour of speech gives an alternate view of speech waveforms, Correlation dimension and false nearest neighbour measures are just two techniques for investigating the dynamical behaviour of speech. The recent false nearest neighbours technique for determining the embedding dimension for phase space reconstruction of a time series is used on both voiced and unvoiced speech. Results from false nearest neighbours are compared with the correlation dimension results.

PDF
140--145 Jianxiong WU, Chorkin CHAN, Pengfei SHI A Gamma Network Approach To Automatic Speech Recognition

Abstract  Gamma neurons [1] are employed in this paper to construct a feed-forward multi-layer network to perform automatic speech recognition. Gamma network can approximate complex time-dependent connection weights with simple network structure. Its structure parameters can be learned in the training phase to determine an optimal structure for dynamics processing to fit the particular application task. An error back-propagation like learning algorithm is derived and experimental results of speaker-independent phoneme recognition are discussed.

PDF
146--151 Brett Watson, Ah Chung Tsoi Second Order Hidden Markov Models For Speech Recognition

Abstract  The application of Hidden Markov Models (HMM's) to speech research has yielded some of the best performing speech and speaker recognition systems. In this paper an extension to standard Markov models, second order HMM's, which allow dependence of transition probabilities on previous states as well as on the current state, incorporating in»context information on the hidden states, is presented. Both standard Baum-Welch re-estimation and discriminative alpha-network training techniques are presented.

PDF
152--155 M. I. Dawson and S. Sridharan Speech Enhancement Using Time Delay Neural Networks

Abstract  The use of Time Delay Neural Networks (TcDNNs) for removing additive noise from continuous speech is described. Mel scaled frequency coefficients are used to parametise the noisy speech which is input to the network. The network is trained to extract the speech signal from the noise using the gradient back propagation algorithm. Preliminary results are presented.

PDF

Poster Session 1

Pages Authors Title and Abstract PDF
162--166 L.R. Leerink. M.A. Jabri , A.E. Dutech Detection Of Word-Boundaries From Continuous Phoneme Streams Using Simple Recurrent Neural Networks

Abstract  This paper describes a word boundary detection system based on recurrent neural networks. We show that by using a more effective training algorithm the performance of previous research in this area can be matched by a significantly smaller and simpler network. It is then shown that the performance of this network can be further improved by varying the forward context. For the architecture used in our simulations, empirical results indicate that the optimal amount oi forward context depends on the number of recurrent units in the network.

PDF
167--172 S-H Luo and R.W.King A Combined Neural Network And Contour Method For Mouth Image Location For Speech-Driven Image Enhancement

Abstract  This paper describes a new and effective mouth locating method to locate automatically the mouth shape in a human head with shoulder image. This work forms part of our research aimed at to improving the quality of image compression tor very low bit rate video-telephony where the motion of the mouth should correspond exactly to what is being uttered, and is not excessively smoothed by any more general purpose data compression method.The paper outlines how we intend to integrate phonetic information with the mouth shape and motion.

PDF
173--177 R. Togneri, D. Farrokhi. Y. Zhang and Y. Attikiouzel A Comparison Of The Lbg, Lvq, Mlp, Som And Gmm Algorithms For Vector Quantisation And Clustering Analysis

Abstract  We compare the performance of five algorithms for vector quantisation and clustering analysis: the Self-Organising Map (SOM) and Learning Vector Quantization (LVQ) algorithms of Kohonen, the Linde-Buzo-Gray (LBG) algorithm, the MultiLayer Perceptron (MLP) and the GMM/EM algorithm for Gaussian Mixture Models We propose that the GMM/EM provides a better representation of the speech space and demonstrate this by comparing the GMM with the LBG, LVQ, MLP and SOM algorithms in phoneme classification and digit recognition.

PDF
178--183 O. A. Alim, E. A. Youssef and A. G. Mokhtar Two Novel Speech Coding Techniques Based On Multipulse Representation

Abstract  Two new coding techniques based on multipulse representation of the excitation signal in the time and frequency domains are presented. Algorithms, illustrations, modelling processes as well as bit rates are discussed. Real speech, English as well as arabic mono and di syllabic words were used to test and evaluate the coding techniques subjectively and objectively. Both coding techniques produced good quality speech but the time domain method which operates at bit rates in the range of 8-16 kbps, was found to be superior to the frequency domain method which operates at a bit rate of 5740 bit/s.

PDF
184--189 Jean-Pierre RENARD, Henri LEICH Design Of Pseudo-Quadrature Mirror Filter Bank For High Quality Subband Coding

Abstract  An approximation method is presented to design pseudo-QMF filter banks for high quality subband coding. This method takes into account the implementation structure of the whole subband system.

PDF
190--195 Gao Yang and H. Leich Dynamic Scalar Quantization Of LSP (Line Spectral Pair)

Abstract  On the scalar quantization (SQ) of LSP, the interframe and intraframe correlations can be utilized to reduce the bit rate; however, the interframe correlation is generally weak for a longer frame shift which seems to be necessary for speech coding at a low bit rate below 4k bps; besides the influence of channel errors is not local if the interframe correlation is used. In this paper, a dynamic SQ method of LSP without using the interframe correlation is proposed; a simple and efficient distortion measure is presented. The experiments show that a satisfied perceptual quality can be achieved by using 25 bits per frame with the dynamic SQ.

PDF
196--201 C. F. Chan, W. H. Lau, S. P. Chui, and F. L. Hui Improving Vselp Coding Using Truncated Perceptual Weighting Filter And In-Loop Quantization

Abstract  This paper proposes a method to reduce the complexity of VSELP coding and introduces a more efficient quantization scheme for the short~term predictor coefficients. It is shown that, by using a reduced-order perceptual weighting filter, the subjective speech quality was retained while the complexity in codebook searching is reduced tremendously. An efficient analysis/quantization scheme which quantizes the reflection coefficients inside the analysis loop was developed. The new scheme does not alter the coding format of the VSELP standard and obtains a lower quantization distortion than the original scalar quantization scheme.

PDF
202--206 T. DUTOIT, H. LEICH Mbr-Psola: Text-To-Speech Synthesis Based On An Mbe Re-Synthesis Of The Segments Database.

Abstract  The use of the TD-PSOLA algorithm in a Text-To-Speech synthesizer is reviewed. Its drawbacks are comprehensively underlined and three conditions on the speech database are examined. In order to satisfy them, a previously described high quality re-synthesis process is developed and enhanced, which makes use of the well-known MBE model. An important by-product of this operation is that Pitch Marking turns out to be useless. The temporal interpolation block is finally refined. The resulting synthesis algorithm supports spectral interpolation between voiced parts of segments, with virtually no increase in complexity. It provides the basis of a high-quality Text-To-Speech synthesizer.

PDF
207--211 R.E.E.Robinson Synthesising Facial Movement: Data Capture

Abstract  A method of data capture and analysis of lip movement is described. Cine film was digitised by computer and a frame by frame comparison was done to generate a codebook of lip shapes for the synthesis of facial movement. By various methods of data reduction, real-time image playback was achieved

PDF

Neural Networks

Pages Authors Title and Abstract PDF
214--219 David B. Grayden and Michael S. Scordilis Tdnn Vs. Fully Interconnected Multilayer Perceptron: A Comparative Study On Phoneme Recognition

Abstract  The development and performance of a Time-Delay Neural Network (TDNN) and a Fully interconnected Neural Network (FINN) is compared for continuous speech, speaker-independent recognition of voiced stops and unvoiced fricatives from the DARPA TIMIT speech database. The results conclusively show that the TDNN is the preferred network for phoneme recognition. A major enhancement of the back-propagation is also included, and it makes possible the speedy development of large neural networks on general purpose workstations.

PDF
220--225 Andrew Hunt Recurrent Neural Networks For Syllabification

Abstract  An important procedure in many prosodic analysis systems is to locate syllables. The location of syllables is used for the identification of stress and pitch accents, and forms the basis for the analysis of rhythm. This paper presents a novel syllabification method using recurrent neural networks which is more accurate than previous techniques. This method achieves high accuracy on continuous speech, by finding 94% of syllables and by placing most syllable boundaries within 20msec of the desired location. The paper also investigates means of optimising the performance of recurrent neural networks.

PDF
226--231 Yohji Fukuda and Haruya Matsumoto Speech Recognition Using Modular Organizations Based On Multiple Hopfield Neural Networks

Abstract  In this paper, we describe the speech recognition that introduces the modular organizations based on Multiple Hopfield Neural Network {MHNN). MHNN is composed of several Hopfield Networks that are connected to each other. Each network owns the individual energy function but when connected, the total energy of the whole network can be minimized because MHNN interacts between the networks. Our recognition architecture has two phases. In the first phase, two mapping-type networks extract features from a spectrum data and a pitch data as the modular organization's inputs. In the second phase, MHNN recognizes the speech signal by the interactions between the networks. We perform Japanese name recognition by using MHNN.

PDF

Speech Coding I

Pages Authors Title and Abstract PDF
234--239 C. F. Chan Vector Quantisation Of Thinned Filter Coefficients For Low-Bit-Rate Speech Coding

Abstract  The results in vector quantization of thinned filter coefficients are presented in this paper. The codeword in the thinned filter codebook is represented by the non-zero coefficients and their corresponding delays. By using a modified likelihood ratio as distortion measure, and employing the generalized Lloyd algorithm for codebook training, we are able to obtain thinned filter codebooks for both the transversal an1d lattice models with significant lower distortion than conventional LPC-VQ codebooks. Simulation results show that a 7-bit thinned lattice filter codebook has the same distortion level as a 10-bit LPC-VQ codebook.

PDF
240--245 Kwok-Wah Law and Cheung-Fat Chan Efficient Quantization Of The Lpc Coefficients Using Hybrid Codebook Structure

Abstract  An efficient k-parameters vector quantization algorithm using hybrid codebook structure is present. In the new scheme, the LPC vector is partitioned into 3 stages for vector quantization. In the first stage, full search codebook is used in order to maintain the spectral accuracy due to quantization. The binary-tree and quad-tree search codebook are used in the second and third stages for reducing the encoding complexity. By employing the hybrid codebooks in different stage, the proposed scheme can achieve 3.6 times complexity reduction when compared with the full-searched codebooks for all stages, and therefore very suitable for real-time implementation in low bit rate speech coding.

PDF

Aspects Of Speech Understanding Systems

Pages Authors Title and Abstract PDF
248--253 C. Rowles, X. Huang, M. de Beler, J. Vonwiller, R. King, C. Matthiesson, P. Sefton and M. O'Donnell Using Prosody To Assist In The Understanding Of Spoken English

Abstract  The use of prosody to assist in the understanding of spoken English by computers has recently started to attract some attention. While prosody has been studied for its potential in syntactic analysis and the understanding of dialogue structure, the practical use of prosody has been largely limited to improving the intelligibility of synthesised speech. In this paper we show how prosody can be used to improve syntactic segmentation and dis-ambiguation, assist in the understanding of dialogue structure and allow the proper management of turn-taking in dialogues.

PDF
254--259 Russell.J.Collingham and Roberto Garigliano A Word Lattice Parsing Algorithm For Naturally Spoken English

Abstract  A system is described which provides a prototype speech recognition aid for deaf students in university lectures. A dynamic programming algorithm builds a lattice of likely spoken words from the phoneme form of naturally spoken English. A mixture of breadth first search and best first search incorporating a novel set of "anti-grammar" rules to penalise grammatically incorrect or grammatically bad sequences of words is used to parse the word lattice and produce the best recognised word sequence.

PDF
260--265 M O'Kane, P Kenne and O White Determination Of Training Set Size For A Statistically-Based Wordspotter

Abstract  Wordspotting in continuous speech is useful for automatically locating words for audio indexing purposes. Wordspotting is also the basic technology behind concept spotting, in which the location of enough members of a set of semantically-related words and phrases in a particular segment of speech is taken as an indication that the concept represented by that set is being discussed. A set of experiments was conducted as a first attempt to determine the size of the database needed to train a statistically-based wordspotter. False negatives and the false positives are both treated as errors in wordspotting. In the first experiment the size of the wordspotter training set needed was examined for the speech of a single speaker. Sufficient training data were collected until good wordspotting was achieved for this speaker. This experiment was then repeated for the speech of another speaker so that the variation of training set size as a function of speaker could be investigated. The training sets for the speakers were then pooled and the wordspotter was tested on test sentences for these speakers. The obvious generalisation experiment was then carried out in which the wordspotter was tested on test speakers who were not in the training set.

PDF

Speech Technology I - Speech Aids

Pages Authors Title and Abstract PDF
268--273 P.A. Jones, H.J. McDermott, P.M. Selligman, J.B. Millar An Extension Of The Multipeak Speech Processing Strategy For The Msp/Mini 22 Cochlear Implant System.

Abstract  The speech perception of three postlinguistically deaf adults using the Nucleus MSP/Mini System 22 cochlear implant system programmed with a new speech processing strategy, MPEAK+AO, was evaluated. The MPEAK+AO strategy retains all the information of the standard Multipeak speech processing strategy and additionally presents acoustic components below 400Hz to the most-apical electrode. This extra - spectral information may help implantees understand speech, particularly in noise. Since the estimated fundamental frequency is presented as the rate of stimulation at a fixed intracochlear site and is thereby potentially perceived more easily, and the amplitude of the stimulation on the apical electrode, associated with the voice fundamental, is directly determined from the estimated energy in the relevant spectral region, these coding factors may provide a better representation of the prosodic information in speech and a more complete auditory feedback signal. The comparison between Multipeak and MPEAK+AO included tests of vowel, consonant and CNC word recognition. Speech materials were presented with both a male and female speaker. Sentence material, presented with background masking noise (four-speaker babble), was also used. The results showed that the new strategy significantly improved the ability of these MSP users to recognise words in open-set sentences in noisy conditions.

PDF
273--278 P.J. Blamey, G.J. Dooley, P.M. Seligman, J.I. Alcantara, and E.S. Gerin Formant-Based Processing For Hearing Aids

Abstract  A body-worn hearing aid has been developed with the ability to estimate formant frequencies and amplitudes in real time. These parameters can be used to enhance the output signal by "sharpening" the formant peaks, by "mapping" the amplitudes of the formants onto the available dynamic range of hearing at each frequency, or by resynthesizing a speech signal that is suited to the listener's hearing characteristics. Initial evaluations have indicated small improvements in speech perception for three groups of subjects: users of a combined cochlear implant and speech processing hearing aid, normally hearing listeners in background noise, and a hearing aid user with a severe hearing loss.

PDF
279--284 Catherine I. Watson and John H. Andreae A Test To Assess The Remedial Worth Of A Computer-Based Speech Therapy Aid

Abstract  For a visual speech therapy aid to be useful, its displays must distinguish unacceptable from acceptable speech utterances. Whilst a large number of visual speech therapy aids have been developed, very few have been tested adequately. A Visual Display Test is being developed to assess the visual displays of the CASTT, a computer-based aid developed at the University of Canterbury. The evolution of this test is reported in this paper.

PDF

Acoustic Phonetics II

Pages Authors Title and Abstract PDF
286--291 Andrew Butcher Intraoral Pressure As An Independent Parameter In Oral Stop Contrasts

Abstract  An analysis of acoustic and aerodynamic data on contrasting stop series in a number of European and Australian languages confirms that a significant variation in peak intraoral pressure is one of the main factors differentiating many such series. A more detailed examination of the intraoral pressure data from the Australian language suggests that glottal aperture is the main physiological parameter underlying this pressure variation, and there is no evidence to support the notion of a single independent phonetic correlate of the fortis/lenis distinction.

PDF
292--267 Phil Rose Bidirectional Interaction Between Tone And Syllable Coda: Acoustic Evidence From Chinese.

Abstract  The acoustic phonetic interaction between tone and syllable-final velar nasal coda is examined for a dialect of Chinese. A bidirectional effect is demonstrated in that the nasal coda causes significant differences in the tonal duration and FO height of its syllable, whilst at the same time having its duration conditioned by the tone.

PDF
298--304 Frantz CLERMONT Characterisation Of The Diphthongal Sound Beyond The F1-F2 Plane

Abstract  A well entrenched and still dominating approach to characterise the diphthongal sound is based: (1) on the two lowest, vocal-tract resonance (or formant) frequencies (F1 and F2) considered individually and/or in a planar space; and (2) on a very sparse, temporal representation of these frequencies. While this time-honoured approach has been instrumental in deriving certain important properties of diphthongs, our basic knowledge appears to have advanced little beyond the F1-F2 plane as an acoustic phonetic framework for describing the dynamics and the complex vocalic nature of these speech sounds. In contrast, a new perspective on the formant space of the diphthong is offered here by studying the detailed time-varying behaviour of the individual formants; and by unveiling transition characteristics of particularly the F3-contour, which have hitherto been either unacknowledged or severely attenuated by sparse time-sampling.

PDF

Speech Coding II

Pages Authors Title and Abstract PDF
306--311 S Sridharan, B Goldburg, E Dawson Speech Cryptanalysis

Abstract  The security of frequency domain analog speech scramblers is investigated. 1t is shown that a vector codebook similar to that deployed in vector quantization of speech can be used to attack the speech scramblers even under the stringent condition that no section of the original speech is available to the attacker and that the encryption key of the system is varied frequently in a random and unknown manner. Subjective as well as objective results demonstrating the success of the attack are given.

PDF
312--317 J. Kostogiannis and A. Perkis A Robust Error Masking Hybrid Spectral Quantisation Scheme For Noisy Channels

Abstract  In this paper a robust error masking hybrid spectral quantisation scheme, based on Line Spectral Pairs (LSPs), critical for perceptual speech quality in noisy channels is presented. its performance, in comparison to conventional schemes, is evaluated considering error free conditions and in the presence of up to 10% random errors. In addition the inherent structures of the LSPs are utilised in designing non-redundant techniques for error detection and error masking incorporated in the quantisation scheme.

PDF
317--322 R. Soheili. A.M. Kondoz, B.G. Evans An 8 Kb/S Ld-Celp With Improved Excitation Modeling

Abstract  Backward prediction of the short term redundancies in speech has resulted in very low delay algorithms, with toll quality at 16 kh/s. At medium bit rates around 8 kb/s the mod- elling of the excitation signal by conventional CELP techniques can result in high complexity or poor output processed speech for services such as PSTN. In this paper we propose a low delay algorithm based on a vector quantised multi-tap adaptive codebook in producing high quality speech signal operating at 8 kb/s. A report on the comparisons with other existing standards as well as simplification techniques in realising the algorithm are presented

PDF

Speech Recognition II: Systems

Pages Authors Title and Abstract PDF
324--329 S. Sagayama, M. Sugiyama, K. Ohkura, J. Takami, A. Nagai, H. Singer, H. Hattori, K. Fukuzawa, Y. Kato, K. Yamaguchi, T Kosaka, and A. Kurematsu Atreus: Continuous Speech Recognition Systems At Atr Interpreting Telephony Research Laboratories

Abstract  This paper describes ATREUS, a family of a large variety of continuous speech recognition systems developed at ATR Interpreting Telephony Research Laboratories as the spoken input front-end of an interpreting telephony system. It is one of the major achievements of a seven-year automatic interpreting telephony project, which will reach its completion at the end of this fiscal year. A comparative study is given from the viewpoints of constituent technique and performance. A combination called ATREUS/SSS-LR performed best among the ATREUS systems.

PDF
330--335 Jin'ichi Murakami and Shigeki Sagayama An Efficient Algorithm For Using Word Trigram Models For Continuous Speech Recognition

Abstract  This paper describes an efficient algorithm for using word trigram models in continuous speech recognition. The algorithm reduces the memory requirements and computational cost by employing two techniques: beam search and an improved method for training the Viterbi path. It was tested on continuous speech recognition experiment.

PDF
336--341 A. Kowalczyk, M. Dale & C. Rowles. Low Cost Speech Recognition For Simple Dialogue Understanding

Abstract  This paper describes the results of some experiments in the development and application of low cost neural networks for isolated speech recognition. Emphasis is placed on low precision weights and low memory requirements, which facilitate, in particular, the use of- simple microprocessors for implementation.

PDF
342--347 C. H. Lee, J. L. Gauvain. R. Pieraccini and L. R. Rabiner Large Vocabulary Speech Recognition Using Subword Units

Abstract  Research in large vocabulary speech recognition has been intensively carried out worldwide, in the past several years, spurred on by advances in algorithms, architectures, and hardware. In the United States, the DARPA community has focused efforts on studying several systems including Resource Management, a 991 word task, ATIS (Air Travel information System), a task with an open vocabulary (in practice on the order of several thousand words) and a natural language component, and Wall Street Journal, a task with a vocabulary on the order of 20,000 words. Although we have learned a great deal about how to build and efficiently implement large vocabulary speech recognition systems, there remain a whole range of fundamental questions for which we have no definitive answers. For example we do not yet know the best way to build and train the fundamental subword units from which word models are created. We do not yet know the best way to impose language constraints on the recognizer so as to utilize all available knowledge in the most computationally efficient manner. We do not yet even understand the best way to implement a recognition system so as to maximize the probability of recognizing the spoken string while minimizing the computation for string comparison and searching through the recognition network. In this paper we review the basic structure of a large vocabulary speech recognition system, discuss the considerations in the choice of subword unit, method of training, integration of language model, and implementation of overall system, and report on some recent results, obtained on at AT&T Bell Laboratories and elsewhere, on the DARPA Resource Management Task.

PDF

Speech Disorders

Pages Authors Title and Abstract PDF
356--361 Lydia K. H. So and Barbara Dodd Phonologically Disordered Cantonese-Speaking Children

Abstract  Speech disordered children are not a homogeneous group in terms oi aetiology, severity, surface error patterns or response to specific treatment approaches. In this paper we describe the speech error patterns of 17 monolingual Cantonese-speaking children. Two had difficulties articulating specific speech sounds; eight showed a delayed pattern of development ie their errors were appropriate for a younger child acquiring phonology normally; Eve used unusual (non-developmental) phonological rules (as well as some normal developmental ones); and two showed inconsistent patterns of errors. The possible nature of the deficits underlying each of these surface error patterns is discussed.

PDF
362--367 Alison D. Bagnall. An Alternative Approach To The Treatment Of Vocal Fold Nodules In Children - A Case Study

Abstract  A six year old child with long-standing bilateral vocal fold nodules and dysphonic voice received five lessons in skillful yelling or "belting". Laryngoscopy findings, post-treatment. confirmed the resolution of the nodules. Spectrographic analysis. pre- and post-therapy confirmed the successful resolution of the dysphonia. He was not discouraged from yelling.

PDF

Speech Coding III

Pages Authors Title and Abstract PDF
370--374 Annie George and Bernt Ribbum High Quality Audio Coding Suitable For Isdn Channels

Abstract  In the last several years developments in signal processing technology and transmission technology has made it imperative that high quality audio coding algorithms be developed. A brief study is made into the various coders available for audio coding. OCF (Optimum Coding in the Frequency Domain) is an algorithm that allows audio source coding down to 64 kbits/s. This coder and the inclusion of it's basic principles into the MPEG - audio standard draft is discussed. A coder based on the principles of the OCF but for smaller bandwidth is being implemented by the authors.

PDF
375--380 T. S. Lim and M. S. Scordilis New And Improved Pitch Determination For The IMBE Vocoder

Abstract  A robust and accurate pitch determination algorithm of infinite resolution is presented in this paper. This method makes use of a hybrid of time domain and frequency domain pitch estimation techniques. For frame to frame analysis, this method was found to provide high accuracy in extracting the pitch period best representing the average pitch within a speech frame. It is a computationally efficient technique, particularly when used as part of the IMBE vocoder.

PDF
381--386 Shu H. Leung, Chi Y. Chung and Andrew Luk A Low Noise Fixed Point Implementation Of Gsm Speech Codec On Tms320c2s

Abstract  This paper is to present a fixed point implementation of Regular-pulse Excitation Linear Predictive coder (RPE-LPC) that combines with long term prediction (LTP) on using a single TMS32OC25. The bit rate of the coder is 13kbits/s. The coding is done by Toeplitz approximation that permits the use of lattice filter for reducing the finite word length effects such as coefficient sensitivity and roundoff error.

PDF

Speech Recognition III: Prosody

Pages Authors Title and Abstract PDF
388--393 Andrew J. Hunt Recent Advances In Utilising Prosody In Speech Recognition

Abstract  In the last few years both the potential for use of prosody in Automatic Speech Recognition and its actual use have grown substantially. The qualitative and quantitative understanding of prosodic features, including stress, pausing, rhythm, and intonation, has improved significantly. At the same time automatic speech recognition systems have reached a level of sophistication at which prosodic features can play a useful and complementary role to conventional recognition techniques. This paper outlines recent research work on the nature and utilisation of prosody and looks at areas of promise. A trend towards more sophisticated processing of prosodic features is observed.

PDF
394--399 Harald Singer and Shigeki Sagayama Suprasegmental Duration Control With Matrix Parsing In Continuous Speech Recognition

Abstract  This paper describes a unified framework for continuous speech recognition (CSR) under grammatical constraints, where trellis calculations and parsing are performed by the same simple fundamental operations, namely multiplication and addition of likelihood matrices. The matrix parser is shown to be a generalization of the CYK parser, which because of its simplicity lends itself to efficient hardware implementation. It also facilitates explicit supra-segmental duration control for all grammatical categories. Preliminary results showed, that improved duration control on the mora level raised the recognition accuracy from 86.6 % to 88.2 %.

PDF

Poster Session 2

Pages Authors Title and Abstract PDF
402--407 Ann Packman, Janis van Doom, and Mark Onslow Stuttering Treatments: What Is Happening To The Acoustic Signal?

Abstract  Identifying the critical variables in stuttering treatment will increase treatment effectiveness and contribute to understanding the nature of the disorder. A research program is described which investigates the notion that changes in the variability of acoustic segment durations, in particular the variability of vowel length. may contribute to the success of various stuttering treatments.

PDF
408--413 Jan van Doorn and Alison Purcell The Nasometer: A Clinical Gadget Or A Potential Technological Breakthrough?

Abstract  The excessive nasal quality associated with some types of disordered speech is a perceptual speech feature which is exceptionally difficult to assess reliably. This has led to a search for objective measures of nasality by using the acoustic speech signal. One recent device which has been developed commercially for the clinical market measures the ratio of oral to nasal speech signal intensity, and this device is currently being assessed in various research and clinical settings around the world. To date, the focus of the assessment has been on the validity of Nasometer measures whereby the nasalance measures have been correlated with levels of perceived nasality as rated by clinicians specialising in nasality disorders of speech. This paper addresses the issue of establishing a measure of the reliability of nasalance measures in order to establish criteria for using the device in making clinical decisions.

PDF
414--419 A. Marchal, C. Meunier, D. Masse Hyperbaric Speech Unscrambling: Results Of An Analysis/Synthesis Method Using Psh/Dispe Cdrom Speech Samples.

Abstract  We describe in this paper the bilingual database of subaquatic and hyperbaric speech (PSH/DESPE), and we present an unscrambling technique which uses an analysis/synthesis method to shift frequencies, to reduce the noise level and to increase speech intelligibility .

PDF
420--424 Sallyanne Palethorpe Speech Intelligibility In Communicative Difficulty

Abstract  An inability to maintain adequate intelligibility in situations of communicative difficulty may be due to a speaker being unaware of the requirements of the listener for maintaining intelligible conversation. In the context of an interaction with a simulated automatic speech recognition system, providing directed feedback from the listener as to the source of the communicative failure was an effective method of overcoming the problem

PDF
425--430 Florian Schiel A New Approach To Speaker Adaptation By Modelling The Pronunciation In Automatic Speech Recognition

Abstract  To deal with large lexica (more than 2000) many systems of automatic speech recognition (ASR) use an internal phonetic representation of the speech signal and phonetic models of pronunciation from the lexicon to search for the spoken word chain or sentence. Therefore there is the possibility to model different pronunciations of a word in the lexicon. In German language we observed that individual speakers pronounce words in a typical way that depends on several factors as: sex, age, place of living, place of birth, etc. Our goal is to enhance speech recognition by automatically adapting the models of pronunciation in the lexicon to the unknown speaker. The obvious problem is: You can't wait until the present speaker will have uttered approx. 2000 different words at least one time. We solved this problem by generalization of observed rules of differing pronunciation to not observed words. Another point presented is speaker adaptation by re-estimating the a-posteriori probabilities of the phonetic units used in a 'bottom up' ASR system. A word hypothesis is evaluated by the product of the a~posteriori probabilities of the phonetic units produced by the classification to the phonetic units belonging to the word hypothesis. Normally these probabilities are estimated during the training of the ASR system and stay fixed during the test. We propose a algorithm which observes the typical confusions of phonetic units of the unknown speaker and adapt the a-priori probabilities. The learning rates can be dynamically adjusted by the entropy of the a-posteriori probabilities. By that we achieve a very fast adaptation of the a-posteriori probabilities to the optimal recognition rates using a Maximum Likelihood criterion.

PDF
431--436 Yasunaga Miyazawa and Shigeki Sagayama Speaker-Normalized Hmm-Likelihood For Selecting A Reference Speaker In Speaker-Adaptive Speech Recognition

Abstract  This paper proposes a principle of speaker-normalized HMM-likelihood for pre- selecting a reference speaker to be used in speaker-adaptive HMM-based speech recognition. The experimental evaluation of this principle indicates that speaker selection using the speaker-normalized HMM-likelihood is superior to the simple likelihood-based method.

PDF
437--442 Jun-ichi Takami, Akito Nagai and Shigeki Sagayama Speaker Adaptation Of The SSS (Successive State Splitting)-Based Hidden Markov Network For Continuous Speech Recognition

Abstract  This paper describes a speaker adaptation method called Vector Field Smoothing (VFS) for a Hidden Markov Network (HMnet) generated by the Successive State Splitting (SSS) algorithm, and shows experimental results of speech recognition for multiple input speakers. The VFS method can accurately adapt a standard speaker's HMnet to the input speaker's HMnet with a limited amount of training samples because of its "smoothing" mechanism for transfer vectors. By using this method, remarkable improvements in the continuous speech recognition rates have been obtained.

PDF
443--447 X.Y. Zhu and L.W. Cahill Automatic Gender Identification By Voice

Abstract  Gender identification is a sub-area of speaker identification. This paper reports a traditional pattern recognition method and a connectionist method of automatic gender identification. Six different acoustic features extracted from 16 millisecond segments of the vowel were used and compared in the performance of gender identification. Fifty male speakers and fifty female speakers from the DARPA TIMIT speech corpus were tested in the experiments, The results show that speaker's sex can be identified accurately from very short segments of his or her voice by the proposed methods, For a large number of subjects the connectionist model is superior to the traditional pattern recognition method in gender identification.

PDF
448--453 J. M. Song A Study On The Combination Of Hidden Markov Models And Multi-Layer Perceptron For Speech Recognition

Abstract  This paper presents an enhanced speech recognition algorithm by combining continuous hidden Markov modelling (HMM) with a multi-layer perceptron (MLP). The first stage of speech recognition, carried out by the HMM, selects a small group of candidates and projects incoming speech vectors into state normalized vectors. The second stage of MLP classifies each normalized vector generated by the HMM and determines the best candidate. In this architecture, the HMM plays a role of pre-classification, while the MLP is used for decision refinement. A simple speaker-independent isolated digits telephone speech database was used to test this approach. The result shows that the recognition performance increases from 92.9% to 93.8%.

PDF
454--459 Elijah Mwangi On The Use Of Acoustic Segmentation In Isolated Word Recognition

Abstract  An isolated word recognition system in which the broad acoustic structure of A word is used to supplement a conventional recognizer is described. The acoustic structure is the voiced, unvoiced , silence pattern of the word. Results obtained by computer simulation show an improvement in the recognition accuracy.

PDF
460--464 David Shrimpton and Brett D. Watson Comparison Of Recurrent Neural Network Architectures For Speaker Verificatlon

Abstract  - Recurrent Neural Networks (RNN) have shown promise in the area of automatic speech recognition. In this paper we examine the application of RNN architectures to the problem of text dependent automatic speaker identity verification.

PDF

Poster Session 2/2

Pages Authors Title and Abstract PDF
465--470 J. Ingram, R. Prandolini & S. Ong Phonetic Variability In Speaker Recognition For Forensic Purposes

Abstract  An experiment is reported on the impact of phonetic control in the selection of acoustic segments for formant trajectory based speaker identification under forensic conditions.

PDF
471--476 Frédéric Quesne and Henri. Leich Improving The Performances Of Hidden Markov Models For Text Dependent Speaker Verification Or Identification

Abstract  Two algorithms are described here, improving the results of hidden Markov models, one in a text dependent speaker identification process and another in a text dependent speaker verification process. An original method is also presented for the estimation of identification success rates with small databases.

PDF
477--482 Miles P. Moody, Sherman Ong On The Confidence Level Of Speaker Identification Using Statistical Measures On Reflection Coefficients

Abstract  In this paper, we consider the confidence level for text-independent speaker identification. The confidence level is obtained from an analysis of identification accuracy versus weighted Euclidean distance between the reference template and the test vector of reflection coefficients extracted from segmented speech samples. The acceptance rate (the proportion of the intra-class within a limited distance) is dependent on the distance, the setting of a threshold level is then a trade-off between the accuracy and the acceptance rate. For carefully selected samples with 90% acceptance rate (for one minute for each test and each reference respectively), around 94% accuracy is achieved for a population of 14 and with 40% acceptance rate for the same population, about 99% accuracy is achieved.

PDF
483--488 Ian Booth, Michael Barlow, and Brett Watson Enhancements To Dtw And VQ Decision Algorithms For Speaker Recognition

Abstract  Dynamic Time Warping (DTW) and Vector Quantisation (VQ) techniques have been applied with considerable success to speaker verification. In this paper we develop two enhancements involving statistical weighting and distance normalisation. Speaker verification results on a population of 42 are reported

PDF
489--494 L. Penny Acoustic Measurements Of The Diphthongs Of Women Speakers Of General Australian English

Abstract  Formant and duration measures of the general variety of Australian English, as spoken by healthy young native-born women speakers are presented. The findings are discussed in relation to Bernard's (Bernard & Mannell, 1986) data for male speakers, and the implications for views on regional and social varieties of Australian English are discussed. The data may be used to confirm Clark's (Clark, 1989) revised transcription system for Australian vowels.

PDF
495--500 Steve Cassidy and Jonathan Harrington Investigating The Dynamic Structure Of Vowels Using Neural Networks

Abstract  The target theory of vowel perception suggests that the vowels are identified from the static spectral characteristics at the vowel target. This has been challenged recently by Strange who claims that dynamic information may be more important than static spectral shape in identifying vowels. In the work described here we attempt to investigate this issue using a neural network trained to identity vowels from bark spectra inputs. If the network is able to better identify vowels which contain natural dynamic information than similar stimuli which do not, then this dynamic information must be characteristic of the vowel, rather than being noise. Our results confirm that dynamic information is useful in categorising eleven monophthongal vowels.

PDF
501--506 Xiaonong Sean Zhu Intrinsic Vowel F0 In A Contour Tone Language

Abstract  Two contour tones on high and low vowels are measured at different duration points to examine intrinsic vowel F0 effects in Shanghai Chinese. T-test results show the intrinsic vowel Fo occurs on the earlier part of the falling tone and later part of the rising tone.

PDF

Speaker Verification II

Pages Authors Title and Abstract PDF
508--513 Anthony Kelly and Eliathamby Ambikairajah Hidden Control Neural Networks And Neural Prediction Models For The Task Of Speaker Verification

Abstract  The Hidden Control Neural Network (Levin, 1990) and the Neural Prediction Model (Iso & Watanabe, 1990) were recently proposed as speech recognition models. The models are based on speech pattern prediction by multi- layer perceptrons. Each model was tested, by its proposer, with speaker independent digit recognition experiments In both cases recognition accuracies in excess of 99% were achieved. This paper describes the use of both the Hidden Control Neural Network and the Neural Prediction Model to perform the task of speaker verification. The vulnerability of each model to changes in speech parameters over time is also investigated. A set of sixty utterances from one true- talker and six impostors, collected over a period of six months, is used to evaluate the speaker verification performance. The Neural Prediction Model based system yields a speaker verification accuracy of 100%, however, this falls to 90% for the Hidden Control Neural Network based system. Finally, a multi-transputer implementation of the Neural Prediction Model system is described. This system uses five transputers and operates in real-time.

PDF
515--520 E. Ambikairajah, M. Keane and G. Tattersall Speaker Verification Using Self Segmenting Linear Predictors

Abstract  A new speaker verification model is proposed in this paper. The model uses self aligning linear predictors to represent the temporal structures of speech. Conventional Linear Predictive Coefficient (LPC) methods use short frames for analysis, resulting in neighbouring frames having very similar coefficients. This paper proposes a model that uses variable length segments. This considerably reduces the number of coefficients required to represent the true speaker utterance. Furthermore, the fact that these segments self align in the time domain eliminates the need for time warping. The training algorithm is based on a combination of dynamic programming and steepest decent techniques. The self segmenting model was evaluated on a database of 70 utterances, taken over a period of three weeks. Two different criteria were used in the verification decision, one was based on the accumulated prediction residual and the other was based on the optimal segmentation of the utterance. Individually these two tests yielded verification accuracies of 97% and 95%. An accuracy of 100% was achieved when both the accumulated prediction residual and the optimal segmentation were used in the verification decision.

PDF
521--526 Peter Kootsookos, Ah Chung Tsoi and Brian Lovell Speech Enhancement For Robust Speaker Verification

Abstract  We examine the performance of Kalman filtering and smoothing techniques in the context of a working verification system to see the effect on interspeaker and intraspeaker variability. The efficacy of Kalman noise reduction on speech contaminated by several types of noise in the context of three different speaker verification techniques - dynamic time warping, vector quantization and recurrent neural network -- is investigated. Although the neural network system appears to benefit the MOST from Kalman noise reduction, there is also significant improvement for the other two systems. Vector quantisation had the least noise sensitivity and the best overall performance.

PDF

Speech Synthesis I Text To Speech

Pages Authors Title and Abstract PDF
528--533 N. Yiourgalis, G. Epitropakis, G. Kokkinakis Some Important Cues On Improving The Quality Of A Tts System

Abstract  This paper presents the experience acquired by improving the quality of a rule-based parametric torment Greek TTS system developed in our laboratory. Improvements are achieved by: I. the addition of rules which control the duration, concatenation and coarticulation behaviour of the appropriate speech segments that are to be abutted. The intended definition/firing of these rules was ensured by the use of a specially designed graphical environment, ii. the introduction of pitch synchronous excitation which removed the appearing spikes in the speech output due to filter instabilities and iii. the application of an intonation scheme based on the syntactic analysis results of the input text, in order to resemble the intonational boundaries and the important prosodic features of natural speech.

PDF
534--539 G. Epitropakis, N. Yiourgalis, G. Kokkinakis Prosody Assignment To Its-Systems Based On Linguistic Analysis

Abstract  In this paper we present a)a complete method for formulating the rules needed for assigning prosody to a Text-To-Speech system on the basis of linguistic knowledge extracted from text and b)the implementation of the method in the Greek TTS-system developed at our laboratory (Yiourgalis & Kokkinakrs, 1991).

PDF

Speech Analysis II: Tools

Pages Authors Title and Abstract PDF
542--547 WJ. Hardcastle, A. Marchal, K. Nicolaidis, N. Nguyen-Trong Non-Linear Annotation Of Multi-Channel Speech Data

Abstract  The principles of the non-linear annotation used in the ACCOR project are described with reference to connected speech data selected from the EUR-ACCOR multi-channel speech database.

PDF
548--553 Andrew McVeigh and Jonathan Harrington Acoustic, Articulatory, And Perceptual Studies Using The Mu+ System For Speech Database Analysis

Abstract  mu+ is a system for retrieving and analysing speech data from large speech databases. The input to the system can be acoustic and articulatory signal files keyed to labels at different hierarchical levels. Using mu+, most combinations of labels, together with their boundary times and associated signal files can be retrieved and analysed. The system has been developed to provide a common environment for experimentation in numerous facets of speech research including: articulatory and acoustic phonetics, prosodic analysis, speech technology research, and linguistic corpus development.

PDF
556--561 Lorenzo Cioni An Environment For Speech Signal Processing

Abstract  The aim of this paper is to describe a project that we are developing at our Laboratory. This project aims at the definition of an environment for speech signal processing in which a set of applications can cooperate through a library of user-defined data that represent either "the outcome of" or "the source for" such applications.

PDF

Speech Recognition V: Word Recognition

Pages Authors Title and Abstract PDF
562--567 J. M. Song and A. Samouelian A Robust Speaker Independent Isolated Word Recognizer Over The Telephone Network Based On A Modified Hmm Approach

Abstract  This paper presents an accurate and robust, speaker independent isolated word recognition system based on continuous hidden Markov modelling. By using the state of the art techniques, we are able to achieve speech recognition performance of 97.9% testing on 20 males speakers over public switched telephone network (PSTN). In order to enhance the robustness of the system under noisy conditions, we explore a modified Gaussian pdf by using vector projection approach and improve the recognition performance from 88.8% to 92.1% at 10dB SNR (Gaussian white noise).

PDF
568--573 Yaxin Zhang, Christopher J. S. deSilva, Roberto Togneri, Mike Alder, and Yianni Attikiouzel A Multi-Hmm Isolated Word Recognizer

Abstract  A multi-HMM speaker-independent isolated word recognition system is described. In this system,three vector quantization methods are used for the classification of speech space. This multi-HMM system results in an improvement of about 50 per cent in the error rate in comparison to the single model system.

PDF

Speech Coding Iv

Pages Authors Title and Abstract PDF
574--579 John Asenstorfer Superresolution Pitch Estimator Using Chaos Theory

Abstract  Using experimental techniques used in analysing nonlinear dynamical systems a novel pitch estimator is derived. The system allows pitch estimation to a fraction of the sampling period. issues are addressed that make the estimator reliable and robust.

PDF
580--584 Dale Carnegie, Geoff Holmes, and Lloyd Smith Intelligibility Of Speech Compressed Using An Auditory Model

Abstract  We extract dominant frequencies from speech waveforms by an In-Synchrony-Bands spectrum analyzer based upon an auditory model. Experiments indicate that intelligible reconstructed speech requires only 3 such frequencies per frame. This paper presents the results of our investigation into speech compression employing this technique.

PDF
585--590 Hiroaki Oasa and Michael Wagner A Method To Evaluate The Pre-Processing Stage Of Isolated-Word Recognition Systems

Abstract  This paper describes a cluster analysis method developed to evaluate pre-processors or isolated~word recognition, The method derives various statistical measures on inter-word and intra-word distances, whose ratios are and used to evaluate the relative effectiveness of the pre-processors. The study uses a speech database comprising (a) the 36-word set of alphabetic letters and digits and (b) a phonetically balanced set of 36 CVC words, and evaluates three pre-processors which extract FFT log-power spectrum coefficients, linear-prediction based cepstral coefficients, and critical band energies from the speech signal.

PDF

Linguistic Phonetics

Pages Authors Title and Abstract PDF
592--596 Mark Donohue and Yvette van Vugt Analysis Of Tone In The Language Of Vanimo Papua New Guinea

Abstract  Mean fundamental frequency data and duration measurements are presented for the three putative tonemes of the Dumb dialect of Vamino, a Papuan tone language of northwestern Papua New Guinea. The evidential argument for the separation of three tonemes is presented from the acoustic data.

PDF
597--601 Heather B. King Dyirbal Intonation

Abstract  Declarative intonation phrases in Dyirbal are analysed using Pierrehumbert's model in order to ascertain the constructs required to account for the F0 contours. Results thus far obtained are presented.

PDF
602--607 Cathryn J Donohue The Phonetics Of Register In The Fuzhou Dialect Of Chinese

Abstract  Investigations are made into the phonetic reality of the phonological feature Register, originally defined by Yip as a bifurcation of the pitch range (1990:196). This is discussed in terms of her analysis of Fuzhou tonology, which uses Register to capture the natural class of tones as defined by their participation in vowel alternations. Mean normalized fundamental frequency contours are presented for the citation tones. The degree of abstractness involved in this analysis is assessed in terms of the original definition of Register. Finally, alternative definitions for Register are explored on the basis of the instrumental data.

PDF
608--613 Yasuko Nagano-Madsen Multilingual Prosodic Rules - Introducing A New Project

Abstract  A new project on typologically motivated prosodic analysis is described. It uses three prototypical languages - Japanese, Eskimo and Yoruba - which are chosen on the grounds of how they use duration and FO for signalling lexical properties.

PDF

Speech Analysis And Recognition

Pages Authors Title and Abstract PDF
614--619 A. G Maher, R. W. King and J. M. Song Adaptive Noise Reduction Techniques For Speech Recognition In Telecommunications Environments

Abstract  This paper examines the application of adaptive noise reduction techniques at the input to a hidden Markov mode! speech recognizer. The most effective technique of those discussed is spectral subtraction. As this characterizes the noise in periods of silence, it has the capability to deal with non~stationary noise sources, and is thus suitable for use in recognition systems operating over the telephone network. The paper presents results for speaker-dependent recognition of digits in gaussian white noise, and shows that the spectral subtraction noise reduction technique can maintain good recognition accuracy at signal to noise ratios as low as 5 dB.

PDF
620--624 George Raicevich and Phillip Dermody. Comparison Of Methods For Speech Analysis

Abstract  Comparisons of performance are made between an auditory model and LPC analysis. Further more two types of auditory model outputs, mean rate and synchrony response are tested for the lowest distance metric error. A time sensitive Euclidean distance measure (Integrated Time Squared Error : ITSE) is used and compared to a Euclidean distance metric. A local speech data base of CV combinations mixed with office environment noise is used for the testing.

PDF
625--628 MD. Chau and C. D. Summerfield Auditory Models As Front-Ends For Speech Recognition In High Noise Environments

Abstract  This paper describes a series of experiments conducted by Syrinx to determine performance improvements offered by Auditory Model based speech signal processing front-ends for HMM recognisers. The experiments tested an implementation of the Ghitza Model connected to a HMM recogniser through a number of interface algorithms that reduces the Auditory Model's representation dimensionality to a manageable size. The results show that in high noise environments recognisers incorporating front-ends based on the Ghitza Auditory Model outperform those implemented using traditional Delta Cepstrum speech processing algorithms.

PDF
629--634 A. Samouelian Acoustic Feature Extraction Framework For Automatic Speech Recognition

Abstract  This paper presents a feature extraction framework that allows the use of speech knowledge in training a phonetic recognition system. It can train on any combination of features that may be derived from time and/or frequency domains parametric `acoustic-phonetic and auditory models including speech specific features. The system requires a moderate size, phonetically labeled database. During the training phase nominated features per frame are automatically extracted and used as a set of attributes to generate a recognition decision tree, using c4.5 induction Program. During recognition, the feature extraction framework generates the set of attributes which are then fed through the decision tree, which assigns a phonetic label to each frame. Recognition results on the class of semi.vowels are presented.

PDF
635--640 P Kenne and M O'Kane Micro-Measures Of Speech Recogniser Effectiveness

Abstract  Speaker-independent speech recogniser performance measures are generally reported for performance averaged over all utterances by all speakers in a suitably large test database. In this paper we demonstrate that recognisers can perform quite differently on different speakers in a given test database. We examine the amount of variation that occurs across speakers for two different statistically-based recognisers - a Hidden Markov Model recogniser and a SPRITE recogniser. Another issue that is hidden by averaged recogniser performance scores is the variation that can occur as a function of utterance length. Some recognisers perform better on shorter utterances for a given training database. Finally, recogniser performance measures are examined. Some measures do not allow one to infer certain types of errors easily. Overguessing is one such example. Another measure is proposed and its performance is examined for effectiveness in highlighting different types of recogniser error.

PDF

Speech Technology II

Pages Authors Title and Abstract PDF
642--647 A.J. Hunt, P.C.B. Henderson, A. Samouelian, J.M. Song, and R.W. King Engineering A Speech-Controlled Voice-Mail Demonstration System Operating On The Telephone Network

Abstract  Engineering real~time speech recognition into services to operate over the telephone network requires more than good speech recognition. This paper addresses the design and integration issues of such a system; a speech controlled voicemail demonstration system. This system integrates a PC-based telephone interface card, DSP signal processing, Sun-based isolated~word speaker-independent HMM recognition, an X-window user interface, prompt generation and system control. The system provides a base for evaluating the effectiveness of speech-controlled applications as well as being a pilot for investigating the difficulties in implementing real-world speech technology systems. This system also demonstrated the robust performance of a recently developed HMM package.

PDF
647--652 Wilson Lo and A. Samouelian Application Of Speech Recognition Technology For Telecommunication Services

Abstract  This paper presents the recognition results of two commercial PC based isolated word, Speaker Independent Voice Recognition (SIVR) systems over the Public Switched Telephone Network (PSTN) with a vocabulary of 0-9 and several control words. A brief description of a pilot service called "World Time information Service" which was developed using one of the SIVR evaluated is also described.

PDF
652--657 Elizabeth Bednall and Josephine Chessari The Role Of Human Factors Testing In Speech Technology

Abstract  Three studies are reported which locus on human factors issues in three speech-based telecommunications products. The purpose of the paper is to describe how human factors methodology can be applied in different ways. The first product shall be referred to as System A, the second as System B and the third as System C. The methods used provided valuable information regarding the human-computer interface for each system. These included: 1. user needs analysis heuristic evaluation 2. observation of users interacting with the system while completing typical tasks measurement of performance of the system 3. interpretation of questionnaire data 4. testing of product managers on the tasks. The results of these studies clearly illustrate the need for human factors testing to be incorporated into the design process. Such testing ensures a cost-effective way of optimizing usability and customer satisfaction when a product is finally released into the market place.

PDF
658--663 Steven Hiller, Edmund Rooney, John Laver and Mervyn Jack An Automated System For Computer Aided Pronunciation Teaching

Abstract  This paper describes the SPELL workstation which has been designed to improve the pronunciation of foreign languages (English, French and Italian) by non-native speakers. The workstation is used presently for the automated assessment and improvement of the prosodic features of intonation and rhythm, and the segmental feature of vowel quality. The paper highlights the intonation, rhythm and vowel quality metrics used for assessing non native speech. The results of a preliminary evaluation by language experts and teachers support the underlying phonetic analysis techniques as well as the pedagogic approach presented to the workstation user.

PDF

Speech Perception II

Pages Authors Title and Abstract PDF
666--671 Anne Cutler, Ruth Kearns, Dennis Norris and Donia R. Scott Listeners' Responses To Extraneous Signals Coincident With English And French Speech

Abstract  English and French listeners performed two tasks - click location and speeded click detection - with both English and French sentences, closely matched for syntactic and phonological structure. Clicks were located more accurately in open- than in closed-class words in both English and French; they were detected more rapidly in open- than in closed-class words in English, but not in French. The two listener groups produced the same pattern of responses, suggesting that higher-level linguistic processing was not involved in these tasks.

PDF
672--677 U. Jekosch Spaces Of Perceptual Distinction In Natural And Synthetic Speech

Abstract  Similarity profiles representing spaces of perceptual distinction are presented: Profile A is based on judgements gained in an introspective way, Profile B visualizes judgements on natural speech, and Profile C on synthetic speech. Data are compared and interpreted with regard to their role in synthesis assessment.

PDF
678--683 P.J.Blamey and V.C. Tartter Fricative Perception By Cochlear Implant Users

Abstract  Three implant users were tested with 45 syllables consisting of [v,f,d,e,z,s,3,S,d3,tS,h,t,d,n,1] before the vowels [i,a,o] with three wearable speech processors. The WSP3 processor coded first and second formant frequencies and amplitudes The MSP1 processor used a similar scheme with improved measurement and coding of the formants. The MSP2 processor added amplitude information from three higher frequency bands. Average scores were 42% for WSP3, 54% for MSP1, and 57% for MSP2. Perception of voicing, manner, and place of articulation of the consonants was significantly greater for the MSP processors than the WSP3 processor. Place perception was slightly higher for MSP2 than MSP1. The listeners used three perceptual dimensions which were highly correlated with the frequencies and amplitudes of peaks in the low frequency region of the frication spectrum, amplitudes of high frequency peaks, and duration of the frication noise.

PDF

Speech Analysis II

Pages Authors Title and Abstract PDF
686--691 Nick Campbell and Yoshinori Sagisaka Automatic Annotation Of Speech Corpora

Abstract  This paper describes a method for automatic annotation of prosodic events in speech corpora and extends previous work that detected prominences from segmental duration and energy measures. It details a way of differentiating prominence-related lengthening from boundary lengthening using durational clues alone, and discusses an anomaly in the phrasing characteristics of four speakers readings of 200 phonetically balanced sentences.

PDF
692--697 S H Leung , Andrew Luk , Godfrey K F Liu and Caesar S Lun An Arma Model For Extracting Cantonese Phoneme Characteristics

Abstract  An autoregressive moving average algorithm is proposed for the analysis and extraction of characteristics in Cantonese phonemes. Its performance is found to be more accurate than a number of classical ARMA estimations methods. This method is used to extract the formant characteristics in the Cantonese vowel phonemes and is found to match closely with the estimates obtained by the Cantonese linguist.

PDF
698--703 L. Cerrato, F, Cutugno, P.Maturi A Method For The Statistical Treatment Of Vocalic Formant Values

Abstract  A method of analysis to represent temporal formant variations and vowel dynamic is proposed in this work. This method is based on the following steps: 1) LPC analysis of the first two formants of the vocalic portion to be examined; 2) linear regression computation and extraction of linear equations; 3) polar representation of slopes in terms of r=sqrt(s1^2 + s2^2) and phi = arctg(s1/s2).

PDF

Speech Databases II

Pages Authors Title and Abstract PDF
706--711 Michael Barlow, Ian Booth, and Andrew Parr The Collection Of Two Speaker Recognition Targeted Speech Databases

Abstract  In recent years speech databases such as TIMIT have become available to the general research community. Such databases as are currently available are designed specifically with automatic speech recognition research in mind and as such are deficient in a number of aspects for automatic speaker recognition research; chiefly capturing repeated utterances from a large number of speakers over the long-term, The design and acquisition of two speech databases specifically for speaker recognition research are described, including an access system to a microcomputer laboratory.

PDF
712--717 M O'Kane and P Kenne On The Feasibility Of Using Application-Specific Speech To Derive A General-Purpose Speech Recogniser Training Database

Abstract  A speech database is easier to mark-up if what the speakers are saying is known before marking-up commences. In a joint project with the court reporting services we have examined the feasibility of using court speech recordings and associated transcript to derive a general~purpose speech recogniser training database. The first question addressed was the size of the natural vocabulary that was covered by day-Io-day court proceedings. The next question addressed was the frequency of occurrence of the various words and phrases in this vocabulary. We then turned to the issue of how much transcript had to be examined in total in order to get a reasonable number of examples of all the commonly- occurring words in the vocabulary. All this work was done using automatic analysis of transcript text. Another important aspect of speech-database collecting is the overall time it is going to take to mark-up a database of known size. In order to address this issue we conducted mark-up speed trials in which several experienced speech database markers were timed for speed of marking-up speech from associated transcript. A special software mark-up system was used which ideally requires only tour mouse-clicks to mark up and confirm each instance of each word entered in the database. Each marker was marking-up at word level only. Quality of marking-up was checked for each marker. While the exact minimum amount of data needed to train a very large speech recogniser is unknown, experiments such as the ones described here suggest that the concept of deriving such databases from application-specific speech is a very large but not an impossible task.

PDF
718--723 Lydia K.H. So and Robin Thelwall Hong Kong Spoken Cantonese Database

Abstract  As the initial stage of collecting a structured sample of spoken Cantonese for a database to be used for speech therapy and pronunciation teaching purposes, the present paper discusses the design of a linguistic questionnaire and illustrates some early acoustic analyses of vowels based on the current wordlist.

PDF

Speech Synthesis II

Pages Authors Title and Abstract PDF
724--729 V. Kraft, J.R. Andrews Design, Evaluation And Acquisition Of A Speech Database For German Synthesis-By-Concatenation

Abstract  This paper presents a systematic approach for the definition oi a speech database to be used for parametric or non-param etric speech synthesis for Germ an. Considering coarticulation and practical aspects, the demisyllable (DS) is chosen as the basic unit. Improvements are achieved by adding vowel-to-vowel-diphones and frequent words to the inventory. Describing the whole process of definition, recording and processing of the speech elements including the interface to the TTSsystem, an account is given on the experience gained on the way to high-quality speech synthesis

PDF
730--735 K.P.H. Sullivan Novel-Word Pronunciation: A Crosslanguage Study

Abstract  In the case of a 'novel' word absent from a text-to-speech system's pronouncing dictionary the traditional systems invoke let1er-to-phoneme rules to produce a pronunciation. A proposal in the psychological literature, however, is that human readers pronounce novel words not using explicit rules, but by analogy with letter/phoneme patterns for words they already know. A synthesis-by-analogy system is presented which is ,accordingly, also a model of novel-word pronunciation by humans. The computational methods of assessing the orthographic analogy module and the 'flexible' (contextindependent) GPC rule module, a pre-requisite for phonological analogy, are presented. The resultant assessments across language, method of assessment, size and content of the lexical database are compared, before implications the future development of computer synthesis-by-analogy and for psychological models of oral reading are presented. The investigations into these modules produced useful results for both British English and German.

PDF

Speaker Characteristics

Pages Authors Title and Abstract PDF
738--743 Caroline Henton Sex And Speech Synthesis: Techniques, Successes And Challenges

Abstract  Female speech synthesis has a short history. its quality is marginal in most current systems. Examples of synthetic speech will be played, indicating remaining problems in synthesizing female speech. A template for female speech is given, together with a review of applications for synthetic speech.

PDF
744--749 J. Pittam and K.R. Scherer The Encoding Of Affect: A Review And Directions For Future Research

Abstract  This paper reviews the work conducted into the encoding of affect in synthesised speech signals and isolates a number of major problem areas that researchers need to consider in the future.

PDF
750--755 John Ingram Prosody, Foreign Accent And Speech Synthesis

Abstract  The contribution of prosody to the perception of foreign accent and the impact of non-native prosody for the intelligibility of speech of second language users (Vietnamese immigrants speaking Australian English) is investigated using speech synthesis.

PDF

Human Factors

Pages Authors Title and Abstract PDF
758--763 Julie Vonwiller and Suzanne Eggins What Is A Communication Difficulty?

Abstract  This paper describes, with examples, several types of communication difficulty which arise in telephone information seeking dialogues. These difficulties arise for a variety of reasons, including mishearings and misunderstandings, by either the caller or information giver. Both use their natural communication skills to effect repairs. The analysis of these natural dialogues gives insights into the language processing requirements of future automated speech response systems.

PDF
764--769 A.BETARI, P.COTE, S. EL-KAREH Interfaces For Standard Arabic In Prolog

Abstract  Our main objective is to deal with databases to satisfy the needs of the arab world. The conceptual frame-work we adopted is logical grammars Modifier Logic Grammar [MLG] and the Programmation language used is (Prolog II+ and Arity PROLOG).

PDF
770--775 R.W. King and J.P. Vonwiller Undergraduate Education For Speech Technology: An Introductory Course For Electrical Engineers And Linguists

Abstract  This paper describes a novel elective course in speech and language processing offered to undergraduate students from the Departments of Electrical Engineering and Linguistics. The course is presented by staff from the two Departments. The course content focusses on the components of speech and language technology systems and also on the importance of integrating them with high level language processing. The paper discusses the teaching methods employed for the course and provides brief comments on some of the educational issues which have arisen in presenting a course for students from the two, traditionally separate, cultures of Arts and Engineering.

PDF

Speech Recognition

Pages Authors Title and Abstract PDF
778--783 Andrew Tridgell and Bruce Millar A Speaker Independent Phoneme Recognition System

Abstract  A speaker independent phoneme recognition system is presented and discussed, Some of the unique features of this system include the use oi a tree based vector quantiser and the use of multiple vector quantisers for each parameter set.

PDF
784--789 Daniel Woo , Phillip Dermody, Richard Lyon and Bruce Lowerre Auditory Model Interfaces Into A Dtw Recogniser

Abstract  Auditory models have been proposed as one way to improve the robustness of current speech recognisers. In this work the Lyon auditory model is coupled to a DTW speech recogniser and comparisons are made between LPC coefficients and autocorrelation coefficients derived from the auditory spectrum. The results for both are compared using speaker dependent recognition for three speakers across different signal to noise ratios. The results suggest that auditory models can be used for interface to current recognisers and that the autocorrelation output plus a Euclidean distance measure provides the best performance in the current configuration.

PDF
789--794 G.V.RAMANA RAO Detection Of Word Boundaries In Continuous Hindi Speech Using Pitch And Duration

Abstract  Reliable detection of word boundaries in continuous speech is an important problem in speech recognition. Many studies established the importance of prosodic knowledge in detecting word boundaries. In this paper we report a word boundary hypothesisation technique based on the durational knowledge for Hindi. Recently another technique using pitch patterns was proposed for Hindi. We have also shown in this paper that combining the duration and pitch knowledge leads to significant improvements in the overall detection of word boundaries.

PDF

Speech Analysis III

Pages Authors Title and Abstract PDF
796--801 X.Y. Zhu and L.W. Cahill Combining Template Matching And Multilayer Perceptron For Speaker Identification

Abstract  This paper presents a combination approach to speaker identification. In addition to a traditional template matching method, a multilayer perceptron (MLP) method was applied to further distinguish speakers' voices. In the template matching method, cepstral coefficients were selected as acoustic features, and a dynamic time warping (DTW) algorithm was used to compare the feature vectors at equivalent points in time. An unknown speakers template was first compared with all the stored speakers' reference templates to choose a few candidates. The MLP method, in which formant parameters of vowels and diphthongs were chosen as features, was then used on these candidates to identify the the identity of the speaker. The final results showed that the combination approach was better than either the traditional method or the MLP method, used alone.

PDF
802--807 R. Mannell The Effects Of Presentation Level On Sone, Intensity-J.N.D. And Decibel Quantisation Of Channel Vocoded Speech

Abstract  Natural speech tokens were passed through a Bark-scaled channel vocoder simulation and the outputs of 18 B.P. analysis filters were quantised at various multiples of the Sone scale, the intensity-j.n.d.-scale and the dB scale. The resulting synthetic speech was presented to a group of listening subjects at 40, 50, 70 and 90 dB s.p.l. (ref:20 uPa.) and intelligibility scores were obtained for each type and level of quantisation. The largest step on the Sone or amplitude j.n.d. scales that did not result in a significant reduction in intelligibility was found to vary with presentation level. The largest dB step that did not result in a drop in intelligibility was, in the other hand, constant across the four presentation levels. When the Sone scale was transformed into a logSone scale it was found that the maximum allowable step on that new scale was constant across the four presentation levels. This suggests that steps of loudness doubling (and not steps of equal loudness) represent the appropriate scale for the amplitude dimension in speech perception and that the dB scale ts a reasonable approximation of that scale.

PDF
808--813 Paul C. Bagshaw An Investigation Of Acoustic Events Related To Sentential Stress And Pitch Accents, In English

Abstract  An algorithm is described to abstract acoustic parameters of a speech wave- form to give a scaler measure of the relative stress and pitch movement of each group of phones which can consist of a single prominence. A method of identify such groups using acoustic information is given. The abstracted parameters are used to locate sentential stress and pitch accents in English speech. These are compared with a hand-labeled prosodic transcription.

PDF

Extra Paper

Pages Authors Title and Abstract PDF
816--821 Hing C. Ng, Shu H. Leung and Andrew Luk An Isolated Chinese Word Recognition System Using Hierarchical Neural Network With Applications To Telephone Dialing

Abstract  This paper is to present a neural network based isolated word recognition system for monosyllabic language especially for Cantonese. The features are extracted from FFT-based filter bank that is designed according to the formant characteristics of the Cantonese phonemes. A hierarchical neural network is used for recognizing feature vectors with good recognition rate and moderate computational complexity.

PDF