only search ASSTA Proceedings

Proceedings of SST 1990

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Neural Nets I

Pages Authors Title and Abstract PDF
10--15 M. Saseetharan and K. E. Forward "Parcor" Parameters As Features Applied To An Artificial Neural Network Word Recognizer

Abstract  "PARCOR" parameters were extracted using linear predictive coding (LPC) of speech data. The fact that parameters extracted from a stable filter have a magnitude of less than unity, was used to confirm the stability of the filter. These parameters were time normalised and used as the input to the three-layer perceptron. Arbitrary non-linear decision surfaces were developed using an error back-propogation algorithm known as generalized delta rule (GDR) on a three-layer artificial neural network (ANN) of simple computing units. As a recognition task, a simulated perceptron of 140 inputs was trained to an accuracy of 0.1 rms with ten repetitions of twenty isolated words. Recognition was tested with sixteen repetitions of the same twenty isolated words spoken by the same person and an accuracy of 87.5% was achieved.

PDF
16--21 Adam Kowalczyk , Herman Ferra and Gordon Jenkins Experiments With Mask-Perceptrons For Speech Recognition

Abstract  The paper discusses results from a series of experiments on isolated word recognition using neural networks (multilayer perceptrons). It shows that high recognition accuracy in simple tasks can be achieved with very crude signal processing, It also shows that suitable incorporation of some classical pattern recognition techniques (distributed representation of network output with rows of Hadamard's matrix and optimised quantisation of input) can provide significant improvement in the system performance.

PDF
22--27 Shuping Ran and J.Bruce Millar Exploring The Phonetic Structure Of The Speech Signal Using Multi-Layer Perceptrons

Abstract  Two experiments using a multi-layer perceptron to explore phonetically significant boundaries in the speech signal are described. The two fundamental distinctions, between speech segments which have periodic or aperiodic waveforms, and between speech segments which have transitional or steady state spectra, are examined to lay the foundation for possible future work. In the first experiment the refinement of hand segmented vocalic nuclei is shown to be possible for at least one speaker, whereas in the second experiment new boundaries are created using criteria developed from within the data itself.

PDF
28--33 Danqing Zhang, J.Bruce Millar and Iain Macleod Multi-Speaker Digit Recognition Using Neural Networks

Abstract  Application of neural network architectures to the problem of digit recognition is investigated using two different forms of a multi-layer perceptron. The problem of digit recognition is studied from three points of view: firstly, selection of input features representing the spoken digits; secondly, minimisation of training time; thirdly, optimisation of the architecture of neural nets.

PDF
34--39 Luyuan Fang Isolated Words, Multispeaker Speech Recognition With Multilayer Neural Networks

Abstract  A multilayer neural network for speech recognition is described here. This neural network is trained with the back propagation algorithm. The network can recognize different types of speakers. For multispeaker speech recognition, a 95% rate of correct recognition is achieved.

PDF

Coding

Pages Authors Title and Abstract PDF
40--45 A. Perkis and D. Rowe Quantizer Design And Evaluation Procedures For Hybrid Voice Coders

Abstract  - This paper presents a complete methodology for designing Max_Lloyd optimized quantizers (Max-quantizers) for a given parameter. The main emphasis is concentrated on estimating the parameters Probability Density Function from a carefully chosen database. Examples are given through optimization of a CELP based voice coder.

PDF
46--51 Philip Secker and Andrew Perkis A Robust Speech Coder Incorporating Joint Source And Channel Coding Techniques

Abstract  This paper discusses the incorporation of joint source and channel coding techniques into a high quality 12kbps speech coder. Trellis structures are used to quantize both the Linear Predictive coefficients and the residual signal with each trellis optimized to the expected channel distortion. The resulting system shows remarkable robustness to a wide range of bit error rates, with most gain achieved at rates as high as 0.1.

PDF
52--57 K. Ratkevicius and A. Rudzonis Some Investigations On The Vocoded Speech Perception And Encoding

Abstract  The relation between the quality of vocoded speech and its compression ratio and improvements in parametric coding techniques leading to data rates as low as 1200 bits/s were investigated.

PDF
58--63 Lei Peng and Andrew Perkis Implementation And Discussion Of The Vselp Coder

Abstract  This paper presents a coder implementation procedure for the Vector-Sum Excited Linear Predictive Coder (VSELP), operating at 7950 bits/second. The quality of the coder is evaluated and some pitfalls in the coder specifications are identified. The paper also provides methodologies for verifying the search for self-excitation sequences and the residual codebook searches. For the LPC analysis, a performance comparison between the fixed point covariance algorithm(FLAT) and the standard Autocorrelation method (AUTO) is performed showing the superiority of the FLAT algorithm.

PDF
64--69 J. Kostogiannis, A. Perkis Evaluation Of Linear Prediction Schemes For Use In Mobile Satellite Communications

Abstract  This paper evaluates and compares several analysis methods and quantization schemes applied to parts oi a Linear Predictive coder, considering error free conditions and in the presence of random errors. Subjective and objective measures indicate that all the spectral analysis methods perform comparably, while the quantization schemes are shown to have a great impact on the degradation in speech quality at high bit error rates (BER). The sensitivity of the spectral information is reduced by the implementation of a "smart" filter stability correction algorithm, based on Line Spectrum Pairs (LSP).

PDF

Analysis I

Pages Authors Title and Abstract PDF
72--77 Malcolm Crawford and Martin Cooke Speech Perception Based On Large-Scale Spectral Integration

Abstract  This paper presents a computational model of speech perception, within the framework of a general theory of auditory processing. We believe that large-scale spectral integration may play an important part in speech recognition, and may account for a disparate range of findings in auditory psychophysics. We present initial findings from a model of integration on an ERB scale treated as a post-streaming transformation, discuss some of the current limitations of the model, and proposals for future work.

PDF
78--83 Duncan Markham, Dept. of Linguistics, Faculties, Duration, F-Patterns And Tempo In German Syllables

Abstract  Segmental duration and vowel formant frequencies in German nonsense syllables at two different speeds were investigated The results are compared to previous research in this area and some modelling issues are discussed.

PDF
84--89 J. Pittam and J. Ingram Connected Speech Processes In Vietnamese-Australian

Abstract  Changes to connected speech processes characterising four Vietnamese-English speakers are examined, shortly after their arrival in Australia, and one to one and a half years later.

PDF
90--95 Mark Donohue Differential Accent Deletion Across Phrase Boundaries In Tokyo Dialect Japanese

Abstract  The subject of accent in Japanese has received substantial treatment from phonologists, but relatively little treatment until recently from phoneticians. The fact of accent deletion in Standard Japanese is universally acknowledged, but the mechanism of its function is not yet completely described. This paper addresses some of the aspects of accent reduction across phrase boundaries in conjoined sentences and concludes that the semantic nature of a phrase boundary needs to be taken into consideration when modelling the sentence declination and its effects on the realisation off a post-phrase boundary accent.

PDF

Aids For Hearing Impairment / Speech Disabilities

Pages Authors Title and Abstract PDF
98--103 W.K. Lai, Y.C. Tong, J .B. Millar and G.M. Clark Psychophysical Studies Investigating The Use Of Pulse Rate To Encode Acoustic Speech Information For A Multiple-Electrode Cochlear Implant

Abstract  Two psychophysical studies were conducted on cochlear implant recipients using a place/rate speech coding strategy which encodes fl and f2 formant information into two different electrode positions as well as the pulse rates presented respectively on the two selected electrode pairs. The results indicated that a significant increase was achieved in the transmission of information encoded into pulse rates between 80 to 250 pps, but not for higher pulse rates. Also, the pulse rates presented on each of the two electrode pairs were found to be perceptually partially independent, which means that the pulse rates presented on two electrode pairs may be used for encoding more than one acoustic speech feature that could be useful for speech perception. Furthermore, the apical pulse rate was found to be perceptually more dominant than the basal pulse rate.

PDF
104--109 R.S.C. Cowan, P.J. Blamey, J.Z. Sarant, K.L. Galvin, and G.M. Clark Speech Processing Strategies In An Electrotactile Aid For Hearing-Impaired Adults And Children

Abstract  An electrotactile speech processor (Tickle Talker) tor hearing-impaired children and adults has been developed and tested. Estimates of.second format frequency, fundamental frequency and speech amplitude are extracted from the speech input, electrically encoded and presented to the user through eight electrodes located over the digital nerve bundles on the fingers of the non-dominant hand. Clinical results with children and adults confirm that v tactually-encoded speech features can be recognized, and combined with input from vision or residual audition to improve recognition of words in isolation or in sentences. Psychophysical testing suggests that alternative encoding strategies using multiple-electrode stimuli are feasible. Preliminary results comparing encoding of consonant voiced/voiceless contrasts with new encoding schemes are discussed.

PDF
110--115 John Ingram and William Hardcastle Perceptual, Acoustic And Electropalatographic Evaluation Of Coarticulation Effects In Apraxic Speech

Abstract  The relationship between perceptual and instrumental assessments of coarticulation effects in apraxic and normal speech are investigated.

PDF
116--121 M.I. Dawson, S. Sridharan Computer Based Speech Training Methods For The Hearing Impaired

Abstract  The general characteristics of the speech of hearing impaired individuals are reviewed to gain an understanding of the requirements of an appropriate computer based visual speech training aid, Current speech training aids are reviewed with regard to this and a direction for future research in the field is discussed.

PDF
122--127 Paul F. McCormack Vowel Production Changes Subsequent To Cochlear Implantation

Abstract  Perceptual and acoustic correlates of vowel production changes in 3 postlingually - deafened speakers were measured subsequent to implantation with a cochlear multichannel prosthesis. Significant changes occurred in the acoustic vowel spaces of all three speakers. Improvements in listener recognition correlated strongly with formant changes in the direction of the normal vowel space. The consequences for speech rehabilitation programs are discussed.

PDF

Auditory Models

Pages Authors Title and Abstract PDF
130--135 E. Ambikairajah and E. Jones An Active Cochlear Model For Speech Recognition

Abstract  A model of the cochlea which includes both passive and active elements is described in this paper, The cochlear model consists of a cascade of 128 digital filters, of which 60 fall within the speech bandwidth of 259 Hz to 4 KHz. The model presented in this paper is suitable for implementation using a digital signal processor, and could act as a front-end processor for a speech recognition system.

PDF
136--141 M. P. Cooke, M.D. Crawford & G.J. Brown An Integrated Treatment Of Auditory Knowledge In A Model Of Speech Analysis

Abstract  This paper addresses the question of how information gleaned about auditory processing from experimental disciplines in hearing may appropriately be incorporated into computational architectures for automatic speech recognition (ASR). Criteria for so doing are developed, based on a software engineering analogy. We present an overview of what we believe to be a coherent system for auditory processing, and illustrate the representations produced at each stage of the model.

PDF
142--147 G.J. Brown and M.P. Cooke A Computational Model Of Amplitude Modulation Processing In The Higher Auditory System

Abstract  A novel speech processing technique is presented, which is based on the concept of a modulation map. This term describes the way in which the higher auditory system codes amplitude modulation rate as spatially distributed peaks of activity within a neural array. The map has applications in grouping and pitch analysis, and will form part of an integrated model of auditory processing. A simple pitch detector based on the map shows a superior performance in noise compared to a conventional autocorrelation analysis.

PDF
148--153 A. Samouelian A. Markus and C. D. Summerfield An Auditory Model Asic For Speech Recognition: Functional Design

Abstract  This paper describes the functional design of an ASIC auditory model for use as a front-end signal processor for speech recognition. The model consists of a set of critical band auditory filters, followed by non-linearities, representing the transduction stage of the Organ of Corti. The initial functional design was performed using the Denyer and Renshaw FIRST Silicon compiler. The compiler was modified to accommodate look-up tables to implement the compressive rectifier. Initial simulation results Indicate that a bank of 32 filters, spanning a centre frequency range of 100 to 6003 Hz, using a Bark spacing of 0.6 can be implemented on a single chip. Results of the simulation and its performance on real speech signals are presented.

PDF
154--159 A. Samouelian and J. Vonwiller Performance Of A Peripheral Auditory Model On Phonemes In Combination

Abstract  A peripheral auditory model, implemented as a front-end speech processor, has the potential to provide relevant acoustic features for speech recognition. The model's performance on utterances containing 'l' and 'r' sounds in various positions in connected speech is described. The effects of stress on these consonants is also examined. Results on real speech signals are presented.

PDF

Prosody / Pitch

Pages Authors Title and Abstract PDF
162--167 M, Beham and W. Datscheweit An Algorithm For Pitch Determination Of Speech Based On An Auditory Spectral Transformation

Abstract  An algorithm for pitch calculation is described which is based on an auditory spectral transformation and the theory of virtual pitch perception. The principle of harmonic coincidence is used to estimate pitch frequency.

PDF
168--173 V. Pikturna and A.Rudzionis Pitch Measuring From Spectra Of Noisy Speech: Amplitude Thresholding Versus Identifying Of Harmonics

Abstract  Various criteria for identifying harmonic peaks in the FFT spectra are investigated. The low order linear prediction spectrum is use as an amplitude threshold crossing the upper parts of pitch harmonics.

PDF
174--179 Gunnar Hult Pulse-By-Pulse Pitch Analysis Through Zero Phase Low-Pass Filtering

Abstract  A recently proposed pitch detection algorithm is based on iterative low-pass filtering of the speech signal. We discuss some modifications to the proposed algorithm, such as appropriate halting criteria for the iterative filtering, and also how the pitch can be determined from the final, sinusoidal-like filter output. Finally, we compare these pitch estimates to those of two well-known pitch analysis methods, cepstral filtering and time domain parallel processing.

PDF
180--185 Michael Wagner, Bob McKay, Santha Sampath, David Slater Modelling The Prosody Of Simple English Sentences Using Hidden Markov Models

Abstract  A set of 144 declarative sentences with a subject-verb-object structure is drawn from a vocabulary of monosyllabic and disyllabic English words. Fundamental frequency contours and energy contours of the sentence set are analysed with respect to the syllabic structure of the sentences. Multivariate correlation analysis provides predictions for the average energy and fundamental frequency of syllables. Based on the distributions of the energy, voicing and fundamental frequency parameters, 2 different continuously variable Hidden Markov Models are trained to distinguish between intersyllable intervals, stressed and unstressed syllables. One HMM uses single-mixture Gaussian parameter distributions while the other uses double mixtures. The Viterbi algorithm is used for automatic segmentation. It is noted that the convergence of the training procedure is sensitive to the initial distributions. It is also argued that the inclusion of duration modelling is essential to distinguish between stressed and unstressed syllables.

PDF
186--191 Dieter Huber Speech Style Variations Of F0 In A Cross-Linguistic Perspective

Abstract  This study explores the differences between discourse intonation and the kind of pitch contours typically found in isolated sentences. Three kinds of material are evaluated systematically: (1) orally read lists of semantically unrelated sentences, (ii) orally read narrative texts, and (iii) dialogues. The material_consists of equivalent samples of Swedish, English and Japanese speech, produced by native speakers (both female and male) of the respective languages. It will be shown that discourse intonation differs from intonation in semantically unrelated sentences with respect to practically all F0 parameters investigated in this study.

PDF

Recognition I

Pages Authors Title and Abstract PDF
194--199 S Nulsen, D Landy, M O'Kane, P Kenne and S Atkins Developivient Of Rules For Automatic Recognition Of Nasal Consonants

Abstract  Examination of speech waveforms has shown that the nasal consonants are one of the easist classes of sounds to recognise by eye. What is it that so clearly distinguishes these from the other sounds? Three featues may be identified: * their characteristic shape, * an amplitude which is much lower than nearby vowels but which is well above zero and cannot be confused with noise, * a very clean smooth waveform, frequently almost sinusoidal. This is indicative of the concentration of spectral energy in the low frequency region, which has previously been well documented, * a set of recognrtron rules for the class of nasal consonants was developed to capture these observations using the WAl speech recognition programming environment. The development of these rules is discussed and results are presented for three speakers.

PDF
200--205 N.R. Kew and P.D. Green A Scheme For The Use Of Syllabic Knowledge In Statistical Speech Recognition

Abstract  We describe a new project, SYLK, which aims to combine statistical and knowledge-based approaches in a front-end for Automatic Speech Recognition. It is based on the syllable as an explanation unit. The processing comprises an HMM front-end, which is followed by an inferential reasoning system in which a series of individual tests are applied to enhance the overall performance. We then consider the task of plosive discrimination in the context of SYLK, and illustrate the flexibility of the system in its ability to encompass a variety of approaches. Some preliminary results demonstrate the utility of the syllabic approach.

PDF
206--211 L.A. Smith Selection Of Speech Recognition Features Using A Genetic Algorithm

Abstract  A genetic algorithm was used to select a reduced set of features for a commercial speech recognizer. The algorithm was applied using a test set of 20_words spoken over the telephone by 22 speakers, with the recognition accuracy determining the fitness of competing feature sets. The recognizer correctly recognized 95.1% of the utterances using all 19 of the candidate features. The genetic algorithm found 2 sets of 16 features that recognized 95.3%, and a set of 17 features that recognized 95.4% of the test utterances. A second experiment found a set of feature weights, using all 19 candidates, with which the recognizer correctly identified 95.8% of the test utterances.

PDF

Tools I

Pages Authors Title and Abstract PDF
214--219 Dale Carnegie, Geoff Holmes, and Lloyd Smith Implementation Of An Auditory Model

Abstract  An auditory model has been implemented from its description in the literature. The model attempts to capture the phenomenon of auditory synchrony by detecting frequencies which dominate the output of adjacent bandpass filters. The implementation is described along with some results of processing speech with the model.

PDF
218--223 E. Jones and E. Ambikairajah Implementation Of An Active Cochlear Model On A Tms32oc25

Abstract  This paper, which is the second of two papers describing the development of an active cochlear model submitted to this conference, outlines the implementation of the model on a Texas Instruments' TMS32OC25 single-chip digital signal processor. The cochlear model contains both passive and active elements, in line with recent research findings in the field of auditory physiology. The passive system is operational at normal stimulus amplitudes, while the active system comes into play for low-amplitude stimuli. Tests of the implemented model have been carried out using sinewaves. This implementation could be used as a physiologically-based front-end processor for a speech recognition system.

PDF
224--229 R.E.E.Robinson A Dsp Hearing Aid Simulator And Screening Test

Abstract  A device to screen hearing impaired people to enable the easier fitting of Hearing Aids is described. It is very flexible and provides a self paced easily controlled method of determining a patient's frequency response preferences. The device has a second function where it can simulate several commercially available hearing aids. This serves as a useful clinical and research tool.

PDF
228--233 John Ingram, Jeffery Pittam & Robb Hay A Speech Annotation And Phonological Analysis Program

Abstract  A software system for phonetic annotation and phonological analysis of speech samples is described for application to a longitudinal study of sound change in second language learning.

PDF
234--239 Catherine I. Watson, W K Kennedy, R H T Bates Towards A Computer Based Speech Therapy Aid

Abstract  A computer-based speech therapy is currently being developed at the University of Canterbury Electrical and Electronic Department (UCEEE). At present the aid consists of seven speech analysis modules, which include a vocal tract shape reconstruction from speech signals and a fricative sound identifier. The real-time hardware for processing and analysing the speech and the software of the aid has all been developed by the UCEEE. The aid has been evaluated by 15 speech therapists, who have all have reacted positively to the aid and its potential.

PDF

Synthesis

Pages Authors Title and Abstract PDF
244--249 J.P. Vonwiller, R.W. King, K. Stevens and C.R. Latimer Comprehension Of Prosody In Synthesized Speech

Abstract  An experiment to determine the extent to which prosodically controlled synthesized speech can convey interactional meaning is described. The results reveal that most forms interactional meaning can be comprehended, and that the formal basis for the definitions or meaning can be used to derive computational rules for prosody in text-to-speech systems.

PDF
250--255 P. A. Taylor and S. D. Isard Automatic Diphone Segmentation Using Hidden Markov Models

Abstract  A two stage automatic method of producing a diphone set from nonsense words is described. Firstly hidden Markov models are used to locate phoneme boundaries and then a spectral discontinuity minimisation algorithm is used to choose diphone boundaries.

PDF
256--261 Walter G. Rolandi English Text-To-Speech As A Function Of Concatenating Digitized Syllables

Abstract  This paper describes preliminary results obtained in an application of some more fundamental core technology. The core technology is the syllabic representation of English. An effort is underway to computationally determine (essentially) all of the syllables that collectively comprise the English language. While rules depicting syllabification in English have been described (Chomsky & Halle, 1968), an actual list of the language's constituent syllables does not appear to exist. Identifying the syllables of English may have several implications for the speech science community. Some are indicated below. This paper discusses the potential for improvement in English text-to-speech applications. The initial results by no means imply revolutionary breakthroughs. On the other hand, initial results do suggest a substantial improvement over some existing text-to-speech methods.

PDF
260--265 G. Abbattista, A. Riccio, S. Terribili Speech Workstation For Italian Text To Speech Development

Abstract  The paper describes the implementation and use of a powerful Workstation suitable for the generation of the acoustic units database and the study and evaluation of the prosodic contours for a Text to Speech system for Italian language.

PDF

Speaker Characteristics

Pages Authors Title and Abstract PDF
268--273 M.P. Moody, R. Prandolini Speaker Recognition Using ILS

Abstract  A means of identifying a speaker from recorded passages of speech using the signal processing package ILS (Interactive Laboratory System) is described, which is efficient and sufficiently accurate for use in legal proceedings. Statistical dependence on variables such as the length of the utterances, numbers and types of parameters used and the length of segments of speech is investigated with the aim of determining the confidence level dependence on these variations, Success rates of distinguishing speakers from similar populations are about 90%for reasonably short samples (a few minutes for references and a few seconds for test samples),while even quite shorter reference samples may result in sufficiently significant results to add to the weight of evidence.

PDF
272--277 David Bijl and Frank Fallside Using Probabilistically Conditioned Neural Networks To Achieve Speaker Adaptation

Abstract  Speaker Adaptation using Neural Networks is generally difficult because network weights are adjusted in accordance to a whole training set. Introduction of new adaptation data provides a problem, because back-propagation training would converge exactly on that test data, throwing away previously learnt information. If a neural network is formulated via a probabilistic approach, it is possible to use concepts of maximum likelihood to adapt the parameters of the network so as to accommodate changes without discarding valuable information generalized from initial training. Here, a probabilistic approach is demonstrated which allows speaker adaptation in automatic speech recognition. The units of speech used are phonic and prosodic.

PDF
280--285 P. D. Templeton and B. J. Guillemin Speaker Identification Based On Vowel Sounds Using Neural Networks

Abstract  This paper presents the results of an experiment which applies a neural network approach to the problem of speaker identification. We, restrict ourselves to the analysis of the 11 non-diphthongal English vowels. The neural networks were trained on a set of cepstral coefficients derived from an LRC analysis. Results are presented which show that this approach compares favourably with more traditional methods.

PDF
286--293 J. Vogel, R. Wick Analysis Of Pre-Lingual Sound Utterances Of Children In The First Year Of Their Lives Using The Lpc Method Of Formant Extraction

Abstract  Sound utterances made by two children in the first year of their lives were investigated. The focus was on formant extraction using the Linear prediction analysis in combination with the FFT-spectrum. There is reason to assume that articulatory and therefore neurophysiological processes are predominantly represented in the 1st format while higher formants reflect anatomical and morphological changes during this particular period of ontogenesis.

PDF

Neural Nets II

Pages Authors Title and Abstract PDF
292--297 A. P. Reilly and B. Boashash Alternative Speech Representations For Kohonen Classifiers

Abstract  Several different techniques have been used to process speech signals in preparation for classification. The most commonly used ones rely on assumptions of signal stationarity that are not true for many important speech signal types. Recent work in the field of time-frequency signal analysis has produced a number of representations which do not make these restrictive assumptions. This paper reports on work being done to quantify the differences between these representations, as relates to the classification of speech signals using the neural network architecture popularized by Teuvo Kohonen in 1984.

PDF
298--303 Brian C. Lovell and Ah Chung Tsoi Speaker Verification Using Artificial Neural Networks

Abstract  Speaker verification is a process by which a machine authenticates the claimed identity of a person from his or her voice characteristics. A major application area of such systems would be providing security for telephone-mediated transaction systems where some form of anatomical or "biometric" identification (which cannot be lost, forgotten or stolen) is desirable. Due to the great potential shown by artificial neural networks (AN Ns) in the field of speech recognition, we evaluate the performance of a variant of the multi-layer perceptron ANN in the task of speaker verification. To prove the concept, the technique is applied to the classification of 2 speakers using a single utterence. A clustering algorithm partitions the input and output synaptic weights of the trained networks according to a Euclidean distance measure and it is found that the input synaptic weights appear to effectively characterize the speaker. The results demonstrate that the chosen ANN model can be used for speaker identification and verification purposes.

PDF
304--309 Roberto Togneri Speech Processing Using Artificial Neural Networks

Abstract  A three layer perceptron network is used to classify the sound using isolated words from different speakers. A classification accuracy of 97% has been achieved. A map of phonemes is used to trace trajectories of utterances using the selforganising neural network. A crinkle factor is proposed which allows using the self-organising map to determine the inherent dimensionality of a set of points. By this technique speech data has been shown to possess an inherent dimensionality of at least four, A projection of the map and the speech data shows how the self-organising map fits the speech space.

PDF
310--315 Simon Hawkins and Frantz Clermont Supervised Cepstrum-To-Formant Estimation: A New Piecewise-Linear Model

Abstract  A multiple-linear regression model of the relationship between low-order LPC-cepstral coefficients and vowel formant contours has been proposed by Broad and Clermont (1989). However, less-than-perfect formant estimates generated using this method suggest that the assumption of linearity underlying this model is questionable. In the present study, a neural net, with the potential for developing mappings that provide a nonlinear partitioning of large multi-dimensional space, is shown to produce substantially more accurate formant estimates than is possible using Broad and Clermont's multiple linear regression model. Because the neural net provides no indication as to the nature of the nonlinearity it has discovered, we propose a new piecewise multiple-linear regression model of the cepstrum-to-formant relationship. This parametric nonlinear model is thought to capture the quintessential nonlinearity in the cepstrum-to-formant relationship because it produces first and second formant estimates which are almost as accurate as those generated using the neural net.

PDF

Perception

Pages Authors Title and Abstract PDF
318--323 Janet Fletcher and Eric Vatikiotis-Bateson Prosody And Intrasyllabic Timing In French

Abstract  Durational variation associated with accentuation and final lengthening is examined in a corpus of articulatory data for French. Both factors are associated with measurable differences in acoustic duration. However two different articulatory strategies are employed to make these contrasts although both result in superficially longer and more displaced gestures.

PDF
324--329 Anne Cutler and Sally Butterfield Syllabic Lengthening As A Word Boundary Cue

Abstract  Bisyllabic sequences which could be interpreted as one word or two were produced in sentence contexts by a trained speaker, and syllabic durations measured. Listeners judged whether the bisyllables, excised from context, were one word or two. The proportion of two-word choices correlated positively with measured duration, but only for bisyllables stressed on the second syllable. The results may suggest a limit for listener sensitivity to syllabic lengthening as a word boundary cue.

PDF
330--335 D. C. Bradley and B. Dejean de la Batie Resolving Word Boundaries In Spoken French

Abstract  For three populations of listeners (native French speakers, and tertiary students in the first or second year of their post-secondary studies of French), the ease with which word boundaries are located in connected speech was investigated using the phoneme monitoring task. With response required to word-initial /t /, liaison environments created an ambiguity about the status of surface /t / which could be resolved only lexically, and performance here was contrasted with that in unambiguous cases. The cost of ambiguity tended to be less in native speakers than in second year students, in line with a presumed difference in processing efficiency; for first year students, knowledge limitations and consequent speed-accuracy tradeoffs resulted in an estimated ambiguity cost which was, if anything, less than that in natives. For all three populations of listeners, monitoring performance was frequency-sensitive: word-initial / t / was detected faster in higher frequency carriers, whether or not these occurred in the ambiguity-creating environment. These findings are discussed with respect to processing models of word-boundary resolution, in native and non-native speakers.

PDF
336--341 Xie Guanghua Syllabic Volume As An Acoustic Correlate Of Metrical Structure And Focus In Mandarin.

Abstract  This paper investigates the use of syllabic volume (a three dimensional acoustic value) as the acoustic correlate of metrical structure and focus in Mandarin.

PDF
342--347 Jonathan Harrington The Acoustic Basis Of The Distinction Between Strong And Weak Vowels

Abstract  This study is a preliminary exploration of the acoustic-phonetic basis of the distinction between 'strong' and 'weak' vowels. Segments were taken from a database of continuous speech and were classified using critical band and formant frequency parameters. The results show that up to 75% of strong vowels and just over 83% of weak vowels are correctly classified, depending on the type of acoustic classification used.

PDF

Recognition II

Pages Authors Title and Abstract PDF
350--355 Walter Weigel A Demi-Syllable Based Continuous Speech Recognition System With Hmms And Syntax-Controlled Word Search

Abstract  A system for recognizing continuous speech in a speaker-dependent mode is described, where demisyllables serve as basic processing units. The acoustic- phonetic decoding uses an explicit segmentation based on a pattern-matching technique and vowel-context-independent HMMs. The sentence recognition uses simplified word-HMMs and a Viterbi-algorithm. For the syntax-control a bottom-up and a top-down strategy are compared, achieving sentence recognition rate of up to 74%.

PDF
356--361 Tracy M Clark, W K Kennedy, and R H T Bates. Features For A Computer Word Recognition System

Abstract  Any review of the extensive literature on word recognition reveals that a large variety of speech features is used in computer based word recognition. However, of the work focusses on a limited set of features adapted to a selected recognition method. We report on a series of experiments designed to isolate the relative merits of a range of features. The results for each feature or set of features are standardised by testing them with female and male speakers having a New Zealand accent using a vocabul ary zero to nine. A dynamic time warping algorithm is used. The features tested include root mean squared, zero crossing rate, linear predictive coefficients, cepstral coefficients, and transitional data in the form of dynamic cepstral coefficients. It is found that the best performance is achieved with the cepstral information. Addition of other features to this set gives only marginal improvement.

PDF
362--367 Tony Robinson and Frank Fallside Word Recognition From The Darpa Resource Management Database With The Cambridge Recurrent Error Propagation Network Speech Recognition System

Abstract  Recent work with Recurrent Error Propagation Networks has shown that they can perform at least as well as the current best Hidden Markov Models for speaker independent phoneme recognition on the TIMIT task. Accurate phoneme recognition is a prerequisite for very large vocabulary word recognition and this paper extends the previous work to word recognition from the DARPA 1000 word Resource Management task. This preliminary work achieves 52.1% word recognition rate (43.3% accuracy) with no grammar when trained on the TIMIT database using single pronunciation word models from the SPHINX system. The paper concludes with a list of topics that should be addressed in order to improve the recognition rate.

PDF
368--373 Alan M. Smith On The Use Of The Relative Information Transmitted (RIT) Measure For The Assessment Of Performance In The Evaluation Of Automated Speech Recognition (ASR) Devices

Abstract  The use of an information-theoretic based metric, the Relative Information Transmitted (RIT), may facilitate the assessment of the performance of automated speech recognition (ASR) devices. The RIT provides a scalar value which may be employed in a manner similar to the use of such traditional scalar measures as 'percent correct'. The complexity of the recognition task is factored into the computation of the R/T. For example, chance-level performance on a two word recognition task and on a four-word recognition task both yield equivalent RIT values of zero, whereas the associated percent correct performance on these tasks would be 50% and 25%, respectively. The H/T is an entropy-based characterization of the ASR as a receiver in a communication channel in that the distribution of input/output characteristics of the associated confusion matrix is reflected In the generated measure. However, as with all figures of merit, the use of the RIT must be coupled with an understanding of the specific task and application domain associated with the assessment. Examples are provided which indicate that conclusions drawn from the use of the HIT may not always be in agreement with those derived from consideration of the 'percent correct' performance of the same system.

PDF
374--379 C. Rowles, X. Huang, and G. Aumann Natural Language Understanding And Speech Recognition: Exploring The Connections

Abstract  This paper describes research aimed at integrating natural language understanding (NLU), speech recognition (SR) and the intonational structure of spoken language. NLU is being used to provide a measure of robustness to SR by placing utterances into context based on pragmatics and correctly recognize the speaker's intention in a database application. The approach is to use context to correct speech recognition errors and to reduce the search space. In turn, basic intonational structure derived from the speech waveform will assist lexical disambiguation, phrase attachment, and anaphoric resolution not dealt with by discourse segmentation. The paper outlines the speech understanding system architecture and describes the understanding process, giving examples of the use of intonation.

PDF

Tonation

Pages Authors Title and Abstract PDF
382--387 Takako Toda Shanghai Tonal Phonology 'Rightward Spreading'? Some Arguments Based On Acoustic Evidence

Abstract  In recent years, linguists have started looking to acoustics for evaluation of phonological questions (Ohala 1986; Ladefoged 1989). This paper presents some acoustic data pertaining to the Shanghai tone sandhi, and argues that for Shanghai the current phonological assumption 'rightward spreading' is not phonetically plausible.

PDF
388--393 Phil Rose Linguistic Phonetic Aspects Of Shanghai Tonal Acoustics

Abstract  Mean fundamental frequency and duration data are presented for the tones of 3 female and 4 male speakers of Shanghai dialect. Normalised f0 shapes for Shanghai and Zhenhai tones are compared, and a linguistic tonetic contrast demonstrated between both falling and rising tones. The importance of retaining durational relationships in normalisation is demonstrated.

PDF
394--399 Phil Rose Thai-Phake Tones: Acoustic, Aerodynamic And Perceptual Data On A Tai Dialect With Contrastive Creak

Abstract  Mean fundamental frequency, amplitude, duration and air flow data are presented for the 5 tonemes of Thai Phake on syllables with [k] and [x] initial consonants and /aa/, /aat/ and /at/ rimes.

PDF
400--405 J S Mirza F0 Perturbation Effects Of Prevocalic Stops On Punjabi Tones

Abstract  The perturbation effects of the prevocalic stop consonants on the FO contours of the following Punjabi tones on a vowel [aa] are investigated. As is the case with the languages already studied the unvoiced stops perturbate the FO onset values to a higher start while the voiced stops perturbate them down for each Punjabi tone. This tone-splitting, however, has been found to remain consistent in Punjabi for first 100 ms, unlike in other languages which show a fast or slow convergence of the FO tracks for the corresponding voiced and unvoiced stops. It is suspected that tone- splitting in Punjabi extends over a much longer period unlike in other tone languages as Yaruba and Thai. The level-falling tone of Punjabi, which has the highest frequency register, split by 30 Hz, while the low register tones, the level and dipping tones split by 14 and 12 Hz respectively on the average, for the first twelve periods of vocal cords vibrations.

PDF
406--411 U Thein-Tun The Domain Of Tones In Burmese

Abstract  The fundamental frequency patterns and the duration of the phonological tones in Burmese were analysed firstly with the neutral initial consonant [h] and secondly in four major syllable types. The Fo patterns of the four tones influenced by the preceding initial consonants in the four syllable types were compared with their counterparts preceded by the initial [h] and an attempt was made to determine the tone domain on the basis of the comparison.

PDF

Analysis II

Pages Authors Title and Abstract PDF
414--419 A. Marchal, W.J. Hardcastle The Relevance Of Basic Research In Articulatory Phonetics To Speech Technology

Abstract  For many applications in speech technology, decisive progress would result from the availability of an articulatory representation of speech utterances. While the acoustic mapping of the geometry of the human vocal tract during speech articulation is well understood today, the solution of the inverse problem, namely reconstructing the articulatory processes from the acoustic information, is still unsolved. The coarticulatory phenomena as a source of information have been almost entirely neglected. We will! present in this paper the research action "ACCOR" ("Articulatory-Acoustic Correlations in Coarticulatory Processes: a cross-linguistic investigation") which has been recently launched under the EEC-funded ESPRIT II/BRA Program. It integrates investigation of the coarticulatory regularities themselves with research into new and improved ways of exploiting these regularities in deriving articulatory representations through the acoustic analysis of speech.

PDF
420--425 Andrew Butcher "Place Of Articulation" In Australian Languages

Abstract  Most Australian languages contrast either five or six orders of stops (in both oral and nasal series), distinguished in terms of what is traditionally known as "place of articulation". This study examines the articulatory and acoustic correlates of the four coronal categories in a number of languages, using acoustic and palatographic evidence.

PDF
426--431 Leisa Condie Non-Linearity In Vowel Waveforms

Abstract  Acoustic models of the vocal tract show that turbulence plays a significant role in speech production. Non-linear dynamics has recently provided tools which allow analysis in this very difficult area. In this note vowel waveforms are examined for evidence of irregular behaviours using such tools.

PDF

Tools II

Pages Authors Title and Abstract PDF
434--439 P Kenne, D Landy, M O'Kane, S Nulsen, A Mitchell and S Atkins The Wal Speech Programming Environment

Abstract  The WAL (Wave Analysis Laboratory) Speech Programming Environment was first developed in 1987 to provide software tools for rapid prototyping of the FOPHO and SPRITE speech recognition systems. The environment has been revised and extended several times following evaluation trials at various sites. Current versions are maintained under both MS-DOS and Unix. A central feature of the environment is a programming language which was designed to provide a high-level, natural-language-like facility for phoneticians and other speech scientists not familiar with standard programming languages to write and test speech recognition rules. The rule language provides the usual logical operators ('and', 'or', 'not') for combining rules, together with operators for temporal reasoning (after, 'before', _'then'). A set of primitives for describing shapes and their deformations allows the notion of rubber templates to be included in the language, and when combined with the temporal reasoning facilities, provides an extension to picture languages. The WAL environment is highly portable. The language is interpreted, with the interpreter implemented in C and the graphical user interface is implemented using X Windows. We describe the language and environment and their implementation together with a number of examples illustrating the usefulness of the WAL environment in both speech and non-speech applications.

PDF
440--445 J. B. Millar, P.Dermody, J.M.Harrington, J.Vonwiller A National Cluster Of Spoken Language Databases For Australia

Abstract  This paper addresses the issue of the nature of a viable national resource of spoken language in Australia. The importance of such a resource for the development of speech technology in Australia is explored against the background of the economic, political, and legal issues that have frustrated previous attempts to develop such a resource. A proposed solution is provided in the form of a cluster of technically compatible databases in which each component of the cluster will have its independently determined content arising from the primary purpose behind its collection. The primary compatibility will be that each component corpus will have the same technical structure and the same standards of data description. Secondary compatibility will arise by making the components of the cluster available under well-defined conditions to the speech technology community via a set of database-nodes, each of which will be accessible by electronic data links.

PDF
446--451 J.E. Clark, C. Whitfield, and P.J. Kennedy Development Considerations For Speech Based Hearing Test Materials For Flight Crew

Abstract  Traditional hearing tests for technical flight crew have relied on the conventional pure tone threshold tests commonly used in clinical audiometry.This paper describes some of the considerations involved in the development of a set of speech based hearing test materials designed to evaluate the capacity of flight crew to satisfactorily process the speech signals routinely used under opertaional conditions in aviation. The rationale for the development of such tests is the recognition of the limitations in pure tone tests for distinguishing between crew with functionally effective hearing, and those who are no longer functionally effective when their pure tone thresholds are around the borderline of the ICAO limits. Factors investigated in the process of test development are described, including communication systems properties, cockpit noise levels, and operational language characteristics. The need for three classes of test is shown, and some of the considerations and procedures for their compilation are described.

PDF
452--457 A. Marchal, M.H.Casanova, P. Gavarry, M. Avon Dispe: A Divers' Speech Data-Base

Abstract  Gas mixture and pressure modify the spectral characteristics of divers' speech. Additionally, constraints imposed on jaw movements by wearing a facial mask affect the speech production process. The auditory feedback loop is equally concerned. Furthermore, underwater adverse working conditions are characterised by noise from different sources. As a result, divers' speech is poorly intelligible and communications between divers and surface control need to be enhanced. This is clearly true tor both security and task efficiency reasons. To this end, "voice unscramblers" are being used. However, the technological state of commercially available equipment is dated and the quality of speech remains insufficient. To help with the design, testing and qualification (NORM) of new communication devices, a bilingual (French-English) Data-base is currently being set up. It consists of phonetically balanced lists of 200 words read by 17 divers under sea and in chambers at operational levels from the surface to -300m. These recordings will be edited, labelled and stored for further distribution on a CD-ROM.

PDF

Signal Processing And Intelligibility

Pages Authors Title and Abstract PDF
460--465 R.H. Mannell The Effects Of Phase Information On The Intelligibility Of Channel Vocoded Speech

Abstract  The intelligibility of vocoded speech with various phase spectra was compared to the intelligibility of the original input natural speech. It was found that vocoded speech with true natural phase was the closest to natural speech in intelligibility.

PDF
466--471 J.Bruce Millar and Xue Yang Evaluation Of The Robustness Of Perceptual Linear Prediction Analysis Using Multi-Speaker Australian English Vowel Data

Abstract  Perceptual linear prediction analysis is a recently developed extension of standard linear predictive analysis which takes into account the characteristics of human hearing. It has been shown that its application to American English data presented to a sim- ple automatic speech recognition system improves the speaker-independent performance of that system. In our studies, we use multi-speaker Australian English vowel data to eval- uate the relative sensitivity to speaker difference and to phonetic difference of perceptual linear prediction analysis and standard linear prediction analysis. This work extends that of Hermansky by its detailed analysis of the vowel space for a moderately large number of speakers.

PDF
472--477 B.Goldburg, S.Sridharan and E. Dawson A Discrete Cosine Transform Based Speech Encryption System

Abstract  A speech encryption system suitable for use on bandlimited transmission channels is described. Scrambling is achieved using permutation of discrete cosine coefficients. A method for removing energy variation in the scrambled speech has been incorporated into the scheme which significantly enhances its performance.Simulation results presented in the paper indicate that the scheme provides scrambled speech of low residual intelligibility, and recovered speech of good quality.

PDF

Technology And Applications

Pages Authors Title and Abstract PDF
480--485 Roland Seidl The Application Of Speech I/O Technology To Interactive Telecommunication Services

Abstract  This paper discusses the evolution and current status of speech; I/O technology and its application to telecommunication services. Because speech recognition is the limiting factor in the design of these services, the impact of its constraints on the user/Service interface is discussed in the context of some generic applications. The key factors to a potentially successful service are outlined.

PDF
486--491 Mark F. Schulz and Lenard J. Payne Providing Multimedia Facilities In A Workstation Independent Form

Abstract  This paper presents the planned work on providing multimedia facilities to a community of users of workstations and PCs. We aim to provide multimedia servers on a network, at least one per workstation, with each server providing a small number of services to the user. The workstation communicates with the multimedia server over an Ethernet via standard TCP/IP protocols. By providing the services separate from the workstation, we are able to upgrade workstations and network services with minimum interruption to our multimedia service.

PDF
492--497 C. D. Summerfield Design And Implementation Of A Multi-Channel Formant Speech Synthesis Asic

Abstract  This paper describes the design and implementation of a single chip multi-channel formant speech synthesis Application Specific Integrated Circuit (ASIC). The aim of the Ft&D project was the design and implementation a cost effective device/ capable of synthesising multiple channels of high quality speech. The paper describes the ASIC architecture developed to implement the complete multi-channel synthesis system on a single chip. At the core of the device Is a fully synchronous bit-serial signal processing architecture which implements the formant synthesiser function. ....s is augmented by circuits which implement a multi-channel Interactive glottal source function, delay line elements for the filter network and a flexible interface circuit which enables the device to be directly connected to an industry standard 32 bit bidirectional data bus. Using the device it is feasible to implement a complete multi-channel text-to-speech system using just two components, a microprocessor unit to run the text-to-speech algorithm and the multi-channel speech synthesis ASIC to provide the synthetic speech output.

PDF
498--503 D Rowe, A Perkis, W.G.Cowley, J.A.Asenstorfer Error Masking In A Real Time Voice Codec For Mobile Satellite Communications

Abstract  The implementation of a real time CELP based speech coder is described. The coder has been optimised for operation on a land mobile satellite channel, and full duplex operation is achieved on a single AT&T DSPSQC device. The coder operates at a bit rate of around 6400bps, and produces near toll quality speech. Error masking techniques are investigated with the goal of minimising perceptual errors in the coder output in the land mobile satellite channel. The error masking is achieved with a combination of standard FEC techniques and subpacket substitution.

PDF