Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.
Pages | Authors | Title and Abstract | |
---|---|---|---|
10--15 | M. Saseetharan and K. E. Forward | "Parcor" Parameters As Features Applied To An Artificial Neural Network Word Recognizer
Abstract "PARCOR" parameters were extracted using linear predictive coding (LPC) of speech data. The fact that parameters extracted from a stable filter have a magnitude of less than unity, was used to confirm the stability of the filter. These parameters were time normalised and used as the input to the three-layer perceptron. Arbitrary non-linear decision surfaces were developed using an error back-propogation algorithm known as generalized delta rule (GDR) on a three-layer artificial neural network (ANN) of simple computing units. As a recognition task, a simulated perceptron of 140 inputs was trained to an accuracy of 0.1 rms with ten repetitions of twenty isolated words. Recognition was tested with sixteen repetitions of the same twenty isolated words spoken by the same person and an accuracy of 87.5% was achieved. |
|
16--21 | Adam Kowalczyk , Herman Ferra and Gordon Jenkins | Experiments With Mask-Perceptrons For Speech Recognition
Abstract The paper discusses results from a series of experiments on isolated word recognition using neural networks (multilayer perceptrons). It shows that high recognition accuracy in simple tasks can be achieved with very crude signal processing, It also shows that suitable incorporation of some classical pattern recognition techniques (distributed representation of network output with rows of Hadamard's matrix and optimised quantisation of input) can provide significant improvement in the system performance. |
|
22--27 | Shuping Ran and J.Bruce Millar | Exploring The Phonetic Structure Of The Speech Signal Using Multi-Layer Perceptrons
Abstract Two experiments using a multi-layer perceptron to explore phonetically significant boundaries in the speech signal are described. The two fundamental distinctions, between speech segments which have periodic or aperiodic waveforms, and between speech segments which have transitional or steady state spectra, are examined to lay the foundation for possible future work. In the first experiment the refinement of hand segmented vocalic nuclei is shown to be possible for at least one speaker, whereas in the second experiment new boundaries are created using criteria developed from within the data itself. |
|
28--33 | Danqing Zhang, J.Bruce Millar and Iain Macleod | Multi-Speaker Digit Recognition Using Neural Networks
Abstract Application of neural network architectures to the problem of digit recognition is investigated using two different forms of a multi-layer perceptron. The problem of digit recognition is studied from three points of view: firstly, selection of input features representing the spoken digits; secondly, minimisation of training time; thirdly, optimisation of the architecture of neural nets. |
|
34--39 | Luyuan Fang | Isolated Words, Multispeaker Speech Recognition With Multilayer Neural Networks
Abstract A multilayer neural network for speech recognition is described here. This neural network is trained with the back propagation algorithm. The network can recognize different types of speakers. For multispeaker speech recognition, a 95% rate of correct recognition is achieved. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
40--45 | A. Perkis and D. Rowe | Quantizer Design And Evaluation Procedures For Hybrid Voice Coders
Abstract - This paper presents a complete methodology for designing Max_Lloyd optimized quantizers (Max-quantizers) for a given parameter. The main emphasis is concentrated on estimating the parameters Probability Density Function from a carefully chosen database. Examples are given through optimization of a CELP based voice coder. |
|
46--51 | Philip Secker and Andrew Perkis | A Robust Speech Coder Incorporating Joint Source And Channel Coding Techniques
Abstract This paper discusses the incorporation of joint source and channel coding techniques into a high quality 12kbps speech coder. Trellis structures are used to quantize both the Linear Predictive coefficients and the residual signal with each trellis optimized to the expected channel distortion. The resulting system shows remarkable robustness to a wide range of bit error rates, with most gain achieved at rates as high as 0.1. |
|
52--57 | K. Ratkevicius and A. Rudzonis | Some Investigations On The Vocoded Speech Perception And Encoding
Abstract The relation between the quality of vocoded speech and its compression ratio and improvements in parametric coding techniques leading to data rates as low as 1200 bits/s were investigated. |
|
58--63 | Lei Peng and Andrew Perkis | Implementation And Discussion Of The Vselp Coder
Abstract This paper presents a coder implementation procedure for the Vector-Sum Excited Linear Predictive Coder (VSELP), operating at 7950 bits/second. The quality of the coder is evaluated and some pitfalls in the coder specifications are identified. The paper also provides methodologies for verifying the search for self-excitation sequences and the residual codebook searches. For the LPC analysis, a performance comparison between the fixed point covariance algorithm(FLAT) and the standard Autocorrelation method (AUTO) is performed showing the superiority of the FLAT algorithm. |
|
64--69 | J. Kostogiannis, A. Perkis | Evaluation Of Linear Prediction Schemes For Use In Mobile Satellite Communications
Abstract This paper evaluates and compares several analysis methods and quantization schemes applied to parts oi a Linear Predictive coder, considering error free conditions and in the presence of random errors. Subjective and objective measures indicate that all the spectral analysis methods perform comparably, while the quantization schemes are shown to have a great impact on the degradation in speech quality at high bit error rates (BER). The sensitivity of the spectral information is reduced by the implementation of a "smart" filter stability correction algorithm, based on Line Spectrum Pairs (LSP). |
Pages | Authors | Title and Abstract | |
---|---|---|---|
72--77 | Malcolm Crawford and Martin Cooke | Speech Perception Based On Large-Scale Spectral Integration
Abstract This paper presents a computational model of speech perception, within the framework of a general theory of auditory processing. We believe that large-scale spectral integration may play an important part in speech recognition, and may account for a disparate range of findings in auditory psychophysics. We present initial findings from a model of integration on an ERB scale treated as a post-streaming transformation, discuss some of the current limitations of the model, and proposals for future work. |
|
78--83 | Duncan Markham, Dept. of Linguistics, Faculties, | Duration, F-Patterns And Tempo In German Syllables
Abstract Segmental duration and vowel formant frequencies in German nonsense syllables at two different speeds were investigated The results are compared to previous research in this area and some modelling issues are discussed. |
|
84--89 | J. Pittam and J. Ingram | Connected Speech Processes In Vietnamese-Australian
Abstract Changes to connected speech processes characterising four Vietnamese-English speakers are examined, shortly after their arrival in Australia, and one to one and a half years later. |
|
90--95 | Mark Donohue | Differential Accent Deletion Across Phrase Boundaries In Tokyo Dialect Japanese
Abstract The subject of accent in Japanese has received substantial treatment from phonologists, but relatively little treatment until recently from phoneticians. The fact of accent deletion in Standard Japanese is universally acknowledged, but the mechanism of its function is not yet completely described. This paper addresses some of the aspects of accent reduction across phrase boundaries in conjoined sentences and concludes that the semantic nature of a phrase boundary needs to be taken into consideration when modelling the sentence declination and its effects on the realisation off a post-phrase boundary accent. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
98--103 | W.K. Lai, Y.C. Tong, J .B. Millar and G.M. Clark | Psychophysical Studies Investigating The Use Of Pulse Rate To Encode Acoustic Speech Information For A Multiple-Electrode Cochlear Implant
Abstract Two psychophysical studies were conducted on cochlear implant recipients using a place/rate speech coding strategy which encodes fl and f2 formant information into two different electrode positions as well as the pulse rates presented respectively on the two selected electrode pairs. The results indicated that a significant increase was achieved in the transmission of information encoded into pulse rates between 80 to 250 pps, but not for higher pulse rates. Also, the pulse rates presented on each of the two electrode pairs were found to be perceptually partially independent, which means that the pulse rates presented on two electrode pairs may be used for encoding more than one acoustic speech feature that could be useful for speech perception. Furthermore, the apical pulse rate was found to be perceptually more dominant than the basal pulse rate. |
|
104--109 | R.S.C. Cowan, P.J. Blamey, J.Z. Sarant, K.L. Galvin, and G.M. Clark | Speech Processing Strategies In An Electrotactile Aid For Hearing-Impaired Adults And Children
Abstract An electrotactile speech processor (Tickle Talker) tor hearing-impaired children and adults has been developed and tested. Estimates of.second format frequency, fundamental frequency and speech amplitude are extracted from the speech input, electrically encoded and presented to the user through eight electrodes located over the digital nerve bundles on the fingers of the non-dominant hand. Clinical results with children and adults confirm that v tactually-encoded speech features can be recognized, and combined with input from vision or residual audition to improve recognition of words in isolation or in sentences. Psychophysical testing suggests that alternative encoding strategies using multiple-electrode stimuli are feasible. Preliminary results comparing encoding of consonant voiced/voiceless contrasts with new encoding schemes are discussed. |
|
110--115 | John Ingram and William Hardcastle | Perceptual, Acoustic And Electropalatographic Evaluation Of Coarticulation Effects In Apraxic Speech
Abstract The relationship between perceptual and instrumental assessments of coarticulation effects in apraxic and normal speech are investigated. |
|
116--121 | M.I. Dawson, S. Sridharan | Computer Based Speech Training Methods For The Hearing Impaired
Abstract The general characteristics of the speech of hearing impaired individuals are reviewed to gain an understanding of the requirements of an appropriate computer based visual speech training aid, Current speech training aids are reviewed with regard to this and a direction for future research in the field is discussed. |
|
122--127 | Paul F. McCormack | Vowel Production Changes Subsequent To Cochlear Implantation
Abstract Perceptual and acoustic correlates of vowel production changes in 3 postlingually - deafened speakers were measured subsequent to implantation with a cochlear multichannel prosthesis. Significant changes occurred in the acoustic vowel spaces of all three speakers. Improvements in listener recognition correlated strongly with formant changes in the direction of the normal vowel space. The consequences for speech rehabilitation programs are discussed. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
130--135 | E. Ambikairajah and E. Jones | An Active Cochlear Model For Speech Recognition
Abstract A model of the cochlea which includes both passive and active elements is described in this paper, The cochlear model consists of a cascade of 128 digital filters, of which 60 fall within the speech bandwidth of 259 Hz to 4 KHz. The model presented in this paper is suitable for implementation using a digital signal processor, and could act as a front-end processor for a speech recognition system. |
|
136--141 | M. P. Cooke, M.D. Crawford & G.J. Brown | An Integrated Treatment Of Auditory Knowledge In A Model Of Speech Analysis
Abstract This paper addresses the question of how information gleaned about auditory processing from experimental disciplines in hearing may appropriately be incorporated into computational architectures for automatic speech recognition (ASR). Criteria for so doing are developed, based on a software engineering analogy. We present an overview of what we believe to be a coherent system for auditory processing, and illustrate the representations produced at each stage of the model. |
|
142--147 | G.J. Brown and M.P. Cooke | A Computational Model Of Amplitude Modulation Processing In The Higher Auditory System
Abstract A novel speech processing technique is presented, which is based on the concept of a modulation map. This term describes the way in which the higher auditory system codes amplitude modulation rate as spatially distributed peaks of activity within a neural array. The map has applications in grouping and pitch analysis, and will form part of an integrated model of auditory processing. A simple pitch detector based on the map shows a superior performance in noise compared to a conventional autocorrelation analysis. |
|
148--153 | A. Samouelian A. Markus and C. D. Summerfield | An Auditory Model Asic For Speech Recognition: Functional Design
Abstract This paper describes the functional design of an ASIC auditory model for use as a front-end signal processor for speech recognition. The model consists of a set of critical band auditory filters, followed by non-linearities, representing the transduction stage of the Organ of Corti. The initial functional design was performed using the Denyer and Renshaw FIRST Silicon compiler. The compiler was modified to accommodate look-up tables to implement the compressive rectifier. Initial simulation results Indicate that a bank of 32 filters, spanning a centre frequency range of 100 to 6003 Hz, using a Bark spacing of 0.6 can be implemented on a single chip. Results of the simulation and its performance on real speech signals are presented. |
|
154--159 | A. Samouelian and J. Vonwiller | Performance Of A Peripheral Auditory Model On Phonemes In Combination
Abstract A peripheral auditory model, implemented as a front-end speech processor, has the potential to provide relevant acoustic features for speech recognition. The model's performance on utterances containing 'l' and 'r' sounds in various positions in connected speech is described. The effects of stress on these consonants is also examined. Results on real speech signals are presented. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
162--167 | M, Beham and W. Datscheweit | An Algorithm For Pitch Determination Of Speech Based On An Auditory Spectral Transformation
Abstract An algorithm for pitch calculation is described which is based on an auditory spectral transformation and the theory of virtual pitch perception. The principle of harmonic coincidence is used to estimate pitch frequency. |
|
168--173 | V. Pikturna and A.Rudzionis | Pitch Measuring From Spectra Of Noisy Speech: Amplitude Thresholding Versus Identifying Of Harmonics
Abstract Various criteria for identifying harmonic peaks in the FFT spectra are investigated. The low order linear prediction spectrum is use as an amplitude threshold crossing the upper parts of pitch harmonics. |
|
174--179 | Gunnar Hult | Pulse-By-Pulse Pitch Analysis Through Zero Phase Low-Pass Filtering
Abstract A recently proposed pitch detection algorithm is based on iterative low-pass filtering of the speech signal. We discuss some modifications to the proposed algorithm, such as appropriate halting criteria for the iterative filtering, and also how the pitch can be determined from the final, sinusoidal-like filter output. Finally, we compare these pitch estimates to those of two well-known pitch analysis methods, cepstral filtering and time domain parallel processing. |
|
180--185 | Michael Wagner, Bob McKay, Santha Sampath, David Slater | Modelling The Prosody Of Simple English Sentences Using Hidden Markov Models
Abstract A set of 144 declarative sentences with a subject-verb-object structure is drawn from a vocabulary of monosyllabic and disyllabic English words. Fundamental frequency contours and energy contours of the sentence set are analysed with respect to the syllabic structure of the sentences. Multivariate correlation analysis provides predictions for the average energy and fundamental frequency of syllables. Based on the distributions of the energy, voicing and fundamental frequency parameters, 2 different continuously variable Hidden Markov Models are trained to distinguish between intersyllable intervals, stressed and unstressed syllables. One HMM uses single-mixture Gaussian parameter distributions while the other uses double mixtures. The Viterbi algorithm is used for automatic segmentation. It is noted that the convergence of the training procedure is sensitive to the initial distributions. It is also argued that the inclusion of duration modelling is essential to distinguish between stressed and unstressed syllables. |
|
186--191 | Dieter Huber | Speech Style Variations Of F0 In A Cross-Linguistic Perspective
Abstract This study explores the differences between discourse intonation and the kind of pitch contours typically found in isolated sentences. Three kinds of material are evaluated systematically: (1) orally read lists of semantically unrelated sentences, (ii) orally read narrative texts, and (iii) dialogues. The material_consists of equivalent samples of Swedish, English and Japanese speech, produced by native speakers (both female and male) of the respective languages. It will be shown that discourse intonation differs from intonation in semantically unrelated sentences with respect to practically all F0 parameters investigated in this study. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
194--199 | S Nulsen, D Landy, M O'Kane, P Kenne and S Atkins | Developivient Of Rules For Automatic Recognition Of Nasal Consonants
Abstract Examination of speech waveforms has shown that the nasal consonants are one of the easist classes of sounds to recognise by eye. What is it that so clearly distinguishes these from the other sounds? Three featues may be identified: * their characteristic shape, * an amplitude which is much lower than nearby vowels but which is well above zero and cannot be confused with noise, * a very clean smooth waveform, frequently almost sinusoidal. This is indicative of the concentration of spectral energy in the low frequency region, which has previously been well documented, * a set of recognrtron rules for the class of nasal consonants was developed to capture these observations using the WAl speech recognition programming environment. The development of these rules is discussed and results are presented for three speakers. |
|
200--205 | N.R. Kew and P.D. Green | A Scheme For The Use Of Syllabic Knowledge In Statistical Speech Recognition
Abstract We describe a new project, SYLK, which aims to combine statistical and knowledge-based approaches in a front-end for Automatic Speech Recognition. It is based on the syllable as an explanation unit. The processing comprises an HMM front-end, which is followed by an inferential reasoning system in which a series of individual tests are applied to enhance the overall performance. We then consider the task of plosive discrimination in the context of SYLK, and illustrate the flexibility of the system in its ability to encompass a variety of approaches. Some preliminary results demonstrate the utility of the syllabic approach. |
|
206--211 | L.A. Smith | Selection Of Speech Recognition Features Using A Genetic Algorithm
Abstract A genetic algorithm was used to select a reduced set of features for a commercial speech recognizer. The algorithm was applied using a test set of 20_words spoken over the telephone by 22 speakers, with the recognition accuracy determining the fitness of competing feature sets. The recognizer correctly recognized 95.1% of the utterances using all 19 of the candidate features. The genetic algorithm found 2 sets of 16 features that recognized 95.3%, and a set of 17 features that recognized 95.4% of the test utterances. A second experiment found a set of feature weights, using all 19 candidates, with which the recognizer correctly identified 95.8% of the test utterances. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
214--219 | Dale Carnegie, Geoff Holmes, and Lloyd Smith | Implementation Of An Auditory Model
Abstract An auditory model has been implemented from its description in the literature. The model attempts to capture the phenomenon of auditory synchrony by detecting frequencies which dominate the output of adjacent bandpass filters. The implementation is described along with some results of processing speech with the model. |
|
218--223 | E. Jones and E. Ambikairajah | Implementation Of An Active Cochlear Model On A Tms32oc25
Abstract This paper, which is the second of two papers describing the development of an active cochlear model submitted to this conference, outlines the implementation of the model on a Texas Instruments' TMS32OC25 single-chip digital signal processor. The cochlear model contains both passive and active elements, in line with recent research findings in the field of auditory physiology. The passive system is operational at normal stimulus amplitudes, while the active system comes into play for low-amplitude stimuli. Tests of the implemented model have been carried out using sinewaves. This implementation could be used as a physiologically-based front-end processor for a speech recognition system. |
|
224--229 | R.E.E.Robinson | A Dsp Hearing Aid Simulator And Screening Test
Abstract A device to screen hearing impaired people to enable the easier fitting of Hearing Aids is described. It is very flexible and provides a self paced easily controlled method of determining a patient's frequency response preferences. The device has a second function where it can simulate several commercially available hearing aids. This serves as a useful clinical and research tool. |
|
228--233 | John Ingram, Jeffery Pittam & Robb Hay | A Speech Annotation And Phonological Analysis Program
Abstract A software system for phonetic annotation and phonological analysis of speech samples is described for application to a longitudinal study of sound change in second language learning. |
|
234--239 | Catherine I. Watson, W K Kennedy, R H T Bates | Towards A Computer Based Speech Therapy Aid
Abstract A computer-based speech therapy is currently being developed at the University of Canterbury Electrical and Electronic Department (UCEEE). At present the aid consists of seven speech analysis modules, which include a vocal tract shape reconstruction from speech signals and a fricative sound identifier. The real-time hardware for processing and analysing the speech and the software of the aid has all been developed by the UCEEE. The aid has been evaluated by 15 speech therapists, who have all have reacted positively to the aid and its potential. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
244--249 | J.P. Vonwiller, R.W. King, K. Stevens and C.R. Latimer | Comprehension Of Prosody In Synthesized Speech
Abstract An experiment to determine the extent to which prosodically controlled synthesized speech can convey interactional meaning is described. The results reveal that most forms interactional meaning can be comprehended, and that the formal basis for the definitions or meaning can be used to derive computational rules for prosody in text-to-speech systems. |
|
250--255 | P. A. Taylor and S. D. Isard | Automatic Diphone Segmentation Using Hidden Markov Models
Abstract A two stage automatic method of producing a diphone set from nonsense words is described. Firstly hidden Markov models are used to locate phoneme boundaries and then a spectral discontinuity minimisation algorithm is used to choose diphone boundaries. |
|
256--261 | Walter G. Rolandi | English Text-To-Speech As A Function Of Concatenating Digitized Syllables
Abstract This paper describes preliminary results obtained in an application of some more fundamental core technology. The core technology is the syllabic representation of English. An effort is underway to computationally determine (essentially) all of the syllables that collectively comprise the English language. While rules depicting syllabification in English have been described (Chomsky & Halle, 1968), an actual list of the language's constituent syllables does not appear to exist. Identifying the syllables of English may have several implications for the speech science community. Some are indicated below. This paper discusses the potential for improvement in English text-to-speech applications. The initial results by no means imply revolutionary breakthroughs. On the other hand, initial results do suggest a substantial improvement over some existing text-to-speech methods. |
|
260--265 | G. Abbattista, A. Riccio, S. Terribili | Speech Workstation For Italian Text To Speech Development
Abstract The paper describes the implementation and use of a powerful Workstation suitable for the generation of the acoustic units database and the study and evaluation of the prosodic contours for a Text to Speech system for Italian language. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
268--273 | M.P. Moody, R. Prandolini | Speaker Recognition Using ILS
Abstract A means of identifying a speaker from recorded passages of speech using the signal processing package ILS (Interactive Laboratory System) is described, which is efficient and sufficiently accurate for use in legal proceedings. Statistical dependence on variables such as the length of the utterances, numbers and types of parameters used and the length of segments of speech is investigated with the aim of determining the confidence level dependence on these variations, Success rates of distinguishing speakers from similar populations are about 90%for reasonably short samples (a few minutes for references and a few seconds for test samples),while even quite shorter reference samples may result in sufficiently significant results to add to the weight of evidence. |
|
272--277 | David Bijl and Frank Fallside | Using Probabilistically Conditioned Neural Networks To Achieve Speaker Adaptation
Abstract Speaker Adaptation using Neural Networks is generally difficult because network weights are adjusted in accordance to a whole training set. Introduction of new adaptation data provides a problem, because back-propagation training would converge exactly on that test data, throwing away previously learnt information. If a neural network is formulated via a probabilistic approach, it is possible to use concepts of maximum likelihood to adapt the parameters of the network so as to accommodate changes without discarding valuable information generalized from initial training. Here, a probabilistic approach is demonstrated which allows speaker adaptation in automatic speech recognition. The units of speech used are phonic and prosodic. |
|
280--285 | P. D. Templeton and B. J. Guillemin | Speaker Identification Based On Vowel Sounds Using Neural Networks
Abstract This paper presents the results of an experiment which applies a neural network approach to the problem of speaker identification. We, restrict ourselves to the analysis of the 11 non-diphthongal English vowels. The neural networks were trained on a set of cepstral coefficients derived from an LRC analysis. Results are presented which show that this approach compares favourably with more traditional methods. |
|
286--293 | J. Vogel, R. Wick | Analysis Of Pre-Lingual Sound Utterances Of Children In The First Year Of Their Lives Using The Lpc Method Of Formant Extraction
Abstract Sound utterances made by two children in the first year of their lives were investigated. The focus was on formant extraction using the Linear prediction analysis in combination with the FFT-spectrum. There is reason to assume that articulatory and therefore neurophysiological processes are predominantly represented in the 1st format while higher formants reflect anatomical and morphological changes during this particular period of ontogenesis. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
292--297 | A. P. Reilly and B. Boashash | Alternative Speech Representations For Kohonen Classifiers
Abstract Several different techniques have been used to process speech signals in preparation for classification. The most commonly used ones rely on assumptions of signal stationarity that are not true for many important speech signal types. Recent work in the field of time-frequency signal analysis has produced a number of representations which do not make these restrictive assumptions. This paper reports on work being done to quantify the differences between these representations, as relates to the classification of speech signals using the neural network architecture popularized by Teuvo Kohonen in 1984. |
|
298--303 | Brian C. Lovell and Ah Chung Tsoi | Speaker Verification Using Artificial Neural Networks
Abstract Speaker verification is a process by which a machine authenticates the claimed identity of a person from his or her voice characteristics. A major application area of such systems would be providing security for telephone-mediated transaction systems where some form of anatomical or "biometric" identification (which cannot be lost, forgotten or stolen) is desirable. Due to the great potential shown by artificial neural networks (AN Ns) in the field of speech recognition, we evaluate the performance of a variant of the multi-layer perceptron ANN in the task of speaker verification. To prove the concept, the technique is applied to the classification of 2 speakers using a single utterence. A clustering algorithm partitions the input and output synaptic weights of the trained networks according to a Euclidean distance measure and it is found that the input synaptic weights appear to effectively characterize the speaker. The results demonstrate that the chosen ANN model can be used for speaker identification and verification purposes. |
|
304--309 | Roberto Togneri | Speech Processing Using Artificial Neural Networks
Abstract A three layer perceptron network is used to classify the sound using isolated words from different speakers. A classification accuracy of 97% has been achieved. A map of phonemes is used to trace trajectories of utterances using the selforganising neural network. A crinkle factor is proposed which allows using the self-organising map to determine the inherent dimensionality of a set of points. By this technique speech data has been shown to possess an inherent dimensionality of at least four, A projection of the map and the speech data shows how the self-organising map fits the speech space. |
|
310--315 | Simon Hawkins and Frantz Clermont | Supervised Cepstrum-To-Formant Estimation: A New Piecewise-Linear Model
Abstract A multiple-linear regression model of the relationship between low-order LPC-cepstral coefficients and vowel formant contours has been proposed by Broad and Clermont (1989). However, less-than-perfect formant estimates generated using this method suggest that the assumption of linearity underlying this model is questionable. In the present study, a neural net, with the potential for developing mappings that provide a nonlinear partitioning of large multi-dimensional space, is shown to produce substantially more accurate formant estimates than is possible using Broad and Clermont's multiple linear regression model. Because the neural net provides no indication as to the nature of the nonlinearity it has discovered, we propose a new piecewise multiple-linear regression model of the cepstrum-to-formant relationship. This parametric nonlinear model is thought to capture the quintessential nonlinearity in the cepstrum-to-formant relationship because it produces first and second formant estimates which are almost as accurate as those generated using the neural net. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
318--323 | Janet Fletcher and Eric Vatikiotis-Bateson | Prosody And Intrasyllabic Timing In French
Abstract Durational variation associated with accentuation and final lengthening is examined in a corpus of articulatory data for French. Both factors are associated with measurable differences in acoustic duration. However two different articulatory strategies are employed to make these contrasts although both result in superficially longer and more displaced gestures. |
|
324--329 | Anne Cutler and Sally Butterfield | Syllabic Lengthening As A Word Boundary Cue
Abstract Bisyllabic sequences which could be interpreted as one word or two were produced in sentence contexts by a trained speaker, and syllabic durations measured. Listeners judged whether the bisyllables, excised from context, were one word or two. The proportion of two-word choices correlated positively with measured duration, but only for bisyllables stressed on the second syllable. The results may suggest a limit for listener sensitivity to syllabic lengthening as a word boundary cue. |
|
330--335 | D. C. Bradley and B. Dejean de la Batie | Resolving Word Boundaries In Spoken French
Abstract For three populations of listeners (native French speakers, and tertiary students in the first or second year of their post-secondary studies of French), the ease with which word boundaries are located in connected speech was investigated using the phoneme monitoring task. With response required to word-initial /t /, liaison environments created an ambiguity about the status of surface /t / which could be resolved only lexically, and performance here was contrasted with that in unambiguous cases. The cost of ambiguity tended to be less in native speakers than in second year students, in line with a presumed difference in processing efficiency; for first year students, knowledge limitations and consequent speed-accuracy tradeoffs resulted in an estimated ambiguity cost which was, if anything, less than that in natives. For all three populations of listeners, monitoring performance was frequency-sensitive: word-initial / t / was detected faster in higher frequency carriers, whether or not these occurred in the ambiguity-creating environment. These findings are discussed with respect to processing models of word-boundary resolution, in native and non-native speakers. |
|
336--341 | Xie Guanghua | Syllabic Volume As An Acoustic Correlate Of Metrical Structure And Focus In Mandarin.
Abstract This paper investigates the use of syllabic volume (a three dimensional acoustic value) as the acoustic correlate of metrical structure and focus in Mandarin. |
|
342--347 | Jonathan Harrington | The Acoustic Basis Of The Distinction Between Strong And Weak Vowels
Abstract This study is a preliminary exploration of the acoustic-phonetic basis of the distinction between 'strong' and 'weak' vowels. Segments were taken from a database of continuous speech and were classified using critical band and formant frequency parameters. The results show that up to 75% of strong vowels and just over 83% of weak vowels are correctly classified, depending on the type of acoustic classification used. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
350--355 | Walter Weigel | A Demi-Syllable Based Continuous Speech Recognition System With Hmms And Syntax-Controlled Word Search
Abstract A system for recognizing continuous speech in a speaker-dependent mode is described, where demisyllables serve as basic processing units. The acoustic- phonetic decoding uses an explicit segmentation based on a pattern-matching technique and vowel-context-independent HMMs. The sentence recognition uses simplified word-HMMs and a Viterbi-algorithm. For the syntax-control a bottom-up and a top-down strategy are compared, achieving sentence recognition rate of up to 74%. |
|
356--361 | Tracy M Clark, W K Kennedy, and R H T Bates. | Features For A Computer Word Recognition System
Abstract Any review of the extensive literature on word recognition reveals that a large variety of speech features is used in computer based word recognition. However, of the work focusses on a limited set of features adapted to a selected recognition method. We report on a series of experiments designed to isolate the relative merits of a range of features. The results for each feature or set of features are standardised by testing them with female and male speakers having a New Zealand accent using a vocabul ary zero to nine. A dynamic time warping algorithm is used. The features tested include root mean squared, zero crossing rate, linear predictive coefficients, cepstral coefficients, and transitional data in the form of dynamic cepstral coefficients. It is found that the best performance is achieved with the cepstral information. Addition of other features to this set gives only marginal improvement. |
|
362--367 | Tony Robinson and Frank Fallside | Word Recognition From The Darpa Resource Management Database With The Cambridge Recurrent Error Propagation Network Speech Recognition System
Abstract Recent work with Recurrent Error Propagation Networks has shown that they can perform at least as well as the current best Hidden Markov Models for speaker independent phoneme recognition on the TIMIT task. Accurate phoneme recognition is a prerequisite for very large vocabulary word recognition and this paper extends the previous work to word recognition from the DARPA 1000 word Resource Management task. This preliminary work achieves 52.1% word recognition rate (43.3% accuracy) with no grammar when trained on the TIMIT database using single pronunciation word models from the SPHINX system. The paper concludes with a list of topics that should be addressed in order to improve the recognition rate. |
|
368--373 | Alan M. Smith | On The Use Of The Relative Information Transmitted (RIT) Measure For The Assessment Of Performance In The Evaluation Of Automated Speech Recognition (ASR) Devices
Abstract The use of an information-theoretic based metric, the Relative Information Transmitted (RIT), may facilitate the assessment of the performance of automated speech recognition (ASR) devices. The RIT provides a scalar value which may be employed in a manner similar to the use of such traditional scalar measures as 'percent correct'. The complexity of the recognition task is factored into the computation of the R/T. For example, chance-level performance on a two word recognition task and on a four-word recognition task both yield equivalent RIT values of zero, whereas the associated percent correct performance on these tasks would be 50% and 25%, respectively. The H/T is an entropy-based characterization of the ASR as a receiver in a communication channel in that the distribution of input/output characteristics of the associated confusion matrix is reflected In the generated measure. However, as with all figures of merit, the use of the RIT must be coupled with an understanding of the specific task and application domain associated with the assessment. Examples are provided which indicate that conclusions drawn from the use of the HIT may not always be in agreement with those derived from consideration of the 'percent correct' performance of the same system. |
|
374--379 | C. Rowles, X. Huang, and G. Aumann | Natural Language Understanding And Speech Recognition: Exploring The Connections
Abstract This paper describes research aimed at integrating natural language understanding (NLU), speech recognition (SR) and the intonational structure of spoken language. NLU is being used to provide a measure of robustness to SR by placing utterances into context based on pragmatics and correctly recognize the speaker's intention in a database application. The approach is to use context to correct speech recognition errors and to reduce the search space. In turn, basic intonational structure derived from the speech waveform will assist lexical disambiguation, phrase attachment, and anaphoric resolution not dealt with by discourse segmentation. The paper outlines the speech understanding system architecture and describes the understanding process, giving examples of the use of intonation. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
382--387 | Takako Toda | Shanghai Tonal Phonology 'Rightward Spreading'? Some Arguments Based On Acoustic Evidence
Abstract In recent years, linguists have started looking to acoustics for evaluation of phonological questions (Ohala 1986; Ladefoged 1989). This paper presents some acoustic data pertaining to the Shanghai tone sandhi, and argues that for Shanghai the current phonological assumption 'rightward spreading' is not phonetically plausible. |
|
388--393 | Phil Rose | Linguistic Phonetic Aspects Of Shanghai Tonal Acoustics
Abstract Mean fundamental frequency and duration data are presented for the tones of 3 female and 4 male speakers of Shanghai dialect. Normalised f0 shapes for Shanghai and Zhenhai tones are compared, and a linguistic tonetic contrast demonstrated between both falling and rising tones. The importance of retaining durational relationships in normalisation is demonstrated. |
|
394--399 | Phil Rose | Thai-Phake Tones: Acoustic, Aerodynamic And Perceptual Data On A Tai Dialect With Contrastive Creak
Abstract Mean fundamental frequency, amplitude, duration and air flow data are presented for the 5 tonemes of Thai Phake on syllables with [k] and [x] initial consonants and /aa/, /aat/ and /at/ rimes. |
|
400--405 | J S Mirza | F0 Perturbation Effects Of Prevocalic Stops On Punjabi Tones
Abstract The perturbation effects of the prevocalic stop consonants on the FO contours of the following Punjabi tones on a vowel [aa] are investigated. As is the case with the languages already studied the unvoiced stops perturbate the FO onset values to a higher start while the voiced stops perturbate them down for each Punjabi tone. This tone-splitting, however, has been found to remain consistent in Punjabi for first 100 ms, unlike in other languages which show a fast or slow convergence of the FO tracks for the corresponding voiced and unvoiced stops. It is suspected that tone- splitting in Punjabi extends over a much longer period unlike in other tone languages as Yaruba and Thai. The level-falling tone of Punjabi, which has the highest frequency register, split by 30 Hz, while the low register tones, the level and dipping tones split by 14 and 12 Hz respectively on the average, for the first twelve periods of vocal cords vibrations. |
|
406--411 | U Thein-Tun | The Domain Of Tones In Burmese
Abstract The fundamental frequency patterns and the duration of the phonological tones in Burmese were analysed firstly with the neutral initial consonant [h] and secondly in four major syllable types. The Fo patterns of the four tones influenced by the preceding initial consonants in the four syllable types were compared with their counterparts preceded by the initial [h] and an attempt was made to determine the tone domain on the basis of the comparison. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
414--419 | A. Marchal, W.J. Hardcastle | The Relevance Of Basic Research In Articulatory Phonetics To Speech Technology
Abstract For many applications in speech technology, decisive progress would result from the availability of an articulatory representation of speech utterances. While the acoustic mapping of the geometry of the human vocal tract during speech articulation is well understood today, the solution of the inverse problem, namely reconstructing the articulatory processes from the acoustic information, is still unsolved. The coarticulatory phenomena as a source of information have been almost entirely neglected. We will! present in this paper the research action "ACCOR" ("Articulatory-Acoustic Correlations in Coarticulatory Processes: a cross-linguistic investigation") which has been recently launched under the EEC-funded ESPRIT II/BRA Program. It integrates investigation of the coarticulatory regularities themselves with research into new and improved ways of exploiting these regularities in deriving articulatory representations through the acoustic analysis of speech. |
|
420--425 | Andrew Butcher | "Place Of Articulation" In Australian Languages
Abstract Most Australian languages contrast either five or six orders of stops (in both oral and nasal series), distinguished in terms of what is traditionally known as "place of articulation". This study examines the articulatory and acoustic correlates of the four coronal categories in a number of languages, using acoustic and palatographic evidence. |
|
426--431 | Leisa Condie | Non-Linearity In Vowel Waveforms
Abstract Acoustic models of the vocal tract show that turbulence plays a significant role in speech production. Non-linear dynamics has recently provided tools which allow analysis in this very difficult area. In this note vowel waveforms are examined for evidence of irregular behaviours using such tools. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
434--439 | P Kenne, D Landy, M O'Kane, S Nulsen, A Mitchell and S Atkins | The Wal Speech Programming Environment
Abstract The WAL (Wave Analysis Laboratory) Speech Programming Environment was first developed in 1987 to provide software tools for rapid prototyping of the FOPHO and SPRITE speech recognition systems. The environment has been revised and extended several times following evaluation trials at various sites. Current versions are maintained under both MS-DOS and Unix. A central feature of the environment is a programming language which was designed to provide a high-level, natural-language-like facility for phoneticians and other speech scientists not familiar with standard programming languages to write and test speech recognition rules. The rule language provides the usual logical operators ('and', 'or', 'not') for combining rules, together with operators for temporal reasoning (after, 'before', _'then'). A set of primitives for describing shapes and their deformations allows the notion of rubber templates to be included in the language, and when combined with the temporal reasoning facilities, provides an extension to picture languages. The WAL environment is highly portable. The language is interpreted, with the interpreter implemented in C and the graphical user interface is implemented using X Windows. We describe the language and environment and their implementation together with a number of examples illustrating the usefulness of the WAL environment in both speech and non-speech applications. |
|
440--445 | J. B. Millar, P.Dermody, J.M.Harrington, J.Vonwiller | A National Cluster Of Spoken Language Databases For Australia
Abstract This paper addresses the issue of the nature of a viable national resource of spoken language in Australia. The importance of such a resource for the development of speech technology in Australia is explored against the background of the economic, political, and legal issues that have frustrated previous attempts to develop such a resource. A proposed solution is provided in the form of a cluster of technically compatible databases in which each component of the cluster will have its independently determined content arising from the primary purpose behind its collection. The primary compatibility will be that each component corpus will have the same technical structure and the same standards of data description. Secondary compatibility will arise by making the components of the cluster available under well-defined conditions to the speech technology community via a set of database-nodes, each of which will be accessible by electronic data links. |
|
446--451 | J.E. Clark, C. Whitfield, and P.J. Kennedy | Development Considerations For Speech Based Hearing Test Materials For Flight Crew
Abstract Traditional hearing tests for technical flight crew have relied on the conventional pure tone threshold tests commonly used in clinical audiometry.This paper describes some of the considerations involved in the development of a set of speech based hearing test materials designed to evaluate the capacity of flight crew to satisfactorily process the speech signals routinely used under opertaional conditions in aviation. The rationale for the development of such tests is the recognition of the limitations in pure tone tests for distinguishing between crew with functionally effective hearing, and those who are no longer functionally effective when their pure tone thresholds are around the borderline of the ICAO limits. Factors investigated in the process of test development are described, including communication systems properties, cockpit noise levels, and operational language characteristics. The need for three classes of test is shown, and some of the considerations and procedures for their compilation are described. |
|
452--457 | A. Marchal, M.H.Casanova, P. Gavarry, M. Avon | Dispe: A Divers' Speech Data-Base
Abstract Gas mixture and pressure modify the spectral characteristics of divers' speech. Additionally, constraints imposed on jaw movements by wearing a facial mask affect the speech production process. The auditory feedback loop is equally concerned. Furthermore, underwater adverse working conditions are characterised by noise from different sources. As a result, divers' speech is poorly intelligible and communications between divers and surface control need to be enhanced. This is clearly true tor both security and task efficiency reasons. To this end, "voice unscramblers" are being used. However, the technological state of commercially available equipment is dated and the quality of speech remains insufficient. To help with the design, testing and qualification (NORM) of new communication devices, a bilingual (French-English) Data-base is currently being set up. It consists of phonetically balanced lists of 200 words read by 17 divers under sea and in chambers at operational levels from the surface to -300m. These recordings will be edited, labelled and stored for further distribution on a CD-ROM. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
460--465 | R.H. Mannell | The Effects Of Phase Information On The Intelligibility Of Channel Vocoded Speech
Abstract The intelligibility of vocoded speech with various phase spectra was compared to the intelligibility of the original input natural speech. It was found that vocoded speech with true natural phase was the closest to natural speech in intelligibility. |
|
466--471 | J.Bruce Millar and Xue Yang | Evaluation Of The Robustness Of Perceptual Linear Prediction Analysis Using Multi-Speaker Australian English Vowel Data
Abstract Perceptual linear prediction analysis is a recently developed extension of standard linear predictive analysis which takes into account the characteristics of human hearing. It has been shown that its application to American English data presented to a sim- ple automatic speech recognition system improves the speaker-independent performance of that system. In our studies, we use multi-speaker Australian English vowel data to eval- uate the relative sensitivity to speaker difference and to phonetic difference of perceptual linear prediction analysis and standard linear prediction analysis. This work extends that of Hermansky by its detailed analysis of the vowel space for a moderately large number of speakers. |
|
472--477 | B.Goldburg, S.Sridharan and E. Dawson | A Discrete Cosine Transform Based Speech Encryption System
Abstract A speech encryption system suitable for use on bandlimited transmission channels is described. Scrambling is achieved using permutation of discrete cosine coefficients. A method for removing energy variation in the scrambled speech has been incorporated into the scheme which significantly enhances its performance.Simulation results presented in the paper indicate that the scheme provides scrambled speech of low residual intelligibility, and recovered speech of good quality. |
Pages | Authors | Title and Abstract | |
---|---|---|---|
480--485 | Roland Seidl | The Application Of Speech I/O Technology To Interactive Telecommunication Services
Abstract This paper discusses the evolution and current status of speech; I/O technology and its application to telecommunication services. Because speech recognition is the limiting factor in the design of these services, the impact of its constraints on the user/Service interface is discussed in the context of some generic applications. The key factors to a potentially successful service are outlined. |
|
486--491 | Mark F. Schulz and Lenard J. Payne | Providing Multimedia Facilities In A Workstation Independent Form
Abstract This paper presents the planned work on providing multimedia facilities to a community of users of workstations and PCs. We aim to provide multimedia servers on a network, at least one per workstation, with each server providing a small number of services to the user. The workstation communicates with the multimedia server over an Ethernet via standard TCP/IP protocols. By providing the services separate from the workstation, we are able to upgrade workstations and network services with minimum interruption to our multimedia service. |
|
492--497 | C. D. Summerfield | Design And Implementation Of A Multi-Channel Formant Speech Synthesis Asic
Abstract This paper describes the design and implementation of a single chip multi-channel formant speech synthesis Application Specific Integrated Circuit (ASIC). The aim of the Ft&D project was the design and implementation a cost effective device/ capable of synthesising multiple channels of high quality speech. The paper describes the ASIC architecture developed to implement the complete multi-channel synthesis system on a single chip. At the core of the device Is a fully synchronous bit-serial signal processing architecture which implements the formant synthesiser function. ....s is augmented by circuits which implement a multi-channel Interactive glottal source function, delay line elements for the filter network and a flexible interface circuit which enables the device to be directly connected to an industry standard 32 bit bidirectional data bus. Using the device it is feasible to implement a complete multi-channel text-to-speech system using just two components, a microprocessor unit to run the text-to-speech algorithm and the multi-channel speech synthesis ASIC to provide the synthetic speech output. |
|
498--503 | D Rowe, A Perkis, W.G.Cowley, J.A.Asenstorfer | Error Masking In A Real Time Voice Codec For Mobile Satellite Communications
Abstract The implementation of a real time CELP based speech coder is described. The coder has been optimised for operation on a land mobile satellite channel, and full duplex operation is achieved on a single AT&T DSPSQC device. The coder operates at a bit rate of around 6400bps, and produces near toll quality speech. Error masking techniques are investigated with the goal of minimising perceptual errors in the coder output in the land mobile satellite channel. The error masking is achieved with a combination of standard FEC techniques and subpacket substitution. |