only search ASSTA Proceedings

Proceedings of SST 1996

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Session 1: Multimodal Speech

Pages Authors Title and Abstract PDF
1--6 R.E.E.Robinson Synthesising Facial Movement: Real Time Visual Speech

Abstract  This article describes the method by which a phonetic string is processed into visemes, with adaptability to other languages. The visemes are paired into disemes and lip sequences are extracted from a lip database. The resultant lip sequences are joined by using transition tables and placed in a play buffer. The number of frames in the play buffer is adjusted to match the duration of the required sequence. The system timer is used to play back the sequence in real time on a Sun Ultra 1/170 workstation.

PDF
7--12 Takefumi KITAYAMA, Hiroyuki KAMATA and Yoshihisa ISHIDA Development Of Speech Training System For The Hard Of Hearing Person Based On Voice Synthesis Technique Using Vocal Tract Area Function

Abstract  In our laboratory, we have developed a speech training system called "Speech Trainer" for the deaf. This system has presented many practice items for hearing impaired people. The training results are displayed on the CRT display of personal computer. Therefore, trainers can confirm the characteristics of voice of themselves visually, Decides, if the hearing impaired people have the ability to hear a part of frequency band, the voice synthesis technique can be adopted to practice on the speaking. In this paper, we insist the effective educational method of speech training for the hard of hearing person using the audience of the synthesis sound adopting the technique to shift the formants and the comparative training of vocal tract shape at the same time.

PDF
13--17 Jordi Robert-Ribes and Bruce Millar A Simple System For Measuring Audiovisual Speech

Abstract  A system which allows the collection of audio-visual speech data is described. The system is described at the level of the hardware and software involved, and at the level of the precision of the data collected. The key issues of audio-video synchronisation and the standards required for the study of speech, the formal description of both audio and video signals, and the extraction and manipulation of audio and video features are examined in some detail. An example of the performance of the system is presented as the output of a novel software analysis package which has been developed for use with this system. Further developments of this software package are outlined.

PDF

Session 2 Features Analysis 1

Pages Authors Title and Abstract PDF
19--24 Peter Barger, Stefan Slomka, Pierre Castellano and Sridha Sridharan Gender Gates For Automatic Speaker Recognition

Abstract  The present work, based on telephone speech, proposes gender gates suitable as front-ends to ASR discrimination systems. The gates are composed of connectionist and/or statistical classifiers whose outputs are fused for increased robustness. While gender separation is simpler than speaker separation, the former is not a trivial problem as is commonly assumed.

PDF
25--30 Raphael Ahn and W. Harvey Holmes Voiced/Unvoiced/Silence Classification Of Speech Using 2-Stage Neural Networks With Delayed Decision Input

Abstract  This paper proposes a two stage feed-forward neural network classifier capable of determining voiced, unvoiced and silence in the first stage and refining unvoiced and silence decisions in the second stage. Delayed decision from the previous frame’s classification along with preliminary decision by the first stage network, normalised partial-sum-of-autocorrelation-coefficient ratio and energy ratio enables the second stage to correct the mistakes made by the first stage in classifying unvoiced and silence frames. Comparisons with a single stage classifier demonstrates the necessity of two stage classification techniques. It also shows that the proposed classifier performs favourably even in the presence of noise.

PDF
31--36 K.L. Jenkin and M.S. Scordilis Automatic Syllable Stress Classification Methods

Abstract  Several approaches to the task of automatically classifying syllable stress in continuous speech are detailed. The neural network and Markov chain techniques are shown to achieve good performance rates of 81 -84% and 78-80% respectively. Preliminary findings concerning the utilisation of the classified stress labels for phoneme recognition are provided and enhance the cause for prosodic information to be included within continuous speech understanding systems.

PDF

Session 3 Linguistics/Phonetics 1

Pages Authors Title and Abstract PDF
37--42 Yuko KINOSHITA Linguistic Phonetic Differences In The Acoustics Of Plosives In Chinese Dialects

Abstract  This paper investigates phonetic Linguistic_differences_ between three Chinese dialects, Cantonese, Peking and Shanghai, focussing on bilabial plosives in particular. The results show that the three dialects in question hold certain distinguishing linguistic phonetic characteristics and, furthermore, the observations suggest that it is necessary to reassess the nature of linguistic phonetic differences. Linguistic phonetic differences are thus not only manifested as between language differences (or between-dialect differences in this study) in phonetic parameters, but must also be understood in terms of variances in the realisation of certain phonetic phenomena in languages.

PDF
43--48 Robert Bannert & Peter E. Czigler Observations On The Duration Of /S/ In Standard Swedish

Abstract  From a large investigation of the temporal structure and variations of consonants in clusters in Standard Swedish, the alveolar voiceless fricative /s/ was selected for this presentation. Ten speakers produced /s/ as a singleton and as a member of consonant clusters in a frame sentence under different conditions. These included prominence (focus accent), cluster structure, position in the word, and preceding phonological vowel length. Focus accent has the most pronounced effect on segment duration increasing it by about 40 ms in all conditions. The results are compared to some earlier findings for Swedish and American English.

PDF
49--54 Dawn M. Behne, Peter E. Czigler and Kirk P. Sullivan Acoustic Characteristics Of Perceived Quantity And Quality In Swedish Vowels

Abstract  This project re-examines the perceptual weight of vowel duration and the first two vowel formant frequencies in distinguishing phonologically short and long vowels in Swedish. Based on listeners’ responses to synthesized sets of materials for [I]-[i:], [c]-[o:] and [a]-[a:], results indicate that vowel duration is of primary importance for distinguishing [i:] from [I] and [o:] from [c], whereas both formant frequencies and vowel duration were found to influence the perception of [a:] and [a].

PDF
55--60 G. Dogil and J. Roux Notes On Unencoded Speech: Clicks And Their Accompaniments In Xhosa

Abstract  In this paper we argue that clicks are a prototypical case of unencoded speech. Unlike other speech sounds clicks do not coarticulate with their phonetic environment, they block vowel coarticulation across them and may not be recovered from the transitional features which are left after editing them. Unencoded sounds as such present a challenge to most phonetic theories in particular, to all theories of coarticulation and coproduction.

PDF
67--66 David Deterding Diphthong Measurements In Singapore English

Abstract  It is often stated that the vowels /e1/ and /ou/ in Singaporean English Pronunciation (SEP) are characterised by little or no movement, so that they may be regarded as long monophthongs. However, few measurements have previously been made to check such claims. Nine Singaporeans with a range of educational levels were recorded, and the movement of the first formant was measured for their /er/ and /ou/ vowels. These measurements were compared with similar recordings of three British university lecturers and also some BBC broadcasters from a standard database. It was found that all the Singaporeans, regardless of educational level, do indeed tend to have more monophthongs /e1/ and /ou/ than the speakers of British English.

PDF

Session 4 Speech Recognition 1: Adverse Conditions

Pages Authors Title and Abstract PDF
67--72 Olli Viikki, Kari Laurila, Petri Haavisto A Confidence Measure For Detecting Recognition Errors In Isolated Word Recognition

Abstract  Error detection is an important technology needed to improve the usability of practical speech recognition systems. In this paper, we propose a confidence based error detection approach for isolated word recognition using Hidden Markov Models (HMM). An on-line garbage modelling technique is used to obtain the reference score for the recognition result. The confidence is defined as difference of the recognized word model score and the garbage model score between the recognized utterance endpoints. Experiments indicate that a large number of recognition errors can be detected even using a small rejection threshold. In the clean environment, we are able to detect and reject over 80% of incorrect recognitions without rejecting any correct recognitions. If the speech signal is corrupted by background car noise, over 60% of recognition errors can be rejected and, 95% of correctly recognized utterances are still accepted. Experiments also show that the proposed technique is capable of rejecting out-of-vocabulary words.

PDF
73--78 S.E. Dixon and D.M.W. Powers The Characterisation, Separation And Transcription Of Complex Acoustic Signals

Abstract  Traditional approaches to sound recognition perform poorly in the presence of background noise or multiple simultaneous signals. This research project aims to tackle these difficulties, thus addressing some of the unsolved problems common to many acoustic processing tasks. Although in the early stages, the project is proceeding on two fronts; music transcription and speech recognition.

PDF
79--84 Jean-Baptiste PUEL Cellular Phone Speech Recognition : Neural Nets Preprocessing Vs. Robust Hmm Architectures

Abstract  In this paper, we present and compare two methods contributing to the robustness of automatic speech recognition systems in adverse conditions. The first method consists in computing new acoustic parameters corresponding to the identification of a telephonic network using a neural net. The second one consists in building robust HMM architectures to include and manage more variability in the learning corpora.

PDF
85--90 B. T. Logan and A. J. Robinson Noise Estimation For Enhancement And Recognition Within An Autoregressive Hidden-Markov-Model Framework

Abstract  This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The combined model is used to determine the likelihood of each observation being just noise. These likelihoods are used to weight each observation to form a new estimate of the noise and the process is repeated. Enhancement is performed using Wiener filters formed from the clean speech and noise models. Results are presented for additive stationary Gaussian and coloured noise.

PDF
91--96 Jinhai Cai and Zhi-Qiang Liu An Adaptive Approach To Robust Speech Recognition

Abstract  The performance of speech recognizers often degrades rapidly in noisy acoustic environments. The environmental noise not only disturbs speech features and affects the reliability of feature extraction, it also causes people to change their speaking manners. In this paper, we focus on the acoustic effects of noise. We propose to use the short-time modified coherence representation with a noise-adaptive approach for extrac- tion of speech features and adaptive weighted logarithmic output probabilities of HMMs for enhancing the robustness to the errors in weak speech segments. As a result, proposed approach works well in both Gaussian white noise and computer fan noise.

PDF

Session 5 Forensic Linguistics

Pages Authors Title and Abstract PDF
97--102 Andrew Butcher Getting The Voice Line-Up Right: Analysis Of A Multiple Auditory Confrontation

Abstract  The practice of confronting witnesses of a crime with a tape recorded ‘voice line-up’, where the voice of a suspect is included amongst a series of ‘foils’, is becoming more frequent as a forensic technique. A tape recently used in such a procedure was submitted to acoustic analysis and to auditory analysis by a panel of listeners. Two speakers were consistently identified as being different from the rest. One of these was the suspect. The voice line-up evidence was ruled inadmissible.

PDF
103--108 F. Schlichting and K.P.H. Sullivan Discrimination Of Imitated Voices

Abstract  The belief that an individual can discriminate between different voices from memory is central to the concept of the voice line-up and its application in the legal sphere. The question of how voice imitation can affect the accuracy of a voice line-up has received little attention. However, recent research has demonstrated that high-quality imitation can cause a problem for the reliability of speaker discrimination within a voice line-up. This paper examines the effect of amateur imitations on the reliability of speaker discrimination within in the line-up. The voice which was imitated and the line-up construction was the same as in the study using the professional imitator. Unlike in the study which used the professional imitation, there was almost no confusion between the imitation and the real voice — the listeners were not convinced that the voice they were to recognize was someone else they knew. However, when the imitations were absent from the line-ups there was some confusion between the natural voice and the not present option. In fact, the amateur imitators, even though they failed to cause confusion with the voice they were imitating, succeeded in disguising their voices to cause a high number of judicially fatal errors.

PDF
109--114 Phil Rose Speaker Verification Under Realistic Forensic Conditions

Abstract  A forensic phonetic experiment is described which investigates the nature of within- and between speaker variation in demonstrably similar sounding voices. The centre frequencies of F1 - F4 in the naturally produced single word utterance hello are compared for 6 adult males. ANOVA results show that even similar sounding voices differ in F-pattern, but some of these differences are not realistically demonstrable. The magnitude of smallest significant difference between similar speakers is proposed as a way of estimating the involvement of more than one speaker forensically.

PDF
115--120 J. Pittam and E.S. Rintel The Acoustics Of Voice And Ethnic Identity

Abstract  This paper examines the acoustic characteristics of ethnic identity. In particular, it looks at the long-term acoustic properties of Anglo-, Vietnamese- and Hong Kong Chinese-Australians. The paper is a preliminary report of a larger project, and focuses on long-term spectral features of four speakers from each ethnic group. Three-mode principal component analysis is conducted on long-term average spectra to find group differences among the speakers.

PDF
121--126 Phil Rose and Alison Simmons F-Pattern Variability In Disguise And Over The Telephone Comparisons For Forensic Speaker Identification

Abstract  A pilot experiment was carried out which investigates the nature of variability in the F-pattern of 3 speakers under conditions of disguise and telephone speech. Evidence is presented which may point to effective exclusion of F1 and F4 in comparisons involving the latter. The acoustic and forensic consequences of three different types of disguise are outlined.

PDF

Session 6 Speech Recognition II

Pages Authors Title and Abstract PDF
127--132 Parham Mokhtari and Frantz Clermont A Methodology For Investigating Vowel-Speaker Interactions In The Acoustic-Phonetic Domain

Abstract  A long-standing problem in speech research is concerned with the separation of the phonetic and speaker-specific attributes of the acoustics of spoken language. In a previous attempt (Mokhtari & Clermont, 1994) to re-examine this problem in the context of machine classification of spoken vowels, we first confirmed the relative importance of the low spectral regions for maximum phonetic distinction. However, we also provided compelling evidence of the relatively large, speaker-related potency of the high spectral regions, where inter-speaker vowel distinction was found to be adversely affected. This contrast in classification accuracy observed across the available spectral range led us then to advance the notion of dichotomy which is unfolded by way of the methodology described in this paper. The consequences of the dichotomy are thus studied more closely, with a view to gaining a better understanding of the acoustic-phonetic basis for the detrimental effects of vowel-speaker interactions observed in the high spectral regions. The proposed methodology is also put forward as having the potential of paving the way for more robust speech or speaker recognition systems.

PDF
133--138 W. J. Tey, N. P. Jong, and R. Togneri Investigation Of Speech And Speaker Recognition Based On Trajectory Modeling Of Utterances

Abstract  We present in this paper a modelling technique used to capture the dynamic and temporal behaviour of transitions between phonemes. This model relies on the trajectory instead of the geometrical position of the observations in the parameter space. Transition based models provide an alternative method for acoustic—phonetic modelling of the speech signal. In our modelling technique, the trajectory is modelled by regression analysis of low-order polynomials followed by statistical clustering of these coefficients. This technique is used for both speech recognition as well as speaker recognition. Results on a small trial set of isolated alphabet sounds and speakers for both speech and speaker recognition are presented. The speech recognition rate using the trajectory model is found comparable to traditional HMM modelling. However, the poor results for the speaker identification suggest that the current trajectory model is not suitable for this recognition task.

PDF
139--144 Michael Wagner Combined Speech-Recognition/Speaker-Verification System With Modest Training Requirements

Abstract  This study investigates a combined speech recognition and speaker verification system which performs well under conditions of client training being restricted to only a few repetitions per utterance. The system employs vector quantisation and discrete hidden Markov models such as to make use of VQ codeword indices for word recognition and the corresponding VQ distortions for speaker verification. Different codebook sizes are investigated on a speech corpus of 10 computer command words spoken by 8 male and 8 female speakers. At an optimum codebook size of 64, the system performs at word error rates of about 4 percent and at speaker verification equal-error rates of about 9 percent for single—word utterances. it up to 4 or 5 consecutive words are accumulated for speaker verification, the equal-error rates of the system fall to about 4 percent.

PDF

Session 7 Linguistics/Phonetics 2

Pages Authors Title and Abstract PDF
145--150 Frantz Clermont Multi-Speaker Formant Data On The Australian English Vowels: A Tribute To J.R.L. Bernard'S (1967) Pioneering Research

Abstract  In this paper we endeavour to reinforce the importance of Bernard's (1967) pioneering study of the Australian English vowels, by attempting a more complete account of the extent and the diversity of his multi-speaker spectrographic estimates of the first three formant frequencies (F1, F2 and F3). Through a careful restoration of these parameters we show that, similarly to Peterson and Barney’s (1952) seminal study of the American English vowels, Bernard's work has yielded an invaluable contribution to that body of data, which are still to date difficult to acquire automatically from the speech signal, but which continue to provide theoretically-robust probes into the articulatory and acoustic processes of speech communication.

PDF
151--156 Marija Tabain Nasal Consonants In Yanyuwa And Yindjibarndi: An Acoustic Study

Abstract  This study looks at nasal consonants in two Australian languages, Yanyuwa and Yindjibarndi. Locus equations are used to find evidence for acoustic loci for these consonants, and also to investigate the degree of coarticulation with the following vowel involved in their production. Results strongly suggest that despite having a large nasal consonant inventory, and a small vowel inventory, Yanyuwa and Yindjibarndi show a great deal of coarticulation in their nasal consonants according to vowel context.

PDF
157--162 Gerry Docherty & Paul Foulkes A Corpus-Based Account Of Variation In The Realisation Of 'Released‘ /T/ In English

Abstract  In this paper we describe the main patterns of realisation which we have observed in non-glottaled/ised stops in pre-pausal position in British English. We present quantitative results which suggest that the patterns cannot be entirely accounted for in terms of free variation or articulatory economy, and we outline some ideas about how this data might be interpreted. The principal points which emerge are (a) that work on phonological variation and change can be enhanced by making use of a more detailed phonetic analysis than is usually the case; and (b) aspects of phonological variation which can only be observed from a corpus of naturalistic speech bring into sharp focus issues related to the nature of phonological theory and its relationship to speech performance.

PDF
163--168 Helen Fraser An Introduction To Phenomenological Phonology (PP)

Abstract  Phenomenological Phonology (PP) is a framework for studying phonological phenomena which l have been developing over recent years. It is intended as an alternative to mainstream theories based on or deriving from Generative Phonology (GP), and uses the insights of phenomenological philosophy. In this paper, give a brief outline of PP. More in-depth discussion and justification is available in Fraser 1992 (hereafter SSP) and Fraser (in press).

PDF

Session 8 Coding & Synthesis

Pages Authors Title and Abstract PDF
169--174 S. C. Chu and J. S. Pan Tabu Search Algorithms To VQ Codevector Index Assignment For Noisy Channels

Abstract  Vector quantization is a popular technique for data compression. It provides good performance against channel noise if suitable algorithm of code- vector index assignment is used. The problem of VQ codevector index assignment is NP-hard. In this paper, tabu search approaches are applied to codevector indices assignment for noisy channels for the purpose of minimizing the distortion due to bit errors without introducing any redundancy. Experimental results compared with the standard parallel genetic algorithm and the binary switching algorithm confirm the usefulness of these approaches.

PDF
175--180 Mike Wu & W. H. Holmes A Low Rate Sinusoidal Speech Coder

Abstract  This paper reports on a study of low bit rate speech coding based on the sinusoidal model. Several techniques are employed to reduce the coder's bit rate and increase the processing speed. Firstly, the Hilbert transform is used to estimate the system phase response. Secondly, a bilinear transform is used to warp the spectrum to exploit the auditory characteristics of the human ear when coding at very low rates. Thirdly, a method to estimate the coarse pitch for the SEEVOC model is introduced and a method to perform the pitch correction is presented. Finally, a simplified birth»and-death algorithm is presented. The simulation results show that the reconstructed speech is of good quality at a bit rate of 4800 b/s, and is still intelligible and speaker recognizable at 1200 b/s.

PDF
181--186 H.R. Sadegh Mohammadi and W.H. Holmes Differential Interpolative Prediction Scalar Quantization Of The Line Spectral Frequencies For Low Bit-Rate Spectral Coding Of Speech

Abstract  Line Spectral Frequencies (LSFs) have been used widely as a set of parameters for representation of the all-pole filter in linear prediction based speech coders. In this paper, a new method called Differential interpolative Prediction Scalar Quantization (DIPSQ) is proposed for coding the LSFs. It is shown by simulation that with the new system of quantization these parameters can be encoded more efficiently than with ordinary scalar quantization as used in the standard CELP coders.

PDF
187--192 J. S. Pan and S. C. Chu Improved Algorithms For VQ Codeword Search And The Derivation Of Bound For Quadratic Metric Using Principal Component Transform

Abstract  A new bound for quadratic metric using principal component transform is derived in this paper. A new fast search algorithm for quadratic metric is also proposed by storing the transformed codeword first, then the algorithm is executed by using the previous vector candidate, bound for quadratic metric and partial distortion search from the transformed input data. Experimental results demonstrate that this new algorithm compared with previous work (Pan et al., 1996a) will reduce the number of multiplications and the total number of mathematical operations for 1,024 codewords by more than 77% and 50%.

PDF
193--198 H.R. Sadegh Mohammadi and W.H. Holmes Considerations In The Selection Of An Objective Measure To Assess The Quality Of Spectral Coding Methods

Abstract  In low rate speech coders based on the linear prediction method the quality of the synthesized speech is highly affected by the amount of distortion arising from the spectral coding stage. In this study we investigate two basic models for the evaluation of the quality of the short—term spectrum quantization. The advantages and disadvantages of each model are studied. Moreover, the difficulties in comparing the results of different published studies are found to be because of five groups of incompatibilities. We demonstrate the differences between the results of the assessments based on these models for several spectral coding methods using vector quantization.

PDF
199--204 Kerrie Lee, Phillip Dermody, Daniel Woo Evaluation Of A Method For Subjective Assessment Of Speech Quality In Telecommunication Applications

Abstract  A seven point rating scale was used to obtain mean opinion scale (MOS) scores from a group of listeners who used the scale to judge the quality of speech distorted using a modulated noise reference unit (MNRU). The results showed that there were large individual differences in the use of the scale by listeners. Despite these differences individual listeners gave perceptual results which reflected the degree of MNRU distortion. An ANOVA analysis showed statistically reliable differences in the trend of the quality judgments across the MNRU conditions.

PDF
205--210 Peter Veprek Czech Text-To-Speech System For A Reading Machine

Abstract  In this paper, a Czech text-to-speech (TTS) system developed for a reading machine is presented. The whole TTS system was divided into rule- and lexicon-based text-to-phoneme conversion, prosody pattern calculation, rule-driven allophone selection, and linear-prediction-based speech production. Description of each of these components is given in the paper. The result is a complete yet compact TTS system that meets criteria laid out in the original project specifications.

PDF

Session 9 Speech Disorders I

Pages Authors Title and Abstract PDF
211--216 Lyn Goldberg Ph.D. American The Effects Of The Attenuation Of Second And Third Formant Frequencies On The Recognition Of Stop Consonant Vowel Syllables In Aphasic And Nonaphasic Subjects

Abstract  This study compares the ability of nonfluent and fluent aphasic and non- aphasic subjects to recognize six stop consonant vowel syllables (/pa,ba.ta.da.ka.ga/) under three formant frequency conditions (synthesized presentation of F1, F2 and F3; attenuation of F2; and attenuation of F3). AN OVAs and confusion matrices are used to document and compare the voicing and place of articulation responses made by each group across all formant conditions. Results indicate that aphasic subjects demonstrate speech recognition abilities remarkably similar to those of their nonaphasic counterparts when acoustic information vital to the recognition of speech is attenuated, and imply that more intact speech production facilitates more accurate speech recognition.

PDF
217--222 P.F. McCormack & B. Dodd A Feature Analysis Of Speech Errors In Subgroups Of Speech Disordered Children

Abstract  A feature analysis was undertaken of the speech production errors of 99 pre-school children categorised into subgroups of speech disorder. Results indicated that type of errors could not be accounted for by the severity of the speech disorder, indicating the presence of qualitatively different types of speech disordered groups.

PDF
223--228 Lynda Penny, Simon Mitchell, Natasha Saunders, Jenny Hunwick, Helen Mitchard & Mary Vrlic Some Aspects Of Speech And Voice In Healthy Ageing People

Abstract  Changes to aspects of speech and voice in ageing people are widely reported, but the reports are sometimes contradictory, perhaps because the samples surveyed vary greatly in size and their health status is not always clear. The tasks vary too, and methods of measurement, both of which are known to affect results. Reported here are data on rate of utterance and fundamental frequency from a sample of healthy ageing people, all native-born speakers of general Australian English.

PDF
227--232 Sameer Singh, Romola Bucks, Jody M. Cuerden Speech In Alzheimer'S Disease

Abstract  This paper describes the quantification of physical characteristics of Alzheimer’s patients’ conversational speech. The study was conducted on a total of eight probable AD patients and eight normal controls. For each group, a total of five measurements were made from their conversational speech recordings which depended on verbal fluency and pauses in speech. The paper discusses statistical results obtained with these parameters and explains their usefulness for quantifying speech deficits in Alzheimer's disease.

PDF
233--238 Sameer Singh and Tom Gedeon Hypertext Tools In Speech And Language Therapy

Abstract  This paper investigates hypertext tools in communication disorders, particularly in aphasia assessment and therapy. The assessment of language comprehension abilities can be facilitated by evaluating patient performance on information retrieval tasks. Hypertext tools can be used to gather information about patients' planning abilities and the semantic understanding of the available information. The paper explores the use of hypertext for generating therapy material which works with traditional methods of aphasia therapy, and highlights its importance whilst evaluating verbal and non-verbal abilities. Patient performance on hypertext related tasks can be quantified with the proposed parameters which quantify the degree of language deficit through information retrieval and understanding processes. These parameters need to be tested for their importance in outcome management: whether they are sensitive enough to measure changes or not The role of hypertext in developing therapy exercises is discussed. The advantages of hypertext tools in single case-studies is highlighted and an active role of hypertext applications in aphasia management is suggested.

PDF

Session 10 Speaker Recognition

Pages Authors Title and Abstract PDF
239--244 A. Satriawan and J.B. Millar Broad Phonetic Class Based Speaker Modelling

Abstract  This paper discusses the results of speaker modelling experiments based on broad phonetic classes of a speaker using different hidden Markov model structure and testing on different durational classes of broad phonetic class test segments. The results show that for fricatives and plosives the full covariance normal models are much better than other models, independent of segment duration, but especially for medium and long segments . For nasals, vowels, and approximants, left-to-right models are preferable. However, for short segments, left-to-right with skip models are the best for all broad phonetic classes.

PDF
245--249 S. Hussain, F. R. McInnes and M. A. Jack Comparison Of Neural Network Techniques For Speaker Verification

Abstract  In this paper a comparison is made between two alternative approaches for speaker verification(SV) using neural network. Firstly, a vector quantization preprocessing stage was used as the front end. The preprocessing stage measures the local spectral similarities by using a vector quantizer to select the index. The indices of the winner units are fed to a second stage neural network in which the system can be trained and evaluated. Two experiments were performed. The first experiment used a neural network model(NNM) with frame labelling performed from a client codebook known as NNM-C. Better performance was obtained from this model when compared with SCHMM(Semi Continuous Hidden Markov Model). The second set of experiments used the NNM with frame labelling from the client and the impostors codebook known as NNM—CI. The results were not as good when compared with the NNM-C and SCHMM.

PDF
251--256 A. Samouellan Automatic Language Identification Using Inductive Inference

Abstract  Automatic spoken language identification (LID) plays an important part in routing foreign callers to operators who speak the caller's language, or as a front-end to a multi-lingual translation system to route the call to the appropriate translation system. A common approach to spoken language ID is adopted from current speaker independent recognition techniques. These generally involve the development of a phonetic recogniser or each language and then combining the acoustic likelihood scores to determine the highest scoring language. The models are trained using hidden Markov modelling (HMM) or neural networks (NN). This paper proposes a novel approach to spoken language identification by the use of inductive inference "decision trees". To develop the production rules, the classification models are generated inductively by examining a large speech database and then generalizing the pattern from the specific examples.This approach has already been successfully used for isolated digit recognition (Samouelian, 1995). The aim of this research is to demonstrate that inductive learning can provide a viable alternative approach to existing automatic spoken language identification techniques. The proposed LID is based on automatic speech recognition (ASR) system using inductive inference (Samouelian, 1994a, 1994b). It uses a single decision tree to capture all the complexities of each language, using mel-scaled cepstral coefficients (MFCC) as input. The training database is labelled at the language level. The LID classification is performed at the frame level, using an inference engine to execute the decision tree and classify the tiring of the rules. A simple sorting routine is then used to identify the spoken language. Spoken language identification results using the OGI Multi-language Telephone Speech Corpus (OGI_TS) on the three language task (English, German and Japanese), are presented.

PDF
257--262 Karsten Kumpf Lda Based Modelling Of Foreign Accents In Continuous Speech

Abstract  A foreign accent classification system based on phoneme-dependent LDA models has been implemented. The classifier generates accent likelihood scores for single phoneme segments extracted from continuous speech. An automatic training and model optimisation procedure allows evaluation of the contribution of individual phoneme classes and features to the classification task. The average accent classification rates for single phonemes from three accented speaker groups were 69.4% and 49.5% in a multi—speaker and a speaker-indepenent test, respectively. The relative positions of the accented speakers in the feature space can be shown.

PDF
263--268 D. R. Dersch The Acoustic Fingerprint: A Method For Speaker Identification, Speaker Verification And Accent Identification

Abstract  We present a new parameter free framework based on similarity measures between data spaces and demonstrate its performance for different tasks in the field of automatic speech processing, e.g., for text-independent speaker identification, speaker verification and accent identification. The speech data are taken from the ANDOSL database. The speech signal is coded by 12 mel-frequency cepstrum coefficients (mfcc). The identification and verification task is performed by calculating similarity measures between data spaces occupied by utterances in the 12-dimensional mfcc-space. In order to reduce the computation effort, we apply a Vector Quantization technique. On a set of 108 speaker we achieve an accuracy of 100% for text—independent speaker identification, 100% successful acceptances and 99.81% successful rejections for text-independent speaker verification, respectively. First results for accent identification yields an accuracy of 72.3 - 74.5% for the discrimination of two different Australian accents, Lebanese Arabic and South Vietnamese.

PDF

Session 11 Speech Disorders II: Cochlear Implant And Hearing Improvement

Pages Authors Title and Abstract PDF
269--274 J.Z. Sarant, P.J. Blamey and G.M. Clark The Effect Of Language Knowledge On Speech Perception In Children With Impaired Hearing

Abstract  Open-set words and sentences were used to assess auditory speech perception of three hearing-impaired children aged 9 to 15 years using the Nucleus 22- channel cochlear implant. Vocabulary and syntax used in the tests were assessed following the initial perception tests. Remediation was given in specific vocabulary and syntactic areas, chosen separately for each child, and the children were reassessed. Two children showed a significant post-remediation improvement in their overall scores on the syntactic test and both perception measures. The third child who was older, had the best language knowledge and the lowest auditory speech perception scores, showed no significant change on any of the measures. Language remediation in specific areas of weakness may be the quickest way to enhance speech perception for some children with impaired hearing in this age range.

PDF
275--280 Cécile Pereira Angry, Happy, Sad Or Plain Neutral? The Identification Of Vocal Affect By Hearing-Aid Users

Abstract  This paper reports a speech perception study on the identification of vocal affect by hearing aid users. The performance of a group of 40 normally-hearing subjects is compared to that of a group of 39 post-lingually deafened subjects using hearing-aids. All subjects are adult native speakers of English. Results indicate that there are major differences between the two groups. Scores for overall identification of affect are 85% for the normally hearing listeners and 65% for the hearing—aid users. Patterns of confusion in the identification of emotions are similar in the two groups, but there is a greater degree of confusion in the hearing-aid users. There is a significant negative correlation between hearing-loss and the correct identification scores.

PDF
281--286 P.J. Blamey, E.S. Parisi & G.J. Dooley Perception Of Two-Formant Vowels By Normal Listeners And People Using A Hearing Aid And A Cochlear Implant In Opposite Ears.

Abstract  G4 two-formant vowels were synthesised in an /h-vowel-d/ context using first formant frequencies (F1) from 300 to 900 Hz in 100 Hz steps and second formant frequencies (F2) from 600 to 2400 Hz in 200 Hz steps. Listeners classified the stimuli according to the nearest word from the list "hid, head, had, hud, hod, hood, heed, heard, hard, who’d, hoard". Analysis of the centre frequencies for each response category showed significant differences between the Australian and American response patterns, but not between the response patterns for male and female listeners with normal hearing. The cultural differences in the response patterns corresponded closely to differences that have been documented for vowel production. The implanted ear patterns were closer than the aided ear patterns to the normal listeners patterns (from the same country) Binaural response patterns for the hearing—impaired listeners showed influences from both monaural patterns, but tended to be closer to the implanted ear pattern than the aided ear pattern. The response patterns for normal listeners showed greater consistency than for implanted ears which showed greater consistency than the severely-to-profoundly hearing-impaired hearing aid ears. The results show that hearing impairment and hearing aid use can change perceived vowel quality as well as affecting frequency resolution.

PDF
287--292 Bernice McGuire Speech, Phonological Awareness And Reading Skills In Children With Impaired Hearing

Abstract  The connection between speech and phonological awareness is discussed in relation to the way in which these might affect the reading ability of children with impaired hearing. The ability to read requires that the person use a code, prior knowledge, vocabulary and linguistic knowledge, as well as contextual information, in order to understand text. Automatic and rapid decoding skills are essential for effective reading, allowing the short-term memory to be utilised for the linguistic interpretation of the words rather than in taking time to decode the individual words (Perfetti, 1992: Stanovich, 1986). Word identification requires that a phonological form of the word is retrieved as part of the decoding process. Good comprehension of the text is dependent on efficient decoding skills. Studies by Stanovich (1992, 1994), and Jorm and Share (1983), suggest that poorer readers make more use of context in order to comprehend text due to lack of automaticity of decoding skills. In comparison, proficient readers are able to decode the words efficiently and only use context to confirm their understanding of the text. A skilled reader uses context to interpret words and sentences rather than to identify words. In fact, Perfetti (1995) believes that "the hallmark of skilled reading is fast context-free word identification and rich context-dependent text understanding". These two aspects of reading, decoding skill and use of context, have given rise to the two major categories of reading theory. The bases of these theories need to be understood because they have very different implications for the teaching of reading.

PDF

Session 12 Speech Recognition III

Pages Authors Title and Abstract PDF
295--301 A. Samouelian Connected Digit Recognition Using Inductive Inference

Abstract  This paper proposes a novel approach to connected digit recognition by the use of inductive inference "decision frees". To develop the production rules, the expert is bypassed and instead the classification models are generated inductively by examining a large speech database and then generalising the pattern from the specific examples. This approach has already been successfully used for isolated. digit recognition [Samouelian,1996]. The aim of this research is to demonstrate that inductive learning can provide a viable alternative approach to existing automatic speech recognition (ASR) techniques. This proposed system uses mel frequency cepstral coefficients (MFCC) front-end signal processing technique . The C4.5 inductive system [Quinlan,1993] generates the decision tree automatically from labelled examples in the training database. The recognition is performed at the frame level, using an inference engine to execute the decision tree and classify the firing of the rules. A sorting routine is then used to identify the digit string. Connected digit recognition results for Texas Instruments (TI) digit database, for speaker dependent and independent recognition, for known and unknown digit string lengths are presented.

PDF
301--306 A. Jusek, G. A. Fink, F Kummert, and G. Sagerer Automatically Generated Models For Unknown Words

Abstract  Especially in recognition of spontaneous speech it is necessary to cope with the occurrence of unknown words. We present an approach to unknown word detection which is integrated into a standard HMM speech recognizer. From the context dependent sub-word units, e.g. triphones, that can be found in the training database a generic word model can be derived automatically using the context restrictions to form valid sequences of sub-word units. This generic word model combines automatically derived knowledge about the phonotactics of the language considered with the modelling quality of context dependent acoustic units. Detection of unknown words is achieved adding this model to the recognizer's lexicon. We present results of experiments carried out on a large German spontaneous speech recognition task.

PDF
307--312 D. R. Dersch Neural Network Approaches To Speech Recognition: A General Radial Basis Function Network For Speaker-Independent Phone Classification

Abstract  In this paper we present neural network approaches which enable both the analysis of a high dimensional data space of phone templates and the construction of a speaker-independent isolated phone classifier based on a Generalized Radial Basis Function Network (GRBFN). Firstly, we present a codebook obtained by a neural motivated fuzzy Vector Quantization procedure. Such codebooks provide an intrinsic discretization of the data space expanded by phone templates into various phone groups, eg. stops, fricatives, nasals and vowels of different pitch. Secondly, a codebook is used to train a three-layer GRBFN in a two-step optimization process. As a result we obtain a speaker-independent single phone recognition accuracy of 63.1% on the training set and 62.2% on the test set for 52 different phone classes. A coarse classification of five phone groups into 'stops', 'fricatives', 'nasals', 'semi-vowels' and 'vowels' yields a recognition accuracy of 87.6% on the training set and 87.0% on the test set, respectively. Phone templates are obtained from the male training corpus of the TIMIT database.

PDF
313--318 David B. Grayden and Michael S. Scordilis Using The Vowel Triangle In Automatic Speech Recognition

Abstract  An approach to reducing the number of insertions in a speech recognition system is presented which makes use of the relationship between the places of articulation of sonorant phonemes. A neural network is trained to locate place of articulation and the resulting contour is examined for phoneme boundaries. A Time-Delay neural network is also developed to locate nasal phonemes, a special case of sonorants.

PDF
319--324 Michael Barlow, Stephanie Dal, Tatsuo Matsuoka and Sadaoki Furui An Automatically Acquired Cfg For Speech Understanding And Hypotheses Reordering

Abstract  The paper describes the generation and use of a context free grammar as a component in both a speech recognition and speech understanding system. An N-best speech recogniser was run on sentences from the ATIS-2 distribution and the top 25 hypotheses for each were produced. Post-processing grammar models-- bigrams, trigrams, cooccurrence, finite state, and semantic MM were employed to reorder the hypotheses. All showed considerable reduction in sentence error rate. The incorporation of the CFG lead to a further, significant reduction with the best showing more than a halving in original error rate. The speech understanding experiments comprised a finite-state grammar based system for translating class-A sentences into database queries. incorporation of the CFG dramatically improved translation rate as well as reducing the finite-state grammar's perplexity & complexity.

PDF

Session 14 Databases

Pages Authors Title and Abstract PDF
351--356 Peter Roach, Simon Arnfield and Elizabeth Hallum Babel: A Multi-Language Database

Abstract  A speech database is being constructed by a group of European researchers concentrating on languages of Central and Eastern Europe. The languages covered are Bulgarian, Estonian, Hungarian, Polish and Romanian, and the database is modelled on the Western European EUROM-1 design. The project is co-ordinated by the Speech Research Laboratory at the University of Reading, UK.

PDF
355--360 Christoph Draxler The German Speechdat Telephone Speech Corpus Overview And Experiences

Abstract  SpeechDat is a European project to collect ISDN quality telephone speech for all major European languages. In the first phase of the project, 1000 speakers were recorded in eight languages, including German. The paper presents the experiences made during the data collection for German, and outlines the specifications for the second phase of the project.

PDF
361--366 Steve Cassidy and Jonathan Harrington Emu: An Enhanced Hierarchical Speech Data Management System

Abstract  EMU is a system for labelling, managing and retrieving data from speech databases such as the Australian ANDOSL database or the US TIMIT. EMU is a re-implementation of the earlier MU+ system (Harrington, Cassidy, Fletcher, and McVeigh 1993) with the aim of providing a more flexible environment. The hierarchical structures and database query facility have been generalised and the system has been extended to include an interactive labeller with spectrogram and waveform displays. EMU incorporates the Tcl/Tk scripting language which can be used to extend the labeller and to perform many automated operation databases; as an example, scripts have been written to automatically construct hierarchical descriptions given Phonetic level labels. The need for increased flexibility was driven largely by the desire to use the system on languages other than English. This paper concludes by describing a database for Cantonese, and a database used in a kinematic study of vowel lengthening, both of which include facilities for automatically generating hierarchies.

PDF

Session 13 Speech Development

Pages Authors Title and Abstract PDF
325--330 Christine Kitamura & Denis Burnham Pitch & Communicative Intent In Infant-Directed Speech: Longitudinal Data

Abstract  This study examines the modifications made to the pitch and communicative intent in infant-direct speech (IDS) from birth to 12 months, at 3 monthly intervals. With regard to pitch, mothers use the peak level of mean fundamental frequency (F0) when the infant is 6 months while pitch range is highest at I2 months of age. Sex-based differences were also evident with mothers using higher mean-F0 and pitch range in speech to female than male infants. Furthermore the average shape of the utterance transposes from a fall-flat/rise in speech to newborns to a flat-fall in speech to adults. With respect to communicative intent, two factors were extracted from five ratings scales and these were labelled 'affective' and 'attentional/didactic'. Analysis of these factors showed mothers express more affection at 6 months than other ages while peak levels on the attentional/didactic factor were reached at 9 months of age. Mothers also increase their use of both these components of IDS more in speech to girl than boy infants.

PDF
331--336 S. McLeod, J. van Doorn, and V. Reed Homonyms And Cluster Reduction In The Normal Development Of Children'S Speech

Abstract  As children are learning to speak, they sometimes reduce consonant clusters to a single element, and as a result produce a homonym (e.g. they say "top" for "stop"). There is some evidence in the literature which suggests that even though these words may sound the same, they have differences which can be detected with acoustic analysis, indicating that the children are making covert distinctions between the two contexts. The purpose of this study was to make acoustic comparisons of homonym pairs produced as a result of cluster reduction by a group of 16 young children (2;O to 2;11 years). Duration and relative energy of aspiration for stops /k/ and /t/, duration and spectral distribution for fricative /s/, and voice onset time (VOT) for /k/ were measured in several word-initial contexts. Results showed that for word-initial /s/ plus stop clusters which had been reduced to a stop, the aspiration duration for the stop in the cluster target word was significantly less than that for the singleton target word. No other time or spectral measures reached statistical significance. The results have been interpreted in terms of phonological and speech motor development in children.

PDF
337--342 P.F. McCormack & T. Knighton Gender Differences In The Speech Patterns Of Two And A Half Year Old Children.

Abstract  The speech patterns of 50 normally developing two and a half year old children were investigated. Differences were found between the males and females for a distinct clustering of processes that simplify syllable structure. There was a significantly greater use of final consonant deletion, weak syllable deletion, and cluster reduction by the boys, while there was no differences in the use of other speech processes or for receptive and expressive language abilities. A discriminant function constructed from these 3 syllable simplifying processes correctly classified each subject as being either male or female with eighty percent success. Interestingly, in the pre-school and early school years boys have a higher incidence of developmental speech disorders than girls (2 to 1), with the marked use of weak syllable deletion, final consonant deletion, and cluster reduction distinguishing severe cases of developmental speech disorder. These are the same process identified in this study as being generally more evident in boys than girls at two and a half years of age. The question arises as to whether there is any relationship between early patterns of speech development in a child and later identification as having a speech disorder.

PDF
343--348 Christine Kitamura & Denis Burnham Infant Preferences For Infant-Directed Speech: Is Vocal Affect More Salient Than Pitch?

Abstract  The aim in the following experiments was to ascertain whether infants are more responsive to the pitch or vocal affect in infant-directed speech (IDS). In Experiments I and 2, infant preferences were tested for high vs low vocal affect with the level of pitch equated (HiAffect vs LoAffect IDS) and in both experiments, infants preferred to listen to HiAffect IDS. In Experiment 3, high vs low pitch was presented with the level of vocal affect equated (HiPitch vs LoPitch IDS) and it was found that infants preferred LOPitch over HiPitch IDS. This result was Unanticipated and when a different procedure was used to rate the vocal affect of the speech exemplars in Experiment 4, there was no difference in infant preferences for Hi or LoPitch IDS. Taken together these two sets of results suggest that it is the affective salience of IDS that is important to infant responsiveness and not necessarily the pitch characteristics alone. In the final experiment infants showed no differential preferences for normal or low-pass filtered IDS, confirming they are as responsive to the intonation or pitch characteristics as they are to full spectral versions of speech. Therefore it is suggested that pitch is used as a means of conveying affective intent to infants.

PDF

Session 15 Posters

Pages Authors Title and Abstract PDF
367--372 C. Blight, A. Butcher, & P. McCormack Nasal Airflow Measures Pre- And Post- Tonsillectomy

Abstract  There is little known about the role of tonsils in speech. In particular, the effect that removal of the tonsils has on the speech mechanism has not been investigated. The present study investigated . The effect of tonsillectomy on the ratio of nasal to oral airflow as an indirect measure of velopharyngeal function in a group of children (N = 23) age range 7 - 14 years compared with a control group (N = 33) children age range 8 - 14 years. The results indicated that tonsillectomy does have a significant effect by increasing nasal airflow in proportion to oral airflow in nasal consonant environments. These changes in the balance of nasal to oral airflow in the tonsillectomy group, however, were not detected by experienced judges of nasality.

PDF
373--378 Kimiko Tsukada Acoustic Analysis Of Japanese-Accented Vowels In English

Abstract  In an attempt to characterize Japanese-accented English, vowels in monosyllabic English words produced by 12 Australian talkers and 24 Japanese talkers were analyzed acoustically. The results show clear temporal and spectral differences between the two groups which may be perceived as a "foreign accent in English produced by the Japanese talkers. These differences are in agreement with a notion of intermediate nature of non-native speech production.

PDF
379--384 Lydia K. H. So and Jing Wang Acoustic Distinction Between Cantonese Long And Short Vowels

Abstract  This study focuses on the acoustic analysis of Cantonese vowels, using a database containing speech data from two subjects, comprising approximately 1860 monosyllable words per subject. The acoustic analysis concerns vowel (acoustic) duration and vowel spectral quality (F1-F2 formants).

PDF
385--390 V. Mildner and Z. Rukavina Hemispheric Specialization For Phonological Processing

Abstract  Ear advantage as an indicator of hemispheric specialization for phonological processing was examined during a rhyme test by means of response time, accuracy and laterality index on right-handed female subjects.

PDF
391--396 P.F. McCormack & B. Dodd Phonetic Variability In Speech Disordered Children: A Comparison Of Real And Nonsense Words

Abstract  A speech production experiment is reported where the timing and formant variability for the vowels / a /, / i/ and / o /, used in the naming of real words and the imitation of nonsense words, are compared across 4 subgroups of speech disordered children. The first group exhibited normal speech development, the second group exhibited delayed speech development, the third group exhibited unusual but consistent speech patterns, while the fourth exhibited highly inconsistent . speech patterns. Multiple Analysis of Variance indicated no differences in timing variability across groups or contexts. The inconsistent group of children, however, exhibited significantly greater vowel formant variability in naming real words compared to the other 3 groups. No such difference occurred between the groups in the imitation of nonsense words. The results indicate that the inconsistent subgroup of speech disordered children do not exhibit a generalised motor disturbance, but suggest a highly selective type of phonetic disturbance where a particular parameter of the phonetic specification of a lexical item is less specific than in the other groups of children, or is more difficult to access. The implications for models of speech production and their development are discussed.

PDF
397--402 Lydia K. H. So and K.W. Chan Electropalatographic Pattern Of Cantonese Speech

Abstract  - Electropalatograph (EPG) is a technique which provides visual display of the tongue contact with the hard palate. This paper. describes the development of a systematic database of EPG patterns of all possible Cantonese syllables (about 1860). EPGvers1on 3 was used in the data. collection for the _database. Syllable structure includes me following combinations: V, CV, VC, CVC and CVG. The EPG patterns of different consonants and vowels and consonant-vowel interaction are available in this database.

PDF
403--406 David Hawthorn and Chris White A Talking Word Procesesor

Abstract  Speech is becoming an increasingly common form of computer output. its applications are any situation where the reading of a computer screen may be difficult. This paper looks at one very easy method of implementing speech in Microsoft Word for Windows (1).

PDF
407--412 Young-Mok Ahn, Hoi-Rin Kim Development Of A Very Fast Preprocessor

Abstract  This paper proposes a very fast preprocessor for a large vocabulary isolated word recognition. This preprocessor extracts a few candidate words using the frequency and the time information for each word. For designing reference pattern, we use the order of amplitude of speech feature. So, the proposed preprocessor has a small computational load after the extraction of speech feature. In order to show the effectiveness of the proposed preprocessor, we compared it to a speech recognition system based on semi-continuous hidden Markov model and a VQ-based preprocessor by computing their recognition performances of a speaker independent isolated word recognition. In experiments, we use three types of speech database. The first, a speech database consists of 244 words including digits, English alphabets, etc. The second, a speech database consists of 22 words including section names. The third a speech database consists of 35 words. This preprocessor is composed of three major parts in the feature extraction, the feature sorting, and the reference pattern matching with reference templates. After sorting it requires only one vector addition per frame, namely, vocabulary size x length of incoming frame. In consequence, this approach is therefore much faster than our previous version which is the VQ-based preprocessor for isolated word recognition task (Ahn etal, 1994). In the experimental results, the accuracy of feature sorting based preprocessor is 99.86% with 90% reduction rate for the speech database of 244 words.

PDF
415--420 K.P.H. Sullivan and Y.N. Karst The Perceived Naturalness Of The Word-Final Dental Stop In The English Of Native Speakers Of Swedish

Abstract  The duration of consonants after long vowels is shorter than after short vowels in Swedish. This is a fact of which much is made in the instruction of Swedish as a foreign language. In English the difference can either be considered not to exist. In the teaching of English to native Swedish speakers little, if any, importance is placed on the 'removal' of this complementary vowelzconsonant length relationship. This paper reports on work undertaken to assess the perceptual importance of acquiring English consonant length. Consonant length was found to be perceptually unimportant, yet the degree of aspiration after the consonantal closure was found to be perceptually noticeable.

PDF
421--426 Hartmut R. Pfitzinger Two Approaches To Speech Rate Estimation

Abstract  This paper introduces two approaches to speech rate estimation: one is based on automatic syllable detection and the other on automatic phone segmentation. For evaluation of both approaches used manually segmented syllables and phones as a reference. Although the used segment detectors are not perfect it is possible to automatically estimate the local rate of phones and the local rate of syllables reliably, We argue that neither the rate of phones nor the rate of syllables suffices for estimating actual speech rate.

PDF
427--432 B. Byrne, A. Butcher, & P. McCormack The Speech Rhythm Of Vietnamese Speakers Of English

Abstract  The durational characteristics of the speech rhythm of Vietnamese speakers of English was compared to that of native speakers of Australian English. The 2 aspects of English speech rhythm examined were compensatory shortening of stressed syllables when unstressed syllables are added to an interstress internal, and rhythmic stress shifts. Results showed several areas where the Vietnamese subjects' performance differed to that of the native subjects. They evidenced a lesser degree of shortening, or compression, of unstressed syllables in interstress intervals. In stress shift contexts, although marking shift, they exhibited a lesser degree of durational adjustment to the second stressed syllable. The results indicate an overall difficulty in marking durational contrasts with native-like proficiency.

PDF

Session 16 Poster

Pages Authors Title and Abstract PDF
433--438 Alain Marchal & Yohann Meynadier Coarticulation In /Kl/ Sequences In French : A Multisensor Investigation Of The Timing Of Lingual Gestures

Abstract  The timing of lingual gestures in the production of /kl/ clusters in read French sentences is reported in this poster. Our data indicates that the body and tip of the tongue do not behave as independent articulators. In fact, there is a strong resistance to coarticulation from /k/ to /I/.

PDF
439--444 Lydia K.H. So, D.K.K. Au and B.M. Chen Development Of Cantonese Speech And Tone Viewer

Abstract  Modelling and practice are important in for good results in articulation/speech therapy. This paper describes the development of a Cantonese speech and tone viewer (CSTV). in the construction of this training tool, Cantonese phonology has been taken into account. This speech training tool allows clients to practise speech production at their own pace and supports their improving the accuracy of their Cantonese tones.

PDF
445--450 Xiaonong Sean Zhu Intrinsic Vowel Duration In Shanghai Open Syllables

Abstract  There is a contradiction between the observed phenomenon that low vowels tend to be longer than high vowels (the Lehiste IVD) and the theoretical prediction that high vowels should be longer than low vowels (the Catford IVD). This paper argues that the Lehiste IVD is environmental conditioned by a following consonant. The experiments, using the Shanghai dialect, conducted to test the IVD show that the Catford IVD varies with FO directions: 1) in a CV syllable with a falling tone, a high vowel is longer than a low one; and 2) in a CV syllable with a rising tone, high and low vowels have comparable duration.

PDF
451--456 Lynda Penny The Effect Of Vocal Disguise On Some Vowel Formant Frequencies

Abstract  The effect of assuming a vocal disguise on the formant frequencies of some vowels is examined. The material used was that of an authentic vocal line-up in which the identity of the speaker was in question. The subjects assumed commonly adopted vocal disguises.

PDF
455--460 B. Watson On-Line Speaker Adaptation For Hmm Based Speech Recognisers

Abstract  An investigation of a gradient-descent based training technique was performed for the on-line adaptation of hidden Markov models to new speakers in a speech recognition system. It was found to be successful for supervised speaker adaptation, improving the recognition performance on a 46 word task (alphabet, digits and control words) from 88.0% to 93.2% after adaptation with nine repetitions of each word. Unsupervised adaptation on the same task was unsuccessful. However, for an easier 20 word vocabulary, unsupervised adaptation improved the recognition performance from 97.7% to 99.0%.

PDF
461--466 Ho-Young Lee The Prosodic Typology Of Kyungsang Korean

Abstract  The typological characteristics of stress languages, tone languages, pitch-accent languages, and stress-pitch languages are investigated. Based on the discussion of prosodic typology, the prosodic type of Kyungsang Korean is examined. It is argued that Kyungsang Korean can be treated both as a tone language with tone neutralization and tone modulation processes and as a pitch-accent language with two marked tones.

PDF
467--472 Katsumasa Shimizu Listening Characteristics Of Japanese Learners Of English

Abstract  This paper describes listening characteristics of Japanese learners of English in three areas of comprehension: synthetic /r - I/ continuum, internal open juncture, and syntactically ambiguous sentences. Intermediate ESL learners have a different mode of listening for /r/ -/l/ continuum from native speakers of English, have adifficulty in identifying subtle allophonic differences and show adifference in identifying between sentences of surface- and deep-structure ambiguities. implications on these observations are discussed in terms of perceptual strategies.

PDF
473--478 J Leis , M Phyihian and S Sridharan Automatic Speaker Recognition Using Msvq-Coded Speech

Abstract  Low bitrate speech coding finds application in both telecommunications (band-width compression) and archival (filespace compression). Speaker verification is used in telecommunication applications (to gain access to particular services, for example) and implies that either or both of the speech data streams (incoming and reference) may be compressed. In this paper, we investigate the effect of high compression methods on the effectiveness of automatic speaker identification and verification. Lossy compression of the speech (whether transmitted or stored) requires vector quantization of the short-term spectra! parameters in order to achieve high compression ratios, and thus implies some loss of accuracy in the repre- sentation of these parameters. However, in the situation where the same spectral parameters are utilized in identifying the speaker, the identification accuracy may be compromised by the compression process. We present in this paper our findings on the effect of compression on identification, for one particular family of vector quantization methods.

PDF
479--484 Jason Chong and Roberto Togneri Speaker Independent Recognition Of Small Vocabulary

Abstract  This paper reports on the implementation of a real-time speaker independent isolated word speech recognition program on a PC Windows platform. The overall structure of the recognition engine is based on the Dynamic Time Warping (DTW) paradigm for computational efficiency. Furthermore, to decrease the recognition time and increase the recognition accuracy, the dictionary is limited to under 15 words. This severely restricts the vocabulary. To overcome this restriction, a new technique is introduced. Many dictionaries are linked in a hierarchical structure and each word in each dictionary will activate a new dictionary related to that word. This represents a basic form of language modelling which is suited for the menu driven interface found in many of t0day`s applications. The results show that reasonable performance can be achieved by these methods.

PDF
485--490 K.S. Ananthakrishnan, Callan Hanley, John Asenstorfer Bill Cowley, Bill Edwards Considerations In The Realisation Of A Text-To-Speech Synthesis System For Pitjantjatjara Language

Abstract  Pitjantjatjara is one of the most widely used of the Western Desert Australian Aboriginal languages. We intend to exploit the information technology to facilitate flexible learning of this language by students, children and adults. Another objective is to preserve the heritage of Aboriginal languages for future generations. To achieve these goals, a project has been undertaken to study the feasibility of realising a Text-to-Speech Synthesis system for Pitjantjatjara language, where the ultimate aim is to develop a user friendly system, to serve the Australian Community at large. This paper reports the preliminary results obtained from a speech generation module currently under development at the University of South Australia and projects the future directions of research in this application area. To our knowledge, our project is the first one making an attempt to develop speech synthesis system for Pitjantjatjara language in Australia.

PDF

Session 16 Second Language Linguistics

Pages Authors Title and Abstract PDF
491--496 John Ingram Perception Of Tensity And Aspiration In Synthesised Korean Stop Consonants

Abstract  An experiment with the synthesis of Korean Tense, Lax, and Aspirated stop is reported. Listeners responses are evaluated with identification judgements and prototypicality (goodness) measures. The status of Aspirated stops on the Tense - Lax continuum is evaluated.

PDF
497--502 Duncan Markham Similarity And Newness -Workable Concepts In Describing Phonetic Categorisation?

Abstract  Experimental and impressionistic data from learners of the sounds of Japanese and English are presented and discussed with regard to perceptual processing and a model of speech sound learning (Flege's SLM). It is argued that the perceptual classifications 'similarity' and 'newness' as proposed by Flege are not workable, and some alternative characteristics and criteria used in sound analysis by learners are posited.

PDF
503--508 Denis Burnham, Sheila Keane Where Does Auditory-Visual Speech Integration Occur? Japanese Speakers' Perception Of The Mcgurk Effect As A Function Of Vowel Environment

Abstract  In the McGurk effect, when auditory [b] is dubbed onto visual [g] it is perceived as [d] or [Q?]. The occurrence of this in undegraded conditions shows that human speech perceivers use visual information whenever it is available. This study uses phonetic and phonological tools to ascertain the processing level at which auditory-visual speech integration occurs. The phonetic tool is the fact that the relative frequency of [d] and [Q?] fusion responses changes over vowel environment: for auditory [ba] + visual [ga] English speakers' report [Q?a] more often than [da], while for auditory [bi] + visual [gi], more [di] than [Q?i] reposes occur. The phonological tool rests on differences in phonology; while Japanese phonology contains [b], [g], and [d], it does not contain [Q?]. English speakers and Japanese speakers at three levels of English proficiency, Beginner, Intermediate, Advanced, were tested on the [b] + [g] McGurk effect in [a] and [i] vowel environments. If the integration of auditory and visual speech components occurs at a phonetic level, then Japanese speakers should show appropriate shifts in frequency of [d] vs [Q?] responses in the [a] vs [i] vowel conditions, despite the phonological irrelevance of [Q?] in Japanese. This is indeed what occurred: despite Japanese subjects' propensity for a phonological bias towards perceiving [d] rather than [Q?] in both impoverished (auditory-only [Q?], and visual-only [Q?]) and unimpoverished (auditory-visual [Q?]) conditions, all subject groups showed a similar change in frequency of [d] and [Q?] responses over the [a] vs [i] vowel conditions. The results are discussed in relation to the role of phonetic and phonological factors in auditory-visual speech integration.

PDF
509--514 K.P.H. Sullivan and Y.N. Karst Perception Of English Accent By Native British English Speakers And Swedish Learners Of English

Abstract  Much research has been conducted into the perception of dialect variation in the country in which the language is spoken. This study extends this line of investigation by asking how well learners of English in a non-English speaking environment can perceive accent variation. The Swedish learner of English is exposed daily to many varieties of English in the media and it is unclear how much attention the learner pays the accent when listening to a film or television program in a foreign language. This investigation compares the discriminative ability of native British speakers to perceive variation in six world Englishes with the discriminative ability of the Swedish learner of English. On an accent identification task no significant difference was found, however on a discrimination task the Swedish learners faired less well than the British English listeners.

PDF
515--520 C Tsurutani and J. Ingram Prosodic Template In Word Blending: A Comparison Between Native Japanese And English Learners Of Japanese

Abstract  Word blending is used to observe the word segmentation of native speakers and learners of Japanese. The results highlight prosodic differences between Japanese and English word-level prosodic templates.

PDF

Session 17 Signal Processing

Pages Authors Title and Abstract PDF
521--526 Peter Veprek and Michael S. Scordilis Enhanced Speech Classification And Pitch Detection

Abstract  Speech classification into voiced unvoiced or silent portions is important in many speech processing applications. In addition, segmentation of voiced speech into individual pitch epochs is necessary in several high quality speech synthesis and coding techniques. In this paper, two different pitch detection methods are evaluate and a set of enhancements is presented which substantially enhance performance.

PDF
527--532 David R.L. Davies and J. Bruce Millar Evaluation Of A Computationally Efficient Method For Generating A Voiced-Source Synchronised Timing Signal

Abstract  This paper evaluates the performance of a system that comprises a low-pass filter and a feature detecting post-processor to generate a voice-source synchronised timing signal. The evaluation of the output signal is described in terms of its phase relationship to the electro-glottograph signal. The paper discusses thechoice of reference phase within the . electro-glottograph signal, the degree of synchronisation achieved for data containing a wide range of vowel qualities and excitation fundamentals and the control of phase in dynamically iterative architectures. We have chosen a simple but extensible architecture and have discussed its failure modes over a wide range of signal conditions.

PDF
533--538 L.Candille, M. George, A. Soquet and H. Meloni Control Of A Vocal Tract Model Based On Articulatory Measurements And Acoustic Optimization

Abstract  The control of the Maeda's articulatory model is realized using 2 different methods. One is based on articulatory measurements, the other consists in an acoustical optimization. V1V2 sequences are studied by both methods and some preliminary results are compared.

PDF
539--544 D. Cole, M. Moody and S. Sridharan Alternative Methods For Reverberant Speech Enhancement

Abstract  A novel method for enhancement of reverberant speech is presented. The technique, which is based on the well known spectral subtraction enhancement method, overcomes the positional sensitivity which makes the conventional estimate/invert method impractical.

PDF
545--550 Peter Veprek and Michael S. Scordilis The University of Melbourne, Australia A Constrained Dtw-Based Procedure For Speech Segmentation

Abstract  Reconfiguring a speech synthesiser to a new voice requires substantial amount of effort. As a result, current synthesisers offer only a very limited number of voices. Methods for automating this process will greatly expand the utility of speech synthesis. This paper presents the development of an enhanced method for the automatic segmentation of speech into phonemes, particularly issue suited for concaterative speech synthesis. Its effectiveness is tested in an analysis/resynthesis procedure, and in subsequent perceptual evaluation of typical sentence selected from a large speaker population. Results indicate that this technique can be successfully used for the segmentation of speech for synthesis applications.

PDF
551--556 Richard Katsch , Phillip Dermody, John Seymour, Loredana Cerrato Objective Identification Of Speech Presented In Noise

Abstract  The initial investigation of an objective measure of speech processing in noise is presented that uses models of auditory processing as the speech analysis stage and a simple distance measure classifier to provide identification scores for speech presented in noise.

PDF

Session 18 Speech Physiology

Pages Authors Title and Abstract PDF
555--560 W Hardcastle, B Vaxelaire, F Gibbon, P Hoole and N Nguyen Tongue Kinematics In /Kl/ Clusters And Singleton /K/: A Combined Ema/Epg Study

Abstract  Results of a combined EPG/EMA study show different movement trajectories of the tongue body for /k/ in clusters and singleton environments. A "looping" trajectory observed during singleton /k/ was abruptly halted during cluster production and seen as a downward vertical movement of the tongue body coinciding with raising of the tongue tip/blade for the /l/. The results point to the advantages of combining EMA with EPA data for investigating lingual dynamics.

PDF
561--566 Anders Lofqvist Control Of Oral Closure And Release In Bilabial Stop Consonants

Abstract  This paper examines the control of bilabial closure and release in stop consonants. Recordings of lip kinematics were made in five subjects using an electromagnetic transduction technique. Results suggest that the lips are moving at a high velocity at the instant of oral closure. During the closure, the lip tissues are compressed and the lower lip may push the upper lip upwards. The results are compatible with the hypothesis that one target in the production of labial stop consonants is a region of negative lip aperture.

PDF
567--572 Peter J. Alfonso Long-Term Spatiotemporal Stability Of Lip-Jaw Synergies For Bilabial Closure

Abstract  Assumptions made about invariant control schemes are gleaned from data collected from a single point in time, that is, data collected from a single session. Thus, we know relatively little about the magnitude of the day-to-day articulatory variability that underlies an invariant percept. The results of the experiment reported here support the notion that the predictable long-term variability in speech motor output that reflects the inherent complexity of the motor system in meeting the time varying demands of rapid conversational speech.

PDF
577--582 JANET FLETCHER, MARY E. BECKMAN, and JONATHAN HARRINGTON Accentual-Prominence-Enhancing Strategies In Australian English

Abstract  Previous studies have documented at least two supralaryngeal strategies that talkers use to highlight accentual prominence on a word, They; can lower the jaw more in the stressed vowel, producing a more open vocal tract. they can also manipulate lingual articulation to accentuate the contrastive features of accented syllables. This study examines the acoustic consequences or such supralaryngeal correlates of accenting, Tongue-dorsum and jaw movement were recorded for three female speakers reciting a dialogue designed to elicit different accent placements around words containing high and low vowels. The results showed multiple articulatory strategies that varied across talkers and consonantal context. For example, two of the three speakers lowered the jaw more in accented syllables. However, one of these fronted the tongue in /i:/ to compensate for the jaw lowering in all contexts, whereas the other did so only in the velar context. Moreover, this second speaker (but not the first) also raised the tongue dorsum further away from the jaw to make a narrower constriction during the accented /i:/ vowel in /d/ and /b/ contexts. Despite this inter-talker differences, accenting the word resulted in a consistent raising of the frequencies or amplitudes of spectral peaks in the region of the second and third formants. Thus, the result of these articulatory strategies is a perceptual "sharpening" of the /i:/ timbre, suggesting a localized hyperarticulation of the accented high vowel.

PDF

Session 19 Prosody I

Pages Authors Title and Abstract PDF
581--586 Xiaonong Sean Zhu Two Stress Patterns Of Shanghai Compounds

Abstract  Stress in Shanghai is not uniformly left-handed, as suggested in the literature. It is determined by tone categories and is related to duration and FO profile. This paper suggests that both left- and right-headed stress can exist at the same lexical level in a language.

PDF
587--592 Denis Burnham, Elizabeth Francis, Di Webster, Sudaporu Luksaneeyanawin, Francisco Lacerda, and Chayada Attapaiboon Facilitation Or Attenuation In The Development Of Speech Mode Processing? Tone Perception Over Linguistic Contexts

Abstract  Perceptual discrimination of Thai tones was tested in three contexts: normal speech, filtered speech, and musical sounds with speakers of Central Thai; speakers of Cantonese, a tonal language of similar complexity; and speakers of English. The inter-sound interval (ISI) was varied: half the subjects in each group were tested with 500 msec 151 (to encourage phonetic language-general processing), and half with 1500 msec ISI (to encourage phonemic language- specific processing). English speakers discriminated tonal contrasts best in the musical context, and better in filtered than in full speech Thai speakers, however, discriminated the tonal contrasts equally well in all three contexts, although manner of processing across contexts differed reaction times for speech showed a 1500 msec advantage, while for music and filtered speech there was a 500 msec advantage. Cantonese speakers` showed some similarities to their fellow tonal language speakers, and some to the non-tonal, English speakers. Preliminary results for a group of native speakers of Swedish. a pitch accented language with two tonal variants, suggest that they respond similarly to the Cantonese subjects. In Experiment 2 Thai-speaking and English-speaking children of 5, 6, and 8 years were tested in the three contexts. Their results essentially mirrored those of their adult counterparts, Thus it appears that a special mode of speech processing becomes established relatively early in life.

PDF
593--598 Phil Rose Aerodynamic Involvement In Intrinsic F0 Perturbations - Evidence From Thai-Phake

Abstract  Mean fundamental frequency and airflow data are presented for 17 acoustic allotones of the six tonemes of Thai Phake on syllables with [k] and [x] initial consonants. It is argued that observe F0 perturbations at syllable onset are caused by aerodynamic factors associated with the difference in initial consonants. Historical tonological implications are briefly discussed.

PDF

Session 20 Prosody II

Pages Authors Title and Abstract PDF
599--604 Anne Cutler and Takashi Otake The Processing Of Word Prosody In Japanese

Abstract  The prosodic structure of Japanese polysyllabic words is defined by patterns of high and low pitch accents. The present study investigated whether the accent level of a single syllable extracted from its word context can be reliably identified by listeners. 96 tokens of the same CV sequence, extracted from the utterances of 32 words by three speakers, were presented to 24 listeners; their correct identification rates were high. Scores were higher for word-initial than for word-final syllables, and acoustic correlates of accent level were stronger in word-initial syllables, which is consistent with a role for pitch accent information in lexical access in Japanese.

PDF
605--610 Phil Rose The Realisation Of Stopped-Syllable Tones In Hua Sai And Pakphanang

Abstract  This paper examines the relationship between the acoustics of stopped and unstopped tones in two Southern Thai dialects with a high number of contour tone contrasts. It is shown that the realisation of stopped tones does not conform to a simple 'truncation' model.

PDF
611--616 Janet Fletcher and Jonathan Harrington Timing Of Intonational Events In Australian English

Abstract  There are two competing views of pitch accent timing in English. One suggest pitch accents are timed as a proportion of syllable rhyme duration. Another suggests biaccent timing is better modelled as an absolute time delay from the onset of syllables. Two corpora were analysed to test which model best fits the timing of prenuclear and nuclear pitch accents in Australian English. Results suggest that syllable onset as well as rhyme duration may play an important role in determining pitch accent timing.

PDF

Session 21 Features Analysis II

Pages Authors Title and Abstract PDF
617--622 Stefan Slomka, Peter Barger, Pierre Castellano and Sridha Sridharan Gender Gates In Degraded Environments

Abstract  The present paper extends the investigation of gender gates propose in (Barger et al, 1996) to speech degraded by coding and, separately, room reverberation. Coded speech did not degrade gate accuracies relative to the uncoded case in (Barger et al, 1996). Reverberation slightly degraded gate accuracies although this was only weakly dependent on reverberation time.

PDF
623--628 Marija Tabain and Catherine Watson Classification Of Fricatives

Abstract  The purpose of this study is to explore voiceless fricative consonants in Australian English. In particular, attempts are made to classify the dental and the labio-dental fricatives ([T] and [f] respectively), using pre-emphasised averaged spectra of fricative tokens sampled at 44.1 kHz. Results suggest that the techniques used help to correctly classify the two non-sibilant fricatives, although the results are not as good as those for the other two fricatives, the alveolar [s] and the alveolopalatal [S].

PDF
629--634 K. Chong and R. Togneri Extraction Of A Speech Signal In The Presence Of A Musical Note Signal

Abstract  This paper presents the methodology of extracting a speech§signal in the presence of a musical note signal using the GRNN (General- Regression Neural Network). An overview of GRNN is presented first, followed by preliminary simulations. Results of extracting speech in the presence of a flute and a cello note are also presented.

PDF
635--640 Daniel Woo, Phillip Dermody Simulation Of Human Incremental Speech Gating Performance Using Time Frequency Analysis And A Simple Classifier.

Abstract  The incremental speech gating task is described as a task showing that human listeners can process short duration portions of speech signals to achieve speech sound identification. The results of a group of human listeners on the task is presented and an artificial system using a time frequency spectral analysis and a simple classifier is used to determine. its identification performance on the same task. The use of 1 msec analysis frames and an inefficient probability summation method produces a reasonable match to the human performance-duration function in the speech gating task.

PDF
640--645 Ira Gerson, Orhan Karaali, Gerald Corrigan, and Noel Massey Neural Network Speech Synthesis

Abstract  Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then convened into speech. Concatenative systems can require large amounts of storage, while speech from synthesis-by-rule systems may not sound natural. A time-delay neural network system is described which produces natural-sounding speech while requiring less storage than concatenative systems.

PDF
645--650 W.N Farrell and W.G. Cowley Maximum A Posteriori Decoding For Speech Codec Parameters

Abstract  This paper looks at applying MAP decoding to low rate speech codec parameters as a means of protection at low channel SNRs. It has been shown that MAP decoding works well in protecting LSPs but the method has not been applied to other parameters. By using theoretical source data, results are obtained that compare MAP decoding with other more conventional techniques.

PDF