Proceedings of SST 2000

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Dialogue Systems

Pages	Authors	Title and Abstract	PDF
32--37	Shinya Kiriyamai, Keikichi Hirosei, Nobuaki Minematsui	Development And Evaluation Of A Spoken Dialogue System For Academic Document Retrieval With A Focus On Reply Generation Abstract A spoken dialogue system has been developed for academic document retrieval. It generates speech replies with their important words emphasized by controlling prosodic features, viz., prosodic focusing. The important words are determined by referring to the dialogue flow. Validity of the prosodic focusing was proved through a system evaluation.	PDF
38--42	Xian Fang Wang, Li Min Du	The Design Of A Chinese Spoken Dialogue System Engine Abstract In this paper, we propose the design principle of a Chinese Spoken Dialogue System Engine. The principle of the design is classifying the actions of a dialogue system into two types: the interactive actions and the task transaction actions. The purpose of the interactive actions is to know the users' intention and the information items that are necessary for task transaction and the task transaction actions are those actions that provide the data and the services according to the requirement of the user. According to this principle, one dialogue system is divided into two parts: Dialogue Manager (DM) and Task Transact (TT). It is easy to port a dialogue system to another domain using this principle.	PDF

Pages

Authors

Title and Abstract

PDF

32--37

Shinya Kiriyamai, Keikichi Hirosei, Nobuaki Minematsui

Development And Evaluation Of A Spoken Dialogue System For Academic Document Retrieval With A Focus On Reply Generation

Abstract A spoken dialogue system has been developed for academic document retrieval. It generates speech replies with their important words emphasized by controlling prosodic features, viz., prosodic focusing. The important words are determined by referring to the dialogue flow. Validity of the prosodic focusing was proved through a system evaluation.

PDF

38--42

Xian Fang Wang, Li Min Du

The Design Of A Chinese Spoken Dialogue System Engine

Abstract In this paper, we propose the design principle of a Chinese Spoken Dialogue System Engine. The principle of the design is classifying the actions of a dialogue system into two types: the interactive actions and the task transaction actions. The purpose of the interactive actions is to know the users' intention and the information items that are necessary for task transaction and the task transaction actions are those actions that provide the data and the services according to the requirement of the user. According to this principle, one dialogue system is divided into two parts: Dialogue Manager (DM) and Task Transact (TT). It is easy to port a dialogue system to another domain using this principle.

PDF

Language Acquisition

Pages	Authors	Title and Abstract	PDF
44--49	Peter E Czigler, Jan van Doorn, Kirk P.H. Sullivan	An Acoustic Study Of The Development Of Word-Initial /Sp/ Consonant Clusters In The Speech Of A Swedish Child Aged 1:11 - 2:5 Years Abstract A pilot six-month longitudinal study of the development of the word-initial /s/+plosive cluster was conducted. The experimental participant, a Swedish female child, was 1:11 years of age at the time of the first recording. Homophones were expected for a single word-initial plosive, a word—initial Isl and a word-initial /s/+plosive cluster. Durational measurements of the plosive and the fricative were made. The observations suggest that the child distinguished between singletons and reduced clusters from 2:1 years of age. The single plosives are produced with aspiration, and the plosives, consistently substituted by the child for consonant clusters, are unaspirated. Further, at the age of 1:11 the fricative sound was very short, then after a unstable period of duration exaggeration, the productions stabilized. This paper has demonstrated that providing the speech material used is well-constructed and can be compared with adult data, the use of duration measurements is an appropriate methodology to follow a child's phonological and phonetic development.	PDF
50--55	Chong-Woon Kim and John C. L. Ingram	Perception Of Korean Back Vowels By Australian English And Japanese Speaking Adult Learners Abstract The results of this study indicate that the nature of perceptual mapping between target L2 vowels and their nearest matching L1 counterparts is crucially related to the ways in which L2 sounds are accommodated by the learners with different L1 backgrounds at different stages of L2 learning.	PDF
56--61	Kimiko Tsukada	Some Acoustic Characteristics Of Australian English /Ai/ And Japanese /Ai/ In Native And Non-Native Speech Production Abstract An acoustic comparison was made between Australian English /ai/ and Japanese /ai/ produced by native and non-native talkers. Vowel duration and formant frequencies (F1, F2) at the two targets were measured with a view to characterizing cross-linguistic similarities and differences and to examining the non-native production of these sounds in comparison with that of native talkers. There was an interesting finding that, for spectral characteristics, Australian learners of Japanese approximated to the phonetic norms of Japanese to a greater extent than Japanese learners of English did to the Australian English norms.	PDF
62--67	Chiharu Tsurutani and John Ingram	Perception Of Mora Timing By English Learners Of Japanese Abstract English learners of Japanese at two levels of proficiency in L2 were tested for their ability to perceive moraic contrasts in Japanese vowels (long vs short) and stop consonants (single vs geminate) under listening conditions where the speaker and the rate of speech varied unpredictably on a short carrier phrase (Sore wa ___ desu.). The aim was to assess the acquisition of moraic timing by second language learners and the strategies they use for temporal normalization in the face of speech rate and speaker variation. Learners performance was compared with that of native speakers of Japanese, Results indicated that advanced learners were able to normalize speech rate, while beginners failed to do so and perceived moraic contrasts reliably only in slow speech. However, perception of moraic timing contrasts of consonants in slow speech was found to be difficult even for advanced learners.	PDF

Language Identification

Pages	Authors	Title and Abstract	PDF
70--75	Javad Sheikhzadegan, Mahmood Reza Roohani	Automatic Spoken Language Identification Based On Ann Using Fundamental Frequency And Relative Changes In Spectrum Abstract An automatic spoken language identification system based on artificial neural network (ANN) is described in this paper. Two different sets of statistical parameter namely prosodic and segmental features, extracted from fundamental frequency (F0) contour and frequency spectrum, were used for language classification. 'The procedures of feature parameter extraction are: 1) FO contour extraction, 2) approximation of polygonal line of F0> contour, 3) determination of frequency spectrum, 4) calculation of energy relative changes of distinctive frequency bands, 5) extraction of statistical parameters of FO contour (prosodic parameters) and relative bands energy (segmental parameters). F0 contour was extracted by one innovated version of cepstrum pitch extractor method, and frequency spectrum is determined by short time fast fourier transform. A multi-layer perceptron (MLP) type of neural network used for classification purpose. Training and testing process was performed using a multi-language speech database generated. at RCISP. Identifying task correction rate for six languages is greater than 97% in closed experiment tests and about 75% in open experiment tests.	PDF
74--77	J.P Willmore, R.C. Price and W.J.J Roberts	Comparing Gaussian Mixture And Neural Network Modelling Approaches To Automatic Language Identification Of Speech. Abstract In this paper we compare the performance of two well-known approaches to automatic Language Identification: Gaussian Mixture Modelling and Neural Network modelling. The systems were evaluated with the Oregon Graduate institute Multi Language Telephone Speech Corpus. In a comparison of the two systems using identical training and testing data, similar performance was obtained.	PDF
78--83	E. Wong, J. Pelecanos, S. Myers and S. Sridharan	Language Identification Using Efficient Gaussian Mixture Model Analysis Abstract Automatic Language Identification (LID) is the automated process of identifying the language of a speech utterance. In this paper, we will describe a language identification system that utilises Mel-Frequency Cepstral Coefficients (MFCCs) and Gaussian mixture models {GMMs) to model the short-term characteristics of a language. We also compare this standard GMM language model to the models that are adapted from a universal, language- independent background model (UBM). Experiments show that model adaptation gave comparable performance. In addition, a computation speed-up approach was tested on the adapted language models. The accuracy of the system remained comparable while the computation time was reduced significantly.	PDF

Pages

Authors

Title and Abstract

PDF

70--75

Javad Sheikhzadegan, Mahmood Reza Roohani

Automatic Spoken Language Identification Based On Ann Using Fundamental Frequency And Relative Changes In Spectrum

Abstract An automatic spoken language identification system based on artificial neural network (ANN) is described in this paper. Two different sets of statistical parameter namely prosodic and segmental features, extracted from fundamental frequency (F0) contour and frequency spectrum, were used for language classification. 'The procedures of feature parameter extraction are: 1) FO contour extraction, 2) approximation of polygonal line of F0> contour, 3) determination of frequency spectrum, 4) calculation of energy relative changes of distinctive frequency bands, 5) extraction of statistical parameters of FO contour (prosodic parameters) and relative bands energy (segmental parameters). F0 contour was extracted by one innovated version of cepstrum pitch extractor method, and frequency spectrum is determined by short time fast fourier transform. A multi-layer perceptron (MLP) type of neural network used for classification purpose. Training and testing process was performed using a multi-language speech database generated. at RCISP. Identifying task correction rate for six languages is greater than 97% in closed experiment tests and about 75% in open experiment tests.

PDF

74--77

J.P Willmore, R.C. Price and W.J.J Roberts

Comparing Gaussian Mixture And Neural Network Modelling Approaches To Automatic Language Identification Of Speech.

Abstract In this paper we compare the performance of two well-known approaches to automatic Language Identification: Gaussian Mixture Modelling and Neural Network modelling. The systems were evaluated with the Oregon Graduate institute Multi Language Telephone Speech Corpus. In a comparison of the two systems using identical training and testing data, similar performance was obtained.

PDF

78--83

E. Wong, J. Pelecanos, S. Myers and S. Sridharan

Language Identification Using Efficient Gaussian Mixture Model Analysis

Abstract Automatic Language Identification (LID) is the automated process of identifying the language of a speech utterance. In this paper, we will describe a language identification system that utilises Mel-Frequency Cepstral Coefficients (MFCCs) and Gaussian mixture models {GMMs) to model the short-term characteristics of a language. We also compare this standard GMM language model to the models that are adapted from a universal, language- independent background model (UBM). Experiments show that model adaptation gave comparable performance. In addition, a computation speed-up approach was tested on the adapted language models. The accuracy of the system remained comparable while the computation time was reduced significantly.

PDF

Multimodal Speech

Pages	Authors	Title and Abstract	PDF
86--91	Denis Burnham, Valter Ciocca, Cheryl Lauw, Susanna	Perception Of Visual Information For Cantonese Tones Abstract It is often assumed that there is little if any visual speech information for lexical tone. However, the presence or absence of visual information for tone is yet to be tested empirically. In this study Cantonese speakers are asked to identify spoken words as one of six words differing only in tone. Words are presented in three different modes: auditory-visual (AV), auditory only (AO), and visual only (VO). It is found that performance is equivalent in AO and AV conditions, ie, that there is no augmentation of auditory tone perception when visual information is added. Performance in the VO condition is considerably worse, but under some circumstances it is significantly greater than chance, and interestingly, distinctly different in nature to performance in the auditory conditions. In particular, visual information for tone is evident in the performance of perceivers without phonetic training but not in those with phonetic training; for tone carried on monophthongs but not on diphthongs; in running speech but not in citation form; and for contour tones but not for level tones. As there is no augmentation for AV over AO, then these results may have little practical implication for hearing in good listening conditions. Nevertheless, as visual information alone raises performance above chance, then the results may have implications for hearing impaired Cantonese language users, or for situations in which auditory input is degraded.	PDF
92--97	Roland Goeoke, Quynh Nho Tran, J Bruce Millar, Alexander Zelinsky and Jordi Robert-Ribes	Validation Of An Automatic Lip-Tracking Algorithm And Design Of A Database For Audio-Video Speech Processing Abstract We have recently proposed a new algorithm for the automatic extraction of lip feature points. Based on their positions, parameters describing. the shape of the mouth are derived. Since the algorithm is based on a stereo vision face tracking system, all measurements are in real-world distances. In this paper, we evaluate the accuracy of the automatic feature extraction algorithm by comparing its results with a manual feature extraction process. The results show an average error of about 1-2mm for the internal mouth width and height. In the second part of the paper, we present the design of an AV speech database for Australian English for future experiments on the correlation of audio and video speech signals.	PDF
98--103	Simon Lucey, Sridha Sridharan and Vinod Chandran	An Improvement Of Automatic Speech Reading Using An Intensity To Contour Stochastic Transformation Abstract The extraction of lip contour features is difficult and computationally expensive. In this paper we explore the desirable alternative of estimating the contour from area features (ie. mouth grey~scale image) directly via a non—linear stochastic mapping technique. Results are presented on our own speaker dependent database to demonstrate this method and explain why it performs better than previous techniques.	PDF
104--109	Jacek C. Wojdel and Leon J. M. Rothkrantz	Silence Detection And Vowel/Consonant Discrimination In Video Sequences Abstract In this paper we present a set of experiments that were aimed at investigation of feasibility of using artificial neural networks (ANNs) in a lip-reading task. We present here the method for data extraction that is applied on video sequences containing lower half of the face of speaking subject. Further the data is used to evaluate the performance of ANNs in a task of classifying the frames in the video stream into three possible classes: vowel, consonant or silence.	PDF

Phonetics - Acoustic

Pages	Authors	Title and Abstract	PDF
112--117	Michael Barlow and Frantz Clermont	A Parametric Model Of Australian English Vowels In Formant Space Abstract This paper concerns the development of a parametric model for characterising the formant (F1 -F3—F3) space of three sociolinguistic varieties (Broad, General and Cultivated) of spoken vowels in Australian English. The vowel-formant space is modelled as a quadratic surface, which captures the non-linearity in the F3 dimension and yields a parametric formulation for F3 as a weighted combination of F1 and F2. Differences between the surfaces, together with their application for prediction/classification on a per-speaker basis provide a holistic quantification of the differences between the varieties of Australian English.	PDF
118--123	Michael Barlow and Frantz Clermont	Seeing Is Believing: Beyond A Static 2d-View Of Formant Space For Speech Research And Education Abstract The paper describes an online resource developed for the dual purpose of education and research in speech science and technology. The resource consists of three- dimensional (3D) interactive worlds, which are currently based on formant databases and can be used not only to demonstrate familiar phenomena but also to gain new insights from within less constrained spaces. The resource’s construction and availability are described, together with examples of some of the phenomena manifest in spoken vowels. Results of employing the resource in an undergraduate course in speech processing suggest that the 3D interactive approach not only is more appealing and natural than two-dimensional (2D), numeric approaches to teaching the same material, but it also enhances learning.	PDF
124--129	Josefina Carrera-Sabat Ana M. Fernandez-Planas, Josep Matas-Crespo and Alicia Ortega-Escandell	Differences In Vowel Quality In Two Catalan Dialects Data From Mds Abstract Phonetic descriptions which concern themselves with dialectal distinctions have different goals: a) general descriptions of all the linguistic domain and b) more specific detail focusing on the differences between dialects, such as "llerdata" and "barceloni". Following this second line of dialectal contrast, this paper alms at demonstrating acoustical differences of openness between the middle vowels of the anterior series of two Catalan dialects: "barceloni" (Eastern Catalan) and "lleidata" (Western Catalan). The data have been statistically treated by means of MultiDimensional Scaling (MDS), and the configurations obtained allow us to observe significant differences between open and closed vowels in both dialects.	PDF
130--133	Jonathan Harrington, Sallyanne Palethorpe, and Catherine Watson	Vowel Change In Received Pronunciation: Evidence From The QueenS English Abstract It is well established that accent changes in general originate from younger members of the community whose speech includes more innovative forms of pronunciation. However there is a paucity of studies that have examined experimentally whether a person’s vowel space changes with time in the same direction as that of the wider community. In order to examine this further, we analysed nine of the Christmas broadcasts made by Queen Elisabeth ll spanning three time periods (the 1950s; the late 1960s/early 70s; the 1980s). An analysis of the monophthongal formant space showed that the first formant frequency was generally higher for open vowels, and lower for mid—high vowels in the 1960s and 1980s data than in the 1950s data, which we interpret as an expansion of phonetic height. The second formant frequency showed a more modest compression in later, compared with earlier years.	PDF
134--139	Akiko Onaka and Catherine I. Watson	Acoustic Comparison Of Child And Adult Fricatives Abstract The study presents acoustic comparisons between child and adult productions of the 9 English fricatives. Fricative tokens are obtained from citation—form words by 4 boys and 4 girls (aged 7 to 11) and 5 men and 5 women. The results show the overall spectral shapes of fricatives produced by the children are similar to those by the adults, however, some significant differences are found in the resonance values and resonant bandwidth. Classification experiments results show that in general the children's fricative data performed much more poorly than the adult data. The implications of the results for automatic speech recognition are discussed.	PDF
140--145	Napier Guy Ian Thompson	Wuyi Citation Tone Acoustics: Problems For Tonological Representation Abstract This paper describes the acoustic characteristics (Fu and duration) of the citation tones of Wuyi, a Southern Wu dialect, belonging to the Wuzhou dialect sub-group. Mean acoustic data from one male speaker are presented. It is shown how the results of the analysis pose questions for tonological representation in current tonologlcal theory.	PDF

Phonetics - Forensic

Pages	Authors	Title and Abstract	PDF
148--153	Jennifer Elliott	Auditory And F-Pattern Variations In Australian Okay: A Forensic Investigation Abstract An understanding of the acoustic properties, as well as the nature of within- and between-speaker variation, of words which occur with high frequency in natural discourse, is of great importance in forensic phonetic analyses. One word which occurs with relatively high frequency in natural discourse, including telephone conversations, which are often a source of data in forensic comparisons, is okay This paper presents the initial findings of a study of auditory and F-pattern variations in okay in a natural telephone conversation spoken by six male speakers of general Australian English. Seven pre-defined sampling points are measured within each token to determine the most efficient sampling points and formants for distinguishing between-speaker variation from within-speaker—variation in okay. F—ratios at these seven sampling points are calculated as a mean of ratios of between- to within-speaker variation. The greatest F-ratio is shown to be for F, at voice onset of the second vowel. Forensic implications are discussed.	PDF
154--159	Jennifer Elliott	Comparing The Acoustic Properties Of Normal And Shouted Speech: A Study In Forensic Phonetics. Abstract Forensic phoneticians are able to exercise little control over the data-they are required to examine and compare. When two speech samples, one from a criminal and one from a suspect, are provided for forensic analysis, it is quite possible that one sample may contain shouted speech, while the other will contain normally spoken speech. Analysing these dissimilar speech samples requires an understanding of how the acoustic properties of shouted speech differ from normal speech. This paper reports the findings of a pilot study which investigates the similarities and differences between the acoustic properties of natural speech in both normal and shouted modes. Results of the experiment indicate, that F0 and F1, may be significantly higher in shouted speech, but there as no evidence tor a significant difference in F-pattern for the other formants.	PDF
160--165	Yuko Kinoshita	Effective F2 As A Parameter In Japanese Forensic Speaker Identification Abstract The possibility of using effective F2 as a parameter in forensic speaker identification is demonstrated. 11 male Japanese speakers were recorded, and the formants of their 5 Japanese accented short vowels were measured. The potential of effective F2 as a possible forensic speaker identification parameter was evaluated by Faratio. The results showed that effective F2 of /e/ produced considerably a higher F-ratio than individual F2 and F3 of the same vowel, although transforming into effective F2 did not improve the F-ratio of the other vowels.	PDF
166--171	Yuko Kinoshita John Maindonald	Statistical Quantification Of Differential Vowel Comparability In Forensic Phonetic Samples Abstract In the phonetic comparison of forensic samples, vowels in different words often need to be compared. This paper discusses to what extent vowels embedded in different words are in fact comparable. The experiment was carried out on natural speech data to simulate forensically realistic conditions, and 11 male native Japanese speakers participated as informants. Multi-level analysis of variance was performed on the F2 of three vowels /a/, /i/, and /e/. The experiment shows that, although the phonological environment, namely the nasality of a preceding consonant, affects the F2 of these vowels, the magnitude of the effect can be discounted. What is shown to be important, however, is the identity of the vowel, and comparisons with /a/ in different phonological environments are strongly disfavoured.	PDF
172--177	Phil Rose and Frantz Clermont	Comparative Performance Of Cepstrum And Formant Based Analysis On Similar-Sounding Speakers For Forensic Speaker Identification Abstract A pilot forensic-phonetic experiment is described which compares the performance of formant- and cepstrally-based analyses on forensically realistic speech: intonationally varying tokens of the word hello said by six demonstrably similar-sounding speakers in recording sessions separated by at least a year. The two approaches are compared with respect to F-ratios and overall discrimination performance utilising a novel band-selective cepstral analysis. It is shown that at the second diphthongal target in hello the cepstrum-based analysis outperforms the formant analysis by about 5%, compared to its 10% superiority for same-session data.	PDF

Phonetics - Linguistic

Pages	Authors	Title and Abstract	PDF
180--185	Helen Fraser	Phonetics, Phonology, And The Teaching Of Pronunciation -- A New Cd-Rom For Esl Learners, And Its Rationale Abstract As phoneticians, it is relatively easy for us to describe in articulatory or acoustic terms the aspects of a learner’s speech which are different from those of a native speaker. It is also becoming more feasible to automate judgement of a learner's utterances in comparison with those of a native speaker, providing instant feedback to the speaker via a computer screen. Questions remain however. Which of the many differences between a native and non-native pronunciation to bring to the attention of the learner? Even more importantly, how can those differences be presented to learners so that they can actually make use of the information to change their pronunciation appropriately? For example, though it is obvious that instructing a learner to ‘raise the second formant higher when you are 80% through the second vowel’ will be of little use, it is far less obvious what metalinguistic descriptions are useful to learners. In this paper, I put forward some principles developed through both theoretical and practical investigation to help ensure effective metalinguistic communication between teachers and learners of pronunciation. I demonstrate a CD-ROM based on these principles which I have recently produced to test experimentally a range of methods of helping ESL learners with pronunciation.	PDF
186--191	D. Gharavian, H. Sheikhzadeh, and S. M. Ahadi	An Experimental Multi-Speaker Study On Farsi Phoneme Duration Rules Using Automatic Alignment Abstract In this paper we present the results of an experimental study on phoneme duration rules of the Farsi (Persian) language. A multi-speaker speech corpus is used and an automatic alignment algorithm based on CD-HMMs rs employed. The results are utilized to examine a few duration rules stated in the literature. After completion of thus research, the results could be employed in the frameworks of speech synthesis and speech recognition.	PDF
192--197	Shunichi Ishihara	Linguistic-Tonetic Differences In Target-Tone Realisation: Standard Vs. Kagoshima Japanese Abstract Normalized FO data from tour Kagoshima Japanese speakers are used to investigate the relationship between the distance of two target tones and their F0 realisations in utterances of HL(n) sequences. The F0 minima of the HL(n) sequence is shown to vary as a function of the distance of the two target tones. Furthermore, the apparent existence is demonstrated of i) a default contour shape on the basis of which F0 is realised according to the distance of two target tones, and ii) a base—line beyond which FO does not fall. Implications of the results for Pierrehumbert and Beckman’s target-tone model are discussed, and the existence of two linguistically-tonetically different types in terms of target-tone realisation is hypothesised.	PDF
198--202	Phil Rose	Hong Kong Cantonese Citation Tone Acoustics: A Linguistic Tonetic Study Abstract Mean fundamental frequency and duration data for the six citation tones of Hong Kong Cantonese on unstopped syllables are presented for five male and five female young native speakers. The linguistic-tonetic properties of the tones are specified from mean and standard deviation normalised FO and duration data. The effectiveness of the normalisation is shown to be much better than for some other Chinese dialects, and it is hypothesised that this is a function of the minimal nature of some of the Cantonese tonal contrasts.	PDF

Prosody

Pages	Authors	Title and Abstract	PDF
206--211	Mariapaola D'Imperio, Jacques Terken, and Michel Pitermann	Perceived Tone "Targets" And Pitch Accent Identification In Italian Abstract This study investigates the role of temporal alignment, f0 and peak shape in determining perceived tonal target values in Neapolitan italian. In this variety, the alignment of the accent peak appears to be a strong perceptual cue to the question/statement identification (D‘Imperio and House, 1997), everything else being equal. In the present study, the f0 contour of a question, uttered by a female speaker of Neapolitan Italian, was stylized and resynthesized by means of PSOLA. A set of stimuli was created in which either tonal alignment was varied, while f0 height was kept constant, or f0 height was varied orthogonally to alignment. For the alignment manipulation, an additional variable was the shape of the accent peak, which could be either flat (creating a short plateau) or sharp. Thirty Neapolitan subjects listened to the stimuli and identified each as a question or a statement. The results suggest that the contribution of f0 peak height to the question/statement identification is much less important than that of target alignment. More-over, peak shape affects the perceived alignment of the target tone, in that flat peak stimuli cause the perceived target to be displaced towards the end of the plateau.	PDF
212--217	Petra Hansson	Focal Accentuation And Boundary Perception Abstract In this paper, an experiment concerning focal accent distribution and phrase boundary perception in South Swedish is discussed. In previous studies on prosodic phrasing in Standard Swedish, no reliable, simple relation between focal accent distribution and boundary perception has been found. However, results from the present perception experiment show that focal accent distribution has an effect on the perceived composition of an utterance in South Swedish. Since the South Swedish focal accent is characterized by a fall (not by a rise as in Standard Swedish), the focal accent gesture can be perceived as a combined focal accent gesture and boundary signal.	PDF
218--223	Shunichi Ishihara	Continuous Linguistic Tonetic Representation Using Polynomial Residuals Abstract Normalised F0 data from the two accent types of Kagoshima Japanese are used to argue for 1) a continuous (polynomial) rather than a discrete—mean representation, and 2) the superiority of parameters derived from polynomial residuals over standard deviation measurements in the modeling of tone. It is claimed that both are preferable for linguistic-phonetics and speech technology.	PDF
224--229	Shunichi Ishihara	The Exponential Nature Of F0 Target Tone Interpolation Abstract The intonational phenomena of Kagoshima Japanese (KJ) will be investigated in this paper. More precisely, this paper has two aims. Ishihara (1998, 2000) reports that KJ’s accentual oppositions: Type A and Type B, show a different F0 realisation range at word level. First of all, l will show that this difference between Type A and Type B appearing at word level is also maintained at phrase level. Secondly, I will show that KJ’s HLLLLL(L) sequences—which are represented by the interpolation from a high to a low tone in Autosegmental—Metrical (AM) model (i.e. Pierrehumbert and Beckman, 1988)—exhibit exponential F0 curves, by acoustical-phonetically describing the sequences in question.	PDF
230--235	Phil Rose	Wenzhou Dialect Disyllabic Lexical Tone Sandhi With First Syllable Entering Tones Abstract An acoustic description and tonological analysis are presented for some tonologically interesting lexical tone sandhi data in a subset of tonal combinations involving the Entering Tone category in the Southern Wu dialect of Wenzhou.	PDF
236--241	Hyung-Soon Yim	An Analysis Of Korean Intonation In Declarative And Propositive Sentence Types Abstract It has been claimed that morpho-syntactically equivalent declarative and propositive sentence types are not distinguishable intonationally in Korean. This paper investigates whether this is so, by analysing, within in the Autosegmental-Metrical model, short declarative and propositive sentences, produced by 4 native Seoul Korean speakers. It is shown that the two types are in fact distinguishable by a combination of boundary tones, the duration of the sentence-final syllable, and accentual phrasing.	PDF

Signal Processing & Feature Analysis

Pages	Authors	Title and Abstract	PDF
244--249	M. R. Flax, E. Ambikairajah, W. H. Holmes, J. S. Jin	Improved Auditory Masking Models Abstract Hybrid masking models are defined and are informally judged to be improved masking models. Moore’s method for deriving spreading functions, a previously unused constituent, is closer to real world observations than its counterparts. Moore’s method lowers the computational complexity of the masking model and suggests that people are auditorily more sensitive to high frequencies than previously assumed. The hybrid models are independent of cochlea mapping functions, hence the same model may be used for assessing auditory redundancy in a variety of mammals. The two best masking models are chosen and classed as models which a] preserve auditory quality for the loss of spectral character discrimination and b] discriminate spectral character for the loss of auditory quality.	PDF
250--255	Hyoung-Gook Kim, Klaus Obermayer, Mathias Bode, Dietmar Ruwisch	Real-Time Noise Cancelling Based On Spectral Minimum Detection And Diffusive Gain Factors Abstract In this paper we propose an efficient algorithm for a one channel noise reduction in audio signals. One of the main objectives is to find a balanced trade off between noise reduction and speech distortion in the processed signal. This is accomplished by a system based on spectral minimum detection and diffusive gain factors. Our approach to speech enhancement is capable of distinguishing between language and noise interference in the microphone signal, even when they are located in the same frequency band.	PDF
256--261	HyunSoo Kim and W. Harvey Holmes	Nonparametric Peak Feature Extraction And Its Applications To Speech Signals Abstract We have developed a novel peak feature algorithm that can be used for endpoint detection, transient extraction and segmentation of speech. It uses binomial probabilities to measure peak characteristics and estimate sets of endpoints to characterize an utterance. in first tests our algorithm appears to outperform existing methods. It can also be used to improve the segmentation capabilities of existing segmentation methods, and can probably benefit most speech recognition approaches. In addition to the utility of the first order peaks, the statistics of the higher order peaks are also effective for feature extraction and segmentation. The second order peaks can also be used to devise an intelligent update procedure for the feature data windows, where the window update rate changes based on the type of speech signal present. The nonparametric peak feature algorithm is flexible, efficient and very robust in noise.	PDF
262--267	A.A. Kovtonyuk, A.Ya. Kalyuzhny, V.Yu. Semenov	Adaptive Kalman Filtering Of Speech Signals, Based On A Block Model In The State Space And Vector Quantization Of Autoregressive Features Abstract A novel method of adaptive Kalman filtering (KF) of noisy speech is proposed. The method is based on block model of autoregressive (AR) signal in the state space (SS). It is shown that such representation allows to reduce computational expenses and to decrease filtering error as compared with known methods. Also the essentially new method of estimation of AR parameters in the presence of noise is developed. This method is based on the usage of optimal Bayesian estimation and vector quantization.	PDF
268--273	Iain A. McGowan, Darren Moore, Sridha Sridharan	Speech Enhancement Using Near-Field Superdirectivity With An Adaptive Sidelobe Canceler And Post-Filter Abstract This paper describes a new microphone array technique and investigates its effectiveness for speech enhancement. A system structure consisting of a fixed near-field superdirective beamformer and an adaptive sidelobe canceling path is proposed (NFSD-ASC). The effect of adding a post-filter is also examined. The system is evaluated in terms of speech quality measures in the context of a computer workstation in an office environment. The speaker is located directly in front of the computer monitor at a distance of 60 cm and the array is designed to fit across the top of a standard 17 inch monitor. The experiments show that the array is effective in both decreasing the noise level and the amount of signal distortion when compared with standard near—field superdirectivity and the generalised sidelobe canceler.	PDF
274--279	Seng Kah Phooi, Zhihong Man, H. R. Wu	A New Approach In Designing An Adaptive Lattice Predictor For Nonlinear And Nonstationary Speech Signals In Adpcm Using Lyapunov Theory Abstract In this paper, we present a computationally efficient adaptive lattice-ladder predictor for adaptive prediction of nonstationary speech signals in ADPCM. The important advantage of the proposed predictor is capable of adaptive predicting the signal and its algorithm does not require a priori knowledge of time dependent among the input data. The lattice reflection coefficients and the ladder weights are adaptively adjusted by algorithms that are designed using Lyapunov theory. The proposed scheme possesses distinct advantages of stability and speed of convergence over linear adaptive LMS or RLS lattice predictors in ADPCM. The theoretical derivation of the lattice predictor is further supported by simulation examples for speech signals.	PDF
280--285	CW Thorpe and Ci Watson	Vowel Identification In Singing At High Pitch Abstract We present a new analysis method that represents the vowel space directly by a factorial analysis of the harmonic amplitudes, without requiring explicit identification of formant frequencies. Analyses of vowels sung by male and female singers across their pitch ranges are performed with this method and also by LP formant extraction. The results indicate that even at high pitch, vowels are well separated with our new method, even though the LP analysis produces clusters of formants locked onto harmonic frequencies. This result suggests that vowel identity at high pitch may be conveyed largely by the magnitudes of individual harmonics, and that some of the observations of "vowel modification" and "convergence" in acoustic analyses of high pitch vowels may be artefacts of formant analysis.	PDF

Speaker Recognition

Pages	Authors	Title and Abstract	PDF
288--293	Michael Barlow, Brett Watson, Ah Chung Tsoi & Tom Downs	A-Priori Selection Of Cohort Sets For A Speaker Verification System: Issues And Insights Abstract The paper describes a series of speaker verification experiments using the well-known cohort normalisation method. Cohort sets are selected a-priori- based only on training data, for a database of 42·speakers uttering digits in isolation, recorded over a period of t8-months. Baseline performance is contrasted with post (at time of verification) selection of cohort set. Further, a-priori set selection is examined along a number of axes: text-independent versus text-dependent, similarity between cohort sets for the different utterances, the significance of speaker ordering in a cohort set, together with the issue of length of verification utterance. This analysis is then used to explore and highlight issues regarding similarity and dissimilarity between speakers.	PDF
294--299	Grazyna Demenko, Adam Mickiewicz	Analysis Of Suprasegmental Features For Speaker Verification Abstract The objective of the paper is the assessment of suprasegmental speech features in text-independent speaker verification systems. The linguistic material adopted for research included: read story, read dialogue and spontaneous speech. Fifty speakers without any known speech detects were recorded (3 times of different time intervals). For each individual pitch statistics and dynamic suprasegmental parameters were determined. Selected features were tested with the discrimination analysis and neural networks. Different combinations of pitch parameters were found to be significant for the speaker identification.	PDF
300--305	S. Myers, J. Pelecanos and S. Sridharan	Two Speaker Detection By Dual Gaussian Mixture Modelling Abstract Presented in this paper is a method for performing two speaker detection utilising a technique of modeling the output scores of an Adapted Gaussian Mixture Model — Universal Background Model (GMM-UBM) system. This method consists of training a two mixture Gaussian Mixture Model on the output scores of a speaker recognition engine. A baseline system developed tor the NIST 2000 Speaker Recognition Evaluation has demonstrated encouraging results, which are presented. The computation time of this system was significantly less when compared to some of the systems submitted for the NIST 2000 competition. Improvements to the baseline system are suggested, and current experiments indicate that two speaker detection systems based on this method will demonstrate very good performance.	PDF
306--311	J. Pelecanos, S. Myers and S. Sridharan	Rapid Channel Compensation For One And Two Speaker Detection In The Nist 2000 Speaker Recognition Evaluation Abstract This paper proposes a technique for rapidly compensating for channel effects of telephone speech for speaker verification. The method proposed is generic and can be applied to both the one and two speaker detection tasks to avoid re—training the separate systems. The technique has the advantages that it can be performed in real time (except for the small initial buffering), it does not suffer from a relatively long settling time such as certain RASTA processing techniques, and in addition, it is computationally efficient to apply. Results of the application of this technique to the NIST 2000 Speaker Recognition Evaluation are reported.	PDF
312--317	Conrad Sanderson and Kuldip K. Paliwal	Training Method Of A Piecewise Linear Classifier For A Multi-Modal Person Verification System Abstract In this paper we propose a training method for a Piece-wise Linear (PL) binary classifier used in a multi·modal person verification system. The training criterion used minimizes the false acceptance rate as well as false rejection rate, leading to a lower Total Error (TE) made by a multi-modal verification system. The performance of the PL classifier and Support Vector Machine (SVM) binary classifier, trained using the traditional Minimum Total Misclassification Error (IVITME) criterion, is compared. The PL classifier consistently outperforms the SVM classifier with the TE on average 50% lower.	PDF
318--323	Timothy Wark Sridha Sridharan Vinod Chandran	A Comparison Of Static And Dynamic Classifier Performance For Multi-Modal Speaker Verification Abstract This paper compares the performance of two techniques for the fusion of speech and lip information for robust multi-modal speaker verification. The first approach uses static speech and lip information via GMM classifiers in order to make a speaker verification decision, whilst the second approach uses dynamic information via the use of HMM classifiers. Verification experiments are performed on the M2VTS database which show that the dynamic system significantly outperforms the static system over a range of operating conditions.	PDF
324--329	Brett R. Wildermoth and Kuldip K. Paliwal	Use Of Voicing And Pitch Information For Speaker Recognition Abstract Speech signal can be decomposed into two parts: the source part and the system part. The system part corresponds to the smooth envelope of the power spectrum and is used in the form of cepstral coefficients in almost all the automatic speaker recognition systems reported in the literature. The source part contains information about voicing and pitch. Though this information is very important for human beings to identify a person from his/her voice, it is rarely used for automatic speaker recognition. In this paper, we propose a simple and reliable method to derive acoustic features based on voicing and pitch information and use them for automatic speaker recognition. We evaluate these features for speaker identification using TIMIT, NTIMIT and IISC databases and demonstrate their effectiveness.	PDF

Speaker Characteristics

Pages	Authors	Title and Abstract	PDF
330--335	Michael Barlow & Michael Wagner	Perceptions Of Identity, Gender And Idiolect In Prosodically Altered Speech Using A Composite Model Approach Abstract The paper describes a series of perceptual experiments in which the prosodic parameters F0, energy, voicing and timing of utterances were systematically altered and the resulting speech resynthesised and evaluated by a group of listeners. Using the linear-prediction source-filter model of production intermediate spectral models were composed from the speech of two or more speakers. Listener perceptions of identity, gender and idiolect were correlated with the systematic alterations of the prosodic parameters. Perceptions of identity were found to be correlated with both the static and dynamic properties of all parameters, though listeners employed different cues for different speakers. Perceptions of gender were based solely on the mean value of F0. Perception of idiolect was found only to correlate with utterance duration: shorter duration utterances were perceived as more cultivated, while longer duration was perceived as broader.	PDF
336--341	Kirk P.H. Sullivan, Donn Bayard, Ann Weatherall, Cindy Gallois, Frank Schlichting & Jeffery Pittam	Does Media Exposure To An Accent Impact Upon The Estimation Of The Age Of Speakers? Abstract World English accents are beamed via the media into the homes of those living in English speaking countries. The situation is the same in non-English speaking countries with a subtitling rather than dubbing policy. The degree to which an individual is exposed to the range of world Englishes, thus, varies from country to country. Research into the ability of individuals to identify the English accent of a speaker has demonstrated variation, which is dependent upon the country in which one lives and most likely the degree of exposure to non-local world English accents. This paper explores whether exposure to world English accents impacts upon the forensically important question of an individuals ability to assign the perceived age of speakers based on vocal cues alone. 150 English speakers, 50 from New Zealand, 50 from Australia and 50 from the United States of America responded to a range of questions about two New Zealand, two Australian and two North American voices. Only one of the questions related subjective age estimation. Statistical analysis showed no significant differences due to country or residence and hence media exposure to non-local world Englishes. A second set of 150 listeners, 50 from Germany, 50 from Finland and 50 from Sweden undertook the same task. No significant difference was found due to subtitling as opposed to dubbing policy. This study found no impact of media exposure to accent upon the estimation of the age of speakers based on vocal cues alone.	PDF
342--347	Elisabeth Zetterholm	The Significance Of Phonetics In Voice Imitation Abstract Speech behaviour and the voice show our regional, social and personal identity. Sometimes we imitate other people‘s speech behaviour in the aim of learning another language or accent or for entertainment. This study indicates that it is possible to imitate other speaker’s voice and speech behaviour with success. The result of a perception test indicates that some features, such as voice quality and pitch register, are more important than others for voice identification.	PDF

Pages

Authors

Title and Abstract

PDF

330--335

Michael Barlow & Michael Wagner

Perceptions Of Identity, Gender And Idiolect In Prosodically Altered Speech Using A Composite Model Approach

Abstract The paper describes a series of perceptual experiments in which the prosodic parameters F0, energy, voicing and timing of utterances were systematically altered and the resulting speech resynthesised and evaluated by a group of listeners. Using the linear-prediction source-filter model of production intermediate spectral models were composed from the speech of two or more speakers. Listener perceptions of identity, gender and idiolect were correlated with the systematic alterations of the prosodic parameters. Perceptions of identity were found to be correlated with both the static and dynamic properties of all parameters, though listeners employed different cues for different speakers. Perceptions of gender were based solely on the mean value of F0. Perception of idiolect was found only to correlate with utterance duration: shorter duration utterances were perceived as more cultivated, while longer duration was perceived as broader.

PDF

336--341

Kirk P.H. Sullivan, Donn Bayard, Ann Weatherall, Cindy Gallois, Frank Schlichting & Jeffery Pittam

Does Media Exposure To An Accent Impact Upon The Estimation Of The Age Of Speakers?

Abstract World English accents are beamed via the media into the homes of those living in English speaking countries. The situation is the same in non-English speaking countries with a subtitling rather than dubbing policy. The degree to which an individual is exposed to the range of world Englishes, thus, varies from country to country. Research into the ability of individuals to identify the English accent of a speaker has demonstrated variation, which is dependent upon the country in which one lives and most likely the degree of exposure to non-local world English accents. This paper explores whether exposure to world English accents impacts upon the forensically important question of an individuals ability to assign the perceived age of speakers based on vocal cues alone. 150 English speakers, 50 from New Zealand, 50 from Australia and 50 from the United States of America responded to a range of questions about two New Zealand, two Australian and two North American voices. Only one of the questions related subjective age estimation. Statistical analysis showed no significant differences due to country or residence and hence media exposure to non-local world Englishes. A second set of 150 listeners, 50 from Germany, 50 from Finland and 50 from Sweden undertook the same task. No significant difference was found due to subtitling as opposed to dubbing policy. This study found no impact of media exposure to accent upon the estimation of the age of speakers based on vocal cues alone.

PDF

342--347

Elisabeth Zetterholm

The Significance Of Phonetics In Voice Imitation

Abstract Speech behaviour and the voice show our regional, social and personal identity. Sometimes we imitate other people‘s speech behaviour in the aim of learning another language or accent or for entertainment. This study indicates that it is possible to imitate other speaker’s voice and speech behaviour with success. The result of a perception test indicates that some features, such as voice quality and pitch register, are more important than others for voice identification.

PDF

Speech Aids & Disorders Speech Databases

Pages	Authors	Title and Abstract	PDF
350--355	Jan van Doorn	Does Artificially Increased Speech Rate Help? Abstract Modifications to the acoustic properties of natural speech signals have for some time been mooted as ways of studying intelligibility of disordered speech. An earlier study of acoustic characteristics of dysarthric speech in cerebral palsy showed that syllable length was increased, while relative syllable duration was preserved when compared with normal speech. Hence it was proposed that a global increase of speech rate may bring segmental and suprasegmental features closer to normal and thereby improve intelligibility. The intelligibility of the speech from three speakers with cerebral palsy was determined for three conditions — unmodified, increased rate, and increased rate with pauses removed. Using 50 sentences that were part of the Assessment of the intelligibility of Dysarthric Speech test, intelligibility was judged by 40 listeners who transcribed what they heard. Results showed that the overall mean intelligibility measures (% words correct) were not significantly different in any of the conditions. However, results for individual speakers were interesting, showing trends that may have implications for future work in the ongoing search to understand the relationship between acoustic features of dysarthric speech and its intelligibility.	PDF
356--361	David B. Grayden and Graeme M. Clark	The Effect Of Rate Of Stimulation Of The Auditory Nerve On Phoneme Recognition Abstract Five patients implanted with the Nucleus CI·24M cochlear implant were tested on consonant and vowel perception with three different average rates of stimulation: 250 pulses/s per channel, 807 pps/ch and 1615 pps/ch. There were no significant differences in phoneme recognition scores when learning effects were taken into account. Information transmission analyses of consonant confusion matrices revealed that, with higher rates of stimulation, manner of articulation features were better perceived but place of articulation features were more poorly perceived. The results and analyses suggest that high rates of stimulation provide improved information about temporal information and frication in speech, but mask the spectral detail required for the perception of place of articulation.	PDF
362--367	Simone Griffin, Linda Wilson, Elizabeth Clark	Speech Pathology Applications Of Automatic Speech Recognition Technology Abstract Few studies have investigated the benefits and potential difficulties of automatic speech recognition (ASR) application for people with speech and language impairment. The literature has demonstrated that ASR has the potential to assist individuals with dysarthria and hearing impairment to communicate with computers but the use of ASR with people with other speech and language disorders is less well documented. This paper examines the potential applications of ASR in the domain of speech pathology, including therapeutic and assessment applications, report writing, and as a mode of aiternativeand augmentative communication (AAC). It also identifies areas in which further research is required before these potential applications can be realised.	PDF
367--372	Cheolwoo Jo, Daehyun Kim, Moojin Baek, Sugeon Wang	On Predicting Patient’S Voice After Surgical Operation Abstract This paper describes a procedure to predict a patient’s voice after surgical operation. To do this, the voice before and after surgical operation is collected from the same patient Collected voice is analyzed to obtain differences affected by surgical operation. To measure the change of acoustical characteristics of voice, jitter, shimmer and other spectral domain and time domain parameters are computed and compared. According to the result, it is shown that the factors that change are caused not only by vocal fold components but also by vocal tract. One method to implement the predictive synthesis of voice after surgery, residual excited PSOLA method is applied. The resulting voice is compared to the voice after surgery in terms of spectral and perceptual similarity.	PDF
373--378	Lois F.A. Martin, Peter J. Blamey, Christopher J. James, Karyn L. Galvin, & David Macfarlane	Adaptive Dynamic Range Optimisation For Hearing Aids Abstract ADRO (Adaptive Dynamic Range Optimisation) is a slowly—adapting digital signal processor that controls the output levels of a set of narrow frequency bands so that the levels fall within a specified dynamic range. ADRO is suitable for a variety of applications, including control of a hearing aid. In the case of a hearing aid, the output dynamic range is defined by the threshold of hearing (T) and a comfortable level (C) at each frequency for the individual listener. A set of rules is used to control the output levels, with each rule directly addressing a requirement for a functional hearing aid. For example, the audibility rule specifies that the output level should be greater than a fixed level between T and C at least 70% of the time. The discomfort rule specifies that the output level should be below C at least 90% of the time. In this study, open-set sentence perception scores for 15 listeners were compared for ADRO and a linear hearing aid fit. Speech was presented at three levels. ADRO improved scores by 1.9% at 75 dB SPL (NS), 15.9% at 65 dB SPL (p = 0.014) and 36% at 55 dB SPL (p < 0.001).	PDF
428--433	Steve Cassidy Pauline Welby, Julie McGory, Mary Beckman	Testing The Adequacy Of Query Languages Against Annotated Spoken Dialog Abstract Large annotated collections of speech data are now common in spoken lan- guage research and a recent focus has been on the development of annotation standards and query languages for these annotations As part of this process it is important to evalu- ate the emerging proposals against a range of Linguistic annotation practices and tn many different domains. This paper presents an example of a richly annotated discourse segment -which includes both DAMSL style discourse level annotation and ToBl intonational analysis. We describe how this annotation could be realised in either the Emu, MATE or Annotation Graph formalisms. In order to evaluate the different query languages we take a small number of queries and attempt to express them in each query language. We are particularly interested in the natu- ralness of the query expression in each case. In some cases we find that queries cannot be expressed in the current language. We make a number of suggestions to guide the develop- ment of these query languages.	PDF
434--439	J Bruce Millar	Prospects For Speech Technology In The Oceania Region Abstract The development of speech technology in the Oceania region is an issue for Australian speech scientists and technologists. In this paper we examine both the issues that govern the development of speech technology anywhere, the specific opportunities and inhibiting factors of the Oceania region, and the role that Australia, as the largest and most prosperous nation of the region, can have in the process. The necessary scientific resources required to establish both basic and more sophisticated speech technology are reviewed and mapped against the characteristics of the Oceania region. It is concluded that the most productive approach is likely to be one of creative partnership with the many island communities such that technology may be developed in a cost-effective and culturally sensitive manner.	PDF
440--445	Li Ming, Jochen Junkawitsch, Tiecheng Yu	An Incremental Approach To Selection Of Well Balanced Corpus Abstract While achieving a large vocabulary speech recognition system, we often need appropriate corpora for training, test and initialization. Designing those corpora manually is very time consuming. So they are usually generated automatically. But by the common generation method, there are always some phonemes which are not balanced well. In this paper, we present a novel approach by which we can get very good corpora which are well balanced for all phonemes. In this approach we adopt an incremental strategy to obtain the corpus, namely obtain the whole corpus part by part. We also put forward a new method to evaluate sentences in the data source, by which we can select data more effectively. In our experiments, this approach achieves a significant improvement for the quality of the selected corpus.	PDF

Speech Coding

Pages	Authors	Title and Abstract	PDF
380--385	Beena Ahmed and W. Harvey Holmes	Objective And Subjective Performance Measures For Voice Activity Detectors Abstract The accurate performance of a Voice Activity Detector (VAD) is critical in several areas, including speech coders and speech recognition systems as well as in mobile telephony. Hence the need to comprehensively evaluate the performance of a VAD before integrating it into the application. We initially analyze the behaviour of VADs and their possible errors. Two separate measures are proposed which allow comparisons of the performances of different VADs to be made. The first measure is objective and uses the cross-correlation between the VAD output and the corresponding true speech/noise classification. The second measure estimates the perceptual effects of VAD errors on the overall speech quality felt by the listeners. Subjective MOS tests of VADs were carried out and it is shown that the proposed Perceptual Quality Measure (PQM) closely estimates the subjectively evaluated MOS scores.	PDF
386--391	Chandranath N. Athaudage, Alan B. Bradley and Margaret Lech	Efficient Compression Of Melp Spectral Parameters Using Optimized Temporal Decomposition Abstract This paper describes an efficient compression algorithm for MELP (Supplee et al., 1997) spectral parameters based upon an optimized temporal decomposition model of speech. Temporal decomposition (TD) is an effective technique for modelling the dynamics of speech parameters and an optimized algorithm for TD has been presented previously (Athaudage, Bradley & Lech, 1999). In this paper we provide an overview of the optimized TD algorithm with its rate-distortion performance. Application of optimized TD for efficient compression of MELP spectral parameters is discussed with TD parameter quantization issues and effective coupling between TD analysis and parameter quantization stages. Simulation results show that over 50% compression can be achieved using 450 ms delay block coding, which is attractive for speech storage related applications.	PDF
392--397	John Dines, Sridha Sridharan and Miles Moody	Compression Of Speech For Mass Storage Using Speech Recognition And Text-To-Speech Synthesis Abstract In this paper a speech compression algorithm is presented that utilises speech recognition and text-to-speech synthesis technology to code speech at very low bit rates suitable for mass storage applications. The system relies on a word level transcription and carries out a phonetic alignment of the signal using a lexicon of pronunciations. Speech is synthesised at the decoder by concatenating diphone segments from a speaker dependent database. Prosody and energy information is extracted from the original source speech and compressed using a low rate scheme. In order to synthesise speech that is perceived as being produced by the target speaker it is necessary that a speaker transformation scheme be adopted. Two speaker transformation schemes have been tested: a vector quantisation scheme and a direct estimation mapping scheme. Rates of 220 to 400 bps have been achieved using this approach. Informal subjective testing was carried out to compare the synthesised speech of several coding schemes in terms of intelligibility and speaker recognisability.	PDF
398--403	J.R. Epps and W.H. Holmes	Wideband Speech Coding At Narrowband Bit Rates Abstract The 'muffled' quality of coded speech, which arises from the bandlimiting of speech to 4 kHz, can be reduced either by coding speech up to a wider bandwidth or by wideband enhancement of the narrowband coded speech. This paper investigates the limitations of wideband enhancement and possibilities for its improvement. A new wideband coding scheme is proposed, based on any narrowband coder and wideband enhancement augmented by a few bits per frame of highband information. The scheme thus has a bit rate only slightly greater than that of the narrowband coder. Subjective listening tests show that this scheme can produce wideband speech of significantly higher quality than the narrowband coded speech.	PDF
404--409	Hyoung-Gook Kim, Klaus Obermayer, Mathias Bode and Dietmar Ruwisch	A 1.6 Kbps Speech Codec Using Spectral Vector Quantization Of Differential Features Abstract In this paper we propose an efficient algorithm for a low rate and low complexity speech compression algorithm based on spectral vector quantization of differential features in the frequency domain. Speech signals can be effectively encoded at medium transmission rates maintaining high quality of the reconstructed speech. To operate at lower transmission rates with minimized quality drawbacks, we use differential feature vector coding in the frequency domain. From the spectrogram comparison, it is shown that the use of the proposed method provides reasonable quality of synthesized speech.	PDF
410--415	Michael Mason, Sridha Sridharan and Vinod Chandran	A Comparison Of Two Hybrid Audio Coding Structures Incorporating Discreet Wavelet Transforms Abstract This paper compares the performance of two related audio coding structures. Both structures are hybridisations of parametric and subband transform coding schemes. The first coder exploits linear predictive (LP) analysis to model the spectral shape of the audio signal, and uses the LPC analysis filter to extract the residue. The residue is decomposed into non-uniform subbands using a multiband discrete wavelet transform (MDWT), and these subband coefficients are quantised in accordance to a perceptually determined dynamic bit allocation scheme. The second coder directly models the audio signal using a sinusoidal model and the residual is the difference between this model and the original signal. The residual is quantised in the manner as in the first coder. The quality of the decoded audio and the complexity of the coders is compared.	PDF
416--421	C. H. Ritz, I. S. Burnett	Split Temporal Decomposition And Quantisation Abstract Standard temporal decomposition derives models for the speech spectral parameter vectors without considering the perceptual significance of the vector elements. To overcome this drawback, Split Temporal Decomposition divides the parameter vector (in this case Line Spectral Frequencies (LSFs)) into sub or split vectors for which separate event functions are determined. Hence, multiple event functions are derived for the overall parameter vectors and these are shown to provide a distinct improvement in the modeling of such vectors. We also show that by using a joint event codebook for the sub—vectors, improvements in LSF vector quantisation can be achieved. In particular, this allows the emphasis of perceptually important LSF vector elements within the quantisation scheme.	PDF
421--426	Peter Veprek and Alan B. Bradley	Hierarchical Speech Compression For Storage - A Two-Level Approach Abstract In this paper, we investigate speech compression for storage. We propose a compression method operating on several levels of the speech signal hierarchy. The method is suitable for off-line compression of large closed corpora with known transcription. A realisation of the method using two levels of hierarchy is presented and evaluated. Results show that the method can provide high quality of reconstructed speech at low data rate for sufficiently large corpora.	PDF

Speech Perception

Pages	Authors	Title and Abstract	PDF
446--451	Dawn Behne, Peter Czigler and Kirk Sullivan	Perception Of Swedish Vowel Quantity: Tracing Stages Of Development Abstract Swedish adults generally use vowel duration to identify Swedish long and short vowel quantities. However, when the duration of a vowel is relatively long (due, e.g., to inherent duration, or vowel lengthening),adults listeners may also make use of vowel spectra distinguish vowel quantities. In this respect, use of the vowel spectrum in special cases (e.g., when identifying the quantity of inherently long vowels) might be see en as a result of perceptual fine tuning to improve the processing efficiency of identifying vowel quantities. This project investigates how young, developing listeners acquire the perceptual use of vowel duration and the vowel spectra for identifying long and short vowel quantities. Of particular interest is whether younger listeners consistently use vowel duration when characterizing quantities and whether the pre-adult listeners are as likely to use spectral cues to identify the vowel quantity of the inherently long vowels as adult listeners.	PDF
452--457	Peter J. Blamey, Christopher J. James & Lois F.A. Martin	Sound Separation With A Cochlear Implant And A Hearing Aid In Opposite Ears Abstract Two experiments were conducted to investigate the perception of speech and noise presented simultaneously to three subjects with impaired hearing in five monaural and binaural conditions. A broadband noise was found to have no effect on speech perception when the two signals were presented to opposite ears. When speech and noise were presented to the same ear(s), speech perception scores on a closed-set test fell from above 95% at high signal—to-noise ratios (SNR) to 71% at an SNR of about -5 dB. When two speech signals were presented simultaneously at equal intensities (U dB SNR) speech perception scores fell to 75% or lower, regardless of the ear(s) to which the signals were presented. Thus dichotic presentation helped these listeners to separate speech from a broadband noise, but not to separate two simultaneous speech signals produced by different speakers.	PDF
458--463	Michael D. Tyler and Denis K. Burnham	Orthographic Influences On Initial Phoneme Deletion Tasks Abstract Here the orthographic effects on initial phoneme deletion tasks with an adult population are examined. Three experiments were conducted, where participants listened to instructions to take the first sound away from a real word to create a new word. For half of the items it was also possible to use sound or spelling to clothe task (e.g. /ra1s/ — /r/ = /ats/, rice — "r" = ice), but for the other halt an orthographic strategy resulted in the incorrect spelling for the resultant word (/kof/ - /k/ = /¤f/ “off”, cough —· "c" = ough). Longer reaction times were found for interfering items in Experiment 1, and in Experiment 2, with a slightly different method of stimulus delivery. In Experiment 3, participants were specifically instructed not to use spelling in the task, but the same result was observed nonetheless. The results suggest that orthographic processing is automatically activated during phoneme deletion with real words.	PDF

Pages

Authors

Title and Abstract

PDF

446--451

Dawn Behne, Peter Czigler and Kirk Sullivan

Perception Of Swedish Vowel Quantity: Tracing Stages Of Development

Abstract Swedish adults generally use vowel duration to identify Swedish long and short vowel quantities. However, when the duration of a vowel is relatively long (due, e.g., to inherent duration, or vowel lengthening),adults listeners may also make use of vowel spectra distinguish vowel quantities. In this respect, use of the vowel spectrum in special cases (e.g., when identifying the quantity of inherently long vowels) might be see en as a result of perceptual fine tuning to improve the processing efficiency of identifying vowel quantities. This project investigates how young, developing listeners acquire the perceptual use of vowel duration and the vowel spectra for identifying long and short vowel quantities. Of particular interest is whether younger listeners consistently use vowel duration when characterizing quantities and whether the pre-adult listeners are as likely to use spectral cues to identify the vowel quantity of the inherently long vowels as adult listeners.

PDF

452--457

Peter J. Blamey, Christopher J. James & Lois F.A. Martin

Sound Separation With A Cochlear Implant And A Hearing Aid In Opposite Ears

Abstract Two experiments were conducted to investigate the perception of speech and noise presented simultaneously to three subjects with impaired hearing in five monaural and binaural conditions. A broadband noise was found to have no effect on speech perception when the two signals were presented to opposite ears. When speech and noise were presented to the same ear(s), speech perception scores on a closed-set test fell from above 95% at high signal—to-noise ratios (SNR) to 71% at an SNR of about -5 dB. When two speech signals were presented simultaneously at equal intensities (U dB SNR) speech perception scores fell to 75% or lower, regardless of the ear(s) to which the signals were presented. Thus dichotic presentation helped these listeners to separate speech from a broadband noise, but not to separate two simultaneous speech signals produced by different speakers.

PDF

458--463

Michael D. Tyler and Denis K. Burnham

Orthographic Influences On Initial Phoneme Deletion Tasks

Abstract Here the orthographic effects on initial phoneme deletion tasks with an adult population are examined. Three experiments were conducted, where participants listened to instructions to take the first sound away from a real word to create a new word. For half of the items it was also possible to use sound or spelling to clothe task (e.g. /ra1s/ — /r/ = /ats/, rice — "r" = ice), but for the other halt an orthographic strategy resulted in the incorrect spelling for the resultant word (/kof/ - /k/ = /¤f/ “off”, cough —· "c" = ough). Longer reaction times were found for interfering items in Experiment 1, and in Experiment 2, with a slightly different method of stimulus delivery. In Experiment 3, participants were specifically instructed not to use spelling in the task, but the same result was observed nonetheless. The results suggest that orthographic processing is automatically activated during phoneme deletion with real words.

PDF

Speech Physiology

Pages	Authors	Title and Abstract	PDF
466--471	Michael Barlow, Frantz Clermont & Parham Mokhtari	From Acoustics Of Speech To A 3d Vocal-Tract: Toward A Plausible Model, With Real-Time Constraints Abstract A system is described for constructing and visualizing three-dimensional (SD) images of the human vocal-tract (VT), either from directly-measured articulatory data or from acoustic measurements of the speech waveform. The system comprises the following three major components: (1) a method of inversion which maps from acoustic parameters of speech to VT area-functions, (2) a suite of algorithms which transform the VT area-function to a 3D model of the VT airway, and (3) solutions for immersing the 3D model in an interactive visual environment. The emphasis in all stages of modelling is to achieve a balance between computational simplicity as imposed by the constraint of real-time operation, and visual plausibility of the reconstructed SD images of the human vocal-tract.	PDF
472--477	Julie Carson-Berndsen and Michael Walsh	Interpreting Multilinear Representations In Speech Abstract This paper discusses the interpretation of multilinear representations of speech utterances using a computational linguistic model. The model uses a feature-based finite state automaton representation of phonotactic constraints and axioms of event logic to provide a multilinear representation with a temporal interpretation. The asynchronous nature of the features in multilinear representations allows coarticulation to be modeled and the phonotactic automaton representation of permissible sound combinations in a language allows not only actual but also potential (well-formed) syllable structures to be recognised. The paper illustrates how a computational linguistic model can provide a robust solution to the problems of coarticulation and out-of-vocabulary items in speech recognition.	PDF
478--483	Parham Mokhtari and Frantz Clermont	New Perspectives On Linear-Prediction Modelling Of The Vocal-Tract: Uniqueness, Formant-Dependence And Shape Parameterisation Abstract It is well known that the linear-prediction (LP) model yields an inherently unique estimate of the vocal-tract (VT) shape from information contained only in the acoustic speech signal. However, the uniqueness property of the LP-VT model is understood at best incompletely, resulting in a perceived lack of confidence and a popular trend towards more sophisticated VT models and methods of acoustic—to-articulatory mapping. This paper contributes a better understanding of the LP-VT model’s uniqueness property, by revealing for the first time, the formant frequency- and bandwidth-dependence of LP-derived VT—shapes. Our results thereby (i) provide a shape-related explanation of the uniqueness property of the LP-VT model, and (ii) suggest a new, acoustically-relevant parameterisation of VT-shapes.	PDF
484--489	R. E. E. ROBINSON	Articulograph Interface Abstract This paper describes a new interface that allows the Carstens Articulograph to be connected to any computer. The Articulograph is a device which converts physical speech movements into data that can be recorded on a computer. The new interface was designed and built at the Speech Hearing and Language Research Centre at Macquarie University. The existing analog unit functions are described, and the new interface integration is shown, complete with monitoring and synchronisation functions. The calibration and data gathering software written for a Sun workstation is presented. This new interface allows analog signals to be recorded directly through the a computers own Analog to Digital converters, and the Articulograph to be directly controlled. Thus any computer platform can be used for Articulography.	PDF

Speech Recognition

Pages	Authors	Title and Abstract	PDF
492--497	S.M. Ahadi	Reduced Context Sensitivity In Persian Speech Recognition Via Syllable Modeling Abstract In this paper, an alternative approach to acoustic modeling in Persian continuous speech recognition has been introduced. The approach utilizes syllables in place of phonetic level units due to their more stable and well·defined characteristics. This system is believed to reduce - sensitivity to context as part of the context is modeled within syllables. The results obviate the performance of the approach especially in comparison to context-independent models, where reductions up to 21% in the system word error rate, for single-Gaussian case, have been noticed.	PDF
498--503	Keith Bain and Di Paez	Speech Recognition In Lecture Theatres - Liberated Learning Project An Innovation To Improve Access To Higher Education Using Speech Recognition Technology Abstract The Liberated Learning Project (LLP) represents a large and complex effort to determine whether automatic speech recognition (ASR) can be successfully used for real-time transcription and display in university lecture theatres. ASR-based notes, which are produced to enhance retention of lecture material by students, will overlap data collected to investigate the use of ASR in the lecture theatre. Stuckless (2000, cited in Konopasky) described the need to collect "data for assessment of accuracy of speech-to-text) for examination by our IBM partner". Leitch (2000, cited in Konopasky) described the need to record outputs like audio-video tapes of lecture and display text, audio and displayed text, audio and software generated notes, unedited, and audio-software generated notes edited.	PDF
504--509	JunLan Feng and LiMin Du	An Improved Architecture For Word Verification Abstract Word verification is important for both LVCSR and other domain-limited tasks. The first contribution of this paper is to provide a novel strategy to train a set of anti subword models by combining maximum likelihood estimation (MLE) with minimum verification error (MVE) training. Second, this paper proposes a mechanism by which decision thresholds can be set differently for different words depending on their statistically aggressive capabilities obtained from training data.	PDF
510--515	Ho-Hyun Jeon, Chang-Sun Ryu, Jae-In Kim and Myoung-Wan Koo	A Speech-Operated Railroad Information & Reservation Service With Multi-Stage Dialogue Abstract Dialog management is very important in telephone-based services. This paper describes KORIS (Korean national Railroad information & reservation Service) system for access to rail travel information. The system allows to access the timetable by using telephone over the public network. It was developed to replace a conventional service like a ARS (Automatic Response Service) and to support an attendant service with which a user can get information. We use a CHMM (continuous hidden Markov model) for isolate word recognition and multi-stage grammar for dialog management. The grammar and some parameters for recognition depend on the service. It prevents wrong results that are not relevant to each stage. The paper will show the overview of the system and results of field trials.	PDF