Proceedings of SST 1994

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Feature Analysis I

Pages Authors Title and Abstract PDF
2--7 Yaxin Zhang. Roberto Togneri, Chris deSilva. Mike Alder Optimization Of Phoneme-Based VQ Codebook In A Dhmm System

Abstract  A phoneme-based Gaussian mixture VQ codebook can improve the conventional DHMM system performance significantly. In this paper. an optimization method for the phoneme-based VQ codebook is proposed. The experimental results shown that the optimized phoneme-based VQ codebook leads to both the improvement of system performance and the reduction of system complexity.

PDF
2--7 Yaxin Zhang. Roberto Togneri, Chris deSilva. Mike Alder Optimization Of Phoneme-Based VQ Codebook In A Dhmm System

Abstract  A phoneme-based Gaussian mixture VQ codebook can improve the conventional DHMM system performance significantly. In this paper. an optimization method for the phoneme-based VQ codebook is proposed. The experimental results shown that the optimized phoneme-based VQ codebook leads to both the improvement of system performance and the reduction of system complexity.

PDF
8--13 Lunji Qin, Haiyun Yang and Soo Ngee Koh Estimation Of Continuous Fundamental Frequency Of Speech Signals

Abstract  Most of the well known and widely used speech analysis algorithms are framebased with the assumption that a speech signal is locally stationary over the analysis frame. A different approach to estimate the continuous fundamental frequency of speech signals is considered in this paper. The approach can detect the true non-stationarity of the speech signals as it provides a continuous (sample by sample) fundamental frequency estimation as a function of time. Our algorithm is based on the instantaneous frequency (IF) estimation technique. A bank of two filters are used to remove and attenuate the harmonics of the speech signals. The fundamental frequency of a speech signal is then estimated by the IF technique.

PDF
14--19 Andrew Luk, C.P. Cheung, S.H. Leung, and W.H. Lau Cantonese Phonemes Recognition Via The Gated Neural Network

Abstract  This paper examines the idea and structure of a gated neural network that can be used to recognize Cantonese phonemes. The idea is to partition a fully connected multi-layered feedforward neural network (MLFNN) into a number of functional areas, each of which is a fully connected MLFNN and responsible for a subset of the original problem. These functional areas are activated via gated signals generated from either some other functional areas or external sources. Preliminary results have indicated that such network structure can achieve better performance than the MLFNN.

PDF

Speech Enhancement

Pages Authors Title and Abstract PDF
40--45 A. J. Fisher and S. Sridharan Speech Enhancement For Forensic And Telecommunication Applications

Abstract  This paper describes speech enhancement applied to covert recordings to improve both quality and intelligibility of noise corrupted speech. In this situation, intelligibility is the key issue since the recording is likely to used as audio evidence. It is shown that using improved noise estimation and post- processing applied to a noise subtraction technique, both the above features can be substantially improved. It is also shown that these results are also significant to applications in the telecommunications industry where enhancement of speech is used as a pre-processing stage in speech recognition, coding and speaker verification.

PDF
46--50 K. J. Popel and R. E. Bogner Blind Separation Of Speech Signals

Abstract  Blind signal separation deals with the problem of extracting independent source signals, such as different speech signals, from a set of measurements that are combinations of the source signals, such as measurements made by several microphones in a room with . several speakers. Applications that involve this problem are hearing aids, noise cancellation in extreme environments (e.g. aircraft cockpits and factory floors), and eliminating cross-talk between communication channels.

PDF
51--56 Jie Huang and Noboru Ohnishi Voice Separation Based On Multi-Channel Correlation And Components Tracing

Abstract  This paper presents a novel sound separation method for voices pronounced by multiple persons from different directions. The method uses multi-channel signals from microphones located on different positions. Signals from all microphones are divided into narrow sub-bands by a set of band-pass filter. Envelope section curves over frequencies are calculated at each decimated sampling point. We assume that the voice energy meanly exists in the peaks of envelope section curves. The separation task, then, is to group all of those peaks. Arrival temporal disparities are used to group peaks in the method. Grouping operation, in order to cope with echo, is applied only when a peak just appears. It is because reflected sound arrives later than direct sound, and thus onset only contains sound directly from its source. The peaks are traced after onset and the grouping is maintained. Voice separation experiments were conducted in both an anechoic chamber and a normal room enclosed by concrete walls. The availability of the method was demonstrated.

PDF
57--62 Yuchang Cao, Sridha Sridharan, Miles P. Moody Multi-Channel Speech Signal Separation By Eigendecomposition And Its Application To Co-Talker Interference Removal

Abstract  This paper describes the concept of eigendecomposition for multi-channel signal separation, an alternative method of enhancing the desired signal corrupted by interference. The method uses two observations, which come from a pair of single sensors (or beamformers) both of which contain the desired signal and the undesired signal(s). The method assumes that the desired signal and the undesired signal(s) are uncorrelated and the signal-noise ratios (the ratio of the desired signal to the undesired signal(s)) of each observation are different. The technique has been successfully used to separate speech signal corrupted heavily by ambient noise, co-talker interference and other sources such as background music.

PDF
63--68 Michael. S. Scordilis and Stuart Adams Experiments In Multi-Microphone Speech Enhancement For Recognition

Abstract  Operating on high quality speech data is important in present speech recognition systems, and any variation has quite detrimental effects on system recognition performance. Head-mounted, close-talk microphones are usually required, but are too restraining. As an alternative, this paper investigates speech processing for several arbitrarily positioned microphones, and evaluates the performance of a given recognition system for a number of different arrangements. Results indicate that while beamforming improves performance, it does not match the performance of more traditional close-talk setups unless additional signal enhancement is performed.

PDF

Linguistic Phonetics II: Extralinguistic Aspects Of Speech

Pages Authors Title and Abstract PDF
70--75 CHRISTINE M. KITAMURA Infant Preferences For Age-Related Infant-Directed Speech: The Salience Of Vocal Affect

Abstract  Three explanations of the potential benefit that the exaggerated intonation pattens of infant-directed speech (IDS) may have tor infants are: (i) attentional, (ii) affective/social and (iii) linguistic/didactic. These were investigated using an auditory preference procedure that tested infants aged 5 and 12 months of age for their preferences for speech directed to 5-month-old infants (5DS) and to 12-month-old infants (12DS). In addition to age, attentional and affective variables were manipulated by using four speaker combinations created by pairing each of two 5DS with two 12DS speakers. Prior to the auditory preference study these speaker combinations were analysed for their pitch characteristics and affective salience. Contrary to previous the SDS stimuli did not always have higher pitch and more pitch modulation than the 12DS stimuli. Nevertheless 5DS samples were always rated to he warmer and more affectionate than the 12DS samples, The results of the auditory preference study showed that both 5- and 12-month-old infants preferred to listen to 5DS. This was found for both normal and low-pass filtered versions of the speech stimuli. 'These results suggest IDS preferences are based not only on pitch but also on expressed affect.

PDF
76--80 L. Penny, A. Russell and C. Pemberton Some Speech And Acoustic Measures Of The Aging Voice

Abstract  Age is one of the personal attributes which may be judged with reasonable accuracy from samples of a person's speech. The features that the listener uses in making these judgements are of interest not only to the speech scientist, but also to the speech pathologist and gerontologist interested in physiological aging. Longitudinal studies are the method of choice when seeking data on age-related processes, but in the human case such studies are often impracticable for reasons of the time scale, and so cross sectional studies, interpreted cautiously, may be resorted to. Information is offered on some aspects of voice and speech, obtained from a longitudinal study that has so far spanned 45 years, with additional data from subjects at relevant ages, but so far studied at only one time interval.

PDF
81--86 J.P. Scanlan The Transformation Of Bird Sounds Into 'Speech'

Abstract  Budgerigars mould their imitations of speech on species-specific vocalizations. This process is demonstrated, and some of its mechanisms examined, by analysing intermediate forms in which the transformation is incomplete.

PDF
87--91 L. Penny and M. Carmody The Acoustic Correlates Of Heightened Emotion: The Making Of Marriage Vows

Abstract  The speech of people under conditions of high emotional arousal, making their marriage vows, is compared on a number of variables with their speech recorded in ordinary relaxed circumstances. The findings relating to speaking fundamental frequency (but not its variability) and rate of utterance are in broad agreement with the reported influence of heightened emotional arousal on the speech signal. Also reported are changes to formant bandwidths, but the interpretation of these findings is problematical.

PDF
92--97 Duncan Markham Is Foreign Accent Visible?

Abstract  The paper presents a pilot experiment testing speakers' ability to identify language and native/non-native status of other speakers from visual stimuli.

PDF

Speech Coding I

Pages Authors Title and Abstract PDF
100--105 J.S. Pan, F. R. McInnes and M. A. Jack Improvements In Extended Partial Distortion Search And Partial Distortion Search Algorithms VQ Search

Abstract  A new approach for the extended partial distortion search (EPDS) algorithm is proposed to optimize the EPDS algorithm based on the cost ratio of sorting time to dimension-distortion computation time. The result of this approach is the same as using dynamic programming (DP) to improve the computation time in the EPDS algorithm. The partial distortion search (PDS) algorithm can also be improved by determining which dimension is suitable to start inserting comparisons for every codeword. Experimental results confirm the improved computational performance of these new approaches.

PDF
106--111 J. S. Pan, F. R. McInnes and M. A. Jack Comparison Of Fast VQ Training Algorithms

Abstract  Some fast approaches to VQ training for the LBG recursive algorithm are presented and compared. The computational efficiency is based on the number of multiplications, comparisons, additions, and the sum of these three mathematical operations. Experimental results in comparison to conventional VQ training algorithms with speech data demonstrate that the best approach will save more than 99% in the number of multiplications, as well as considerable saving in the number of additions. The increase in the number of comparisons is moderate.

PDF
112--117 Jinho Choi On Mse Of Celp Coder

Abstract  We consider mean squared error (MSE) between original and synthesized speech signals in terms of length of frame and size of stochastic codebook of CELP coder. A theoretical approximation of MSE has been found in terms of length of frame and size of stochastic codebook. It is shown that MSE depends on not only the length of frame, but also correlation properties of speech signal when the size of stochastic codebook is fixed. Moreover, relationship between MSE and correlation properties of speech signal is observed using the number of effective dimensions of speech signal. For example, uncorrelated speech signal which has large number of effective dimensions has larger MSE than those of correlated speech signal which has small number of effective dimensions. In addition, two applications of MSE are considered to reduce stochastic codebook search time. One application can be also used for reduction of number of bits to encode codeword of stochastic codebook.

PDF
118--123 H.R. Sadegh Mohammadi and W.H. Holmes Fine-Coarse Split Vector Quantization: An Efficient Method For Spectral Coding

Abstract  Line Spectral Frequencies (LSFs) are the most popular parameters for spectrum quantization in speech coders using linear prediction. We propose a new method for the quantization of the LSFs, namely Fine-Coarse Split Vector quantization (FCSVQ). The paper explains the principles of this method, including training and optimization of the associated codebooks. It is shown that this quantizer can be implemented efficiently with negligible computational overhead compared to simple scalar quantization. Satisfactory performance of the new method is verified through experimental simulations.

PDF
124--129 Haiyun Yang, Soo-Ngee Koh, Sivaprakassaipillai P Enhancement Of Improved Multi-Band Excitation (IMBE) Using A Novel Method To Encode Spectral Amplitudes

Abstract  A novel method using pitch-cycle waveform (POW) to encode the spectral amplitudes of the multi-band excitation (MBE) model is proposed in this paper. With the new method, both the magnitudes and phases of the spectra are transmitted, instead of only magnitudes, as in the case of the Improved MBE (IMBE). The perceptual weighting filter is also included in the encoding procedure. Through our experiments, it is found that the new method significantly improves the perceptual quality of the MBE decoded speech, especially for the case of male speech.

PDF

Linguistic Phonetics III: Tone And Intonation

Pages Authors Title and Abstract PDF
132--137 Phil Rose Any Advance On Eleven? Linguistic Tonetic Contrasts In A Bidialectal Thai Speaker

Abstract  The use of a Thai bidialectal with at least ten different tones is demonstrated in establishing linguistic tonetic contrasts, and testing the suitability of tonal normalisation strategies and descriptive adequacy of current feature systems. A descriptive framework tor linguistic tonetics is outlined.

PDF
138--143 Phil Rose Wenzhou Tonal Acoustics - Depressor And Register Effects In Chinese Tonology

Abstract  Mean F0 and duration data from one male speaker of Wenzhou dialect are presented for the eight citation tones, and tones in selected disyllabic lexical sandhi environments. Morphotonemrc sandhr alternations are adduced to argue for the inclusion of a Depressor component in Wu tonological representation in addition to Register.

PDF
144--149 Heather B. King The Interrogative Intonation Of Dyirbal

Abstract  Dyirbal interrogative intonation phrases are analysed to determine the intonational cues a listener uses to differentiate between declarative and interrogative utterances. Results thus far obtained are presented.

PDF
150--155 Yuancheng Zheng, Harald Trost, Ernst Buchberger, and Johannes Matiasek The Intonational Model Used For German Text-To-Speech Generation

Abstract  In Text-To-Speech (TTS) systems the intonation quality depends mostly on the fundamental frequency (F0) contour. In this paper we propose a method of reconstructing the F0 contour in German TTS systems. The input to our system is an intonation hierarchy structure obtained by syntactic analysis. The nodes our intonation hierarchy tree are scaled by relative numerical values which are normalized by pitch range. Assuming that the F0 tends to move continuously within a word, 11 patterns for reconstructing syllable F0 contour are presented according to the location of the syllable s stress.

PDF
156--161 Sandra Madureira Pitch Patterns In Brazilian Portuguese: An Acoustic-Phonetic Analysis

Abstract  Fundamental frequency contours of declarative, interrogative and imperative sentences are taken into account in describing the prosodic structure of sentences in Brazilian Portuguese.

PDF

Speech Coding II

Pages Authors Title and Abstract PDF
164--169 S. Boland, M. Deriche, and S. Sridharan Low Bit Rate Speech And Music Coding Using The Wavelet Transform

Abstract  This paper investigates low bit rate coding of high quality speech and music signals using the discrete wavelet transform (DWT). Compression is first exploited by eliminating wavelet coefficients that are zero or near zero in magnitude. The remaining coefficients are then uniformly quantized using 8 bits. Identical experiments are carried out with the DCT and the FFT. In each case, subjective quality and the segmental SNR are recorded to enable accurate comparison with the DWT. Also the effects of frame size, number of vanishing moments, and levels decomposed in the DWT, on the subjective quality and seg. SNR are examined. More efficient techniques for the quantization of wavelet coefficients are currently considered with the aim of reducing the bit rate, while maintaining near transparent quality.

PDF
170--175 W.N. Farrell and W.G. Cowley A Rate 3/4 Tcm Decoder For Line Spectral Pairs Using Map Information

Abstract  This paper describes two schemes which combine Line Spectral Parr (LSP) source information with a Trellis-Coded Modulation (T CM) scheme. The proposed schemes use a rate 3/4 TCM scheme which is used to transmit LSP values across a channel. The decoder proposed is a Viterbi decoder which uses LSP Transition information to help correct errored paths and choose the most probable set of LSPs. Two variations of this decoder are discussed. By using knowledge of LSP transitions and expected noise variance, a-posteri probabilities can be used as the metrics for branches in the trellis. This provides for more robust transmission over tow SNR channels. Results show that there is significant gain in using either of the proposed methods.

PDF
176--181 J. Leis, S. Sridharan and W. Millan Secure Speech Coding For Voice Messaging Applications

Abstract  Voice messaging provides to the user the advantages of both electronic mail (providing message store and forward) and conventional voice telephony. It has the potential to take advantage of both aforementioned tech- nologies, transcending both time and space in providing a more convenient and natural form of communication. A number of issues present themselves in the context of a store-and-forward voice mail system, in particular the ability to compress an entire message at once, and the problem of error control coding for a message which may not be decoded immediately. When combined with the need to secure the message via some cryptographic algorithm, the delayed-decode characteristics of voice mail create a problematic situation in the event of channel or storage errors, as conventional cryptographic algorithms propagate bit errors for the remainder of the message. Although the compression and message en- cryption may be viewed as independent, sequential stages in the processing of a voice message, this paper proposes a joint coding and encryption approach based on tolerable levels of distortion in the received voice message.

PDF
182--187 Chenthurvasan Duraiappan and Yuliang Zheng Improving Speech Security And Authentication In Mobile Communications

Abstract  This paper points out certain weaknesses in the existing security system of Global System of Mobile Communications (GSM) and proposes a better security system for GSM. The proposed security system provides an authenticated session key distribution protocol between the authentication center (AUC) and the mobile station for every call attempt made by a MS. At the end of an authenticated session key distribution protocol, the identities are mutually verified between the AUC of a Public Land Mobile Network (PLMN) and the Subscriber Identity Module (SIM) of a MS as well as the session key for call encryption is distributed to the MS. Keywords: GSM, Secure protocols, Scrambling, Digitization, Authentication, Encryption, Roaming

PDF
188--193 Ira A. Gerson, Mark A. Jasiuk, Joseph M. Nowack, and Eric H. Winter Speech And Channel Coding For The Half-Rate Gsm Channel

Abstract  A VSELP speech coder and accompanying channel coder designed for the half-rate GSM channel are described. This system is being adopted by ETSI for the half-rate GSM channel. The speech coder employs a novel strategy for vector quantization of the reflection coefficients (rj), which combines high coding efficiency, low codebook search complexity, and low storage requirements.A computationally streamlined version of a zero- _pole spectral noise weighting function is implemented. An adaptive pitch prefilter and an . adaptive spectral postfilter are used to improve the speech coder's performance for both the tandemed and non-tandemed cases. Error protection is provided by convolutional codes. Error detection is achieved through the use of a 3 bit CRC and Window Error Detection (WED).

PDF

Speech Phonetics & Speech Databases

Pages Authors Title and Abstract PDF
196--201 K.M. Hird The Function Of Declination In Spontaneous Speech

Abstract  The assumptions about the universality of declination were tested in the context of spontaneous speech. Acoustic analyses provided support for a theory of declination that involves both passive physiological processes and processes that are speaker controlled and context specific. The results also suggested that damage to the right cerebral hemisphere impairs the use of processes that are required for the control of declination in spontaneous speech.

PDF
202--208 Hisham Darjazini and Dr Jo Tibbitts The Construction Of Phonemic Knowledge Using Clustering Methodology

Abstract  In order to simulate Inherent human phonemic knowledge, a mechanism for categorising phonemic information and for providing efficient structures to refine information is proposed. A model rs used that depends on the hierarchical structure of the knowledge which classes the phonemic information in clustering manner. The classification hierarchy takes advantage of inheritance where once the phoneme is identified within the phonemic stream, it inherits knowledge of all the statistical possibilities of the remain stream. It rs suggested that this process will enhance isolated word recognition as well as continuous speech recognition.

PDF
209--214 J.Bruce Millar and Dave Davies The Andor Interface To The Australian National Database Of Spoken Language

Abstract  This paper introduces the ANDOR user interface, which allows remote access using both special purpose and standard SOL commands, to a descriptive database which describes the current spoken language data holdings of the ANDOSL Project. It also updates the description oi the data description of the data currently used within the ANDOSL project and indicates how it is mounted in an ORACLE system on the ANDOSL node connected to AARNET at the ANU. ANDOR attempts to do several things; give limited access to all-comers on AARNET- allow access to the ORACLE database via e-mail to registered users; provide tutorial help and personal SQL command library support; provide short-cut report generation directly from the standard data description format thus supporting the specific queries expected from the speech science and technology community in an efficient manner. The facilities provided using these various modes of access are described, examples of the scope of the descriptive data are given, and the status of current data holdings in the ANDOSL system are reported.

PDF
215--220 Jonathan Harrington and Lydia K.H. So Some Design Criteria In Segmenting And Labelling A Database Of Spoken Cantonese

Abstract  The paper describes a collaborative project for the creation of a segmented and labelled database of spoken Cantonese. Details are provided both of the phonetic and phonological structure of Cantonese and the way in which the mu+ system for speech corpus analysis has been adapted to extract and analyse data from a tone language.

PDF

Speech & Feature Analyses

Pages Authors Title and Abstract PDF
222--227 Myléne Pijpers, Michael D. Alder and Roberto Togneri Dimension Reduction Of Acoustic Vowel Data

Abstract  From 33 speakers in the Timit database a total of around 88 vowel utterances was extracted. These represent eight vowel categories. Each waveform segment was processed, first by taking a FFT on 32 msec frames, then by binning into 12 mel-spaced frequency bands. This way each frame is described by 12 numbers, hence it becomes a point in IR". Each vowel utterance is a series of points and a short trajectory in R12. The eight vowel classes become eight clusters of points in the speech space. The covariance matrix and the centers of the clusters were computed and di- mension estimates of each vowel cluster by Principal Components Analysis show that the centers of the eight vowel clusters are all situated close to a plane and their principal axes make a small and consistent angle with respect to this plane. This confirms the results of Plomp e.a. (1969), who found the vowel space to be essentially 2 dimensional. Projecting the centers of the vowel clusters to the plane gives a representation of the vowels which is very similar to the conventional F1-F2 plot and the vowel diagram.

PDF
228--233 Fikret S Gurgen, Ting Fan, Julie Vonwiller On The Analysis Of Phoneme-Based Features For Gender Identification With Neural Networks

Abstract  As a first step in our research into the identification of non-linguistic features, we introduce the identification of speaker gender or sex using neural networks (NN). The study makes use of phoneme-based features (phonemes and broad phoneme classes) from a phonetically rich database, using a few of the sentences as training data, and investigates the effects on gender identification on a comparative basis. There were 3 speakers, 1 male and 2 females with different accents. Accurate identification of gender is known to increase the performance of speech recognition systems. Briefly, a window-based neural network (WNN) is generally trained for identification of non-linguistic features using phoneme samples. This NN is then used for testing for the feature in unknown phoneme samples. Various numbers of MFCCs are employed for the gender identification. It was found that vowels of just a few sentences provided valuable gender information.

PDF
234--237 T. Schurer Comparing Different Feature Extraction Methods For Telephone Speech Recognition Based On Hmm'S

Abstract  Robust speech recognition over telephone lines severely depends on the choice of the feature extraction method used. In the last years, several researchers made experiments with new feature extraction methods based on the Perceptual Linear Predictive Analysis (PLP), and showed that these methods sometimes outperform conventional methods like Linear Predictive Cepstral Coefficients (LPC) and Mel-Frequency Cepstrum Coefficients (MFCC). The aim of the study described is to find the best feature extraction method for a speech recognition system running in the public switched telephone network. Because previous work showed that HMM's clearly outperformed other classification methods, continuous density HMM's are used in that study. The above mentioned and commonly used feature extraction methods (MFCC, LPC) are compared to some PLP-based methods (simple PLP and RastaPLP). Important parameters of each feature extraction method (e.g. model or filter order, number of coefficients) were modified within reasonable ranges, and the recognition performance of each computed list of feature vectors was tested with a. set of different HMM's. In order to find the optimal HMM covering the specific data, the number of states and mixtures per state were also modified within the tests.

PDF
238--243 T. Matsuoka, N. Hayakawa, Y. Yashiba, Y. Ishida, T. Honda and Y. Ogawa Pitch Estimation Using Discrete Analytic Signals

Abstract  This paper proposes a new method for estimating the fundamental frequency of speech signal which uses the low-pass filter and Hilbert transformer with approximately ideal frequency responses.

PDF

Linguistic Phonetics I: Individual Speaker Characteristics

Pages Authors Title and Abstract PDF
22--27 Jeffery Pittam The Measurement Of Voice

Abstract  This paper discusses a range of convergent methodologies for the measurement of voice. It argues for an integrated approach to vocal measurement that conceptualises the voice as an important part of social interactions. This approach includes acoustic, perceptual and attributional measures. Two projects currently being undertaken by the author will be briefly introduced.

PDF
28--33 Andrew Butcher On The Phonetics Of Small Vowel Systems: Evidence From Australian Languages

Abstract  The majority of indigenous Australian languages have a 'triangular' system of vowel quality contrasts, said to consist of /i/~/a/~/u/ only. Formant frequency data, both previously published and recently gathered, suggest that the basic phonetic space for Australian vowels is smaller than is generally taken for granted in the literature on the phonetics of vowels. Two possible explanations for this are examined. (1) Australian aboriginal speakers may have vocal tracts which are of different proportions from those of European speakers, on whom conventional notions of phonetic vowel space are based (2) Australian languages may not conform to the principle of 'maximum dispersion', which is widely assumed to be a universal one, whereby the vowels of a language are said to be dispersed maximally and evenly within the available phonetic space.

PDF
34--39 Kuniko Kakita Inter-Speaker Interaction Of The Duration Of Sentences And Intersentence Interval S

Abstract  The aim of the present study is to examine how one speaker's speech affects another speakers speech. The speech parameters investigated are the sentence duration and the intersentence interval duration. The results of the study indicated that when the subjects 'took over from the preceding speaker and read the remaining part of the speech aterial, both the sentence duration and the interval duration deviated from one's 'preferred' duration obtained in the single speaker readings. The deviation was mostty assimilative. The results of further analysis indicated that the sentence duration and the interval duration differed characteristically in the way they were affected by another speakers speech.

PDF

Speech Analysis

Pages Authors Title and Abstract PDF
244--248 Munehiro Namba and Yoshihisa Ishida Design And Implementation Using Neural Networks And Its Application To Hearing Aid

Abstract  Techniques for designing digital filters using neural networks, and an application example of the filter designed by this method to digital hearing aid are described. We apply neural networks to digital filters in order to get much more flexibilities on design than conventional adaptive digital filter techniques.

PDF
249--254 Hiroyuki KAMATA, Hiroyuki OKA and Yoshihisa ISHIDA Analysis And Synthesis Of Human Voice Considering The Nonstationary Based On The Glottis Open And Close Characteristics

Abstract  In this paper, we present a new structure of transfer function which is suitable to reconstruct the waveform of voiced speech. In case of voiced speech, there are two different states every pitch period based on the glottis open and close. Thus it is difficult to identify the voice generation system by one transfer function. In this paper, two transfer functions that are connected as the parallel structure are used when the compensation of the estimation error is requested, one transfer function is added to the other two transfer functions. By these three transfer functions, the voiced speech of human voice is perfectly reconstructed. These transfer functions are estimated by least mean square (LMS) method. Furthermore, the method for reduction of voice information is also discussed in this paper.

PDF
255--260 Xue YANG, J. Bruce Millar and Iain Macleod On The Separation Of Speech Signal Variances From Two Sources

Abstract  Variability is an inherent characteristic of speech signals and its management plays an important role in speech processing. This variability arises from many sources. In this study, techniques are developed to explore the separation of speech signal variances from two sources in the low-order cepstral space. Hidden Markov modelling and statistical comparison of multi-variate distributions are used in a novel way to capture local variance in the midst of the normal dynamics of speech signals. Subsequent analysis demonstrates differences between this local variance and the global variance which is often used to characterise signal variance without reference to its significant components, and reveals speaker characteristics which may not be observed by simply looking at the global variance of speech data.

PDF
261--267 Richard E Favero Comparison Of Perceptual Scaling Of Wavelets For Speech Recognition

Abstract  Recent work has applied wavelets to speech recognition and has shown that the use of perceptual scaling of the wavelet set can reduce the number of coefficients generated per feature vector compared to standard log frequency scaling. This paper examines the formulation and performance of four frequency scaling operations applied to a wavelet set for the parameterisation of speech in a speech recognition system. Three perceptually based frequency scales: two mel-frequency scales and a bark scale are compared with standard log scaled wavelets. Application to a multi-speaker E-set discrimination task shows that the piece-wise mel scale provides a recognition accuracy of 67.5%, outperforming the other perceptually based scales slightly, and the standard wavelet log scale by nearly 7%.

PDF

Speech Analysis & Audition

Pages Authors Title and Abstract PDF
268--273 S. Ong and P. Castellano Spectral Patterns And Speaker Identification Asymmetry

Abstract  This paper investigates the phenomenon of Automatic Speaker Identification asymmetry. Automatic Speaker identification attempts to identify one or several speakers in a multi-speaker environment by analysing the speech signal. in thus situation, some speakers may routinely be mistaken for others while the latter are rarely identified as the former. The Multiple Signal Classification algorithm was used to provide an eigenvector approach, partitioning the problem space into signal plus noise and noise eigenvalue subspaces. A speaker associated with high eigenvalues in the first subspace (well defined spectral patterns) was not routinely mistaken for another. The opposite was found for those with low eigenvalues in that subspace (weak spectral patterns). Hence, a disparity in spectral pattern strengths from speaker to speaker may strongly influence asymmetry in Automatic Speaker Identification.

PDF
274--279 SH Luo and R. W. King Using Speech Signals To Improve Visual Facial Image Reconstruction: An Rnn Approach To Explore The Mutual Information

Abstract  We present a novel approach for improving visual facial image reconstruction by utilizing information in the accompanying acoustic speech signal. From analysis of the speech signal and knowledge of the mutual information between speech and visual features of facial images, the method can be used to synthesize moving facial images. A recurrent neural network is used to map between the acoustic and visual spaces; the input to the RNN is 21 acoustic speech features and the output is the position values of 15 facial feature points and the first 20 coefficients of a principal components representation of the mouth area.

PDF
280--284 R.H. Withnell & R.A. Wilde Preliminary Report: Early Latency Auditory Evoked Potentials In Infants With Down Syndrome

Abstract  In individuals with Down syndrome, it has been speculated that compromise of the neural synapses of the auditory brainstem pathway may result in an auditory brainstem response (ABR) in which the evoked potential components exhibit temporal variability. This paper presents our findings on 14 infants with Trisomy 21. Using a mode of stimulation to directly innervate the cochlea, ABR thresholds were obtained which suggest that synaptic abnormality may be present in some cases.

PDF

Keynote Address

Pages Authors Title and Abstract PDF
285--288 Anne Cutler How Human Speech Recognition Is Affected By Phonological Diversity Among Languages

Abstract  Listeners process spoken language in ways which are adapted to the phonological structure of their native language. As a consequence. non-native speakers do not listen to a language in the same way as native speakers; moreover, listeners may use their native language listening procedures inappropriately with foreign input. With sufficient experience. however, it may be possible to inhibit this latter (counter-productive) behaviour.

PDF

Session 13 Speech Development

Pages Authors Title and Abstract PDF
325--330 Christine Kitamura & Denis Burnham Pitch & Communicative Intent In Infant-Directed Speech: Longitudinal Data

Abstract  This study examines the modifications made to the pitch and communicative intent in infant-direct speech (IDS) &om birth to 12 months, at 3 monthly intervals. With regard to pitch, mothers use the peak level of mean fundamental frequency (F 0) when the infant is 6 months while pitch range is highest at 12 months of age. Sex-based differences were also evident with mothers using higher mean-F0 and pitch range in speech to female than male infants. Furthermore the average shape of the utterance transposes from a fall-flat/rise in speech to newborns to a flat-fall in speech to adults. With respect to communicative intent, two factors were extracted from five ratings scales and these were labelled 'affective' and 'attentional/didactic? Analysis of these factors showed mothers express more affection at 6 months than other ages while peak levels on the attentional/didactic factor were reached at 9 months of age. Mothers also increase their use of both these components of IDS more in speech to girl than boy infants.

PDF
331--336 S. McLeod. J. van Doorn, and V. Reed Homonyms And Cluster Reduction In The Normal Development Of Children'S Speech

Abstract  As children are learning to speak, they sometimes reduce consonant clusters to a single element, and as a result produce a homonym (e.g. they say "top" for "stop"). There is some evidence in the literature which suggests that even though these words may sound the same, they have differences which can be detected with acoustic analysis, indicating that the children are making covert distinctions between the two contexts. The purpose of this study was to make acoustic comparisons of homonym pairs produced as a result of cluster reduction by a group of 16 young children (2;O to 2;11 years). Duration and relative energy of aspiration for stops /k/ and /t/, duration and spectral distribution for fricative /s/, and voice onset time (VOT) for /k/ were measured in several word-initial contexts. Results showed that for word-initial /s/ plus stop clusters which had been reduced to a stop, the aspiration duration for the stop in the cluster target word was significantly less than that for the singleton target word. No other time or spectral measures reached statistical significance. The results have been interpreted in terms of phonological and speech motor development in children.

PDF
337--342 P.F. McCormack & T. Knighton Gender Differences In The Speech Patterns Of Two And A Half Year Old Children

Abstract  The speech patterns of 50 normally developing two and a half year old children were investigated. Differences were found between the males and females for a distinct clustering of processes that simplify syllable structure. There was a significantly greater use of final consonant deletion, weak syllable deletion, and cluster reduction by the boys, while there was no differences in the use of other speech processes or for receptive and expressive language abilities. A discriminant function constructed from these 3 syllable simplifying processes correctly classified each subject as being either male or female with eighty percent success. Interestingly, in the pre- school and early school years boys have a higher incidence of developmental speech disorders than girls (2 to 1), with the marked use of weak syllable deletion, final consonant deletion, and cluster reduction distinguishing severe cases of developmental speech disorder. These are the same process identified in this study as being generally more evident in boys than girls at two and a half years of age. The question arises as to whether there is any relationship between early patterns of speech development in a child and later identification as having a speech disorder.

PDF
343--348 Christine Kitamura & Denis Burnham Infant Preferences For Infant-Directed Speech: Is Vocal Affect More Salient Than Pitch'?

Abstract  The aim in the following experiments was to ascertain whether infants are more responsive to the pitch or vocal affect in infant-directed speech (IDS). In Experiments I and 2, infant preferences were tested for high vs low vocal affect with the level of pitch equated (HiAffect vs LoAffect IDS) and in both experiments, infants preferred to listen to HiAffect IDS. In Experiment 3, high vs low pitch was presented with the level of vocal affect equated (HiPitch vs LoPitch IDS) and it was found that infants preferred LoPitch over HiPitch IDS. This result was unanticipated and when a different procedure was used to rate the vocal affect of the speech exemplars in Experiment 4, there was no difference in infant preferences for Hi or LoPitch IDS. Taken together these two sets of results suggest that it is the affective salience of IDS that is important to infant responsiveness and not necessarily the pitch characteristics alone. In the final experiment infants showed no differential preferences for normal or low-pass filtered IDS, confirming they are as responsive to the intonation or pitch characteristics as they are to full spectral versions of speech. Therefore it is suggested that pitch is used as a means of conveying affective intent to infants.

PDF

Feature Analysis II

Pages Authors Title and Abstract PDF
290--295 Simon Fox and Peter Tischer Exact Sound Compression With Optimal Linear Predictors

Abstract  The optimal Least-Entropy predictor for DPCM exact sound compression is used to compare and evaluate the performance of predictors computed using such criteria as Least-Squares and Least-Absolute-Deviations. Results indicate that performance of Least-Squares predictors is sensitive to the data particularly as the dynamic range is increased. This has been addressed previously by breaking the data into small blocks and calculating new Least-Squares predictors for each block. We show that while this leads to considerable improvement, a single Least-Entropy predictor over the whole data set often performs favourably. An improvement for the Least-Squares predictor is suggested where a small percentage of the data is discarded in the least-squares modelling. This often gives a predictor considerably closer to optimal. The Least-Absolute-Deviations predictor rs found to perform very close to optimal in general, suggesting that the distribution of errors is well approximated by a Laplacian distribution.

PDF
296--301 Andrew Hunt and Richard Favero Using Principal Component Analysis With Wavelets In Speech Recognition

Abstract  Recent work has shown that wavelets can provide an effective spectral representation for use in speech analysis and speech recognition because of their ability to merge both wide-band and narrow-band spectral representations. There are, however, some difficulties associated with using wavelet parameterisation with HMM-based speech recognition. This paper presents a method which uses PCA to transform the feature space of wavelets so that they can be more effectively used with HMMs. The approach yields good results; up to 25% error-rate reduction is achieved on the difficult E-set discrimination task, along with a seven times reduction in the number of parameters. Further, the training and execution times with the PCA features is greatly reduced.

PDF
302--307 Chee Wee Loke, Roberto Togneri A Geometric Interpretation Of Hidden Markov Model

Abstract  In this paper, we investigate the relationship between speech trajectories and the hidden Markov model. The speech utterances were transformed into speech feature vectors and the trajectories displayed on a bro dimensional space. The hidden Markov models were also displayed on a two dimensional space. By visual examination, we think that the state of the HMM is related to a sustained sound. Further experiments showed that each state seem to be associated with a distinct phoneme of the utterance. 'therefore, the number of states-required in the continuous HMM is related to the number of phonemes in the word to be modelled. In the semi-continuous HMM, it is also possible that the same gaussian probability density function is shared by the same phoneme sound in different semi-continuous HMMs.

PDF

Linguistic Phonetics Iv: Syllable Duration & Rhythm

Pages Authors Title and Abstract PDF
310--315 P.F. McCormack, J. C. Ingram Tempo And The Rhythm Rule

Abstract  The adjustment of linguistic stress patterns under the influence of rhythm is well attested, and forms the basis of a number of models of speech organisation. An experiment is reported on the perceptual and acoustic consequences of manipulating speaking rate for such rhythmic stress shifts. The results are discussed in terms of their implications for the place of rhythm in models of speech production.

PDF
316--321 David Deterding The Rhythm Of Singapore English

Abstract  Singapore English is relevant for the study of rhythm, because it is often claimed to have syllable-timed rhythm. Speakers of Singapore English Pronunciation (SEP) and Standard Southern British (SSB) were recorded, and the duration of each syllable compared with that of the following syllable for 30 utterances from each variety. It was found that there is greater variability in this measure of syllable-to-syllable duration in SSB, winch confirms that SEP might indeed be regarded as more syllable-timed than SSB.

PDF
322--327 J. Wang Syllable Duration In Mandarin

Abstract  A Mandarin speech database, aimed at establishing a prosodic model for Mandarin synthesisers, has been established in SHLRC at Macquarie University. The database contains 401 sentences (3168 syllables) read by a female speaker. The duration data was extracted and analysed using mu+. The statistical results have shown that the following durational control factors have a strong influence on syllable duration; syllable structure; syllable tone; stress pattern of prosodic words; syllable position in prosodic words; and prosodic word position in sentences. Of these, syllable structure, syllable tone (including unstressed syllables with neutral tone), and sentence-final position are more dominant. In order to quantify the effects of individual control actors, a simple, additive, syllable based duration model was tentatively modelled. The duration analysis and its modelling in this paper, to a large extent, reveals the prosodic and rhythmic characteristics of spoken Mandarin since tone, tone sandhi, word stress and rhythmic units were included in the hierarchically structured labels.

PDF

Speech Analysis

Pages Authors Title and Abstract PDF
330--335 Mechtild Tronnier Tracing Nasality With The Help Of The Spectrum Of A Nasal Signal

Abstract  This paper presents preliminary findings in the detection of nasalisation making use of the formant pattern in the spectrum of a nasal signal obtained with a contact microphone attached to the nose.

PDF
336--341 Richard E Favero Compound Wavelets And Speech Recognition

Abstract  This paper reports on a method for improving a speech parameterisation for speech recognition by increasing the bandwidth of a mother wavelet without significantly altering its time resolution. The linear combination of wavelets that have centre frequencies near each other produce a compound wavelet with a larger bandwidth. This paper also shows how more complex wavelets can be constructed for use in other correlation tasks. This work applies a wavelet parameterisation using compounded wavelets to a discriminative recognition task. The wavelet transform of the speech sample using the resultant wavelets is applied to a HMM classifier. Recognition performance on the E-set discrimination task improves from 67.5% to 70.0% through the use of compounding.

PDF
342--347 Simon Hawkins, Iain MacLeod, and Bruce Millar Modelling Individual Speaker Characteristics By Describing A Speaker'S Vowel Distribution In Articulatory, Cepstral And Formant Space

Abstract  Of the various methods of encoding vowels, the cepstral representation has consistently produced the best performance in automatic vowel classification studies. We explain this robust phenomenon in terms of our finding that the shape of a speakers vowel surface is much more similar across speakers in Cepstral Space than it is in either Articulatory or Formant Space. The cepstral representation thus allows an automatic vowel classifier trained on the vowels of one group of speakers to generalise to the vowels of another group of speakers.

PDF
348--353 Simon Hawkins, Iain Macleod, and Bruce Millar An Unsupervised Algorithm For The Extraction Of Formant-Like Features From Lpc-Cepstral Space

Abstract  This study develops an unsupervised algorithm for extracting the perceptual dimensions of vowel backness (a correlate of F2) and vowel height (a correlate of F1) from the LPC-cepstral representation of a speaker's vocalic system.

PDF
354--359 Frantz Clermont and Parham Mokhtari Frequency-Band Specification In Cepstral Distance Computation

Abstract  Distances derived from the allpole, Linear-Prediction (LP) Cepstrum are known for their ability to capture important spectral differences between speech sounds with rela- tively small computational complexity, and hence are widely used for computer speech and speaker recognition. However, these so-called cepstral distances have to date been formulated in such a way as to yield similarity measures, which are integrated over the entire spectral range defined between zero Hertz and half the sampling frequency, while this time-honoured formulation can be very effective, it is limited by the fact that arbitrary frequency bands within the available spectral range cannot be isolated or emphasised in the distance computation itself. In this paper we show that the existing mathematical framework for deriving LP-cepstrum distances is amenable to one which permits direct frequency-band specification. In particular, the quefrency-weighted cepstral distance, also known as a spectral slope distance, is re-formulated as a parametric function of frequency, and then illustrated using directly selected frequency bands of a pair of speech spectra.

PDF

Linguistic Phonetics V

Pages Authors Title and Abstract PDF
362--367 Anne Cutler, James McQueen, Harald Baayen and Hens Drexler Words Within Words In A Real-Speech Corpus

Abstract  In a 50,000-word corpus of spoken British English the occurrence of words embedded within other words is reported. Within-word embedding in this real speech sample is common, and analogous to the extent of embedding observed In the vocabulary. Imposition of a syllable boundary matching constraint reduces but by no means eliminates spurious embedding. Embedded words are most likely to overlap with the beginning of matrix words, and thus may pose serious problems for speech recognisers.

PDF
368--373 Clive Cooper and Frantz Clermont An Investigation Of The Speaker Factor In Vowel Nuclei

Abstract  Experiments in computer speaker identification are reported, which shed some light on speaker variability in the first three formants of the vowel nuclei of selected monosyllabic English words. The experimental results show that there is considerable spectro-temporal variation of speaker information throughout the vowel nuclei, but that the information does not appear to be “localised". Evidence is also presented in support of the hypothesis that speakers could be characterised in certain vowel subspaces.

PDF
374--380 Michael Ingleby, Wiebke Brockhaus and Carl Chalfont Robust Techniques For Recognition Of New Knowledge-Based Speech Primitives

Abstract  A knowledge-based approach to speech recognition based on relatively recent, post-SPE (Chomsky & Halle, 1968), non-phonemic speech patterns is outlined. The approach emphasises the power of new phonological theories (exemplified by Government Phonology) to model the coarticulation phenomena which make continuous speech hard to recognise by machine, and proposes a set of speaker-indepenent features (a signature) which map acoustic signal segments to the primitives of this chosen phonological theory. The features are shown to be suitable for 'coarse-to-fine' matching of a speech signal to possible parses, through invocation of a succession of cues about speaker intention imbedded in a signal.

PDF
381--386 Andrew Hunt Two Linear Models Relating Acoustic Prosodics And Syntax

Abstract  This paper presents two models based on multivariate statistical techniques which show a significant linear relationship between acoustic prosodic features and syntactic structure for professionally read speech. The models differ from most previous research in three important ways; they do not use a predetermined intermediate phonological representation but instead "learn" one, the models capture a broad range of prosody-syntax effects instead of focusing on effects at major boundaries, and the Link Parser is used to provide the syntactic framework instead of constituent parsing. The models show correlations between the acoustic prosodic and syntactic domains of 0.78 and 0.84. The role of individual acoustic and syntactic features will be analysed through the use of the multivariate models.

PDF
387--392 Shuping Ran, Phil Rose, J.Bruce Millar and Iain Macleod Automatic Vowel Quality Description Using A Cardinal Vowel Reference Model

Abstract  This paper presents an analysis of the vowel space of a single speaker using an automated method of deriving the phonetic dimensions of the constituent vowels. The boundaries of a three dimensional space are defined by a set of eight extreme vowels. These vowels are used to train two multilayer perceptrons such that they encode in their inter-nodal weights the relationship between the dimensions of the vowel space and the acoustic characteristics of the extreme vowels. The English vowels of the speaker are then processed by the perceptrons. The activation levels of the output nodes are used to represent the position of each vowel within the vowel space. These automatically derived positions are then compared with the positions of these vowels in a similar space as judged by a phonetician, and the acoustic space derived from these vowels. The differences observed are discussed in the light of possible improvements in the procedure.

PDF

Speech Analysis II

Pages Authors Title and Abstract PDF
394--399 D Hawthorn, C White A Data-Driven Speech System

Abstract  Communication is a process of meaningful message exchange that is fundamental to daily living. Any impairment of this process has a significant impact greatly reduces the quality of life that many take for granted. Opening communication with people who have severe and multiple disabilities is a challenge. Furthermore, developing a system that can be readily customised reused in a number of -application areas poses a number of major challenges. This paper describes a system, currently being developed by the authors, that addresses the areas of communication, customisation and reuse. The system is a Dynamic Data-Driven (3D) system that runs on an IBM compatible PC incorporating speech output through a single point interface to SBTALKER using a Sound Blaster card (1).

PDF
400--405 P. Castellano and S. Sridharan Speaker Identification With Projection Networks

Abstract  This study compares four connectionist approaches to text-independent Automatic Speaker Identification. It concludes that projection networks such as the Higher Order Neural Network (HONN), the Moody-Darken Radial Basis Function (MD-RBF) network and the Logicon Projection Network (LPN) consistently outperform standard Multi-Layered Perceptron (MLP) networks. The difference in performance is least 11% according to the criteria of ASI threshold and percentage of correctly recognised speech vectors. In this study, the LPN, capable of creating both open and closed boundary regions for data points, is superior to both the HONN and the MD-RHF on both criteria (Mean threshold: 90.55%, mean percentage of correct classifications: 97.2%). Since the LPN's dominance is almost universal across all speakers considered, results need not be confirmed by the additional use of one, or several, of the remaining three classifiers.

PDF
406--410 Malcolm B. Jones Real-Time Speech Enhancement Using Median Filters

Abstract  Considerations involved in using median filters in Digital Signal Processing systems for speech enhancement are discussed. A summary or algorithms available presented. together with measurements of the real-time performance on sample speech corrupted by impulsive noise.

PDF
411--416 A. Satriawan and J.B. Millar Speaker Change Detection

Abstract  An approach to speaker change detection based on speaker discontinuity models is described. It builds on the limited evidence of speaker characteristics in individual phone segments to enable a rapid response when a speaker change is detectable. A baseline system based on a speaker identification task is built and tested on the TIMIT corpus.

PDF
417--422 A. R. Kian Aleolfazlian & Brian L. Karlsen The Cocktail Party Listener

Abstract  A complex computational model of the human ability to listen to certain signals in preference of others, also called the cocktail party phenomenon, is built on the basis of surveys into the relevant psychological, DSP, and neural network literature. This model is basically binaural and as such it makes use of both spectral data and spatial data in determining which speaker to listen to. The model uses two neural networks for filtering and speaker identification. Results from some experimentation with type and architure of these networks is presented along with the results of the model. These results indicate that the model has a distinctive ability to focus on a particular speaker of choice.

PDF

Perception & Perceptual Features

Pages Authors Title and Abstract PDF
424--429 Robert H. Mannell The Prediction Of "Perceptual Distance" From Spectral Distance Measures Based Upon Auditory And Non-Auditory Models Of Intensity Scaling

Abstract  Spectral and Perceptual distances of channel vocoded speech from the original natural speech tokens are examined. Perceptual distances are defined as the differences between the intelligibility of each natural speech token and the intelligibilities of each of its derived synthetic forms. The spectral distances are second order Euclidean distances between the Bark-scaled spectra of each natural speech token and its derived synthetic tokens (summed over 10 ms frames). A number of different spectral distances based on various auditory and non-auditory intensity scales were compared by examining the correlations between the spectral and perceptual distances for each intensity scale. Correlations were compared globally and across various phonetic categories and the intensity scale(s) that provided the best correlations for each phonetic class were determined. The auditory and perceptual consequences of the results are discussed.

PDF
430--435 U. Thein-Tun and D. Burnham The Nature Of Information Processing In Speech Perception

Abstract  Since several pairs of non-speech sounds on each end of an acoustic continuum can be perceived categorically in the same way as a minimal pair, of two phonemes on each end of a speech continuum are, the notion of speech perception has been debated either as a special or non-spectal mode of information processing. the strength of phonetic cue trading as an experimental tool, it can be shown that the higher the linguistic level is from words to sentences, the more special is the mode of information processing to speech perception.

PDF
436--441 Jialong He, Li Liu and Gunther Palm Perception Of Stop Consonants In Vcv Utterances Reconstructed From Partial Fourier Transform Information

Abstract  Identification experiments were performed to investigate perception to intervocalic stop consonants in Vowel-Consonant-Vowel (VCV) utterances. The VCV utterances were reconstructed from the following partial Fourier transform information (1) Long-term Fourier phase spectra; (2) Signed magnitude spectra - Fourier magnitude spectra combined with 1-bit phase spectra. It was shown that the percent of correct identification to the intervocalic consonants was improved from 68% to more than 93% for the first type of stimuli alter applying an iterative reconstruction algorithm with enough phase samples, Near-perfect performance could be reached for stimuli reconstructed - from signed magnitude spectra The effects of different initial guesses to the unknown parts and vowel context were discussed.

PDF
442--447 Li Liu, Jialong He and Gunther Palm Perception Of Stop Consonants With Conflicting Phase And Magnitude

Abstract  Identification experiments were performed to examine the relative importance of phase versus magnitude for stop consonant perception. Three types of stimuli were constructed from Vowel-Consonant-Vowel (VCV) utterances: (1) Swapped stimuli, a swapped stimulus has the magnitude spectra of its consisting patches from one VCV signal and its corresponding phase spectra from another; (2) Phase-only stimuli, a phase-only stimulus is constructed by eliminating the original signal s magnitude information; (3) Magnitude-only stimuli; It was shown that the perception to the intervocalic stop consonant in a swapped stimulus varied from magnitude dominance phase dominance as the size of analysis patches increases. The crossover lies somewhere between 192 ms and 256 ms where both magnitude and phase spectra provide equivalent but conflicting perceptual information, It was found that the perception of voicing property (voiced/voiceless) of a stop consonant rely strongly on its phase information while the perception of a stop's place of articulation was mainly determined by its magnitude information. As the patch size increases from 16 ms to 512 ms, the identification rates to the intervocalic consonants in magnitude-only stimuli decrease from 78% to 30% and in phase-only stimuli increase from 18% to 68%.

PDF
448--453 Jordi Robert-Ribes, Jen-Luc Schwartz, Pierre Escudier Audio-Visual Recognition Of Speech Units: A Tentative Functional Model Compatible With Psychological Data

Abstract  We compare four models (issued from psychology) of Audio-Visual (AV) integration for speech perception. First, we show that two of them are incompatible with psychological data. Then we present a perception test on AV identification with evidence for a complementarity of A and V in what concerns place of articulation. Finally, we present an implementation of the two remaining plausible models, and we show that one of them makes a better use of the complementarity, and hence achieves higher recognition scores.

PDF

Neural Networks & Artificial Intelligence

Pages Authors Title and Abstract PDF
456--461 P. Castellano and S. Sridharan A Two Stage Fuzzy Decision Classifier For Speaker Identification

Abstract  This paper discusses a new neural network based Two Stage Fuzzy Decision Classifier (T SFDC), applicable to Automatic Speaker identification (ASI). At the conclusion of its training phase, this system computes one reference possibility set per speaker present in the data. During classification, these sets may be used to assist the networks decision making process in identifying a speaker for each input vector. The speaker is selected from amongst the two most likely speakers, tor an given vector. The method was not found to be beneficial when identification scores already exceeded 54% However, at consistently improved classification (by 15% on average) tor those speakers harder to identify {below 54%). Misclassification by es much es 54% is uncommon for an ANN trained with a multiple speaker database. However, more work is needed to determine the probability of such en event occurring.

PDF
462--467 N. Kasabov, C.Watson, S. Sinclair, R. Kilgour Integrating Neural Networks And Fuzzy Systems For Speech Recognition

Abstract  The paper presents a framework of an integrated environment for speech recognition and a methodology of using such environment. The integrated environment includes a signal processing unit, neural networks and fuzzy rule-based systems. Neural networks are used for "blind" pattern recognition of t.he phonemic labels of the segments of the speech. Fuzzy rules are used for reducing the ambiguities of the correctly recognised phonemic labels, for final recognition of the phonemes, and for language understanding. The fuzzy system part is organised as multi-level, hierarchical structure. As an illustration, a model for phoneme recognition of New Zealand English is developed which exploits the advantages of the integrated environment. The model is illustrated on a small set of phonemes.

PDF
468--472 Richard F. Favero Comparison Of Mother Wavelets For Speech Recognition

Abstract  This paper compares four modulated wavelets for speech recognition. The modulated wavelets are based on well understood window functions, These are the Hanning, Hamming, Blackman and Gaussian windows.The wavelet parameterisation of speech using each of the above mother wavelets is applied to a multi-speaker E-set discrimination task The results show that the Gaussian and the Blackman windows give a slight improvement in recognition performance.

PDF
473--478 David B. Grayden and Michael S. Scordilis A Hierarchical Approach To Phoneme Recognition Of Fluent Speech

Abstract  An overview is presented of a hierarchical phoneme recognition system which performs the task in a number of steps: segmentation, manner of articulation classification and then place of articulation classification. A combination of knowledge-based techniques and neural networks are used within these modules.

PDF
479--484 A. Samouelian Knowledge Based Approach To Speech Recognition

Abstract  This paper presents a knowledge/rule based approach to continuous speech recognition. The proposed recognition system (Samouelian, 1994) uses a data driven methodology, where the knowledge about the structure and characteristics of the speech signal is captured explicitly from the database by the use of inductive inference (C4.5) (Quinlan, 1986). This allows the integration of features from existing signal processing techniques, that are currently used in HMM stochastic modelling, and acoustic-phonetic features, which have been the cornerstone of traditional knowledge based techniques. Phoneme recognition results on the phonetic classes of plosives, semivowels and nasals for a combination of feature sets, for speaker dependent and independent recognition, are presented.

PDF

Speech & Hearing Disorders

Pages Authors Title and Abstract PDF
486--491 Sameer Singh Linguistic Computing In Speech And Language Disorders

Abstract  This paper describes the methodology of using linguistic measures to test the quality of spontaneous speech for patients with speech and language disorders. The discussion involves agrammatic patients with language problems. The paper discusses the various measures, their usefulness and the actual method of analysis.

PDF
492--497 C. McKilligan, J. van Doorn & S. Pitt The Intelligibility Of Speech In Cerebral Palsy: The Effects Of Manipulating The Acoustic Speech Signal

Abstract  Lack of speech intelligibility is a problem for many people who have cerebral palsy. This pilot study investigated whether improved speech intelligibility could be achieved by modifying the frequency and durational characteristics bf the acoustic speech signal. Formant frequency modifications using linear predictive analysis and synthesis and time scale modifications were used to manipulate the speech signal for token utterances from a single speaker. The resultant acoustic signals were used as the basis for some preliminary listening tests to establish whether improvements in . intelli ibility occurred. Varying degrees of improvemed intelligibility were achieved, depending on the nature of the modifications. These encouraging results provide justification for an expanded project to investigate optimal speech processing methods using more speakers and a larger corpus of utterances.

PDF
498--503 P.J. Blamey, M.L. Grogan and M.B. Shields Using An Automatic Word-Tagger To Analyse The Spoken Language Of Children With Impaired Hearing

Abstract  The grammatical analysis and description of spoken language of children with impaired hearing is time-consuming, but has important implications for their habilitation and educational management. Word-tagging programs have achieved high levels of accuracy with text and adult spoken language. This paper investigates the accuracy of one automatic word tagger (AUTASYS 3.0 developed for the international Corpus of English project, ICE) on a small corpus of spoken language samples from children using a cochlear implant. The accuracy of the tagging and the usefulness of the results in comparison with more conventional analyses are discussed.

PDF
504--509 B.M. Chen, D.J. Calder and G. Mann Computer-Based Multimedia Speech Training Tool For Dyspraxic Clients

Abstract  Currently, there is a shortage of speech therapy services available to match the number of adult clients who require speech therapy. Consequently, we have developed a computer-based therapeutic tool, Articulator, to assist speech therapy services with the aim of reducing time in therapy and improving practice/access to treatment programs. Most research in the area of computer assisted speech rehabilitation addresses language problems with only very occasional reference to motor speech disorders. Motor speech disorders may be caused by various factors with stroke being the most common, Computer-based therapy is interesting and visually stimulating to die patient and costs of therapy may be reduced. Time is saved, and the problem of shortage of speech therapists is alleviated. Further advantages are that patients can learn at their own pace, perhaps in the comfort of their own home. Articulator is a computer-based therapeutic tool developed for use in both the hospital setting and at home by those with motor speech disorders. It is on current methods with full adaptability and utilization to a range of speech clinics. This user interface is simple to use so care-givers can also use it easily. The use of different levels of cues allows the patient to determine the level of difficultly to match their abilities. Articulator is a personal computer multimedia Windows application. It consists of natural sound output, high quality graphics, animation of air-flow and tongue position, a user-friendly graphical interface. 'There are two levels of training available, single consonant training and consonant-vowel combination training. Although the emphasis is on stroke patients, it is useful for all patients with motor speech disorders. Developed in association with speech therapists at Royal Perth (Rehabilitation) Hospital, the system is about to undergo further level trials prior to general release. This paper addresses the problems faced in developing user interfaces of the system and the associated problems of integrating current therapeutic treatments with computer technologies.

PDF

Ann/Ai Recognition

Pages Authors Title and Abstract PDF
512--517 Tetsuya Hoya, Hiroyuki Kamata and Yoshihisa Ishida Spoken Digit Recognition Using Neural Network$ Trainee By Incremental Learning

Abstract  In this paper, a new method to develop a practical spoken word recognition system is proposed. in order to improve the recognition rate, the incremental training technique is used to re-configure the original network. The experimental results show that the proposed method is worthy of introduction into the system for the improvement on the recognition performance.

PDF
518--521 Kiyoshi Kondou, Hiroyuki Kamata and Yoshihisa Ishida Spoken Japanese Digits Recognition System Using Lvq

Abstract  In this paper we present a new spoken Japanese digits recognition system using LVQ ( Learning Vector Quantization ). LVQ has very simple algorithm and generates some good reference vectors for DP ( Dynamic Programming ) matching. Our overall sys- tem has two characteristics. One is that the learning and recognition systems have two modes which are tor vowels and consonants. The other is that the vowel recognition system predicts consonant label. In our two recognition systems the vowel recognition system segments vowel part and recognizes vowel label. In the consonant recognition system, actually recognized label is compared with predictive label. Each system ( vowel and consonant recognition system ) generates series of label Japanese digit. According to the experiment, our recognition system has good performance.

PDF
522--527 Danqing Zhang and J.Bruce Millar Digit-Specific Feature Extraction For Multi-Speaker Isolated Digit Recognition Using Neural Networks

Abstract  The digit-specific feature extraction approach extracts distinguishing features of spoken digits in order to use a smaller amount of data to represent the digits. This reduced representation of the distinctive acoustics of the digits was evaluated in an isolated digit recognition task using a multi-layer perceptron neural network architecture. The acoustic-phonetic design of features for English digits is described as is the means to extract them from spoken utterances. The results of a recognition system based on this feaure set are presented for the conditions of multi-speaker dependent and speaker independent testing. The data set for this study is the ten isolated digits 'zero' to 'nine' spoken by three male and five female Australian speakers.

PDF
528--533 F. Béchet, H. Meloni, P. Gilles Knowledge Based Lexical Filtering: The Lexical Module Of The Spex System

Abstract  We present the lexical component of a speech recognition system (SPEX) which connects an acoustic-phonetic decoding process with a large lexical data base, The lexical level is made up of some coding modules of the phonetic and lexical data and of a lexical access stage using a set of various filters to reduce the number hypotheses.

PDF

Enhancement, Adaptation & Coding

Pages Authors Title and Abstract PDF
534--539 Fikret S. Gurgen and H. C. Choi On The Frame-Based And Segment-Based Nonlinear Spectral Transformation For Speaker Adaptation

Abstract  This study presents speaker adaptation using nonlinear spectral transformation of frame-based and segment-based feature vectors for a continuous density hidden Markov model (CDHMM) speech recognizer. A multilayer perceptron (M/LP)_structure is used for the nonlinear mapping of speech frames and segments and the adaptation results are compared with those of canonical correlation analysis (CCA) technique (frame-based) which is a linear method. Experiments are performed using isolated words from the TI-46 database. It is concluded that the linear transformation gives better recognition rate after the frame-based adaptation but in the segment-based case, the nonlinear transformation improves the recognition rate after the adaptation in comparing to frame-based linear transformation due to the addition of dynamic information of speech.

PDF
540--545 H.C. Choi and RW. King A Two-Stage Spectral Transformation Approach To Fast Speaker Adaptation

Abstract  A two-stage approach for using spectral transformation to perform supervised speaker adaptation for continuous speech recognition is proposed. In the first stage, all the speech data of the training speakers are first transformed to the spectral space of a reference speaker. Then a set of adaptive Hidden Markov Models (HMMs) is trained for this reference speaker using his/her speech data and the transformed speech data of the other training speakers. Once the models have been trained, they are used for all new speakers. In the second stage, speaker adaptation is performed by transforming the feature vectors of the speech of a new speaker to the spectral space of this reference speaker. Recognition is then performed using the pre-trained models. The effectiveness of this approach is investigated using the DARPA Resource Management (RM1) continuous speech corpus.

PDF
546--550 D. Cole, M. Moody and S. Sridharan Measuring Intelligibility Of Reverberant Speech With And Without Enhancement

Abstract  Three acoustical measures are evaluated as predictors of speech intelligibility for highly reverberant speech both before and after enhancement by inversion of the room impulse response. The predictors are the Speech Transmission Index (STI), the Lochner-Burger signal to noise ratio, and a simple ratio of useful to detrimental energy. Predictions are compared with articulation scores using phonetically balanced (PB) speech material. intelligibility improvements were consistent with two variants of articulation test, but were not reflected in any predicted value for short inverse filter lengths.

PDF
551--556 Jianwei Miao Vector Quantization Of Dct Components For Speech Coding

Abstract  In this study a system of design algorithms and experimental results was investigated for vector quantization of discrete cosine transform components in speech coding. The encoding and decoding systems of vector quantization of discrete cosine transform components were developed for digital speech coding. A training sequence consisting of 200,000 speech samples sampled at 8 kHz with 16 bits signed integers was used to design the codebooks. The codebooks were designed using the K-means algorithm. The implementation of a discrete cosine transform vector quantizer used a two-codebook structure, providing different codes for different real component vectors corresponding to different frequency bands. This is a form of subband coding and yields a means of optimizing bit allocations among the subcodes as well as produces a good quality speech at low bit rates. The experimental results of implementing the encoding and decoding systems showed a good quality speech at 7111 BPS with 15.61 dB in signal-to-quantization noise ratio.

PDF

Speech Recognition

Pages Authors Title and Abstract PDF
558--563 NAGAI Akito, ISHIKAWA Yasushi and NAKAJIMA Kunio Concept-Driven Semantic Interpretation For Robust Spontaneous Speech Understanding

Abstract  This paper describes an integration of speech recognition and two-stage semantic interpretation which detects concepts from a phrase lattice and integrates them with an intention to form a meaning hypothesis. In this approach, a concept represented by several phrases is an unit of semantic interpretation, and a sentence is regarded as a sequence of concepts. An island-driven search method exploits syntactic/semantic knowledge to evaluate linguistic likelihood of concept hypotheses. The method has achieved 82% of understanding rate at the first rank (the sixth 94%) by a speech understanding experiment using 50 spoken sentences with various sentential expressions.

PDF
564--569 Yaxin Zhang, Chee Wee Loke, Roberto Togneri, Mike Alder A Comparison Of Pbdhmm And Chmm For Isolated Word Recognition

Abstract  Using phoneme-based Gaussian mixture as a VQ codebook in DHMM speech recognition system (PBDHMM) is an efficient way to improve the system performance. This paper compares the performances of PBDHMM system with that of the well known continuous HMM system for isolated word recognition task. The results shown that PBDHMM system obtained better results than CHMM system, especially for phoneme-distinct data.

PDF
570--575 C.C. Fung, C. Romeo and A. Gregory Development Of A Microprocessor-Based Speech Recognition System For Remotely Operated Underwater Vehicle

Abstract  Development of a microprocessor-based speech recognition system for a low cost remotely operated underwater vehicle (ROV) is reported in this paper. The system is designed to enhance the operator-controls for the TITAN range of ROV's. The experimental system is developed on a Motorola evaluation board with a M68HG16 16-bit microcontroller. Recognition is based on the minimum Euclidean distance between the energy functions of sampled templates. The algorithm is further enhanced with the incorporation of linear predictive coding and cepstral coefficients for distance measurement. It is implemented and tested on a PC-486 system with a S-oundBlaster card. Experimental results indicated that the speech-based system will simplify the operator control console but performance of the HC16 microcontroller may not be desirable when all the tasks of control, communication and speech recognition are taking place. Further research is continued to improve the algorithms and system hardware so es to achieve acceptable accuracy and efficiency.

PDF

Keynote Address

Pages Authors Title and Abstract PDF
576--578 Professor Robert Linggard Speech Science And Technology - Review And Perspective

Abstract  Though Speech Science isprobably as old as Science itself (Aristotle wrote orgthe subject), Speech Technology probably began in 1938 when Homer Dudley of Bell Labs invented the Channel Vocoder. This was a speech transmission system, which, at the transmitter, analysed speech into ten frequency channels plus an excitation type (buzz or noise) and re-synthesised the speech at the receiving end using an identical ten channel filter system excited by buzz/noise. The theory of this device relied on the observation that speech was mainly a codec spectrum and its detailed waveform was not significant for perception. The virtue of the Vocoder wasthat the eleven components of the analysed speech required less bandwidth to transmit them than the original speech. It was, what we would now call, a low bit-rate coding system, and was the first real attempt to take speech processrng beyond amplification and filtering. A year later, in 1939, at the New York World Fair, the ten band synthesiser, fittted with manual control keys, was shown generating synthetic speech by trained, key-board operators. Speech Science, at this time, was largly descriptive, and the phonetic alphabet and the theory of phonemes were it main theoretical base. After the invention of the Vocoder, it was hoped that, within a few years, an analyser could be developed to detect phonemes, and the synthesrser would be adapted to accept phonetic text as input. It would then be possible to transmit speech as phonetic text with extremely low band width. More than fifty years later, even though speech recognition and synthesrs have advance enormously, this "Phonetic Vocoder" is still a dream. In the 1950s and 60s, efforts to analyse speech into phonemic units met with frustration and speech recognition research became concentrated on recognising whole words, a task which proved to be much more tractable. The inventron of digital computers and the application of dynamic time warping provided some success, and the first commercial recognition systems appeared on the market in the 1970. It has since been possible to recognise continuous speech using sub-word units based on phonemic ideals. However, the development of speech recognition has owed more to advances in stochastic processing and linguistic insight. It had been speech synthesis research which has taken up the ideas of phonetics. In 1964, John Holmes at JSRU in England devised the first successful text-to-speech system, In this scheme, control signals are derived from phonetrc text by a set of rules running on a digital computer, and are use to drive an electronic, parallel-formant synthesiser. Later work by D Klatt and J Altan at MIT improve both the intelligibility and quality of synthettc speech generated from text. I The morph decomposrtion theory of Allan and the co-articulauon rules of Klatt have made a significant contributron to speech science. The development of electronic instrumentation and the advent of the digital computer in the 1950s gave Speech Science an impetus which transformed it from a speculative study into a modern, experimental science. The understanding of speech perception advanced rapidly through the modelling of cochlea mechanics, and speech production progressed quickly via experimental phonetics and more vocal tract aerodynamics. Digital recording and processing methods facilitated more and more sophisticated psycho-acoustic experiments on humans and detailed neurar recordings from the auditory pathways of animals. In many ways, Speech Technology still has to catch up with this enormous explosion of knowledge in Speech Science. A very large amount of research in Speech Science and Technology has been carried out in the past fifty years, though it is only recently that Speech Technology has become commercially viable. The field has been fortunate in receiving huge amounts of funding, probably because the technology appeared, for a long time, to be just on the verge of breakthrough to commercial exploitation. Speech Technology is now, unquestionably, applicable in many branches of industry and commerce. Yet compared with human capabilities, speech recognition and synthesis have still a long way to go, and Speech Science still has many unanswered questions.

PDF

Speech Synthesis

Pages Authors Title and Abstract PDF
581--586 Caroline Henton Techniques For Synthesizing Visible, Emotional Speech

Abstract  Attempts are described to synthesize visible speech in real-time on a Macintosh personal computer, and to enable the user to color the text of the speech to be synthesized emotionally, according to the users wishes, the representation of the text, or the semantics of the utterance. Animated visible speech will be demonstrated, using a variety of on-screen agents.

PDF
587--592 Sang-Hun Kim, Jung-Chul Lee Korean Text-To-Speech System Using Time Domain-Pitch Synchronous Overlap And Add Method

Abstract  We developed an advanced Korean text-to-speech conversion system using TD-PSOLA(Time Domain-Pitch Synchronous Overlap and Add) technique. Our system consists of language processing module, prosodic processing module, and synthetic speech generation module. This paper mainly describes the prosodic processing on text-to-speech system. To derive the segmental durar tion and intonation model, we selected appropriate sentences containing a variety of phrase structure. The prepared prosodic database, read by a female announcer is composed of 38 sentences and 1021 syllables. The prosodic processing calcu lates segmental duration and F0 contour from the rules extracted through the analysis of prosodic database. Finally, we applied the phrase level macro prosody using the syntactic and positional information of prosodic phrases in a sentence The syllable level micro prosody was set up using the phonetic context and the position of syllable in a prosodic phrase. The advanced Korean text-to-speech conversion system applying prosodic processing shows more naturalness.

PDF
593--598 Yasushi Ishikawa and Kunio Nakajima Speech Synthesis By Rule Using Synthesis Units Considering Prosodic Features

Abstract  This paper describes on synthesis units for text-to-speech synthesis. A kind of synthesis unit and extraction of units are very important problems in speech synthesis by rule. Many kinds of units have been proposed, VCV, CVC, demi-syllable, triphone are typical units for Japanese text-to-speech system. Recently, non-uniform units or context-dependent units have been proposed, and good results were reported. However such a kind of unit is considered only phonetic context as a factor of spectral variation. Our basic idea is introducing prosodic features into control of spectral features in order to realize natural sounded synthetic speech. In this paper, we report results of basic analytic experiments. The results show that there is obvious relation between prosodic features and spectral features, and that spectral control method considered not only phonetic context but also prosodic feature is able to improve quality of synthetic speech in text-to-speech system.

PDF

Speech Production & Spreaker Characterisics

Pages Authors Title and Abstract PDF
600--605 C.J. James, M.F. Cheesman, L. Cornelisse and L.T. Miller. Response Times To Sentence Verification Tasks (SVTS) As A Measure Of Effort In Speech Perception

Abstract  Three studies are reported that assessed the potential use of three speech tests for obtaining a measure of listener effort. The tests were designed to yield two performance measures, response time and percent correct responses. The sensitivity of each test and each measure to changes in listener performance (practice effects) and listening condition (signal-to-noise ratio) was compared.

PDF
606--611 Alain Marchal and Sophie Lapierre "Can We Learn Something From Nonsense Words?"

Abstract  Experiments on speech production using nonsense words allows for a fine control of the phonetic, linguistic and prosodic variables which can interact in a speech sequence; but it is questionable whether results obtained from these carefully designed experiments bear any significance for the understanding of the processe involved in the production of less artificial speech items. Data for this study has been extracted from the multisensor ACCOR database. We have investigated the production of /l/ by two French speakers in nonsense words, isolated words and sentences. First, the various signals have been annotated using a multitiered phonetic approach whereas articulatory, acoustic and aerodynamic data are marked at major signal discontinuities. These events are then interpreted as landmarks of articulatory gestures. The spatio-temporal organization of these "gestures for the production of /l/ is compared across speech styles and across speakers. Our results suggest that more data as needed to account for aerodynamic requirements in the production of /l/.

PDF
612--619 Christophe Vescovi, Eric Castelli Gestural Supervisor For The Vocal Cords Of A Speaking Machine

Abstract  Until now, inversion methods of the speech signal mainly deal with the representation of the vocal tract. This work extends those inversion models to the voice source using the two-mass model of the vocal cords. in a first step, the characterisation of the glottal flow produced by this model is done in order to define an appropriate control space. Then a forward model of the two-mass model is built, giving relations between the commands of the voice source and the control parameters, taking into account the alterations of the glottal flow caused by vocal tract configurations. The back propagation algorithm used with this forward model can predict the commands to use in order to reach a specified point in the control space. The validation of the algorithm is done inverting synthetics signals, thus results of the inversion can be compared with the commands used for the synthesis. Finally, first results on the inversion of natural speech are presented.

PDF

Speech Tools

Pages Authors Title and Abstract PDF
620--625 K. C. Scott, D.S. Kagels, S.H. Watson, H. Rom, J.R. Wright, M. Lee, K.J. Hussey Synthesis Of Speaker Facial Movement To Match Selected Speech Sequences

Abstract  A system is described which allows for the synthesis of a video sequence of a realistic-appearing talking human head, A phonetic based approach as used to describe facial motion; image processing rather than physical modeling techniques are used to create the video frames.

PDF
626--631 Catherine I. Watson The Visual Display Test: A Test To Assess The Usefulness Of A Visual Speech Aid

Abstract  The facility to be able to display features of speech in a visual speech aid docs not by itself guarantee that the aid will effective in speech therapy, An effective visual speech aid must provide a visual representation of an utterance from which a judgement on the goodness of the utterance can be made. Two things are required for an aid to be effective. Firstly, the clusters of acceptable utterances must be separate from the unacceptable utterances in display Spam. Secondly, the acoustic features which distinguish acceptable utterances from unacceptable utterances must be evident in the displays of the speech aid. A two part test-. called the Visual Display Test (VDT), has been developed to assess a visual speech aid's capacity to fulfil these requirements.

PDF
632--636 K.M.Knill and S.J.Young Keyword Training Using A Single Spoken Example For Applications In Audio Document Retrieval

Abstract  An open keyword vocabulary word-spotting system is described for audio document retrieval. To model each keyword, an N-Best recogniser is used to hypothesise the keyword's phonetic transcription based on a single spoken example. The system is evaluated on a database of spoken messages and performance is found to be comparable with that obtained using pronunciations taken from a dictionary.

PDF
637--642 Joon Hyung Ryoo, Katunobu Itou, Satoru Hayamizu and Kazuyo Tanaka Korean Speech Dialog System For Hotel Reservation

Abstract  This paper describes an experimental Korean speech dialog system for hotel reservation task. The system consists of a 1anguage-dependent dialog manager and a language-independent speech recognizer which is based on the techniques used in Japanese continuous speech recognizer. We describe these two modules of the system and how to use the sentence pattern information to reduce the search space. We performed speaker independent recognition experiments to evaluate the performance of the dialog system. Sentence recognition rate is 62.3% with perplexity 13.1 for 1203 utterances uttered by 11 male persons Response time of the system is 3-5 seconds when we run the whole process on HP9000.

PDF
643--648 Florence Sédes, Nadine Vigouroux, Philippe Truillet, Bernard Orloia Hyperaudio : Vocal Navigation Strategies In A Hypermedia Environment

Abstract  Hypermedia systems emphasize the paradigm of graphic user interface based on navigation through a two/three dimensional space. The HyperAudio described in this paper a mainly speech-based hypermedia interface, explores issues of navigation in audio environment without a visual display. The user interface under development uses spoken commands to browse through the hyperbase For contextual information, user feedback and hypermedia object reading only technologies of audio output are used. Navigation strategies in the audio space are more difficult than in the spatial one. The concepts of navigation need to be redefined. Some navigation solutions based on a voice interface will discuss how auditory information is processed. We use audio cues as substitution for visual ones when reading. These audio cues provide navigational information when the reader moves through a `' hypermedia base. This paper focuses on audio concepts and issues raised by the need for intelligent navigation tools making a system capable of reading a hypermedia base.

PDF

Acoustic Phonetics I: Segments Duration & Prosody

Pages Authors Title and Abstract PDF
650--655 Katsumasa Shimizu F0 In Phonation Types Of Initial-Stops

Abstract  F0 and the curves of phonation types of initial-stops in six Asian languages were examined. The cross-language study shows that difference of phonation types of stops has a different effect on the Fo perturbations of the following vowels. The general trend that voiceless types are associated with a higher Fo and voiced types are associated with a tower Fo was observed in these languages, but some Language-specific characteristics were observed in Thai and Hindi. Their phonetic characteristics of each phonation type were examined.

PDF
656--661 Janet Fletcher, Jonathan Harrington and John Hajek Phonemic Vowel Length And Prosody In Australian English

Abstract  Previous acoustic studies of Australian vowels note the similarity of formant patterns in the vowels /a:/ and /A /. Most linguists and phoneticians describe this as an example of a vowel quantity versus vowel quality contrast. Two experiments were conducted to examine the acoustic and articulatory characteristics (i.e. jaw displacement) of these vowels in accented and unaccented environments. While significant formant and jaw height differences were observed in many instances, at appears these patterns can be interpreted as examples of articulatory undershoot, as opposed to inherent vowel quality differences.

PDF
662--667 J. Hajek Phonological Length And Phonetic Duration In Bolognese: Are They Related?

Abstract  The phonetic basis of a reported phonological correlation between stressed vowel and post-tonic consonant length is examined for the first time. Whilst a vowel length distinction is confirmed for all subjects, a correlation between vowel and consonant duration is not universal.

PDF
668--673 Corinne Roberts Speech Rate Effects On Duration: An Articulatory Analysis

Abstract  Duration differences resulting from speech rate changes are assessed from three perspectives: linear rescaling, gestural overlap and planned shortening. A method of analysis is provided to distinguish between linear rescaling and gestural overlap. The results of the research suggest a range of processes create the durational differences.

PDF
674--679 Stefanie Jannedy Prosodic And Segmental Influences On High Vowel Devoicing In Turkish

Abstract  Prosodic and segmental factors such as rate, stress, preceding environment, following environment, vowel- and syllable type which influence the process of vowel devoicing in Turkish are described and evaluated. Results are contrasted with findings for Japanese and Korean. Browman & GOldstein's Gestural Score Model (1990) is employed to explain the data.

PDF

Speaker Recognition

Pages Authors Title and Abstract PDF
682--687 Jianming Song Enhancement Of Hmm Through Discriminative Analysis

Abstract  Although the HMM approaches have been extensively studied in the context of speech recognition for nearly two decades, there still appears to be room for improving the discrimination capability within the HMM framework, from both training side and recognition side. The algorithm proposed in this paper integrates the concepts of variable frame rate and discriminative analysis to modify the conventional Viterbi algorithm, in such a way that the steady or stationary signal is compressed, while transitional or non-stationary signal is emphasized through the frame-by-frame searching process. The usefulness of each frame is decided entirely within the Viterbi process and needs not to be the same for different models. To evaluate this algorithm, we tested a speech database of S highly confusable E-set English letters. With 5 state and G mixture components, the conventional HMM baseline system only delivered the recognition accuracy of 73.9%. Through the use of the algorithm proposed in this paper, the recognition accuracy was increased to 82.5%.

PDF
688--693 G. Platt and M.D. Alder A Dynamic Causal Filter Approach To Speech Trajectory Segmentation

Abstract  We consider the possibility of segmenting speech into phonemic segments using an idea that stems from the concept of Linear Predictive Coding. Linear 'Trajectory Predictive Segmentation, or LTPS, involves performing linear predictive analysis on each dimension of the speech trajectory, and assuming that points in the trajectory where the error of our prediction is large correspond to points where phonemic transitions are occurring. LTPS has two basic parameters that alter the prediction made. The ability of this method to accurately segment speech into phonemes rs analysed, for various combinations of these parameter values.

PDF
694--699 Hoi-Rin Kim, Kyu-Woong Hwang, Nam-Yong Han, and Young-Mok Ahn Korean Continuous Speech Recognition System Using Context-Dependent Phone Schmms

Abstract  This paper presents the Korean continuous speech recognition system using phone-based semi-continuous hidden Markov model (SCHMM, also known as a tied-mixture model) method. The system has the following three features. First, an embedded bootstrapping training method that enables us to train each phone model without phoneme segmentation database was used. Second, in the HMM parameter estimation, a hybrid estimation method which is composed of the forward-backward algorithm within phoneme boundaries and the Viterbi algorithm to determine those phoneme boundaries, was proposed. Third, a between-word modeling technique in word boundaries was used to solve the strong coarticulation between the Korean postpositional word and its preceding word. Task domain of the system is the query sentences of hotel reservation with 244 words including digits, English alphabet, etc. Speech database for simulation consists of two parts; one is the word data which were pronounced once by 51 male speakers, the other is a set of 5610 different sentences pronounced by the 51 speakers. We have defined 339 context-dependent phone models based on triphone model for pronunciation dictionary of each word. For the phone models, we use both the DHMM and SCHMM methods for performance comparison, and define a model topology with 3 states and 8 transitions including skip transitions. The silence model has an additional null transition. We use four feature vectors: LPC cepstrum with a bandpass lifter, delta cepstrum, delta-delta cepstrum, and energy(logE, delta logE, and delta-delta logE}. For HMM training, two training stages were applied to the word and sentence data. In recognition stage, the finite state grammar for language modeling and the Viterbi beam algorithm for search were used. In speaker-independent recognition experiments, the discrete HMM (DHMM) method resulted in 89.7% word accuracy and the SCHMM method in 89.0%.

PDF
700--705 Andrew Hunt Introducing Prosodic Constraints To Stochastic Language Modelling

Abstract  Recent work has shown that prosodic knowledge can be used in conjunction with syntactic analysis to improve continuous speech recognition accuracy and to implement speech understanding by resolving syntactic ambiguity. One major limitation of such work is that it is based on deterministic parsing techniques which are difficult to integrate closely with HMM-based recognition systems. This paper presents a novel way of extending conventional stochastic language modelling techniques to incorporate prosodic constraints with the aim of reducing perplexity; this produces a Stochastic Prosodic Language Model (SPLM). This approach has three major advantages over previous prosody-syntax work; training is entirely unsupervised, no prosodic or syntax analysis or labelling of training data is required, and the prosodic syntactic constraints can be directly (and easily) employed in the Viterbi search of a speech recogniser. This paper presents the theoretical background to the SPLM arising from previous work on deterministic approaches and presents the detail of the SPLM which uses a combination conventional stochastic language modelling techniques and linear discriminant analysis.

PDF
706--712 Shuping Ran, Bruce Millar, William Laverty, lain Macleod, Michael Wagner and Xiaoyuan Zhu Speaker Recognition Using Continuous Ergodic Hmms

Abstract  This paper aims to investigate the effect of training the transition probabilities of Continuous Ergodic Hidden Markov Models (CEHMMs) and the effect of the choice of number of states land number of mixtures for CEHMMs when using them in a speaker recognition task. Speaker recognition experiments with and without training the transition probabilities using different combination of number of states and mixtures were carried out. The length of training and testing utterances was from 0.3 to 2.5 seconds with an average length of about one second. Using a different data set, the results confirm Matsui and Furuis finding that the total number of mixtures (i.e. the product of the number of states times the number of mixtures per state) is an important parameter in determining speaker recognition performance. Training of the transition probabilities did not improve the overall recognition rate. We suggest a possible explanation for our failure to derive any benefits here.

PDF

Acoustic Phonetics II

Pages Authors Title and Abstract PDF
714--719 Mary O'Kane, P.E.Kenne, Hamish Pearcy, Tim Morgan, Gail Ransom and Kathryn Devoy On The Feasibility Of Automatic Punctuation Of Transcribed Speech Without Prosody Or Parsing

Abstract  This paper describes an investigation of the effectiveness of statistical methods for automatic punctuation of transcribed speech. Most work carried out on automatic punctuation is based on prosodic or syntactic analysis. Here, however, we decided to investigate the strengths and weaknesses of automatic punctuation based solely on the more simply-calculated probabilities of the collocatron of certain words or groups of words with different punctuation marks augmented by some simple heuristic rules. Using the techniques described in this paper, just over half the correct positions of punctuation mark scan be found and about 42j% of all the expected punctuation results are correctly marked at the cost or 7% of the total number or expected marks being incorrect insertions. It has been found that when deriving the statistical training data for the punctuator, it is important to use text of the same style as that which is to be punctuated automatically.

PDF
720--725 P.E. Kenne, M.J. O'Kane and H. Pearcy Some Experiments Involving The Annotation Of A Large Speech And Natural Language Database

Abstract  We report on the effect of different interfaces on the manual adjustment of a segmentation of a speech waveform provided by an automatic segementation process.

PDF
725--730 Simon Hawkins, Iain Macleod and Bruce Millar An Ab Initio Analysis Of Relationships Between Cepstral And Formant Spaces

Abstract  Building on earlier work (Hawkins & Clermont, 1990), this paper uses simple Artificial Neural Networks to learn how to map points representing a single speaker's vowels in a 12-Dimensional cepstral space into points in a 3-D formant space. We find that an ANN with only six hidden units can learn this mapping entirely on the basis of the supplied training data. We then analyse the operation of individual hidden units to try to discover the means by which the ANN is able to map input to output points so successfully. The suggestion from this analysis is that the F1 coordinate in the output space is estimated using a linear combination of input coefficients, but that the F2 and F3 coordinates are estimated by piecewise-linear mappings. It appears that selection of one or other of alternative linear mappings is selected in estimating the latter coordinates according to whether F2 is greater or less than 1350 to 1400 Hz or so, corresponding to the traditional front/back distinction. Having gained some unbiased hints about the possible structure of the relationship between vowel representations in cepstral and formant spaces, we then proceed to evaluate these hints by means of multiple linear regression analysis. The overall results here confirm our somewhat limited analysis of the operation of the trained ANN.

PDF
731--736 K.L. Jenkin and M.S. Scordilis Automatic Methods Of Syllable Stress Classification In Continuous Speech

Abstract  This paper addresses the task of classifying syllables from continuous speech into primary stress, secondary stress and zero stress categories using artificial neural networks. The results compare favourably with other studies performed in this area while highlighting the problem of classifying the secondary stress class.

PDF
737--742 Cioni Lorenzo A Simple Program For The Visualisation Of F0

Abstract  The main aim of the present paper is to present a simple program (1) that has been developed at our Laboratory and the name of which is Pit. The scope of such a program ranges from visualisation to editing of FO. Pit, indeed, allows a user to visualise up to six FO graphs within a single window and to perform over each of them both editing operations (such as clear or cut), measurements, justification and splitting.

PDF

Speaker Verification

Pages Authors Title and Abstract PDF
744--749 J.Bruce Millar, Fangxin Chen, Iain Macleod, Shuping Ran, Hong Tang, Michael Wagner and Xiaoyuan Zhu. Overview Of Speaker Verification Studies Towards Technology For Robust User-Conscious Secure Transactions

Abstract  The aim of this paper is to provide background material relating to speaker verification work of the Technology for Robust User-conscious Secure Transactions project. It describes the philosophy of this work as it is expressed in the design and collection of a speech data corpus, and the selection of speech analysts and speaker modelling techniques. It then summarises the range of experiments that have been performed using these data and techniques to explore many issues pertinent to effective speaker verification. This paper sets the scene for six other papers appearing in these proceedings.

PDF
750--755 Iain Macleod, Fangxin Chen, Bruce Millar and William Laverty Optimal Cohort Design In Vq-Distortion Based Text-Independent Speaker Verification

Abstract  The "cohort normalised" method of speaker verification computes for each input utterance its relative distance from models of the client and a cohort of speakers drawn from the same population. It is assumed that variations which reduce the utterance fit to the client model will tend to have similar effects with respect to the cohort speaker models. The use of "relative distance" can lead to improved client/impostor discrimination. This paper explores several issues related to the design of suitable cohorts. Using VQ codebooks in multidimensional cepstral space as the basic speaker models, we show that pairs of codebooks can be related geometrically in terms of vector differences between their centroids in cepstral space. In a well-designed cohort, the cohort members give adequate "coverage" of the clients codebook in multidimensional space. Cohort members are usually chosen on the basis of their similarity to the client. Experiments in which cohort members were instead chosen according to their position relative to the client led to a slight improvement in verification performance, suggesting that joint consideration of similarity and position would give even better results. However, with a limited set of speakers, it will often be difficult to find cohort members who meet these simultaneous requirements. Preliminary tests indicate that in certain cases it may well prove possible to synthesise suitable "phantom" codebooks based on those of real speakers.

PDF
756--761 Xiaoyuan Zhu, Bruce Millar, Iain Macleod and Michael Wagner Speaker Verification: Beyond The Absolute Threshold

Abstract  This paper examines the use of "discriminating probabilities" for text-dependent speaker verification. Conventional left-to-right continuous hidden Markov models are used to represent client speaker models, with the whole impostor speech database being considered during training. of client models. In the speaker verification stage, the claimed speaker s discriminating probability is calculated based both on their own trained model and on the impostor distribution. The proposed method attempts to avoid using an absolute threshold: No additional training data for each client is required for this method and there is only al minor increase in computational overheads. One promising application of discriminating probabilities is in conjunction with cohort-normalised speaker verification, to help reduce the problem of false acceptances of dissimilar impostors associated with this latter approach.

PDF
762--767 Shupinq Ran, William Laverty, Bruce Millar, Iain Macleod and Michael Wagner Estimation Of False Acceptance Rate In Speaker Verification

Abstract  In this paper, we show how the false acceptance rate depends on inter-speaker-distances and propose a novel technique for the estimation of False Acceptance Rate (FAR) in speaker verification. The RAH estimate is based on the statistical technique of "bootstrapping". First, a pairwise FAR is obtained as a function of pairwise inter-speaker distances between each client and impostor pair. Second the inter-speaker distance distribution is estimated. The FAR is calculated by combining these two.

PDF
768--773 Hong Tang, Xiaoyuan Zhu, Bruce Millar, Iain Macleod and Michael Wagner Robust Speaker Verification In Noisy Environments

Abstract  This paper considers the problem of robust parametric model estimation and classification in noisy acoustic environments. Characterisation and modelling of external noise sources is in itself an important issue in noise compen- sation. The techniques described here provide a mechanism for deriving parametric models of acoustic backgrounds along with the signal model so that noise compensation is tightly coupled with signal model training and classification. Prior information about the acoustic background process is provided using a maximum likelihood parameter estimation procedure. A max noise model is defined for use in the cepstral domain. We describe experimental evaluation of the benefits gained by applying this approach to text-independent speaker verification. With short speech utterances and various noise environments, useful improvements in speaker verification performance were obtained.

PDF

Audition

Pages Authors Title and Abstract PDF
776--781 Graeme K. Yates Dynamic Range Compression In The Cochlea: Experiments And Models

Abstract  Experimental results from this and other laboratories have demonstrated that compression of dynamic range exists at the level of the basilar membrane mechanics and is clearly reflected in the response characteristics of auditory nerve fibres. We have previously shown that plots of the rate of generation of nerve action potentials against sound pressure level (rate-intensity functions, or; RI functions) range over a continuum, reflecting the existence of fibres extending from rapidly saturating with dynamic range around 20 dB sound pressure level (SPL) to non-saturating types with dynamic ranges in excess of 60 dB SPL. Together, these two types have been shown to cover a range of at least 100 dB. We have also used the neural RI functions to show that the degree of compression appears to extend from. around 0.3 (dB/dB) in the low-frequency region of the cochlea to around 0.12 at high frequencies. This suggests that the cochlea compresses a 60 dB range of sound pressure levels into as little as 8 dB of basilar membrane movement. Now we have applied the boundary element technique of solving velocity-potential equations to a two-dimensional model of the cochlea, and have produced solutions which match extremely well with direct observation of basilar membrane mechanics. By including a nonlinearity in the modelled active process, a process of mechanical positive feedback widely believed to operate in the cochlea, we have also shown compression of the modelled basilar membrane input-output function. Slopes as low as 0.12 have been calculated from a simple Boltzmann operating characteristic assumed to be driving the positive feedback through the cochlear outer hair cells. We now believe we have an explanation for the cochlea's ability to process input signals .varying over a very wide range of intensities.

PDF
782--787 Michael Oerlemans and Peter Blamey Multisensory Speech Perception: Integration Of Speech Information

Abstract  Speech perception usually occurs in the context of more than one source of sensory information. The effectiveness of speech perception In these multisensory situations depends on how successfully two or more sources of information are combined. An analysis of auditory-visual (AV) speech perception suggests a model of multisensory integration which argues that the extent of integration depends on the physical and cognitive characteristics of the signals which are combined. It is proposed that such a model can explain observed differences between AV, auditory-tactile (AT) and visual-tactile (VT) combination.

PDF
788--793 Peter J Blamey and Elvira S Parisi Pitch And Vowel Perception In Cochlear Implant Users

Abstract  Two methods of determining the pitch or timbre of electrical stimuli in comparison with acoustic stimuli are described. Ln the Erst experiment, the pitch of pure tones and electrical stimuli were compared directly by implant users who have residual hearing in the non-implanted ear. This resulted in a relationship between frequency in the non-implanted ear and position of the best-matched electrode in the implanted ear. In the second experiment, one- and two-formant synthetic vowels, with formant frequencies covering the range from 200 to 4000 Hz, were presented to the same implant users through their implant or through their hearing aid. The listeners categorised each stimulus according to the closest vowel from a set of eleven possibilities, and a vowel centre was calculated for each response category for each ear. Assuming that stimuli at the vowel centres in each ear sound alike, a second relationship between frequency and electrode position was derived. Both experiments showed that electrically-evoked pitch is much lower than that produced by pure tones at the corresponding cochlear location in normally-hearing listeners. This helps to explain why cochlear implants with electrode arrays that rarely extend beyond the basal turn of the cochlea have achieved high levels of speech recognition in postlinguistically deafened adults without major retraining or adaptation by the users. The techniques described also have potential for optimising speech recognition for individual implant users.

PDF
794--799 Ambikairajah, E., McDonagh, B. An Active Model Of The Auditory Periphery With Realistic Temporal And Spectral Characteristics

Abstract  This paper describes three auditory models which attempt to reproduce accurate temporal and spectral behaviour in a manner which is not computationally excessive. A particular problem in modelling the cochlea is to compress the input Sound Pressure Level (SPL) range into a smaller range of values (the compression ratio for a human is approximately 2.5:1), while maintaining realistic cochlear bandwidths and latencies. This paper describes a transmission line model which aims to incorporate such a compression ratio into the model, and compares its performance with that of two alternative parallel filterbank models.

PDF
800--805 P.F. McCormack, J.C. Ingram Speech Motor Control In Ataxic Dysarthria

Abstract  Ataxic dysarthria has unique prosodic characteristics in which the production of linguistic stress and speech rhythm are disturbed. Words and syllables are usually produced at an extremely slow rate, with long pauses between them. There is often the perception that words and syllables are being produced with "equal stress". Kent, Netsell & Abbs (1979) suggested that this disturbed prosodic expression may reflect an impairment in anticipatory motor programming as well as any associated difficulties with motor execution. An experiment is reported on the production of "stress shifts" requiring anticipatory planning by 10 speakers with ataxic dysarthria, compared to 10 matched speakers with normal speech production. Comparison with normal speakers at normal and slow rates of speech indicates that the speakers with ataxic dysarthria, unlike the control subjects, do not take account of the position of the main stress in the following word in their production of the shift words. This pattern is consistent with a disruption to motor programming, and is consistent with current knowledge about the functions of the cerebellum in motor control.

PDF

Speech Databases

Pages Authors Title and Abstract PDF
808--813 P.E. Kenne, M.J. O'Kane and H. Pearcy An Australian Speech Database Derived From Court Recordings

Abstract  We describe a pilot version and give a statistical characterization of a large speech and natural language database which is being developed, and is based on the recordings of court proceedings and their transcripts.

PDF
814--819 Dong K. Kim Automatically Assisted Annotation Of The Australian National Speech Database

Abstract  This paper describes a technique for automatically assisted segmentation and labelling of the continuous speech using CDHMM. A Given the orthographic transcription of speech, the system converts this into the phonetic transcriptions which are used as the recognition network for segmentation and labelling, and then makes the forced state alignment for the speech data. Results based on the manual and the automatic transcriptions with and without the word boundary information are compared. Effects of word junction and length of state for each phone are also discussed. Our best result shows 80.08% and 93.04% correct boundary placement within 10 ms and 20 ms of the manual boundary respectively.

PDF
820--825 Yaxin Zhang, Mylene Pijpers, Roberto Togneri, Mike Alder Cdigits: A Large Isolated English Digit Database

Abstract  This paper described a large isolated English digit database which was designed for the training and evaluation of statistical algorithms and neural networks. 1108 speakers (575 males and 533 females) were recorded in the UWA campus under office environment.

PDF
826--831 M.Bijankhan, J.Sheikhzadegan, M.R.Roohani, Y.Samareh, K.Lucas, M.Tebyani Farsdat - The Speech Database Of Farsi Spoken Language

Abstract  The FARSDAT has been produced tor speech and speaker recognition purposes,and linguistic research as well. It consists of 386 sentences read aloud by 300 Farsi native speakers, belong to one of the ten dialect regions. Two types of sentence are included. Type one composed of 384 phonetically balanced sentences to provide alophones In al phonetic contexts. Type two composed of two sentences to allow dialect comparison. Each speaker read twenty sentences In two sessions. All utterances were divided Into two corpora: training and recognition. The speech was sampled at 44.1 Khz by 16-bit sound Blaster hardware on IBM micro computers. 2000 utterances have been manually segmented and labelled, providing a vast amount of computer files of different Isolated words, syllables, alophones and some phone sequences.

PDF

Speech Databases And Speaker Verification

Pages Authors Title and Abstract PDF
834--839 Wenxian Li, Yiqing Zu, Chorkin Chan A Chinese Speech Database (Putonghua Corpus)

Abstract  In this paper, a Chinese speech database (Putonghua corpus) is introduced which has been constructed at HKU. It consists of isolated syllables, words, digit strings and sentences read by a total of 20 native speakers, ten females and ten males. No systematic effort in constructing comprehensive Chinese speech database has been reported before, so this corpus has great importance to serve as a standard. It supplies a large scale, professionally built common test-bed for Putonghua recognizers.

PDF
840--845 R.E.E.ROBINSON Synthesising Facial Movement: Data Base Design

Abstract  This article describes the design and construction of a database for lip movement during speech. A set of Australian consonants and vowels were arranged into nonsense words and recorded on cine film and video tape. The images of these words were then digitised and analysed to form diphone pairs. The visual diphones were arranged in a database with an entry lookup table for access and a set of transition tables of points of similarity. This allowed phonetic strings to be translated into lip movement sequences.

PDF
846--849 Hitoshi Ihara, Hiroyuki Kamata and Yoshihisa Ishida Speaker Identification Using Neural Networks

Abstract  In this paper we describe a speaker identification system using neural networks. We have found that there are mainly individual characteristics in the mid frequency band on the spectrum. We use this frequency band on the spectrum and the fundamental frequency. We show that the individual characteristics are included in that frequency band and consider the practical speaker recognition system.

PDF
850--855 J.Bruce Millar, Fangxin Chen and Michael Wagner. The Efficacy Of Cohort Normalisation In A Speaker Verification Task Under Different Types Of Speech Signal Variance

Abstract  This paper examines the influence of three types of speech signal variance on the performance of a text-independentx speaker verification system and the efficacy of cohort normalisation as a technique to compensate for this influence. The three forms of variance comprise that generated by repetition of utterances over time, the addition of extraneous noise to test utterances, and the inclusion of test utterances which use phonemic sequences that do not occur in the training data, A statistical analysis of the results for a gender-balanced set of 20 client/impostor speakers and an independent population of 25 cohort speakers is presented. The results indicate that conventional cohort normalisation is moderately successful in conventional repetition variance and phonetic variance but gives no significant improvement in the presence of moderate levels of noise. A hybrid form of cohort normalisation is shown to combat the former two variance types very effectively and to provide a weakly significant improvement for moderate levels of noise induced variance.

PDF

Speech Tools

Pages Authors Title and Abstract PDF
858--863 Cioni Lorenzo A Data Base For Speech Signal Processing

Abstract  The present paper gives a brief description of an approach to the creation and administration of large corpora of data to be used for speech signal processing. It contains, moreover, a definition of persistent and volatile objects together with the description of both their structure and of the programs that take care of their administration.

PDF
864--869 D. Farrokhi, R. Togneri, Y. Zhang, and Y. Attikiouzel Real Time Voice Processing (Voice/Speaker Recognition)

Abstract  Communication systems are one of the biggest industry in the world. Today, man's communication in the globe is greatly dependant on its communication with computers. One of the most exciting field of research which can revolutionise human interaction with computers is Voice Recognition. So far research had a reasonable success in providing algorithms for the encoding of speech data segments and the classification of these sequences of segments. Some of the most important challenges, are those which attempt to produce real time automatic voice recognition. Real Time Voice Recognition (RTVR) was not possible for a long time, however since the Digital Signal Processing (DSP) chips such as TMS320C30/40, i860XP and i860XR have been produced it is possible today to process the speech signal as fast as a word is spoken. This is done by embedded processing of speech signal into the daughter board for pre-processing, and doing a post and final processing in the mother-board. In this paper, l will discuss some problems which could in any real time embedded program. One of the question, in this paper is the minimum processing speed required tor pre-processing of speech data. The other question is on the post-processing module requirements. The last topic is about the over all speed requirement of the complete Real Time Voice Recognition System.

PDF
870--875 Fumitake SUGANO, Tomoyuki MIZUTANI, Ayano SASAKI, Takefumi KITAYAMA, Hiroyuki KAMATA and Yoshihisa ISHIDA Speech Training System For Hearing Impaired Children Using Technology Of Voice Recognition

Abstract  This paper describes a speech training system called " Speech Trainer using personal computer. For over past twenty years, our laboratory has developed a speech training system for the deaf. This system has been already used by about 80 percent of schools for the deaf in Japan. The system gives characteristics of trainer's voice on CRT display and has some functions such as vowel training, consonant training, pitch accent training using fundamental frequency, articulation training using vocal tract shape and so on.

PDF