Proceedings of SST 1988

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Synthesis I

Pages	Authors	Title and Abstract	PDF
2--7	Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt	Rulsys - The Swedish Multilingual Text-To-Speech Approach Abstract Speech synthesis has been a major field of research at our department for several decades. The projects contain everything from basic research on speech production models to applications of speech technology, e.g., for handicapped persons. In this contribution we will concentrate on the development strategies, describe the development environment and discuss some recent results. The synthesis is based on a combination of modules including lexica and rule components. Even if the number of components are about the same for different languages, the emphasis on the different parts varies considerably due to language structure. Rule development is done in the generative phonology tradition. The development system, originally written for a different computer, has now been moved to our network of Apollo workstations and In- tegrated with speech analysis and resynthesis software. Expanded use of morphological and syntactical analysis has proved useful in several languages. Recent experiments with an expanded synthesis model including a more realistic voice source, the LF-model, has given new possibilities to vary both speaker type and speaking style.	PDF
8--13	Clive D Summerfield, and Marwan A Jabri	A Formant Speech Synthesiser Asic: Functional Design Abstract This paper is the first of two companion papers on the design and implementation of a multi-channel formant speech synthesiser Application Specific integrated Circuit (ASIC). The objective of this research is the development of an efficient VLSI structure which can be implemented as a single VLSI device and yet retains the acoustical performance necessary to generate high quality and high intelligibility synthetic speech and have sufficient processing bandwidth for multi-channel operation. This paper concentrates on the functional design of a VLSI formant speech synthesis structure for achieving these objectives.	PDF
14--20	Marwan A Jabri, Kiang Ooi Tan and Clive D Summerfield	A Formant Speech Synthesiser Asic: Implementation Abstract This paper is the second of two companion papers on the design and implementation of a multichannel formant speech synthesiser Application Specific Integrated Circuit (ASIC). ln this paper, we describe the implementation aspects of the project. The use of the silicon compiler FIRST in the conception of the synthesiser has reduced the design time considerably. However, as the 5pm NMOS primitive library of FIRST falls short in providing sufficient processing bandwidth for multi-channel operation, we implemented a new primitive library using the European Silicon Structure (ES2) standard cell design tool (2 pm CMOS). The library is implemented using the MODEL hardware description language The implementation of the primitives using MODEL is discussed in detail together with the clocking and data synchronisation strategies necessary for reliable high speed bit—serial operation.	PDF

Pages

Authors

Title and Abstract

PDF

2--7

Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt

Rulsys - The Swedish Multilingual Text-To-Speech Approach

Abstract Speech synthesis has been a major field of research at our department for several decades. The projects contain everything from basic research on speech production models to applications of speech technology, e.g., for handicapped persons. In this contribution we will concentrate on the development strategies, describe the development environment and discuss some recent results. The synthesis is based on a combination of modules including lexica and rule components. Even if the number of components are about the same for different languages, the emphasis on the different parts varies considerably due to language structure. Rule development is done in the generative phonology tradition. The development system, originally written for a different computer, has now been moved to our network of Apollo workstations and In- tegrated with speech analysis and resynthesis software. Expanded use of morphological and syntactical analysis has proved useful in several languages. Recent experiments with an expanded synthesis model including a more realistic voice source, the LF-model, has given new possibilities to vary both speaker type and speaking style.

PDF

8--13

Clive D Summerfield, and Marwan A Jabri

A Formant Speech Synthesiser Asic: Functional Design

Abstract This paper is the first of two companion papers on the design and implementation of a multi-channel formant speech synthesiser Application Specific integrated Circuit (ASIC). The objective of this research is the development of an efficient VLSI structure which can be implemented as a single VLSI device and yet retains the acoustical performance necessary to generate high quality and high intelligibility synthetic speech and have sufficient processing bandwidth for multi-channel operation. This paper concentrates on the functional design of a VLSI formant speech synthesis structure for achieving these objectives.

PDF

14--20

Marwan A Jabri, Kiang Ooi Tan and Clive D Summerfield

A Formant Speech Synthesiser Asic: Implementation

Abstract This paper is the second of two companion papers on the design and implementation of a multichannel formant speech synthesiser Application Specific Integrated Circuit (ASIC). ln this paper, we describe the implementation aspects of the project. The use of the silicon compiler FIRST in the conception of the synthesiser has reduced the design time considerably. However, as the 5pm NMOS primitive library of FIRST falls short in providing sufficient processing bandwidth for multi-channel operation, we implemented a new primitive library using the European Silicon Structure (ES2) standard cell design tool (2 pm CMOS). The library is implemented using the MODEL hardware description language The implementation of the primitives using MODEL is discussed in detail together with the clocking and data synchronisation strategies necessary for reliable high speed bit—serial operation.

PDF

Perception I

Pages	Authors	Title and Abstract	PDF
22--27	R. H. Mannell	Perceptual Space Of Male And Female Australian English Vowels Abstract This study investigates the phonemic space of synthetic male and female vowel tokens as perceived by native speakers of Australian English The data was also examined for evidence of vowel normalisation.	PDF
28--34	U Thein-Tun	The Gender And Individual Variations In Processing Linguistic-Phonetic Cues Abstract The perception of integrated phonetic cues by males and females was investigated at five levels of information processing. The integrated phonetic cues were the intensity and duration of voice-onset—time in relation to the intensity of the following vowel for syllable initial /d/-/t/ distinction. The five levels of information processing were auditory, phonetic, syllable, word, and sentence levels. The results demonstrate that listeners who cannot effectively process the cues at the auditory and phonetic levels can process them very effectively at the sentence level and vice versa. Most female listeners belong to the former group.	PDF

Pages

Authors

Title and Abstract

PDF

22--27

R. H. Mannell

Perceptual Space Of Male And Female Australian English Vowels

Abstract This study investigates the phonemic space of synthetic male and female vowel tokens as perceived by native speakers of Australian English The data was also examined for evidence of vowel normalisation.

PDF

28--34

U Thein-Tun

The Gender And Individual Variations In Processing Linguistic-Phonetic Cues

Abstract The perception of integrated phonetic cues by males and females was investigated at five levels of information processing. The integrated phonetic cues were the intensity and duration of voice-onset—time in relation to the intensity of the following vowel for syllable initial /d/-/t/ distinction. The five levels of information processing were auditory, phonetic, syllable, word, and sentence levels. The results demonstrate that listeners who cannot effectively process the cues at the auditory and phonetic levels can process them very effectively at the sentence level and vice versa. Most female listeners belong to the former group.

PDF

Coding I

Pages	Authors	Title and Abstract	PDF
36--41	M.J. Flaherty	On The Representation Of Time-Varying Lpc Parameters By Cubic Splines With Variable Knots Abstract The modelling of log area coefficients by cubic splines with variable knots is discussed. Results are presented which compare variable and uniform knot modelling for a connected digit utterance spoken over a telephone handset using least squares error and spectral difference measures.	PDF
42--47	H Gondokusumo and T.S. Ng	Subband Coding Of Speech Using M-Band Parallel Quadrature Mirror Filters Abstract In this paper, the design of both analysis and synthesis filters for a 3—band parallel sub-band coder with perfect reconstruction properties is given. An example of this design using FIR filters of 17th order is implemented in an IBM PC/AT. Experimental results using this subband coder on segments of speech will be demonstrated in the presentation.	PDF
48--53	A. W. Johnson and A. B. Bradley	The Effect Spectral Modifications Have Upon The Performance Of Frequency Domain Coders Abstract The effect of spectral modifications on the performance of low data rate frequency domain coders employing overlapped transform operations and the Ramstad bitallocafion procedure is the introduction of distortion into the recovered signal. While this distortion cannot be removed certain aspects of it can be controlled. This theoretical analysis suggests the design of a new bitallocation procedure based upon the Ramstad majority vote rule [Ramstad 1984). This is the subject of on going research which it is hoped will improve the overall subjective quality of speech recovered from a low data rate frequency domain coder.	PDF
54--59	S N Koh and P. Sivaprakasapillai	Analysis And Synthesis Method For Packet Speech Enhancement Abstract This paper describes a novel system which employs the filter bank analysis and synthesis method for the packetisation of speech for transmission in packet data communications. Computer simulations of the system indicate that significant improvement in the perceptual quality of the recovered speech can be obtained even with zero substitutions compared to the conventional technique of straight packetization of PCM speech. Further improvement is possible through frequency—domain component replications.	PDF
60--66	A. Perkis, B. Ribbum, T. Ramstad	Improving Subjective Quality In Waveform Coders By The Use Of Postfiltering Abstract Adaptive postfiltering is shown to significantly enhance the perceived speech quality of medium bit rate coders. The post- filter, utilizing auditory masking properties, provides an adaptive shaping of the noise and signal spectra, thus reducing the perceived quantization noise level at the cost of introducing some extra signal distortion. This paper will discuss the effectiveness of postfiltering in three distinctly different coding schemes. These are Regular Pulse Excited linear predictive coding (RPE) representing LPC based coding schemes, and Adaptive Sub Band Coding (SBC) and Adaptive Transform Coding (ATC) representing frequency domain coders.	PDF

Analysis I

Pages	Authors	Title and Abstract	PDF
68--73	Phil Rose	Normalisation Of Tonal F0 From Long Term F0 Distributions Abstract An attempt is described to ascertain whether the F0 of 7 speakers tones can be normalised using parameters from their long term F0 distribution. It is shown that normalisation using long term mean and standard deviation is not as effective in reducing the between-speaker variance as with parameters derived from the tones themselves. However, the approach is still successful enough to be worth pursuing, and some suggestions for improvement are indicated.	PDF
74--79	J.S. Chang and Y.C. Tong	Development Of A Switched Capacitor Speech Spectrum Analyzer System Design Abstract The design of a novel Low Power Monolithic Time—Multiplexed Switched Capacitor Speech Spectrum Analyzer is described. Essential features are specified with comments on the reasons for the design decisions. An experimental four channel spectrum analyzer has been fabricated and measurements on prototypes show that the design specifications are satisfied.	PDF
80--85	Michael G. Barlow and Michael Wagner	Prosody As A Basis For Determining Speaker Characteristics Abstract A speaker identification experiment based on prosodic features is described. Five speakers recorded a set of four sentences In five separate sessions over a period of one week. For each of these utterances, the energy, fundamental frequency, voicing and linear prediction error contours were extracted. For each sentence (four) and each type of contour (four) distance measures based on dynamic time warping were calculated between all twenty five (five speakers by five repetitions) contours. These distances were compared on an inter—speaker versus intra-speaker basis and the ratio was generally found to be large. Parameters within the distance measuring process, namely warping window size and contour smoothing, were altered and the effects on speaker distances are discussed.	PDF
86--91	Kim E. A. Silverman	Utterance-Internal Prosodic Boundaries Abstract This paper investigates minor prosodic boundaries that often occur in fluent speech, and yet are not well understood. A corpus was collected of utterances with a range of segmental structures, where each utterance was spoken both with and without such an internal boundary. Acoustic measurements of the utterances were then related to perceptual ratings of the salience of the boundaries. Results showed that the F0 tall from the preceding pitch accent is much steeper before a boundary, and while these boundaries do not contain pauses they do alter the temporal structure of the speech. The segmental material is lengthened, and the preceding F0 accent occurs considerably earlier relative to its accent-bearing syllable.	PDF
92--97	Lori F. Lamel	Spectrogram Readers' Identification Of Stop Consonants Abstract This paper reports on the performance of five spectrogram readers at identifying spectrograms of stop consonants extracted from continuous speech. The stops were spoken by 299 talkers and were presented in the immediate phonemic context. The task was designed to minimize the use of lexical and other higher sources of knowledge. The averaged identification rate across contexts ranged from 73-82% for the top choice, and 77—93% for the top two choices. The readers' performances were comparable to those of other spectrogram reading experiments reported in the literature, however the other studies have typically evaluated a single subject on speech spoken by a small number of talkers.	PDF

Recognition I

Pages	Authors	Title and Abstract	PDF
100--105	D.Rainton and S.J Young	Consonant Recognition Using The Covariance Of The Pseudo Wigner Distribution Abstract It is generally accepted that the consonant a consonant-vowel(CV) pair can be identified by the nature of the formant transitions in the vowel. STFT power spectral snapshots fail to capture the detailed time-varying nature of these transitions. In this paper we show that such spectra can be considered weighted time averages of the pseudo—Wigner distribition (PWD) when appropriate Gaussian windows are used in the computation of both. Given this interpretation we then speculate as to whether the higher order statistics of the PVD convey additional consonant discriminant information. Experimental evidence indicates that they do.	PDF
106--111	J. R. Sholicar I, F. Fallside	A Prosodically And Lexically Constrained Approach To Continuous Speech Recognition Abstract Psycholinguistic studies have indicated that prosodic cues play a vital role in human speech perception. The prosodic relationships which exist within an utterance are believed to provide fundamental cues for structuring the recognition process. However, in the majority of reported systems for the automatic recognition of continuous speech, prosodic cues are seldom used. In this paper, we review the . evidence supporting the exploitation of prosodic cues, and discuss how such cues can be exploited within a machine recognition system to improve the segmental parsing strategy. practical implementation is then proposed, in which prosodic structure is a major factor in the organisation of the recognition process. The architecture of the system is described and preliminary results relating to the current development of thus system discussed.	PDF
112--118	P.Pierucci, A.Paladin	Multistage Vector Quantization With Acoustic Constraints For Speaker Verification Abstract In this paper a new method to build multisection codebooks for the speaker recognition task is investigated. Different methods of threshold evaluation are then discussed for the proposed approach, and a comparison with single section VQ and previously reported Multisection VQ is discussed, in a fixed text speaker verification experiment.	PDF

Pages

Authors

Title and Abstract

PDF

100--105

D.Rainton and S.J Young

Consonant Recognition Using The Covariance Of The Pseudo Wigner Distribution

Abstract It is generally accepted that the consonant a consonant-vowel(CV) pair can be identified by the nature of the formant transitions in the vowel. STFT power spectral snapshots fail to capture the detailed time-varying nature of these transitions. In this paper we show that such spectra can be considered weighted time averages of the pseudo—Wigner distribition (PWD) when appropriate Gaussian windows are used in the computation of both. Given this interpretation we then speculate as to whether the higher order statistics of the PVD convey additional consonant discriminant information. Experimental evidence indicates that they do.

PDF

106--111

J. R. Sholicar I, F. Fallside

A Prosodically And Lexically Constrained Approach To Continuous Speech Recognition

Abstract Psycholinguistic studies have indicated that prosodic cues play a vital role in human speech perception. The prosodic relationships which exist within an utterance are believed to provide fundamental cues for structuring the recognition process. However, in the majority of reported systems for the automatic recognition of continuous speech, prosodic cues are seldom used. In this paper, we review the . evidence supporting the exploitation of prosodic cues, and discuss how such cues can be exploited within a machine recognition system to improve the segmental parsing strategy. practical implementation is then proposed, in which prosodic structure is a major factor in the organisation of the recognition process. The architecture of the system is described and preliminary results relating to the current development of thus system discussed.

PDF

112--118

P.Pierucci, A.Paladin

Multistage Vector Quantization With Acoustic Constraints For Speaker Verification

Abstract In this paper a new method to build multisection codebooks for the speaker recognition task is investigated. Different methods of threshold evaluation are then discussed for the proposed approach, and a comparison with single section VQ and previously reported Multisection VQ is discussed, in a fixed text speaker verification experiment.

PDF

Production

Pages	Authors	Title and Abstract	PDF
120--125	Peter D. Neilson, Megan D. Neilson, and Nicholas J. O'Dwyer	Redundant Degrees Of Freedom In Speech Control: A Problem Or A Virtue? Abstract It is well-established that rapid, functionally specific compensations for unexpected perturbations occur in speech articulators remote from the site of the disturbance. we interpret this in terms of an adaptive controller which incorporates an inverse internal model of the sensory—motor system involved. By using "compliant" control, in which variables representing redundant degrees of freedom are set equal to the feedback of their actual values, sensory consequences crucial to a task can be protected from external disturbances. This subsumes the notions of coordinative structures and feedforward processes.	PDF
126--131	David Slater	Intrinsic Effects Of Voiced And Voiceless Unaspirated Prevocalic Stops On Fundamental Frequency In Luang Prabang Lao Abstract Three tones, of two native speakers of the Luang Prabang variety of Lao, are investigated as to the effects of the voicing or voicelessness of the initial consonant of a syllable on the fundamental frequency of the following vowel. Three vowels are used, and the intrinsic effect of vowel height on fundamental frequency observed. The results from this investigation support the hypothesis that the intrinsic effect of lowering of fundamental frequency after voiced consonants is minimised in tonal languages.	PDF
132--137	Andrew Butcher	Invariance And Variability In Tongue Contact Patterns: Electropalatographic Evidence Abstract An attempt was made to quantify the variability of tongue contact patterns at certain stages during the pronunciation of VCV sequences, as registered by an electropalatograph system. Of specific interest were: total contact area during the vowels and the rate of change of contact for consonant closures and releases. Results are discussed in the light of some current theories of coarticulation.	PDF
138--144	Jinshi Huang	An Arma Model Of Speech Production Process With Applications Abstract An autoregressive and moving-average (ARMA) model of speech production process is proposed. The orders of the model are determined from the generalized partial autocorrelation (GPAC) pattern. Based on the maximum likelihood estimation, parameters of the model are estimated via the Marquardt algorithm. Experiments show that the fricatives can be better modeled as an ARMA process. The autocorrelation of the residuals can be used for pitch detection and voiced/unvoiced speech recognition.	PDF

Analysis II

Pages	Authors	Title and Abstract	PDF
146--151	Frantz Clermont	A Dual Exponential Model For Formant Trajectories Of Diphthongs Abstract Australian English diphthongs are studied in terms of their second formant-frequency trajectories. These sigmoid—shaped trajectories may be decomposed, around a suitable breakpoint, as two exponential functions approaching two distinct vowel targets. In order to obtain this dual exponential representation, a set of candidate 'breakpoints defined along the inter-target transition are used to divide a given trajectory in two segments, thus simplifying the problem to that of fitting two single exponentials. A succession of such fits are performed, and the best pair of exponentials are determined in a root—mean-square sense. The method developed for constructing and evaluating the dual exponential model is described and illustrated. While the model fares well in a curve-fitting sense, its components do not always admit of a sensible phonetic interpretation in the case of an incomplete gesture towards the second vowel target.	PDF
152--157	C. D. Summerfield	Pole-Zero Analysis For The Detection Of Nasality Abstract The detection of nasality and the fine-class categorisation of nasal segments is important for the success of a phonetically based speech recognition machine. In this paper the application of a pole·zero modelling algorithm to this problem is described. It is well known that nasal segments are characterised and may be classified by the presence and location of the vocal tract transfer function zeros. The aim, in applying the pole-zero algorithm, is to elucidate this particular acoustic feature. However, as will be demonstrated, the zero response from the pole-zero algorithm show considerable amount of extra activity which is not attributed to the vocal tract zero alone. These results indicate that the application of the pole-zero algorithm yields results which are difficult to interpret and, consequently of limited use in this application.	PDF
158--163	R. H. Mannell	Spectral Distortion And Spectral Distance Measures Abstract Several different spectral distance measures have been compared in order to see which measures most closely correlate with the intelligibility of speech systematically distorted by various channel vocoder configurations.	PDF
164--169	A. Al-Otaibi and Y. El-Imam	Automatic Segmentation Of Speech Signals Into Arabic Syllables Abstract Due to the nature of the Arabic language in terms of the rules governing formation of syllables by phonemes, and the one—to—one correspondence between the acoustics and the phonetics of Arabic syllables, a syllabic based approach for speech recognition of the Arabic language has a high potential for success. An experimental evaluating of an automatic speech segmentation algorithm into Arabic syllable units ls reported here. The parameter used for segmentation ls the energy of the acoustic signal. Speech data consisting of mono—syllabic1 and multi—syl1ablc words were used to test the automatic Arabic syllable segmentation algorithm. The algorithm has the advantage of being simple to implement.	PDF
170--175	Xi Xiao, D. Nandagopal & D.A.H. Johnson	On The Application Of Ar Model In Segmenting Isolated-Word Speech Signals Abstract In this paper, we present a segmentation method which uses AR modelling of the spectrum of the fullwave rectified speech signal. The FFT of the model coefficients yields a smoothed time domain signal with well defined minima, which locate the segment boundaries. The method is robust in the presence of noise. It is a useful first step in speech processing to segment speech signals into sub-frames which can be treated as time-invariant (stationary) processes.	PDF

Assesment Intelligibility & Cognition

Pages	Authors	Title and Abstract	PDF
190--196	Katerina K. Karadjova	Cognitive Networks In The Semantic Memory Of Normal And Mentally Retarded Children Abstract The structure of the conceptions of normal and mentally handicapped children is considered. The experimental data are processed by means of Johnson's hierarchical cluster analysis.	PDF
178--183	V.J. Demczuk	An Investigation Into The Intelligibility Of Various Synthetic Speech Devices. Abstract An experiment was conducted to measure the intelligibility of a number of commercially available synthetic speech devices.	PDF
184--189	W.A.Ainsworth, A.P.Lobo and G.R.Nest	Correlation Of Speech Intelligibility And Glottal Pulse Parameters Abstract The intelligibility of the voices of eight speakers was measured at three levels of background noise. A number of glottal pulse parameters were estimated for the same voices. Significant correlations were found between intelligibility and some of these parameters.	PDF

Pages

Authors

Title and Abstract

PDF

190--196

Katerina K. Karadjova

Cognitive Networks In The Semantic Memory Of Normal And Mentally Retarded Children

Abstract The structure of the conceptions of normal and mentally handicapped children is considered. The experimental data are processed by means of Johnson's hierarchical cluster analysis.

PDF

178--183

V.J. Demczuk

An Investigation Into The Intelligibility Of Various Synthetic Speech Devices.

Abstract An experiment was conducted to measure the intelligibility of a number of commercially available synthetic speech devices.

PDF

184--189

W.A.Ainsworth, A.P.Lobo and G.R.Nest

Correlation Of Speech Intelligibility And Glottal Pulse Parameters

Abstract The intelligibility of the voices of eight speakers was measured at three levels of background noise. A number of glottal pulse parameters were estimated for the same voices. Significant correlations were found between intelligibility and some of these parameters.

PDF

Recognition II

Pages	Authors	Title and Abstract	PDF
198--203	M.D. Alder	Automatic Extraction Of Syntax Applied To Speech Recognition. Abstract The successful recognition of speech depends heavily upon the use of contextual information. Moreover the contextual information is too extensive to be put in by hand; this has led to attempts to automate the process. At the level of phonemic data, hidden markov models are extensively used, while at the lexical level the principal method is that of n-grams. (F. Jelinek , 1985). The n-gram method has two related problems associated with it. The first is that there are a very large number of n-grams of consecutive words in English text for n greater than one, and the number goes up very quickly with n. The second is that even this number is sparse in real data, so that even with 'training sets' of millions of words of text, any new text contains a large fraction of n-grams never seen before. And this fraction also increases rapidly with n. The question of what to do in this case is a central problem for this and other classes of stochastic grammars. In this paper I describe algorithms which address both problems. The first issue, that of storing large numbers of n-grams, is treated by storing not the n-grams themselves but classes of n-grams which are ‘close together’. The second issue, that of sparseness of data, is solved by a derived method: we average over a neighbourhood of n-grams. The algorithms are computationally intensive, but are amenable to parallelisation. There are implications for layered neural networks.	PDF
204--209	R.A. Bennett, E. Lai, and Y. Attikiouzel	A Connected Speech Parse For Australian English Utilizing Matrix Syllable Formation Abstract A system is proposed to enhance the speed for Connected Speech Recognition systems by the formation of syllables from a phoneme string. Binary matrices are used to provide fast calculation of syllables which are used in modified dictionary search patterns. The system is designed for use with simple recognition systems which provide minimal allophonic information. Some preliminary results are discussed.	PDF
210--215	T. Svendsen, K. K. Paliwal, E. Harborg, P. O. Husoy	Experiments With A Sub-Word Based Speech Recognizer Abstract A system for sub-word based speech recognition is described. The system is evaluated for a vocabulary of 42 Norwegian isolated words and the performance of the system is compared to the performance of whole—word based Hidden Markov Model and Dynamic Time Warping speech recognition systems.	PDF
216--221	Frantz CLERMONT and Simon J. BUTLER	Prosodically Guided Methods For Nearest Neighbour Classification Of Syllables Abstract An approach to Nearest Neighbour (NN) classification of syllables in continuous speech is described. Acoustic prosodic segmentation of speech is used to guide the conventional Dynamic `lime Warping (DTW) distance measure. The acoustic prosodic analysis robustly determines the nuclei of syllables, and establishes neighbouring intervals which include the syllable boundaries. The limits of these intervals are then used to define the global constraints, which serve to restrict the DTW warping paths within an allowable region. Reliable syllable boundaries are therefore determined implicitly in the matching process. Furthermore, when the proposed method is used in NN-classification of a small database of Australian English diphthongs embedded in continuous speech, the accuracy is comparable to that achieved by current DTW-based systems for isolated word recognition.	PDF
222--228	Simon J. BUTLER and Frantz CLERMONT	On The Asymptotic Performance Of Nearest Neighbour Pattern Classifiers In Speech Recognition Abstract When distance measures based on Linear Prediction are used in Nearest Neighbour speech recognisers with a large number of training samples, it is found that the recognition performance is independent of the distance measure used. This contrasts with the case of small training sample sizes, in which performance is highly sensitive to choice of distance measure. The “asymptotic nearest neighbour equivalence" of this class of distance measures is explained and demonstrated in a vowel recognition experiment.	PDF

Tools

Pages	Authors	Title and Abstract	PDF
230--233	H.S.J.Purvis.	The Control Of A Speech Synthesiser By An Ibm Pc. Abstract This paper describes a computer program and interface card used to control a serial formant speech synthesiser. The program is used to enter parametric data using the keyboard, the data may be modified using a mouse.	PDF
234--239	Ara Samouelian and Clive D. Summerfield	Computational Model Of The Peripheral Auditory System For Speech Recognition: Initial Results Abstract This paper describes the design of a computational model of the peripheral auditory system, which is controlled via the AUDLAB Interactive Speech Signal Processing Package using a programmable harness to interface the AUDLAB command protocol and track file format to the structural model of the cochlear processor. A suite of signal processing modules, originally developed for speech synthesis research has been supplemented by a number of non-linear signal processing modules to model the transduction stage of the Cochlear. Some initial results of the cochlear processor model and its performance on real speech signals are presented.	PDF
240--243	R.A. Wills and Y.C. Tong	An ILS Compatible Wide Band Spectrum Analysis/Plotting Program (WBS) Abstract The software tool WBS which plots spectrograms from digitally recorded ILS compatible speech files is presented. The program operation is explained and comparison shown for a speech file plotted on a postscript printer with selected frequency/time resolution and the normal spectrogram plot from a "KAY" machine.	PDF
244--247	Arthur Lagos and Michael Wagner	An Integrated Audio Signal Interface For Use In The Teaching Laboratory Abstract An audio signal interface for the IBM PC-AT is described which was specifically designed for student use in the teaching laboratory. The interface which is implemented as an IBM PC-AT plug—in board allows the recording and playback of audio signals directly to and from disk files. All functions of signal conditioning, data conversion and the DMA bus interface are integrated on the board which provides for microphone/line inputs and headphone/line outputs.	PDF
248--255	H.S.J.Purvis.	A General Purpose Speech Editor, The Speak Language. Abstract - This paper describes a simple computer language known as the speak language that is used in a general purpose speech editor to output words or sentences in a defined sequence with specified timing.	PDF

Applications

Pages	Authors	Title and Abstract	PDF
256--261	C Wheddon	Human-Computer Speech Communication Systems Abstract The principal means of human communication is speech, this modality is now replicated by computer systems that are able to hold a limited but useful conversation with the user. Systems in operation and under development are described.	PDF
262--267	R. W. King and A.J. Hunt	A Synthetic Speech Terminal. For Viatel: Design, Implementation And Performance Abstract Speech synthesis technology has been incorporated successfully into several computer systems for use by blind people. There has, however, been relatively little attention paid to the specific problems of using visually—conceived information services such as videotex (of which Australia‘s Viatel service is an example) with synthesised speech output. In this implementation of a PC-based prototype ‘talking— videotex’ terminal, page layout processing and comprehensive user controls are provided to overcome the problems of page scanning. The terminal incorporates a low-cost SSl—263 synthesiser chip, with software to produce an Australianised accent, together with word-based prosody.	PDF
268--273	P.J. Kennedy and J.E. Clark	Operational Language In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft Abstract A recent survey of cockpit noise and communications in 44 Australian civil aviation aircraft included the compilation of a corpus of operational language material heard by aircrew during the performance of their duties. Preliminary analyses were made of the lexicon, syntax and message content of 1,726 transmissions. Constraints found upon the operational vocabulary and message-set construction heard by pilots present opportunities for applications of current speech technology to the civil aviation cockpit. Access to a suitable language database could be very useful for such applications.	PDF
274--281	F.J. Kennedy and JE. Clark	Ambient Noise In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft Abstract A recent survey of ambient sound pressure levels and noise spectra occurring in the cockpit/flightdeck of various categories and classes of civil aviation aircraft is described. Variations in the cockpit noise between aircraft classes and during different flight operations indicate that cockpit speech technology should be evaluated under a range of conditions, although conditions within aircraft of the same category or class are similar enough to allow construction of an ‘average’ noise environment for e. specified flight condition.	PDF

Synthesis II

Pages	Authors	Title and Abstract	PDF
282--287	JE. Clark and RH. Mannell	Some Comparative Characteristics Of Uniform And Auditorily Scaled Channel Synthesis Abstract This paper examines the comparative phonetic level intelligibility characteristics of two channel vocoder type synthesis systems, one based on a uniform bandwidth filterbank, and the other on an auditorily scaled filterbank. The intelligibility tests were conducted using listeners with no prior experience of synthesised speech, and employed masking noise to help expose differences in the perceptual robustness of the test corpus. The intelligibility of the natural input speech tested under the same conditions was used the benchmark for all comparisons. The results suggest that the Bark scale derived synthesis may have intelligibility characteristics closer to those of natural speech than the uniform filterbank synthesis, is perceptually more robust, and is more cost effective in its use of available channel encoding.	PDF
288--293	Simon J. BUTLER	A Speech Synthesis System Based On Articulatory Modelling Abstract The elements of an articulatory synthesis system under development are described. Particular emphasis has been given to modelling the trans-consonantal coarticulation effect for stops in /V1CV2/ context that have been reported by Ohman (1966).	PDF
294--301	Danielle Ribot, Frédéric Le Diberder and Pierre Martin	The Multivoc Text-To-Speech System Abstract MULTIVOC is a real-world text-to-speech system geared to the French language. The full system is described including the technical view and the main application of the product up to now as a basic component of a telephone-based mail service.	PDF

Pages

Authors

Title and Abstract

PDF

282--287

JE. Clark and RH. Mannell

Some Comparative Characteristics Of Uniform And Auditorily Scaled Channel Synthesis

Abstract This paper examines the comparative phonetic level intelligibility characteristics of two channel vocoder type synthesis systems, one based on a uniform bandwidth filterbank, and the other on an auditorily scaled filterbank. The intelligibility tests were conducted using listeners with no prior experience of synthesised speech, and employed masking noise to help expose differences in the perceptual robustness of the test corpus. The intelligibility of the natural input speech tested under the same conditions was used the benchmark for all comparisons. The results suggest that the Bark scale derived synthesis may have intelligibility characteristics closer to those of natural speech than the uniform filterbank synthesis, is perceptually more robust, and is more cost effective in its use of available channel encoding.

PDF

288--293

Simon J. BUTLER

A Speech Synthesis System Based On Articulatory Modelling

Abstract The elements of an articulatory synthesis system under development are described. Particular emphasis has been given to modelling the trans-consonantal coarticulation effect for stops in /V1CV2/ context that have been reported by Ohman (1966).

PDF

294--301

Danielle Ribot, Frédéric Le Diberder and Pierre Martin

The Multivoc Text-To-Speech System

Abstract MULTIVOC is a real-world text-to-speech system geared to the French language. The full system is described including the technical view and the main application of the product up to now as a basic component of a telephone-based mail service.

PDF

Analysis III

Pages	Authors	Title and Abstract	PDF
302--307	J. Ingram	Connected Speech Processes And Connected Speech Synthesis Abstract Construction of a data base for the study of connected speech processes (CSP’s) in Australian English is described. Application to the problem of speech rate, style, and sociolect sensitive synthesis is discussed.	PDF
308--313	Jeffery Pittam and J. Bruce Millar	The Longterm Spectrum Of Voice Abstract This paper presents an analysis of published information about the long-term spectrum of the voice. The historical development of the measure is first examined leading to a classification of the published works. Techniques used to compute the LTS are then presented, and the utility of the spectrum to various applications is considered. The outcome of this work is a research tool in the form of an annotated and classified bibliographic database.	PDF
314--319	J.Bruce Millar	Stability Of Long Term Acoustic Features Abstract This paper is a progress report on an ongoing study of speaker characteristics in a number of acoustic feature domains and a number of temporal domains. Variations in the long—term analysis of timing, energy distribution, fundamental frequency distribution over a three month period for 33 speakers of Australian English are presented. These data are based on 5 reading passages of a nominal duration of one minute.	PDF
320--325	J. Pittam, C, Gallois and V.J. Callan	The Long-Term Acoustic Characteristics Of Emotion Abstract Long-term spectra of recordings of three standard passages differing on perceived dominance and arousal were examined for 30 Australian speakers using three-mode principal components analysis. Results indicated that both affective dimensions were reflected systematically in the spectra, with dominance especially prominent in the upper part of the spectrum, and arousal affecting particularly two bands below 3 kHz.	PDF
326--333	Hiroaki Oasa and J.Bruce Millar	Acoustic Processing Cf Phonetically Controlled Vowels Abstract A phonetically controlled vowel database, derived from 594 vowel samples .spoken by adult males, adult females, and children, was analysed using LPC techniques to obtain a formant description of the database. The measured formants were uniformly transformed using various scaling factors derived from averaged acoustic features or from anatomical features. The effectiveness of these transformations as a first-stage normalisation procedure is evaluated, and the residual inter-speaker variation discussed.	PDF

Disorders

Pages	Authors	Title and Abstract	PDF
334--337	Geoff Plant	Speech Test Procedures For Use With Hearing Impaired Aboriginal Children Abstract The development of a speech test for use with Aboriginal children who speak Warlpiri as their first language is described. The test is extremely simple and easy to administer but appears to give reliable results. Possible applications for other Aboriginal languages are considered.	PDF
338--343	Megan D. Neilson and Peter D. Neilson	Sensory-Motor Integration Capacity Of Stutterers And Nonstutterers Abstract We review a series of studies concerning the auditory~motor and visual-motor tracking performance of stutterers and nonstutterers. we find no evidence of lateralization differences between the groups and interpret the finding that stutterers perform auditory tracking tasks significantly less well than nonstutterers as evidence of a deficit in ability to form internal auditory-motor models which subserve speech control.	PDF
344--349	Corinne Adams	Prosody And Airflow In Deaf Speech And Visual Feedback Remediation Abstract An acoustic/aerodynamic investigation of the speech of normal-hearing and profoundly deaf children is reported. The speech of the latter improved significantly following visual feedback remediation.	PDF
350--357	J. Ingram, B. Murdoch and H. Chenery	Prosody In Hypokinetic And Ataxic Dysarthria Abstract This paper contrasts patterns of speech prosody in Hypokinetic dysarthria and Ataxia. A perceptual and acoustic analysis of the metrical component of prosody in dysarthric speech is undertaken.	PDF

Technical Aids

Pages	Authors	Title and Abstract	PDF
358--363	Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt	Applications Of Speech Technology In Aids For The Disabled Abstract A number of technical aids which include speech synthesis or speech recognition have been developed at the Department of Speech Communication and Music Acoustics and are now being used by disabled individuals. Applications of synthetic speech include a communication aid, a symbol-to-speech system, word predictors, talking terminals and a daily newspaper. Speech recognition is also being used in a communication aid.	PDF
364--369	P.J. Blamey and G.M. Clark	Perception Of Synthetic Vowels And Stop Consonants By Cochlear Implant Users Abstract Three multiple-channel cochlear implant users were tested with speech sounds that were synthesized using electrical parameters representing the fundamental frequency of the voice, and the frequencies and amplitudes of the first and second formants. Using vowels of equal duration and loudness, it was shown that most of the vowel recognition could be attributed to the formant coding. Unvoiced stops with varying burst frequencies, voiced stops with varying second formant loci, bilabial stops with varying voice onset times, and bilabial consonants with varying formant transition durations were also synthesized. For each consonant set, the responses showed similar patterns to those observed with normally—hearing listeners for analogous acoustic stimuli. Interactions between amplitude and frequency cues were observed.	PDF
370--376	H.H. Lim, Y.C. Tong and G.M. Clark	Identification Of Synthetic Vowel Nuclei By Cochlear Implant Patients Abstract Six speech processing schemes, differing in the formant frequency-to-electrode position map and the number of formant frequencies encoded were investigated . The six schemes consisted of two single—formant (F2) schemes, three two-formant (F1 and F2 or F2 and F3) schemes and one three-formant (F1, F2 and F3) scheme. Eleven steady state Australian vowel nuclei ([i], [a], [>], [u], [3], [l],[e], [aa], [/\], [o] and [v]) synthesised as electrical signals were used to evaluate the relative merits of the six schemes on three cochlear implant patients. the first five vowels are long vowels and the remaining six are short vowels. The steady state formant frequencies of these vowel nuclei (Bernard, 1970) were transformed to steady state electrode positions using different formant frequency-to-electrode position maps. The confusion matrices were subjected to conditional information transmission analysis. The results showed that : (1) training, experience and adaptability to a new speech processing scheme were the main factors influencing the identification of the synthetic vowels ; and (2) adding an extra formant vowel feature to a speech processing scheme tended to decrease the amount of information transmission about the existing formant feature(s). From these synthesis results, the three-formant (F0/F1/F2/F3/B) speech processing scheme appeared to be the logical choice for future implementation in speech processors for cochlear implant patients.	PDF

Pages

Authors

Title and Abstract

PDF

358--363

Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt

Applications Of Speech Technology In Aids For The Disabled

Abstract A number of technical aids which include speech synthesis or speech recognition have been developed at the Department of Speech Communication and Music Acoustics and are now being used by disabled individuals. Applications of synthetic speech include a communication aid, a symbol-to-speech system, word predictors, talking terminals and a daily newspaper. Speech recognition is also being used in a communication aid.

PDF

364--369

P.J. Blamey and G.M. Clark

Perception Of Synthetic Vowels And Stop Consonants By Cochlear Implant Users

Abstract Three multiple-channel cochlear implant users were tested with speech sounds that were synthesized using electrical parameters representing the fundamental frequency of the voice, and the frequencies and amplitudes of the first and second formants. Using vowels of equal duration and loudness, it was shown that most of the vowel recognition could be attributed to the formant coding. Unvoiced stops with varying burst frequencies, voiced stops with varying second formant loci, bilabial stops with varying voice onset times, and bilabial consonants with varying formant transition durations were also synthesized. For each consonant set, the responses showed similar patterns to those observed with normally—hearing listeners for analogous acoustic stimuli. Interactions between amplitude and frequency cues were observed.

PDF

370--376

H.H. Lim, Y.C. Tong and G.M. Clark

Identification Of Synthetic Vowel Nuclei By Cochlear Implant Patients

Abstract Six speech processing schemes, differing in the formant frequency-to-electrode position map and the number of formant frequencies encoded were investigated . The six schemes consisted of two single—formant (F2) schemes, three two-formant (F1 and F2 or F2 and F3) schemes and one three-formant (F1, F2 and F3) scheme. Eleven steady state Australian vowel nuclei ([i], [a], [>], [u], [3], [l],[e], [aa], [/\], [o] and [v]) synthesised as electrical signals were used to evaluate the relative merits of the six schemes on three cochlear implant patients. the first five vowels are long vowels and the remaining six are short vowels. The steady state formant frequencies of these vowel nuclei (Bernard, 1970) were transformed to steady state electrode positions using different formant frequency-to-electrode position maps. The confusion matrices were subjected to conditional information transmission analysis. The results showed that : (1) training, experience and adaptability to a new speech processing scheme were the main factors influencing the identification of the synthetic vowels ; and (2) adding an extra formant vowel feature to a speech processing scheme tended to decrease the amount of information transmission about the existing formant feature(s). From these synthesis results, the three-formant (F0/F1/F2/F3/B) speech processing scheme appeared to be the logical choice for future implementation in speech processors for cochlear implant patients.

PDF

Perception II

Pages	Authors	Title and Abstract	PDF
378--383	W.K. Lai, Y.C. Tong, and G.M. Clark	Absolute Identification By Cochlear Implant Patients Of Synthetic Vowels Constructed From Acoustic Formant Information Abstract Five speech processing schemes for presenting speech information to multiple-channel cochlear implant patients were investigated and compared. Tabulated data for formant frequencies of the natural vowels (i. ,1 ,a,a@, a, o,v,r,r,A,3 ,¤) ) were coded into the parameters of the electric stimuli used in the cochlear implant, and these electric stimuli or synthetic vowels were presented to two patients in a single-interval absolute identification task. The results suggest that when first and second formant speech information is coded into the pulse rate has well as the electrode position, it is possible for the performance in the identification task to be significantly improved, compared to when the same information is coded into the electrode positron only.	PDF
384--389	Geoff Plant	Speech Understanding With Low Frequency Hearing A Case Study Abstract A subject with a long standing high frequency loss was tested using a variety of speech materials. The results indicate much useful information may be gained from even limited amounts of low frequency hearing.	PDF
390--396	P.J. Blamey and G.M.Clark	Combining Tactile, Auditory And Visual Information For Speech Perception Abstract Four normally hearing subjects were trained and tested with all combinations of a highly degraded auditory input, a visual input via Lipreading, and a tactile input using a multichannel electrotactile speech processor. When the visual input was added to any combination of other inputs, a significant improvement occurred for every test. Similarly, the auditory input produced a significant improvement for all tests except closed-set vowel recognition. The tactile input produced scores that were significantly greater than chance In isolation, but combined less effectively with the other modalities. The less effective combination might be due to lack of training with the tactile input, or to more fundamental limitations in the processing of multimodal stimuli.	PDF

Pages

Authors

Title and Abstract

PDF

378--383

W.K. Lai, Y.C. Tong, and G.M. Clark

Absolute Identification By Cochlear Implant Patients Of Synthetic Vowels Constructed From Acoustic Formant Information

Abstract Five speech processing schemes for presenting speech information to multiple-channel cochlear implant patients were investigated and compared. Tabulated data for formant frequencies of the natural vowels (i. ,1 ,a,a@, a, o,v,r,r,A,3 ,¤) ) were coded into the parameters of the electric stimuli used in the cochlear implant, and these electric stimuli or synthetic vowels were presented to two patients in a single-interval absolute identification task. The results suggest that when first and second formant speech information is coded into the pulse rate has well as the electrode position, it is possible for the performance in the identification task to be significantly improved, compared to when the same information is coded into the electrode positron only.

PDF

384--389

Geoff Plant

Speech Understanding With Low Frequency Hearing A Case Study

Abstract A subject with a long standing high frequency loss was tested using a variety of speech materials. The results indicate much useful information may be gained from even limited amounts of low frequency hearing.

PDF

390--396

P.J. Blamey and G.M.Clark

Combining Tactile, Auditory And Visual Information For Speech Perception

Abstract Four normally hearing subjects were trained and tested with all combinations of a highly degraded auditory input, a visual input via Lipreading, and a tactile input using a multichannel electrotactile speech processor. When the visual input was added to any combination of other inputs, a significant improvement occurred for every test. Similarly, the auditory input produced a significant improvement for all tests except closed-set vowel recognition. The tactile input produced scores that were significantly greater than chance In isolation, but combined less effectively with the other modalities. The less effective combination might be due to lack of training with the tactile input, or to more fundamental limitations in the processing of multimodal stimuli.

PDF

Coding II

Pages	Authors	Title and Abstract	PDF
398--401	Leisa Condie.	Word Recognition Using Error-Correction Codes Abstract The Reed-Solomon error-correction code separates input vectors as far as possible from each other. Such codes are known as Maximum Distance Separable (MDS). This property was investigated in a word recognition system to see whether applying such a code would separate word vectors to such a point that recognition rates improved. A vocabulary of 21 words spoken on four occasions by a single speaker formed the basis of the experiment. First formants for each frame were found with the Interactive Laboratory System (ILS) package. The resulting vectors were encoded with a Reed-Solomon code. The reference set for recognition was formed from the average of all the utterances of each word, and a simple distance metric (after suitable Dynamic Time Warping to align the vector lengths) used to find the closest reference word. A comparison of performance for encoded and unencoded vectors is made.	PDF
402--407	R.E.E.Robinson	A Simple Pitch Detector Using A Digital Signal Processor Abstract A Real Time Pitch extraction device using the TMS3201D Digital Signal Processor is described. lt is designed as a replacement for an analog Pitch extractor and performs the same function but with greater accuracy and better dynamic response. The two are compared.	PDF
408--413	Bernt Ribbum, Andrew Perkis and K.K. Paliwal	Enhancing The Codebook For Improving The Speech Quality Of Celp Coders Abstract A Code Excited Linear Predictive (CELP) coder with a stochastic-multipulse (STMP) codebook is presented.TheLPC residual exhibits a certain structure due to non-linearities in the glottal excitation. This structure can be exploited by a refinement of the STMP excitation signal, as a training procedure for the codebook. The algorithms are described and results are reported, both in terms of segmental SNR and subjective preference.	PDF
414--420	S. Sridharan, E. Dawson, and J. O'Sullivan	Speech Encryption Using Fast Fourier Transform Techniques Abstract A speech encryption system based on permutation of FFT coefficients is described. Results of simulation and cryptanalysis of the system are presented.	PDF

Analysis Iv

Pages	Authors	Title and Abstract	PDF
422--425	R. Potapova	The Length And Variability In Connected Speech For Russian Abstract This paper presents automatic segmentation of speech in the "bottom-up" way taking into account a number of linguistic constraints: the specification of texts based on the classification of speech acts, the classification of text fragments on the basis of semantic-syntactic analysis, the segmentation of the utterance when the end-points of a phrase are determined, the segmentation into syllabic units, the specification of length and its variability in connected Russian speech.	PDF
426--428	V.V. Potapov	The Rhythmic Organization Of Speech In Czech And Russian Abstract The subject of this research consists in describing perception as revealed in the process of segmentation of Czech and Russian speech into rhythmic structures (RS); in describing peculiarities of prosodic features of segmentation in Czech and Russian. In this investigation rhythm is defined as a regular recurrence of speech units in an utterance. These units comprise syllables, rhythmic structures (phonetic words), sense-groups (syntagmas) and phrases.	PDF

Pages

Authors

Title and Abstract

PDF

422--425

R. Potapova

The Length And Variability In Connected Speech For Russian

Abstract This paper presents automatic segmentation of speech in the "bottom-up" way taking into account a number of linguistic constraints: the specification of texts based on the classification of speech acts, the classification of text fragments on the basis of semantic-syntactic analysis, the segmentation of the utterance when the end-points of a phrase are determined, the segmentation into syllabic units, the specification of length and its variability in connected Russian speech.

PDF

426--428

V.V. Potapov

The Rhythmic Organization Of Speech In Czech And Russian

Abstract The subject of this research consists in describing perception as revealed in the process of segmentation of Czech and Russian speech into rhythmic structures (RS); in describing peculiarities of prosodic features of segmentation in Czech and Russian. In this investigation rhythm is defined as a regular recurrence of speech units in an utterance. These units comprise syllables, rhythmic structures (phonetic words), sense-groups (syntagmas) and phrases.

PDF