SST 1988 Abstracts

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Synthesis I

Pages Authors Title and Abstract PDF
2--7 Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt Rulsys - The Swedish Multilingual Text-To-Speech Approach

 

Abstract  Speech synthesis has been a major field of research at our department for several decades. The projects contain everything from basic research on speech production models to applications of speech technology, e.g., for handicapped persons. In this contribution we will concentrate on the development strategies, describe the development environment and discuss some recent results. The synthesis is based on a combination of modules including lexica and rule components. Even if the number of components are about the same for different languages, the emphasis on the different parts varies considerably due to language structure. Rule development is done in the generative phonology tradition. The development system, originally written for a different computer, has now been moved to our network of Apollo workstations and In- tegrated with speech analysis and resynthesis software. Expanded use of morphological and syntactical analysis has proved useful in several languages. Recent experiments with an expanded synthesis model including a more realistic voice source, the LF-model, has given new possibilities to vary both speaker type and speaking style.

 

PDF
8--13 Clive D Summerfield, and Marwan A Jabri A Formant Speech Synthesiser Asic: Functional Design

 

Abstract  This paper is the first of two companion papers on the design and implementation of a multi-channel formant speech synthesiser Application Specific integrated Circuit (ASIC). The objective of this research is the development of an efficient VLSI structure which can be implemented as a single VLSI device and yet retains the acoustical performance necessary to generate high quality and high intelligibility synthetic speech and have sufficient processing bandwidth for multi-channel operation. This paper concentrates on the functional design of a VLSI formant speech synthesis structure for achieving these objectives.

 

PDF
14--20 Marwan A Jabri, Kiang Ooi Tan and Clive D Summerfield A Formant Speech Synthesiser Asic: Implementation

 

Abstract  This paper is the second of two companion papers on the design and implementation of a multichannel formant speech synthesiser Application Specific Integrated Circuit (ASIC). ln this paper, we describe the implementation aspects of the project. The use of the silicon compiler FIRST in the conception of the synthesiser has reduced the design time considerably. However, as the 5pm NMOS primitive library of FIRST falls short in providing sufficient processing bandwidth for multi-channel operation, we implemented a new primitive library using the European Silicon Structure (ES2) standard cell design tool (2 pm CMOS). The library is implemented using the MODEL hardware description language The implementation of the primitives using MODEL is discussed in detail together with the clocking and data synchronisation strategies necessary for reliable high speed bit—serial operation.

 

PDF

Perception I

Pages Authors Title and Abstract PDF
22--27 R. H. Mannell Perceptual Space Of Male And Female Australian English Vowels

 

Abstract  This study investigates the phonemic space of synthetic male and female vowel tokens as perceived by native speakers of Australian English The data was also examined for evidence of vowel normalisation.

 

PDF
28--34 U Thein-Tun The Gender And Individual Variations In Processing Linguistic-Phonetic Cues

 

Abstract  The perception of integrated phonetic cues by males and females was investigated at five levels of information processing. The integrated phonetic cues were the intensity and duration of voice-onset—time in relation to the intensity of the following vowel for syllable initial /d/-/t/ distinction. The five levels of information processing were auditory, phonetic, syllable, word, and sentence levels. The results demonstrate that listeners who cannot effectively process the cues at the auditory and phonetic levels can process them very effectively at the sentence level and vice versa. Most female listeners belong to the former group.

 

PDF

Coding I

Pages Authors Title and Abstract PDF
36--41 M.J. Flaherty On The Representation Of Time-Varying Lpc Parameters By Cubic Splines With Variable Knots

 

Abstract  The modelling of log area coefficients by cubic splines with variable knots is discussed. Results are presented which compare variable and uniform knot modelling for a connected digit utterance spoken over a telephone handset using least squares error and spectral difference measures.

 

PDF
42--47 H Gondokusumo and T.S. Ng Subband Coding Of Speech Using M-Band Parallel Quadrature Mirror Filters

 

Abstract  In this paper, the design of both analysis and synthesis filters for a 3—band parallel sub-band coder with perfect reconstruction properties is given. An example of this design using FIR filters of 17th order is implemented in an IBM PC/AT. Experimental results using this subband coder on segments of speech will be demonstrated in the presentation.

 

PDF
48--53 A. W. Johnson and A. B. Bradley The Effect Spectral Modifications Have Upon The Performance Of Frequency Domain Coders

 

Abstract  The effect of spectral modifications on the performance of low data rate frequency domain coders employing overlapped transform operations and the Ramstad bitallocafion procedure is the introduction of distortion into the recovered signal. While this distortion cannot be removed certain aspects of it can be controlled. This theoretical analysis suggests the design of a new bitallocation procedure based upon the Ramstad majority vote rule [Ramstad 1984). This is the subject of on going research which it is hoped will improve the overall subjective quality of speech recovered from a low data rate frequency domain coder.

 

PDF
54--59 S N Koh and P. Sivaprakasapillai Analysis And Synthesis Method For Packet Speech Enhancement

 

Abstract  This paper describes a novel system which employs the filter bank analysis and synthesis method for the packetisation of speech for transmission in packet data communications. Computer simulations of the system indicate that significant improvement in the perceptual quality of the recovered speech can be obtained even with zero substitutions compared to the conventional technique of straight packetization of PCM speech. Further improvement is possible through frequency—domain component replications.

 

PDF
60--66 A. Perkis, B. Ribbum, T. Ramstad Improving Subjective Quality In Waveform Coders By The Use Of Postfiltering

 

Abstract  Adaptive postfiltering is shown to significantly enhance the perceived speech quality of medium bit rate coders. The post- filter, utilizing auditory masking properties, provides an adaptive shaping of the noise and signal spectra, thus reducing the perceived quantization noise level at the cost of introducing some extra signal distortion. This paper will discuss the effectiveness of postfiltering in three distinctly different coding schemes. These are Regular Pulse Excited linear predictive coding (RPE) representing LPC based coding schemes, and Adaptive Sub Band Coding (SBC) and Adaptive Transform Coding (ATC) representing frequency domain coders.

 

PDF

Analysis I

Pages Authors Title and Abstract PDF
68--73 Phil Rose Normalisation Of Tonal F0 From Long Term F0 Distributions

 

Abstract  An attempt is described to ascertain whether the F0 of 7 speakers tones can be normalised using parameters from their long term F0 distribution. It is shown that normalisation using long term mean and standard deviation is not as effective in reducing the between-speaker variance as with parameters derived from the tones themselves. However, the approach is still successful enough to be worth pursuing, and some suggestions for improvement are indicated.

 

PDF
74--79 J.S. Chang and Y.C. Tong Development Of A Switched Capacitor Speech Spectrum Analyzer System Design

 

Abstract  The design of a novel Low Power Monolithic Time—Multiplexed Switched Capacitor Speech Spectrum Analyzer is described. Essential features are specified with comments on the reasons for the design decisions. An experimental four channel spectrum analyzer has been fabricated and measurements on prototypes show that the design specifications are satisfied.

 

PDF
80--85 Michael G. Barlow and Michael Wagner Prosody As A Basis For Determining Speaker Characteristics

 

Abstract  A speaker identification experiment based on prosodic features is described. Five speakers recorded a set of four sentences In five separate sessions over a period of one week. For each of these utterances, the energy, fundamental frequency, voicing and linear prediction error contours were extracted. For each sentence (four) and each type of contour (four) distance measures based on dynamic time warping were calculated between all twenty five (five speakers by five repetitions) contours. These distances were compared on an inter—speaker versus intra-speaker basis and the ratio was generally found to be large. Parameters within the distance measuring process, namely warping window size and contour smoothing, were altered and the effects on speaker distances are discussed.

 

PDF
86--91 Kim E. A. Silverman Utterance-Internal Prosodic Boundaries

 

Abstract  This paper investigates minor prosodic boundaries that often occur in fluent speech, and yet are not well understood. A corpus was collected of utterances with a range of segmental structures, where each utterance was spoken both with and without such an internal boundary. Acoustic measurements of the utterances were then related to perceptual ratings of the salience of the boundaries. Results showed that the F0 tall from the preceding pitch accent is much steeper before a boundary, and while these boundaries do not contain pauses they do alter the temporal structure of the speech. The segmental material is lengthened, and the preceding F0 accent occurs considerably earlier relative to its accent-bearing syllable.

 

PDF
92--97 Lori F. Lamel Spectrogram Readers' Identification Of Stop Consonants

 

Abstract  This paper reports on the performance of five spectrogram readers at identifying spectrograms of stop consonants extracted from continuous speech. The stops were spoken by 299 talkers and were presented in the immediate phonemic context. The task was designed to minimize the use of lexical and other higher sources of knowledge. The averaged identification rate across contexts ranged from 73-82% for the top choice, and 77—93% for the top two choices. The readers' performances were comparable to those of other spectrogram reading experiments reported in the literature, however the other studies have typically evaluated a single subject on speech spoken by a small number of talkers.

 

PDF

Recognition I

Pages Authors Title and Abstract PDF
100--105 D.Rainton and S.J Young Consonant Recognition Using The Covariance Of The Pseudo Wigner Distribution

 

Abstract  It is generally accepted that the consonant a consonant-vowel(CV) pair can be identified by the nature of the formant transitions in the vowel. STFT power spectral snapshots fail to capture the detailed time-varying nature of these transitions. In this paper we show that such spectra can be considered weighted time averages of the pseudo—Wigner distribition (PWD) when appropriate Gaussian windows are used in the computation of both. Given this interpretation we then speculate as to whether the higher order statistics of the PVD convey additional consonant discriminant information. Experimental evidence indicates that they do.

 

PDF
106--111 J. R. Sholicar I, F. Fallside A Prosodically And Lexically Constrained Approach To Continuous Speech Recognition

 

Abstract  Psycholinguistic studies have indicated that prosodic cues play a vital role in human speech perception. The prosodic relationships which exist within an utterance are believed to provide fundamental cues for structuring the recognition process. However, in the majority of reported systems for the automatic recognition of continuous speech, prosodic cues are seldom used. In this paper, we review the . evidence supporting the exploitation of prosodic cues, and discuss how such cues can be exploited within a machine recognition system to improve the segmental parsing strategy. practical implementation is then proposed, in which prosodic structure is a major factor in the organisation of the recognition process. The architecture of the system is described and preliminary results relating to the current development of thus system discussed.

 

PDF
112--118 P.Pierucci, A.Paladin Multistage Vector Quantization With Acoustic Constraints For Speaker Verification

 

Abstract  In this paper a new method to build multisection codebooks for the speaker recognition task is investigated. Different methods of threshold evaluation are then discussed for the proposed approach, and a comparison with single section VQ and previously reported Multisection VQ is discussed, in a fixed text speaker verification experiment.

 

PDF

Production

Pages Authors Title and Abstract PDF
120--125 Peter D. Neilson, Megan D. Neilson, and Nicholas J. O'Dwyer Redundant Degrees Of Freedom In Speech Control: A Problem Or A Virtue?

 

Abstract  It is well-established that rapid, functionally specific compensations for unexpected perturbations occur in speech articulators remote from the site of the disturbance. we interpret this in terms of an adaptive controller which incorporates an inverse internal model of the sensory—motor system involved. By using "compliant" control, in which variables representing redundant degrees of freedom are set equal to the feedback of their actual values, sensory consequences crucial to a task can be protected from external disturbances. This subsumes the notions of coordinative structures and feedforward processes.

 

PDF
126--131 David Slater Intrinsic Effects Of Voiced And Voiceless Unaspirated Prevocalic Stops On Fundamental Frequency In Luang Prabang Lao

 

Abstract  Three tones, of two native speakers of the Luang Prabang variety of Lao, are investigated as to the effects of the voicing or voicelessness of the initial consonant of a syllable on the fundamental frequency of the following vowel. Three vowels are used, and the intrinsic effect of vowel height on fundamental frequency observed. The results from this investigation support the hypothesis that the intrinsic effect of lowering of fundamental frequency after voiced consonants is minimised in tonal languages.

 

PDF
132--137 Andrew Butcher Invariance And Variability In Tongue Contact Patterns: Electropalatographic Evidence

 

Abstract  An attempt was made to quantify the variability of tongue contact patterns at certain stages during the pronunciation of VCV sequences, as registered by an electropalatograph system. Of specific interest were: total contact area during the vowels and the rate of change of contact for consonant closures and releases. Results are discussed in the light of some current theories of coarticulation.

 

PDF
138--144 Jinshi Huang An Arma Model Of Speech Production Process With Applications

 

Abstract  An autoregressive and moving-average (ARMA) model of speech production process is proposed. The orders of the model are determined from the generalized partial autocorrelation (GPAC) pattern. Based on the maximum likelihood estimation, parameters of the model are estimated via the Marquardt algorithm. Experiments show that the fricatives can be better modeled as an ARMA process. The autocorrelation of the residuals can be used for pitch detection and voiced/unvoiced speech recognition.

 

PDF

Analysis II

Pages Authors Title and Abstract PDF
146--151 Frantz Clermont A Dual Exponential Model For Formant Trajectories Of Diphthongs

 

Abstract  Australian English diphthongs are studied in terms of their second formant-frequency trajectories. These sigmoid—shaped trajectories may be decomposed, around a suitable breakpoint, as two exponential functions approaching two distinct vowel targets. In order to obtain this dual exponential representation, a set of candidate 'breakpoints defined along the inter-target transition are used to divide a given trajectory in two segments, thus simplifying the problem to that of fitting two single exponentials. A succession of such fits are performed, and the best pair of exponentials are determined in a root—mean-square sense. The method developed for constructing and evaluating the dual exponential model is described and illustrated. While the model fares well in a curve-fitting sense, its components do not always admit of a sensible phonetic interpretation in the case of an incomplete gesture towards the second vowel target.

 

PDF
152--157 C. D. Summerfield Pole-Zero Analysis For The Detection Of Nasality

 

Abstract  The detection of nasality and the fine-class categorisation of nasal segments is important for the success of a phonetically based speech recognition machine. In this paper the application of a pole·zero modelling algorithm to this problem is described. It is well known that nasal segments are characterised and may be classified by the presence and location of the vocal tract transfer function zeros. The aim, in applying the pole-zero algorithm, is to elucidate this particular acoustic feature. However, as will be demonstrated, the zero response from the pole-zero algorithm show considerable amount of extra activity which is not attributed to the vocal tract zero alone. These results indicate that the application of the pole-zero algorithm yields results which are difficult to interpret and, consequently of limited use in this application.

 

PDF
158--163 R. H. Mannell Spectral Distortion And Spectral Distance Measures

 

Abstract  Several different spectral distance measures have been compared in order to see which measures most closely correlate with the intelligibility of speech systematically distorted by various channel vocoder configurations.

 

PDF
164--169 A. Al-Otaibi and Y. El-Imam Automatic Segmentation Of Speech Signals Into Arabic Syllables

 

Abstract  Due to the nature of the Arabic language in terms of the rules governing formation of syllables by phonemes, and the one—to—one correspondence between the acoustics and the phonetics of Arabic syllables, a syllabic based approach for speech recognition of the Arabic language has a high potential for success. An experimental evaluating of an automatic speech segmentation algorithm into Arabic syllable units ls reported here. The parameter used for segmentation ls the energy of the acoustic signal. Speech data consisting of mono—syllabic1 and multi—syl1ablc words were used to test the automatic Arabic syllable segmentation algorithm. The algorithm has the advantage of being simple to implement.

 

PDF
170--175 Xi Xiao, D. Nandagopal & D.A.H. Johnson On The Application Of Ar Model In Segmenting Isolated-Word Speech Signals

 

Abstract  In this paper, we present a segmentation method which uses AR modelling of the spectrum of the fullwave rectified speech signal. The FFT of the model coefficients yields a smoothed time domain signal with well defined minima, which locate the segment boundaries. The method is robust in the presence of noise. It is a useful first step in speech processing to segment speech signals into sub-frames which can be treated as time-invariant (stationary) processes.

 

PDF

Assesment Intelligibility & Cognition

Pages Authors Title and Abstract PDF
190--196 Katerina K. Karadjova Cognitive Networks In The Semantic Memory Of Normal And Mentally Retarded Children

 

Abstract  The structure of the conceptions of normal and mentally handicapped children is considered. The experimental data are processed by means of Johnson's hierarchical cluster analysis.

 

PDF
178--183 V.J. Demczuk An Investigation Into The Intelligibility Of Various Synthetic Speech Devices.

 

Abstract  An experiment was conducted to measure the intelligibility of a number of commercially available synthetic speech devices.

 

PDF
184--189 W.A.Ainsworth, A.P.Lobo and G.R.Nest Correlation Of Speech Intelligibility And Glottal Pulse Parameters

 

Abstract  The intelligibility of the voices of eight speakers was measured at three levels of background noise. A number of glottal pulse parameters were estimated for the same voices. Significant correlations were found between intelligibility and some of these parameters.

 

PDF

Recognition II

Pages Authors Title and Abstract PDF
198--203 M.D. Alder Automatic Extraction Of Syntax Applied To Speech Recognition.

 

Abstract  The successful recognition of speech depends heavily upon the use of contextual information. Moreover the contextual information is too extensive to be put in by hand; this has led to attempts to automate the process. At the level of phonemic data, hidden markov models are extensively used, while at the lexical level the principal method is that of n-grams. (F. Jelinek , 1985). The n-gram method has two related problems associated with it. The first is that there are a very large number of n-grams of consecutive words in English text for n greater than one, and the number goes up very quickly with n. The second is that even this number is sparse in real data, so that even with 'training sets' of millions of words of text, any new text contains a large fraction of n-grams never seen before. And this fraction also increases rapidly with n. The question of what to do in this case is a central problem for this and other classes of stochastic grammars. In this paper I describe algorithms which address both problems. The first issue, that of storing large numbers of n-grams, is treated by storing not the n-grams themselves but classes of n-grams which are ‘close together’. The second issue, that of sparseness of data, is solved by a derived method: we average over a neighbourhood of n-grams. The algorithms are computationally intensive, but are amenable to parallelisation. There are implications for layered neural networks.

 

PDF
204--209 R.A. Bennett, E. Lai, and Y. Attikiouzel A Connected Speech Parse For Australian English Utilizing Matrix Syllable Formation

 

Abstract  A system is proposed to enhance the speed for Connected Speech Recognition systems by the formation of syllables from a phoneme string. Binary matrices are used to provide fast calculation of syllables which are used in modified dictionary search patterns. The system is designed for use with simple recognition systems which provide minimal allophonic information. Some preliminary results are discussed.

 

PDF
210--215 T. Svendsen, K. K. Paliwal, E. Harborg, P. O. Husoy Experiments With A Sub-Word Based Speech Recognizer

 

Abstract  A system for sub-word based speech recognition is described. The system is evaluated for a vocabulary of 42 Norwegian isolated words and the performance of the system is compared to the performance of whole—word based Hidden Markov Model and Dynamic Time Warping speech recognition systems.

 

PDF
216--221 Frantz CLERMONT and Simon J. BUTLER Prosodically Guided Methods For Nearest Neighbour Classification Of Syllables

 

Abstract  An approach to Nearest Neighbour (NN) classification of syllables in continuous speech is described. Acoustic prosodic segmentation of speech is used to guide the conventional Dynamic `lime Warping (DTW) distance measure. The acoustic prosodic analysis robustly determines the nuclei of syllables, and establishes neighbouring intervals which include the syllable boundaries. The limits of these intervals are then used to define the global constraints, which serve to restrict the DTW warping paths within an allowable region. Reliable syllable boundaries are therefore determined implicitly in the matching process. Furthermore, when the proposed method is used in NN-classification of a small database of Australian English diphthongs embedded in continuous speech, the accuracy is comparable to that achieved by current DTW-based systems for isolated word recognition.

 

PDF
222--228 Simon J. BUTLER and Frantz CLERMONT On The Asymptotic Performance Of Nearest Neighbour Pattern Classifiers In Speech Recognition

 

Abstract  When distance measures based on Linear Prediction are used in Nearest Neighbour speech recognisers with a large number of training samples, it is found that the recognition performance is independent of the distance measure used. This contrasts with the case of small training sample sizes, in which performance is highly sensitive to choice of distance measure. The “asymptotic nearest neighbour equivalence" of this class of distance measures is explained and demonstrated in a vowel recognition experiment.

 

PDF

Tools

Pages Authors Title and Abstract PDF
230--233 H.S.J.Purvis. The Control Of A Speech Synthesiser By An Ibm Pc.

 

Abstract  This paper describes a computer program and interface card used to control a serial formant speech synthesiser. The program is used to enter parametric data using the keyboard, the data may be modified using a mouse.

 

PDF
234--239 Ara Samouelian and Clive D. Summerfield Computational Model Of The Peripheral Auditory System For Speech Recognition: Initial Results

 

Abstract  This paper describes the design of a computational model of the peripheral auditory system, which is controlled via the AUDLAB Interactive Speech Signal Processing Package using a programmable harness to interface the AUDLAB command protocol and track file format to the structural model of the cochlear processor. A suite of signal processing modules, originally developed for speech synthesis research has been supplemented by a number of non-linear signal processing modules to model the transduction stage of the Cochlear. Some initial results of the cochlear processor model and its performance on real speech signals are presented.

 

PDF
240--243 R.A. Wills and Y.C. Tong An ILS Compatible Wide Band Spectrum Analysis/Plotting Program (WBS)

 

Abstract  The software tool WBS which plots spectrograms from digitally recorded ILS compatible speech files is presented. The program operation is explained and comparison shown for a speech file plotted on a postscript printer with selected frequency/time resolution and the normal spectrogram plot from a "KAY" machine.

 

PDF
244--247 Arthur Lagos and Michael Wagner An Integrated Audio Signal Interface For Use In The Teaching Laboratory

 

Abstract  An audio signal interface for the IBM PC-AT is described which was specifically designed for student use in the teaching laboratory. The interface which is implemented as an IBM PC-AT plug—in board allows the recording and playback of audio signals directly to and from disk files. All functions of signal conditioning, data conversion and the DMA bus interface are integrated on the board which provides for microphone/line inputs and headphone/line outputs.

 

PDF
248--255 H.S.J.Purvis. A General Purpose Speech Editor, The Speak Language.

 

Abstract  - This paper describes a simple computer language known as the speak language that is used in a general purpose speech editor to output words or sentences in a defined sequence with specified timing.

 

PDF

Applications

Pages Authors Title and Abstract PDF
256--261 C Wheddon Human-Computer Speech Communication Systems

 

Abstract  The principal means of human communication is speech, this modality is now replicated by computer systems that are able to hold a limited but useful conversation with the user. Systems in operation and under development are described.

 

PDF
262--267 R. W. King and A.J. Hunt A Synthetic Speech Terminal. For Viatel: Design, Implementation And Performance

 

Abstract  Speech synthesis technology has been incorporated successfully into several computer systems for use by blind people. There has, however, been relatively little attention paid to the specific problems of using visually—conceived information services such as videotex (of which Australia‘s Viatel service is an example) with synthesised speech output. In this implementation of a PC-based prototype ‘talking— videotex’ terminal, page layout processing and comprehensive user controls are provided to overcome the problems of page scanning. The terminal incorporates a low-cost SSl—263 synthesiser chip, with software to produce an Australianised accent, together with word-based prosody.

 

PDF
268--273 P.J. Kennedy and J.E. Clark Operational Language In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft

 

Abstract  A recent survey of cockpit noise and communications in 44 Australian civil aviation aircraft included the compilation of a corpus of operational language material heard by aircrew during the performance of their duties. Preliminary analyses were made of the lexicon, syntax and message content of 1,726 transmissions. Constraints found upon the operational vocabulary and message-set construction heard by pilots present opportunities for applications of current speech technology to the civil aviation cockpit. Access to a suitable language database could be very useful for such applications.

 

PDF
274--281 F.J. Kennedy and JE. Clark Ambient Noise In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft

 

Abstract  A recent survey of ambient sound pressure levels and noise spectra occurring in the cockpit/flightdeck of various categories and classes of civil aviation aircraft is described. Variations in the cockpit noise between aircraft classes and during different flight operations indicate that cockpit speech technology should be evaluated under a range of conditions, although conditions within aircraft of the same category or class are similar enough to allow construction of an ‘average’ noise environment for e. specified flight condition.

 

PDF

Synthesis II

Pages Authors Title and Abstract PDF
282--287 JE. Clark and RH. Mannell Some Comparative Characteristics Of Uniform And Auditorily Scaled Channel Synthesis

 

Abstract  This paper examines the comparative phonetic level intelligibility characteristics of two channel vocoder type synthesis systems, one based on a uniform bandwidth filterbank, and the other on an auditorily scaled filterbank. The intelligibility tests were conducted using listeners with no prior experience of synthesised speech, and employed masking noise to help expose differences in the perceptual robustness of the test corpus. The intelligibility of the natural input speech tested under the same conditions was used the benchmark for all comparisons. The results suggest that the Bark scale derived synthesis may have intelligibility characteristics closer to those of natural speech than the uniform filterbank synthesis, is perceptually more robust, and is more cost effective in its use of available channel encoding.

 

PDF
288--293 Simon J. BUTLER A Speech Synthesis System Based On Articulatory Modelling

 

Abstract  The elements of an articulatory synthesis system under development are described. Particular emphasis has been given to modelling the trans-consonantal coarticulation effect for stops in /V1CV2/ context that have been reported by Ohman (1966).

 

PDF
294--301 Danielle Ribot, Frédéric Le Diberder and Pierre Martin The Multivoc Text-To-Speech System

 

Abstract  MULTIVOC is a real-world text-to-speech system geared to the French language. The full system is described including the technical view and the main application of the product up to now as a basic component of a telephone-based mail service.

 

PDF

Analysis III

Pages Authors Title and Abstract PDF
302--307 J. Ingram Connected Speech Processes And Connected Speech Synthesis

 

Abstract  Construction of a data base for the study of connected speech processes (CSP’s) in Australian English is described. Application to the problem of speech rate, style, and sociolect sensitive synthesis is discussed.

 

PDF
308--313 Jeffery Pittam and J. Bruce Millar The Longterm Spectrum Of Voice

 

Abstract  This paper presents an analysis of published information about the long-term spectrum of the voice. The historical development of the measure is first examined leading to a classification of the published works. Techniques used to compute the LTS are then presented, and the utility of the spectrum to various applications is considered. The outcome of this work is a research tool in the form of an annotated and classified bibliographic database.

 

PDF
314--319 J.Bruce Millar Stability Of Long Term Acoustic Features

 

Abstract  This paper is a progress report on an ongoing study of speaker characteristics in a number of acoustic feature domains and a number of temporal domains. Variations in the long—term analysis of timing, energy distribution, fundamental frequency distribution over a three month period for 33 speakers of Australian English are presented. These data are based on 5 reading passages of a nominal duration of one minute.

 

PDF
320--325 J. Pittam, C, Gallois and V.J. Callan The Long-Term Acoustic Characteristics Of Emotion

 

Abstract  Long-term spectra of recordings of three standard passages differing on perceived dominance and arousal were examined for 30 Australian speakers using three-mode principal components analysis. Results indicated that both affective dimensions were reflected systematically in the spectra, with dominance especially prominent in the upper part of the spectrum, and arousal affecting particularly two bands below 3 kHz.

 

PDF
326--333 Hiroaki Oasa and J.Bruce Millar Acoustic Processing Cf Phonetically Controlled Vowels

 

Abstract  A phonetically controlled vowel database, derived from 594 vowel samples .spoken by adult males, adult females, and children, was analysed using LPC techniques to obtain a formant description of the database. The measured formants were uniformly transformed using various scaling factors derived from averaged acoustic features or from anatomical features. The effectiveness of these transformations as a first-stage normalisation procedure is evaluated, and the residual inter-speaker variation discussed.

 

PDF

Disorders

Pages Authors Title and Abstract PDF
334--337 Geoff Plant Speech Test Procedures For Use With Hearing Impaired Aboriginal Children

 

Abstract  The development of a speech test for use with Aboriginal children who speak Warlpiri as their first language is described. The test is extremely simple and easy to administer but appears to give reliable results. Possible applications for other Aboriginal languages are considered.

 

PDF
338--343 Megan D. Neilson and Peter D. Neilson Sensory-Motor Integration Capacity Of Stutterers And Nonstutterers

 

Abstract  We review a series of studies concerning the auditory~motor and visual-motor tracking performance of stutterers and nonstutterers. we find no evidence of lateralization differences between the groups and interpret the finding that stutterers perform auditory tracking tasks significantly less well than nonstutterers as evidence of a deficit in ability to form internal auditory-motor models which subserve speech control.

 

PDF
344--349 Corinne Adams Prosody And Airflow In Deaf Speech And Visual Feedback Remediation

 

Abstract  An acoustic/aerodynamic investigation of the speech of normal-hearing and profoundly deaf children is reported. The speech of the latter improved significantly following visual feedback remediation.

 

PDF
350--357 J. Ingram, B. Murdoch and H. Chenery Prosody In Hypokinetic And Ataxic Dysarthria

 

Abstract  This paper contrasts patterns of speech prosody in Hypokinetic dysarthria and Ataxia. A perceptual and acoustic analysis of the metrical component of prosody in dysarthric speech is undertaken.

 

PDF

Technical Aids

Pages Authors Title and Abstract PDF
358--363 Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt Applications Of Speech Technology In Aids For The Disabled

 

Abstract  A number of technical aids which include speech synthesis or speech recognition have been developed at the Department of Speech Communication and Music Acoustics and are now being used by disabled individuals. Applications of synthetic speech include a communication aid, a symbol-to-speech system, word predictors, talking terminals and a daily newspaper. Speech recognition is also being used in a communication aid.

 

PDF
364--369 P.J. Blamey and G.M. Clark Perception Of Synthetic Vowels And Stop Consonants By Cochlear Implant Users

 

Abstract  Three multiple-channel cochlear implant users were tested with speech sounds that were synthesized using electrical parameters representing the fundamental frequency of the voice, and the frequencies and amplitudes of the first and second formants. Using vowels of equal duration and loudness, it was shown that most of the vowel recognition could be attributed to the formant coding. Unvoiced stops with varying burst frequencies, voiced stops with varying second formant loci, bilabial stops with varying voice onset times, and bilabial consonants with varying formant transition durations were also synthesized. For each consonant set, the responses showed similar patterns to those observed with normally—hearing listeners for analogous acoustic stimuli. Interactions between amplitude and frequency cues were observed.

 

PDF
370--376 H.H. Lim, Y.C. Tong and G.M. Clark Identification Of Synthetic Vowel Nuclei By Cochlear Implant Patients

 

Abstract  Six speech processing schemes, differing in the formant frequency-to-electrode position map and the number of formant frequencies encoded were investigated . The six schemes consisted of two single—formant (F2) schemes, three two-formant (F1 and F2 or F2 and F3) schemes and one three-formant (F1, F2 and F3) scheme. Eleven steady state Australian vowel nuclei ([i], [a], [>], [u], [3], [l],[e], [aa], [/\], [o] and [v]) synthesised as electrical signals were used to evaluate the relative merits of the six schemes on three cochlear implant patients. the first five vowels are long vowels and the remaining six are short vowels. The steady state formant frequencies of these vowel nuclei (Bernard, 1970) were transformed to steady state electrode positions using different formant frequency-to-electrode position maps. The confusion matrices were subjected to conditional information transmission analysis. The results showed that : (1) training, experience and adaptability to a new speech processing scheme were the main factors influencing the identification of the synthetic vowels ; and (2) adding an extra formant vowel feature to a speech processing scheme tended to decrease the amount of information transmission about the existing formant feature(s). From these synthesis results, the three-formant (F0/F1/F2/F3/B) speech processing scheme appeared to be the logical choice for future implementation in speech processors for cochlear implant patients.

 

PDF

Perception II

Pages Authors Title and Abstract PDF
378--383 W.K. Lai, Y.C. Tong, and G.M. Clark Absolute Identification By Cochlear Implant Patients Of Synthetic Vowels Constructed From Acoustic Formant Information

 

Abstract  Five speech processing schemes for presenting speech information to multiple-channel cochlear implant patients were investigated and compared. Tabulated data for formant frequencies of the natural vowels (i. ,1 ,a,a@, a, o,v,r,r,A,3 ,¤) ) were coded into the parameters of the electric stimuli used in the cochlear implant, and these electric stimuli or synthetic vowels were presented to two patients in a single-interval absolute identification task. The results suggest that when first and second formant speech information is coded into the pulse rate has well as the electrode position, it is possible for the performance in the identification task to be significantly improved, compared to when the same information is coded into the electrode positron only.

 

PDF
384--389 Geoff Plant Speech Understanding With Low Frequency Hearing A Case Study

 

Abstract  A subject with a long standing high frequency loss was tested using a variety of speech materials. The results indicate much useful information may be gained from even limited amounts of low frequency hearing.

 

PDF
390--396 P.J. Blamey and G.M.Clark Combining Tactile, Auditory And Visual Information For Speech Perception

 

Abstract  Four normally hearing subjects were trained and tested with all combinations of a highly degraded auditory input, a visual input via Lipreading, and a tactile input using a multichannel electrotactile speech processor. When the visual input was added to any combination of other inputs, a significant improvement occurred for every test. Similarly, the auditory input produced a significant improvement for all tests except closed-set vowel recognition. The tactile input produced scores that were significantly greater than chance In isolation, but combined less effectively with the other modalities. The less effective combination might be due to lack of training with the tactile input, or to more fundamental limitations in the processing of multimodal stimuli.

 

PDF

Coding II

Pages Authors Title and Abstract PDF
398--401 Leisa Condie. Word Recognition Using Error-Correction Codes

 

Abstract  The Reed-Solomon error-correction code separates input vectors as far as possible from each other. Such codes are known as Maximum Distance Separable (MDS). This property was investigated in a word recognition system to see whether applying such a code would separate word vectors to such a point that recognition rates improved. A vocabulary of 21 words spoken on four occasions by a single speaker formed the basis of the experiment. First formants for each frame were found with the Interactive Laboratory System (ILS) package. The resulting vectors were encoded with a Reed-Solomon code. The reference set for recognition was formed from the average of all the utterances of each word, and a simple distance metric (after suitable Dynamic Time Warping to align the vector lengths) used to find the closest reference word. A comparison of performance for encoded and unencoded vectors is made.

 

PDF
402--407 R.E.E.Robinson A Simple Pitch Detector Using A Digital Signal Processor

 

Abstract  A Real Time Pitch extraction device using the TMS3201D Digital Signal Processor is described. lt is designed as a replacement for an analog Pitch extractor and performs the same function but with greater accuracy and better dynamic response. The two are compared.

 

PDF
408--413 Bernt Ribbum, Andrew Perkis and K.K. Paliwal Enhancing The Codebook For Improving The Speech Quality Of Celp Coders

 

Abstract  A Code Excited Linear Predictive (CELP) coder with a stochastic-multipulse (STMP) codebook is presented.TheLPC residual exhibits a certain structure due to non-linearities in the glottal excitation. This structure can be exploited by a refinement of the STMP excitation signal, as a training procedure for the codebook. The algorithms are described and results are reported, both in terms of segmental SNR and subjective preference.

 

PDF
414--420 S. Sridharan, E. Dawson, and J. O'Sullivan Speech Encryption Using Fast Fourier Transform Techniques

 

Abstract  A speech encryption system based on permutation of FFT coefficients is described. Results of simulation and cryptanalysis of the system are presented.

 

PDF

Analysis Iv

Pages Authors Title and Abstract PDF
422--425 R. Potapova The Length And Variability In Connected Speech For Russian

 

Abstract  This paper presents automatic segmentation of speech in the "bottom-up" way taking into account a number of linguistic constraints: the specification of texts based on the classification of speech acts, the classification of text fragments on the basis of semantic-syntactic analysis, the segmentation of the utterance when the end-points of a phrase are determined, the segmentation into syllabic units, the specification of length and its variability in connected Russian speech.

 

PDF
426--428 V.V. Potapov The Rhythmic Organization Of Speech In Czech And Russian

 

Abstract  The subject of this research consists in describing perception as revealed in the process of segmentation of Czech and Russian speech into rhythmic structures (RS); in describing peculiarities of prosodic features of segmentation in Czech and Russian. In this investigation rhythm is defined as a regular recurrence of speech units in an utterance. These units comprise syllables, rhythmic structures (phonetic words), sense-groups (syntagmas) and phrases.

 

PDF
 
Contact ASSTA: Either email The ASSTA Secretary, or

G.P.O. Box 143, Canberra City, ACT, 2601.

Copyright © ASSTA