Proceedings of SST 1988
Page numbers refer to nominal page numbers assigned to each
paper for purposes of citation.
Synthesis I
Pages |
Authors |
Title and Abstract |
PDF |
2--7 |
Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt |
Rulsys - The Swedish Multilingual Text-To-Speech Approach
Abstract
Speech synthesis has been a major field of research at our department for several decades. The projects contain everything from basic research on speech production models to applications of speech technology, e.g., for handicapped persons. In this contribution we will concentrate on the development strategies, describe the development environment and discuss some recent results. The synthesis is based on a combination of modules including lexica and rule components. Even if the number of components are about the same for different languages, the emphasis on the different parts varies considerably due to language structure. Rule development is done in the generative phonology tradition. The development system, originally written for a different computer, has now been moved to our network of Apollo workstations and In- tegrated with speech analysis and resynthesis software. Expanded use of morphological and syntactical analysis has proved useful in several languages. Recent experiments with an expanded synthesis model including a more realistic voice source, the LF-model, has given new possibilities to vary both speaker type and speaking style.
|
PDF |
8--13 |
Clive D Summerfield, and Marwan A Jabri |
A Formant Speech Synthesiser Asic: Functional Design
Abstract
This paper is the first of two companion papers on the design and implementation of a multi-channel formant speech synthesiser Application Specific integrated Circuit (ASIC). The objective of this research is the development of an efficient VLSI structure which can be implemented as a single VLSI device and yet retains the acoustical performance necessary to generate high quality and high intelligibility synthetic speech and have sufficient processing bandwidth for multi-channel operation. This paper concentrates on the functional design of a VLSI formant speech synthesis structure for achieving these objectives.
|
PDF |
14--20 |
Marwan A Jabri, Kiang Ooi Tan and Clive D Summerfield |
A Formant Speech Synthesiser Asic: Implementation
Abstract
This paper is the second of two companion papers on the design and implementation of a multichannel formant speech synthesiser Application Specific Integrated Circuit (ASIC). ln this paper, we describe the implementation aspects of the project. The use of the silicon compiler FIRST in the conception of the synthesiser has reduced the design time considerably. However, as the 5pm NMOS primitive library of FIRST falls short in providing sufficient processing bandwidth for multi-channel operation, we implemented a new primitive library using the European Silicon Structure (ES2) standard cell design tool (2 pm CMOS). The library is implemented using the MODEL hardware description language The implementation of the primitives using MODEL is discussed in detail together with the clocking and data synchronisation strategies necessary for reliable high speed bit—serial operation.
|
PDF |
Perception I
Pages |
Authors |
Title and Abstract |
PDF |
22--27 |
R. H. Mannell |
Perceptual Space Of Male And Female Australian English Vowels
Abstract
This study investigates the phonemic space of synthetic male and female vowel tokens as perceived by native speakers of Australian English The data was also examined for evidence of vowel normalisation.
|
PDF |
28--34 |
U Thein-Tun |
The Gender And Individual Variations In Processing Linguistic-Phonetic Cues
Abstract
The perception of integrated phonetic cues by males and females was investigated at five levels of information processing. The integrated phonetic cues were the intensity and duration of voice-onset—time in relation to the intensity of the following vowel for syllable initial /d/-/t/ distinction. The five levels of information processing were auditory, phonetic, syllable, word, and sentence levels. The results demonstrate that listeners who cannot effectively process the cues at the auditory and phonetic levels can process them very effectively at the sentence level and vice versa. Most female listeners belong to the former group.
|
PDF |
Coding I
Pages |
Authors |
Title and Abstract |
PDF |
36--41 |
M.J. Flaherty |
On The Representation Of Time-Varying Lpc Parameters By Cubic Splines With Variable Knots
Abstract
The modelling of log area coefficients by cubic splines with variable knots is discussed. Results are presented which compare variable and uniform knot modelling for a connected digit utterance spoken over a telephone handset using least squares error and spectral difference measures.
|
PDF |
42--47 |
H Gondokusumo and T.S. Ng |
Subband Coding Of Speech Using M-Band Parallel Quadrature Mirror Filters
Abstract
In this paper, the design of both analysis and synthesis filters for a 3—band parallel sub-band coder with perfect reconstruction properties is given. An example of this design using FIR filters of 17th order is implemented in an IBM PC/AT. Experimental results using this subband coder on segments of speech will be demonstrated in the presentation.
|
PDF |
48--53 |
A. W. Johnson and A. B. Bradley |
The Effect Spectral Modifications Have Upon The Performance Of Frequency Domain Coders
Abstract
The effect of spectral modifications on the performance of low data rate frequency domain coders employing overlapped transform operations and the Ramstad bitallocafion procedure is the introduction of distortion into the recovered signal. While this distortion cannot be removed certain aspects of it can be controlled. This theoretical analysis suggests the design of a new bitallocation procedure based upon the Ramstad majority vote rule [Ramstad 1984). This is the subject of on going research which it is hoped will improve the overall subjective quality of speech recovered from a low data rate frequency domain coder.
|
PDF |
54--59 |
S N Koh and P. Sivaprakasapillai |
Analysis And Synthesis Method For Packet Speech Enhancement
Abstract
This paper describes a novel system which employs the filter bank analysis and synthesis method for the packetisation of speech for transmission in packet data communications. Computer simulations of the system indicate that significant improvement in the perceptual quality of the recovered speech can be obtained even with zero substitutions compared to the conventional technique of straight packetization of PCM speech. Further improvement is possible through frequency—domain component replications.
|
PDF |
60--66 |
A. Perkis, B. Ribbum, T. Ramstad |
Improving Subjective Quality In Waveform Coders By The Use Of Postfiltering
Abstract
Adaptive postfiltering is shown to significantly enhance the perceived speech quality of medium bit rate coders. The post- filter, utilizing auditory masking properties, provides an adaptive shaping of the noise and signal spectra, thus reducing the perceived quantization noise level at the cost of introducing some extra signal distortion. This paper will discuss the effectiveness of postfiltering in three distinctly different coding schemes. These are Regular Pulse Excited linear predictive coding (RPE) representing LPC based coding schemes, and Adaptive Sub Band Coding (SBC) and Adaptive Transform Coding (ATC) representing frequency domain coders.
|
PDF |
Analysis I
Pages |
Authors |
Title and Abstract |
PDF |
68--73 |
Phil Rose |
Normalisation Of Tonal F0 From Long Term F0 Distributions
Abstract
An attempt is described to ascertain whether the F0 of 7 speakers tones can be normalised using parameters from their long term F0 distribution. It is shown that normalisation using long term mean and standard deviation is not as effective in reducing the between-speaker variance as with parameters derived from the tones themselves. However, the approach is still successful enough to be worth pursuing, and some suggestions for improvement are indicated.
|
PDF |
74--79 |
J.S. Chang and Y.C. Tong |
Development Of A Switched Capacitor Speech Spectrum Analyzer System Design
Abstract
The design of a novel Low Power Monolithic Time—Multiplexed Switched Capacitor Speech Spectrum Analyzer is described. Essential features are specified with comments on the reasons for the design decisions. An experimental four channel spectrum analyzer has been fabricated and measurements on prototypes show that the design specifications are satisfied.
|
PDF |
80--85 |
Michael G. Barlow and Michael Wagner |
Prosody As A Basis For Determining Speaker Characteristics
Abstract
A speaker identification experiment based on prosodic features is described. Five speakers recorded a set of four sentences In five separate sessions over a period of one week. For each of these utterances, the energy, fundamental frequency, voicing and linear prediction error contours were extracted. For each sentence (four) and each type of contour (four) distance measures based on dynamic time warping were calculated between all twenty five (five speakers by five repetitions) contours. These distances were compared on an inter—speaker versus intra-speaker basis and the ratio was generally found to be large. Parameters within the distance measuring process, namely warping window size and contour smoothing, were altered and the effects on speaker distances are discussed.
|
PDF |
86--91 |
Kim E. A. Silverman |
Utterance-Internal Prosodic Boundaries
Abstract
This paper investigates minor prosodic boundaries that often occur in fluent speech, and yet are not well understood. A corpus was collected of utterances with a range of segmental structures, where each utterance was spoken both with and without such an internal boundary. Acoustic measurements of the utterances were then related to perceptual ratings of the salience of the boundaries. Results showed that the F0 tall from the preceding pitch accent is much steeper before a boundary, and while these boundaries do not contain pauses they do alter the temporal structure of the speech. The segmental material is lengthened, and the preceding F0 accent occurs considerably earlier relative to its accent-bearing syllable.
|
PDF |
92--97 |
Lori F. Lamel |
Spectrogram Readers' Identification Of Stop Consonants
Abstract
This paper reports on the performance of five spectrogram readers at identifying spectrograms of stop consonants extracted from continuous speech. The stops were spoken by 299 talkers and were presented in the immediate phonemic context. The task was designed to minimize the use of lexical and other higher sources of knowledge. The averaged identification rate across contexts ranged from 73-82% for the top choice, and 77—93% for the top two choices. The readers' performances were comparable to those of other spectrogram reading experiments reported in the literature, however the other studies have typically evaluated a single subject on speech spoken by a small number of talkers.
|
PDF |
Recognition I
Pages |
Authors |
Title and Abstract |
PDF |
100--105 |
D.Rainton and S.J Young |
Consonant Recognition Using The Covariance Of The Pseudo Wigner Distribution
Abstract
It is generally accepted that the consonant a consonant-vowel(CV) pair can be identified by the nature of the formant transitions in the vowel. STFT power spectral snapshots fail to capture the detailed time-varying nature of these transitions. In this paper we show that such spectra can be considered weighted time averages of the pseudo—Wigner distribition (PWD) when appropriate Gaussian windows are used in the computation of both. Given this interpretation we then speculate as to whether the higher order statistics of the PVD convey additional consonant discriminant information. Experimental evidence indicates that they do.
|
PDF |
106--111 |
J. R. Sholicar I, F. Fallside |
A Prosodically And Lexically Constrained Approach To Continuous Speech Recognition
Abstract
Psycholinguistic studies have indicated that prosodic cues play a vital role in human speech perception. The prosodic relationships which exist within an utterance are believed to provide fundamental cues for structuring the recognition process. However, in the majority of reported systems for the automatic recognition of continuous speech, prosodic cues are seldom used. In this paper, we review the . evidence supporting the exploitation of prosodic cues, and discuss how such cues can be exploited within a machine recognition system to improve the segmental parsing strategy. practical implementation is then proposed, in which prosodic structure is a major factor in the organisation of the recognition process. The architecture of the system is described and preliminary results relating to the current development of thus system discussed.
|
PDF |
112--118 |
P.Pierucci, A.Paladin |
Multistage Vector Quantization With Acoustic Constraints For Speaker Verification
Abstract
In this paper a new method to build multisection codebooks for the speaker recognition task is investigated. Different methods of threshold evaluation are then discussed for the proposed approach, and a comparison with single section VQ and previously reported Multisection VQ is discussed, in a fixed text speaker verification experiment.
|
PDF |
Production
Pages |
Authors |
Title and Abstract |
PDF |
120--125 |
Peter D. Neilson, Megan D. Neilson, and Nicholas J. O'Dwyer |
Redundant Degrees Of Freedom In Speech Control: A Problem Or A Virtue?
Abstract
It is well-established that rapid, functionally specific compensations for unexpected perturbations occur in speech articulators remote from the site of the disturbance. we interpret this in terms of an adaptive controller which incorporates an inverse internal model of the sensory—motor system involved. By using "compliant" control, in which variables representing redundant degrees of freedom are set equal to the feedback of their actual values, sensory consequences crucial to a task can be protected from external disturbances. This subsumes the notions of coordinative structures and feedforward processes.
|
PDF |
126--131 |
David Slater |
Intrinsic Effects Of Voiced And Voiceless Unaspirated Prevocalic Stops On Fundamental Frequency In Luang Prabang Lao
Abstract
Three tones, of two native speakers of the Luang Prabang variety of Lao, are investigated as to the effects of the voicing or voicelessness of the initial consonant of a syllable on the fundamental frequency of the following vowel. Three vowels are used, and the intrinsic effect of vowel height on fundamental frequency observed. The results from this investigation support the hypothesis that the intrinsic effect of lowering of fundamental frequency after voiced consonants is minimised in tonal languages.
|
PDF |
132--137 |
Andrew Butcher |
Invariance And Variability In Tongue Contact Patterns: Electropalatographic Evidence
Abstract
An attempt was made to quantify the variability of tongue contact patterns at certain stages during the pronunciation of VCV sequences, as registered by an electropalatograph system. Of specific interest were: total contact area during the vowels and the rate of change of contact for consonant closures and releases. Results are discussed in the light of some current theories of coarticulation.
|
PDF |
138--144 |
Jinshi Huang |
An Arma Model Of Speech Production Process With Applications
Abstract
An autoregressive and moving-average (ARMA) model of speech production process is proposed. The orders of the model are determined from the generalized partial autocorrelation (GPAC) pattern. Based on the maximum likelihood estimation, parameters of the model are estimated via the Marquardt algorithm. Experiments show that the fricatives can be better modeled as an ARMA process. The autocorrelation of the residuals can be used for pitch detection and voiced/unvoiced speech recognition.
|
PDF |
Analysis II
Pages |
Authors |
Title and Abstract |
PDF |
146--151 |
Frantz Clermont |
A Dual Exponential Model For Formant Trajectories Of Diphthongs
Abstract
Australian English diphthongs are studied in terms of their second formant-frequency trajectories. These sigmoid—shaped trajectories may be decomposed, around a suitable breakpoint, as two exponential functions approaching two distinct vowel targets. In order to obtain this dual exponential representation, a set of candidate 'breakpoints defined along the inter-target transition are used to divide a given trajectory in two segments, thus simplifying the problem to that of fitting two single exponentials. A succession of such fits are performed, and the best pair of exponentials are determined in a root—mean-square sense. The method developed for constructing and evaluating the dual exponential model is described and illustrated. While the model fares well in a curve-fitting sense, its components do not always admit of a sensible phonetic interpretation in the case of an incomplete gesture towards the second vowel target.
|
PDF |
152--157 |
C. D. Summerfield |
Pole-Zero Analysis For The Detection Of Nasality
Abstract
The detection of nasality and the fine-class categorisation of nasal segments is important for the success of a phonetically based speech recognition machine. In this paper the application of a pole·zero modelling algorithm to this problem is described. It is well known that nasal segments are characterised and may be classified by the presence and location of the vocal tract transfer function zeros. The aim, in applying the pole-zero algorithm, is to elucidate this particular acoustic feature. However, as will be demonstrated, the zero response from the pole-zero algorithm show considerable amount of extra activity which is not attributed to the vocal tract zero alone. These results indicate that the application of the pole-zero algorithm yields results which are difficult to interpret and, consequently of limited use in this application.
|
PDF |
158--163 |
R. H. Mannell |
Spectral Distortion And Spectral Distance Measures
Abstract
Several different spectral distance measures have been compared in order to see which measures most closely correlate with the intelligibility of speech systematically distorted by various channel vocoder configurations.
|
PDF |
164--169 |
A. Al-Otaibi and Y. El-Imam |
Automatic Segmentation Of Speech Signals Into Arabic Syllables
Abstract
Due to the nature of the Arabic language in terms of the rules governing formation of syllables by phonemes, and the one—to—one correspondence between the acoustics and the phonetics of Arabic syllables, a syllabic based approach for speech recognition of the Arabic language has a high potential for success. An experimental evaluating of an automatic speech segmentation algorithm into Arabic syllable units ls reported here. The parameter used for segmentation ls the energy of the acoustic signal. Speech data consisting of mono—syllabic1 and multi—syl1ablc words were used to test the automatic Arabic syllable segmentation algorithm. The algorithm has the advantage of being simple to implement.
|
PDF |
170--175 |
Xi Xiao, D. Nandagopal & D.A.H. Johnson |
On The Application Of Ar Model In Segmenting Isolated-Word Speech Signals
Abstract
In this paper, we present a segmentation method which uses AR modelling of the spectrum of the fullwave rectified speech signal. The FFT of the model coefficients yields a smoothed time domain signal with well defined minima, which locate the segment boundaries. The method is robust in the presence of noise. It is a useful first step in speech processing to segment speech signals into sub-frames which can be treated as time-invariant (stationary) processes.
|
PDF |
Assesment Intelligibility
& Cognition
Recognition II
Pages |
Authors |
Title and Abstract |
PDF |
198--203 |
M.D. Alder |
Automatic Extraction Of Syntax Applied To Speech Recognition.
Abstract
The successful recognition of speech depends heavily upon the use of contextual information. Moreover the contextual information is too extensive to be put in by hand; this has led to attempts to automate the process. At the level of phonemic data, hidden markov models are extensively used, while at the lexical level the principal method is that of n-grams. (F. Jelinek , 1985). The n-gram method has two related problems associated with it. The first is that there are a very large number of n-grams of consecutive words in English text for n greater than one, and the number goes up very quickly with n. The second is that even this number is sparse in real data, so that even with 'training sets' of millions of words of text, any new text contains a large fraction of n-grams never seen before. And this fraction also increases rapidly with n. The question of what to do in this case is a central problem for this and other classes of stochastic grammars. In this paper I describe algorithms which address both problems. The first issue, that of storing large numbers of n-grams, is treated by storing not the n-grams themselves but classes of n-grams which are ‘close together’. The second issue, that of sparseness of data, is solved by a derived method: we average over a neighbourhood of n-grams. The algorithms are computationally intensive, but are amenable to parallelisation. There are implications for layered neural networks.
|
PDF |
204--209 |
R.A. Bennett, E. Lai, and Y. Attikiouzel |
A Connected Speech Parse For Australian English Utilizing Matrix Syllable Formation
Abstract
A system is proposed to enhance the speed for Connected Speech Recognition systems by the formation of syllables from a phoneme string. Binary matrices are used to provide fast calculation of syllables which are used in modified dictionary search patterns. The system is designed for use with simple recognition systems which provide minimal allophonic information. Some preliminary results are discussed.
|
PDF |
210--215 |
T. Svendsen, K. K. Paliwal, E. Harborg, P. O. Husoy |
Experiments With A Sub-Word Based Speech Recognizer
Abstract
A system for sub-word based speech recognition is described. The system is evaluated for a vocabulary of 42 Norwegian isolated words and the performance of the system is compared to the performance of whole—word based Hidden Markov Model and Dynamic Time Warping speech recognition systems.
|
PDF |
216--221 |
Frantz CLERMONT and Simon J. BUTLER |
Prosodically Guided Methods For Nearest Neighbour Classification Of Syllables
Abstract
An approach to Nearest Neighbour (NN) classification of syllables in continuous speech is described. Acoustic prosodic segmentation of speech is used to guide the conventional Dynamic `lime Warping (DTW) distance measure. The acoustic prosodic analysis robustly determines the nuclei of syllables, and establishes neighbouring intervals which include the syllable boundaries. The limits of these intervals are then used to define the global constraints, which serve to restrict the DTW warping paths within an allowable region. Reliable syllable boundaries are therefore determined implicitly in the matching process. Furthermore, when the proposed method is used in NN-classification of a small database of Australian English diphthongs embedded in continuous speech, the accuracy is comparable to that achieved by current DTW-based systems for isolated word recognition.
|
PDF |
222--228 |
Simon J. BUTLER and Frantz CLERMONT |
On The Asymptotic Performance Of Nearest Neighbour Pattern Classifiers In Speech Recognition
Abstract
When distance measures based on Linear Prediction are used in Nearest Neighbour speech recognisers with a large number of training samples, it is found that the recognition performance is independent of the distance measure used. This contrasts with the case of small training sample sizes, in which performance is highly sensitive to choice of distance measure. The “asymptotic nearest neighbour equivalence" of this class of distance measures is explained and demonstrated in a vowel recognition experiment.
|
PDF |
Tools
Pages |
Authors |
Title and Abstract |
PDF |
230--233 |
H.S.J.Purvis. |
The Control Of A Speech Synthesiser By An Ibm Pc.
Abstract
This paper describes a computer program and interface card used to control a serial formant speech synthesiser. The program is used to enter parametric data using the keyboard, the data may be modified using a mouse.
|
PDF |
234--239 |
Ara Samouelian and Clive D. Summerfield |
Computational Model Of The Peripheral Auditory System For Speech Recognition: Initial Results
Abstract
This paper describes the design of a computational model of the peripheral auditory system, which is controlled via the AUDLAB Interactive Speech Signal Processing Package using a programmable harness to interface the AUDLAB command protocol and track file format to the structural model of the cochlear processor. A suite of signal processing modules, originally developed for speech synthesis research has been supplemented by a number of non-linear signal processing modules to model the transduction stage of the Cochlear. Some initial results of the cochlear processor model and its performance on real speech signals are presented.
|
PDF |
240--243 |
R.A. Wills and Y.C. Tong |
An ILS Compatible Wide Band Spectrum Analysis/Plotting Program (WBS)
Abstract
The software tool WBS which plots spectrograms from digitally recorded ILS compatible speech files is presented. The program operation is explained and comparison shown for a speech file plotted on a postscript printer with selected frequency/time resolution and the normal spectrogram plot from a "KAY" machine.
|
PDF |
244--247 |
Arthur Lagos and Michael Wagner |
An Integrated Audio Signal Interface For Use In The Teaching Laboratory
Abstract
An audio signal interface for the IBM PC-AT is described which was specifically designed for student use in the teaching laboratory. The interface which is implemented as an IBM PC-AT plug—in board allows the recording and playback of audio signals directly to and from disk files. All functions of signal conditioning, data conversion and the DMA bus interface are integrated on the board which provides for microphone/line inputs and headphone/line outputs.
|
PDF |
248--255 |
H.S.J.Purvis. |
A General Purpose Speech Editor, The Speak Language.
Abstract
- This paper describes a simple computer language known as the speak language that is used in a general purpose speech editor to output words or sentences in a defined sequence with specified timing.
|
PDF |
Applications
Pages |
Authors |
Title and Abstract |
PDF |
256--261 |
C Wheddon |
Human-Computer Speech Communication Systems
Abstract
The principal means of human communication is speech, this modality is now replicated by computer systems that are able to hold a limited but useful conversation with the user. Systems in operation and under development are described.
|
PDF |
262--267 |
R. W. King and A.J. Hunt |
A Synthetic Speech Terminal. For Viatel: Design, Implementation And Performance
Abstract
Speech synthesis technology has been incorporated successfully into several computer systems for use by blind people. There has, however, been relatively little attention paid to the specific problems of using visually—conceived information services such as videotex (of which Australia‘s Viatel service is an example) with synthesised speech output. In this implementation of a PC-based prototype ‘talking— videotex’ terminal, page layout processing and comprehensive user controls are provided to overcome the problems of page scanning. The terminal incorporates a low-cost SSl—263 synthesiser chip, with software to produce an Australianised accent, together with word-based prosody.
|
PDF |
268--273 |
P.J. Kennedy and J.E. Clark |
Operational Language In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft
Abstract
A recent survey of cockpit noise and communications in 44 Australian civil aviation aircraft included the compilation of a corpus of operational language material heard by aircrew during the performance of their duties. Preliminary analyses were made of the lexicon, syntax and message content of 1,726 transmissions. Constraints found upon the operational vocabulary and message-set construction heard by pilots present opportunities for applications of current speech technology to the civil aviation cockpit. Access to a suitable language database could be very useful for such applications.
|
PDF |
274--281 |
F.J. Kennedy and JE. Clark |
Ambient Noise In The Cockpit/Flightdeck Communication Environment Of Australian Civil Aviation Aircraft
Abstract
A recent survey of ambient sound pressure levels and noise spectra occurring in the cockpit/flightdeck of various categories and classes of civil aviation aircraft is described. Variations in the cockpit noise between aircraft classes and during different flight operations indicate that cockpit speech technology should be evaluated under a range of conditions, although conditions within aircraft of the same category or class are similar enough to allow construction of an ‘average’ noise environment for e. specified flight condition.
|
PDF |
Synthesis II
Pages |
Authors |
Title and Abstract |
PDF |
282--287 |
JE. Clark and RH. Mannell |
Some Comparative Characteristics Of Uniform And Auditorily Scaled Channel Synthesis
Abstract
This paper examines the comparative phonetic level intelligibility characteristics of two channel vocoder type synthesis systems, one based on a uniform bandwidth filterbank, and the other on an auditorily scaled filterbank. The intelligibility tests were conducted using listeners with no prior experience of synthesised speech, and employed masking noise to help expose differences in the perceptual robustness of the test corpus. The intelligibility of the natural input speech tested under the same conditions was used the benchmark for all comparisons. The results suggest that the Bark scale derived synthesis may have intelligibility characteristics closer to those of natural speech than the uniform filterbank synthesis, is perceptually more robust, and is more cost effective in its use of available channel encoding.
|
PDF |
288--293 |
Simon J. BUTLER |
A Speech Synthesis System Based On Articulatory Modelling
Abstract
The elements of an articulatory synthesis system under development are described. Particular emphasis has been given to modelling the trans-consonantal coarticulation effect for stops in /V1CV2/ context that have been reported by Ohman (1966).
|
PDF |
294--301 |
Danielle Ribot, Frédéric Le Diberder and Pierre Martin |
The Multivoc Text-To-Speech System
Abstract
MULTIVOC is a real-world text-to-speech system geared to the French language. The full system is described including the technical view and the main application of the product up to now as a basic component of a telephone-based mail service.
|
PDF |
Analysis III
Pages |
Authors |
Title and Abstract |
PDF |
302--307 |
J. Ingram |
Connected Speech Processes And Connected Speech Synthesis
Abstract
Construction of a data base for the study of connected speech processes (CSP’s) in Australian English is described. Application to the problem of speech rate, style, and sociolect sensitive synthesis is discussed.
|
PDF |
308--313 |
Jeffery Pittam and J. Bruce Millar |
The Longterm Spectrum Of Voice
Abstract
This paper presents an analysis of published information about the long-term spectrum of the voice. The historical development of the measure is first examined leading to a classification of the published works. Techniques used to compute the LTS are then presented, and the utility of the spectrum to various applications is considered. The outcome of this work is a research tool in the form of an annotated and classified bibliographic database.
|
PDF |
314--319 |
J.Bruce Millar |
Stability Of Long Term Acoustic Features
Abstract
This paper is a progress report on an ongoing study of speaker characteristics in a number of acoustic feature domains and a number of temporal domains. Variations in the long—term analysis of timing, energy distribution, fundamental frequency distribution over a three month period for 33 speakers of Australian English are presented. These data are based on 5 reading passages of a nominal duration of one minute.
|
PDF |
320--325 |
J. Pittam, C, Gallois and V.J. Callan |
The Long-Term Acoustic Characteristics Of Emotion
Abstract
Long-term spectra of recordings of three standard passages differing on perceived dominance and arousal were examined for 30 Australian speakers using three-mode principal components analysis. Results indicated that both affective dimensions were reflected systematically in the spectra, with dominance especially prominent in the upper part of the spectrum, and arousal affecting particularly two bands below 3 kHz.
|
PDF |
326--333 |
Hiroaki Oasa and J.Bruce Millar |
Acoustic Processing Cf Phonetically Controlled Vowels
Abstract
A phonetically controlled vowel database, derived from 594 vowel samples .spoken by adult males, adult females, and children, was analysed using LPC techniques to obtain a formant description of the database. The measured formants were uniformly transformed using various scaling factors derived from averaged acoustic features or from anatomical features. The effectiveness of these transformations as a first-stage normalisation procedure is evaluated, and the residual inter-speaker variation discussed.
|
PDF |
Disorders
Pages |
Authors |
Title and Abstract |
PDF |
334--337 |
Geoff Plant |
Speech Test Procedures For Use With Hearing Impaired Aboriginal Children
Abstract
The development of a speech test for use with Aboriginal children who speak Warlpiri as their first language is described. The test is extremely simple and easy to administer but appears to give reliable results. Possible applications for other Aboriginal languages are considered.
|
PDF |
338--343 |
Megan D. Neilson and Peter D. Neilson |
Sensory-Motor Integration Capacity Of Stutterers And Nonstutterers
Abstract
We review a series of studies concerning the auditory~motor and visual-motor tracking performance of stutterers and nonstutterers. we find no evidence of lateralization differences between the groups and interpret the finding that stutterers perform auditory tracking tasks significantly less well than nonstutterers as evidence of a deficit in ability to form internal auditory-motor models which subserve speech control.
|
PDF |
344--349 |
Corinne Adams |
Prosody And Airflow In Deaf Speech And Visual Feedback Remediation
Abstract
An acoustic/aerodynamic investigation of the speech of normal-hearing and profoundly deaf children is reported. The speech of the latter improved significantly following visual feedback remediation.
|
PDF |
350--357 |
J. Ingram, B. Murdoch and H. Chenery |
Prosody In Hypokinetic And Ataxic Dysarthria
Abstract
This paper contrasts patterns of speech prosody in Hypokinetic dysarthria and Ataxia. A perceptual and acoustic analysis of the metrical component of prosody in dysarthric speech is undertaken.
|
PDF |
Technical Aids
Pages |
Authors |
Title and Abstract |
PDF |
358--363 |
Rolf Carlson, Bjorn Granstrom and Sheri Hunnicutt |
Applications Of Speech Technology In Aids For The Disabled
Abstract
A number of technical aids which include speech synthesis or speech recognition have been developed at the Department of Speech Communication and Music Acoustics and are now being used by disabled individuals. Applications of synthetic speech include a communication aid, a symbol-to-speech system, word predictors, talking terminals and a daily newspaper. Speech recognition is also being used in a communication aid.
|
PDF |
364--369 |
P.J. Blamey and G.M. Clark |
Perception Of Synthetic Vowels And Stop Consonants By Cochlear Implant Users
Abstract
Three multiple-channel cochlear implant users were tested with speech sounds that were synthesized using electrical parameters representing the fundamental frequency of the voice, and the frequencies and amplitudes of the first and second formants. Using vowels of equal duration and loudness, it was shown that most of the vowel recognition could be attributed to the formant coding. Unvoiced stops with varying burst frequencies, voiced stops with varying second formant loci, bilabial stops with varying voice onset times, and bilabial consonants with varying formant transition durations were also synthesized. For each consonant set, the responses showed similar patterns to those observed with normally—hearing listeners for analogous acoustic stimuli. Interactions between amplitude and frequency cues were observed.
|
PDF |
370--376 |
H.H. Lim, Y.C. Tong and G.M. Clark |
Identification Of Synthetic Vowel Nuclei By Cochlear Implant Patients
Abstract
Six speech processing schemes, differing in the formant frequency-to-electrode position map and the number of formant frequencies encoded were investigated . The six schemes consisted of two single—formant (F2) schemes, three two-formant (F1 and F2 or F2 and F3) schemes and one three-formant (F1, F2 and F3) scheme. Eleven steady state Australian vowel nuclei ([i], [a], [>], [u], [3], [l],[e], [aa], [/\], [o] and [v]) synthesised as electrical signals were used to evaluate the relative merits of the six schemes on three cochlear implant patients. the first five vowels are long vowels and the remaining six are short vowels. The steady state formant frequencies of these vowel nuclei (Bernard, 1970) were transformed to steady state electrode positions using different formant frequency-to-electrode position maps. The confusion matrices were subjected to conditional information transmission analysis. The results showed that : (1) training, experience and adaptability to a new speech processing scheme were the main factors influencing the identification of the synthetic vowels ; and (2) adding an extra formant vowel feature to a speech processing scheme tended to decrease the amount of information transmission about the existing formant feature(s). From these synthesis results, the three-formant (F0/F1/F2/F3/B) speech processing scheme appeared to be the logical choice for future implementation in speech processors for cochlear implant patients.
|
PDF |
Perception II
Pages |
Authors |
Title and Abstract |
PDF |
378--383 |
W.K. Lai, Y.C. Tong, and G.M. Clark |
Absolute Identification By Cochlear Implant Patients Of Synthetic Vowels Constructed From Acoustic Formant Information
Abstract
Five speech processing schemes for presenting speech information to multiple-channel cochlear implant patients were investigated and compared. Tabulated data for formant frequencies of the natural vowels (i. ,1 ,a,a@, a, o,v,r,r,A,3 ,¤) ) were coded into the parameters of the electric stimuli used in the cochlear implant, and these electric stimuli or synthetic vowels were presented to two patients in a single-interval absolute identification task. The results suggest that when first and second formant speech information is coded into the pulse rate has well as the electrode position, it is possible for the performance in the identification task to be significantly improved, compared to when the same information is coded into the electrode positron only.
|
PDF |
384--389 |
Geoff Plant |
Speech Understanding With Low Frequency Hearing A Case Study
Abstract
A subject with a long standing high frequency loss was tested using a variety of speech materials. The results indicate much useful information may be gained from even limited amounts of low frequency hearing.
|
PDF |
390--396 |
P.J. Blamey and G.M.Clark |
Combining Tactile, Auditory And Visual Information For Speech Perception
Abstract
Four normally hearing subjects were trained and tested with all combinations of a highly degraded auditory input, a visual input via Lipreading, and a tactile input using a multichannel electrotactile speech processor. When the visual input was added to any combination of other inputs, a significant improvement occurred for every test. Similarly, the auditory input produced a significant improvement for all tests except closed-set vowel recognition. The tactile input produced scores that were significantly greater than chance In isolation, but combined less effectively with the other modalities. The less effective combination might be due to lack of training with the tactile input, or to more fundamental limitations in the processing of multimodal stimuli.
|
PDF |
Coding II
Pages |
Authors |
Title and Abstract |
PDF |
398--401 |
Leisa Condie. |
Word Recognition Using Error-Correction Codes
Abstract
The Reed-Solomon error-correction code separates input vectors as far as possible from each other. Such codes are known as Maximum Distance Separable (MDS). This property was investigated in a word recognition system to see whether applying such a code would separate word vectors to such a point that recognition rates improved. A vocabulary of 21 words spoken on four occasions by a single speaker formed the basis of the experiment. First formants for each frame were found with the Interactive Laboratory System (ILS) package. The resulting vectors were encoded with a Reed-Solomon code. The reference set for recognition was formed from the average of all the utterances of each word, and a simple distance metric (after suitable Dynamic Time Warping to align the vector lengths) used to find the closest reference word. A comparison of performance for encoded and unencoded vectors is made.
|
PDF |
402--407 |
R.E.E.Robinson |
A Simple Pitch Detector Using A Digital Signal Processor
Abstract
A Real Time Pitch extraction device using the TMS3201D Digital Signal Processor is described. lt is designed as a replacement for an analog Pitch extractor and performs the same function but with greater accuracy and better dynamic response. The two are compared.
|
PDF |
408--413 |
Bernt Ribbum, Andrew Perkis and K.K. Paliwal |
Enhancing The Codebook For Improving The Speech Quality Of Celp Coders
Abstract
A Code Excited Linear Predictive (CELP) coder with a stochastic-multipulse (STMP) codebook is presented.TheLPC residual exhibits a certain structure due to non-linearities in the glottal excitation. This structure can be exploited by a refinement of the STMP excitation signal, as a training procedure for the codebook. The algorithms are described and results are reported, both in terms of segmental SNR and subjective preference.
|
PDF |
414--420 |
S. Sridharan, E. Dawson, and J. O'Sullivan |
Speech Encryption Using Fast Fourier Transform Techniques
Abstract
A speech encryption system based on permutation of FFT coefficients is described. Results of simulation and cryptanalysis of the system are presented.
|
PDF |
Analysis Iv
Pages |
Authors |
Title and Abstract |
PDF |
422--425 |
R. Potapova |
The Length And Variability In Connected Speech For Russian
Abstract
This paper presents automatic segmentation of speech in the "bottom-up" way taking into account a number of linguistic constraints: the specification of texts based on the classification of speech acts, the classification of text fragments on the basis of semantic-syntactic analysis, the segmentation of the utterance when the end-points of a phrase are determined, the segmentation into syllabic units, the specification of length and its variability in connected Russian speech.
|
PDF |
426--428 |
V.V. Potapov |
The Rhythmic Organization Of Speech In Czech And Russian
Abstract
The subject of this research consists in describing perception as revealed in the process of segmentation of Czech and Russian speech into rhythmic structures (RS); in describing peculiarities of prosodic features of segmentation in Czech and Russian. In this investigation rhythm is defined as a regular recurrence of speech units in an utterance. These units comprise syllables, rhythmic structures (phonetic words), sense-groups (syntagmas) and phrases.
|
PDF |