Recent Changes - Search:

Home

Peer Reviewed

Courses

Allen's Personal Website

PmWiki

Edit SideBar

http://www.pmwiki.org/

Auditory Models

A collection of software, research, history, reflections, and data related to auditory models.

Demos and Software (AMO)

Talk at Johns Hopkins on human phoneme recognition

The Role of the Cochlea in Human Speech Recognition from CLSP Seminars PRO Jont Allen, UIUC 2007 August 7 Center for Speech and Language Processing, Johns Hopkins University vimeo video

Harvey Fletcher mp3 video (~28 mins long)

Harvey Fletcher movie from c1963

Egbert deBoer publications (to 2005)

CDROM-v2.10

Allen Video about modeling cochlear transduction (ca. 1997)

Videos

Interspeech 2013 Tutorial presentation

Interspeech 2013 Demos of Cue-modified speech

Demos (These older demos are inferior to the Interspeech-2013 Tutorial demos, above)

  • KunLun software to analyze and modify speech (wav format), using the AI-gram software KunLun (zip) and wav files example phrases (zip)
  • Demos of what KunLun can do Video-demos (OLD broken format: Video-demos)
  • Support documentation that describes the basic speech perception research behind KunLun:
    1. Allen, Jont and Li, Feipeng (2009). Speech perception and cochlear signal processing, IEEE Signal Processing Magazine, Invited: Life-sciences, 26(4), pp 73-77, July. (pdf, djvu)
    2. Feipeng Li and Jont B. Allen. (2011) Manipulation of Consonants in Natural Speech; IEEE Trans. Audio, Speech and Language processing, (officially published: Jul, 2010; Appearance date: Mar 2011) pp. 496-504 (pdf)
    3. Li, F., Menon, A. and Allen, Jont B., (2010) A psychoacoustic method to find the perceptual cues of stop consonants in natural speech, apr, J. Acoust. Soc. Am. pp. 2599-2610, (pdf)
    4. Li, F., Trevino, A., Menon, A. and Allen, Jont B (2012). "A psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise" J. Acoust. Soc. Am., v132(4) Oct, pp. 2663-2675 (pdf)
  • AIgram source code zip, txt; If you would like to download this code, ask me for the password.

Research Objectives and Accomplishments

The research in the Human Speech Recognition group is directed at a fundamental understanding of speech perception in both normal-hearing (NH) and Hearing-Impaired ears. These are related problems, and are actually a continiuium, not two separate things. Most people are born with normal hearing. Within a few years we learn, without seeming effort, to understand human speech. How this happens is a mystery. But what happens is not a mystery. The research we have been doing over the past 10 years, as documented in the section below, is a systematic study of the nature of the failure to process and communicate under various conditions. Only by stressing the system, causing failure, can we hope to understand it. There are at least four levels of experimentation:

  1. The first level of experiments is with NH ears, with speech in noise.
  2. The second level of experiments are filtering experiments, where the speech is filtered before the noise is added.
  3. In the third series of experiments, the speech is truncated in time.
  4. Finally small regions of speech are modified by a few dB, or removed altogether.

Examples of such processing are given in later on this page.

Findings:

We have found that speech perception is a discrete (binary) zero error task Singh and Allen, 2012. Working at the token level, we defined 2 groups: ZE, NZE. Zero-Error (ZE) speech is defined as speech that NH listeners never make an error in identifying, at and above above -2 dB SNR. The non-ZE (NZE) sounds are all the rest. All of the speech CV sounds that we have tested contain many ZE tokens: most CV consonants consist of more than 80% ZE utterances.

The remaining 20% of the CVs may be broken down into 0% < medium-error (ME) <10% and >10% high-error (HE) groups. ME consonants are typically utterances having varying degrees of mispronounced utterances. HE consonants are typically those that are heard as a different sound, with high probability (>20%). Based on the entropy across normal hearing listeners, we view such sounds as mislabled. The reasons for these errors can typically be traced to a specific flaw in the production of the sound, which is typically easily identified.

A chronological history of HSR papers

Summary of UIUC-HSR Experiments (Updated Mar 15, 2014)

YearExperimentStudentsDetails; $N_s$=# SubjectsPublications.mat
2004MN64 (MN04SWN)Phatak & LovittMiller-Nicely in SWN with 4 vowels:
f/a/ther, b/a/t, b/i/t, b/ee/t (not b/e/t)
i.e., LaTex's tipia ``textipa{ @, \ae, E, i},
LDCbet: [a, xq, i, xi] ([a, Q, i I]),
$V_{ldc}$=/a, @, i, I/$
$N_s$ = 4-4
bad subjects''
Phatak & Allen (2007) [PA07] pdfMN64
2005StudyAllen, J. B.Consonant recognition and the AIJASA 117(4), p. 2212-2223. (2005) pdf 
2005MN16-R (MN05WN)Phatak & LovittReplicate MN04 (WN)Phatak, Lovitt & Allen (2008) pdf 
2005MN64R (MN05SWN)Phatak & LovittMore MN64; 14 new subjects; SWNpdfMN64
2005HIMCL05Yoon & PhatakCVs; 10 HI ears @MCL in WNPhatak, Yoon, Gooler & Allen (2009) pdf 
2006HINALR05YoonCVs; 10 HI ears; NALR@MCL in SWN  
2006VerificationRegnierModifications of /ta/Regnier & Allen (2008) pdf 
2006CV06SWNPhatak$C_{ldc}$=/d,b,k,p,s,t,S,Z,z/, $V_{ldc}$=/o,E,u,R,Q,U,I,a/ cv06swn
2006CV06WNRegnier9C+8V WN /d, b, k, p, s, t, xs, xz, z/ cv06wn
2007CV06PanAnalysis of 9 Vowels of CV062 unpublished MSs 
2007HL07LiHigh and Low pass Repeat of FletcherLi Allen 2009, JASA pdf 
2008TR07LiTime Truncation after Furui86Allen Li (2009) ASSP Magazine pdf 
2008TR08LiTime Truncation after Furui86? 3 vowels ? 
20093DDSLi3DDS (i.e., MN64, HL07, TR07-8)Li Allen (2010) JASA;
Li Allen (2010) IEEE TLSP;
Li Trevino Allen 2012 JASA
 
2009VerificationMenonRemove Primary burst  
2009VerificationAbhinauvModify ($\pm$6 dB)+Remove Primary burstKapoor and Allen, 131(1), 2012 pdf 
2009VerificationCvengrosModify burst + devoiced + voiced transition  
2009MN64(+R)SinghFull analysis of $N_s$=25 of MN64+MN64RJASA, April 2012 pdf 
2010HIMCL10-I/IIIWoojae HanCVs; $N_s$=46 HI ears with
$N_t$=2/token/SNR
pdf 
2010HI10NALR-II/IVWoojae HanCVs $N_s$=17 HI ears with
$N_t$=10/token/SNR
pdf 
2011HL11TrevinoHigh/Low filter CVs of HI10  
2013HI Exp2 AnalysisTrevinoAnalysis of the individual variability of HITrevino & Allen pdf, pdf 

Databases of experiments:

HSR database of various topics

  1. HSR various topics and student data
  2. Exp I & IV Source code and log files]]
  3. Christoph's analysis of Exp IV
  4. Woojae Analysis & Experimental results (Reformatted)

Open projects for MS and PhD

  1. {$\surd$} Study the Hellinger angle between probability distributions p and q ({$\theta_{pq}$}) defined as the {$L_2$} Hellinger inner product defined in terms of the square-root inner product of probabilities, {$\cos(\theta_{pq}) \equiv (\sqrt{p_k},\sqrt{q_k})$}) (EoM entry). This has been shown to be useful for the K-means analysis of confusion matrices (Trevino Thesis)
  2. {$\surd$} Repeate the analysis of Singh & Allen (JASA Apr 2012) on the fricatives /S, s, f, z, Z, T, D/. That is, repeat the analysis done by Riya on /p,t,k,b,d,g/ looking at errors in individual utterances, on the remaining consonants used in the PA07 study.
  3. Repeate the Singh & Allen analysis using the White-noise stimulus (MN16r).
  4. {$\surd$} Introduce foward masking into the AI-gram.
    1. Modify the AI gram so that it includes forward and upward spread of masking.
  5. Improve our insight into confidence intervals for Bernoulli trials (Fisher Exact methods).
  6. Use the intersection the noise spectrum @SNR_90 for white and SW noise, to estimate the frequency of a speech feature. Since the SNR_90 lables the masked threshold for a feature, two different noise spectra can label the frequency of the feature.
  7. Obtain much better 3DDS data on sounds that have not worked in the past, such as /p,b/, /T,D/ and /m,n/. Either we had poor tokens in the existing analysis (/p,b,T,D/) or we never tried (/m,n/) (Andrea)
  8. Continue the analysis of Woojae's data. We need to characterize each ear in terms of the SNR at 1-bit of error, along with the SNR corresponding to 1 bit of error for each consonant. Most of the consonants will never attain the 1 bit of entropy (error). Those that do may be sorted in terms of these SNRs, by consonant.
  9. Run 3DDS on all the sounds of Woojae's experiment II. The goal is to establish precisely what the features are for the sounds of Exp-II. Given these features we hope to determine the strategy of the HI listeners. (Andrea)
  10. Fully analyize CV06. Len and Austin have analyized CV06 for the vowels, but nobody has tackled the consonants. Given the large number of vowels, it would be good to ck the consonant features for these many vowels.
  11. We need to add a module to the AIgram code that detects F0 modulations at high frequencies,and color-code the plots. We now know that many fricatives are coded by this features (z, Z, v, J, r), so its a really simple but rich set of consonants.
  12. General cleanup of the AIgram code (its a mess)
  13. It would be interesting to see if adding nonlinear compression changes the AIgrams, either by improving the prediction of forward or the upward spread of masking.
  14. Analyize the confusions in the Reading-group database

Databases

Software of interest

Other category

Measurement systems

  • ARTA software to be used with your sound card. Performance will vary
  • QA400 Inexpensive (e.g., {$\tiny \approx$}\$200) USB box with Windows software with a -140 [dB] noise floor and 110 [dB] dymanic range.

HSR Pictures (entertainment value only)

Historical Documents

Edit - History - Print - Recent Changes - Search
Page last modified on November 30, 2016, at 07:23 AM