Keynotes

Keynote 1: Dr. Haizhou Li
Voice conversion and spoofing countermeasures for speaker verification
Keynote 2: Dr. Shrikanth (Shri) S. Narayanan
Understanding individual-level speech variability: From novel speech production data to robust speaker recognition
Keynote 3: Dr. Najim Dehak
I-Vector Representation Based on GMM and DNN for Audio Classification
 

Keynote 1

^Top

Voice conversion and spoofing countermeasures for speaker verification

As automatic speaker verification (ASV) technology becomes more and more reliable, banks and e-commerce use voice biometrics to enhance security and deliver a more convenient customer authentication. Just like any other biometrics, ASV is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to masquerade as another enrolled person. Modern technologies, such as speech synthesis and voice conversion, also present a genuine threat to ASV systems. Therefore, spoofing countermeasures, which aim to detect such attacks, are as important as speaker verification systems themselves in commercial deployments. In this talk, we will discuss speech liveness detection in the context of speaker verification. We will also discuss the vulnerability of speaker verification to speech synthesis and voice conversion, and the findings from ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge.

About the speaker

Dr. Haizhou Li
Human Language Technology Department
INSTITUTE FOR INFOCOMM RESEARCH (I2R)

Haizhou Li is currently the Principal Scientist, Department Head of Human Language Technology in the Institute for Infocomm Research, Singapore. He is also an Adjunct Professor at the National University of Singapore.

Dr Li is the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speech and Language Processing (2015-2017). He has served in the Editorial Board of Computer Speech and Language (2012-2014). He is an elected Member of IEEE Speech and Language Processing Technical Committee (2013-2015), the President of the International Speech Communication Association (2015–2017), and the President of Asia Pacific Signal and Information Processing Association (2015-2016). He was the General Chair of ACL 2012 and INTERSPEECH 2014.

Dr Li is a Fellow of the IEEE. He was a recipient of the National Infocomm Award 2002 and the President’s Technology Award 2013 in Singapore.

Keynote 2

^Top

Understanding individual-level speech variability: From novel speech production data to robust speaker recognition

The vocal tract is the universal human instrument played with great dexterity and skill in the production of speech to convey rich linguistic and paralinguistic information. The understanding of how individuals differ in their speech articulation due to differences in shape and size of their physical vocal instrument, and its acoustic consequences are not well understood. Knowledge of how people differ in their speech production can help create improved automatic speaker recognition technologies as well as inform design of technologies for robust speech-based access to people and information.

The talk focuses on steps toward advancing scientific understanding of how vocal tract morphology and speech articulation interact and explain the variant and invariant aspects of speech signal properties across talkers. Of particular scientific interest is the nature of articulatory strategies adopted by individuals in the presence of structural differences across them to achieve phonetic equivalence. Equally of interest are what aspects of, and how, vocal tract morphological differences are reflected in the acoustic speech signal, and if those differences can be estimated from speech acoustics. A crucial part of this goal is to create forward and inverse computational models that relate vocal tract details to speech acoustics toward shedding light on individual speaker differences and informing design of robust speaker recognition technologies.

Speech research has mainly focused on surface speech acoustic properties; there remain open questions on how speech properties co-vary across talker, linguistic and paralinguistic conditions. However, there are limitations to uncovering the underlying details from the acoustic signal alone. This talk will describe efforts on direct investigation of the dynamic human vocal tract using novel magnetic resonance imaging techniques and computational modeling to illuminate inter-speaker variability in vocal tract structure, as well as the strategies by which linguistic articulation is implemented. Applications to speaker modeling and recognition will be presented.

About the speaker

Dr. Shrikanth (Shri) S. Narayanan
Andrew J. Viterbi Professor of Engineering
Professor: EE, CS, Linguistics, Psychology, Neuroscience, Pediatrics
UNIVERSITY OF SOUTHERN CALIFORNIA

Shrikanth (Shri) Narayanan is Andrew J. Viterbi Professor of Engineering at the University of Southern California, where he is Professor of Electrical Engineering, and jointly in Computer Science, Linguistics, Psychology, Neuroscience and Pediatrics, and Director of the Ming Hsieh Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the Acoustical Society of America, IEEE, and the American Association for the Advancement of Science (AAAS). Shri Narayanan is the Editor in Chief for IEEE Journal on Selected Topics in Signal Processing, an Editor for the Computer, Speech and Language Journal and an Associate Editor for the IEEE Transactions on Affective Computing, the Journal of Acoustical Society of America, IEEE Transactions on Signal and Information Processing over Networks, and the APISPA Transactions on Signal and Information Processing having previously served an Associate Editor for the IEEE Transactions of Speech and Audio Processing (2000-2004), the IEEE Signal Processing Magazine (2005-2008) and the IEEE Transactions on Multimedia (2008-2012). He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, the 2005 and 2009 Best Transactions Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, and as an ISCA Distinguished Lecturer for 2015-16. With his students, he has received a number of best paper awards including a 2014 Ten-year Technical Impact Award from ACM ICMI and Interspeech Challenges in 2009 (Emotion classification), 2011 (Speaker state classification), 2012 (Speaker trait classification), 2013 (Paralinguistics/Social Signals), 2014 (Paralinguistics/Cognitive Load) and in 2015 (Non-nativeness detection). He has published over 650 papers and has been granted 17 U.S. patents.

Keynote 3

^Top

I-Vector Representation Based on GMM and DNN for Audio Classification

The I-vector approach became the state of the art approach in several audio classification tasks such as speaker and language recognition. This approach consists of modeling and capturing all the different variability in the Gaussian Mixture Model (GMM) mean components between several audio recordings. More recently several subspace approaches had been extended on modeling the variability between the GMM weights rather than the GMM means. These last techniques such as Non-negative Factor Analysis (NFA) and Subspace Multinomial Model (SMM) needed to deal with the fact that the GMM weights are always positive and they should sum to one. In this talk, we will show how the NFA and SMM approaches or similar other subspaces approaches can be also used to model the hidden layer neuron activations on the deep neural network model for sequential data recognition task such as language and dialect recognition.

About the speaker

Dr. Najim Dehak
Assistant Professor at Electrical Computer Engineering Department
JOHNS HOPKINS UNIVERSITY

Najim Dehak received his Engineering degree in Artificial Intelligence in 2003 from Universite des Sciences et de la Technologie d’Oran, Algeria, and his MS degree in Pattern Recognition and Artificial Intelligence Applications in 2004 from the Universite de Pierre et Marie Curie, Paris, France. He obtained his Ph.D. degree from Ecole de Technologie Superieure (ETS), Montreal in 2009. During his Ph.D. studies he was also with Centre de recherche informatique de Montreal (CRIM), Canada.

In the summer of 2008, he participated in the Johns Hopkins University (JHU), Center for Language and Speech Processing, Summer Workshop. During that time, he proposed a new system for speaker verification that uses factor analysis to extract speaker-specific features, thus paving the way for the development of the i-vector framework.

Dr. Dehak is currently assistant professor at Electrical Computer Engineering department at JHU. Prior of joining JHU, He was a research scientist in the Spoken Language Systems (SLS) Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests are in machine learning approaches sapplied to speech processing and speaker modeling. The current focus of his research involves extending the concept of an i-vector representation into other audio classification problems, such as speaker diarization, language- and emotion-recognition.