# Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

@inproceedings{Seide2012ConversationalST, title={Conversational Speech Transcription Using Context-Dependent Deep Neural Networks}, author={Frank Seide and Gang Li and Dong Yu}, booktitle={ICML}, year={2012} }

Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training. CD-DNN-HMMs greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs: The word error rate is reduced by up to one third on the difficult benchmarking task of speaker-independent single-pass transcription of telephone conversations.

#### Tables and Topics from this paper

#### 850 Citations

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

- Computer Science
- 2011 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2011

This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls. Expand

Standalone training of context-dependent deep neural network acoustic models

- Computer Science
- 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014

This paper introduces a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system, and achieves this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and cluster the equivalent output distributions of the untied CD-HMM states using the decision tree based state tying approach. Expand

Context-dependent Deep Neural Networks for audio indexing of real-life data

- Computer Science
- 2012 IEEE Spoken Language Technology Workshop (SLT)
- 2012

It is found that for the best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data, and that DNN likelihood evaluation is a sizeable runtime factor even in the wide-beam context of generating rich lattices. Expand

Improving English Conversational Telephone Speech Recognition

- Computer Science
- INTERSPEECH
- 2016

This work investigated several techniques to improve acoustic modeling, namely speaker-dependent bottleneck features, deep Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks, data augmentation and score fusion of DNN and BLSTM models. Expand

Pipelined Back-Propagation for Context-Dependent Deep Neural Networks

- Computer Science
- INTERSPEECH
- 2012

It is shown that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server. Expand

Context-dependent deep neural networks for commercial Mandarin speech recognition applications

- Computer Science
- 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
- 2013

It is demonstrated that CD-DNN-HMMs can get relative 26% word error reduction and relative 16% sentence error reduction in Baidu's short message (SMS) voice input and voice search applications, respectively, compared with state-of-the-art CD-GMM-HMM trained using fMPE. Expand

Pipelined BackPropagation for Context-Dependent Deep Neural Networks

- 2012

The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventional… Expand

Fast-LSTM acoustic model for distant speech recognition

- Computer Science
- 2018 IEEE International Conference on Consumer Electronics (ICCE)
- 2018

The proposed Fast-long short-term memory Neural Network (Fast-LSTM) acoustic model combines the time delay neural network (TDNN) and LSTM network to reduce the training time of the standard L STM acoustic model. Expand

Context dependent state tying for speech recognition using deep neural network acoustic models

- Computer Science
- 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014

An algorithm to design a tied-state inventory for a context dependent, neural network-based acoustic model for speech recognition that optimizes state tying on the activation vectors of the neural network directly is proposed. Expand

Recent Improvements to Neural Network based Acoustic Modeling in the EML Transcription Platform

- 2016

In recent years, automatic speech recognition has enjoyed tremendous improvements from the use of (deep) neural networks (DNNs) for both acoustic modeling and stochastic language modeling [1, 2].… Expand

#### References

SHOWING 1-10 OF 21 REFERENCES

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

- Computer Science
- IEEE Transactions on Audio, Speech, and Language Processing
- 2012

A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand

Deep Belief Networks for phone recognition

- Computer Science
- 2009

Deep Belief Networks (DBNs) have recently proved to be very effective in a variety of machine learning problems and this paper applies DBNs to acous ti modeling. Expand

Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system

- Computer Science
- Comput. Speech Lang.
- 1994

A new training procedure that "smooths" networks with different degrees of context dependence is proposed to obtain a robust estimate of the context-dependent probabilities of the HMM/MLP speaker-independent continuous speech recognition system. Expand

ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling

- Computer Science
- Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
- 1998

It is argued that a hierarchical approach is crucial in applying locally discriminative connectionist models to the typically very large state spaces observed in LVCSR systems. Expand

Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition

- Computer Science
- 2010

It is shown that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models and in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer. Expand

Recent innovations in speech-to-text transcription at SRI-ICSI-UW

- Computer Science
- IEEE Transactions on Audio, Speech, and Language Processing
- 2006

It is shown that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker, and speech modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Expand

Connectionist probability estimators in HMM speech recognition

- Computer Science
- IEEE Trans. Speech Audio Process.
- 1994

It is shown that a connectionist component improves a state-of-the-art HMM system through a statistical interpretation of connectionist networks as probability estimators. Expand

Learning representations by back-propagating errors

- Computer Science
- Nature
- 1986

Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain. Expand

A Fast Learning Algorithm for Deep Belief Nets

- Mathematics, Computer Science
- Neural Computation
- 2006

A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. Expand