Ensemble Deep Learning for Speech Recognition
Li Deng and John C. Platt
^{ }Microsoft Research, One Microsoft Way, Redmond, WA, USA
deng@microsoft.com; jplatt@microsoft.com
Abstract
Deep learning systems have dramatically improved the accuracy of speech recognition, and various deep architectures and learning methods have been developed with distinct strengths and weaknesses in recent years. How can ensemble learning be applied to these varying deep learning systems to achieve greater recognition accuracy is the focus of this paper. We develop and report linear and loglinear stacking methods for ensemble learning with applications specifically to speechclass posterior probabilities as computed by the convolutional, recurrent, and fullyconnected deep neural networks. Convex optimization problems are formulated and solved, with analytical formulas derived for training the ensemblelearning parameters. Experimental results demonstrate a significant increase in phone recognition accuracy after stacking the deep learning subsystems that use different mechanisms for computing highlevel, hierarchical features from the raw acoustic signals in speech.
Index Terms: speech recognition, deep learning, ensemble learning, loglinear system combination,
stacking

Introduction
The success of deep learning in speech recognition started with the fullyconnected deep neural network (DNN) [24][40][7] [18][19][29][32][33][10][11][39]. As reviewed in [16] and [11], during the past few years the DNNbased systems have been demonstrated by four major research groups in speech recognition to provide significantly higher accuracy in continuous phone and word recognition both than the earlier stateoftheart GMMbased systems [1] and than the earlier shallow neural network systems [5][26]. More recent advances in deep learning techniques as applied to speech include the use of locallyconnected or convolutional deep neural networks (CNN) [30][9][10][31] and of temporally (deep) recurrent versions of neural networks (RNN) [14][15][8][6], also considerably outperforming the early neural networks with convolution in time [36] and the early RNN [28].
While very different phone recognition error patterns (not just the absolute error rates) were found between the GMM and DNNbased systems that helped ignite the recent interest in DNNs [12], we found somewhat less dramatic differences in the error patterns produced by the DNN, CNN, and RNNbased speech recognition systems. Nevertheless, the differences are pronounced enough to warrant ensemble learning to combine these various systems. In this paper, a method is described that integrates the posterior probabilities produced by different deep learning systems using the framework of stacking as a class of techniques for forming combinations of different predictors to give improved prediction accuracy [37][4][13].
Taking a simplest yet rigorous approach, we use linear predictor combinations in the same spirit as stacked regressions of [4]. In our formulation for the specific problem at hand, each predictor is the framelevel output vectors of either the DNN, CNN, or RNN subsystem given the common filterbank acoustic feature sequences as the subsystems’ inputs. However, our approach presented in this paper differs from that of [4] in several aspects. We learn the stacking parameters without imposing nonnegativity constraints, as we have observed from experiments that such constraints are often automatically satisfied (see Figure 2 and related discussions in Section 4.3). This advantage derives naturally from our use of full training data, instead of just crossvalidation data as in [4], to learn the stacking parameters while using the crossvalidation data to tune the hyperparameters of stacking for ensemble learning. We treat the outputs from individual deep learning subsystems as fixed, highlevel hierarchical “features” [8], justifying reuse of the training data for learning ensemble parameters after learning the lowlevel CNN, CNN, and RNN subsystems’ weight parameters. In addition to linear stacking proposed in [4], we also explore loglinear stacking in ensemble learning, which has some special computational advantage only in the context of softmax output layers in our deep learning subsystems. Finally, we apply stacking only at a component of the full ensemblelearning system  at the frame level of acoustic sequences. Following the rather standard DNNHMM architecture adopted in [40][7][24], the CNNHMM in [30][10] [31], and the RNNHMM in [15][8][28], we perform stacking at the most straightforward frame level and then feed the combined output to a separate HMM decoder. Stacking at the fullsequence level is considerably more complex and will not be discussed in this paper.
This paper is organized as follows: In Section 2, we present the setting of stacking method for ensemble learning in the original linear domain of the output layers in all deep learning subsystems. The learning algorithm based on ridge regression is derived. The same setting and a similar learning algorithm are described in Section 3, except the stacking method is changed to deal with logarithmic values of the output layers in the deep learning subsystems. The main motivation is to save the significant softmax computation, and, to this end, additional bias vectors need to be introduced as part of the ensemble learning parameters. In the experimental Section 4, we demonstrate the effectiveness of both linear and loglinear stacking methods for ensemble learning, and analyze how various deep learning mechanisms for computing highlevel features from the raw acoustic signals in speech naturally give different degrees of effectiveness as measured by recognition accuracy. Finally, we draw conclusions in Section 5.

Linear Ensemble
To simplify the stacking procedure
for ensemblelearning, we perform the linear combination of the original speechclass posterior probabilities produced by deep learning subsystems at the frame level here. A set of parameters in the form of full matrices are associated with the linear combination, which are learned using the training data consisting of the framelevel posterior probabilities of the different subsystems and of the corresponding framelevel target values of speech classes. In the testing phase, the learned parameters are used to linearly combine the posterior probabilities from different systems at the frame level, which are subsequently fed to a parameterfree HMM to carry out a dynamic programming procedure together with an already trained language model. This gives the decoded results and the associated accuracy for continuous phone or word recognition.
The combination of the speechclass posterior probabilities can be carried out in either a linear or a loglinear manner. For the former, as the topic of this section, a linear combination is applied directly to the posterior probabilities produced by different deep learning systems. For the loglinear case, which is the topic of Section 3, the linear combination (including the bias vector) is applied after logarithmic transformation on the posterior probabilities.
