Microsoft Research, One Microsoft Way, Redmond, WA, USA
Deep learning systems have dramatically improved the accuracy of speech recognition, and various deep architectures and learning methods have been developed with distinct strengths and weaknesses in recent years. How can ensemble learning be applied to these varying deep learning systems to achieve greater recognition accuracy is the focus of this paper. We develop and report linear and log-linear stacking methods for ensemble learning with applications specifically to speech-class posterior probabilities as computed by the convolutional, recurrent, and fully-connected deep neural networks. Convex optimization problems are formulated and solved, with analytical formulas derived for training the ensemble-learning parameters. Experimental results demonstrate a significant increase in phone recognition accuracy after stacking the deep learning subsystems that use different mechanisms for computing high-level, hierarchical features from the raw acoustic signals in speech.
Index Terms: speech recognition, deep learning, ensemble learning, log-linear system combination, stacking
The success of deep learning in speech recognition started with the fully-connected deep neural network (DNN)  . As reviewed in  and , during the past few years the DNN-based systems have been demonstrated by four major research groups in speech recognition to provide significantly higher accuracy in continuous phone and word recognition both than the earlier state-of-the-art GMM-based systems  and than the earlier shallow neural network systems . More recent advances in deep learning techniques as applied to speech include the use of locally-connected or convolutional deep neural networks (CNN)  and of temporally (deep) recurrent versions of neural networks (RNN) , also considerably outperforming the early neural networks with convolution in time  and the early RNN .
While very different phone recognition error patterns (not just the absolute error rates) were found between the GMM and DNN-based systems that helped ignite the recent interest in DNNs , we found somewhat less dramatic differences in the error patterns produced by the DNN, CNN, and RNN-based speech recognition systems. Nevertheless, the differences are pronounced enough to warrant ensemble learning to combine these various systems. In this paper, a method is described that integrates the posterior probabilities produced by different deep learning systems using the framework of stacking as a class of techniques for forming combinations of different predictors to give improved prediction accuracy .
Taking a simplest yet rigorous approach, we use linear predictor combinations in the same spirit as stacked regressions of . In our formulation for the specific problem at hand, each predictor is the frame-level output vectors of either the DNN, CNN, or RNN subsystem given the common filterbank acoustic feature sequences as the subsystems’ inputs. However, our approach presented in this paper differs from that of  in several aspects. We learn the stacking parameters without imposing non-negativity constraints, as we have observed from experiments that such constraints are often automatically satisfied (see Figure 2 and related discussions in Section 4.3). This advantage derives naturally from our use of full training data, instead of just cross-validation data as in , to learn the stacking parameters while using the cross-validation data to tune the hyper-parameters of stacking for ensemble learning. We treat the outputs from individual deep learning subsystems as fixed, high-level hierarchical “features” , justifying re-use of the training data for learning ensemble parameters after learning the low-level CNN, CNN, and RNN subsystems’ weight parameters. In addition to linear stacking proposed in , we also explore log-linear stacking in ensemble learning, which has some special computational advantage only in the context of softmax output layers in our deep learning subsystems. Finally, we apply stacking only at a component of the full ensemble-learning system --- at the frame level of acoustic sequences. Following the rather standard DNN-HMM architecture adopted in , the CNN-HMM in  , and the RNN-HMM in , we perform stacking at the most straightforward frame level and then feed the combined output to a separate HMM decoder. Stacking at the full-sequence level is considerably more complex and will not be discussed in this paper.
This paper is organized as follows: In Section 2, we present the setting of stacking method for ensemble learning in the original linear domain of the output layers in all deep learning subsystems. The learning algorithm based on ridge regression is derived. The same setting and a similar learning algorithm are described in Section 3, except the stacking method is changed to deal with logarithmic values of the output layers in the deep learning subsystems. The main motivation is to save the significant softmax computation, and, to this end, additional bias vectors need to be introduced as part of the ensemble learning parameters. In the experimental Section 4, we demonstrate the effectiveness of both linear and log-linear stacking methods for ensemble learning, and analyze how various deep learning mechanisms for computing high-level features from the raw acoustic signals in speech naturally give different degrees of effectiveness as measured by recognition accuracy. Finally, we draw conclusions in Section 5.
To simplify the stacking procedure for ensemble-learning, we perform the linear combination of the original speech-class posterior probabilities produced by deep learning subsystems at the frame level here. A set of parameters in the form of full matrices are associated with the linear combination, which are learned using the training data consisting of the frame-level posterior probabilities of the different subsystems and of the corresponding frame-level target values of speech classes. In the testing phase, the learned parameters are used to linearly combine the posterior probabilities from different systems at the frame level, which are subsequently fed to a parameter-free HMM to carry out a dynamic programming procedure together with an already trained language model. This gives the decoded results and the associated accuracy for continuous phone or word recognition.
The combination of the speech-class posterior probabilities can be carried out in either a linear or a log-linear manner. For the former, as the topic of this section, a linear combination is applied directly to the posterior probabilities produced by different deep learning systems. For the log-linear case, which is the topic of Section 3, the linear combination (including the bias vector) is applied after logarithmic transformation on the posterior probabilities.