Abstract - Machine Learning || Mark Hasegawa-Johnson

Associate Professor, Dept ECE at The University of Illinois at Urbana-Champaign



Semi-Supervised Learning for Speech and Audio Processing

Semi-supervised learning requires one to make assumptions about the data. This talk will discuss three different assumptions, and algorithms that instantiate those assumptions, for the classification of speech and audio events. First, I will discuss the case of phoneme classification, and the assumption of softly compact likelihood functions. The acoustic spectra corresponding to different phonemes overlap each other willy-nilly, but at least there is a tendency for the instantiations of each phoneme to cluster within a well-defined region of the feature space---a sort of "soft compactness" assumption. Softly compact distributions can be learned by an algorithm that encourages compactness without strictly requiring it, e.g., by maximizing likelihood of the unlabeled data, or even better, by minimizing its conditional class entropy. The resulting distributions are different from those that would be learned by a fully labeled dataset, demonstrating the "softness" of the compactness assumption.

Second, I will discuss the problem of recognizing mispronounced words, and the assumption of softly compact transformed distributions. In this problem we have not really developed a semi-supervised method, but rather a transformed method: the canonical phonetic pronunciations are transformed into an articulatory domain, possible mispronunciations are predicted based on a compactness criterion in the articulatory domain, and the result is transformed back into the phonetic domain, forming a rather bushy finite state transducer.

Third, I will discuss the problems of audio normalization and unlabeled class discovery, and the assumption of softly compact distributions of distributions. In this approach, we assume that each training token is generated by an "instance PDF" that is different from every other instance PDF, but that the instance PDFs corresponding to each labeled class are "softly compact" in the space of possible probability density functions. Two methods based on this assumption are worth note. The GMM supervector method achieves excellent performance in tasks where the target labels are provided, but in which tokens are collected from unlabeled arbitrarily noisy environments. Conversely, class discovery methods seek to generate labels for a class that does not exist in the labeled training set.