AIIS LogoArtificial Intelligence and Information Systems Seminar


William Schuler

Speech Understanding as Sequence Estimation


Spoken language interfaces for applications like home organizers, reminder systems, or immersive design may need to allow users to create new entities with names not found in existing training corpora, reducing the effectiveness of conventional techniques for estimating probabilities of hypothesized words. In such cases an `interactive' language model can be used to condition probabilities of successive words on the possible meanings of those words in the current application environment. These interpretations can be defined as vectors, corresponding to distributions over head words or sets of denoted individuals in a world model; then composed with relations, defined as matrices. Unfortunately, semantic composition is typically understood to conform to syntactic phrase structure, and set intersection or matrix multiplication are generally too expensive to be practical in conventional cubic-time chart parsers, used to hypothesize this phrase structure.

Psycholinguistic studies suggest an interactive model of human language processing that works somewhat differently: First, it seems to perform *incremental* interpretation of spoken utterances, identifying referents of words in an utterance even while these words are still being pronounced. Second, it seems to preserve ambiguity by maintaining competing interpretations in parallel. And third, it seems to operate within a severely constrained short-term memory store -- possibly constrained to as few as three or four distinct elements -- which limits the complexity of interactive recognition in a natural and computationally tractable way.

In this talk I will describe how these insights have been applied to the problem of real-time interactive speech interpretation, by modeling joint distributions over referents in an explicit three- or four-element memory store, using a factored HMM-like sequence model. First, I will present evidence that even a three-element model can obtain reasonably accurate syntactic recognition and nearly complete coverage on the large syntactically-annotated Penn Treebank corpus, using a simple reversible tree transform applied during training. I will then describe how this model can be applied directly to recognition of speech repairs (spontaneous edits by a speaker involving repeated words and corrections) without introducing any additional machinery. Then I will show how this framework can be extended to perform incremental interpretation by introducing a variable over referents at each memory element, with an evaluation in an implemented real-time speech interface. The talk will conclude with a description of an implementation of this model that supports references to arbitrary sets of individuals as well as individuals themselves, requiring very large or even unbounded random variable domains.

Two articles (the former under review, the latter in press) describing the syntax and semantics of this model are available on my web site:
- http://www-users.cs.umn.edu/~schuler/paper-jcl08wsj.pdf
- http://www-users.cs.umn.edu/~schuler/paper-jcl07slush.pdf