Hover over a project to read a quick overview; click on the project to read a complete description.
Semantic parsing of sentences is believed to be an important task on the road to natural language understanding, and has immediate applications in tasks such as information extraction and question answering. Semantic Role Labeling (SRL) is a shallow semantic parsing task, in which for each predicate in a sentence, the goal is to identify all constituents that fill a semantic role, and to determine their roles (Agent, Patient, Instrument, etc.) and their adjuncts (Locative, Temporal, Manner etc.).
The Semantic Entailment task seeks to move Natural Language Processing towards Natural Language Understanding by defining its goals in terms of the meaning of natural language, namely recognizing when the meaning of one text span follows from the meaning of another. We believe that progress in Textual Entailment will allow us to better define an applied notion of Semantics for natural language. Moreover, progress towards a solution to the Semantic Entailment problem will have immediate applications in a wide range of NLP tasks, particularly in Question Answering.
A given entity - representing a person, a location, or an organization - may be mentioned in text in multiple, ambiguous ways. Understanding natural language and supporting intelligent access to textual information requires identifying whether different entity mentions are actually referencing the same entity.
Dependency trees provide a syntactic representation that encodes functional relationships between words; it is relatively independent of the grammar theory and can be used to represent the structure of sentences in different languages. Dependency structures are more efficient to parse and are believed to be easier to learn, yet they still capture much of the predicate-argument information needed in applications, which is one reason for the recent interest in learning these structures.
Understanding natural language, reasoning, and acting in the world have been long recognized as fundamental phenomena of intelligence. Much progress has been made in understanding them from a computational perspective. Unfortunately, the theories that have emerged for the phenomena are somewhat disparate. We have very little understanding for how having a model of the world -- concepts, objects and relations -- contributes to understanding the meaning of natural language utterances about this world. In this project we study, within an integrated framework, the problem of understanding language in the context of acting in the world, and of acting and reasoning from representations generated by natural language interactions.
Through collaboration with psycholinguists we are studying possible mechanisms through which children acquire language. Using our state-of-the-art SRL system to mimic language-learning processes in children, we are able to control what knowledge sources are available to the learner. With this setup we are able to reproduce findings from child learning experiments, demonstrating the efficacy of psycholinguistic theories in a controlled learning environment.
Named entity (NE) transliteration is the process of transcribing a NE from a source language to some target language based on phonetic similarity between the entities. Identifying transliteration pairs is an important component in many linguistic applications which require identifying out-of-vocabulary words, such as machine translation and multilingual information retrieval. We have developed (almost) unsupervised discriminative techniques for learning transliteration models as well as constrained optimization inference techniques for extracting better transliteration features.
There are three fundamental approaches in computing lexical similarity: (1) knowledge-rich methods, (2) vector-based models using direct distributional similarity, and (3) vector-based models using dimensionality deduction techniques. We developed several lexical similarity metrics at the word, phrasal (e.g., noun phrases, entity phrases and numerical phrases) and are also looking at a comparison between the key approaches in the context of NLP tasks such as TE.
Making complex decisions in real world problems often involves assigning values to sets of interdependent variables where the expressive dependency structure can influence, or even dictate, what assignments are possible. Structured learning problems provide one such example, but the setting we study is broader. We are interested in cases where decisions depend on multiple models that cannot be learned simultaneously as well as cases where constraints among models' outcomes are available only at decision time. We have developed a general framework -- Constrained Conditional Models -- that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints (written, for example, using a first-order representation) as a way to support decisions in an expressive output space while maintaining modularity and tractability of training and inference.
Learning-based programs are software applications that utilize machine learning technology to interact with naturally occurring data that are highly variable and ambiguous. They lend themselves towards a computational model in which some of the variables, concepts, and relations may be defined only in a data-driven way, or may not be unambiguously defined without relying on other concepts acquired this way. Learning Based Programming (LBP) is a new programming paradigm for specifying computations under this model.
Project for finding medical concepts
Information extraction (IE) is the task of extracting functional information from machine-readable documents. Machine learning approaches to IE have demonstrated superior performance and are now the dominant approach. We have developed machine learning and inference technique both for frame-like information extraction, that attempts to map free from text to database with a given schema, and for the extraction of specific relations and entities from free form text.
Recognizing Named entities and relations among them are essential sub-tasks of natural language understanding and have immediate applications in facilitating access to information. We have developed machine learning and inference techniques for these tasks, focusing on exploiting the inter-dependencies between them as a way to improve performance on each of them.
The goal of this project is to develop natural language processing capabilities and search protocols to support improved search capabilities. Specifically, we intend to support search for relations and actions mentioned in text, as well as support search via entailment and semantic understanding.
Different information sources may contradict each other, leaving the truth uncertain. Deciding which authors, publishers, documents, and assertions to trust may be of vital importance, with applications that are both immediate (e.g. finding a news article for a user) and intermediate (e.g. building a knowledge base). While trust has been closely studied in fields such as peer-to-peer networks, social networks, recommender systems, game theory, and even user interface design, this "information trustworthiness" has received relatively little attention. We are investigating new algorithms that model trustworthiness, and also seeking to develop a precise notion of what trust means in this context.
We examine a lifted first-order probabilistic inference algorithm that performs at a first-order level even during inference itself, and not only at the specification stage. The benefits of such an algorithm are improved efficiency and greater comprehensibility of inference steps.
While ad hoc representations of natural language data are sufficient for low-level Natural Language Text Processing tasks in isolation, more complex tasks that require the combination of many localized resources can benefit from a unifying representation. The Cognitive Computation Group's (CCG) MRCS (Modular Representation and Comparison Scheme) is being developed to allow an open-ended set of automated NLP text analysis to be combined, and provides a noisy unification-based architecture for reasoning about natural language text.
Supervised learning strategies are costly in terms of resources. However, one can often reduce costs--make use of only a small amount of labeled data along with a large pool of unlabeled examples--by exploiting regularities present in the data and, possibly, domain specific information. We investigate semi-supervised and unsupervised learning methods to minimize the need for supervision in the context of several machine learning protocols including Ranking and Structure Learning, and apply these to problems such as transliteration, context sensitive paraphrasing and information extraction.
The successful application of machine learning algorithms to real-world applications is often predicated on obtaining sufficient labeled training data. By allowing the learning algorithm to query a domain expert for additional information, this sample complexity often can be reduced. We are working on extending standard active learning work to more complex settings, such as structured output spaces and pipeline models, with applications to natural language processing problems.
The need to meaningfully combine sets of rankings often arises in ranked data sets. Although a number of heuristic and supervised learning approaches to rank aggregation exist, they require domain knowledge or supervised ranked data, both of which are expensive to acquire. In our early work in this area we studied generalization properties of ranking functions. More recently, we have focused on investigating learning methods for aggregating (partial) rankings without the need for supervised learning.
The problem of domain adaption is critical in natural language processing, given that models trained on a given domain (e.g., news) may not perform well on other domains (e.g., medical text) and that one may want to adapt learned models also across languages and across related tasks. We attempt both to (1) develop new algorithms for adaption across domains and languages and (2) develop a theoretical understanding of the domain adaptation issues and algorithms.
Pipeline models describe the setting where a complex task is decomposed into a sequence of stages such that the subtask at each stage is dependent both on the initial input and the results from previous stages. This is a widely used paradigm for many natural language processing problems.
As the volume of documents expands, organizing them for easy accessibility becomes increasingly difficult. We are developing an approach to automatically and transparently classify and organize electronic documents into ontologies; our approach is inspired by the way people can infer whether or not a document discusses a particular topic by only using the meaning of the labels.