I missed today’s CLIP. Too much work and too much rain. But the description of it made it sound especially interesting, because the speaker is obviously really grappling with the concept of context. It would have been interesting to have heard what he did with it and how he used linguistics (he specifically mentioned the field, albeit probably not in a discourse analytic type of way). I will have to follow up with him or with his papers. Thankfully, he’s local!
Here’s the sum:
February 29: Vlad Eidelman, Unsupervised Textual Analysis with Rich Features
Learning how to properly partition a set of documents into categories in an unsupervised manner is quite challenging, since documents are inherently multidimensional, and a given set of documents can be correctly partitioned along a number of dimensions, depending on the criterion. Since the partition criterion for a supervised model is encoded in the data via the class labels, even the standard information retrieval representation of a document as a vector of term frequencies is sufficient for many state-of-the-art classification models. This representation is especially well suited for the most common application: topic (or thematic) analysis, where term presence is highly indicative of class. Furthermore, for tasks where term presence may not be adequate, such as sentiment or perspective analysis, discriminative models have the ability to incorporate complex features, allowing them to generalize and adapt to the specific domain. In the case where we do not have access to resources for supervised training, we must turn to unsupervised clustering models. Clustering models rely almost exclusively on a simple bag-of-words vector representation, which performs well for topic analysis, but unfortunately, is not guaranteed to perform well for a different task.
In this talk, I will present a feature-enhanced unsupervised model for categorizing textual data. The presented model allows for the integration of arbitrary features of the observations within a document. While in generative models the observed context is usually a single unigram, or bigram, our model can robustly expand the context to extract features from a block of text of larger size. After presenting the model derivation, I will describe the use of complex automatically derived linguistic and statistical features across three practical tasks with different criterion: perspective, sentiment, and topic analysis. I show that by introducing domain relevant features, we can guide the model towards the task-specific partition we want to learn. For each task, our feature enhanced model outperforms strong baselines and state-of-the-art models.
Bio: Vladimir Eidelman is a fourth-year Ph.D. student in the Department of Computer Science at the University of Maryland, working primarily with Philip Resnik. He received his B.S. in Computer Science and Philosophy from Columbia University in 2008 and a M.S in Computer Science from UMD in 2010. His research interests are in machine learning and natural language processing problems, such as machine translation, structured prediction, and unsupervised learning. He is the recipient of the National Science Foundation Graduate Research and National Defense Science and Engineering Graduate Fellowships.