February 29, 2012

Another CLIP

I missed today’s CLIP. Too much work and too much rain. But the description of it made it sound especially interesting, because the speaker is obviously really grappling with the concept of context. It would have been interesting to have heard what he did with it and how he used linguistics (he specifically mentioned the field, albeit probably not in a discourse analytic type of way). I will have to follow up with him or with his papers. Thankfully, he’s local!

Here’s the sum:

February 29: Vlad Eidelman, Unsupervised Textual Analysis with Rich Features

Learning how to properly partition a set of documents into categories in an unsupervised manner is quite challenging, since documents are inherently multidimensional, and a given set of documents can be correctly partitioned along a number of dimensions, depending on the criterion. Since the partition criterion for a supervised model is encoded in the data via the class labels, even the standard information retrieval representation of a document as a vector of term frequencies is sufficient for many state-of-the-art classification models. This representation is especially well suited for the most common application: topic (or thematic) analysis, where term presence is highly indicative of class. Furthermore, for tasks where term presence may not be adequate, such as sentiment or perspective analysis, discriminative models have the ability to incorporate complex features, allowing them to generalize and adapt to the specific domain. In the case where we do not have access to resources for supervised training, we must turn to unsupervised clustering models. Clustering models rely almost exclusively on a simple bag-of-words vector representation, which performs well for topic analysis, but unfortunately, is not guaranteed to perform well for a different task.

In this talk, I will present a feature-enhanced unsupervised model for categorizing textual data. The presented model allows for the integration of arbitrary features of the observations within a document. While in generative models the observed context is usually a single unigram, or bigram, our model can robustly expand the context to extract features from a block of text of larger size. After presenting the model derivation, I will describe the use of complex automatically derived linguistic and statistical features across three practical tasks with different criterion: perspective, sentiment, and topic analysis. I show that by introducing domain relevant features, we can guide the model towards the task-specific partition we want to learn. For each task, our feature enhanced model outperforms strong baselines and state-of-the-art models.

Bio: Vladimir Eidelman is a fourth-year Ph.D. student in the Department of Computer Science at the University of Maryland, working primarily with Philip Resnik. He received his B.S. in Computer Science and Philosophy from Columbia University in 2008 and a M.S in Computer Science from UMD in 2010. His research interests are in machine learning and natural language processing problems, such as machine translation, structured prediction, and unsupervised learning. He is the recipient of the National Science Foundation Graduate Research and National Defense Science and Engineering Graduate Fellowships.

February 27, 2012

Funny Focus Group moment on Oscars

It’s not often that aspects of survey research make it into the public sphere. Last night’s Oscars included some “recovered focus group footage” from the Wizard of Oz. It’s hilarious, and there’s a good reason why it is. Humor often happens when occurrences don’t match expectations. We tend to expect every member of the focus group to be reasonable and representative, but the reality of that just isn’t true.

Anyway, enjoy!

Posted in commentary, survey methodology
Tagged funny
Leave a comment

February 22, 2012

Framing; an Important Aspect of Discourse Analysis

One aspect of discourse analysis that is particularly easy to connect with is framing. Framing is a term that we hear very often in public discourse, as in “How was that issue framed?” or “How should this idea be framed if we want people to buy into it?” Framing is discourse analysis is similar, but it is a much more useful concept.

We understand a frame as ‘what is going on.’ This can be very simple. I can see you on the street and greet you. We can both think of it simply as a greeting frame, and we can have similar ideas about what that greeting frame should look like. I can say “Hey there, nice to see you!” and you can answer back “Nice to see you!” We can both then smile at each other, and keep walking, both smiling for having seen each other.

But frames are much more complicated than that, for the most part. Each of the interactants has their own idea of what the frame of the interaction is, and each has their own set of knowledge about what the frame entails. It would be easy for us to have different sets of knowledge or expectations regarding the frame. We do, after all, have a lifetime of separate experiences. We also could disagree about the framing of our interaction. Let’s say that I think we are simply greeting and passing, and you think we are greeting and then starting a conversation? Or what of we decide to enter a nearby bar, and I think we are on a date and you do not.

Frames also have layers. We might love to joke, but we will joke differently in a job interview than we will at a bar. Joking in a job interview is what we call an embedded frame in discourse analysis. The layering in the frames are an interesting point of analysis as well, because we may or may not have the same idea of what the outer frame of our interaction is.

I believe it was Erving Goffman who pointed out that the range of emotions we access is contingent on the frame we are working within. Truly, anger in an office is generally quite tame compared to anger at home…

Framing accounts for successful communication and misunderstandings. Its an especially useful tool with which to evaluate the success or failure of an interaction. It is especially interesting to look at framing in terms of the cuing that interactants do. How do we signal a change in frame? Are those signals recognized as they were intended? Are they accepted or rejected?

Framing is also an interesting way to view relationships. It is easy, especially early in a relationship, to assume that your partner shares your frames and the knowledge about them. Similarly, it is easy to assume that your partner shares the same priorities that you do.

Unfortunately, we tend to judge people by the frames that we have activated. So if I frame our interaction as ‘cleaning the kitchen’ and you view it as ‘chatting in the kitchen while fiddling with the dishcloth,’ I am likely to judge your performance as a cleaner negatively. Similarly, in a job interview situation, framing problems are often not recognized by the interviewers, causing the interviewee to appear incompetent.

Recognizing framing issues is an important element of what discourse analysts do in their professional lives when analyzing communication.

February 22, 2012

Observations on another CLIP event: ESL and MT

Today I attended another CLIP colloquila at the University of MD:

Feb 22: Rebecca Hwa, The Role of Machine Translation in Modeling English as a Second Language (ESL) Writings

She addressed these research questions:

1. How patterned are the errors of English language learners?

1a. Could ‘English with mistakes’ be used as an input for machine translation?

1b. Could that be used to improve mt outputs?

1c. Could these findings be used for EFL training?

Her presentation made me think a lot about the role of linguistics in this type of work and about the nature of English.

First, I am coming to firmly believe that the best text processing should be done in partnership between linguists and computer scientists. Linguistics provides the most thorough and reliable frame for computer scientists to key off of, and once you stray from the nature of what you’re trying top represent, you end up astray.

So, for example, in the first part of her research presentation she talked about a project involving machine translation and English language learners of all backgrounds. One woman in the audience kept asking questions about the conglomeration of non native English speakers, and I assumed she was from the English department. The issue of mistakes in language use is a huge one, and a focus has to be chose from which to do the work. Maybe language background would be a more productive way to narrow the focus, and would allow for much more specific structural guidance and bodies of knowledge on language interference.

Second, she spoke about Chinese English language learners in particular and her investigation of lexical choice. Often English language learners’ written English is marked by lexical choices that appear strange to native English speakers. Her hypothesis was that the words that were used in place of the correct words were similar in some way to the correct words, most likely by context. She played a lot with the definition of context; was it proximity? Was it a specific grammatical relationship? This discussion was fascinating, but probably could have benefited from some restrictions on the context of the errors she was targeting. Again, this is from the linguistics end of the linguistics—computer science spectrum.

Her speech made me think a lot about the nature of English. I often think about what it means to be a global language. English is spoken in many places where there are not native speakers, and it is spoken in many places that we don’t traditionally think of as native English places. Often the English that arises from these contexts is judged to be full of errors, but I don’t necessarily agree with this. Instead, I would ask two questions:

1. Is the variation patterned?

2. Is communication successful?

If the answer to these questions is yes, then I don’t think that the speaker is producing errors, so much as a different variety of English. Varieties of English are not all treated with the same respect, but I suspect that the reasons behind this are more to do with the prejudices of the person judging the grammar than a paucity on the part of the speaker.

Posted in commentary, event
Leave a comment

February 18, 2012

Verilogue Webinar

Here is an example of a company that does work that is very closely aligned with the stuff we learn in the MLC program:

http://ww2.verilogue.com/Use-Language-to-Boost-Messaging-Impact_Registration.html?leadsource=linkedin

February 15, 2012

AAPOR Conference Preliminary Program is Up!

This is exciting!

The conference theme this year is New Frontiers in Public Opinion Research, and now we can get a first glimpse at AAPOR’s take on the future of the field! There are quite a few sessions on web survey design, paradata, alternative data sources, and the potential of social media. It will be interesting to see which of the sessions will have a sociolinguistic bent, because many certainly have that potential. There are also sessions on interviewer effects and context effects, which may even use Conversation Analysis (CA) approaches.

http://www.aapor.org/AM/Template.cfm?Section=AAPOR_Annual_Conference&Template=/CM/ContentDisplay.cfm&ContentID=4986

Posted in event, survey methodology
Leave a comment

February 15, 2012

The Future is Ripe for Data Geeks

Good news for lovers of data:

“GOOD with numbers? Fascinated by data? The sound you hear is opportunity knocking.”

http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html?_r=2&src=me&ref=general

February 11, 2012

Patterning in Language, revisited

Language can be pretty mindblowing.
In my paper on the potential of Natural Language Processing (NLP) for social science research, I called NLP a kind of oil rig for the vast reserves of data that we are increasingly desperate to tap.
Sometimes the rigging runs smoothly. This week I read a chapter about compliments in Linguistics at Work. In the chapter, Nessa Wolfson describes her investigations into the patterning of compliments in English. Although some of her commentary in this chapter seems far off base to me (I’ll address this in another post) her quantitative findings are strong. She discovered that 54% of the compliments in her corpus fell into a single syntactic pattern, 85% of the compliments fell into three syntactic patterns, and 97% fell into a total of nine syntactic patterns. She also found that 86% of the compliments with a syntactically positive verb used just two common verbs, ‘like’ and ‘love.’ And she discovered some strong patterning in the adjectival compliments as well.

Linguistic patterns such as these are generally not something that native speakers of a language are aware of, yet they offer great potential to English Language Learners and NLP programmers. It is precisely patterns such as these that NLP programmers use in order to mine information from large bodies of textual data. When language is patterned as strongly as this, it is significantly easy to mine and makes a strong case for the effectiveness of NLP as a rig and syntax as the bones of the rig.

But as strongly as language patterns in some areas, it is also profoundly conflicted in others.

This week I attended a CLIP Colloquilam at the Universty of Maryland. The speaker was Jan Wiebe, and the title of her talk was ‘Subjectivity and Sentiment Analysis. From Words to Discourse.’ In an information packed hourlong talk, Wiebe essentially covered her long history with sentiment analysis and discussed her current research (I took 11 pages of notes! Totally mindblowing). Wiebe approached one of the essential struggles of linguistics, the spectrum between language out of context and language in context (from words to discourse) from a computer science perspective. She spoke about the programming tools and transformations that she had developed and worked with in order to take data out of context in an automated way and build their meaning back in a patterned way. For each stage or transformation, she spoke of the complications and potential errors she had encountered.

She spoke of her team’s efforts to tag word senses in wordnet by their subjective or objective orientation and positive and negative meanings. Her team has created a downloadable subjectivity lexicon, and they hope to make a subjectivity phrase classifier available this Spring. For the sense labeling, they decided to use courser groupings that wordnet in order to improve accuracy, so instead of associating words with their senses, they associate them only along usage domains, or s/o (subjective/objective) and p/n/n (positive/negative/neutral). This increases the accuracy of the tags, but doesn’t account for the context effects such as polarity shifting, e.g. from wonderfully (+) horrid (-) to wonderfully horrid (+). The subjectivity phrase classifier will be a next step in the transition between prior polarity (out of context, word level orientation, like in the subjectivity lexicon) and contextual polarity (the ultimate polarity of the sentence, taking into count phrase dependency, etc.), or longer distance negation such as “not only good, but amazing”.

She also spoke of her teams research into debate sites. They annotate individual postings by their target relationships (same/alternative/part/anaphora, etc.), p/n/n, and reinforcing vs non reinforcing. So, for example, in a debate between blackberries and iphones, where the sides are predetermined by the setup of the site, she can connect relationships to stances, e.g. “fast keyboard” is a positive stance toward blackberry, “slower keyboard” reflects a negative orientation toward an iphone, and a pro-iphone post that mentions the “fast keyboard” is building a concessionary, rather than an argument in favor of blackberry.

In sum, she discussed the transformations between words out of context and words in context, a transformation which is far from complete. She discussed the subjectivity/objectivity of individual words, but then showed how these could be transformed through context. She showed the way phrases with the same syntactic structure could have completely different meanings. She spoke of the difficulty of isolating targets or the subject of the speech. She spoke of the interdependent structures in discourse, and the way that each compounding phrase in a sentence can change the overall directionality. She spoke of her efforts to account for these more complex structures with a phrase level classifier, and she spoke of her research into more indirect references in language. Each of these steps is a separate area of research, each compounding error on the path between words and discourse.

Patterning such as Wolfson found show the great potential of NLP, but research such as Wiebe’s shows the complicated nature of putting these patterns into use. In fact, this was exactly my experience working with NLP. NLP is a constant struggle between linguistic patterning and the complicated nature of discourse. It is an important field and a growing field, but the problems it poses will not be quickly resolved. The rigs are being built, but the quality of the oil is still dubious.

February 10, 2012

Infographic on the History of Marketing

This is a really interesting infographic on the history of marketing. It’s great for laying some context for anyone interested in strategic communications:

http://blog.hubspot.com/blog/tabid/6307/bid/31278/The-History-of-Marketing-An-Exhaustive-Timeline-INFOGRAPHIC.aspx#ixzz1luJKOjyG

Posted in Strategic Communications
Leave a comment

February 10, 2012

A World Without Polls

This is a funny, but informative video about the role of polls in society. It was produced in order to advertise the 2012 AAPOR video contest, although the background clip when Scott Keeter explains the details is distracting enough that you may not catch that! (It’s from Robert Groves visit to the Daily Show)

http://www.youtube.com/watch?v=TKBdyJHGd0k&feature=youtu.be

Free Range Research

An aspiring postdisciplinarian surfs through the ebbs and flows of the changing research environment