More work on Twitter, Google Searches and Text Analytics in Survey research

I am so excited to see this blog post and read the paper that it was based on!

 

blog post:

https://blogs.rti.org/surveypost/2012/01/04/can-surveillance-of-tweets-and-google-searches-substitute-survey-research-2

paper:

http://www.rti.org/pubs/twitter_google_search_surveillance.pdf

 

Kudos to RTI for continuing to carve out a place for text analytics in the future of survey research!

Word Clouds

Here is an interesting application of word clouds. It is a word cloud analysis of Public Opinion Quarterly, the leading journal in Public Opinion Research:

https://blogs.rti.org/surveypost/2012/02/26/a-visual-history-of-poq-1937-to-present/

Word clouds are a fast and easy tool that produce a visual picture of the most frequently used words in a body of text or ‘bag of words.’ They are frequently used as a tool for content analysis.

On my ‘my research’ page above, there is a link to a paper I wrote about text analytic strategies. In the paper, I addressed word clouds in great detail. I did that because word clouds are fast gaining popularity and recognition in the survey research community and in the wider society at large. However, the clouds have a lot of limitations that are rarely considered by people who use them.

One of the complications of a word cloud is that word frequency alone doesn’t speak to the particular ways in which a word was used. So when you see ‘public,’ you think of the public/private dichotomy that is such a big debate in the current public sphere. However, in the context of a survey ‘public’ could also easily be used as a noun, to refer to potential respondents. While word clouds appear to give a lot of information in a quick visual, the picture underlying that information can be clouded by the complexities of language use.

I don’t think that these pictures can map directly onto the underling topical landscape, but they can provide a quick window into the specific words that we have used over the years and the changes in our lexicon over time.

Another CLIP

I missed today’s CLIP. Too much work and too much rain. But the description of it made it sound especially interesting, because the speaker is obviously really grappling with the concept of context. It would have been interesting to have heard what he did with it and how he used linguistics (he specifically mentioned the field, albeit probably not in a discourse analytic type of way). I will have to follow up with him or with his papers. Thankfully, he’s local!

Here’s the sum:

February 29: Vlad Eidelman, Unsupervised Textual Analysis with Rich Features

Learning how to properly partition a set of documents into categories in an unsupervised manner is quite challenging, since documents are inherently multidimensional, and a given set of documents can be correctly partitioned along a number of dimensions, depending on the criterion. Since the partition criterion for a supervised model is encoded in the data via the class labels, even the standard information retrieval representation of a document as a vector of term frequencies is sufficient for many state-of-the-art classification models. This representation is especially well suited for the most common application: topic (or thematic) analysis, where term presence is highly indicative of class. Furthermore, for tasks where term presence may not be adequate, such as sentiment or perspective analysis, discriminative models have the ability to incorporate complex features, allowing them to generalize and adapt to the specific domain. In the case where we do not have access to resources for supervised training, we must turn to unsupervised clustering models. Clustering models rely almost exclusively on a simple bag-of-words vector representation, which performs well for topic analysis, but unfortunately, is not guaranteed to perform well for a different task.

In this talk, I will present a feature-enhanced unsupervised model for categorizing textual data. The presented model allows for the integration of arbitrary features of the observations within a document. While in generative models the observed context is usually a single unigram, or bigram, our model can robustly expand the context to extract features from a block of text of larger size. After presenting the model derivation, I will describe the use of complex automatically derived linguistic and statistical features across three practical tasks with different criterion: perspective, sentiment, and topic analysis. I show that by introducing domain relevant features, we can guide the model towards the task-specific partition we want to learn. For each task, our feature enhanced model outperforms strong baselines and state-of-the-art models.

Bio: Vladimir Eidelman is a fourth-year Ph.D. student in the Department of Computer Science at the University of Maryland, working primarily with Philip Resnik. He received his B.S. in Computer Science and Philosophy from Columbia University in 2008 and a M.S in Computer Science from UMD in 2010. His research interests are in machine learning and natural language processing problems, such as machine translation, structured prediction, and unsupervised learning. He is the recipient of the National Science Foundation Graduate Research and National Defense Science and Engineering Graduate Fellowships.

Funny Focus Group moment on Oscars

It’s not often that aspects of survey research make it into the public sphere. Last night’s Oscars included some “recovered focus group footage” from the Wizard of Oz. It’s hilarious, and there’s a good reason why it is. Humor often happens when occurrences don’t match expectations. We tend to expect every member of the focus group to be reasonable and representative, but the reality of that just isn’t true.

 

Anyway, enjoy!

Observations on another CLIP event: ESL and MT

Today I attended another CLIP colloquila at the University of MD:

Feb 22: Rebecca Hwa, The Role of Machine Translation in Modeling English as a Second Language (ESL) Writings

She addressed these research questions:

1. How patterned are the errors of English language learners?

1a. Could ‘English with mistakes’ be used as an input for machine translation?

1b. Could that be used to improve mt outputs?

1c. Could these findings be used for EFL training?

 

Her presentation made me think a lot about the role of linguistics in this type of work and about the nature of English.

First, I am coming to firmly believe that the best text processing should be done in partnership between linguists and computer scientists. Linguistics provides the most thorough and reliable frame for computer scientists to key off of, and once you stray from the nature of what you’re trying top represent, you end up astray.

So, for example, in the first part of her research presentation she talked about a project involving machine translation and English language learners of all backgrounds. One woman in the audience kept asking questions about the conglomeration of non native English speakers, and I assumed she was from the English department. The issue of mistakes in language use is a huge one, and a focus has to be chose from which to do the work. Maybe language background would be a more productive way to narrow the focus, and would allow for much more specific structural guidance and bodies of knowledge on language interference.

Second, she spoke about Chinese English language learners in particular and her investigation of lexical choice. Often English language learners’ written English is marked by lexical choices that appear strange to native English speakers. Her hypothesis was that the words that were used in place of the correct words were similar in some way to the correct words, most likely by context. She played a lot with the definition of context; was it proximity? Was it a specific grammatical relationship? This discussion was fascinating, but probably could have benefited from some restrictions on the context of the errors she was targeting. Again, this is from the linguistics end of the linguistics—computer science spectrum.

Her speech made me think a lot about the nature of English. I often think about what it means to be a global language. English is spoken in many places where there are not native speakers, and it is spoken in many places that we don’t traditionally think of as native English places. Often the English that arises from these contexts is judged to be full of errors, but I don’t necessarily agree with this. Instead, I would ask two questions:

1. Is the variation patterned?

2. Is communication successful?

If the answer to these questions is yes, then I don’t think that the speaker is producing errors, so much as a different variety of English. Varieties of English are not all treated with the same respect, but I suspect that the reasons behind this are more to do with the prejudices of the person judging the grammar than a paucity on the part of the speaker.

Patterning in Language, revisited

Language can be pretty mindblowing.
In my paper on the potential of Natural Language Processing (NLP) for social science research, I called NLP a kind of oil rig for the vast reserves of data that we are increasingly desperate to tap.
Sometimes the rigging runs smoothly. This week I read a chapter about compliments in Linguistics at Work. In the chapter, Nessa Wolfson describes her investigations into the patterning of compliments in English. Although some of her commentary in this chapter seems far off base to me (I’ll address this in another post) her quantitative findings are strong. She discovered that 54% of the compliments in her corpus fell into a single syntactic pattern, 85% of the compliments fell into three syntactic patterns, and 97% fell into a total of nine syntactic patterns. She also found that 86% of the compliments with a syntactically positive verb used just two common verbs, ‘like’ and ‘love.’ And she discovered some strong patterning in the adjectival compliments as well.

 

Linguistic patterns such as these are generally not something that native speakers of a language are aware of, yet they offer great potential to English Language Learners and NLP programmers. It is precisely patterns such as these that  NLP programmers use in order to mine information from large bodies of textual data. When language is patterned as strongly as this, it is significantly easy to mine and makes a strong case for the effectiveness of NLP as a rig and syntax as the bones of the rig.

 

But as strongly as language patterns in some areas, it is also profoundly conflicted in others.

 

This week I attended a CLIP Colloquilam at the Universty of Maryland. The speaker was Jan Wiebe, and the title of her talk was ‘Subjectivity and Sentiment Analysis. From Words to Discourse.’ In an information packed hourlong talk, Wiebe essentially covered her long history with sentiment analysis and discussed her current research (I took 11 pages of notes! Totally mindblowing). Wiebe approached one of the essential struggles of linguistics, the spectrum between language out of context and language in context (from words to discourse) from a computer science perspective. She spoke about the programming tools and transformations that she had developed and worked with in order to take data out of context in an automated way and build their meaning back in a patterned way. For each stage or transformation, she spoke of the complications and potential errors she had encountered.

 

She spoke of her team’s efforts to tag word senses in wordnet by their subjective or objective orientation and positive and negative meanings. Her team has created a downloadable subjectivity lexicon, and they hope to make a subjectivity phrase classifier available this Spring. For the sense labeling, they decided to use courser groupings that wordnet in order to improve accuracy, so instead of associating words with their senses, they associate them only along usage domains, or s/o (subjective/objective) and p/n/n (positive/negative/neutral). This increases the accuracy of the tags, but doesn’t account for the context effects such as polarity shifting, e.g. from wonderfully (+) horrid (-) to wonderfully horrid (+). The subjectivity phrase classifier will be a next step in the transition between prior polarity (out of context, word level orientation, like in the subjectivity lexicon) and contextual polarity (the ultimate polarity of the sentence, taking into count phrase dependency, etc.), or longer distance negation such as “not only good, but amazing”.

 

She also spoke of her teams research into debate sites. They annotate individual postings by their target relationships (same/alternative/part/anaphora, etc.), p/n/n, and reinforcing vs non reinforcing. So, for example, in a debate between blackberries and iphones, where the sides are predetermined by the setup of the site, she can connect relationships to stances, e.g. “fast keyboard” is a positive stance toward blackberry, “slower keyboard” reflects a negative orientation toward an iphone, and a pro-iphone post that mentions the “fast keyboard” is building a concessionary, rather than an argument in favor of blackberry.

 

In sum, she discussed the transformations between words out of context and words in context, a transformation which is far from complete. She discussed the subjectivity/objectivity of individual words, but then showed how these could be transformed through context. She showed the way phrases with the same syntactic structure could have completely different meanings. She spoke of the difficulty of isolating targets or the subject of the speech. She spoke of the interdependent structures in discourse, and the way that each compounding phrase in a sentence can change the overall directionality. She spoke of her efforts to account for these more complex structures with a phrase level classifier, and she spoke of her research into more indirect references in language. Each of these steps is a separate area of research, each compounding error on the path between words and discourse.

 

Patterning such as Wolfson found show the great potential of NLP, but research such as Wiebe’s shows the complicated nature of putting these patterns into use. In fact, this was exactly my experience working with NLP. NLP is a constant struggle between linguistic patterning and the complicated nature of discourse. It is an important field and a growing field, but the problems it poses will not be quickly resolved. The rigs are being built, but the quality of the oil is still dubious.

“The combination of designed data [from surveys] with organic data [from the Internet and other automatic sources] is the ticket to the future.” -Robert Groves

I first became familiar with the work of Tom Smith when I was working on my AAPOR paper on multilingual, multinational and multicultural surveys. He is well spoken and an excellent writer. Here is an excellent commentary of his about the future of survey methodology. It really speaks to some of the motivation behind my enrollment in the MLC program. It is his final ‘Letter from the President” of his tenure as WAPOR (World Association of Public Opinion Research) president.

“Dear WAPOR Members,

Let me raise two inter-related questions:

Is public opinion research about to undergo a paradigm shift?

Should it or shouldn’t it?”

 

The full text can be found here: http://wapor.unl.edu/wp-content/uploads/2012/02/4q2011.pdf

 

 

Research and Little League

I recently had a revelation about research methodology.

In my Intercultural Communication class, a presenter showed a picture of a moment in a baseball game. The conversation that followed was about baseball and about Little League. It missed the point.

Look around you. You are flooded with visual data. Open your ears. You are flooded with auditory data. Open your senses. What are you touching? Do you smell anything? The world is full of sensory data, so much data, in fact, that we could never take it all in.

This is where attention come in. Focus. Foreground. We quickly filter out sounds to focus on, points in the visual field that are the most meaningful at any given moment. In this way, we are efficient and capable. But we are not researchers.

To conduct research is to focus on a moment in time, an interaction, a photograph, etc. and look more deeply at it. Research begins with careful observation. Research includes deconstructing an element into its constituent pieces and thinking carefully about those pieces.

What does a linguist do? A linguist takes the time to look at language and unpack it to reconstitute its context, creation and motivation. Linguistics is asking ‘what is happening?’ ‘what tools are being used’ and ‘what is being accomplished?’ Linguistics is taking the time to look more closely at the elements of the picture and not restrict oneself to the natural foreground.

Laypeople talk about the content of language. People talk about the boy in the picture who is jumping for joy. Researchers look at the trajectory of the eyes in the crowd to see where people are focusing their attention. They notice the fence between the audience and the players and the way people interact with it. They notice the baseball on the ground. They notice the sunshine and the clothing that the people are wearing. They can uncover the deeper story of what was happening in that moment, instead of surmising about the apparent focus.

These are the skills we are learning.