Patterning in Language, revisited

Language can be pretty mindblowing.
In my paper on the potential of Natural Language Processing (NLP) for social science research, I called NLP a kind of oil rig for the vast reserves of data that we are increasingly desperate to tap.
Sometimes the rigging runs smoothly. This week I read a chapter about compliments in Linguistics at Work. In the chapter, Nessa Wolfson describes her investigations into the patterning of compliments in English. Although some of her commentary in this chapter seems far off base to me (I’ll address this in another post) her quantitative findings are strong. She discovered that 54% of the compliments in her corpus fell into a single syntactic pattern, 85% of the compliments fell into three syntactic patterns, and 97% fell into a total of nine syntactic patterns. She also found that 86% of the compliments with a syntactically positive verb used just two common verbs, ‘like’ and ‘love.’ And she discovered some strong patterning in the adjectival compliments as well.

 

Linguistic patterns such as these are generally not something that native speakers of a language are aware of, yet they offer great potential to English Language Learners and NLP programmers. It is precisely patterns such as these that  NLP programmers use in order to mine information from large bodies of textual data. When language is patterned as strongly as this, it is significantly easy to mine and makes a strong case for the effectiveness of NLP as a rig and syntax as the bones of the rig.

 

But as strongly as language patterns in some areas, it is also profoundly conflicted in others.

 

This week I attended a CLIP Colloquilam at the Universty of Maryland. The speaker was Jan Wiebe, and the title of her talk was ‘Subjectivity and Sentiment Analysis. From Words to Discourse.’ In an information packed hourlong talk, Wiebe essentially covered her long history with sentiment analysis and discussed her current research (I took 11 pages of notes! Totally mindblowing). Wiebe approached one of the essential struggles of linguistics, the spectrum between language out of context and language in context (from words to discourse) from a computer science perspective. She spoke about the programming tools and transformations that she had developed and worked with in order to take data out of context in an automated way and build their meaning back in a patterned way. For each stage or transformation, she spoke of the complications and potential errors she had encountered.

 

She spoke of her team’s efforts to tag word senses in wordnet by their subjective or objective orientation and positive and negative meanings. Her team has created a downloadable subjectivity lexicon, and they hope to make a subjectivity phrase classifier available this Spring. For the sense labeling, they decided to use courser groupings that wordnet in order to improve accuracy, so instead of associating words with their senses, they associate them only along usage domains, or s/o (subjective/objective) and p/n/n (positive/negative/neutral). This increases the accuracy of the tags, but doesn’t account for the context effects such as polarity shifting, e.g. from wonderfully (+) horrid (-) to wonderfully horrid (+). The subjectivity phrase classifier will be a next step in the transition between prior polarity (out of context, word level orientation, like in the subjectivity lexicon) and contextual polarity (the ultimate polarity of the sentence, taking into count phrase dependency, etc.), or longer distance negation such as “not only good, but amazing”.

 

She also spoke of her teams research into debate sites. They annotate individual postings by their target relationships (same/alternative/part/anaphora, etc.), p/n/n, and reinforcing vs non reinforcing. So, for example, in a debate between blackberries and iphones, where the sides are predetermined by the setup of the site, she can connect relationships to stances, e.g. “fast keyboard” is a positive stance toward blackberry, “slower keyboard” reflects a negative orientation toward an iphone, and a pro-iphone post that mentions the “fast keyboard” is building a concessionary, rather than an argument in favor of blackberry.

 

In sum, she discussed the transformations between words out of context and words in context, a transformation which is far from complete. She discussed the subjectivity/objectivity of individual words, but then showed how these could be transformed through context. She showed the way phrases with the same syntactic structure could have completely different meanings. She spoke of the difficulty of isolating targets or the subject of the speech. She spoke of the interdependent structures in discourse, and the way that each compounding phrase in a sentence can change the overall directionality. She spoke of her efforts to account for these more complex structures with a phrase level classifier, and she spoke of her research into more indirect references in language. Each of these steps is a separate area of research, each compounding error on the path between words and discourse.

 

Patterning such as Wolfson found show the great potential of NLP, but research such as Wiebe’s shows the complicated nature of putting these patterns into use. In fact, this was exactly my experience working with NLP. NLP is a constant struggle between linguistic patterning and the complicated nature of discourse. It is an important field and a growing field, but the problems it poses will not be quickly resolved. The rigs are being built, but the quality of the oil is still dubious.

Advertisement

“The combination of designed data [from surveys] with organic data [from the Internet and other automatic sources] is the ticket to the future.” -Robert Groves

I first became familiar with the work of Tom Smith when I was working on my AAPOR paper on multilingual, multinational and multicultural surveys. He is well spoken and an excellent writer. Here is an excellent commentary of his about the future of survey methodology. It really speaks to some of the motivation behind my enrollment in the MLC program. It is his final ‘Letter from the President” of his tenure as WAPOR (World Association of Public Opinion Research) president.

“Dear WAPOR Members,

Let me raise two inter-related questions:

Is public opinion research about to undergo a paradigm shift?

Should it or shouldn’t it?”

 

The full text can be found here: http://wapor.unl.edu/wp-content/uploads/2012/02/4q2011.pdf

 

 

Research and Little League

I recently had a revelation about research methodology.

In my Intercultural Communication class, a presenter showed a picture of a moment in a baseball game. The conversation that followed was about baseball and about Little League. It missed the point.

Look around you. You are flooded with visual data. Open your ears. You are flooded with auditory data. Open your senses. What are you touching? Do you smell anything? The world is full of sensory data, so much data, in fact, that we could never take it all in.

This is where attention come in. Focus. Foreground. We quickly filter out sounds to focus on, points in the visual field that are the most meaningful at any given moment. In this way, we are efficient and capable. But we are not researchers.

To conduct research is to focus on a moment in time, an interaction, a photograph, etc. and look more deeply at it. Research begins with careful observation. Research includes deconstructing an element into its constituent pieces and thinking carefully about those pieces.

What does a linguist do? A linguist takes the time to look at language and unpack it to reconstitute its context, creation and motivation. Linguistics is asking ‘what is happening?’ ‘what tools are being used’ and ‘what is being accomplished?’ Linguistics is taking the time to look more closely at the elements of the picture and not restrict oneself to the natural foreground.

Laypeople talk about the content of language. People talk about the boy in the picture who is jumping for joy. Researchers look at the trajectory of the eyes in the crowd to see where people are focusing their attention. They notice the fence between the audience and the players and the way people interact with it. They notice the baseball on the ground. They notice the sunshine and the clothing that the people are wearing. They can uncover the deeper story of what was happening in that moment, instead of surmising about the apparent focus.

These are the skills we are learning.