Repeating language: what do we repeat, and what does it signal?

Yesterday I attended a talk by Jon Kleinberg entitled “Status, Power & Incentives in Social Media” in Honor of the UMD Human-Computer Interaction Lab’s 30th Anniversary.

 

This talk was dense and full of methods that are unfamiliar to me. He first discussed logical representations of human relationships, including orientations of sentiment and status, and then he ventured into discursive evidence of these relationships. Finally, he introduced formulas for influence in social media and talked about ways to manipulate the formulas by incentivizing desired behavior and deincentivizing less desired behavior.

 

In Linguistics, we talk a lot about linguistic accommodation. In any communicative event, it is normal for participant’s speech patterns to converge in some ways. This can be through repetition of words or grammatical structures. Kleinberg presented research about the social meaning of linguistic accommodation, showing that participants with less power tend to accommodate participants with more power more than participants with more power accommodate participants with less power. This idea of quantifying social influence is a very powerful notion in online research, where social influence is a more practical and useful research goal than general representativeness.

 

I wonder what strategies we use, consciously and unconsciously, when we accommodate other speakers. I wonder whether different forms of repetition have different underlying social meanings.

 

At the end of the talk, there was some discussion about both the constitution of iconic speech (unmarked words assembled in marked ways) and the meaning of norm flouting.

 

These are very promising avenues for online text research, and it is exciting to see them play out.

Getting to know your data

On Friday, I had the honor of participating in a microanalysis video discussion group with Fred Erickson. As he was introducing the process to the new attendees, he said something that really caught my attention. He said that videos and field notes are not data until someone decides to use them for research.

As someone with a background in survey research, the question of ‘what is data?’ was never really on my radar before graduate school. Although it’s always been good practice to know where your data comes from and what it represents in order to glean any kind of validity from your work, data was unquestioningly that which you see in a spreadsheet or delimited file, with cases going down and variables going across. If information could be formed like this, it was data. If not, it would need some manipulation. I remember discussing this with Anna Trester a couple of years ago. She found it hard to understand this limited framework, because, for her, the world was a potential data source. I’ve learned more about her perspective in the last couple of years, working with elements that I never before would have characterized as data, including pictures, websites, video footage of interactions, and now fieldwork as a participant observer.

Dr Erickson’s observation speaks to some frustration I’ve had lately, trying to understand the nature of “big data” sets. I’ve seen quite a bit of people looking for data, any data, to analyze. I could see the usefulness of this for corpus linguists, who use large bodies of textual data to study language use. A corpus linguist is able to use large bodies of text to see how we use words, which is a systematically patterned phenomena that goes much deeper than a dictionary definition could. I could also see the usefulness of large datasets in training programs to recognize genre, a really critical element in automated text analysis.

But beyond that, it is deeply important to understand the situated nature of language. People don’t produce text for the sake of producing text. Each textual element represents an intentioned social action on the part of the writer, and social goals are accomplished differently in different settings. In order for studies of textual data to produce valid conclusions with social commentary, contextual elements are extremely important.

Which leads me to ask if these agnostic datasets are being used solely as academic exercises by programmers and corpus linguists or if our hunger for data has led us to take any large body of information and declare it to be useful data from which to excise valid conclusions? Worse, are people using cookie cutter programs to investigate agnostic data sets like this without considering the wider validity?

I urge anyone looking to create insight from textual data to carefully get to know their data.