The surprising unpredictability of language in use

This morning I recieved an e-mail from an international professional association that I belong to. The e-mail was in English, but it was not written by an American. As a linguist, I recognized the differences in formality and word use as signs that the person who wrote the e-mail is speaking from a set of experiences with English that differ from my own. Nothing in the e-mail was grammatically incorrect (although as a linguist I am hesitant to judge any linguistic differences as correct or incorrect, especially out of context).

Then later this afternoon I saw a tweet from Twitter on the correct use of Twitter abbreviations (RT, MT, etc.). If the growth of new Twitter users has indeed leveled off then Twitter is lucky, because the more Twitter grows the less they will be able to influence the language use of their base.

Language is a living entity that grows, evolves and takes shape based on individual experiences and individual perceptions of language use. If you think carefully about your experiences with language learning, you will quickly see that single exposures and dictionary definitions teach you little, but repeated viewings across contexts teach you much more about language.

Language use is patterned. Every word combination has a likelihood of appearing together, and that likelihood varies based on a host of contextual factors. Language use is complex. We use words in a variety of ways across a variety of contexts. These facts make language interesting, but they also obscure language use from casual understanding. The complicated nature of language in use interferes with analysts who build assumptions about language into their research strategies without realizing that their assumptions would not stand up to careful observation or study.

I would advise anyone involved in the study of language use (either as a primary or secondary aspect of their analysis) to take language use seriously. Fortunately, linguistics is fun and language is everywhere. So hop to it!

Reporting on the AAPOR 69th national conference in Anaheim #aapor

Last week AAPOR held it’s 69th annual conference in sunny (and hot) Anaheim California.

Palm Trees in the conference center area

My biggest takeaway from this year’s conference is that AAPOR is a very healthy organization. AAPOR attendees were genuinely happy to be at the conference, enthusiastic about AAPOR and excited about the conference material. Many participants consider AAPOR their intellectual and professional home base and really relished the opportunity to be around kindred spirits (often socially awkward professionals who are genuinely excited about our niche). All of the presentations I saw firsthand or heard about were solid and dense, and the presenters were excited about their work and their findings. Membership, conference attendance, journal and conference submissions and volunteer participation are all quite strong.


At this point in time, the field of survey research is encountering a set of challenges. Nonresponse is a growing challenge, and other forms of data and analysis are increasingly en vogue. I was really excited to see that AAPOR members are greeting these challenges and others head on. For this particular write-up, I will focus on these two challenges. I hope that others will address some of the other main conference themes and add their notes and resources to those I’ve gathered below.


As survey nonresponse becomes more of a challenge, survey researchers are moving from traditional measures of response quality (e.g. response rates) to newer measures (e.g. nonresponse bias). Researchers are increasingly anchoring their discussions about survey quality within the Total Survey Error framework, which offers a contextual basis for understanding the problem more deeply. Instead of focusing on an across the board rise in response rates, researchers are strategizing their resources with the goal of reducing response bias. This includes understanding response propensity (who is likely not to respond to the survey? Who is most likely to drop out of a panel study? What are some of the barriers to survey participation?), looking for substantive measures that correlate with response propensity (e.g. Are small, rural private schools less likely to respond to a school survey? Are substance users less likely to respond to a survey about substance abuse?), and continuous monitoring of paradata during the collection period (e.g. developing differential strategies by disposition code, focusing the most successful interviewers on the most reluctant cases, or concentrating collection strategies where they are expected to be most effective). This area of strategizing emerged in AAPOR circles a few years ago with discussions of nonresponse propensity modeling, a process which is surely much more accessible than it sounds, but it has really evolved into a practical and useful tool that can help any size research shop increase survey quality and lower costs.


Another big takeaway for me was the volume of discussions and presentations that spoke to the fast-emerging world of data science and big data. Many people spoke of the importance of our voice in the realm of data science, particularly with our professional focus on understanding and mitigating errors in the research process. A few practitioners applied error frameworks to analyses of organic data, and some talks were based on analyses of organic data. This year AAPOR also sponsored a research hack to investigate the potential for Instagram as a research tool for Feed the Hungry. These discussions, presentations and activities made it clear that AAPOR will continue to have a strong voice in the changing research environment, and the task force reports and initiatives from both the membership and education committees reinforced AAPOR’s ability to be right on top of the many changes afoot. I’m eager to see AAPOR’s changing role take shape.

“If you had asked social scientists even 20 years ago what powers they dreamed of acquiring, they might have cited the capacity to track the behaviors, purchases, movements, interactions, and thoughts of whole cities of people, in real time.” – N.A.  Christakis. 24 June 2011. New York Times, via Craig Hill (RTI)


AAPOR a very strong, well-loved organization and it is building a very strong future from a very solid foundation.



2014-05-16 15.38.17



This conference is huge, so I could not possibly cover all of it on my own, so I will try to share my notes as well as the notes and resources I can collect from other attendees. If you have any materials to share, please send them to me! The more information I am able to collect here, the better a resource it will be for people interested in the AAPOR or the conference-


Patrick Ruffini assembled the tweets from the conference into this storify


Annie, the blogger behind LoveStats, had quite a few posts from the conference. I sat on a panel with Annie on the role of blogs in public opinion research (organized by Joe Murphy for the 68th annual AAPOR conference), and Annie blew me away by live-blogging the event from the stage! Clearly, she is the fastest blogger in the West and the East! Her posts from Anaheim included:

Your Significance Test Proves Nothing

Do panel companies manage their panels?

Gender bias among AAPOR presenters

What I hate about you AAPOR

How to correct scale distribution errors

What I like about you AAPOR

I poo poo on your significance tests

When is survey burden the fault of the responders?

How many survey contacts is enough?


My full notes are available here (please excuse any formatting irregularities). Unfortunately, they are not as extensive as I would have liked, because wifi and power were in short supply. I also wish I had settled into a better seat and covered some of the talks in greater detail, including Don Dillman’s talk, which was a real highlights of the conference!

I believe Rob Santos’ professional address will be available for viewing or listening soon, if it is not already available. He is a very eloquent speaker, and he made some really great points, so this will be well worth your time.


Let’s talk about data cleaning

Data cleaning has a bad rep. In fact, it has long been considered the grunt work of the data analysis enterprise. I recently came across a piece of writing in the Harvard Business Review that lamented the amount of time data scientists spend cleaning their data. The author feared that data scientists’ skills were being wasted on the cleaning process when they could be using their time for the analyses we so desperately need them to do.

I’ll admit that I haven’t always loved the process of cleaning data. But my view of the process has evolved significantly over the last few years.

As a survey researcher, my cleaning process used to begin with a tall stack of paper forms. Answers that did not make logical sense during the checking process sparked a trip to the file folders to find the form in question. The forms often held physical evidence of a indecision on the part of the respondent, such as eraser marks or an explanation in the margin, which could not have been reflected properly by the data entry person. We lost this part of the process when we moved to web surveys. It sometimes felt like a web survey left the respondent no way to communicate with the researcher about their unique situations. Data cleaning lost its personalized feel and detective story luster and became routine and tedious.

Despite some of the affordances of the movement to web surveys, much of the cleaning process stayed routed in the old techniques. Each form has its own id number, and the programmers would use those id numbers for corrections

if id=1234567, set var1=5, set var7=62

At this point a “good programmer” would also document the changes for future collaborators

*this person was not actually a forest ranger, and they were born in 1962
if id=1234567, set var1=5, set var7=62

Making these changes grew tedious very quickly, and the process seemed to drag on for ages. The researcher would check the data for a potential errors, scour the records that could hold those errors for any kind of evidence of the respondent’s intentions, and then handle each form one at a time.

My techniques for cleaning data have changed dramatically since those days. My goal is to use id numbers as rarely as possible, but instead to ask myself questions like “how can I tell that these people are not forest rangers?” The answer to these questions evokes a subtley different technique:

* these people are not actually forest rangers
if var7=35 and var1=2 and var10 contains ‘fire fighter’, set var1=5)

This technique requires honing and testing (adjusting the precision and recall), but I’ve found it to be far more efficient, faster, more comprehensive and, most of all- more fun (oh hallelujah!). It makes me wonder whether we have perpetually undercut the quality of the data cleaning we do simply because we hold the process in such low esteem.

So far I have not discussed data cleaning for other types of data. I’m currently working on a corpus of Twitter data, and I don’t see much of a difference in the cleaning process. The data types and programming statements I use are different, but the process is very close. It’s an interesting and challenging process that involves detective work, a better and growing understanding of the intricacies of the dataset, a growing set of programming skills, and a growing understanding of the natural language use in your dataset. The process mirrors the analysis to such a degree that I’m not really sure why it would be such a bad thing for analysts to be involved in data cleaning.

I’d be interested to hear what my readers have to say about this. Is our notion of the value and challenge of data cleaning antiquated? Is data cleaning a burden that an analyst should bear? And why is there so little talk about data cleaning, when we could all stand to learn so much from each other in the way of data structuring code and more?