The surprising unpredictability of language in use

This morning I recieved an e-mail from an international professional association that I belong to. The e-mail was in English, but it was not written by an American. As a linguist, I recognized the differences in formality and word use as signs that the person who wrote the e-mail is speaking from a set of experiences with English that differ from my own. Nothing in the e-mail was grammatically incorrect (although as a linguist I am hesitant to judge any linguistic differences as correct or incorrect, especially out of context).

Then later this afternoon I saw a tweet from Twitter on the correct use of Twitter abbreviations (RT, MT, etc.). If the growth of new Twitter users has indeed leveled off then Twitter is lucky, because the more Twitter grows the less they will be able to influence the language use of their base.

Language is a living entity that grows, evolves and takes shape based on individual experiences and individual perceptions of language use. If you think carefully about your experiences with language learning, you will quickly see that single exposures and dictionary definitions teach you little, but repeated viewings across contexts teach you much more about language.

Language use is patterned. Every word combination has a likelihood of appearing together, and that likelihood varies based on a host of contextual factors. Language use is complex. We use words in a variety of ways across a variety of contexts. These facts make language interesting, but they also obscure language use from casual understanding. The complicated nature of language in use interferes with analysts who build assumptions about language into their research strategies without realizing that their assumptions would not stand up to careful observation or study.

I would advise anyone involved in the study of language use (either as a primary or secondary aspect of their analysis) to take language use seriously. Fortunately, linguistics is fun and language is everywhere. So hop to it!

Advertisements

Reporting on the AAPOR 69th national conference in Anaheim #aapor

Last week AAPOR held it’s 69th annual conference in sunny (and hot) Anaheim California.

Palm Trees in the conference center area

My biggest takeaway from this year’s conference is that AAPOR is a very healthy organization. AAPOR attendees were genuinely happy to be at the conference, enthusiastic about AAPOR and excited about the conference material. Many participants consider AAPOR their intellectual and professional home base and really relished the opportunity to be around kindred spirits (often socially awkward professionals who are genuinely excited about our niche). All of the presentations I saw firsthand or heard about were solid and dense, and the presenters were excited about their work and their findings. Membership, conference attendance, journal and conference submissions and volunteer participation are all quite strong.

 

At this point in time, the field of survey research is encountering a set of challenges. Nonresponse is a growing challenge, and other forms of data and analysis are increasingly en vogue. I was really excited to see that AAPOR members are greeting these challenges and others head on. For this particular write-up, I will focus on these two challenges. I hope that others will address some of the other main conference themes and add their notes and resources to those I’ve gathered below.

 

As survey nonresponse becomes more of a challenge, survey researchers are moving from traditional measures of response quality (e.g. response rates) to newer measures (e.g. nonresponse bias). Researchers are increasingly anchoring their discussions about survey quality within the Total Survey Error framework, which offers a contextual basis for understanding the problem more deeply. Instead of focusing on an across the board rise in response rates, researchers are strategizing their resources with the goal of reducing response bias. This includes understanding response propensity (who is likely not to respond to the survey? Who is most likely to drop out of a panel study? What are some of the barriers to survey participation?), looking for substantive measures that correlate with response propensity (e.g. Are small, rural private schools less likely to respond to a school survey? Are substance users less likely to respond to a survey about substance abuse?), and continuous monitoring of paradata during the collection period (e.g. developing differential strategies by disposition code, focusing the most successful interviewers on the most reluctant cases, or concentrating collection strategies where they are expected to be most effective). This area of strategizing emerged in AAPOR circles a few years ago with discussions of nonresponse propensity modeling, a process which is surely much more accessible than it sounds, but it has really evolved into a practical and useful tool that can help any size research shop increase survey quality and lower costs.

 

Another big takeaway for me was the volume of discussions and presentations that spoke to the fast-emerging world of data science and big data. Many people spoke of the importance of our voice in the realm of data science, particularly with our professional focus on understanding and mitigating errors in the research process. A few practitioners applied error frameworks to analyses of organic data, and some talks were based on analyses of organic data. This year AAPOR also sponsored a research hack to investigate the potential for Instagram as a research tool for Feed the Hungry. These discussions, presentations and activities made it clear that AAPOR will continue to have a strong voice in the changing research environment, and the task force reports and initiatives from both the membership and education committees reinforced AAPOR’s ability to be right on top of the many changes afoot. I’m eager to see AAPOR’s changing role take shape.

“If you had asked social scientists even 20 years ago what powers they dreamed of acquiring, they might have cited the capacity to track the behaviors, purchases, movements, interactions, and thoughts of whole cities of people, in real time.” – N.A.  Christakis. 24 June 2011. New York Times, via Craig Hill (RTI)

 

AAPOR a very strong, well-loved organization and it is building a very strong future from a very solid foundation.

 

 

2014-05-16 15.38.17

 

MORE DETAILED NOTES:

This conference is huge, so I could not possibly cover all of it on my own, so I will try to share my notes as well as the notes and resources I can collect from other attendees. If you have any materials to share, please send them to me! The more information I am able to collect here, the better a resource it will be for people interested in the AAPOR or the conference-

 

Patrick Ruffini assembled the tweets from the conference into this storify

 

Annie, the blogger behind LoveStats, had quite a few posts from the conference. I sat on a panel with Annie on the role of blogs in public opinion research (organized by Joe Murphy for the 68th annual AAPOR conference), and Annie blew me away by live-blogging the event from the stage! Clearly, she is the fastest blogger in the West and the East! Her posts from Anaheim included:

Your Significance Test Proves Nothing

Do panel companies manage their panels?

Gender bias among AAPOR presenters

What I hate about you AAPOR

How to correct scale distribution errors

What I like about you AAPOR

I poo poo on your significance tests

When is survey burden the fault of the responders?

How many survey contacts is enough?

 

My full notes are available here (please excuse any formatting irregularities). Unfortunately, they are not as extensive as I would have liked, because wifi and power were in short supply. I also wish I had settled into a better seat and covered some of the talks in greater detail, including Don Dillman’s talk, which was a real highlights of the conference!

I believe Rob Santos’ professional address will be available for viewing or listening soon, if it is not already available. He is a very eloquent speaker, and he made some really great points, so this will be well worth your time.

 

Let’s talk about data cleaning

Data cleaning has a bad rep. In fact, it has long been considered the grunt work of the data analysis enterprise. I recently came across a piece of writing in the Harvard Business Review that lamented the amount of time data scientists spend cleaning their data. The author feared that data scientists’ skills were being wasted on the cleaning process when they could be using their time for the analyses we so desperately need them to do.

I’ll admit that I haven’t always loved the process of cleaning data. But my view of the process has evolved significantly over the last few years.

As a survey researcher, my cleaning process used to begin with a tall stack of paper forms. Answers that did not make logical sense during the checking process sparked a trip to the file folders to find the form in question. The forms often held physical evidence of a indecision on the part of the respondent, such as eraser marks or an explanation in the margin, which could not have been reflected properly by the data entry person. We lost this part of the process when we moved to web surveys. It sometimes felt like a web survey left the respondent no way to communicate with the researcher about their unique situations. Data cleaning lost its personalized feel and detective story luster and became routine and tedious.

Despite some of the affordances of the movement to web surveys, much of the cleaning process stayed routed in the old techniques. Each form has its own id number, and the programmers would use those id numbers for corrections

if id=1234567, set var1=5, set var7=62

At this point a “good programmer” would also document the changes for future collaborators

*this person was not actually a forest ranger, and they were born in 1962
if id=1234567, set var1=5, set var7=62

Making these changes grew tedious very quickly, and the process seemed to drag on for ages. The researcher would check the data for a potential errors, scour the records that could hold those errors for any kind of evidence of the respondent’s intentions, and then handle each form one at a time.

My techniques for cleaning data have changed dramatically since those days. My goal is to use id numbers as rarely as possible, but instead to ask myself questions like “how can I tell that these people are not forest rangers?” The answer to these questions evokes a subtley different technique:

* these people are not actually forest rangers
if var7=35 and var1=2 and var10 contains ‘fire fighter’, set var1=5)

This technique requires honing and testing (adjusting the precision and recall), but I’ve found it to be far more efficient, faster, more comprehensive and, most of all- more fun (oh hallelujah!). It makes me wonder whether we have perpetually undercut the quality of the data cleaning we do simply because we hold the process in such low esteem.

So far I have not discussed data cleaning for other types of data. I’m currently working on a corpus of Twitter data, and I don’t see much of a difference in the cleaning process. The data types and programming statements I use are different, but the process is very close. It’s an interesting and challenging process that involves detective work, a better and growing understanding of the intricacies of the dataset, a growing set of programming skills, and a growing understanding of the natural language use in your dataset. The process mirrors the analysis to such a degree that I’m not really sure why it would be such a bad thing for analysts to be involved in data cleaning.

I’d be interested to hear what my readers have to say about this. Is our notion of the value and challenge of data cleaning antiquated? Is data cleaning a burden that an analyst should bear? And why is there so little talk about data cleaning, when we could all stand to learn so much from each other in the way of data structuring code and more?

A Postcard from Japan

Hi all,

This week I returned from a 10 day trip to Japan, and I figured I would share some pictures with you.

The first pictures were taken on the plane ride over. We flew over the frozen Midwestern US and Canada and over the Bering Strait, and the view was breathtaking:

2014-04-08 15.42.39

2014-04-08 15.53.39

2014-04-08 16.02.53

2014-04-08 19.37.02

And finally we were over Japan!

2014-04-09 02.02.07

 

Our home base in Japan was a place called Nobi, which is in the Muira peninsula, west of Yokohama and Yokosuka but not all the way to Muirakaigan:

2014-04-10 10.59.02

2014-04-10 09.56.23

2014-04-10 09.56.16

We spent some time exploring the Muira Peninsula, which included Yokosuka, home of the Japanese and American naval bases:

2014-04-13 18.30.35

2014-04-17 10.37.39

 

and Yokohama, second largest city in Japan, home of a famously large Chinatown with a few nice temples inside:

2014-04-11 16.43.30

2014-04-11 16.26.14

2014-04-11 16.07.05

2014-04-11 13.15.58

as well as many natural wonders, including Muirakaigan beach and Jogachima island:

2014-04-16 12.19.23

2014-04-16 12.36.07

2014-04-16 12.48.16

2014-04-16 14.22.06

2014-04-16 14.22.16

2014-04-16 14.34.20

Kamakura is also on the Muira peninsula. Kamakura has many beautiful shrines, great shopping and food, and the third largest Buddha in Japan- which was hollow (we were able to step inside) .

2014-04-18 10.02.10

2014-04-18 10.09.44

2014-04-18 13.36.54

2014-04-18 13.44.20

2014-04-18 13.52.17

Tokyo is North of Muira and full of many kinds of wonders, from gardens, shrines and temples to buildings, nightlife and neighborhoods with very distinct characters. We explored many of the different areas of Tokyo:

2014-04-17 18.24.14

2014-04-17 18.21.51

2014-04-17 18.21.12

2014-04-17 18.11.06

2014-04-17 18.01.41

2014-04-17 15.40.48

2014-04-17 15.24.32

2014-04-15 16.32.11

2014-04-15 15.49.31

2014-04-15 15.46.02

2014-04-15 13.43.14

2014-04-15 13.05.13

2014-04-15 13.04.58

2014-04-15 12.31.54

2014-04-14 16.27.13

2014-04-14 15.37.06

2014-04-14 14.25.23

2014-04-14 14.12.45

2014-04-14 12.55.57

We also attended a drum festival in the town of Narita, which most people only know for the large international airport. This was a truly amazing experience! As we walked from the subway to the big temple we passed many shops, ate amazing street food and saw smaller drum performances. The main performance was on the steps of the big temple, and we were able to explore the grounds and gardens and return to see drumming by fire at sunset. Once the performance ended we followed the main road back to the city, but now it was dark outside, the shop lights were low, and the shopkeepers had set candles out to line the path.

2014-04-12 16.55.55

2014-04-12 17.13.17

2014-04-12 17.16.29

2014-04-12 17.42.31

2014-04-12 17.50.48

2014-04-12 18.17.00

2014-04-12 18.30.04

2014-04-12 18.58.11

2014-04-12 18.58.22

2014-04-12 18.58.36

Truly an amazing experience- thank you for sharing!

 

 

Professional Identity: Who am I? And who are you?

Last night I acted as a mentor at the annual Career Exploration Expo sponsored by my graduate program. Many of the students had questions about developing a professional identity. This makes sense, of course, because graduate school is an important time for discovering and developing a professional identity.

People enter our program (and many others) With a wide variety of backgrounds and interests. They choose from a variety of classes that fit their interests and goals. And then they try to map their experience onto job categories. But boxes are difficult to climb into and out of, and students soon discover that none of the boxes is a perfect fit.

I experienced this myself. I entered the program with an extensive and unquestioned background in survey research. Early in my college years (while I was studying and working in neuropsychology) I began to manage a clinical dataset in SPSS. Working with patients and patient files was very interesting, but to my surprise working with data using statistical software felt right to me much in the way that Ethiopian meals include injera and Japanese meals include rice (IC 2006 (1997) Ohnuki Tierney Emiko). I was actually teased by my friends about my love of data! This affinity served me well, and I enjoyed working with a variety of data sets while moving across fields and statistical programming languages.

But my graduate program blew my mind. I felt like I had spent my life underwater and then discovered the sky and continents. I discovered many new kinds of data and analytic strategies, all of which were challenging and rewarding. These discoveries inspired me to start this blog and have inspired me to attend a wide variety of events and read some very interesting work that I never would have discovered on my own. Hopefully followers of this blog have enjoyed this journey as much as I have!

As a recent graduate, I sometimes feel torn between worlds. I still work as a survey researcher, but I’m inspired by research methods that are beyond the scope of my regular work. Another recent graduate of our program who is involved in market research framed her strategy in a way that really resonated with me: “I give my customers what they want and something else, and they grow to appreciate the ‘something else.'” That sums up my current strategy. I do the survey management and analysis that is expected of me in a timely, high quality way. But I am also using my newly acquired knowledge to incorporate text analysis into our data cleaning process in order to streamline it, increasing both the speed and the quality of the process and making it better equipped to handle the data from future surveys. I do the traditional quantitative analyses, but I supplement them  with analyses of the open ended responses that use more flexible text analytic strategies. These analyses spark more quantitative analyses and make for much better (richer, more readable and more inspired) reports.

Our goal as professionals should be to find a professional identity that best capitalizes on  our unique knowledge, skills and abilities. There is only one professional identity that does all of that, and it is the one you have already chosen and continue to choose every day. We are faced with countless choices about what classes to take, what to read, what to attend, what to become involved in, and what to prioritize, and we make countless assessments about each. Was it worthwhile? Did I enjoy it? Would I do it again? Each of these choices constitutes your own unique professional self, a self which you are continually manufacturing. You are composed of your past, your present, and your future, and your future will undoubtedly be a continuation of your past and present. The best career coach you have is inside of you.

Now your professional identity is much more uniquely or narrowly focused that the generic titles and fields that you see in the professional marketplace. Keep in mind that each job listing that you see represents a set of needs that a particular organization has. Is this a set of needs that you are ready to fill? Is this a set of needs that you would like to fill? You are the only one who knows the answers to these questions.

Because it turns out that you are your best career coach, and you have been all along.

In praise of getting things wrong and working toward better

“An expert is a man who has made all the mistakes which can be made in a very narrow field” -Niels Bohr

I’ve been reading “In the Plex,” a book about the history of Google by Steven Levy. I highly recommend this book, because as I read it I am increasingly aware of the ways in which Google’s constant presence invisibly shapes our daily lives. Levy makes a point in the book of attributing some of Google’s constant evolution to its obsession with failure. In search terms, isolating failures is relatively easy- if people soon return to the search page, reframe their query, or continue down through lower ranked results their search was a relative failure. Failures are identified and isolated by Google and then obsessed over until the PageRank algorithm can be appropriately tweaked in a way that passes rigorous testing protocols.

In this way, Google is similar to an increasing number of failure- focused initiatives, including some of the engineering based models that have been applied to healthcare and more. These voices are increasingly the source of innovations that are continually shaping and reshaping our future. But the rhetoric of failure and success of its evangelizers can be hard for us to wrap our heads around, as people who naturally fear, avoid and focus on failure in a negative way.

Over the weekend, while I was practicing Yoga I told one of my kids my favorite part of the practice (note: not a good time for chatting). I love that Yoga is a process. One day you will be able to do something that you may or may not be able to do the next day, and vice versa. My practice involves quite a bit of balancing on one foot, and there are days when that balance feels effortless and days when that balance feels impossible. But the effortless days only come because I continue to practice despite the disappointments of my wobblier days. Yoga instructors sometimes talk about the power of intentions and working in ways that align with our intentions. One of my kids pointed out that the wobbly days, as I call them, are exactly the reason why she hates Yoga. She’s believes that she’s no good at it, and because of her assessment she will avoid it. You can probably guess that this conversation is far from over between us.

We see attitudes like these affecting people (including ourselves) every day. Some people theorize that the lower representation of women in STEM (Science, Technology, Engineering and Math) fields is due to a larger proportion of women than men who doubt their abilities or judge their abilities more harshly. We hear about graduate students who experience what is sometimes called the ‘imposter syndrome.’ I remember some students in my graduate classes who chose not to participate in class for fear they would sound stupid. I’ve heard of medical practitioners who were so worried that they would make another mistake that they were afraid to practice. As a writer, I know that the power of self doubt can cause writers block, but I also know how much easier it is to edit or rewrite.

I would encourage all of you to embrace your failures, your mistakes, your shortcomings, your missteps and your errors and see them as part of a process and not an endpoint. These stumbling points are the key points of growth- the key moments for us to learn and to redirect our actions to better suit our intentions. To err is human, but to learn from our missteps is surely something greater.

Great description of a census at Kakuma refugee camp

It’s always fun for a professional survey researcher to stumble upon a great pop cultural reference to a survey. Yesterday I heard a great description of a census taken at Kakuma refugee camp in Kenya. The description was in the book I’m currently reading: What Is the What by Dave Eggers (great book, I highly recommend it!). The book itself is fiction, loosely based on a true story, so this account likely stems from a combination of observation and imagination. The account reminds me of some of the field reports and ethnographic findings in other intercultural survey efforts, both national (US census) and inter or multinational.

To set the stage, Achak is the main character and narrator of the story. He is one of the “lost boys” of Sudan, and he found his way to Kakuma after a long and storied escape from his war-ravaged hometown. At Kakuma he was taken in by another Denka man, named Gop, who is acting as a kind of father to Achak.

What is the What by Dave Eggers

What is the What by Dave Eggers

“The announcement of the census was made while Gop was waiting for the coming of his wife and daughters, and this complicated his peace of mind. To serve us, to feed us, the UNHCR and Kakuma’s many aid groups needed to know how many refugees were at the camp. Thus, in 1994 they announced they would count us. It would only take a few days, they said. To the organizers I am sure it seemed a very simple, necessary, and uncontroversial directive. But for the Sudanese elders, it was anything but.

—What do you think they have planned? Gop Chol wondered aloud.

I didn’t know what he meant by this, but soon I understood what had him, and the majority of Sudanese elders, greatly concerned. Some learned elders were reminded of the colonial era, when Africans were made to bear badges of identification on their necks.

—Could this counting be a pretext of a new colonial period? Gop mused.—It’s very possible.
Probable even!

I said nothing.

At the same time, there were practical, less symbolic, reasons to oppose the census, including the fact that many elders imagined that it would decrease, not increase, our rations. If they discovered there were fewer of us than had been assumed, the food donations from the rest of the world would drop. The more pressing and widespread fear among young and old at Kakuma was that the census would be a way for the UN to kill us all. These fears were only exacerbated when the fences were erected.

The UN workers had begun to assemble barriers, six feet tall and arranged like hallways. The fences would ensure that we would walk single file on our way to be counted, and thus counted only once. Even those among us, the younger Sudanese primarily, who were not so worried until then, became gravely concerned when the fences went up. It was a malevolent-looking thing, that maze of fencing, orange and opaque. Soon even the best educated among us bought into the suspicion that this was a plan to eliminate the Dinka. Most of the Sudanese my age had learned of the Holocaust, and were convinced that this was a plan much like that used to eliminate the Jews in Germany and Poland. I was dubious of the growing paranoia, but Gop was a believer. As rational a man as he was, he had a long memory for injustices visited upon the people of Sudan.

—What isn’t possible, boy? he demanded.—See where we are? You tell me what isn’t possible at this time in Africa!

But I had no reason to distrust the UN. They had been feeding us at Kakuma for years. There was not enough food, but they were the ones providing for everyone, and thus it seemed nonsensical that they would kill us after all this time.

—Yes, he reasoned,—but see, perhaps now the food has run out. The food is gone, there’s no more money, and Khartoum has paid the UN to kill us. So the UN gets two things: they get to save food, and they are paid to get rid of us.

—But how will they get away with it?

—That’s easy, Achak. They say that we caught a disease only the Dinka can get. There are always illnesses unique to certain people, and this is what will happen. They’ll say there was a Dinka plague, and that all the Sudanese are dead. This is how they’ll justify killing every last one of us.
—That’s impossible, I said.

—Is it? he asked.—Was Rwanda impossible?

I still thought that Gop’s theory was unreliable, but I also knew that I should not forget that there were a great number of people who would be happy if the Dinka were dead. So for a few days, I did not make up my mind about the head count. Meanwhile, public sentiment was solidifying against our participation, especially when it was revealed that the fingers of all those counted, after being counted, would be dipped in ink.

—Why the ink? Gop asked. I didn’t know.

—The ink is a fail-safe measure to ensure the Sudanese will be exterminated.

I said nothing, and he elaborated. Surely if the UN did not kill us Dinka while in the lines, he theorized, they would kill us with this ink on the fingers. How could the ink be removed? It would, he thought, enter our bodies when we ate.

—This seems very much like what they did to the Jews, Gop said.

People spoke a lot about the Jews in those days, which was odd, considering that a short time before, most of the boys I knew thought the Jews were an extinct race. Before we learned about the Holocaust in school, in church we had been taught rather crudely that the Jews had aided in the killing of Jesus Christ. In those teachings, it was never intimated that the Jews were a people still inhabiting the earth. We thought of them as mythological creatures who did not exist outside the stories of the Bible. The night before the census, the entire series of fences, almost a mile long, was torn down. No one took responsibility, but many were quietly satisfied.

In the end, after countless meetings with the Kenyan leadership at the camp, the Sudanese elders were convinced that the head count was legitimate and was needed to provide better services to the refugees. The fences were rebuilt, and the census was conducted a few weeks later. But in a way, those who feared the census were correct, in that nothing very good came from it. After the count, there was less food, fewer services, even the departure of a few smaller programs. When they were done counting, the population of Kakuma had decreased by eight thousand people in one day.

How had the UNHCR miscounted our numbers before the census? The answer is called recycling.

Recycling was popular at Kakuma and is favored at most refugee camps, and any refugee anywhere in the world is familiar with the concept, even if they have a different name for it. The essence of the idea is that one can leave the camp and re-enter as a different person, thus keeping his first ration card and getting another when he enters again under a new name. This means that the recycler can eat twice as much as he did before, or, if he chooses to trade the extra rations, he can buy or otherwise obtain anything else he needs and is not being given by the UN—sugar, meat, vegetables. The trading resulting from extra ration cards provided the basis for a vast secondary economy at Kakuma, and kept thousands of refugees from anemia and related illnesses. At any given time, the administrators of Kakuma thought they were feeding eight thousand more people than they actually were. No one felt guilty about this small numerical deception.

The ration-card economy made commerce possible, and the ability of different groups to manipulate and thrive within the system led soon enough to a sort of social hierarchy at Kakuma.”