Let’s talk about data cleaning

Data cleaning has a bad rep. In fact, it has long been considered the grunt work of the data analysis enterprise. I recently came across a piece of writing in the Harvard Business Review that lamented the amount of time data scientists spend cleaning their data. The author feared that data scientists’ skills were being wasted on the cleaning process when they could be using their time for the analyses we so desperately need them to do.

I’ll admit that I haven’t always loved the process of cleaning data. But my view of the process has evolved significantly over the last few years.

As a survey researcher, my cleaning process used to begin with a tall stack of paper forms. Answers that did not make logical sense during the checking process sparked a trip to the file folders to find the form in question. The forms often held physical evidence of a indecision on the part of the respondent, such as eraser marks or an explanation in the margin, which could not have been reflected properly by the data entry person. We lost this part of the process when we moved to web surveys. It sometimes felt like a web survey left the respondent no way to communicate with the researcher about their unique situations. Data cleaning lost its personalized feel and detective story luster and became routine and tedious.

Despite some of the affordances of the movement to web surveys, much of the cleaning process stayed routed in the old techniques. Each form has its own id number, and the programmers would use those id numbers for corrections

if id=1234567, set var1=5, set var7=62

At this point a “good programmer” would also document the changes for future collaborators

*this person was not actually a forest ranger, and they were born in 1962
if id=1234567, set var1=5, set var7=62

Making these changes grew tedious very quickly, and the process seemed to drag on for ages. The researcher would check the data for a potential errors, scour the records that could hold those errors for any kind of evidence of the respondent’s intentions, and then handle each form one at a time.

My techniques for cleaning data have changed dramatically since those days. My goal is to use id numbers as rarely as possible, but instead to ask myself questions like “how can I tell that these people are not forest rangers?” The answer to these questions evokes a subtley different technique:

* these people are not actually forest rangers
if var7=35 and var1=2 and var10 contains ‘fire fighter’, set var1=5)

This technique requires honing and testing (adjusting the precision and recall), but I’ve found it to be far more efficient, faster, more comprehensive and, most of all- more fun (oh hallelujah!). It makes me wonder whether we have perpetually undercut the quality of the data cleaning we do simply because we hold the process in such low esteem.

So far I have not discussed data cleaning for other types of data. I’m currently working on a corpus of Twitter data, and I don’t see much of a difference in the cleaning process. The data types and programming statements I use are different, but the process is very close. It’s an interesting and challenging process that involves detective work, a better and growing understanding of the intricacies of the dataset, a growing set of programming skills, and a growing understanding of the natural language use in your dataset. The process mirrors the analysis to such a degree that I’m not really sure why it would be such a bad thing for analysts to be involved in data cleaning.

I’d be interested to hear what my readers have to say about this. Is our notion of the value and challenge of data cleaning antiquated? Is data cleaning a burden that an analyst should bear? And why is there so little talk about data cleaning, when we could all stand to learn so much from each other in the way of data structuring code and more?

Advertisements

Great description of a census at Kakuma refugee camp

It’s always fun for a professional survey researcher to stumble upon a great pop cultural reference to a survey. Yesterday I heard a great description of a census taken at Kakuma refugee camp in Kenya. The description was in the book I’m currently reading: What Is the What by Dave Eggers (great book, I highly recommend it!). The book itself is fiction, loosely based on a true story, so this account likely stems from a combination of observation and imagination. The account reminds me of some of the field reports and ethnographic findings in other intercultural survey efforts, both national (US census) and inter or multinational.

To set the stage, Achak is the main character and narrator of the story. He is one of the “lost boys” of Sudan, and he found his way to Kakuma after a long and storied escape from his war-ravaged hometown. At Kakuma he was taken in by another Denka man, named Gop, who is acting as a kind of father to Achak.

What is the What by Dave Eggers

What is the What by Dave Eggers

“The announcement of the census was made while Gop was waiting for the coming of his wife and daughters, and this complicated his peace of mind. To serve us, to feed us, the UNHCR and Kakuma’s many aid groups needed to know how many refugees were at the camp. Thus, in 1994 they announced they would count us. It would only take a few days, they said. To the organizers I am sure it seemed a very simple, necessary, and uncontroversial directive. But for the Sudanese elders, it was anything but.

—What do you think they have planned? Gop Chol wondered aloud.

I didn’t know what he meant by this, but soon I understood what had him, and the majority of Sudanese elders, greatly concerned. Some learned elders were reminded of the colonial era, when Africans were made to bear badges of identification on their necks.

—Could this counting be a pretext of a new colonial period? Gop mused.—It’s very possible.
Probable even!

I said nothing.

At the same time, there were practical, less symbolic, reasons to oppose the census, including the fact that many elders imagined that it would decrease, not increase, our rations. If they discovered there were fewer of us than had been assumed, the food donations from the rest of the world would drop. The more pressing and widespread fear among young and old at Kakuma was that the census would be a way for the UN to kill us all. These fears were only exacerbated when the fences were erected.

The UN workers had begun to assemble barriers, six feet tall and arranged like hallways. The fences would ensure that we would walk single file on our way to be counted, and thus counted only once. Even those among us, the younger Sudanese primarily, who were not so worried until then, became gravely concerned when the fences went up. It was a malevolent-looking thing, that maze of fencing, orange and opaque. Soon even the best educated among us bought into the suspicion that this was a plan to eliminate the Dinka. Most of the Sudanese my age had learned of the Holocaust, and were convinced that this was a plan much like that used to eliminate the Jews in Germany and Poland. I was dubious of the growing paranoia, but Gop was a believer. As rational a man as he was, he had a long memory for injustices visited upon the people of Sudan.

—What isn’t possible, boy? he demanded.—See where we are? You tell me what isn’t possible at this time in Africa!

But I had no reason to distrust the UN. They had been feeding us at Kakuma for years. There was not enough food, but they were the ones providing for everyone, and thus it seemed nonsensical that they would kill us after all this time.

—Yes, he reasoned,—but see, perhaps now the food has run out. The food is gone, there’s no more money, and Khartoum has paid the UN to kill us. So the UN gets two things: they get to save food, and they are paid to get rid of us.

—But how will they get away with it?

—That’s easy, Achak. They say that we caught a disease only the Dinka can get. There are always illnesses unique to certain people, and this is what will happen. They’ll say there was a Dinka plague, and that all the Sudanese are dead. This is how they’ll justify killing every last one of us.
—That’s impossible, I said.

—Is it? he asked.—Was Rwanda impossible?

I still thought that Gop’s theory was unreliable, but I also knew that I should not forget that there were a great number of people who would be happy if the Dinka were dead. So for a few days, I did not make up my mind about the head count. Meanwhile, public sentiment was solidifying against our participation, especially when it was revealed that the fingers of all those counted, after being counted, would be dipped in ink.

—Why the ink? Gop asked. I didn’t know.

—The ink is a fail-safe measure to ensure the Sudanese will be exterminated.

I said nothing, and he elaborated. Surely if the UN did not kill us Dinka while in the lines, he theorized, they would kill us with this ink on the fingers. How could the ink be removed? It would, he thought, enter our bodies when we ate.

—This seems very much like what they did to the Jews, Gop said.

People spoke a lot about the Jews in those days, which was odd, considering that a short time before, most of the boys I knew thought the Jews were an extinct race. Before we learned about the Holocaust in school, in church we had been taught rather crudely that the Jews had aided in the killing of Jesus Christ. In those teachings, it was never intimated that the Jews were a people still inhabiting the earth. We thought of them as mythological creatures who did not exist outside the stories of the Bible. The night before the census, the entire series of fences, almost a mile long, was torn down. No one took responsibility, but many were quietly satisfied.

In the end, after countless meetings with the Kenyan leadership at the camp, the Sudanese elders were convinced that the head count was legitimate and was needed to provide better services to the refugees. The fences were rebuilt, and the census was conducted a few weeks later. But in a way, those who feared the census were correct, in that nothing very good came from it. After the count, there was less food, fewer services, even the departure of a few smaller programs. When they were done counting, the population of Kakuma had decreased by eight thousand people in one day.

How had the UNHCR miscounted our numbers before the census? The answer is called recycling.

Recycling was popular at Kakuma and is favored at most refugee camps, and any refugee anywhere in the world is familiar with the concept, even if they have a different name for it. The essence of the idea is that one can leave the camp and re-enter as a different person, thus keeping his first ration card and getting another when he enters again under a new name. This means that the recycler can eat twice as much as he did before, or, if he chooses to trade the extra rations, he can buy or otherwise obtain anything else he needs and is not being given by the UN—sugar, meat, vegetables. The trading resulting from extra ration cards provided the basis for a vast secondary economy at Kakuma, and kept thousands of refugees from anemia and related illnesses. At any given time, the administrators of Kakuma thought they were feeding eight thousand more people than they actually were. No one felt guilty about this small numerical deception.

The ration-card economy made commerce possible, and the ability of different groups to manipulate and thrive within the system led soon enough to a sort of social hierarchy at Kakuma.”

Spam, Personal histories and Language competencies

Over the recent holiday, I spent some time sorting through many boxes of family memorabilia. Some of you have probably done this with your families. It is fascinating, sentimental and mind-boggling. Highlights include both the things that strike a chord and things that can be thrown away. It’s a balance of efficiency and sap.

 

I’m always amazed by the way family memorabilia tells both private, personal histories and larger public ones. The boxes I dealt with last week were my mom’s, and her passion was politics. Even the Christmas cards she saved give pieces of political histories. Old thank you cards provide unknown nuggets of political strategy. She had even saved stirrers and plastic cups from an inauguration!

 

Campaign button found in the family files

Campaign button found in the family files

 

 

My mom continued to work in politics throughout her life, but the work that she did more recently is understandably fresher and more tangible for me. I remember looking through printed Christmas cards from politicians and wondering why she held on to them. In her later years I worried about her tendency to hold on to mail merged political letters. I wondered if her tendency to personalize impersonal documents made her vulnerable to fraud. To me, her belief in these documents made no sense.

 

Flash forward one year to me sorting through boxes of handwritten letters from politicians that mirror the spam she held on to. For many years she received handwritten letters from elected politicians in Washington. At some point, the handwritten letters evolved into typed letters that were hand-corrected and included handwritten sections. These evolved into typed letters on which the only handwriting was the signature. Eventually, even the signatures became printed. But the intention and function of these letters remained the same, even as their typography evolved. She believed in these letters because she had been receiving them for many decades. She believed they were personal because she had seen more of them that were personal than not. The phrases that I believe to be formulaic and spammy were once handwritten, intentional, personal and probably even heartfelt.

 

 

There are a few directions I could go from here:

 

– I better understand why older people complain about the impersonalization of modern society and wax poetic about the old letter writing tradition. I could include a few anecdotes about older family members.

 

– I’m amazed that people would take the time to write long letters using handwriting that may never have been deciphered

 

– I could wax poetic about some of the cool things I found in the storage facility

 

 

But I won’t. Not in this blog. Instead, I’ll talk about competencies.

 

Spam is a manifest of language competencies, although we often dismiss it as a total lack of language competence. In my Linguistics study, we were quickly taught the mantra “difference, not deficiency.” In fact it takes quite a bit of skill to develop spam letters. In survey research, the survey invitation letters that people so often dismiss have been heavily researched and optimized to yield a maximum response rate. In his book The Sociolinguistics of Globalization, Jan Blommaert details the many competencies necessary to create the Nigerian bank scam letters that were so heavily circulated a few years ago. And now I’ve learned that the political letters that I’m so quick to dismiss as thoughtless mail merges are actually part of a deep tradition of political action. Will that be enough for me to hold on to them? No. But I am saving the handwritten stuff. Boxes and boxes of it!

 

 

One day last week, as I drove to the storage facility I heard an interview with Michael Pollan about Food Literacy. Pollan’s point was that the food draughts in some urban areas are not just a function of access (Food draughts are areas where fresh food is difficult to obtain and grocery stores are few and far between, if they’re available at all). Pollan believes that even if there were grocery stores available, the people in these neighborhoods lack the basic cooking skills to prepare the food. He cited a few basic cooking skills which are not basic to me (partly because I’m a vegetarian, and partly because of the cooking traditions I learned from) as a part of his argument.

 

As a linguist, it is very interesting to hear the baggage that people attach to language metaphorically carried over to food (“food illiteracy”). I wonder what value the “difference, not deficiency” mantra holds here. I’m not ready to believe that people in areas subject to food draught are indeed kitchen illiterate. But I wouldn’t hesitate to agree that their food cultures probably differ significantly from Pollan’s. The basic staples and cooking methods probably differ significantly. Pollan could probably make a lot more headway with his cause if, instead of assuming that the people he is trying to help lack any basic cooking skills, he advocated toward a culture change that included access, attainability, and the potential to learn different practical cooking skills. It’s a subtle shift, but an important one.

 

As a proud uncook, I’m a huge fan of any kind of food preparation that is two steps or less, cheap, easy and fresh. Fast food for me involves putting a sweet potato in the microwave and pressing “potato,” grabbing for an apple or carrots and peanut butter, or tossing chickpeas into a dressing. Slow food involves the basic sautéing, roasting, etc. that Pollan advocates. I imagine that the skills he advocates are more practical and enjoyable for him than they are for people like me, whose mealtimes are usually limited and chaotic. What he calls basic is impractical for many of us. And the differences in time and money involved in uncooking and “basics” add up quickly.

 

 

 

So I’ve taken this post in quite a few directions, but it all comes together under one important point. Different language skills are not a lack of language skills altogether. Similarly, different survival skills are not a total lack of survival skills. We all carry unique skillsets that reflect our personal histories with those skills as well as the larger public histories that our personal histories help to compose. We, as people, are part of a larger public. The political spam I see doesn’t meet my expectations of valuable, personal communication, but it is in fact part of a rich political history. The people who Michael Pollan encounters have ways of feeding themselves that differ from Pollan’s expectations, but they are not without important survival skills. Cultural differences are not an indication of an underlying lack of culture.

2013-07-05 11.13.21

 

Representativeness, qual & quant, and Big Data. Lost in translation?

My biggest challenge in coming from a quantitative background to a qualitative research program was representativeness. I came to class firmly rooted in the principle of Representativeness, and my classmates seemed not to have any idea why it mattered so much to me. Time after time I would get caught up in my data selection. I would pose the wider challenge of representativeness to a colleague, and they would ask “representative of what? why?”

 

In the survey research world, the researcher begins with a population of interest and finds a way to collect a representative sample of the population for study. In the qualitative world that accompanies survey research units of analysis are generally people, and people are chosen for their representativeness. Representativeness is often constructed by demographic characteristics. If you’ve read this blog before, you know of my issues with demographics. Too often, demographic variables are used as a knee jerk variable instead of better considered variables that are more relevant to the analysis at hand. (Maybe the census collects gender and not program availability, for example, but just because a variable is available and somewhat correlated doesn’t mean that it is in fact a relevant variable, especially when the focus of study is a population for whom gender is such an integral societal difference.)

 

And yet I spent a whole semester studying 5 minutes of conversation between 4 people. What was that representative of? Nothing but itself. It couldn’t have been exchanged for any other 5 minutes of conversation. It was simply a conversation that this group had and forgot. But over the course of the semester, this piece of conversation taught me countless aspects of conversation research. Every time I delved back into the data, it became richer. It was my first step into the world of microanalysis, where I discovered that just about anything can be a rich dataset if you use it carefully. A snapshot of people at a lecture? Well, how are their bodies oriented? A snapshot of video? A treasure trove of gestures and facial expressions. A piece of graffiti? Semiotic analysis! It goes on. The world of microanalysis is built on the practice of layered noticing. It goes deeper than wide.

 

But what is it representative of? How could a conversation be representative? Would I need to collect more conversations, but restrict the participants? Collect conversations with more participants, but in similar contexts? How much or how many would be enough?

 

In the world of microanalysis, people and objects constantly create and recreate themselves. You consistently create and recreate yourself, but your recreations generally fall into a similar range that makes you different from your neighbors. There are big themes in small moments. But what are the small moments representative of? Themselves. Simply, plainly, nothing more and nothing else. Does that mean that they don’t matter? I would argue that there is no better way to understand the world around us in deep detail than through microanalysis. I would also argue that macroanalysis is an important part of discovering the wider patterns in the world around us.

 

Recently a NY Times blog post by Quentin Hardy has garnered quite a bit of attention.

Why Big Data is Not Truth: http://bits.blogs.nytimes.com/2013/06/01/why-big-data-is-not-truth/

This post has really struck a chord with me, because I have had a hard time understanding Hardy’s complaint. Is big data truth? Is any data truth? All data is what it is; a collection of some sort, collected under a specific set of circumstances. Even data that we hope to be more representative has sampling and contextual limitations. Responsible analysts should always be upfront about what their data represents. Is big data less truthful than other kinds of data? It may be less representative than, say, a systematically collected political poll. But it is what it is: different data, collected under different circumstances in a different way. It shouldn’t be equated with other data that was collected differently. One true weakness of many large scale analyses is the blindness to the nature of the data, but that is a byproduct of the training algorithms that are used for much of the analysis. The algorithms need large training datasets, from anywhere. These sets often are developed through massive web crawlers. Here, context gets dicey. How does a researcher represent the data properly when they have no idea what it is? Hopefully researchers in this context will be wholly aware that, although their data has certain uses, it also has certain [huge] limitations.

 

I suspect that Hardy’s complaint is with the representations of massive datasets collected from webcrawlers as a complete truth from which any analyses could be run and all of the greater truths of the world could be revealed. On this note, Hardy is exactly right. Data simply is what it is, nothing more and nothing less. And any analysis that focuses on an unknown dataset is just that: an analysis without context. Which is not to say that all analyses need to be representative, but rather that all responsible analyses of good quality need to be self aware. If you do not know what the data represents and when and how it was collected, then you cannot begin to discuss the usefulness of any analysis of it.

Marissa Meyer, Motherhood and the Public Sphere

Marisssa Meyer has made quite a few waves recently. First, she was appointed CEO of Yahoo, an internet mainstay with an identity crisis, in a bold act of assertion and experimentation. She made headlines for her history of making waves in internet companies, and she made waves for being female. The headlines about her gender had barely receded before the news broke that she was pregnant. This was huge news, news that had barely receded before she again made headlines for talking about motherhood.

As a working parent, I’m moved to say a few things about motherhood and work.

The first point is an obvious one. Pregnancy is a state that primarily affects the pregnant. As a working, pregnant woman I remember the absolute fear of my coworkers discovering that my pregnant head was a blurred and rearranged rendition of its normal self. There is no public acceptance or public discussion of pregnancy brain, only private, informal corroboration of the phenomena. I felt sympathetic for Meyer as she navigated the intense public scrutiny of her every move within her new position while potentially mucking through “pregnancy brain.” Granted, we are adaptable creatures. I navigated the world of working while pregnant with a minimum of catastrophic errors, mostly through subtle adjustments in my work patterns, and I’m sure that she did as well. And, just as my consuming hatred of onions disappeared during the labor process, my thinking again found clarity. Motherhood does not necessarily affect a woman’s ability to work, nor is it necessarily a negative effect. In fact, we found in a survey of physicists the world over that mothers often discover that they work much more efficiently than they did before they gave birth (may be covered in this report). The wider sense of context and greater array of responsibilities can significantly improve worklife.

The assumption underlying fears about motherhood and work life is that the mother is the sole or primary source of childcare. This is not universally true. Parents make parenting work with whatever tools they have. Some parents have partners with varying degrees of involvement, and some don’t. Some live with family, some don’t. Some have dependable community networks or friends who help out with childcare. Some have financial means that can be used to find help with childcare or to help with other areas of life, in order to make more time for childcare. There are no assumptions that one could make straight out of the gate about a person’s childcare options or preferences.

Meyer made a point of calling her child “easy.” I don’t really see how this should bear on anyone else. I have often called my own kids easy. Maybe a more technical way of saying this would be that their actions generally jive with my own needs and preferences. There are times when that description couldn’t be further from the truth (like when they scream! or puke at dramatically inconvenient times. or cover me in Spaghetti at lunchtime in the office cafeteria!), and times when it seems gloriously true, like the times I couldn’t find a reliable childcare option and my quiet second child hung out in my office all day without most of my coworkers noticing. The truth is that children are people, with a full set of emotions and physical states that they are just learning to reconcile. They start out very dependent and grow independence before we’re ready for them to. Sometimes they are a lot of work, or a lot of frustration. And sometimes they enrich our lives in ways we never could have imagined.

The part of Meyer’s statement that most struck me was that she spoke about her enjoyment of motherhood. Motherhood has plenty of little joys, plenty of cute moments, and plenty of little smiles. They don’t often get as much press as the frustrations, especially from moms who are working on their careers. In a professional world that seems to be eying moms for any sign that motherhood is negatively influencing worklife, moms are often thrust into this dynamic, where being a parent is thought to be at odds with a career. I have often been caught in this dynamic. I enjoy my career and enjoy motherhood and in fact enjoy being able to do both. I rarely hear this dynamic echoed- almost like we need to chose a side and stick to it. But the need to pick sides is an old one, one where the mother is thought to be the primary caretaker, leaving her child adrift if she chooses to also pursue other activities.

Instead, I try to use my enthusiasm for work and school as a model for my kids. I want them to know that it is possible to find and pursue work that they find interesting. And I work to let them know how much I care about them and enjoy spending time with them. And I rely heavily on the networks available to me for help. My system is far from perfect. In November, after being out of town for a few days, I returned to a severely wheezing child. Nothing can make you feel worse as a parent than taking a child to the doctor when they are seriously ill and you know nothing about the history of their illness. And now the wave of finals is creeping over us like an ominous tsunami, threatening to swallow our homelife whole in its voraciousness.

Motherhood is complicated. Parenthood is complicated. Meyer may be moved to characterize parenthood one way in one interview and then in a completely different way fifteen minutes later. As a fellow parent, I would like to issue my full support for her speaking publicly about parenthood at all. I wish her  a string of lights in the thicket of family life and work life and a cord to hold it together in the points between.

And I wish her press about her professional life that isn’t defined by her personal life.

What do all of these polling strategies add up to?

Yesterday was a big first for research methodologists across many disciplines. For some of the newer methods, it was the first election that they could be applied to in real time. For some of the older methods, this election was the first to bring competing methodologies, and not just methodological critiques.

Real time sentiment analysis from sites like this summarized Twitter’s take on the election. This paper sought to predict electoral turnout using google searches. InsideFacebook attempted to use Facebook data to track voting. And those are just a few of a rapid proliferation of data sources, analytic strategies and visualizations.

One could ask, who are the winners? Some (including me) were quick to declare a victory for the well honed craft of traditional pollsters, who showed that they were able to repeat their studies with little noise, and that their results were predictive of a wider real world phenomena. Some could call a victory for the emerging field of Data Science. Obama’s Chief Data Scientist is already beginning to be recognized. Comparisons of analytic strategies will spring up all over the place in the coming weeks. The election provided a rare opportunity where so many strategies and so many people were working in one topical area. The comparisons will tell us a lot about where we are in the data horse race.

In fact, most of these methods were successful predictors in spite of their complicated underpinnings. The google searches took into account searches for variations of “vote,” which worked as a kind of reliable predictor but belied the complicated web of naturalistic search terms (which I alluded to in an earlier post about the natural development of hashtags, as explained by Rami Khater of Al Jezeera’s The Stream, a social network generated newscast). I was a real-world example of this methodological complication. Before I went to vote, I googled “sample ballot.” Similar intent, but I wouldn’t have been caught in the analyst’s net.

If you look deeper at the Sentiment Analysis tools that allow you to view the specific tweets that comprise their categorizations, you will quickly see that, although the overall trends were in fact predictive of the election results, the data coding was messy, because language is messy.

And the victorious predictive ability of traditional polling methods belies the complicated nature of interviewing as a data collection technique. Survey methodologists work hard to standardize research interviews in order to maximize the reliability of the interviews. Sometimes these interviews are standardized to the point of recording. Sometimes the interviews are so scripted that interviewers are not allowed to clarify questions, only to repeat them. Critiques of this kind of standardization are common in survey methodology, most notably from Nora Cate Schaeffer, who has raised many important considerations within the survey methodology community while still strongly supporting the importance of interviewing as a methodological tool. My reading assignment for my ethnography class this week is a chapter by Charles Briggs from 1986 (Briggs – Learning how to ask) that proves that many of the new methodological critiques are in fact old methodological critiques. But the critiques are rarely heeded, because they are difficult to apply.

I am currently working on a project that demonstrates some of the problems with standardizing interviews. I am revising a script we used to call a representative sample of U.S. high schools. The script was last used four years ago in a highly successful effort that led to an admirable 98% response rate. But to my surprise, when I went to pull up the old script I found instead a system of scripts. What was an online and phone survey had spawned fax and e-mail versions. What was intended to be a survey of principals now had a set of potential respondents from the schools, each with their own strengths and weaknesses. Answers to common questions from school staff were loosely scripted on an addendum to the original script. A set of tips for phonecallers included points such as “make sure to catch the name of the person who transfers you, so that you can specifically say that Ms X from the office suggested I talk to you” and “If you get transferred to the teacher, make sure you are not talking to the whole class over the loudspeaker.”

Heidi Hamilton, chair of the Georgetown Linguistics department, often refers to conversation as “climbing a tree that climbs back.” In fact, we often talk about meaning as mutually constituted between all of the participants in a conversation. The conversation itself cannot be taken outside of the context in which it lives. The many documents I found from the phonecallers show just how relevant these observations can be in an applied research environment.

The big question that arises from all of this is one of a practical strategy. In particular, I had to figure out how to best address the interview campaign that we had actually run when preparing to rerun the campaign we had intended to run. My solution was to integrate the feedback from the phonecallers and loosen up the script. But I suspect that this tactic will work differently with different phonecallers. I’ve certainly worked with a variety of phonecallers, from those that preferred a script to those that preferred to talk off the cuff. Which makes the best phonecaller? Neither. Both. The ideal phonecaller works with the situation that is presented to them nimbly and professionally while collecting complete and relevant data from the most reliable source. As much of the time as possible.

At this point, I’ve come pretty far afield of my original point, which is that all of these competing predictive strategies have complicated underpinnings.

And what of that?

I believe that the best research is conscious of its strengths and weaknesses and not afraid to work with other strategies in order to generate the most comprehensive picture. As we see comparisons and horse races develop between analytic strategies, I think the best analyses we’ll see will be the ones that fit the results of each of the strategies together, simultaneously developing a fuller breakdown of the election and a fuller picture of our new research environment.

Data Journalism, like photography, “involves selection, filtering, framing, composition and emphasis”

Beautiful:

“Creating a good piece of data journalism or a good data-driven app is often more like an art than a science. Like photography, it involves selection, filtering, framing, composition and emphasis. It involves making sources sing and pursuing truth – and truth often doesn’t come easily. ” -Jonathan Gray

Whole article:

http://www.guardian.co.uk/news/datablog/2012/may/31/data-journalism-focused-critical

Truly, at a time when the buzz about big data is at such a peak, it is nice to hear a voice of reason and temper! Folks: big data will not do all that it is talked up to do. It will, in fact, do something surprising and different. And that something will come from the interdisciplinary thought leaders in fields like natural language processing and linguistics. That *something,* not the data itself, will be the new oil.