Data Storytelling

In the beginning of our Ethnography of Communication class, one of the students asked about the kinds of papers one writes about an ethnography. It seemed like a simple question at the time. In order to report on ethnographic data, the researcher chooses a theme and then pulls out the parts of their data that fit the theme. Now that I’m at the point in my ethnography where I’m choosing what to report, I can safely say that this question is not one with an easy answer.

At this point, I’ve gathered together a tremendous amount of data about DC taxi drivers. I’ve already given my final presentation for my class, and written most of my final paper. But the data gathering phase hasn’t ended yet. I have been wondering whether I have enough data gathered together to write a book, and I probably could write a book, but that still doesn’t make my project feel complete. I don’t feel like the window I’ve carved is large enough to do this topic any justice.

The story that I set out to tell about the drivers is one of their absence in the online public sphere. As the wife of a DC driver, I was sick and tired of seeing blog posts and newspaper articles with seemingly unending streams of offensive, ignorant, or simply one sided comments. This story turns out to be one with many layers, one that goes far beyond issues of internet access, delves deeply into matters of differential use of technology, and one that strikes fractures into the soil of the grand potential of participatory democracy. It is also a story grounded in countless daily interactions, involving a large number of participants and situations. The question is large, the data abundant, and the paths to the story many. Each more narrow path begs a depth that is hungry for more data and more analysis. Each answer is defined by more questions. More specifically, do I start with the rides? With a specific ride? With the drivers? With a specific driver? With a specific piece of legislation? With one online discussion or theme? How can I make sure that my analysis is grounded and objective? How far do I trace the story, and which parts of the story does it leave out? What happens with the rest of the story? What is my responsibility and to whom?

This paper will clearly not be the capstone to the ethnography, just one story told through the data I’ve gathered together in the past few months. More stories can be told, and will be told with the data. Specifically, I’m hoping to delve more deeply into the driver’s social networks, for their role in information exchange. And the fallout from stylistic differences in online discussions. And, more prescriptively, into ways that drivers voices can be better represented in the public sphere. And maybe more?

It feels strange to write a paper that isn’t descriptive of the data as a whole. Every other project that I’ve worked on has led to a single publication that summarized the whole set. It seems strange, coming from a quantitative perspective where the data strongly confines the limits of what can and cannot be said in the report and what is more or less important to include in the report, to have a choice of data, and, more importantly, a choice of story to tell. Instead of pages of numbers to look through, compare and describe, I’m entering the final week of this project with the same cloud of ambiguity that has lingered throughout. And I’m looking for ways that my data can determine what can and cannot be reported on and what stories should be told. Where, in this sea of data, is my life raft of objectivity? (Hear that note of drama? That comes from the lack of sleep and heightened anxiety that finals bring about- one part of formal education that I will not miss!!)

I have promised to share my paper here once it has been written. I might end up making some changes before sharing it, but I will definitely share it. My biggest hope is that it will inspire some fresh, better informed conversation on the taxi situation in DC and on what it means to be represented in a participatory democracy.

What do all of these polling strategies add up to?

Yesterday was a big first for research methodologists across many disciplines. For some of the newer methods, it was the first election that they could be applied to in real time. For some of the older methods, this election was the first to bring competing methodologies, and not just methodological critiques.

Real time sentiment analysis from sites like this summarized Twitter’s take on the election. This paper sought to predict electoral turnout using google searches. InsideFacebook attempted to use Facebook data to track voting. And those are just a few of a rapid proliferation of data sources, analytic strategies and visualizations.

One could ask, who are the winners? Some (including me) were quick to declare a victory for the well honed craft of traditional pollsters, who showed that they were able to repeat their studies with little noise, and that their results were predictive of a wider real world phenomena. Some could call a victory for the emerging field of Data Science. Obama’s Chief Data Scientist is already beginning to be recognized. Comparisons of analytic strategies will spring up all over the place in the coming weeks. The election provided a rare opportunity where so many strategies and so many people were working in one topical area. The comparisons will tell us a lot about where we are in the data horse race.

In fact, most of these methods were successful predictors in spite of their complicated underpinnings. The google searches took into account searches for variations of “vote,” which worked as a kind of reliable predictor but belied the complicated web of naturalistic search terms (which I alluded to in an earlier post about the natural development of hashtags, as explained by Rami Khater of Al Jezeera’s The Stream, a social network generated newscast). I was a real-world example of this methodological complication. Before I went to vote, I googled “sample ballot.” Similar intent, but I wouldn’t have been caught in the analyst’s net.

If you look deeper at the Sentiment Analysis tools that allow you to view the specific tweets that comprise their categorizations, you will quickly see that, although the overall trends were in fact predictive of the election results, the data coding was messy, because language is messy.

And the victorious predictive ability of traditional polling methods belies the complicated nature of interviewing as a data collection technique. Survey methodologists work hard to standardize research interviews in order to maximize the reliability of the interviews. Sometimes these interviews are standardized to the point of recording. Sometimes the interviews are so scripted that interviewers are not allowed to clarify questions, only to repeat them. Critiques of this kind of standardization are common in survey methodology, most notably from Nora Cate Schaeffer, who has raised many important considerations within the survey methodology community while still strongly supporting the importance of interviewing as a methodological tool. My reading assignment for my ethnography class this week is a chapter by Charles Briggs from 1986 (Briggs – Learning how to ask) that proves that many of the new methodological critiques are in fact old methodological critiques. But the critiques are rarely heeded, because they are difficult to apply.

I am currently working on a project that demonstrates some of the problems with standardizing interviews. I am revising a script we used to call a representative sample of U.S. high schools. The script was last used four years ago in a highly successful effort that led to an admirable 98% response rate. But to my surprise, when I went to pull up the old script I found instead a system of scripts. What was an online and phone survey had spawned fax and e-mail versions. What was intended to be a survey of principals now had a set of potential respondents from the schools, each with their own strengths and weaknesses. Answers to common questions from school staff were loosely scripted on an addendum to the original script. A set of tips for phonecallers included points such as “make sure to catch the name of the person who transfers you, so that you can specifically say that Ms X from the office suggested I talk to you” and “If you get transferred to the teacher, make sure you are not talking to the whole class over the loudspeaker.”

Heidi Hamilton, chair of the Georgetown Linguistics department, often refers to conversation as “climbing a tree that climbs back.” In fact, we often talk about meaning as mutually constituted between all of the participants in a conversation. The conversation itself cannot be taken outside of the context in which it lives. The many documents I found from the phonecallers show just how relevant these observations can be in an applied research environment.

The big question that arises from all of this is one of a practical strategy. In particular, I had to figure out how to best address the interview campaign that we had actually run when preparing to rerun the campaign we had intended to run. My solution was to integrate the feedback from the phonecallers and loosen up the script. But I suspect that this tactic will work differently with different phonecallers. I’ve certainly worked with a variety of phonecallers, from those that preferred a script to those that preferred to talk off the cuff. Which makes the best phonecaller? Neither. Both. The ideal phonecaller works with the situation that is presented to them nimbly and professionally while collecting complete and relevant data from the most reliable source. As much of the time as possible.

At this point, I’ve come pretty far afield of my original point, which is that all of these competing predictive strategies have complicated underpinnings.

And what of that?

I believe that the best research is conscious of its strengths and weaknesses and not afraid to work with other strategies in order to generate the most comprehensive picture. As we see comparisons and horse races develop between analytic strategies, I think the best analyses we’ll see will be the ones that fit the results of each of the strategies together, simultaneously developing a fuller breakdown of the election and a fuller picture of our new research environment.

“Not everything that can be counted counts”

“Not everything that counts can be counted, and not everything that can be counted counts” – sign in Einstein’s Princeton office

This quote is from one of my favorite survey reminder postcards of all time, along with an image from from the Emilio Segre visual archives. The postcard layout was an easy and pleasant decision made in association with a straightforward survey we have conducted for nearly a quarter century. …If only social media analysis could be so easy, pleasant or straightforward!

I am in the process of conducting an ethnography of DC taxi drivers. I was motivated to do this study because of the persistent disconnect between the experiences and reports of the taxi drivers and riders I hear from regularly and the snarky (I know this term does not seem technical, but it is absolutely data motivated!) riders who dominate participatory media sources online. My goal at this point of the project is to chase down the disconnect in media participation and see how it maps to policy deliberations and offline experiences. This week I decided to explore ways of quantifying the disconnect.

Inspired by this article in jedem (the eJournal of eDemocracy and Open Government), I decided to start my search using framework based in Social Network Analysis (SNA), in order to use elements of connectedness, authority and relevance as a base. Fortunately, SNA frameworks are widely available to analysts on a budget in the form of web search engines! I went through the first 22 search results for a particular area of interest to my study: the mandatory GPS policy. Of these 22 sites, only 11 had active web 2.0 components. Across all of these sites, there were just two comments from drivers. Three of the sites that didn’t have any comments from drivers did have one post each that sympathized with or defended DC taxi drivers. The remaining three sites had no responses from taxi drivers and no sympathetic responses in defense of the drivers. Barring a couple of comments that were difficult to divine, the rest of the comments were negative comments about DC taxi drivers or the DC taxi industry. This matched my expectations, and, predictably, didn’t match any of my interviews or offline investigations.

The question at this point is one of denominator.

The easiest denominator to use, and, in fact, the least complicated was the number of sites. Using this denominator, only one quarter of the sites had any representation from a DC taxi driver. This is significant, given that the discussions were about aspects of their livelihood, and the drivers will be the most closely affected by the regulatory changes. This is a good, solid statistic from which to investigate the influence of web 2.0 on local policy enactment. However, it doesn’t begin to show the lack of representation the way that a denominator such as number of posts, number of posters, or number of opinions would have. But each one of these alternative denominators has its own set of headaches. Does it matter if one poster expresses an opinion once and another expresses another, slightly different opinion more than once? If everyone agrees, what should the denominator be? What about responses that contain links that are now defunct or insider references that aren’t meaningful to me? Should I consider measures of social capital, endorsements, social connectedness, or the backgrounds of individual posters?

The simplest figure also doesn’t show one of the most striking aspects of this finding; the relative markedness of these posts. In the context of predominantly short, snarky and clever responses, one of the comments began with a formal “Dear DC city councilmembers and intelligent  taxpayers,” and the other spread over three dense, winding posts in large paragraph form.

This brings up an important aspect of social media; that of social action. If every comment is a social action with social intentions, what are the intentions of the posters and how can these be identified? I don’t believe that the majority of posts left were intended as a voice in local politics, but the comments from the drivers clearly were. The majority of posts represent attempts to warrant social capital using humor, not attempts to have a voice in local politics. And they repeated phrases that are often repeated in web 2.0 discussions about the DC taxi situation, but rarely repeated elsewhere. This observation, of course, is pretty meaningless without being anchored to the data itself, both quantitatively and qualitatively. And it makes for some interesting ‘next steps’ in a project that is certainly not short of ‘next steps.’

The main point I want to make here is about the nature of variables in social media research. Compared to a survey, where you ask a question, determined in advance, and have a set of answers to work with in your analysis, you are free to choose your own variables for your analysis. Each choice brings with it a set of constraints and advantages, and some fit your data better than others. But the path to analysis can be a more difficult path to take, and more justification about the choices you make is important. To augment this, a quantitative analysis, which can sometimes have very arbitrary or less clear choices included in it, is best supplemented with a qualitative analysis that delves into the answers themselves and why they fit the coding structure you have imposed.

In all of this, I have quite a bit of work out ahead of me.

Repeating language: what do we repeat, and what does it signal?

Yesterday I attended a talk by Jon Kleinberg entitled “Status, Power & Incentives in Social Media” in Honor of the UMD Human-Computer Interaction Lab’s 30th Anniversary.


This talk was dense and full of methods that are unfamiliar to me. He first discussed logical representations of human relationships, including orientations of sentiment and status, and then he ventured into discursive evidence of these relationships. Finally, he introduced formulas for influence in social media and talked about ways to manipulate the formulas by incentivizing desired behavior and deincentivizing less desired behavior.


In Linguistics, we talk a lot about linguistic accommodation. In any communicative event, it is normal for participant’s speech patterns to converge in some ways. This can be through repetition of words or grammatical structures. Kleinberg presented research about the social meaning of linguistic accommodation, showing that participants with less power tend to accommodate participants with more power more than participants with more power accommodate participants with less power. This idea of quantifying social influence is a very powerful notion in online research, where social influence is a more practical and useful research goal than general representativeness.


I wonder what strategies we use, consciously and unconsciously, when we accommodate other speakers. I wonder whether different forms of repetition have different underlying social meanings.


At the end of the talk, there was some discussion about both the constitution of iconic speech (unmarked words assembled in marked ways) and the meaning of norm flouting.


These are very promising avenues for online text research, and it is exciting to see them play out.

Getting to know your data

On Friday, I had the honor of participating in a microanalysis video discussion group with Fred Erickson. As he was introducing the process to the new attendees, he said something that really caught my attention. He said that videos and field notes are not data until someone decides to use them for research.

As someone with a background in survey research, the question of ‘what is data?’ was never really on my radar before graduate school. Although it’s always been good practice to know where your data comes from and what it represents in order to glean any kind of validity from your work, data was unquestioningly that which you see in a spreadsheet or delimited file, with cases going down and variables going across. If information could be formed like this, it was data. If not, it would need some manipulation. I remember discussing this with Anna Trester a couple of years ago. She found it hard to understand this limited framework, because, for her, the world was a potential data source. I’ve learned more about her perspective in the last couple of years, working with elements that I never before would have characterized as data, including pictures, websites, video footage of interactions, and now fieldwork as a participant observer.

Dr Erickson’s observation speaks to some frustration I’ve had lately, trying to understand the nature of “big data” sets. I’ve seen quite a bit of people looking for data, any data, to analyze. I could see the usefulness of this for corpus linguists, who use large bodies of textual data to study language use. A corpus linguist is able to use large bodies of text to see how we use words, which is a systematically patterned phenomena that goes much deeper than a dictionary definition could. I could also see the usefulness of large datasets in training programs to recognize genre, a really critical element in automated text analysis.

But beyond that, it is deeply important to understand the situated nature of language. People don’t produce text for the sake of producing text. Each textual element represents an intentioned social action on the part of the writer, and social goals are accomplished differently in different settings. In order for studies of textual data to produce valid conclusions with social commentary, contextual elements are extremely important.

Which leads me to ask if these agnostic datasets are being used solely as academic exercises by programmers and corpus linguists or if our hunger for data has led us to take any large body of information and declare it to be useful data from which to excise valid conclusions? Worse, are people using cookie cutter programs to investigate agnostic data sets like this without considering the wider validity?

I urge anyone looking to create insight from textual data to carefully get to know their data.

The Bones of Solid Research?

What are the elements that make research “research” and not just “observation?” Where are the bones of the beast, and do all strategies share the same skeleton?

Last Thursday, in my Ethnography of Communication class, we spent the first half hour of class time taking field notes in the library coffee shop. Two parts of the experience struck me the hardest.

1.) I was exhausted. Class came at the end of a long, full work day, toward the end of a week that was full of back to school nights, work, homework and board meetings. I began my observation by ordering a (badly needed) coffee. My goal as I ordered was to see how few words I had to utter in order to complete the transaction. (In my defense, I am usually relatively talkative and friendly…) The experience of observing and speaking as little as possible reminded me of one of the coolest things I’d come across in my degree study: Charlotte Linde, SocioRocketScientist at NASA

2.) Charlotte Linde, SocioRocketScientist at NASA. Dr Linde had come to speak with the GU Linguistics department early in my tenure as a grad student. She mentioned that her thesis had been about the geography of communication- specifically: How did the layout of an (her?) apartment building help shape communication within it?

This idea had struck me, and stayed with me, but it didn’t really make sense until I began to study Ethnography of Communication. In the coffee shop, I structured my fieldnotes like a map and investigated it in terms of zones of activities. Then I investigated expectations and conventions of communication in each zone. As a follow-up to this activity, I’ll either return to the same shop or head to another coffee shop to do some contrastive mapping.

The process of Ethnography embodies the dynamic between quantitative and qualitative methods for me. When I read ethnographic research, I really find myself obsessing over ‘what makes this research?’ and ‘how is each statement justified?’ Survey methodology, which I am still doing every day at work, is so deeply structured that less structured research is, by contrast, a bit bewildering or shocking. Reading about qualitative methodology makes it seem so much more dependable and structured than reading ethnographic research papers does.

Much of the process of learning ethnography is learning yourself; your priorities, your organization, … learning why you notice what you do and evaluate it the way you do… Conversely, much of the process of reading ethnographic research seems to involve evaluation or skepticism of the researcher, the researcher’s perspective and the researcher’s interpretation. As a reader, the places where the researcher’s perspective varies from mine is clear and easy to see, as much as my own perspective is invisible to me.

All of this leads me back to the big questions I’m grappling with. Is this structured observational method the basis for all research? And how much structure does observation need to have in order to qualify as research?

I’d be interested to hear what you think of these issues!

Unlocking patterns in language

In linguistics study, we quickly learn that all language is patterned. Although the actual words we produce vary widely, the process of production does not. The process of constructing baby talk was found to be consistent across kids from 15 different languages. When any two people who do not speak overlapping languages come together and try to speak, the process is the same. When we look at any large body of data, we quickly learn that just about any linguistic phenomena is subject to statistical likelihood. Grammatical patterns govern the basic structure of what we see in the corpus. Variations in language use may tweak these patterns, but each variation is a patterned tweak with its own set of statistical likelihoods. Variations that people are quick to call bastardizations are actually patterned departures from what those people consider to be “standard” english. Understanding “differences not defecits” is a crucially important part of understanding and processing language, because any variation, even texting shorthand, “broken english,” or slang, can be better understood and used once its underlying structure is recognized.

The patterns in language extend beyond grammar to word usage. The most frequent words in a corpus are function words such as “a” and “the,” and the most frequent collocations are combinations like “and the” or “and then it.” These patterns govern the findings of a lot of investigations into textual data. A certain phrase may show up as a frequent member of a dataset simply because it is a common or lexicalized expression, and another combination may not appear because it is more rare- this could be particularly problematic, because what is rare is often more noticeable or important.

Here are some good starter questions to ask to better understand your textual data:

1) Where did this data come from? What was it’s original purpose and context?

2) What did the speakers intend to accomplish by producing this text?

3) What type of data or text, or genre, does this represent?

4) How was this data collected? Where is it from?

5) Who are the speakers? What is their relationship to eachother?

6) Is there any cohesion to the text?

7) What language is the text in? What is the linguistic background of the speakers?

8) Who is the intended audience?

9) What kind of repetition do you see in the text? What about repetition within the context of a conversation? What about repetition of outside elements?

10) What stands out as relatively unusual or rare within the body of text?

11) What is relatively common within the dataset?

12) What register is the text written in? Casual? Academic? Formal? Informal?

13) Pronoun use. Always look at pronoun use. It’s almost always enlightening.

These types of questions will take you much further into your dataset that the knee-jerk question “What is this text about?”

Now, go forth and research! …And be sure to report back!

A fleet of research possibilities and a scattering of updates

Tomorrow is my first day of my 3rd year as a Masters student in the MLC program at Georgetown University. I’m taking the slowwww route through higher ed, as happens when you work full-time, have two kids and are an only child who lost her mother along the way.

This semester I will [finally] take the class I’ve been borrowing pieces from for the past two years: Ethnography of Communication. I’ve decided to use this opportunity do an ethnography of DC taxi drivers. My husband is a DC taxi driver, so in essence this research will build on years of daily conversations. I find that the representation of DC taxi drivers in the news never quite approximates what I’ve seen, and that is my real motivation for the project. I have a couple of enthusiastic collaborators: my husband and a friend whose husband is also a DC taxi driver and who has been a vocal advocate for DC taxi drivers.

I am really eager to get back into linguistics study. I’ve been learning powerful sociolinguistic methods to recognize and interpret patterning in discourse, but it is a challenge not to fall into the age old habit of studying aboutness or topicality, which is much less patterned and powerful.

I have been fortunate enough to combine some of my new qualitative methods with my more quantitative work on some of the reports I’ve completed over the summer. I’m using the open ended responses that we usually don’t fully exploit in order to tell more detailed stories in our survey reports. But balancing quantitative and qualitative methods is very difficult, as I’ve mentioned before, because the power punch of good narrative blows away the quiet power of high quality, representative statistical analysis. Reporting qualitative findings has to be done very carefully.

Over the summer I had the wonderful opportunity to apply my sociolinguistics education to a medical setting. Last May, while my mom was on life support, we were touched by a medical error when my mom was mistakenly declared brain dead. Because she was an organ donor, her life support was not withdrawn before the error was recognized. But the fallout from the error was tremendous. The problem arose because two of her doctors were consulting by phone about their patients, and each thought they were talking about a different patient. In collaboration with one of the doctors involved, I’ve learned a great amount about medical errors and looked at the role of linguistics in bringing awareness to potential errors of miscommunication in conversation. This project was different from other research I’ve done, because it did not involve conducting new research, but rather rereading foundational research and focusing on conversational structure.

In this case, my recommendations were for an awareness of existing conversational structures, rather than an imposition of a new order or procedure. My recommendations, developed in conjunction with Dr Heidi Hamilton, the chair of our linguistics department and medical communication expert, were to be aware of conversational transition points, to focus on the patient identifiers used, and to avoid reaching back or ahead to other patients while discussing a single patient. Each patient discussion must be treated as a separate conversation. Conversation is one of the largest sources of medical error and must be approached carefully is critically important. My mom’s doctor and I hope to make a Grand Rounds presentation out of this effort.

On a personal level, this summer has been one of great transitions. I like to joke that the next time my mom passes away I’ll be better equipped to handle it all. I have learned quite a bit about real estate and estate law and estate sales and more. And about grieving, of course. Having just cleaned through my mom’s house last week, I am beginning this new school year more physically, mentally and emotionally tired than I have ever felt. A close friend of mine has recently finished an extended series of chemo and radiation, and she told me that she is reveling in her strength as it returns. I am also reveling in my own strength, as it returns. I may not be ready for the semester or the new school year, but I am ready for the first day of class tomorrow. And I’m hopeful. For the semester, for the research ahead, for my family, and for myself. I’m grateful for the guidance of my newest guardian angel and the inspiration of great research.

A snapshot from a lunchtime walk

In the words of Sri Aurobindo, “By your stumbling the world is perfected”

Rethinking demographics in research

I read a blog post on the LoveStats blog today that referred to one of the most widely regarded critiques of social media research: the lack of demographic information.

In traditional survey research, demographic information is a critically important piece of the analysis. We often ask questions like “Yes 50% of the respondents said they had encountered gender harassment, but what is the breakdown by gender?” The prospect of not having this demographic information is a large enough game changer to cast the field of social media research into the shade.

Here I’d like to take a sidestep and borrow a debate from linguistics. In the linguistic subfield of conversation analysis, there are two main streams of thought about analysis. One believes in gathering as much outside data as possible, often through ethnographic research, to inform a detailed understanding of the conversation. The second stream is rooted in the purity of the data. This stream emphasizes our dynamic construction of identity over the stability of identity. The underlying foundation of this stream is that we continually construct and reconstruct the most important and relevant elements of our identity in the process of our interaction. Take, for example, a study of an interaction between a doctor and a patient. The first school would bring into the analysis a body of knowledge about interactions between doctors and patients. The second would believe that this body of knowledge is potentially irrelevant or even corrupting to the analysis, and if the relationship is in fact relevant it will be constructed within the excerpt of study. This begs the question: are all interactions between doctors and patients primarily doctor patient interactions? We could address this further through the concept of framing and embedded frames (a la Goffman), but we won’t do that right now.

Instead, I’ll ask another question:
If we are studying gender discrimination, is it necessary to have a variable for gender within our datasouce?

My kneejerk reaction to this question, because of my quantitative background, is yes. But looking deeper: is gender always relevant? This does strongly depend on the datasource, so let’s assume for this example that the stimulus was a question on a survey that was not directly about discrimination, but rather more general (e.g. “Additional Comments:”).

What if we took that second CA approach, the purist approach, and say that where gender is applicable to the response it will be constructed within that response. The question now becomes ‘how is gender constructed within a response?’ This is a beautiful and interesting question for a linguist, and it may be a question that much better fits the underlying data and provides deeper insight into the data. It also turns the age old analytic strategy on its head. Now we can ask whether a priori assumptions that the demographics could or do matter are just rote research or truly the productive and informative measures that we’ve built them up to be?

I believe that this is a key difference between analysis types. In the qualitative analysis of open ended survey questions, it isn’t very meaningful to say x% of the respondents mentioned z, and y% of the respondents mentioned d, because a nonmention of z or d is not really meaningful. Instead we go deeper into the data to see what was said about d or z. So the goal is not prevalence, but description. On the other hand, prevalence is a hugely important aspect of quantitative analysis, as are other fun statistics which feed off of demographic variables.

The lesson in all of this is to think carefully about what is meaningful information that is relevant to your analysis and not to make assumptions across analytic strategies.

To go big, first think small

We use language all of the time. Because of this, we are all experts in language use. As native speakers of a language, we are experts in the intricacies of that language.

Why, then, do people study linguistics? Aren’t we all linguists?

Absolutely not.

We are experts in *using* language, but we are not experts in the methods we employ. Believe it or not, much of the process of speaking and hearing is not conscious. If it was, we would be sensorally overwhelmed with the sheer volume of words around us. Instead, listening comprehension involves a process of merging what we expect to hear with what we gauge to be the most important elements of what we do hear. The process of speaking involves merging our estimates of what the people we communicate with know and expect to hear with our understanding of the social expectations surrounding our words and our relationships and distilling these sources into a workable expression. The hearer will reconstruct elements of this process using cues that are sometimes conscious and sometimes not.

We often think of language as simple and mechanistic, but it is not simple at all. As conversational analysts, our job is to study conversation that we have access to in an attempt to reconstruct the elements that constituted the interaction. Even small chunks of conversation encode quite a bit of information.

The process of conversation analysis is very much contrary to our sense of language as regular language users. This makes the process of explaining our research to people outside our field difficult. It is difficult to justify the research, and it is difficult to explain why such small pieces of data can be so useful, when most other fields of research rely on greater volumes of data.

In fact, a greater volume of data can be more harmful than helpful in conversation analysis. Conversation is heavily dependent on its context; on the people conversing, their relationship, their expectations, their experiences that day, the things on their mind, what they expect from each other and the situation, their understanding of language and expectations, and more. The same sentence can have greatly different meanings once those factors are taken into account.

At a time when there is so much talk of the glory of big data, it is especially important to keep in mind the contributions of small data. These contributions are the ones that jeopardize the utility and promise of big data, and if these contributions can be captured in creative ways, they will be the true promise of the field.

Not what language users expect to see, but rather what we use every day, more or less consciously.