“Not everything that can be counted counts”

“Not everything that counts can be counted, and not everything that can be counted counts” – sign in Einstein’s Princeton office

This quote is from one of my favorite survey reminder postcards of all time, along with an image from from the Emilio Segre visual archives. The postcard layout was an easy and pleasant decision made in association with a straightforward survey we have conducted for nearly a quarter century. …If only social media analysis could be so easy, pleasant or straightforward!

I am in the process of conducting an ethnography of DC taxi drivers. I was motivated to do this study because of the persistent disconnect between the experiences and reports of the taxi drivers and riders I hear from regularly and the snarky (I know this term does not seem technical, but it is absolutely data motivated!) riders who dominate participatory media sources online. My goal at this point of the project is to chase down the disconnect in media participation and see how it maps to policy deliberations and offline experiences. This week I decided to explore ways of quantifying the disconnect.

Inspired by this article in jedem (the eJournal of eDemocracy and Open Government), I decided to start my search using framework based in Social Network Analysis (SNA), in order to use elements of connectedness, authority and relevance as a base. Fortunately, SNA frameworks are widely available to analysts on a budget in the form of web search engines! I went through the first 22 search results for a particular area of interest to my study: the mandatory GPS policy. Of these 22 sites, only 11 had active web 2.0 components. Across all of these sites, there were just two comments from drivers. Three of the sites that didn’t have any comments from drivers did have one post each that sympathized with or defended DC taxi drivers. The remaining three sites had no responses from taxi drivers and no sympathetic responses in defense of the drivers. Barring a couple of comments that were difficult to divine, the rest of the comments were negative comments about DC taxi drivers or the DC taxi industry. This matched my expectations, and, predictably, didn’t match any of my interviews or offline investigations.

The question at this point is one of denominator.

The easiest denominator to use, and, in fact, the least complicated was the number of sites. Using this denominator, only one quarter of the sites had any representation from a DC taxi driver. This is significant, given that the discussions were about aspects of their livelihood, and the drivers will be the most closely affected by the regulatory changes. This is a good, solid statistic from which to investigate the influence of web 2.0 on local policy enactment. However, it doesn’t begin to show the lack of representation the way that a denominator such as number of posts, number of posters, or number of opinions would have. But each one of these alternative denominators has its own set of headaches. Does it matter if one poster expresses an opinion once and another expresses another, slightly different opinion more than once? If everyone agrees, what should the denominator be? What about responses that contain links that are now defunct or insider references that aren’t meaningful to me? Should I consider measures of social capital, endorsements, social connectedness, or the backgrounds of individual posters?

The simplest figure also doesn’t show one of the most striking aspects of this finding; the relative markedness of these posts. In the context of predominantly short, snarky and clever responses, one of the comments began with a formal “Dear DC city councilmembers and intelligent  taxpayers,” and the other spread over three dense, winding posts in large paragraph form.

This brings up an important aspect of social media; that of social action. If every comment is a social action with social intentions, what are the intentions of the posters and how can these be identified? I don’t believe that the majority of posts left were intended as a voice in local politics, but the comments from the drivers clearly were. The majority of posts represent attempts to warrant social capital using humor, not attempts to have a voice in local politics. And they repeated phrases that are often repeated in web 2.0 discussions about the DC taxi situation, but rarely repeated elsewhere. This observation, of course, is pretty meaningless without being anchored to the data itself, both quantitatively and qualitatively. And it makes for some interesting ‘next steps’ in a project that is certainly not short of ‘next steps.’

The main point I want to make here is about the nature of variables in social media research. Compared to a survey, where you ask a question, determined in advance, and have a set of answers to work with in your analysis, you are free to choose your own variables for your analysis. Each choice brings with it a set of constraints and advantages, and some fit your data better than others. But the path to analysis can be a more difficult path to take, and more justification about the choices you make is important. To augment this, a quantitative analysis, which can sometimes have very arbitrary or less clear choices included in it, is best supplemented with a qualitative analysis that delves into the answers themselves and why they fit the coding structure you have imposed.

In all of this, I have quite a bit of work out ahead of me.

I think I’m using “big data” incorrectly

I think I’m using the term “big data” incorrectly. When I talk about big data, I’m referring to the massive amount of freely available information that researchers can collect from the internet. My expectation is that the researchers must choose which firehose best fits their research goals, collect and store the data, and groom it to the point of usability before using it to answer targeted questions or examining it for answers in need of a question.

The first element of this that makes it “big data” to me, is that the data is freely available and not subject to any privacy violations. It can be difficult to collect and store, because of its sheer size, but it is not password protected. For this reason, I would not consider Facebook to be a source for “big data.” I believe that the overwhelming majority of Facebook users impose some privacy controls, and the resulting, freely available information cannot be assigned any kind of validity. There are plenty of measures of inclusion for online research, and ignorance about privacy rules or shear exhibitionism are not a target qualities by any of these standards.

The second crucial element to my definition of “big data” is structure. My expectation is that it is in any researchers interest to understand the genesis and structure of their data as much as possible, both for the sake of grooming, and for the sake of assigning some sense of validity to their findings. Targeted information will be layed out and signaled very differently in different online environments, and the researcher must work to develop both working delimiters to find probable working targets and a sense of context for the data.

The third crucial element is representativeness. What do these findings represent? Under what conditions? “Big data” has a wide array of answers to these questions. First, it is crucial to note that it is not representative of the general population. It represents only the networked members of a population who were actively engaging with an online interface within the captured window of time in a way that left a trace or produced data. Because of this, we look at individual people by their networks, and not by their representativeness. Who did they influence, and to what degree could they influence those people? And we look at other units of analysis, such as the website that the people were contributing on, the connectedness of that website, and the words themselves, and their degree of influence, both directly an indirectly.

Given those elements of understanding, we are able to provide a framework from which the analysis of the data itself is meaningful and useful.

I’m aware that my definition is not the generally accepted definition. But for the time being I will continue to use it for two reasons:

1. Because I haven’t seen any other terms that better fit
2. Because I think that it is critically important that any talk about data use is tied to measures that encourage the researcher to think about the meaning and value of their data

It’s my hope that this is a continuing discussion. In the meantime, I will trudge on in idealistic ignorance.

Adventures in Digital Puberty

My digital enthusiasm hit a roadblock lately. My oldest daughter discovered the addictive world of social gaming. What began with her checking out an ad on TV for a gaming website soon evolved into pops of smuggled light in a dark room after bedtime. I looked into this gaming website, and I was able to read all kinds of horror stories about it. Parents told tales of bullying, of graphic talk and advances in chatrooms, and of kids receiving points for dating.

Once you consider some features of this site- the chatrooms, the constant clothes changing (into mostly skimpy outfits), the pursuit of cash and fame, and the encouragement to “date,” it’s easy to see this place as a playground for the perverted. It didn’t help that my first questions about the site were answered by my daughter with a speech about the site’s value as a teaching tool. Apparently they give quizzes, and they give you the answers if you get the questions wrong. So, for example, she learned from this site who Brad Pitt is married to. Although I am a big fan of learning tools, I’m not sure I’d characterize celebrity gossip as useful or necessary knowledge…

I know that some parents would (& do) forbid their kids from going to the site. My first reaction was to limit her time there as much as possible. But today I swallowed my prejudice and jumped in.

The truth is that if I did just dismiss this site altogether, she would still find ways to visit it. I would much rather that she not hide her activity, but instead have me to talk to about what she encounters on the site. So I told her about my experiences trying out chatrooms when I was younger and about what I’d read about this site. We talked in detail about the different features of her site. I offered to sit down with her any time she wanted to talk about things she saw. We talked about bullying, we talked about the possibility of people not being who they say they are, and we talked about making connections online. We talked about her favorite parts of the site and the parts that made her uncomfortable. She told me about the friends she made and what brought them together. And I pledged to talk to her about it again any time she wanted.

She was full of questions and of stories and examples, and I was really struck that I never would have heard any of it had I not gotten over my initial set of worries and discussed this with her. And what would that have meant? She wouldn’t have had a chance to vet her strategies for safety and bullying with me, and she wouldn’t feel comfortable sharing some of her stranger encounters. She would be left without my guidance when determining what was acceptable to her.

From time to time, we parents need a kick in the pants to remind us that raising kids isn’t about creating copies of ourselves, but about providing guidance and safety for them as they develop. She is a different person, growing among a different set of influences. And that’s okay with me.

I did, however, discuss all of this with her as we headed out to the woods to take a gadget free walk among the fall colors!

Repeating language: what do we repeat, and what does it signal?

Yesterday I attended a talk by Jon Kleinberg entitled “Status, Power & Incentives in Social Media” in Honor of the UMD Human-Computer Interaction Lab’s 30th Anniversary.

 

This talk was dense and full of methods that are unfamiliar to me. He first discussed logical representations of human relationships, including orientations of sentiment and status, and then he ventured into discursive evidence of these relationships. Finally, he introduced formulas for influence in social media and talked about ways to manipulate the formulas by incentivizing desired behavior and deincentivizing less desired behavior.

 

In Linguistics, we talk a lot about linguistic accommodation. In any communicative event, it is normal for participant’s speech patterns to converge in some ways. This can be through repetition of words or grammatical structures. Kleinberg presented research about the social meaning of linguistic accommodation, showing that participants with less power tend to accommodate participants with more power more than participants with more power accommodate participants with less power. This idea of quantifying social influence is a very powerful notion in online research, where social influence is a more practical and useful research goal than general representativeness.

 

I wonder what strategies we use, consciously and unconsciously, when we accommodate other speakers. I wonder whether different forms of repetition have different underlying social meanings.

 

At the end of the talk, there was some discussion about both the constitution of iconic speech (unmarked words assembled in marked ways) and the meaning of norm flouting.

 

These are very promising avenues for online text research, and it is exciting to see them play out.

Getting to know your data

On Friday, I had the honor of participating in a microanalysis video discussion group with Fred Erickson. As he was introducing the process to the new attendees, he said something that really caught my attention. He said that videos and field notes are not data until someone decides to use them for research.

As someone with a background in survey research, the question of ‘what is data?’ was never really on my radar before graduate school. Although it’s always been good practice to know where your data comes from and what it represents in order to glean any kind of validity from your work, data was unquestioningly that which you see in a spreadsheet or delimited file, with cases going down and variables going across. If information could be formed like this, it was data. If not, it would need some manipulation. I remember discussing this with Anna Trester a couple of years ago. She found it hard to understand this limited framework, because, for her, the world was a potential data source. I’ve learned more about her perspective in the last couple of years, working with elements that I never before would have characterized as data, including pictures, websites, video footage of interactions, and now fieldwork as a participant observer.

Dr Erickson’s observation speaks to some frustration I’ve had lately, trying to understand the nature of “big data” sets. I’ve seen quite a bit of people looking for data, any data, to analyze. I could see the usefulness of this for corpus linguists, who use large bodies of textual data to study language use. A corpus linguist is able to use large bodies of text to see how we use words, which is a systematically patterned phenomena that goes much deeper than a dictionary definition could. I could also see the usefulness of large datasets in training programs to recognize genre, a really critical element in automated text analysis.

But beyond that, it is deeply important to understand the situated nature of language. People don’t produce text for the sake of producing text. Each textual element represents an intentioned social action on the part of the writer, and social goals are accomplished differently in different settings. In order for studies of textual data to produce valid conclusions with social commentary, contextual elements are extremely important.

Which leads me to ask if these agnostic datasets are being used solely as academic exercises by programmers and corpus linguists or if our hunger for data has led us to take any large body of information and declare it to be useful data from which to excise valid conclusions? Worse, are people using cookie cutter programs to investigate agnostic data sets like this without considering the wider validity?

I urge anyone looking to create insight from textual data to carefully get to know their data.

A brave new vision of the future of social science

I’ve been typing and organizing my notes from yesterday’s dc-aapor event on the past, present and future of survey research (which I still plan to share soon, after a little grooming). The process has been a meditative one.

I’ve been thinking about how I would characterize these same phases- the past, present and future… and then I had a vision of sorts on the way home today that I’d like to share. I’m going to take a minute to be a little post apocalyptic and let the future build itself. You can think of it as a daydream or thought experiment…

The past, I would characterize as the grand discovery of surveys as a tool for data collection; the honing and evolution of that tool in conjunction with its meticulous scientific development and the changing landscape around it; and the growth to dominance and proliferation of the method. The past was an era of measurement, of the total survey error model, of social Science.

The present I would characterize as a rapid coming together, or a perfect storm that is swirling data and ideas and disciplines of study and professions together in a grand sweeping wind. I see the survey folks trudging through the wind, waiting for the storm to pass, feet firmly anchored to solid ground.

The future is essentially the past, turned on its head. The pieces of the past are present, but mixed together and redistributed. Instead of examining the ways in which questions elicit usable data, we look at the data first and develop the questions from patterns in the data. In this era, data is everywhere, of various quality, character and genesis, and the skill is in the sense making.

This future is one of data driven analytic strategies, where research teams intrinsically need to be composed of a spectrum of different, specialized skills.

The kings of this future will be the experts in natural language processing, those with the skill of finding and using patterns in language. All language is patterned. Our job will be to find those patterns and then to discover their social meaning.

The computer scientists and coders will write the code to extract relevant subsets of data, and describe and learn patterns in the data. The natural language processing folks will hone the patterns by grammar and usage. The netnographers will describe and interpret the patterns, the data visualizers will make visual or interactive sense of the patterns, the sociologists will discover constructions of relative social groupings as they emerge and use those patterns. The discourse analysts will look across wider patterns of language and context dependency. The statisticians will make formulas to replicate, describe and evaluate the patterns, and models to predict future behaviors. Data science will be a crucial science built on the foundations of traditional and nontraditional academic disciplines.

How many people does it take to screw in this lightbulb? It depends on the skills of the people or person on the ladder.

Where do surveys fit in to this scheme? To be honest, I’m not sure. The success of surveys seems to rest in part on the failure of faster, cheaper methods with a great deal more inherent error.

This is not the only vision possible, but it’s a vision I saw while commuting home at the end of a damned long week… it’s a vision where naturalistic data is valued and experimentation is an extension of research, where diversity is a natural assumption of the model and not a superimposed dynamic, where the data itself and the patterns within it determine what is possible from it. It’s a vision where traditional academics fit only precariously; a future that could just as easily be ruled out by the constraints of the past as it could be adopted unintentionally, where meaning makers rush to be the rigs in the newest gold rush and theory is as desperately pursued as water sources in a drought.

Unlocking patterns in language

In linguistics study, we quickly learn that all language is patterned. Although the actual words we produce vary widely, the process of production does not. The process of constructing baby talk was found to be consistent across kids from 15 different languages. When any two people who do not speak overlapping languages come together and try to speak, the process is the same. When we look at any large body of data, we quickly learn that just about any linguistic phenomena is subject to statistical likelihood. Grammatical patterns govern the basic structure of what we see in the corpus. Variations in language use may tweak these patterns, but each variation is a patterned tweak with its own set of statistical likelihoods. Variations that people are quick to call bastardizations are actually patterned departures from what those people consider to be “standard” english. Understanding “differences not defecits” is a crucially important part of understanding and processing language, because any variation, even texting shorthand, “broken english,” or slang, can be better understood and used once its underlying structure is recognized.

The patterns in language extend beyond grammar to word usage. The most frequent words in a corpus are function words such as “a” and “the,” and the most frequent collocations are combinations like “and the” or “and then it.” These patterns govern the findings of a lot of investigations into textual data. A certain phrase may show up as a frequent member of a dataset simply because it is a common or lexicalized expression, and another combination may not appear because it is more rare- this could be particularly problematic, because what is rare is often more noticeable or important.

Here are some good starter questions to ask to better understand your textual data:

1) Where did this data come from? What was it’s original purpose and context?

2) What did the speakers intend to accomplish by producing this text?

3) What type of data or text, or genre, does this represent?

4) How was this data collected? Where is it from?

5) Who are the speakers? What is their relationship to eachother?

6) Is there any cohesion to the text?

7) What language is the text in? What is the linguistic background of the speakers?

8) Who is the intended audience?

9) What kind of repetition do you see in the text? What about repetition within the context of a conversation? What about repetition of outside elements?

10) What stands out as relatively unusual or rare within the body of text?

11) What is relatively common within the dataset?

12) What register is the text written in? Casual? Academic? Formal? Informal?

13) Pronoun use. Always look at pronoun use. It’s almost always enlightening.

These types of questions will take you much further into your dataset that the knee-jerk question “What is this text about?”

Now, go forth and research! …And be sure to report back!

Could our attitude toward marketing determine our field’s future?

In our office, we call it the “cocktail party question:” What do you do for a living? For those of us who work in the area of survey research, this can be a particularly difficult question to answer. Not only do people rarely know much about our work, but they rarely have a great deal of interest in it. I like to think of myself as a survey methodologist, but it is easier in social situations to discuss the focus of my research than my passion for methodology. I work at the American Institute of Physics, so I describe my work as “studying people who study physics.” Usually this description is greeted with an uncomfortable laugh, and the conversation progresses elsewhere. Score!

But the wider lack of understanding of survey research can have larger implications than simply awkward social situations. It can also cause tension with clients who don’t understand our work, our process, or where and how we add expertise to the process. Toward this end, I once wrote a guide for working with clients that separated out each stage in the survey process and detailed what expertise the researcher brings to the stage and what expertise we need from the client. I hoped that it would be a way of both separating and affirming the roles of client and researcher and advertising our firm and our field. I have not ye had the opportunity to use this piece, because of the nature of my current projects, but I’d be happy to share it with anyone who is interested in using or adapting it.

I think about that piece often as I see more talk about big data and social media analysis. Data seems to be everywhere and free, and I wonder what affect this buzz will have on a body of research consumers who might not have respected the role of the researchers from the get-go. We worried when Survey Monkey and other automated survey tools came along, but the current bevvy of tools and attitudes could have an exponentially larger impact on our practice.

Survey researchers often thumb their nose at advertising, despite the heavy methodological overlap. Oftentimes there is a knee-jerk reaction against marketing speak. Not only do survey methodologists often thumb their/our noses at the goal and importance of advertising, but they/we often thumb their/our nose at what appears to be evidence of less rigorous methodology. This has led us to a ridiculous point where data and analyses have evolved quickly with the demand and heavy use of advertising and market researchers and evolved strikingly little in more traditional survey areas, like polling and educational research. Much of the rhetoric about social media analysis, text analysis, social network analysis and big data is directed at the marketing and advertising crowd. Translating it to a wider research context and communicating it to a field that is often not eager to adapt to it can be difficult. And yet the exchange of ideas between the sister fields has never been more crucial to our mutual survival and relevance.

One of the goals of this blog has been to approach the changing landscape of research from a methodologically sound, interdisciplinary perspective that doesn’t suffer from the artificial walls and divisions. As I’ve worked on the blog, my own research methodology has evolved considerably. I’m relying more heavily on mixed methods and trying to use and integrate different tools into my work. I’ve learned quite a bit from researchers with a wide variety of backgrounds, and I often feel like I’m belted into a car with the windows down, hurtling down the highways of progress at top speed and trying to control the airflow. And then I often glimpse other survey researchers out the window, driving slowly, sensibly along the access road alongside the highway. I wonder if my mentors feel the change of landscape as viscerally as I do. I wonder how to carry forward the anchors and quality controls that led to such high quality research in the survey realm. I wonder about the future. And the present. About who’s driving, and who in what car is talking to who? Using what gps?

Mostly I wonder: could our negative attitude toward advertising and market research drive us right into obscurity? Are we too quick to misjudge the magnitude of the changes afoot?

 

This post is meant to be provocative, and I hope it inspires some good conversation.

Rethinking demographics in research

I read a blog post on the LoveStats blog today that referred to one of the most widely regarded critiques of social media research: the lack of demographic information.

In traditional survey research, demographic information is a critically important piece of the analysis. We often ask questions like “Yes 50% of the respondents said they had encountered gender harassment, but what is the breakdown by gender?” The prospect of not having this demographic information is a large enough game changer to cast the field of social media research into the shade.

Here I’d like to take a sidestep and borrow a debate from linguistics. In the linguistic subfield of conversation analysis, there are two main streams of thought about analysis. One believes in gathering as much outside data as possible, often through ethnographic research, to inform a detailed understanding of the conversation. The second stream is rooted in the purity of the data. This stream emphasizes our dynamic construction of identity over the stability of identity. The underlying foundation of this stream is that we continually construct and reconstruct the most important and relevant elements of our identity in the process of our interaction. Take, for example, a study of an interaction between a doctor and a patient. The first school would bring into the analysis a body of knowledge about interactions between doctors and patients. The second would believe that this body of knowledge is potentially irrelevant or even corrupting to the analysis, and if the relationship is in fact relevant it will be constructed within the excerpt of study. This begs the question: are all interactions between doctors and patients primarily doctor patient interactions? We could address this further through the concept of framing and embedded frames (a la Goffman), but we won’t do that right now.

Instead, I’ll ask another question:
If we are studying gender discrimination, is it necessary to have a variable for gender within our datasouce?

My kneejerk reaction to this question, because of my quantitative background, is yes. But looking deeper: is gender always relevant? This does strongly depend on the datasource, so let’s assume for this example that the stimulus was a question on a survey that was not directly about discrimination, but rather more general (e.g. “Additional Comments:”).

What if we took that second CA approach, the purist approach, and say that where gender is applicable to the response it will be constructed within that response. The question now becomes ‘how is gender constructed within a response?’ This is a beautiful and interesting question for a linguist, and it may be a question that much better fits the underlying data and provides deeper insight into the data. It also turns the age old analytic strategy on its head. Now we can ask whether a priori assumptions that the demographics could or do matter are just rote research or truly the productive and informative measures that we’ve built them up to be?

I believe that this is a key difference between analysis types. In the qualitative analysis of open ended survey questions, it isn’t very meaningful to say x% of the respondents mentioned z, and y% of the respondents mentioned d, because a nonmention of z or d is not really meaningful. Instead we go deeper into the data to see what was said about d or z. So the goal is not prevalence, but description. On the other hand, prevalence is a hugely important aspect of quantitative analysis, as are other fun statistics which feed off of demographic variables.

The lesson in all of this is to think carefully about what is meaningful information that is relevant to your analysis and not to make assumptions across analytic strategies.

Do you ever think about interfaces? Because I do. All the time.

Did you ever see the movie Singles? It came out in the early 90s, shortly before the alternative scene really blew up and I dyed [part of] my hair blue and thought seriously about piercings. Singles was a part of the growth of the alternative movement. In the movie, there is a moment when one character says to another “Do you ever think about traffic? Because I do. All the time.” I spent quite a bit of time obsessing over that line, about what it meant, and, more deeply, what it signaled.

I still think about that line. As I drove toward the turnoff to my mom’s street during our 4th of July vacation, I saw what looked like the turn lane for her street, but it was actually an intersection- less left- turning split immediately preceding the real left turn lane for her street. It threw me off every time, and I kept remembering that romantic moment in Singles when the two characters were getting to know each other’s quirks, and the man was talking about traffic. And it was okay, even cool, to be quirky and think or talk about traffic, even during a romantic moment.

I don’t think about traffic often. But I am no less quirky. Lately, I tend to think about interfaces. Before my first brush with NLP (Natural Language Processing), I thought quite a bit about alternatives to e-mail. Since I discovered the world of text analytics, I have been thinking quite a bit about ways to integrate the knowledge across different fields about methods for text analysis and the needs of quantitative and qualitative researchers. I want to think outside of the sentiment box, because I believe that sentiment analysis does not fully address the underlying richness of textual data. I want to find a way to give researchers what they need, not what they think they want. Recently, my thinking on this topic has flipped. Instead of thinking from the data end, or the analytic possibilities end, or about what programs already exist and what they do, I have started to think about interfaces. This feels like a real epiphany. Once we think about the problem from an interface, or user experience perspective, we can better utilize existing technology and harness user expectations.

Have you read the new Imagine book about how creativity works? I believe that this strategy is the natural step after spending time zoning out on the web, thinking, or not thinking, about research. The more time you cruise, the better feel you develop for what works and what doesn’t, the more you learn what to expect. Interfaces are simply the masks we put on datasets of all sorts. The data could be the world wide web as a whole, results from a site or time period, a database of merchandise, or even a set of open ended survey responses. The goal is to streamline the searching interface and then make it available for use on any number of datasets. We use NLP every day when we search the internet, or shop. We understand it intuitively. Why don’t we extend that understanding to text analysis?

I find myself thinking about what this interface should look like and what I want this program to do.

Not traffic, not as romantic. But still quirky and all-encompassing.