Rethinking demographics in research

I read a blog post on the LoveStats blog today that referred to one of the most widely regarded critiques of social media research: the lack of demographic information.

In traditional survey research, demographic information is a critically important piece of the analysis. We often ask questions like “Yes 50% of the respondents said they had encountered gender harassment, but what is the breakdown by gender?” The prospect of not having this demographic information is a large enough game changer to cast the field of social media research into the shade.

Here I’d like to take a sidestep and borrow a debate from linguistics. In the linguistic subfield of conversation analysis, there are two main streams of thought about analysis. One believes in gathering as much outside data as possible, often through ethnographic research, to inform a detailed understanding of the conversation. The second stream is rooted in the purity of the data. This stream emphasizes our dynamic construction of identity over the stability of identity. The underlying foundation of this stream is that we continually construct and reconstruct the most important and relevant elements of our identity in the process of our interaction. Take, for example, a study of an interaction between a doctor and a patient. The first school would bring into the analysis a body of knowledge about interactions between doctors and patients. The second would believe that this body of knowledge is potentially irrelevant or even corrupting to the analysis, and if the relationship is in fact relevant it will be constructed within the excerpt of study. This begs the question: are all interactions between doctors and patients primarily doctor patient interactions? We could address this further through the concept of framing and embedded frames (a la Goffman), but we won’t do that right now.

Instead, I’ll ask another question:
If we are studying gender discrimination, is it necessary to have a variable for gender within our datasouce?

My kneejerk reaction to this question, because of my quantitative background, is yes. But looking deeper: is gender always relevant? This does strongly depend on the datasource, so let’s assume for this example that the stimulus was a question on a survey that was not directly about discrimination, but rather more general (e.g. “Additional Comments:”).

What if we took that second CA approach, the purist approach, and say that where gender is applicable to the response it will be constructed within that response. The question now becomes ‘how is gender constructed within a response?’ This is a beautiful and interesting question for a linguist, and it may be a question that much better fits the underlying data and provides deeper insight into the data. It also turns the age old analytic strategy on its head. Now we can ask whether a priori assumptions that the demographics could or do matter are just rote research or truly the productive and informative measures that we’ve built them up to be?

I believe that this is a key difference between analysis types. In the qualitative analysis of open ended survey questions, it isn’t very meaningful to say x% of the respondents mentioned z, and y% of the respondents mentioned d, because a nonmention of z or d is not really meaningful. Instead we go deeper into the data to see what was said about d or z. So the goal is not prevalence, but description. On the other hand, prevalence is a hugely important aspect of quantitative analysis, as are other fun statistics which feed off of demographic variables.

The lesson in all of this is to think carefully about what is meaningful information that is relevant to your analysis and not to make assumptions across analytic strategies.

Do you ever think about interfaces? Because I do. All the time.

Did you ever see the movie Singles? It came out in the early 90s, shortly before the alternative scene really blew up and I dyed [part of] my hair blue and thought seriously about piercings. Singles was a part of the growth of the alternative movement. In the movie, there is a moment when one character says to another “Do you ever think about traffic? Because I do. All the time.” I spent quite a bit of time obsessing over that line, about what it meant, and, more deeply, what it signaled.

I still think about that line. As I drove toward the turnoff to my mom’s street during our 4th of July vacation, I saw what looked like the turn lane for her street, but it was actually an intersection- less left- turning split immediately preceding the real left turn lane for her street. It threw me off every time, and I kept remembering that romantic moment in Singles when the two characters were getting to know each other’s quirks, and the man was talking about traffic. And it was okay, even cool, to be quirky and think or talk about traffic, even during a romantic moment.

I don’t think about traffic often. But I am no less quirky. Lately, I tend to think about interfaces. Before my first brush with NLP (Natural Language Processing), I thought quite a bit about alternatives to e-mail. Since I discovered the world of text analytics, I have been thinking quite a bit about ways to integrate the knowledge across different fields about methods for text analysis and the needs of quantitative and qualitative researchers. I want to think outside of the sentiment box, because I believe that sentiment analysis does not fully address the underlying richness of textual data. I want to find a way to give researchers what they need, not what they think they want. Recently, my thinking on this topic has flipped. Instead of thinking from the data end, or the analytic possibilities end, or about what programs already exist and what they do, I have started to think about interfaces. This feels like a real epiphany. Once we think about the problem from an interface, or user experience perspective, we can better utilize existing technology and harness user expectations.

Have you read the new Imagine book about how creativity works? I believe that this strategy is the natural step after spending time zoning out on the web, thinking, or not thinking, about research. The more time you cruise, the better feel you develop for what works and what doesn’t, the more you learn what to expect. Interfaces are simply the masks we put on datasets of all sorts. The data could be the world wide web as a whole, results from a site or time period, a database of merchandise, or even a set of open ended survey responses. The goal is to streamline the searching interface and then make it available for use on any number of datasets. We use NLP every day when we search the internet, or shop. We understand it intuitively. Why don’t we extend that understanding to text analysis?

I find myself thinking about what this interface should look like and what I want this program to do.

Not traffic, not as romantic. But still quirky and all-encompassing.

Question Writing is an Art

As a survey researcher, I like to participate in surveys with enough regularity to keep current on any trends in methodology. As a web designer, an aspect of successful design is a seamlessness with the visitor’s expectations. So if the survey design realm has moved toward submit buttons on the upper right hand corner of individual pages, your idea (no matter how clever) to put a submit button on the upper left can result in a disconnect on the part of the user that will effect their behavior on the page. In fact, the survey design world has evolved quite a bit in the last few years, and it is easy to design something that reflects poorly on the quality of your research endeavor. But these design concerns are less of an issue than they have been, because most researchers are using templates.

Yet there is still value in keeping current.

And sometimes we encounter questions that lend themselves to an explanation of the importance of question writing. These questions are a gift for a field that is so difficult to describe in terms of knowledge and skills!

Here is a question I encountered today (I won’t reveal the source):

How often do you purchase potato chips when you eat out at any quick service and fast food restaurants?

2x a week or more
1x a week
1x every 2-3 weeks
1x a month
1x every 2-3 months
Less than 1x every 3 months
Never

This is a prime example of a double barreled question, and it is also an especially difficult question to answer. In my care, I rarely eat at quick service restaurants, especially sandwich places, like this one, that offer potato chips. When I do eat at them, I am tempted to order chips. About half the time I will give in to the temptation with a bag of sunchips, which I’m pretty sure are not made of potato.

In bigger firms that have more time to work through, this information would come out in the process of a cognitive interview or think aloud during the pretesting phase. Many firms, however, have staunchly resisted these important steps in the surveying process, because of their time and expense. It is important to note that the time and expense involved with trying to make usable answers out of poorly written questions can be immense.

I have spent some time thinking about alternatives to cognitive testing, because I have some close experience with places that do not use this method. I suspect that this is a good place for text analytics, because of the power of reaching people quickly and potentially cheaply (depending on your embedded TA processes). Although oftentimes we are nervous about web analytics because of their representativeness, the bar for representativeness is significantly lower in the pretesting stage than in the analysis phase.

But, no matter what pretesting model you choose, it is important to look closely at the questions that you are asking. Are you asking a single question, or would these questions be better separated out into a series?

How often do you eat at quick service sandwich restaurants?

When you eat at quick service restaurants, do you order [potato] chips?

What kind of [potato] chips do you order?

The lesson of all of this is that question writing is important, and the questions we write in surveys will determine the kind of survey responses we receive and the usability of our answers.

To go big, first think small

We use language all of the time. Because of this, we are all experts in language use. As native speakers of a language, we are experts in the intricacies of that language.

Why, then, do people study linguistics? Aren’t we all linguists?

Absolutely not.

We are experts in *using* language, but we are not experts in the methods we employ. Believe it or not, much of the process of speaking and hearing is not conscious. If it was, we would be sensorally overwhelmed with the sheer volume of words around us. Instead, listening comprehension involves a process of merging what we expect to hear with what we gauge to be the most important elements of what we do hear. The process of speaking involves merging our estimates of what the people we communicate with know and expect to hear with our understanding of the social expectations surrounding our words and our relationships and distilling these sources into a workable expression. The hearer will reconstruct elements of this process using cues that are sometimes conscious and sometimes not.

We often think of language as simple and mechanistic, but it is not simple at all. As conversational analysts, our job is to study conversation that we have access to in an attempt to reconstruct the elements that constituted the interaction. Even small chunks of conversation encode quite a bit of information.

The process of conversation analysis is very much contrary to our sense of language as regular language users. This makes the process of explaining our research to people outside our field difficult. It is difficult to justify the research, and it is difficult to explain why such small pieces of data can be so useful, when most other fields of research rely on greater volumes of data.

In fact, a greater volume of data can be more harmful than helpful in conversation analysis. Conversation is heavily dependent on its context; on the people conversing, their relationship, their expectations, their experiences that day, the things on their mind, what they expect from each other and the situation, their understanding of language and expectations, and more. The same sentence can have greatly different meanings once those factors are taken into account.

At a time when there is so much talk of the glory of big data, it is especially important to keep in mind the contributions of small data. These contributions are the ones that jeopardize the utility and promise of big data, and if these contributions can be captured in creative ways, they will be the true promise of the field.

Not what language users expect to see, but rather what we use every day, more or less consciously.

Data Journalism, like photography, “involves selection, filtering, framing, composition and emphasis”

Beautiful:

“Creating a good piece of data journalism or a good data-driven app is often more like an art than a science. Like photography, it involves selection, filtering, framing, composition and emphasis. It involves making sources sing and pursuing truth – and truth often doesn’t come easily. ” -Jonathan Gray

Whole article:

http://www.guardian.co.uk/news/datablog/2012/may/31/data-journalism-focused-critical

Truly, at a time when the buzz about big data is at such a peak, it is nice to hear a voice of reason and temper! Folks: big data will not do all that it is talked up to do. It will, in fact, do something surprising and different. And that something will come from the interdisciplinary thought leaders in fields like natural language processing and linguistics. That *something,* not the data itself, will be the new oil.

Facebook Measures Happiness in Status Updates?

From Flowing data:

http://flowingdata.com/2009/10/05/facebook-measures-happiness-in-status-updates/

Does anyone have a link to the original report?

I really wish I had more of a window into the methodology of this one!

A couple of questions:

What is happiness?

How can it be measured or signaled? What kinds of data are representing happiness? Is this just an expanded or open ended sentiment analysis? Is the technology such that this would be a valid study?

Are Facebook statuses a sensible place to investigate happiness?

What is this study representing? Constituting? Perpetuating?

 

Edited to Add: http://blog.facebook.com/blog.php?post=150162112130

What is Data? The answer might surprise you

I like to compare my discovery of Sociolinguistics to my love of swimming. I like to consider myself a competent swimmer, and I love being underwater. But discovering sociolinguistics was like coming up for air and noticing air and dry land. A fundamental element that led to this feeling is the difference in data.

In survey research, we rarely think about what data looks like, unless we are training new hires for jobs like data entry. Data can be visualized as a spreadsheet. Each line is a case, and each column is a variable. The variables can be numeric or character and vary in size. We analyze the numbers using statistics and the character variables using qualitative analysis. Or, we can try quantitative techniques on character fields.

The field of survey research has been feeling out its edges increasingly in the past few years. This has led us to consider new data sources, particularly data sources that do not come from surveys. Two factors shape this exploration

1.) Consideration for the genesis and representativeness of the new data. What is it, and what does it represent?

2.) A sense of what data should look like. We expect new data to resemble old data. We think in terms of joining files; collating, concatenating, merging, aggregating and disaggregating. New data should look and work like this. And so our questions are more along the lines of: how can we make new data look like (or work with) old data?

Sociolinguistics could not be more different, in terms of data. In sociolinguistics, everything is data. Look around you: you’re looking at data. Listen: you’re listening to data. The signs that you passed on your way into work? Data. The tv shows you watch when you get home? Data. Cooking with recipes? Data. Talking on the phone? Data. Attending a meeting? Institutional discourse!

In sociolinguistics, we  call analytic methods our ‘toolkit,’ and we pride ourselves on being able to analyze any kind of data with that toolkit. We include ethnographic methods, visual semiotics, discourse methods, action-based studies, as well as traditional linguistic means and measures. Each of these methods can be addressed quantitatively or qualitatively. The best studies use a combination of quantitative and qualitative methods. To me, these methods and data sources are nothing short of mind blowing, and they redefine the prospect of social science research.

Memory is incomplete experience

Today’s quote on my zen calendar is perfect.

“Memory is incomplete experience.” J. Krishnamurti

This is a great reminder for researchers and for people in general, because we all forget and keep forgetting how incomplete our memories and the memories of people we come into contact are. How many survey questions could be better written with this advice? How much better is ethnography when we base our observations on repeated viewings, rather than trying to reconstruct a vague memory? How many arguments could be avoided, if we could just remember that memories are incomplete?

I sometimes participate in a video discussion group. I am amazed that each viewing of a short segment of video brings a different set of interpretations, and I am amazed that the other participants continually notice different aspects of the video. This experience really drives the point home about how little we see in our everyday lives. We are so inundated with information that we simply couldn’t, and wouldn’t want to, process it all.

Research is the process of recovering and reconstructing. Of observing carefully. Of noticing things that we would never or could never have accessed through normal observation and we absolutely could never access through our memories. Being a researcher does not make us any more able to analyze that which we experience in a single pass- we’re still human. Being a researcher simply means that we have the capacity to observe and investigate things more closely.

More rundown on Academedia

So I promised more on Academedia (note: they will add more video and visual resources to the Academedia website in the next few days)…

First, some of Robert Cannon’s (employed with the FCC and a member of Panel B “New Media: A closer look at what works”) insightful gems

Re: internet “a participatory market of free speech”

Re: kids& social media “It’s not a question of whether kids are writing. Kids are writing all the time. It’s whether parents understand that.”

“The issue is not whether to use Wikipedia, but how to use Wikipedia”
Next, the final panel, “Digital Tools for Communication:” http://gnovis-conferences.com/panel-c/
Hitlin (Pew Project for Excellence in Journalism)
People communicate differently about issues on different kinds of media sources.
Re: Trayvon Martin case –> largest issue by media source

  •      Twitter: 21% Outrage @ Zimmerman
  •      Cable & Talk radio: 17% Gun control legislation
  •      Blogs: 15% Role of race

Re: Crimson Hexagon
Pew is different, because they’re in a partnership with Crimson Hexagon to measure trends in Traditional media sources. Also because their standard of error is much higher, and they have a team of hand coders available.

Crimson Hexagon is different, because it combines human coding with machine learning to develop algorithms. It may actually overlap pretty intensely with some of the traditional qualitative coding programs that allow for some machine learning. I can imagine that this feature would appeal especially to researchers who are reluctant to fully embrace machine coding, which is understandable, given the current state of the art. I wonder if, by hosting their users instead of distributing programs, they’re able to store and learn from the codes developed by the users?

CH appears to measure two main domains: topic volume over time and topic sentiment over time. Users get a sense of recall and precision in action as they work with the program, by seeing the results of additions and subtractions to a search lexicon. Through this process, Hitlin got a sense of the meat of the problems with text analysis. He said that it was difficult to find examples that neatly fit into boxes, and that the computer didn’t have an eye for subtlety or things that fit into multiple categories. What he was commenting about was the nature of language in action, or what sociolinguists call Discourse! Through the process of categorizing language, he could sense how complicated it is. Here I get to reiterate one of the main points of this blog: these problems are the reason why linguistics is a necessary aspect of this process. Linguistics is the study of patterns in language, and the patterns we find are inherently different from the patterns we expect to find. Linguistics is a small field, one that people rarely think of. But it is critically essential to a high quality analysis of communication. In fact, we find, when we look for patterns in language, that everything in language is patterned, from its basic morphology and syntax, to its many variations (which are more systematic than we would predict), to methods like metaphor use and intertextuality, and more.

Linguistics is a key, but it’s not a simple fit. Language is patterned in so many ways that linguistics is a huge field. Unfortunately, the subfields of linguistics divide quickly into political and educational camps. It is rare to find a linguist trained in cognitive linguistics, applied linguistics and discourse analysis, for example. But each of these fields are necessary parts of text analysis.

Just as this blog is devoted to knocking down borders in research methods, it is devoted to knocking down borders between subfields and moving forward with strategic intellectual partnerships.

This next speaker in the panel thoroughly blew my mind!

Rami Khater from Al Jazeera English talked about the generation of ‘The Stream,’ an Al Jazeera program that is entirely driven by social media analysis.

Rami can be found on Twitter: @ramisms , and he shared a bit.ly with resources from his talk: bit.ly/yzST1d

The goal of The Stream is to be “a voice of the voiceless,” by monitoring how the hyperlocal goes global. Rami gave a few examples of things we never would have heard about without social media. He showed how hash tags evolve, by starting with competing tags, evolving and changing, and eventually converging into a trend (incidentally, Rami identified the Kony 2012 trend as synthetic from the get go by pointing that there was no organic hashtag evolution. It simply started and nded as #Kony2012). He used TrendsMap to show a quick global map of currently trending hashtags. I put a link to TrendsMap on the tools section of the links on this blog, and I strongly encourage you to experiment with it. My daughter and I spent some time looking at it today, and we found an emerging conversation in South Africa about black people on the Titanic. We followed this up with another tool, Topsy, which allowed us to see what the exact conversation was about. Rami gets to know the emerging conversations and then uses local tools to isolate the genesis of the trend and interview people at its source. Instead, my daughter and I looked at WhereTweeting to see what the people around us are tweeting about. We saw some nice words of wisdom from Iyanla Vanzant that were drowning in what appeared to me to be “a whole bunch of crap!” (“Mom-mmy, you just used the C word!”)

Anyway, the tools that Rami shared are linked over here —->

I encourage you to play around with them, and I encourage you and me both to go check out the recent Stream interview with Ai Wei Wei!

The final speaker on the panel was Karine Megerdoomian from MITRE. I have encountered a few people from MITRE recently at conferences, and I’ve been impressed with all of them! Karine started with some words that made my day:

“How helpful a word cloud is is basically how much work you put into it”

EXactly! Great point, Karine! And she showed a particularly great word cloud that combined useful words and phrases into a single image. Niiice!

Karine spoke a bit about MITRE’s efforts to use machine learning to identify age and gender among internet users. She mentioned that older users tended to use noses in their smilies 🙂 and younger users did not 🙂 . She spoke of how older Iranian users tended to use Persian morphology when creating neologisms, and younger users tended to use English, and she spoke about predicting revolutions and seeing how they are propagated over time.

After this point, the floor was opened up for questions. The first question was a critically important one for researchers. It was about representativeness.

The speakers pointed out that social media has a clear bias toward English speakers, western educated people, white, mail, liberal, US & UK. Every network has a different set of flaws, but every network has flaws. It is important not to just use these analyses as though they were complete. You simply have to go deeper in your analysis.

 

There was a bit more great discussion, but I’m going to end here. I hope that other will cover this event from other perspectives. I didn’t even mention the excellent discussions about education and media!

It’s not WHETHER they use it, but how they ENGAGE with it

Yesterday I attended an excellent event, Academedia, sponsored by the Gnovis Journal and the Communication, Culture and Technology program at Georgetown. I plan to post a fuller summary of the event soon, but I wanted to jump right in with some commentary about an exchange at the event that really weighed heavy on me.

One of the attendees was lamenting his child’s lack of engagement with traditional media sources (particularly news magazines) and worrying about the deeper societal implications of all of the fluff that garners more attention (and spreads faster) than larger scale news events do online. I would characterize this concern as what my professor Mima Dedaic calls “technopanic,” and I believe that his concerns demonstrate a lack of understanding of the nature of social media.

I have mentioned Pew’s report on the Kony 2012 viral video phenomena. One of the main findings of that report was that younger people tend to engage differently with media than older people tend to. Whereas older people were more likely to find out about the video from traditional news sources, younger people were more likely to have heard about it, and heard about it sooner, from social media sources. They were also less likely to have heard about it through traditional media sources, and more likely to have actually seen the video.

In the past media model, the news was composed of a distinct set of entities that could be avoided. I know of quite a few people who prefer not to watch the news or read the newspapers. This orientation has always existed. But in the age of social media, it is much harder to achieve.

When it snows, I know when the flakes begin to fall and the general swath of the storm, even if it’s not local, from my friends who complain about the storm and post pictures of its aftermath. I heard about Michael Jackson’s tragic passing before it was announced in the news. When Egyptians gathered in Tahrir Square, I knew about it from my friend in Egypt. I kept updated on the conflict and on her safety in a community of concerned friends and relatives on her facebook page. I hear about American political ads from people who see them and comment about them. I know what aspects of politicians my friends with different political orientations orient to. For me, social media can provide a faster news source, and often a more balanced news source, than traditional media (although I am an avid consumer of all kinds of media).

News is no longer a distinct entity that must be sought out. It is personalized. It is discussed from many angles from a variety of perspectives with a great deal of frequency from people who have various degrees of knowledge and and a variety of attitudes toward it. The man in the audience’s daughter may spend most of her time giggling over memes or making fan pages, but she is surely also orienting toward the larger world around her in a collaborative and alocative (location independent) way.

As a survey researcher, I like to participate in surveys. Some of these surveys ask about where I heard about something. I’m often very frustrated by the response options, because they are incongrent with the ways that I, and many people I know, learn about things on the internet. Googling is sometimes represented as a process of typing a search term into the box and choosing the first option that pulls up. But how often is that the way we use search engines?

There are two distinct ways that I can think of offhand that I google. One is for a direct, known piece of information, like an address, phone number or a picture of something I am already familiar with. The other is more exploratory. An exploratory search takes some term adjustment, and it requires reading through matches until a contextual understanding can be developed. I have noticed that some people can search far more efficiently than others. There are many tools available on the internet, and a working knowledge of the usefulness and potential of these tools can lead to a much different outcome than a passing use can.

There was a representative from the FCC on the panel who shared some great insights, much of which I will cover later. He spoke about kids being taught in schools that technology is bad (disruptive, disobedient, minimally insightful, …), instead of being taught how to use the technological tools available to them.

He said, it’s not WHETHER they use Wikipedia, but how they ENGAGE with Wikipedia.

This is a crucial point. The more we embrace the usefulness of these tools, the better our capabilities will be.

The other side of technopanic is a fear that engaging online means NOT engaging offline. Data on this topic show quite the opposite. People who engage online are also MORE likely to engage offline. Technology need not replace anything. But it can be an excellent tool when approached without unhelpful prejudices.