Upcoming DC Event: Online Research Offline Lunch

ETA: Registration for this event is now CLOSED. If you have already signed up, you will receive a confirmation e-mail shortly. Any sign-ups after this date will be stored as a contact list for any future events. Thank you for your interest! We’re excited to gather with such a diverse and interesting group.

—–

Are you in or near the DC area? Come join us!

Although DC is a great meeting place for specific areas of online research, there are few opportunities for interdisciplinary gatherings of professionals and academics. This lunch will provide an informal opportunity for a diverse set of online researchers to listen and talk respectfully about our interests and our work and to see our endeavors from new, valuable perspectives.

Date & Time: August 6, 2013, 12:30 p.m.

Location: Near Gallery Place or Metro Center. Once we have a rough headcount, we’ll choose an appropriate location. (Feel free to suggest a place!)

Please RSVP using this form:

Go back

Your message has been sent

Warning
Warning
Warning
Warning

Warning.

Fitness for Purpose, Representativeness and the perils of online reviews

Have you ever planned a trip online? In January, when I traveled to Amsterdam, I did all of the legwork online and ended up in a surprising place.

Amsterdam City Center is extremely easy to navigate. From the train station (a quick ride from the airport and a quick ride around The Netherlands), the canals extend outward like spokes. Each canal is flanked by streets. Then the city has a number of concentric rings emanating from the train station. Not only is the underlying map easy to navigate, there is a traveler station at the center and maps available periodically. English speaking tourists will see that not only do many people speak English, but Dutch has enough overlap with English to be comprehensible after even a short exposure.

But the city center experience was not as smooth for me. I studied map after map in the city center without finding my hotel. I asked for directions, and no one had heard of the hotel or the street it was on. The traveler center seemed flummoxed as well. Eventually I found someone who could help and found myself on a long commuter tram ride well outside the city center and tourist areas. The hotel had received great reviews and recommendations from many travelers. But clearly, the travelers who boasted about it were not quite the typical travelers, who likely would have ended up in one of the many hotels I saw from the tram window.

Have you ever discovered a restaurant online? I recently went to a nice, local restaurant that I’d been reading about for years. I ordered the truffle fries (fries with truffle salt and some kind of fondue sauce), because people had really raved about them, only to discover once they arrived that they were fundamentally french fries (totally not my bag- I hate fried food).

These review sites are not representative of anything. And yet we/I repeatedly use them as if they were reliable sources of information. One could easily argue that they may not be representative, but they are good enough for their intended use (fitness for purpose <– big, controversial notion from a recent AAPOR task force report on Nonprobability Sampling). I would argue that they are clearly not excellent for their intended use. But does that invalidate them altogether? They often they provide the only window that we have into the whatever it is that we intend them for.

Truffle fried aside, the restaurant was great. And location aside, the hotel was definitely an interesting experience.

Toilet capsule in hotel room (with frosted glass rotating pane for some degree of privacy)

Toilet capsule in hotel room (with frosted glass rotating pane for some degree of privacy)

Representativeness, qual & quant, and Big Data. Lost in translation?

My biggest challenge in coming from a quantitative background to a qualitative research program was representativeness. I came to class firmly rooted in the principle of Representativeness, and my classmates seemed not to have any idea why it mattered so much to me. Time after time I would get caught up in my data selection. I would pose the wider challenge of representativeness to a colleague, and they would ask “representative of what? why?”

 

In the survey research world, the researcher begins with a population of interest and finds a way to collect a representative sample of the population for study. In the qualitative world that accompanies survey research units of analysis are generally people, and people are chosen for their representativeness. Representativeness is often constructed by demographic characteristics. If you’ve read this blog before, you know of my issues with demographics. Too often, demographic variables are used as a knee jerk variable instead of better considered variables that are more relevant to the analysis at hand. (Maybe the census collects gender and not program availability, for example, but just because a variable is available and somewhat correlated doesn’t mean that it is in fact a relevant variable, especially when the focus of study is a population for whom gender is such an integral societal difference.)

 

And yet I spent a whole semester studying 5 minutes of conversation between 4 people. What was that representative of? Nothing but itself. It couldn’t have been exchanged for any other 5 minutes of conversation. It was simply a conversation that this group had and forgot. But over the course of the semester, this piece of conversation taught me countless aspects of conversation research. Every time I delved back into the data, it became richer. It was my first step into the world of microanalysis, where I discovered that just about anything can be a rich dataset if you use it carefully. A snapshot of people at a lecture? Well, how are their bodies oriented? A snapshot of video? A treasure trove of gestures and facial expressions. A piece of graffiti? Semiotic analysis! It goes on. The world of microanalysis is built on the practice of layered noticing. It goes deeper than wide.

 

But what is it representative of? How could a conversation be representative? Would I need to collect more conversations, but restrict the participants? Collect conversations with more participants, but in similar contexts? How much or how many would be enough?

 

In the world of microanalysis, people and objects constantly create and recreate themselves. You consistently create and recreate yourself, but your recreations generally fall into a similar range that makes you different from your neighbors. There are big themes in small moments. But what are the small moments representative of? Themselves. Simply, plainly, nothing more and nothing else. Does that mean that they don’t matter? I would argue that there is no better way to understand the world around us in deep detail than through microanalysis. I would also argue that macroanalysis is an important part of discovering the wider patterns in the world around us.

 

Recently a NY Times blog post by Quentin Hardy has garnered quite a bit of attention.

Why Big Data is Not Truth: http://bits.blogs.nytimes.com/2013/06/01/why-big-data-is-not-truth/

This post has really struck a chord with me, because I have had a hard time understanding Hardy’s complaint. Is big data truth? Is any data truth? All data is what it is; a collection of some sort, collected under a specific set of circumstances. Even data that we hope to be more representative has sampling and contextual limitations. Responsible analysts should always be upfront about what their data represents. Is big data less truthful than other kinds of data? It may be less representative than, say, a systematically collected political poll. But it is what it is: different data, collected under different circumstances in a different way. It shouldn’t be equated with other data that was collected differently. One true weakness of many large scale analyses is the blindness to the nature of the data, but that is a byproduct of the training algorithms that are used for much of the analysis. The algorithms need large training datasets, from anywhere. These sets often are developed through massive web crawlers. Here, context gets dicey. How does a researcher represent the data properly when they have no idea what it is? Hopefully researchers in this context will be wholly aware that, although their data has certain uses, it also has certain [huge] limitations.

 

I suspect that Hardy’s complaint is with the representations of massive datasets collected from webcrawlers as a complete truth from which any analyses could be run and all of the greater truths of the world could be revealed. On this note, Hardy is exactly right. Data simply is what it is, nothing more and nothing less. And any analysis that focuses on an unknown dataset is just that: an analysis without context. Which is not to say that all analyses need to be representative, but rather that all responsible analyses of good quality need to be self aware. If you do not know what the data represents and when and how it was collected, then you cannot begin to discuss the usefulness of any analysis of it.

What is the role of Ethnography and Microanalysis in Online Research?

There is a large disconnect in online research.

The largest, most profile, highest value and most widely practiced side of online research was created out of a high demand to analyze the large amount of consumer data that is constantly being created and largely public available. This tremendous demand led to research methods that were created in relative haste. Math and programming skills thrived in a realm where social science barely made a whisper. The notion of atheoretical research grew. The level of programming and mathematical competence required to do this work continues to grow higher every day, as the fields of data science and machine learning become continually more nuanced.

The largest, low profile, lowest value and increasingly more practiced side of online research is the academic research. Turning academia toward online research has been like turning a massive ocean liner. For a while online research was not well respected. At this point it is increasingly well respected, thriving in a variety of fields and in a much needed interdisciplinary way, and driven by a search for a better understanding of online behavior and better theories to drive analyses.

I see great value in the intersection between these areas. I imagine that the best programmers have a big appetite for any theory they can use to drive their work in a useful and productive ways. But I don’t see this value coming to bear on the market. Hiring is almost universally focused on programmers and data scientists, and the microanalytic work that is done seems largely invisible to the larger entities out there.

It is common to consider quantitative and qualitative research methods as two separate languages with few bilinguals. At the AAPOR conference in Boston last week, Paul Lavarakas mentioned a book he is working on with Margaret Roller which expands the Total Survey Error model to both quantitative and qualitative research methodology. I spoke with Margaret Roller about the book, and she emphasized the importance of qualitative researchers being able to talk more fluently and openly about methodology and quality controls. I believe that this is, albeit a huge challenge in wording and framing, a very important step for qualitative research, in part because quality frameworks lend credibility to qualitative research in the eyes of a wider research community. I wish this book a great deal of success, and I hope that it is able to find an audience and a frame outside the realm of survey research (Although survey research has a great deal of foundational research, it is not well known outside of the field, and this book will merit a wider audience).

But outside of this book, I’m not quite sure where or how the work of bringing these two distinct areas of research can or will be done.

Also at the AAPOR conference last week, I participated in a panel on The Role of Blogs in Public Opinion Research (intro here and summary here). Blogs serve a special purpose in the field of research. Academic research is foundational and important, but the publish rate on papers is low, and the burden of proof is high. Articles that are published are crafted as an argument. But what of the bumps along the road? The meditations on methodology that arise? Blogs provide a way for researchers to work through challenges and to publish their failures. They provide an experimental space where fields and ideas can come together that previously hadn’t mixed. They provide a space for finding, testing, and crossing boundaries.

Beyond this, they are a vehicle for dissemination. They are accessible and informally advertised. The time frame to publish is short, the burden lower (although I’d like to believe that you have to earn your audience with your words). They are a public face to research.

I hope that we will continue to test these boundaries, to cross over barriers like quantitative and qualitative that are unhelpful and obtrusive. I hope that we will be able to see that we all need each other as researchers, and the quality research that we all want to work for will only be achieved through the mutual recognition that we need.

Digital Democracy Remixed

I recently transitioned from my study of the many reasons why the voice of DC taxi drivers is largely absent from online discussions into a study of the powerful voice of the Kenyan people in shaping their political narrative using social media. I discovered a few interesting things about digital democracy and social media research along the way, and the contrast between the groups was particularly useful.

Here are some key points:

  • The methods of sensemaking that journalists use in social media is similar to other methods of social media research, except for a few key factors, the most important of which is that the bar for verification is higher
  • The search for identifiable news sources is important to journalists and stands in contrast with research methods that are built on anonymity. This means that the input that journalists will ultimately use will be on a smaller scale than the automated analyses of large datasets widely used in social media research.
  • The ultimate information sources for journalists will be small, but the phenomena that will capture their attention will likely be big. Although journalists need to dig deep into information, something in the large expanse of social media conversation must capture or flag their initial attention
  • It takes some social media savvy to catch the attention of journalists. This social media savvy outweighs linguistic correctness in the ultimate process of getting noticed. Journalists act as intermediaries between social media participants and a larger public audience, and part of the intermediary process is language correcting.
  • Social media savvy is not just about being online. It is about participating in social media platforms in a publicly accessible way in regards to publicly relevant topics and using the patterned dialogic conventions of the platform on a scale that can ultimately draw attention. Many people and publics go online but do not do this.

The analysis of social media data for this project was particularly interesting. My data source was the comments following this posting on the Al Jazeera English Facebook feed.

fb

It evolved quite organically. After a number of rounds of coding I noticed that I kept drawing diagrams in the margins of some of the comments. I combined the diagrams into this framework:

scales

Once this framework was built, I looked closely at the ways in which participants used this framework. Sometimes participants made distinct discursive moves between these levels. But when I tried to map the participants’ movements on their individual diagrams, I noticed that my depictions of their movements rarely matched when I returned to a diagram. Although my coding of the framework was very reliable, my coding of the movements was not at all. This led me to notice that oftentimes the frames were being used more indexically. Participants were indexing levels of the frame, and this indexical process created powerful frame shifts. So, on the level of Kenyan politics exclusively, Uhuru’s crimes had one meaning. But juxtaposed against the crimes of other national leaders’ Uhuru’s crimes had a dramatically different meaning. Similarly, when the legitimacy of the ICC was questioned, the charges took on a dramatically different meaning. When Uhuru’s crimes were embedded in the postcolonial East vs West dynamic, they shrunk to the degree that the indictments seemed petty and hypocritical. And, ultimately, when religion was invoked the persecution of one man seemed wholly irrelevant and sacrilegious.

These powerful frame shifts enable the Kenyan public to have a powerful, narrative changing voice in social media. And their social media savvy enables them to gain the attention of media sources that amplify their voices and thus redefine their public narrative.

readyforcnn

Instagram is changing the way I see

I recently joined Instagram (I’m late, I know).

I joined because my daughter wanted to, because her friends had, to see what it was all about. She is artistic, and we like to talk about things like color combinations and camera angles, so Instagram is a good fit for us. But it’s quickly changing the way I understand photography. I’ve always been able to set up a good shot, and I’ve always had an eye for color. But I’ve never seriously followed up on any of it. It didn’t take long on Instagram to learn that an eye for framing and color is not enough to make for anything more than accidental great shots. The great shots that I see are the ones that pick deeper patterns or unexpected contrasts out of seemingly ordinary surroundings. They don’t simply capture beauty, they capture an unexpected natural order or a surprising contrast, or they tell a story. They make you gasp or they make you wonder. They share a vision, a moment, an insight. They’re like the beginning paragraph of a novel or the sketch outline of a poem. Realizing that, I have learned that capturing the obvious beauty around me is not enough. To find the good shots, I’ll need to leave my comfort zone, to feel or notice differently, to wonder what or who belongs in a space and what or who doesn’t, and why any of it would capture anyone’s interest. It’s not enough to see a door. I have to wonder what’s behind it. To my surprise, Instagram has taught me how to think like a writer again, how to find hidden narratives, how to feel contrast again.

Sure this makes for a pretty picture. But what is unexpected about it? Who belongs in this space? Who doesn't? What would catch your eye?

Sure this makes for a pretty picture. But what is unexpected about it? Who belongs in this space? Who doesn’t? What would catch your eye?

This kind of change has a great value, of course, for a social media researcher. The kinds of connections that people forge on social media, the different ways in which people use platforms and the ways in which platforms shape the way we interact with the world around us, both virtual and real, are vitally important elements in the research process. In order to create valid, useful research in social media, the methods and thinking of the researcher have to follow closely with the methods and thinking of the users. If your sensemaking process imitates the sensemaking process of the users, you know that you’re working in the right direction, but if you ignore the behaviors and goals of the users, you have likely missed the point altogether. (For example, if you think of Twitter hashtags simply as an organizational scheme, you’ve missed the strategic, ironic, insightful and often humorous ways in which people use hashtags. Or if you think that hashtags naturally fall into specific patterns, you’re missing their dialogic nature.)

My current research involves the cycle between social media and journalism, and it runs across platforms. I am asking questions like ‘what gets picked up by reporters and why?’ and ‘what is designed for reporters to pick up?’ And some of these questions lead me to examine the differences between funny memes that circulate like wildfire through Twitter leading to trends and a wider stage and the more indepth conversation on public facebook pages, which cannot trend as easily and is far less punchy and digestible. What role does each play in the political process and in constituting news?

Of course, my current research asks more questions than these, but it’s currently under construction. I’d rather not invite you into the workzone until some of the pulp and debris have been swept aside…

Is there Interdisciplinary hope for Social Media Research?

I’ve been trying to wrap my head around social media research for a couple of years now. I don’t think it would be as hard to understand from any one academic or professional perspective, but, from an interdisciplinary standpoint, the variety of perspectives and the disconnects between them are stunning.

In the academic realm:

There is the computer science approach to social media research. From this standpoint, we see the fleshing out of machine learning algorithms in a stunning horserace of code development across a few programming languages. This is the most likely to be opaque, proprietary knowledge.

There is the NLP or linguistic approach, which overlaps to some degree with the cs approach, although it is often more closely tied to grammatical rules. In this case, we see grammatical parsers, dictionary development, and api’s or shared programming modules, such as NLTK or GATE. Linguistics is divided as a discipline, and many of these divisions have filtered into NLP.

Both the NLP and CS approaches can be fleshed out, trained, or used on just about any data set.

There are the discourse approaches. Discourse is an area of linguistics concerned with meaning above the level of the sentence. This type of research can follow more of a strict Conversation Analysis approach or a kind of Netnography approach. This school of thought is more concerned with context as a determiner or shaper of meaning than the two approaches above.

For these approaches, the dataset cannot just come from anywhere. The analyst should understand where the data came from.

One could divide these traditions by programming skills, but there are enough of us who do work on both sides that the distinction is superficial. Although, generally speaker, the deeper one’s programming or qualitative skills, the less likely one is to cross over to the other side.

There is also a growing tradition of data science, which is primarily quantitative. Although I have some statistical background and work with quantitative data sets every day, I don’t have a good understanding of data science as a discipline. I assume that the growing field of data visualization would fall into this camp.

In the professional realm:

There are many companies in horseraces to develop the best systems first. These companies use catchphrases like “big data” and “social media firehose” and often focus on sentiment analysis or topic analysis (usually topics are gleaned through keywords). These companies primarily market to the advertising industry and market researchers, often with inflated claims of accuracy, which are possible because of the opacity of their methods.

There is the realm of market research, which is quickly becoming dependent on fast, widely available knowledge. This knowledge is usually gleaned through companies involved in the horserace, without much awareness of the methodology. There is an increasing need for companies to be aware of their brand’s mentions and interactions online, in real time, and as they collect this information it is easy, convenient and cost effective to collect more information in the process, such as sentiment analyses and topic analyses. This field has created an astronomically high demand for big data analysis.

There is the traditional field of survey research. This field is methodical and error focused. Knowledge is created empirically and evaluated critically. Every aspect of the survey process is highly researched and understood in great depth, so new methods are greeted with a natural skepticism. Although they have traditionally been the anchors of good professional research methods and the leaders in the research field, survey researchers are largely outside of the big data rush. Survey researchers tend to value accuracy over timeliness, so the big, fast world of big data, with its dubious ability to create representative samples, hold little allure or relevance.

The wider picture

In the wider picture, we have discussions of access and use. We see a growing proportion of the population coming online on an ever greater variety of devices. On the surface, the digital divide is fast shrinking (albeit still significant). Some of the digital access debate has been expanded into an understanding of differential use- essentially that different people do different activities while online. I want to take this debate further by focusing on discursive access or the digital representation of language ideologies.

The problem

The problem with such a wide spread of methods, needs, focuses and analytic traditions is that there isn’t enough crossover. It is very difficult to find work that spreads across these domains. The audiences are different, the needs are different, the abilities are different, and the professional visions are dramatically different across traditions. Although many people are speaking, it seems like people are largely speaking within silos or echo chambers, and knowledge simply isn’t trickling across borders.

This problem has rapidly grown because the underlying professional industries have quickly calcified. Sentiment analysis is not the revolutionary answer to the text analysis problem, but it is good enough for now, and it is skyrocketing in use. Academia is moving too slow for the demands of industry and not addressing the needs of industry, so other analytic techniques are not being adopted.

Social media analysis would best be accomplished by a team of people, each with different training. But it is not developing that way. And that, I believe, is a big (and fast growing) problem.

Storytelling about correlation and causation

Many researchers have great war stories to tell about the perilous waters between correlation and causation. Here is my personal favorite:

In the late 90’s, I was working with neurosurgery patients in a medical psychology clinic in a hospital. We gave each of the patients a battery of cognitive tests before their surgery and then administered the same battery 6 months after the surgery. Our goal was to check for cognitive changes that may have resulted from the surgery. One researcher from outside the clinic focused on our strongest finding: a significant reduction of anxiety from pre-op to post-op. She hypothesized that this dramatic finding was evidence that the neural basis for anxiety was affected by the surgery. Had she only taken a minute to explain her  hypothesis in plain terms to a layperson, especially one that could imagine the anxiety a patient could potentially experience hours before brain surgery, she surely would have withdrawn her request for our data and slipped quietly out of our clinic.

“Correlation does not imply causation” is a research catchphrase that is drilled into practitioners from internhood and intro classes onward. It is particularly true when working with language, because all linguistic behavior is highly patterned behavior. Researchers from many other disciplines would kill to have chi square tests as strong as linguists’ chi squares. In fact, linguists have to reach deeper into their statistical toolkits, because the significance levels alone can be misleading or inadequate.

People who use language but don’t study linguistics usually aren’t aware of the degree of patterning that underlies the communication process. Language learning has statistical underpinnings, and language use has statistical underpinnings. It is because of this patterning that linguistic machine learning is possible. But, linguistic patterning is a double edged sword- potentially helpful in programming and harmful in analysis. Correlations abound, and they’re mostly real correlations, although, statistically speaking, some will be products of peculiarities in a dataset. But outside of any context or theory, these findings are meaningless. They don’t speak to the underlying relationship between the variables in any way.

A word of caution to researchers whose work centers around the discovery of correlations. Be careful with your findings. You may have found evidence that shows that a correlation may exist. But that is all you have found. Take your next steps carefully. First, step back and think about your work in layman’s terms. What did you find, and is that really anything meaningful? If your findings still show some prospects, double down further and dig deeper. Try to get some better idea of what is happening. Get some context.

Because a correlation alone is no gold nugget. You may think you’ve found some fashion, but your emperor could very well still be naked.

What do all of these polling strategies add up to?

Yesterday was a big first for research methodologists across many disciplines. For some of the newer methods, it was the first election that they could be applied to in real time. For some of the older methods, this election was the first to bring competing methodologies, and not just methodological critiques.

Real time sentiment analysis from sites like this summarized Twitter’s take on the election. This paper sought to predict electoral turnout using google searches. InsideFacebook attempted to use Facebook data to track voting. And those are just a few of a rapid proliferation of data sources, analytic strategies and visualizations.

One could ask, who are the winners? Some (including me) were quick to declare a victory for the well honed craft of traditional pollsters, who showed that they were able to repeat their studies with little noise, and that their results were predictive of a wider real world phenomena. Some could call a victory for the emerging field of Data Science. Obama’s Chief Data Scientist is already beginning to be recognized. Comparisons of analytic strategies will spring up all over the place in the coming weeks. The election provided a rare opportunity where so many strategies and so many people were working in one topical area. The comparisons will tell us a lot about where we are in the data horse race.

In fact, most of these methods were successful predictors in spite of their complicated underpinnings. The google searches took into account searches for variations of “vote,” which worked as a kind of reliable predictor but belied the complicated web of naturalistic search terms (which I alluded to in an earlier post about the natural development of hashtags, as explained by Rami Khater of Al Jezeera’s The Stream, a social network generated newscast). I was a real-world example of this methodological complication. Before I went to vote, I googled “sample ballot.” Similar intent, but I wouldn’t have been caught in the analyst’s net.

If you look deeper at the Sentiment Analysis tools that allow you to view the specific tweets that comprise their categorizations, you will quickly see that, although the overall trends were in fact predictive of the election results, the data coding was messy, because language is messy.

And the victorious predictive ability of traditional polling methods belies the complicated nature of interviewing as a data collection technique. Survey methodologists work hard to standardize research interviews in order to maximize the reliability of the interviews. Sometimes these interviews are standardized to the point of recording. Sometimes the interviews are so scripted that interviewers are not allowed to clarify questions, only to repeat them. Critiques of this kind of standardization are common in survey methodology, most notably from Nora Cate Schaeffer, who has raised many important considerations within the survey methodology community while still strongly supporting the importance of interviewing as a methodological tool. My reading assignment for my ethnography class this week is a chapter by Charles Briggs from 1986 (Briggs – Learning how to ask) that proves that many of the new methodological critiques are in fact old methodological critiques. But the critiques are rarely heeded, because they are difficult to apply.

I am currently working on a project that demonstrates some of the problems with standardizing interviews. I am revising a script we used to call a representative sample of U.S. high schools. The script was last used four years ago in a highly successful effort that led to an admirable 98% response rate. But to my surprise, when I went to pull up the old script I found instead a system of scripts. What was an online and phone survey had spawned fax and e-mail versions. What was intended to be a survey of principals now had a set of potential respondents from the schools, each with their own strengths and weaknesses. Answers to common questions from school staff were loosely scripted on an addendum to the original script. A set of tips for phonecallers included points such as “make sure to catch the name of the person who transfers you, so that you can specifically say that Ms X from the office suggested I talk to you” and “If you get transferred to the teacher, make sure you are not talking to the whole class over the loudspeaker.”

Heidi Hamilton, chair of the Georgetown Linguistics department, often refers to conversation as “climbing a tree that climbs back.” In fact, we often talk about meaning as mutually constituted between all of the participants in a conversation. The conversation itself cannot be taken outside of the context in which it lives. The many documents I found from the phonecallers show just how relevant these observations can be in an applied research environment.

The big question that arises from all of this is one of a practical strategy. In particular, I had to figure out how to best address the interview campaign that we had actually run when preparing to rerun the campaign we had intended to run. My solution was to integrate the feedback from the phonecallers and loosen up the script. But I suspect that this tactic will work differently with different phonecallers. I’ve certainly worked with a variety of phonecallers, from those that preferred a script to those that preferred to talk off the cuff. Which makes the best phonecaller? Neither. Both. The ideal phonecaller works with the situation that is presented to them nimbly and professionally while collecting complete and relevant data from the most reliable source. As much of the time as possible.

At this point, I’ve come pretty far afield of my original point, which is that all of these competing predictive strategies have complicated underpinnings.

And what of that?

I believe that the best research is conscious of its strengths and weaknesses and not afraid to work with other strategies in order to generate the most comprehensive picture. As we see comparisons and horse races develop between analytic strategies, I think the best analyses we’ll see will be the ones that fit the results of each of the strategies together, simultaneously developing a fuller breakdown of the election and a fuller picture of our new research environment.

I think I’m using “big data” incorrectly

I think I’m using the term “big data” incorrectly. When I talk about big data, I’m referring to the massive amount of freely available information that researchers can collect from the internet. My expectation is that the researchers must choose which firehose best fits their research goals, collect and store the data, and groom it to the point of usability before using it to answer targeted questions or examining it for answers in need of a question.

The first element of this that makes it “big data” to me, is that the data is freely available and not subject to any privacy violations. It can be difficult to collect and store, because of its sheer size, but it is not password protected. For this reason, I would not consider Facebook to be a source for “big data.” I believe that the overwhelming majority of Facebook users impose some privacy controls, and the resulting, freely available information cannot be assigned any kind of validity. There are plenty of measures of inclusion for online research, and ignorance about privacy rules or shear exhibitionism are not a target qualities by any of these standards.

The second crucial element to my definition of “big data” is structure. My expectation is that it is in any researchers interest to understand the genesis and structure of their data as much as possible, both for the sake of grooming, and for the sake of assigning some sense of validity to their findings. Targeted information will be layed out and signaled very differently in different online environments, and the researcher must work to develop both working delimiters to find probable working targets and a sense of context for the data.

The third crucial element is representativeness. What do these findings represent? Under what conditions? “Big data” has a wide array of answers to these questions. First, it is crucial to note that it is not representative of the general population. It represents only the networked members of a population who were actively engaging with an online interface within the captured window of time in a way that left a trace or produced data. Because of this, we look at individual people by their networks, and not by their representativeness. Who did they influence, and to what degree could they influence those people? And we look at other units of analysis, such as the website that the people were contributing on, the connectedness of that website, and the words themselves, and their degree of influence, both directly an indirectly.

Given those elements of understanding, we are able to provide a framework from which the analysis of the data itself is meaningful and useful.

I’m aware that my definition is not the generally accepted definition. But for the time being I will continue to use it for two reasons:

1. Because I haven’t seen any other terms that better fit
2. Because I think that it is critically important that any talk about data use is tied to measures that encourage the researcher to think about the meaning and value of their data

It’s my hope that this is a continuing discussion. In the meantime, I will trudge on in idealistic ignorance.