Data science can be pretty badass, but…

Every so often I’m reminded of the power of data science. Today I attended a talk entitled ‘Spatiotemporal Crime Prediction Using GPS & Time-tagged Tweets” by Matt Gerber of the UVA PTL. The talk was a UMD CLIP event (great events! Go if you can!).

Gerber began by introducing a few of the PTL projects, which include:

  • Developing automatic detection methods for extremist recruitment in the Dark Net
  • Turning medical knowledge from large bodies of unstructured texts into medical decision support models
  • Many other cool initiatives

He then introduced the research at hand: developing predictive models for criminal activity. The control model in this case use police report data from a given period of time to map incidents onto a map of Chicago using latitude and longitude. He then superimposed a grid on the map and collapsed incidents down into a binary presence vs absence model. Each square in the grid would either have one or more crimes (1) or not have any crimes (-1). This was his training data. He built a binary classifier and then used logistic regression to compute probabilities and layered a kernel density estimator on top. He used this control model to compare with a model built from unstructured text. The unstructured text consisted of GPS tagged Twitter data (roughly 3% of tweets) from the Chicago area. He drew the same grid using longitude and latitude coordinates and tossed all of the tweets from each “neighborhood” (during the same one month training window) into the boxes. Then, using essentially a one box=one document for a document based classifier, he subjected each document to topic modeling (using LDA & MALLET). He focused on crime related words and topics to build models to compare against the control models. He found that the predictive value of both models was similar when compared against actual crime reports from days within the subsequent month.

This is a basic model. The layering can be further refined and better understood (there was some discussion about the word “turnup,” for example). Many more interesting layers can be built into it in order to improve its predictive power, including more geographic features, population densities, some temporal modeling to accommodate the periodic nature of some crimes (e.g. most robberies happen during the work week, while people are away from their homes), a better accommodation for different types of crime, and a host of potential demographic and other variables.

I would love to dig deeper into this data to gain a deeper understanding of the conversation underlying the topic models. I imagine there is quite a wealth of deeper information to be gained as well as a deeper understanding of what kind of work the models are doing. It strikes me that each assumption and calculation has a heavy social load attached to it. Each variable and each layer that is built into the model and roots out correlations may be working to reinforce certain stereotypes and anoint them with the power of massive data. Some questions need to be asked. Who has access to the internet? What type of access? How are they using the internet? Are there substantive differences between tweets with and without geotagging? What varieties of language are the tweeters using? Do classifiers take into account language variation? Are the researchers simply building a big data model around the old “bad neighborhood” notions?

Data is powerful, and the predictive power of data is fascinating. Calculations like these raise questions in new ways, remixing old assumptions into new correlations. Let’s not forget to question new methods, put them into their wider sociocultural contexts and delve qualitatively into the data behind the analyses. Data science can be incredibly powerful and interesting, but it needs a qualitative and theoretical perspective to keep it rooted. I hope to see more, deeper interdisciplinary partnerships soon, working together to build powerful, grounded, and really interesting research!

 

Rethinking Digital Democracy- More reflections from #SMSociety13

What does digital democracy mean to you?

I presented this poster: Rethinking Digital Democracy v4 at the Social Media and Society conference last weekend, and it demonstrated only one of many images of digital democracy.

Digital democracy was portrayed at this conference as:

having a voice in the local public square (Habermas)

making local leadership directly accountable to constituents

having a voice in an external public sphere via international media sources

coordinating or facilitating a large scale protest movement

the ability to generate observable political changes

political engagement and/or mobilization

a working partnership between citizenry, government and emergency responders in crisis situations

a systematic archival of government activity brought to the public eye. “Archives can shed light on the darker places of the national soul”(Wilson 2012)

One presenter had the most systematic representation of digital democracy. Regarding the recent elections in Nigeria, he summarized digital democracy this way: “social media brought socialization, mobilization, participation and legitimization to the Nigerian electoral process.”
Not surprisingly, different working definitions brought different measures. How do you know that you have achieved digital democracy? What constitutes effective or successful digital democracy? And what phenomena are worthy of study and emulation? The scope of this question and answer varies greatly among some of the examples raised during the conference, which included:

citizens in the recent Nigerian election

citizens who tweet during a natural disaster or active crisis situation

citizens who changed the international media narrative regarding the recent Kenyan elections and ICC indictment

Arab Spring actions, activities and discussions
“The power of the people of greater than the people in power” a perfect quote related to Arab revolutions on a slide from Mona Kasra

the recent Occupy movement in the US

tweets to, from and about the US congress

and many more that I wasn’t able to catch or follow…

In the end, I don’t have a suggestion for a working definition or measures, and my coverage here really only scratches the surface of the topic. But I do think that it’s helpful for people working in the area to be aware of the variety of events, people, working definitions and measures at play in wider discussions of digital democracy. Here are a few question for researchers like us to ask ourselves:

What phenomenon are we studying?

How are people acting to affect their representation or governance?

Why do we think of it as an instance of digital democracy?

Who are “the people” in this case, and who is in a position of power?

What is our working definition of digital democracy?

Under that definition, what would constitute effective or successful participation? Is this measurable, codeable or a good fit for our data?

Is this a case of internal or external influence?

And, for fun, a few interesting areas of research:

There is a clear tension between the ground-up perception of the democratic process and the degree of cohesion necessary to affect change (e.g. Occupy & the anarchist framework)

Erving Goffman’s participant framework is also further ground for research in digital democracy (author/animator/principal <– think online petition and e-mail drives, for example, and the relationship between reworded messages, perceived efficacy and the reception that the e-mails receive).

It is clear that social media helps people have a voice and connect in ways that they haven’t always been able to. But this influence has yet to take any firm shape either among researchers or among those who are practicing or interested in digital democracy.

I found this tweet particularly apt, so I’d like to end on this note:

“Direct democracy is not going to replace representative government, but supplement and extend representation” #YES #SMSociety13

— Ray MacLeod (@RayMacLeod) September 14, 2013

 

 

Reflections on Digital Dualism & Social Media Research from #SMSociety13

I am frustrated by both Digital Dualism and the fight against Digital Dualism.

Digital dualism is the belief that online and offline are different worlds. It shows up relatively harmlessly when someone calls a group of people who are on their devices “antisocial,” but it is much more harmful in the way it pervades the language we use about online communication (e.g. “real” vs. “virtual”).

Many researchers have done important work countering digital dualism. For example, at the recent Social Media & Society conference, Jeffrey Keefer briefly discussed his doctoral work in which he showed that the support that doctoral students offered each other online was both very real and very helpful. I think it’s a shame that anyone ever doubted the power of a social network during such a challenging time, and I’m happy to see that argument trounced! Wooooh, go Jeffrey! (now a well-deserved Dr Keefer!)

Digital dualism is a false distinction, but it is built in part on a distinction that is also very real and very important. Online space and offline spare are different spaces. People can act in either to achieve their goals in very real ways, but, although both are very real, they are very different. The set of qualities with which the two overlap and differ and even blur into each other changes every day. For example, “real name” branding online and GPS enabled in-person gaming across college campuses continue to blur boundaries.

But the private and segmented aspects of online communication are important as well. Sometimes criticism of online space is based on this segmentation, but communities of interest are longstanding phenomena. A book club is expected to be a club for people with a shared interest in books. A workplace is a place for people with shared professional interests. A swim team is for people who want to swim together. And none of these relationships would be confused with the longstanding close personal relationships we share with friends and family. When online activities are compared with offline ones, often people are falsely comparing interest related activities online with the longstanding close personal ties we share with friends and family. In an effort to counter this, some have take moves to make online communication more unified and holistic. But they do this at the expense of one of the greatest strengths of online communication.

Let’s discuss my recent trip to Halifax for this conference as an example.

My friends and family saw this picture:

Voila! Rethinking Digital Democracy! More of a "Hey mom, here's my poster!" shot than a "Read and engage with my argument!" shot

Voila! Rethinking Digital Democracy! More of a “Hey mom, here’s my poster!” shot than a “Read and engage with my argument!” shot

My dad saw this one:

Not bad for airport fare, eh?

Not bad for airport fare, eh?

This picture showed up on Instagram:

2013-09-16 15.27.43

It’s a glass wall, but it looks like water!

People on Spotify might have followed the music I listened to, and people on Goodreads may have followed my inflight reading.

My Twitter followers and those following the conference online saw this:

Talking about remix culture! Have I landed in heaven? #SMSociety13 #heaveninhalifax #niiice

— Casey Langer Tesfaye (@FreeRangeRsrch) September 15, 2013

And you have been presented with a different account altogether

This fractioning makes sense to me, because I wouldn’t expect any one person to share this whole set of interests. I am able to freely discuss my area of interest with others who share the same interests.

Another presenter gave an example of LGBT youth on Facebook. The lack of anonymity can make it very hard for people who want to experiment or speak freely about a taboo topic to do so without it being taken out of context. Private and anonymous spaces that used to abound online are increasingly harder to find.

In my mind this harkens back a little to the early days of social media research, when research methods were deeply tied to descriptions of platforms and online activity on them. As platforms rose and fell, this research was increasingly useless. Researchers had to move their focus to online actions without trying to route them in platform or offline activity. Is social media research being hindered in similar ways, by answering old criticisms instead of focusing on current and future potential?  Social media needs to move away from these artificial roots. Instead of countering silly claims about social media being antisocial or anything more than real communication, we should focus our research activities on the ways in which people communicate online and the situated social actions and behaviors in online situations. This means, don’t try to ferret out people from usernames, or sort out who is behind a username. Don’t try to match across platforms. Don’t demand real names.

Honestly, anyone who is subjected to social feeds that contain quite a bit of posts outside their area of interest should be grateful to refocus and move on! People of abstract Instagram should be thrilled not to have seen a bowl of seafood chowder, and my family and friends should be thrilled not to have to hear me ramble on about digital dualism or context collapse!

I would love to discuss this further. If you’ve been waiting to post a comment on this blog, this is a great time for you to jump in and join the conversation!

Reflections on Social Network Analysis & Social Media Research from #SMSociety13

A dispatch from a quantitative side of social media research!

Here are a few of my reflections from the Social Media & Society conference in Halifax and my Social Network Analysis class.

I should first mention that I was lucky in two ways.

  1. I finished the James Bond movie ‘Skyfall’ as my last Air Canada flight was landing. (Ok, I didn’t have to mention that)
  2. I finished my online course on Social Network Analysis  hours before leaving for a conference that kicked off with an excellent  talk about Networks and diffusion. And then on the second day of the conference I was able to manipulate a network visualization with my hands using a 96 inch touchscreen at the Dalhousie University Social Media Lab  (Great lab, by the way, with some very interesting and freely available tools)

 

This picture doesn't do this screen justice. This is *data heaven*

This picture doesn’t do this screen justice. This is *data heaven*

Social networks are networks built to describe human action in social media environments. They contain nodes (dots), which could represent people, usernames, objects, etc. and edges, lines joining nodes that represent some kind of relationship (friend, follower, contact, or a host of other quantitative measures). The course was a particularly great introduction to Social Network Analysis, because it included a book that was clear and interesting, a set of youtube videos and a website, all of which were built to work together. The instructor (Dr Jen Golbeck, also the author of the book and materials) has a very unique interest in SNA which gives the class an important added dimension. Her focus is on operational definitions and quantitative measures of trust, and because of this we were taught to carefully consider the role of the edges and edge weights in our networks.

Sharad Goel’s plenary at #SMSociety13 was a very different look at networks. He questioned the common notion of viral diffusion online by looking at millions of cases of diffusion. He discovered that very few diffusions actual resemble any kind of viral model. Instead, most diffusion happens on a very small scale. He used Justin Bieber as an example of diffusion. Bieber has the largest number of followers on Twitter, so when it he posts something it has a very wide reach (“the Bieber effect”). However, people don’t share content as often as we imagine. In fact, only a very small proportion of his followers share it, and only a small proportion of their followers share it. Overall, the path is wide and shallow, with less vertical layers than we had previously envisioned.

Goel’s research is an example of Big Data in action. He said that Big Data methods are important when the phenomenon you want to study happens very infrequently (e.g. one in a million), as is the case for actual instances of viral diffusion.

His conclusions were big, and this line of research is very informative and useful for anyone trying to communicate on a large scale.

Sidenote: the term ‘ego network’ came up quite a few times during the conference, but not everyone knew what an ego network is. An ego network begins with a single node and is measured by degrees. A one degree social network looks a bit like an asterisk- it simply shows all of the nodes that are directly connected to the original node. A 1.5 degree network would include the first degree connections as well as the connections between them. A two degree network contains all of the first degree connections to these nodes that were in the one degree network. And so on.

One common research strategy is to compare across ego networks.

My next post will move on from SNA to more qualitative aspects of the conference

Source: https://twitter.com/JeffreyKeefer/status/378921564281921537/photo/1 This was the backdrop for a qualitive panel

Source: https://twitter.com/JeffreyKeefer/status/378921564281921537/photo/1
This was the backdrop for a qualitative panel. It says “Every time you say ‘data driven decision’ a fairy dies.

MOOC’s, Libraries, Online learning and Thirsting for knowledge

Let me begin by telling you a story.

The story began when I was in high school searching for the right college. My mom and I took a road trip the summer after my junior year of college. We took our time and covered quite a bit of ground. I discovered Hot97 in New York and Pepto Bismol in North Carolina. I fell in love with upstate NY. After our return, I began the application and interview process. The most memorable moment came during my interview with a representative from Cornell. She asked if I had any burning questions, and I decided to go ahead and ask her a question that had really been nagging at me: What is the difference between a class at Cornell and a class at a community college? She was shocked and deeply offended. She told me that anyone could get a great education anywhere they could find a library, and obviously I wasn’t right for Cornell.

This exchange has haunted me ever since. I do love to read, sure, but a library alone could never create the magic that a classroom can create. And the most magical classes happen when the students are engaged, interested, attentive, involved, participating, excited and following through with the homework. Part of this magic comes from the teacher. A great teacher can cultivate this kind of environment with ease, but most really struggle when it doesn’t happen organically.

I’m not sure everyone would agree that classrooms can be magical. I may have been spoiled with great classes. I’ve just finished a masters’ program where I loved the classes, loved the reading and loved the assignments, but I’m not sure that every student would approach school with as much relish. I love learning.

Tomorrow I begin an educational experiment. I will start a course in Social Network Analysis from Statistics.com. This is a paid course, and I’ve chosen to be held responsible for my work (you can choose whether or not to submit homework for grading). Next month, the experiment will deepen when I begin my first MOOC. The MOOC is a data analysis class that teaches R. I’m very eager to learn R and to revisit some statistical methods that I haven’t been able to use much. The experiment will not be pure, because three of my coworkers have decided to attend the class as well. We’ll be fortunate enough to experience part of the course in-person.

I’m not sure how I feel about distance education before beginning this experiment. Learning is something that I really love to do in-person. But so many things that happen online can be evaluated the same way. I recently read articles and commentary about a controversial paper on Twitter research SSRN-id2235423. The research is fodder for some great discussion, but many commenters on the news articles simply chose to trash Twitter. They bemoaned the 140 character limit so strongly that one would think that Twitter is a land of Paris Hilton’s and cats. I’d like the critics to know that yes, you can find Paris Hilton and cats on Twitter or just about anywhere else online. But you can also find something deeper, something that interests you. I recently introduced my nephew to Twitter. He’s a news junkie of sorts, and he was fascinated to see how much emerging news and quality commentary was available. The weekly #wjchat’s alone are reason to follow Twitter (#wjchat is a weekly methods chat between social media journalists) The reach of people on Twitter is unparalleled, and the ability to follow specific areas of interest in deeply engaged ways is also unparalleled. When used correctly, Twitter is a powerful tool.

Online learning as well as the potential to be a powerful tool. But it will require engagement from the people involved. We will need to suspend our natural hesitancy and develop the necessary competencies. I really hope that my classmates will be willing to embrace the experience!

Upcoming DC Event: Online Research Offline Lunch

ETA: Registration for this event is now CLOSED. If you have already signed up, you will receive a confirmation e-mail shortly. Any sign-ups after this date will be stored as a contact list for any future events. Thank you for your interest! We’re excited to gather with such a diverse and interesting group.

—–

Are you in or near the DC area? Come join us!

Although DC is a great meeting place for specific areas of online research, there are few opportunities for interdisciplinary gatherings of professionals and academics. This lunch will provide an informal opportunity for a diverse set of online researchers to listen and talk respectfully about our interests and our work and to see our endeavors from new, valuable perspectives.

Date & Time: August 6, 2013, 12:30 p.m.

Location: Near Gallery Place or Metro Center. Once we have a rough headcount, we’ll choose an appropriate location. (Feel free to suggest a place!)

Please RSVP using this form:

Spam, Personal histories and Language competencies

Over the recent holiday, I spent some time sorting through many boxes of family memorabilia. Some of you have probably done this with your families. It is fascinating, sentimental and mind-boggling. Highlights include both the things that strike a chord and things that can be thrown away. It’s a balance of efficiency and sap.

 

I’m always amazed by the way family memorabilia tells both private, personal histories and larger public ones. The boxes I dealt with last week were my mom’s, and her passion was politics. Even the Christmas cards she saved give pieces of political histories. Old thank you cards provide unknown nuggets of political strategy. She had even saved stirrers and plastic cups from an inauguration!

 

Campaign button found in the family files

Campaign button found in the family files

 

 

My mom continued to work in politics throughout her life, but the work that she did more recently is understandably fresher and more tangible for me. I remember looking through printed Christmas cards from politicians and wondering why she held on to them. In her later years I worried about her tendency to hold on to mail merged political letters. I wondered if her tendency to personalize impersonal documents made her vulnerable to fraud. To me, her belief in these documents made no sense.

 

Flash forward one year to me sorting through boxes of handwritten letters from politicians that mirror the spam she held on to. For many years she received handwritten letters from elected politicians in Washington. At some point, the handwritten letters evolved into typed letters that were hand-corrected and included handwritten sections. These evolved into typed letters on which the only handwriting was the signature. Eventually, even the signatures became printed. But the intention and function of these letters remained the same, even as their typography evolved. She believed in these letters because she had been receiving them for many decades. She believed they were personal because she had seen more of them that were personal than not. The phrases that I believe to be formulaic and spammy were once handwritten, intentional, personal and probably even heartfelt.

 

 

There are a few directions I could go from here:

 

– I better understand why older people complain about the impersonalization of modern society and wax poetic about the old letter writing tradition. I could include a few anecdotes about older family members.

 

– I’m amazed that people would take the time to write long letters using handwriting that may never have been deciphered

 

– I could wax poetic about some of the cool things I found in the storage facility

 

 

But I won’t. Not in this blog. Instead, I’ll talk about competencies.

 

Spam is a manifest of language competencies, although we often dismiss it as a total lack of language competence. In my Linguistics study, we were quickly taught the mantra “difference, not deficiency.” In fact it takes quite a bit of skill to develop spam letters. In survey research, the survey invitation letters that people so often dismiss have been heavily researched and optimized to yield a maximum response rate. In his book The Sociolinguistics of Globalization, Jan Blommaert details the many competencies necessary to create the Nigerian bank scam letters that were so heavily circulated a few years ago. And now I’ve learned that the political letters that I’m so quick to dismiss as thoughtless mail merges are actually part of a deep tradition of political action. Will that be enough for me to hold on to them? No. But I am saving the handwritten stuff. Boxes and boxes of it!

 

 

One day last week, as I drove to the storage facility I heard an interview with Michael Pollan about Food Literacy. Pollan’s point was that the food draughts in some urban areas are not just a function of access (Food draughts are areas where fresh food is difficult to obtain and grocery stores are few and far between, if they’re available at all). Pollan believes that even if there were grocery stores available, the people in these neighborhoods lack the basic cooking skills to prepare the food. He cited a few basic cooking skills which are not basic to me (partly because I’m a vegetarian, and partly because of the cooking traditions I learned from) as a part of his argument.

 

As a linguist, it is very interesting to hear the baggage that people attach to language metaphorically carried over to food (“food illiteracy”). I wonder what value the “difference, not deficiency” mantra holds here. I’m not ready to believe that people in areas subject to food draught are indeed kitchen illiterate. But I wouldn’t hesitate to agree that their food cultures probably differ significantly from Pollan’s. The basic staples and cooking methods probably differ significantly. Pollan could probably make a lot more headway with his cause if, instead of assuming that the people he is trying to help lack any basic cooking skills, he advocated toward a culture change that included access, attainability, and the potential to learn different practical cooking skills. It’s a subtle shift, but an important one.

 

As a proud uncook, I’m a huge fan of any kind of food preparation that is two steps or less, cheap, easy and fresh. Fast food for me involves putting a sweet potato in the microwave and pressing “potato,” grabbing for an apple or carrots and peanut butter, or tossing chickpeas into a dressing. Slow food involves the basic sautéing, roasting, etc. that Pollan advocates. I imagine that the skills he advocates are more practical and enjoyable for him than they are for people like me, whose mealtimes are usually limited and chaotic. What he calls basic is impractical for many of us. And the differences in time and money involved in uncooking and “basics” add up quickly.

 

 

 

So I’ve taken this post in quite a few directions, but it all comes together under one important point. Different language skills are not a lack of language skills altogether. Similarly, different survival skills are not a total lack of survival skills. We all carry unique skillsets that reflect our personal histories with those skills as well as the larger public histories that our personal histories help to compose. We, as people, are part of a larger public. The political spam I see doesn’t meet my expectations of valuable, personal communication, but it is in fact part of a rich political history. The people who Michael Pollan encounters have ways of feeding themselves that differ from Pollan’s expectations, but they are not without important survival skills. Cultural differences are not an indication of an underlying lack of culture.

2013-07-05 11.13.21

 

Fitness for Purpose, Representativeness and the perils of online reviews

Have you ever planned a trip online? In January, when I traveled to Amsterdam, I did all of the legwork online and ended up in a surprising place.

Amsterdam City Center is extremely easy to navigate. From the train station (a quick ride from the airport and a quick ride around The Netherlands), the canals extend outward like spokes. Each canal is flanked by streets. Then the city has a number of concentric rings emanating from the train station. Not only is the underlying map easy to navigate, there is a traveler station at the center and maps available periodically. English speaking tourists will see that not only do many people speak English, but Dutch has enough overlap with English to be comprehensible after even a short exposure.

But the city center experience was not as smooth for me. I studied map after map in the city center without finding my hotel. I asked for directions, and no one had heard of the hotel or the street it was on. The traveler center seemed flummoxed as well. Eventually I found someone who could help and found myself on a long commuter tram ride well outside the city center and tourist areas. The hotel had received great reviews and recommendations from many travelers. But clearly, the travelers who boasted about it were not quite the typical travelers, who likely would have ended up in one of the many hotels I saw from the tram window.

Have you ever discovered a restaurant online? I recently went to a nice, local restaurant that I’d been reading about for years. I ordered the truffle fries (fries with truffle salt and some kind of fondue sauce), because people had really raved about them, only to discover once they arrived that they were fundamentally french fries (totally not my bag- I hate fried food).

These review sites are not representative of anything. And yet we/I repeatedly use them as if they were reliable sources of information. One could easily argue that they may not be representative, but they are good enough for their intended use (fitness for purpose <– big, controversial notion from a recent AAPOR task force report on Nonprobability Sampling). I would argue that they are clearly not excellent for their intended use. But does that invalidate them altogether? They often they provide the only window that we have into the whatever it is that we intend them for.

Truffle fried aside, the restaurant was great. And location aside, the hotel was definitely an interesting experience.

Toilet capsule in hotel room (with frosted glass rotating pane for some degree of privacy)

Toilet capsule in hotel room (with frosted glass rotating pane for some degree of privacy)

Representativeness, qual & quant, and Big Data. Lost in translation?

My biggest challenge in coming from a quantitative background to a qualitative research program was representativeness. I came to class firmly rooted in the principle of Representativeness, and my classmates seemed not to have any idea why it mattered so much to me. Time after time I would get caught up in my data selection. I would pose the wider challenge of representativeness to a colleague, and they would ask “representative of what? why?”

 

In the survey research world, the researcher begins with a population of interest and finds a way to collect a representative sample of the population for study. In the qualitative world that accompanies survey research units of analysis are generally people, and people are chosen for their representativeness. Representativeness is often constructed by demographic characteristics. If you’ve read this blog before, you know of my issues with demographics. Too often, demographic variables are used as a knee jerk variable instead of better considered variables that are more relevant to the analysis at hand. (Maybe the census collects gender and not program availability, for example, but just because a variable is available and somewhat correlated doesn’t mean that it is in fact a relevant variable, especially when the focus of study is a population for whom gender is such an integral societal difference.)

 

And yet I spent a whole semester studying 5 minutes of conversation between 4 people. What was that representative of? Nothing but itself. It couldn’t have been exchanged for any other 5 minutes of conversation. It was simply a conversation that this group had and forgot. But over the course of the semester, this piece of conversation taught me countless aspects of conversation research. Every time I delved back into the data, it became richer. It was my first step into the world of microanalysis, where I discovered that just about anything can be a rich dataset if you use it carefully. A snapshot of people at a lecture? Well, how are their bodies oriented? A snapshot of video? A treasure trove of gestures and facial expressions. A piece of graffiti? Semiotic analysis! It goes on. The world of microanalysis is built on the practice of layered noticing. It goes deeper than wide.

 

But what is it representative of? How could a conversation be representative? Would I need to collect more conversations, but restrict the participants? Collect conversations with more participants, but in similar contexts? How much or how many would be enough?

 

In the world of microanalysis, people and objects constantly create and recreate themselves. You consistently create and recreate yourself, but your recreations generally fall into a similar range that makes you different from your neighbors. There are big themes in small moments. But what are the small moments representative of? Themselves. Simply, plainly, nothing more and nothing else. Does that mean that they don’t matter? I would argue that there is no better way to understand the world around us in deep detail than through microanalysis. I would also argue that macroanalysis is an important part of discovering the wider patterns in the world around us.

 

Recently a NY Times blog post by Quentin Hardy has garnered quite a bit of attention.

Why Big Data is Not Truth: http://bits.blogs.nytimes.com/2013/06/01/why-big-data-is-not-truth/

This post has really struck a chord with me, because I have had a hard time understanding Hardy’s complaint. Is big data truth? Is any data truth? All data is what it is; a collection of some sort, collected under a specific set of circumstances. Even data that we hope to be more representative has sampling and contextual limitations. Responsible analysts should always be upfront about what their data represents. Is big data less truthful than other kinds of data? It may be less representative than, say, a systematically collected political poll. But it is what it is: different data, collected under different circumstances in a different way. It shouldn’t be equated with other data that was collected differently. One true weakness of many large scale analyses is the blindness to the nature of the data, but that is a byproduct of the training algorithms that are used for much of the analysis. The algorithms need large training datasets, from anywhere. These sets often are developed through massive web crawlers. Here, context gets dicey. How does a researcher represent the data properly when they have no idea what it is? Hopefully researchers in this context will be wholly aware that, although their data has certain uses, it also has certain [huge] limitations.

 

I suspect that Hardy’s complaint is with the representations of massive datasets collected from webcrawlers as a complete truth from which any analyses could be run and all of the greater truths of the world could be revealed. On this note, Hardy is exactly right. Data simply is what it is, nothing more and nothing less. And any analysis that focuses on an unknown dataset is just that: an analysis without context. Which is not to say that all analyses need to be representative, but rather that all responsible analyses of good quality need to be self aware. If you do not know what the data represents and when and how it was collected, then you cannot begin to discuss the usefulness of any analysis of it.

What is the role of Ethnography and Microanalysis in Online Research?

There is a large disconnect in online research.

The largest, most profile, highest value and most widely practiced side of online research was created out of a high demand to analyze the large amount of consumer data that is constantly being created and largely public available. This tremendous demand led to research methods that were created in relative haste. Math and programming skills thrived in a realm where social science barely made a whisper. The notion of atheoretical research grew. The level of programming and mathematical competence required to do this work continues to grow higher every day, as the fields of data science and machine learning become continually more nuanced.

The largest, low profile, lowest value and increasingly more practiced side of online research is the academic research. Turning academia toward online research has been like turning a massive ocean liner. For a while online research was not well respected. At this point it is increasingly well respected, thriving in a variety of fields and in a much needed interdisciplinary way, and driven by a search for a better understanding of online behavior and better theories to drive analyses.

I see great value in the intersection between these areas. I imagine that the best programmers have a big appetite for any theory they can use to drive their work in a useful and productive ways. But I don’t see this value coming to bear on the market. Hiring is almost universally focused on programmers and data scientists, and the microanalytic work that is done seems largely invisible to the larger entities out there.

It is common to consider quantitative and qualitative research methods as two separate languages with few bilinguals. At the AAPOR conference in Boston last week, Paul Lavarakas mentioned a book he is working on with Margaret Roller which expands the Total Survey Error model to both quantitative and qualitative research methodology. I spoke with Margaret Roller about the book, and she emphasized the importance of qualitative researchers being able to talk more fluently and openly about methodology and quality controls. I believe that this is, albeit a huge challenge in wording and framing, a very important step for qualitative research, in part because quality frameworks lend credibility to qualitative research in the eyes of a wider research community. I wish this book a great deal of success, and I hope that it is able to find an audience and a frame outside the realm of survey research (Although survey research has a great deal of foundational research, it is not well known outside of the field, and this book will merit a wider audience).

But outside of this book, I’m not quite sure where or how the work of bringing these two distinct areas of research can or will be done.

Also at the AAPOR conference last week, I participated in a panel on The Role of Blogs in Public Opinion Research (intro here and summary here). Blogs serve a special purpose in the field of research. Academic research is foundational and important, but the publish rate on papers is low, and the burden of proof is high. Articles that are published are crafted as an argument. But what of the bumps along the road? The meditations on methodology that arise? Blogs provide a way for researchers to work through challenges and to publish their failures. They provide an experimental space where fields and ideas can come together that previously hadn’t mixed. They provide a space for finding, testing, and crossing boundaries.

Beyond this, they are a vehicle for dissemination. They are accessible and informally advertised. The time frame to publish is short, the burden lower (although I’d like to believe that you have to earn your audience with your words). They are a public face to research.

I hope that we will continue to test these boundaries, to cross over barriers like quantitative and qualitative that are unhelpful and obtrusive. I hope that we will be able to see that we all need each other as researchers, and the quality research that we all want to work for will only be achieved through the mutual recognition that we need.