My biggest challenge in coming from a quantitative background to a qualitative research program was representativeness. I came to class firmly rooted in the principle of Representativeness, and my classmates seemed not to have any idea why it mattered so much to me. Time after time I would get caught up in my data selection. I would pose the wider challenge of representativeness to a colleague, and they would ask “representative of what? why?”
In the survey research world, the researcher begins with a population of interest and finds a way to collect a representative sample of the population for study. In the qualitative world that accompanies survey research units of analysis are generally people, and people are chosen for their representativeness. Representativeness is often constructed by demographic characteristics. If you’ve read this blog before, you know of my issues with demographics. Too often, demographic variables are used as a knee jerk variable instead of better considered variables that are more relevant to the analysis at hand. (Maybe the census collects gender and not program availability, for example, but just because a variable is available and somewhat correlated doesn’t mean that it is in fact a relevant variable, especially when the focus of study is a population for whom gender is such an integral societal difference.)
And yet I spent a whole semester studying 5 minutes of conversation between 4 people. What was that representative of? Nothing but itself. It couldn’t have been exchanged for any other 5 minutes of conversation. It was simply a conversation that this group had and forgot. But over the course of the semester, this piece of conversation taught me countless aspects of conversation research. Every time I delved back into the data, it became richer. It was my first step into the world of microanalysis, where I discovered that just about anything can be a rich dataset if you use it carefully. A snapshot of people at a lecture? Well, how are their bodies oriented? A snapshot of video? A treasure trove of gestures and facial expressions. A piece of graffiti? Semiotic analysis! It goes on. The world of microanalysis is built on the practice of layered noticing. It goes deeper than wide.
But what is it representative of? How could a conversation be representative? Would I need to collect more conversations, but restrict the participants? Collect conversations with more participants, but in similar contexts? How much or how many would be enough?
In the world of microanalysis, people and objects constantly create and recreate themselves. You consistently create and recreate yourself, but your recreations generally fall into a similar range that makes you different from your neighbors. There are big themes in small moments. But what are the small moments representative of? Themselves. Simply, plainly, nothing more and nothing else. Does that mean that they don’t matter? I would argue that there is no better way to understand the world around us in deep detail than through microanalysis. I would also argue that macroanalysis is an important part of discovering the wider patterns in the world around us.
Recently a NY Times blog post by Quentin Hardy has garnered quite a bit of attention.
This post has really struck a chord with me, because I have had a hard time understanding Hardy’s complaint. Is big data truth? Is any data truth? All data is what it is; a collection of some sort, collected under a specific set of circumstances. Even data that we hope to be more representative has sampling and contextual limitations. Responsible analysts should always be upfront about what their data represents. Is big data less truthful than other kinds of data? It may be less representative than, say, a systematically collected political poll. But it is what it is: different data, collected under different circumstances in a different way. It shouldn’t be equated with other data that was collected differently. One true weakness of many large scale analyses is the blindness to the nature of the data, but that is a byproduct of the training algorithms that are used for much of the analysis. The algorithms need large training datasets, from anywhere. These sets often are developed through massive web crawlers. Here, context gets dicey. How does a researcher represent the data properly when they have no idea what it is? Hopefully researchers in this context will be wholly aware that, although their data has certain uses, it also has certain [huge] limitations.
I suspect that Hardy’s complaint is with the representations of massive datasets collected from webcrawlers as a complete truth from which any analyses could be run and all of the greater truths of the world could be revealed. On this note, Hardy is exactly right. Data simply is what it is, nothing more and nothing less. And any analysis that focuses on an unknown dataset is just that: an analysis without context. Which is not to say that all analyses need to be representative, but rather that all responsible analyses of good quality need to be self aware. If you do not know what the data represents and when and how it was collected, then you cannot begin to discuss the usefulness of any analysis of it.