On Friday, I had the honor of participating in a microanalysis video discussion group with Fred Erickson. As he was introducing the process to the new attendees, he said something that really caught my attention. He said that videos and field notes are not data until someone decides to use them for research.
As someone with a background in survey research, the question of ‘what is data?’ was never really on my radar before graduate school. Although it’s always been good practice to know where your data comes from and what it represents in order to glean any kind of validity from your work, data was unquestioningly that which you see in a spreadsheet or delimited file, with cases going down and variables going across. If information could be formed like this, it was data. If not, it would need some manipulation. I remember discussing this with Anna Trester a couple of years ago. She found it hard to understand this limited framework, because, for her, the world was a potential data source. I’ve learned more about her perspective in the last couple of years, working with elements that I never before would have characterized as data, including pictures, websites, video footage of interactions, and now fieldwork as a participant observer.
Dr Erickson’s observation speaks to some frustration I’ve had lately, trying to understand the nature of “big data” sets. I’ve seen quite a bit of people looking for data, any data, to analyze. I could see the usefulness of this for corpus linguists, who use large bodies of textual data to study language use. A corpus linguist is able to use large bodies of text to see how we use words, which is a systematically patterned phenomena that goes much deeper than a dictionary definition could. I could also see the usefulness of large datasets in training programs to recognize genre, a really critical element in automated text analysis.
But beyond that, it is deeply important to understand the situated nature of language. People don’t produce text for the sake of producing text. Each textual element represents an intentioned social action on the part of the writer, and social goals are accomplished differently in different settings. In order for studies of textual data to produce valid conclusions with social commentary, contextual elements are extremely important.
Which leads me to ask if these agnostic datasets are being used solely as academic exercises by programmers and corpus linguists or if our hunger for data has led us to take any large body of information and declare it to be useful data from which to excise valid conclusions? Worse, are people using cookie cutter programs to investigate agnostic data sets like this without considering the wider validity?
I urge anyone looking to create insight from textual data to carefully get to know their data.