In linguistics study, we quickly learn that all language is patterned. Although the actual words we produce vary widely, the process of production does not. The process of constructing baby talk was found to be consistent across kids from 15 different languages. When any two people who do not speak overlapping languages come together and try to speak, the process is the same. When we look at any large body of data, we quickly learn that just about any linguistic phenomena is subject to statistical likelihood. Grammatical patterns govern the basic structure of what we see in the corpus. Variations in language use may tweak these patterns, but each variation is a patterned tweak with its own set of statistical likelihoods. Variations that people are quick to call bastardizations are actually patterned departures from what those people consider to be “standard” english. Understanding “differences not defecits” is a crucially important part of understanding and processing language, because any variation, even texting shorthand, “broken english,” or slang, can be better understood and used once its underlying structure is recognized.
The patterns in language extend beyond grammar to word usage. The most frequent words in a corpus are function words such as “a” and “the,” and the most frequent collocations are combinations like “and the” or “and then it.” These patterns govern the findings of a lot of investigations into textual data. A certain phrase may show up as a frequent member of a dataset simply because it is a common or lexicalized expression, and another combination may not appear because it is more rare- this could be particularly problematic, because what is rare is often more noticeable or important.
Here are some good starter questions to ask to better understand your textual data:
1) Where did this data come from? What was it’s original purpose and context?
2) What did the speakers intend to accomplish by producing this text?
3) What type of data or text, or genre, does this represent?
4) How was this data collected? Where is it from?
5) Who are the speakers? What is their relationship to eachother?
6) Is there any cohesion to the text?
7) What language is the text in? What is the linguistic background of the speakers?
8) Who is the intended audience?
9) What kind of repetition do you see in the text? What about repetition within the context of a conversation? What about repetition of outside elements?
10) What stands out as relatively unusual or rare within the body of text?
11) What is relatively common within the dataset?
12) What register is the text written in? Casual? Academic? Formal? Informal?
13) Pronoun use. Always look at pronoun use. It’s almost always enlightening.
These types of questions will take you much further into your dataset that the knee-jerk question “What is this text about?”
Now, go forth and research! …And be sure to report back!