Reliability and validity are the two most important principals in social science research. They are the measures that maintain the integrity, quality and ultimate usefulness of our work.
Reliability refers to the replicability of findings, which is a crucial element in any scientific process. Repeatability under varying conditions helps to establish the consistency and boundaries of a phenomena. Oftentimes, we like to compare ourselves to scientists and emphasize our take on the scientific method. I would argue that some of the most important lessons we can learn from science include the basic doubt (inherent in statistics as well, thanks to the null hypothesis) that assures that no one study can make knowledge so much as suggest knowledge that can lead to further testing and potential verification. Research is inherently tied to its underlying questions, and a well conducted study based on one question could easily lead to very different findings than another, even slightly different, research question. Reliability cannot be valued highly enough in social science research.
Validity refers to the value of our findings. What do these findings mean? What do they refer to? In survey research, understanding the validity of a finding often entails retracing what question was asked to whom under what circumstances and anchoring any conclusions to those basic truths instead of extrapolating to a wider principle that we would like to have observed.
In text analytics, their are two more anchors that guide research; precision and recall. Recall is the measure of how many of the correct instances of a phenomena you were able to isolate in your programming, and precision refers to the percent of the instances that you collected that were indeed instances of your target phenomena. Text analysis is a dance of queries, toggling between collecting correct matches and dropping incorrect matches. It is within the context of this dance that language seems most staggering. Here we see how little people say what they mean or mean what they say, how much context matters, how often people refer to subjects by proxy, how dependent on ongoing conversation new elements are, …
As users of language, we are constantly inundated with words. We cope with this by only focusing on selective elements. We focus on metamessages, not mechanics. It is easy to assume, from our perspective, that language is straightforward. Indeed, if it was straightforward, text analytics would be easy and misunderstandings would not be so rampant! All of the linguistic data that we are flooded with would be harnessed regularly, and society would reflect that data better through ubiquitously increased customization.
But the reality of language stands in stark contrast to what we assume it to be, and that reality is the lifeblood of linguistics.