The surprising unpredictability of language in use

This morning I recieved an e-mail from an international professional association that I belong to. The e-mail was in English, but it was not written by an American. As a linguist, I recognized the differences in formality and word use as signs that the person who wrote the e-mail is speaking from a set of experiences with English that differ from my own. Nothing in the e-mail was grammatically incorrect (although as a linguist I am hesitant to judge any linguistic differences as correct or incorrect, especially out of context).

Then later this afternoon I saw a tweet from Twitter on the correct use of Twitter abbreviations (RT, MT, etc.). If the growth of new Twitter users has indeed leveled off then Twitter is lucky, because the more Twitter grows the less they will be able to influence the language use of their base.

Language is a living entity that grows, evolves and takes shape based on individual experiences and individual perceptions of language use. If you think carefully about your experiences with language learning, you will quickly see that single exposures and dictionary definitions teach you little, but repeated viewings across contexts teach you much more about language.

Language use is patterned. Every word combination has a likelihood of appearing together, and that likelihood varies based on a host of contextual factors. Language use is complex. We use words in a variety of ways across a variety of contexts. These facts make language interesting, but they also obscure language use from casual understanding. The complicated nature of language in use interferes with analysts who build assumptions about language into their research strategies without realizing that their assumptions would not stand up to careful observation or study.

I would advise anyone involved in the study of language use (either as a primary or secondary aspect of their analysis) to take language use seriously. Fortunately, linguistics is fun and language is everywhere. So hop to it!

Reporting on the AAPOR 69th national conference in Anaheim #aapor

Last week AAPOR held it’s 69th annual conference in sunny (and hot) Anaheim California.

Palm Trees in the conference center area

My biggest takeaway from this year’s conference is that AAPOR is a very healthy organization. AAPOR attendees were genuinely happy to be at the conference, enthusiastic about AAPOR and excited about the conference material. Many participants consider AAPOR their intellectual and professional home base and really relished the opportunity to be around kindred spirits (often socially awkward professionals who are genuinely excited about our niche). All of the presentations I saw firsthand or heard about were solid and dense, and the presenters were excited about their work and their findings. Membership, conference attendance, journal and conference submissions and volunteer participation are all quite strong.

 

At this point in time, the field of survey research is encountering a set of challenges. Nonresponse is a growing challenge, and other forms of data and analysis are increasingly en vogue. I was really excited to see that AAPOR members are greeting these challenges and others head on. For this particular write-up, I will focus on these two challenges. I hope that others will address some of the other main conference themes and add their notes and resources to those I’ve gathered below.

 

As survey nonresponse becomes more of a challenge, survey researchers are moving from traditional measures of response quality (e.g. response rates) to newer measures (e.g. nonresponse bias). Researchers are increasingly anchoring their discussions about survey quality within the Total Survey Error framework, which offers a contextual basis for understanding the problem more deeply. Instead of focusing on an across the board rise in response rates, researchers are strategizing their resources with the goal of reducing response bias. This includes understanding response propensity (who is likely not to respond to the survey? Who is most likely to drop out of a panel study? What are some of the barriers to survey participation?), looking for substantive measures that correlate with response propensity (e.g. Are small, rural private schools less likely to respond to a school survey? Are substance users less likely to respond to a survey about substance abuse?), and continuous monitoring of paradata during the collection period (e.g. developing differential strategies by disposition code, focusing the most successful interviewers on the most reluctant cases, or concentrating collection strategies where they are expected to be most effective). This area of strategizing emerged in AAPOR circles a few years ago with discussions of nonresponse propensity modeling, a process which is surely much more accessible than it sounds, but it has really evolved into a practical and useful tool that can help any size research shop increase survey quality and lower costs.

 

Another big takeaway for me was the volume of discussions and presentations that spoke to the fast-emerging world of data science and big data. Many people spoke of the importance of our voice in the realm of data science, particularly with our professional focus on understanding and mitigating errors in the research process. A few practitioners applied error frameworks to analyses of organic data, and some talks were based on analyses of organic data. This year AAPOR also sponsored a research hack to investigate the potential for Instagram as a research tool for Feed the Hungry. These discussions, presentations and activities made it clear that AAPOR will continue to have a strong voice in the changing research environment, and the task force reports and initiatives from both the membership and education committees reinforced AAPOR’s ability to be right on top of the many changes afoot. I’m eager to see AAPOR’s changing role take shape.

“If you had asked social scientists even 20 years ago what powers they dreamed of acquiring, they might have cited the capacity to track the behaviors, purchases, movements, interactions, and thoughts of whole cities of people, in real time.” – N.A.  Christakis. 24 June 2011. New York Times, via Craig Hill (RTI)

 

AAPOR a very strong, well-loved organization and it is building a very strong future from a very solid foundation.

 

 

2014-05-16 15.38.17

 

MORE DETAILED NOTES:

This conference is huge, so I could not possibly cover all of it on my own, so I will try to share my notes as well as the notes and resources I can collect from other attendees. If you have any materials to share, please send them to me! The more information I am able to collect here, the better a resource it will be for people interested in the AAPOR or the conference-

 

Patrick Ruffini assembled the tweets from the conference into this storify

 

Annie, the blogger behind LoveStats, had quite a few posts from the conference. I sat on a panel with Annie on the role of blogs in public opinion research (organized by Joe Murphy for the 68th annual AAPOR conference), and Annie blew me away by live-blogging the event from the stage! Clearly, she is the fastest blogger in the West and the East! Her posts from Anaheim included:

Your Significance Test Proves Nothing

Do panel companies manage their panels?

Gender bias among AAPOR presenters

What I hate about you AAPOR

How to correct scale distribution errors

What I like about you AAPOR

I poo poo on your significance tests

When is survey burden the fault of the responders?

How many survey contacts is enough?

 

My full notes are available here (please excuse any formatting irregularities). Unfortunately, they are not as extensive as I would have liked, because wifi and power were in short supply. I also wish I had settled into a better seat and covered some of the talks in greater detail, including Don Dillman’s talk, which was a real highlights of the conference!

I believe Rob Santos’ professional address will be available for viewing or listening soon, if it is not already available. He is a very eloquent speaker, and he made some really great points, so this will be well worth your time.

 

Let’s talk about data cleaning

Data cleaning has a bad rep. In fact, it has long been considered the grunt work of the data analysis enterprise. I recently came across a piece of writing in the Harvard Business Review that lamented the amount of time data scientists spend cleaning their data. The author feared that data scientists’ skills were being wasted on the cleaning process when they could be using their time for the analyses we so desperately need them to do.

I’ll admit that I haven’t always loved the process of cleaning data. But my view of the process has evolved significantly over the last few years.

As a survey researcher, my cleaning process used to begin with a tall stack of paper forms. Answers that did not make logical sense during the checking process sparked a trip to the file folders to find the form in question. The forms often held physical evidence of a indecision on the part of the respondent, such as eraser marks or an explanation in the margin, which could not have been reflected properly by the data entry person. We lost this part of the process when we moved to web surveys. It sometimes felt like a web survey left the respondent no way to communicate with the researcher about their unique situations. Data cleaning lost its personalized feel and detective story luster and became routine and tedious.

Despite some of the affordances of the movement to web surveys, much of the cleaning process stayed routed in the old techniques. Each form has its own id number, and the programmers would use those id numbers for corrections

if id=1234567, set var1=5, set var7=62

At this point a “good programmer” would also document the changes for future collaborators

*this person was not actually a forest ranger, and they were born in 1962
if id=1234567, set var1=5, set var7=62

Making these changes grew tedious very quickly, and the process seemed to drag on for ages. The researcher would check the data for a potential errors, scour the records that could hold those errors for any kind of evidence of the respondent’s intentions, and then handle each form one at a time.

My techniques for cleaning data have changed dramatically since those days. My goal is to use id numbers as rarely as possible, but instead to ask myself questions like “how can I tell that these people are not forest rangers?” The answer to these questions evokes a subtley different technique:

* these people are not actually forest rangers
if var7=35 and var1=2 and var10 contains ‘fire fighter’, set var1=5)

This technique requires honing and testing (adjusting the precision and recall), but I’ve found it to be far more efficient, faster, more comprehensive and, most of all- more fun (oh hallelujah!). It makes me wonder whether we have perpetually undercut the quality of the data cleaning we do simply because we hold the process in such low esteem.

So far I have not discussed data cleaning for other types of data. I’m currently working on a corpus of Twitter data, and I don’t see much of a difference in the cleaning process. The data types and programming statements I use are different, but the process is very close. It’s an interesting and challenging process that involves detective work, a better and growing understanding of the intricacies of the dataset, a growing set of programming skills, and a growing understanding of the natural language use in your dataset. The process mirrors the analysis to such a degree that I’m not really sure why it would be such a bad thing for analysts to be involved in data cleaning.

I’d be interested to hear what my readers have to say about this. Is our notion of the value and challenge of data cleaning antiquated? Is data cleaning a burden that an analyst should bear? And why is there so little talk about data cleaning, when we could all stand to learn so much from each other in the way of data structuring code and more?

Professional Identity: Who am I? And who are you?

Last night I acted as a mentor at the annual Career Exploration Expo sponsored by my graduate program. Many of the students had questions about developing a professional identity. This makes sense, of course, because graduate school is an important time for discovering and developing a professional identity.

People enter our program (and many others) With a wide variety of backgrounds and interests. They choose from a variety of classes that fit their interests and goals. And then they try to map their experience onto job categories. But boxes are difficult to climb into and out of, and students soon discover that none of the boxes is a perfect fit.

I experienced this myself. I entered the program with an extensive and unquestioned background in survey research. Early in my college years (while I was studying and working in neuropsychology) I began to manage a clinical dataset in SPSS. Working with patients and patient files was very interesting, but to my surprise working with data using statistical software felt right to me much in the way that Ethiopian meals include injera and Japanese meals include rice (IC 2006 (1997) Ohnuki Tierney Emiko). I was actually teased by my friends about my love of data! This affinity served me well, and I enjoyed working with a variety of data sets while moving across fields and statistical programming languages.

But my graduate program blew my mind. I felt like I had spent my life underwater and then discovered the sky and continents. I discovered many new kinds of data and analytic strategies, all of which were challenging and rewarding. These discoveries inspired me to start this blog and have inspired me to attend a wide variety of events and read some very interesting work that I never would have discovered on my own. Hopefully followers of this blog have enjoyed this journey as much as I have!

As a recent graduate, I sometimes feel torn between worlds. I still work as a survey researcher, but I’m inspired by research methods that are beyond the scope of my regular work. Another recent graduate of our program who is involved in market research framed her strategy in a way that really resonated with me: “I give my customers what they want and something else, and they grow to appreciate the ‘something else.'” That sums up my current strategy. I do the survey management and analysis that is expected of me in a timely, high quality way. But I am also using my newly acquired knowledge to incorporate text analysis into our data cleaning process in order to streamline it, increasing both the speed and the quality of the process and making it better equipped to handle the data from future surveys. I do the traditional quantitative analyses, but I supplement them  with analyses of the open ended responses that use more flexible text analytic strategies. These analyses spark more quantitative analyses and make for much better (richer, more readable and more inspired) reports.

Our goal as professionals should be to find a professional identity that best capitalizes on  our unique knowledge, skills and abilities. There is only one professional identity that does all of that, and it is the one you have already chosen and continue to choose every day. We are faced with countless choices about what classes to take, what to read, what to attend, what to become involved in, and what to prioritize, and we make countless assessments about each. Was it worthwhile? Did I enjoy it? Would I do it again? Each of these choices constitutes your own unique professional self, a self which you are continually manufacturing. You are composed of your past, your present, and your future, and your future will undoubtedly be a continuation of your past and present. The best career coach you have is inside of you.

Now your professional identity is much more uniquely or narrowly focused that the generic titles and fields that you see in the professional marketplace. Keep in mind that each job listing that you see represents a set of needs that a particular organization has. Is this a set of needs that you are ready to fill? Is this a set of needs that you would like to fill? You are the only one who knows the answers to these questions.

Because it turns out that you are your best career coach, and you have been all along.

Reflections and Notes from the Sentiment Analysis Symposium #SAS14

The Sentiment Analysis Symposium took place in NY this week in the beautiful offices of the New York Academy of Sciences. The Symposium was framed as a transition into a new era of sentiment analysis, an era of human analytics or humetrics.

The view from the New York Academy of Sciences is really stunning!

The view from the New York Academy of Sciences is really stunning!

Two main points that struck me during the event. One is that context is extremely important for developing high quality analytics, but the actual shape that “context” takes varies greatly. The second is a seeming disconnect between the product developers, who are eagerly developing new and better measures, and the customers, who want better usability, more customer support, more customized metrics that fit their preexisting analytic frameworks and a better understanding of why social media analysis is worth their time, effort and money.

Below is a summary of some of the key points. My detailed notes from each of the speakers, can be viewed here. I attended both the more technical Technology and Innovation Session and the Symposium itself.

Context is in. But what is context?

The big takeaway from the Technology and Innovation session, which was then carried into the second day of the Sentiment Analysis Symposium was that context is important. But context was defined in a number of different ways.

 

New measures are coming, and old measures are improving.

The innovative new strategies presented at the Symposium made for really amazing presentations. New measures include voice intonation, facial expressions via remote video connections, measures of galvanic skin response, self tagged sentiment data from social media sharing sites, a variety of measures from people who have embraced the “quantified self” movement, metadata from cellphone connections (including location, etc.), behavioral patterning on the individual and group level, and quite a bit of network analysis. Some speakers showcased systems that involved a variety of linked data or highly visual analytic components. Each of these measures increase the accuracy of preexisting measures and complicate their implementation, bringing new sets of challenges to the industry.

Here is a networked representation of the emotion transition dynamics of 'Hopeful'

Here is a networked representation of the emotion transition dynamics of ‘Hopeful’

This software package is calculating emotional reactions to a Youtube video that is both funny and mean

This software package is calculating emotional reactions to a Youtube video that is both funny and mean

Meanwhile, traditional text-based sentiment analyses are also improving. Both the quality of machine learning algorithms and the quality of rule based systems are improving quickly. New strategies include looking at text data pragmatically (e.g. What are common linguistics patterns in specific goal directed behavior strategies?), gaining domain level specificity, adding steps for genre detection to increase accuracy and looking across languages. New analytic strategies are integrated into algorithms and complementary suites of algorithms are implemented as ensembles. Multilingual analysis is a particular challenge to ML techniques, but can be achieved with a high degree of accuracy using rule based techniques. The attendees appeared to agree that rule based systems are much more accurate that machine learning algorithms, but the time and expertise involved has caused them to come out of vogue.

 

“The industry as a whole needs to grow up”

I suspect that Chris Boudreaux of Accenture shocked the room when he said “the industry as a whole really needs to grow up.” Speaking off the cuff, without his slides after a mishap and adventure, Boudreaux gave the customer point of view toward social media analytics. He said said that social media analysis needs to be more reliable, accessible, actionable and dependable. Companies need to move past the startup phase to a new phase of accountability. Tools need to integrate into preexisting analytic structures and metrics, to be accessible to customers who are not experts, and to come better supported.

Boudreaux spoke of the need for social media companies to better understand their customers. Instead of marketing tools to their wider base of potential customers, the tools seem to be developed and marketed solely to market researchers. This has led to a more rapid adoption among the market research community and a general skepticism or ambivalence across other industries, who don’t see how using these tools would benefit them.

The companies who truly value and want to expand their customer base will focus on the usability of their dashboards. This is an area ripe for a growing legion of usability experts and usability testing. These dashboards cannot restrict API access and understanding to data scientist experts. They will develop, market and support these dashboards through productive partnerships with their customers, generating measures that are specifically relevant to them and personalized dashboards that fit into preexisting metrics and are easy for the customers to understand and react to in a very practical and personalized sense.

Some companies have already started to work with their customers in more productive ways. Crimson Hexagon, for example, employs people who specialize in using their dashboard. These employees work with customers to better understand and support their use of the platform and run studies of their own using the platform, becoming an internal element in the quality feedback loop.

 

Less Traditional fields for Social Media Analysis:

There was a wide spread of fields represented at the Symposium. I spoke with someone involved in text analysis for legal reasons, including jury analyses. I saw an NYPD name tag. Financial services were well represented. Publishing houses were present. Some health related organizations were present, including neuroscience specialists, medical practitioners interested in predicting early symptoms of diseases like Alzheimer’s, medical specialists interested in helping improve the lives of people with diseases like Autism (e.g. with facial emotion recognition devices), pharmaceutical companies interested in understanding medical literature on a massive scale as well as patient conversation about prescriptions and participation in medical trials. There were traditional market research firms, and many new startups with a wide variety of focuses and functions. There were also established technology companies (e.g. IBM and Dell) with innovation wings and many academic departments. I’m sure I’ve missed many of the entities present or following remotely.

The better research providers can understand the potential breadth of applications  of their research, the more they can improve the specific areas of interest to these communities.

 

Rethinking the Public Image of Sentiment Analysis:

There was some concern that “social” is beginning to have too much baggage to be an attractive label, causing people to think immediately of top platforms such as Facebook and Twitter and belying the true breadth of the industry. This prompted a movement toward other terms at the symposium, including human analytics, humetrics, and measures of human engagement.

 

Accuracy

Accuracy tops out at about 80%, because that’s the limit of inter-rater reliability in sentiment analysis. Understanding the more difficult data is an important challenge for social media analysts. It is important for there to be honesty with customers and with each other about the areas where automated tagging fails. This particular area was a kind of elephant in the room- always present, but rarely mentioned.

Although an 80% accuracy rate is really fantastic compared to no measure at all, and it is an amazing accomplishment given the financial constraints that analysts encounter, it is not an accuracy rate that works across industries and sectors. It is important to consider the “fitness for use” of an analysis. For some industries, an error is not a big deal. If a company is able to respond to 80% of the tweets directed at them in real-time, they are doing quite well, But when real people or weightier consequences are involved, this kind of error rate is blatantly unacceptable. These are the areas where human involvement in the analysis is absolutely critical. Where, honestly speaking, are algorithms performing fantastically, and where are they falling short? In the areas where they fall short, human experts should be deployed, adding behavioral and linguistic insight to the analysis.

One excellent example of Fitness for Use was the presentation by Capital Market Exchange. This company operationalizes sentiment as expert opinion. They mine a variety of sources for expert opinions about investing, and then format the commonalities in an actionable way, leading to a substantial improvement above market performance for their investors. They are able to gain a great deal of market traction that pure sentiment analysts have not by valuing the preexisting knowledge structures in their industry.

 

Targeting the weaknesses

It is important that the field look carefully at areas where algorithms do and do not work. The areas where they don’t represent whole fields of study, many of which have legions of social media analysts at the ready. This includes less traditional areas of linguistics, such as Sociolinguistics, Conversation Analysis (e.g. looking at expected pair parts) and Discourse Analysis (e.g. understanding identity construction), as well as Ethnography (with fast growing subfields, such as Netnography), Psychology and Behavioral Economics. Time to think strategically to better understand the data from new perspectives. Time to more seriously evaluate and invest in neutral responses.

 

Summing Up

Social media data analysis, large scale text analysis and sentiment analysis have enjoyed a kind of honeymoon period. With so many new and fast growing data sources, a plethora of growing needs and applications, and a competitive and fast growing set of analytic strategies, the field has been growing at an astronomical rate. But this excitement has to be balanced out with the practical needs of the marketplace. It is time for growing technologies to better listen to and accommodate the needs of the customer base. This shift will help ensure the viability of the field and free developers up to embrace the spirit of intellectual creativity.

This is an exciting time for a fast growing field!

Thank you to Seth Grimes for organizing such a great event.

 

Free Range Research will cover the Sentiment Symposium in NYC next week #SAS14

Next week Free Range Research will be in NYC to cover the Sentiment Symposium and Innovation session, and I can’t tell you how excited I am about it!

The development of useful analytics hinges on constant innovation and experimentation, and binary positive/negative measures don’t come close to describing the full potential of social media data. This year’s symposium is an effort to confront the limitations of calcified measures of sentiment head on by introducing new measures and new perspectives.

As a programmer, a quantitative and qualitative analyst, a recent academic, and a fervent believer in the power of the power of mixed methods and interdisciplinary research, I am eager to cover the Symposium as both an enthusiastic and a critical voice. The new directions that will be represented are exciting and interesting, and I expect to gain a better feel for many cutting edges analytic practices. But the proprietary and competitive nature of the social media marketplace has led to countless overblown claims. I do not plan to simply be a conduit for these. My goal will be to share as much as possible of what I learn at the Symposium in a grounded and accessible way, as timely as possible, offering counterpoints and data driven examples when possible, on both my blog and through my Twitter handle @FreeRangeRsrch

I hope you’ll join me!

 

today in research & zen: “What is known as ‘realizing the mystery’ is nothing more than breaking through to grab an ordinary person’s life” Te-Shan

Planning another Online Research, Offline lunch

I’m planning another Online Research, Offline lunch for researchers in the Washington DC area later this month. The specific date and location are TBA, but it will be toward the end of February near Metro Center.

These lunches are designed to welcome professionals and students involved in online research across a variety of disciplines, fields and sectors. Past attendees have had a wide array of interests and specialties, including usability and interface design, data science, natural language processing, social network analysis, social media monitoring, discourse analysis, netnography, digital humanities and library science.

The goal of this series is to provide an informal venue for a diverse set of researchers to talk with each other and gain a wider context for understanding their work. They are an informal and flexible way to researchers to meet each other, talk and learn. Although Washington DC is a great meeting place for specific areas of online research, there are few informal opportunities for interdisciplinary gatherings of professionals and academics.

Here is a form that can be used to add new people to the list. If you’re already on the list you do not need to sign up again. Please feel free to share the form with anyone else who may be interested: