Searching for Social Meanings in Social Media

This next CLIP event looks really fantastic!


Please join us on Wednesday at 11AM in AV Williams room 3258 for the University of Maryland Computational Linguistics and Information Processing (CLIP) colloquium!


May 2: Jacob Eisenstein: Searching for social meanings in social media


Social interaction is increasingly conducted through online platforms such as Facebook and Twitter, leaving a recorded trace of millions of individual interactions. While some have focused on the supposed deficiencies of social media with respect to more traditional communication channels, language in social media features the same rich connections with personal and group identity, style, and social context. However, social media’s unique set of linguistic affordances causes social meanings to be expressed in new and perhaps surprising ways. This talk will describe research that builds on large-scale social media corpora using analytic tools from statistical machine learning. I will focus on some of the ways in which social media data allow us to go beyond traditional sociolinguistic methods, but I will also discuss lessons from the sociolinguistics literature that the new generation of “big data” research might do well to heed.


This research includes collaborations with David Bamman, Brendan O’Connor, Tyler Schnoebelen, Noah A. Smith, and Eric P. Xing.


Bio: Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and non-verbal communication. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.


Location of AV Williams:


Webpage for CLIP events:



Facebook Measures Happiness in Status Updates?

From Flowing data:

Does anyone have a link to the original report?

I really wish I had more of a window into the methodology of this one!

A couple of questions:

What is happiness?

How can it be measured or signaled? What kinds of data are representing happiness? Is this just an expanded or open ended sentiment analysis? Is the technology such that this would be a valid study?

Are Facebook statuses a sensible place to investigate happiness?

What is this study representing? Constituting? Perpetuating?


Edited to Add:

What is Data? The answer might surprise you

I like to compare my discovery of Sociolinguistics to my love of swimming. I like to consider myself a competent swimmer, and I love being underwater. But discovering sociolinguistics was like coming up for air and noticing air and dry land. A fundamental element that led to this feeling is the difference in data.

In survey research, we rarely think about what data looks like, unless we are training new hires for jobs like data entry. Data can be visualized as a spreadsheet. Each line is a case, and each column is a variable. The variables can be numeric or character and vary in size. We analyze the numbers using statistics and the character variables using qualitative analysis. Or, we can try quantitative techniques on character fields.

The field of survey research has been feeling out its edges increasingly in the past few years. This has led us to consider new data sources, particularly data sources that do not come from surveys. Two factors shape this exploration

1.) Consideration for the genesis and representativeness of the new data. What is it, and what does it represent?

2.) A sense of what data should look like. We expect new data to resemble old data. We think in terms of joining files; collating, concatenating, merging, aggregating and disaggregating. New data should look and work like this. And so our questions are more along the lines of: how can we make new data look like (or work with) old data?

Sociolinguistics could not be more different, in terms of data. In sociolinguistics, everything is data. Look around you: you’re looking at data. Listen: you’re listening to data. The signs that you passed on your way into work? Data. The tv shows you watch when you get home? Data. Cooking with recipes? Data. Talking on the phone? Data. Attending a meeting? Institutional discourse!

In sociolinguistics, we  call analytic methods our ‘toolkit,’ and we pride ourselves on being able to analyze any kind of data with that toolkit. We include ethnographic methods, visual semiotics, discourse methods, action-based studies, as well as traditional linguistic means and measures. Each of these methods can be addressed quantitatively or qualitatively. The best studies use a combination of quantitative and qualitative methods. To me, these methods and data sources are nothing short of mind blowing, and they redefine the prospect of social science research.

Memory is incomplete experience

Today’s quote on my zen calendar is perfect.

“Memory is incomplete experience.” J. Krishnamurti

This is a great reminder for researchers and for people in general, because we all forget and keep forgetting how incomplete our memories and the memories of people we come into contact are. How many survey questions could be better written with this advice? How much better is ethnography when we base our observations on repeated viewings, rather than trying to reconstruct a vague memory? How many arguments could be avoided, if we could just remember that memories are incomplete?

I sometimes participate in a video discussion group. I am amazed that each viewing of a short segment of video brings a different set of interpretations, and I am amazed that the other participants continually notice different aspects of the video. This experience really drives the point home about how little we see in our everyday lives. We are so inundated with information that we simply couldn’t, and wouldn’t want to, process it all.

Research is the process of recovering and reconstructing. Of observing carefully. Of noticing things that we would never or could never have accessed through normal observation and we absolutely could never access through our memories. Being a researcher does not make us any more able to analyze that which we experience in a single pass- we’re still human. Being a researcher simply means that we have the capacity to observe and investigate things more closely.

Academic register: Are we smarter than a 5th grader?

Another important area of study in Linguistics is register. Among other things, register refers to the degree of formality with which we communicate. We speak differently with our friends and family than we do at work. We speak differently in a courtroom than in a courtroom lobby. And we speak differently in academia than we do elsewhere.

In an interview about Betty Friedan’s ‘The Feminine Mystique,’ Naomi Wolf discusses the groundbreaking impact of Friedan’s classic work. She praises Friedan for having the courage to release countless hours of research to a wider audience, rather than an academic audience. To do this, Friedan sacrificed the academic recognition that could have accompanied her work in order to reach a broader population who could potentially benefit from her work. The choice to write in a less respected register opened Friedan to criticism from academics, but led to a broad, longstanding appeal.

Wolf takes this point a step further by suggesting that academics write in such a way that we don’t even understand each other (!!).

This was a surprising admission to see from an academic. Academics often really embrace the large words and complicated nature of their findings. They manage to encode large amounts of information and complicated ideas in relatively small amounts of space. But are the conclusions and information that we publish limited in their usefulness by the academic register itself?

I’ve mentioned before that Linguistics is a very broad area of study. Coming to the field with absolutely no prior background, I was really struck by the different definitions of terms I used regularly in my profession, like reliability, validity, sampling, representative sample, … It took a while for me to adjust to the different context of those terms, and to the different lexicon and areas of focus in linguistics. And the more areas of linguistics I study, the more I find words and concepts from fields I have little experience with. I remember reading and rereading papers in my conversation analysis class, trying to understand what they were doing and why- and it took the whole semester for me to be able to imitate that academic genre and understand its power.

Clearly, the more experience we have with specific words and methods, the more easily we can understand a specific genre of academic writing. This is the academic genre at its best, and it enables us to reach complicated conclusions that we might not be able to make otherwise. But it is also quite restrictive. Nobody can be an expert in all fields, and research could potentially benefit from feedback from a much wider variety of fields than it often receives. This enforces a linguistic segregation represents is the academic genre at its worst.

Yesterday I attended a talk that involved areas and methods of study that I had never encountered before. I heard a talk about textual features that evoke emotion. The talk was heady, showing logical expressions and cognitive space diagrams, and involving some of what I believe is called semantic formalism. The text examples were mostly poems, which naturally add to the complexity of the analysis. The main points were that the use of complexity and negation in text add to the emotional wallop of a body of text. He used hypotheticals as an example of constructed negation that evokes emotion. After trying and trying to wrap my head around his points and their wider applicability, I thought of funerals and memorial services. I thought of how we use hypotheticals to make us cry and help the grieving process. I mentioned to the speaker that, as difficult as it is to wrap my head around his talk, I realized that we use these devices as tools to evoke emotion regularly in those situations.

In my mind, research is of the best quality when it is anchored in something palpable or readily accessible. As a poet, I have a distinct sense of trying to create contrasts and develop layers of complication as poetic devices. But that sense isn’t as visceral or accessible as grieving communication is.

I wonder what his research would have gained by borrowing from other registers. In my own research, I believe that explaining my work to my family or friends is a critical part of my research process. It helps me to make grounded conclusions, and it guides my research questions and methods. For yesterday’s speaker, surely it would help him to generate better, more wide ranging feedback from a wider variety of people?

At the end of the day, I went out for dinner with my kids. I mentioned the talk to my 10 year old, who loves to discuss emotions. I was surprised to see that not only did she ‘get it’ in one or two sentences of explanation, but she was able to generate some really excellent examples of these devices in a 5th grade register.

More rundown on Academedia

So I promised more on Academedia (note: they will add more video and visual resources to the Academedia website in the next few days)…

First, some of Robert Cannon’s (employed with the FCC and a member of Panel B “New Media: A closer look at what works”) insightful gems

Re: internet “a participatory market of free speech”

Re: kids& social media “It’s not a question of whether kids are writing. Kids are writing all the time. It’s whether parents understand that.”

“The issue is not whether to use Wikipedia, but how to use Wikipedia”
Next, the final panel, “Digital Tools for Communication:”
Hitlin (Pew Project for Excellence in Journalism)
People communicate differently about issues on different kinds of media sources.
Re: Trayvon Martin case –> largest issue by media source

  •      Twitter: 21% Outrage @ Zimmerman
  •      Cable & Talk radio: 17% Gun control legislation
  •      Blogs: 15% Role of race

Re: Crimson Hexagon
Pew is different, because they’re in a partnership with Crimson Hexagon to measure trends in Traditional media sources. Also because their standard of error is much higher, and they have a team of hand coders available.

Crimson Hexagon is different, because it combines human coding with machine learning to develop algorithms. It may actually overlap pretty intensely with some of the traditional qualitative coding programs that allow for some machine learning. I can imagine that this feature would appeal especially to researchers who are reluctant to fully embrace machine coding, which is understandable, given the current state of the art. I wonder if, by hosting their users instead of distributing programs, they’re able to store and learn from the codes developed by the users?

CH appears to measure two main domains: topic volume over time and topic sentiment over time. Users get a sense of recall and precision in action as they work with the program, by seeing the results of additions and subtractions to a search lexicon. Through this process, Hitlin got a sense of the meat of the problems with text analysis. He said that it was difficult to find examples that neatly fit into boxes, and that the computer didn’t have an eye for subtlety or things that fit into multiple categories. What he was commenting about was the nature of language in action, or what sociolinguists call Discourse! Through the process of categorizing language, he could sense how complicated it is. Here I get to reiterate one of the main points of this blog: these problems are the reason why linguistics is a necessary aspect of this process. Linguistics is the study of patterns in language, and the patterns we find are inherently different from the patterns we expect to find. Linguistics is a small field, one that people rarely think of. But it is critically essential to a high quality analysis of communication. In fact, we find, when we look for patterns in language, that everything in language is patterned, from its basic morphology and syntax, to its many variations (which are more systematic than we would predict), to methods like metaphor use and intertextuality, and more.

Linguistics is a key, but it’s not a simple fit. Language is patterned in so many ways that linguistics is a huge field. Unfortunately, the subfields of linguistics divide quickly into political and educational camps. It is rare to find a linguist trained in cognitive linguistics, applied linguistics and discourse analysis, for example. But each of these fields are necessary parts of text analysis.

Just as this blog is devoted to knocking down borders in research methods, it is devoted to knocking down borders between subfields and moving forward with strategic intellectual partnerships.

This next speaker in the panel thoroughly blew my mind!

Rami Khater from Al Jazeera English talked about the generation of ‘The Stream,’ an Al Jazeera program that is entirely driven by social media analysis.

Rami can be found on Twitter: @ramisms , and he shared a with resources from his talk:

The goal of The Stream is to be “a voice of the voiceless,” by monitoring how the hyperlocal goes global. Rami gave a few examples of things we never would have heard about without social media. He showed how hash tags evolve, by starting with competing tags, evolving and changing, and eventually converging into a trend (incidentally, Rami identified the Kony 2012 trend as synthetic from the get go by pointing that there was no organic hashtag evolution. It simply started and nded as #Kony2012). He used TrendsMap to show a quick global map of currently trending hashtags. I put a link to TrendsMap on the tools section of the links on this blog, and I strongly encourage you to experiment with it. My daughter and I spent some time looking at it today, and we found an emerging conversation in South Africa about black people on the Titanic. We followed this up with another tool, Topsy, which allowed us to see what the exact conversation was about. Rami gets to know the emerging conversations and then uses local tools to isolate the genesis of the trend and interview people at its source. Instead, my daughter and I looked at WhereTweeting to see what the people around us are tweeting about. We saw some nice words of wisdom from Iyanla Vanzant that were drowning in what appeared to me to be “a whole bunch of crap!” (“Mom-mmy, you just used the C word!”)

Anyway, the tools that Rami shared are linked over here —->

I encourage you to play around with them, and I encourage you and me both to go check out the recent Stream interview with Ai Wei Wei!

The final speaker on the panel was Karine Megerdoomian from MITRE. I have encountered a few people from MITRE recently at conferences, and I’ve been impressed with all of them! Karine started with some words that made my day:

“How helpful a word cloud is is basically how much work you put into it”

EXactly! Great point, Karine! And she showed a particularly great word cloud that combined useful words and phrases into a single image. Niiice!

Karine spoke a bit about MITRE’s efforts to use machine learning to identify age and gender among internet users. She mentioned that older users tended to use noses in their smilies 🙂 and younger users did not 🙂 . She spoke of how older Iranian users tended to use Persian morphology when creating neologisms, and younger users tended to use English, and she spoke about predicting revolutions and seeing how they are propagated over time.

After this point, the floor was opened up for questions. The first question was a critically important one for researchers. It was about representativeness.

The speakers pointed out that social media has a clear bias toward English speakers, western educated people, white, mail, liberal, US & UK. Every network has a different set of flaws, but every network has flaws. It is important not to just use these analyses as though they were complete. You simply have to go deeper in your analysis.


There was a bit more great discussion, but I’m going to end here. I hope that other will cover this event from other perspectives. I didn’t even mention the excellent discussions about education and media!

It’s not WHETHER they use it, but how they ENGAGE with it

Yesterday I attended an excellent event, Academedia, sponsored by the Gnovis Journal and the Communication, Culture and Technology program at Georgetown. I plan to post a fuller summary of the event soon, but I wanted to jump right in with some commentary about an exchange at the event that really weighed heavy on me.

One of the attendees was lamenting his child’s lack of engagement with traditional media sources (particularly news magazines) and worrying about the deeper societal implications of all of the fluff that garners more attention (and spreads faster) than larger scale news events do online. I would characterize this concern as what my professor Mima Dedaic calls “technopanic,” and I believe that his concerns demonstrate a lack of understanding of the nature of social media.

I have mentioned Pew’s report on the Kony 2012 viral video phenomena. One of the main findings of that report was that younger people tend to engage differently with media than older people tend to. Whereas older people were more likely to find out about the video from traditional news sources, younger people were more likely to have heard about it, and heard about it sooner, from social media sources. They were also less likely to have heard about it through traditional media sources, and more likely to have actually seen the video.

In the past media model, the news was composed of a distinct set of entities that could be avoided. I know of quite a few people who prefer not to watch the news or read the newspapers. This orientation has always existed. But in the age of social media, it is much harder to achieve.

When it snows, I know when the flakes begin to fall and the general swath of the storm, even if it’s not local, from my friends who complain about the storm and post pictures of its aftermath. I heard about Michael Jackson’s tragic passing before it was announced in the news. When Egyptians gathered in Tahrir Square, I knew about it from my friend in Egypt. I kept updated on the conflict and on her safety in a community of concerned friends and relatives on her facebook page. I hear about American political ads from people who see them and comment about them. I know what aspects of politicians my friends with different political orientations orient to. For me, social media can provide a faster news source, and often a more balanced news source, than traditional media (although I am an avid consumer of all kinds of media).

News is no longer a distinct entity that must be sought out. It is personalized. It is discussed from many angles from a variety of perspectives with a great deal of frequency from people who have various degrees of knowledge and and a variety of attitudes toward it. The man in the audience’s daughter may spend most of her time giggling over memes or making fan pages, but she is surely also orienting toward the larger world around her in a collaborative and alocative (location independent) way.

As a survey researcher, I like to participate in surveys. Some of these surveys ask about where I heard about something. I’m often very frustrated by the response options, because they are incongrent with the ways that I, and many people I know, learn about things on the internet. Googling is sometimes represented as a process of typing a search term into the box and choosing the first option that pulls up. But how often is that the way we use search engines?

There are two distinct ways that I can think of offhand that I google. One is for a direct, known piece of information, like an address, phone number or a picture of something I am already familiar with. The other is more exploratory. An exploratory search takes some term adjustment, and it requires reading through matches until a contextual understanding can be developed. I have noticed that some people can search far more efficiently than others. There are many tools available on the internet, and a working knowledge of the usefulness and potential of these tools can lead to a much different outcome than a passing use can.

There was a representative from the FCC on the panel who shared some great insights, much of which I will cover later. He spoke about kids being taught in schools that technology is bad (disruptive, disobedient, minimally insightful, …), instead of being taught how to use the technological tools available to them.

He said, it’s not WHETHER they use Wikipedia, but how they ENGAGE with Wikipedia.

This is a crucial point. The more we embrace the usefulness of these tools, the better our capabilities will be.

The other side of technopanic is a fear that engaging online means NOT engaging offline. Data on this topic show quite the opposite. People who engage online are also MORE likely to engage offline. Technology need not replace anything. But it can be an excellent tool when approached without unhelpful prejudices.

Groves Tapped for GU Provost- Leaving Census?

Is Groves going to leave Census for the Provost position?

Here is more information from the Census blog:

…Aaaand some explanation via the Washington Post:

“I’m an academic at heart,” Groves said in a telephone interview Tuesday, explaining his decision to leave the Census Bureau. “This was the kind of position that’s kind of hard to pass up.”

Can they just get along? Situated Cognition and Survey Response

Finally, I’m going to take a moment to talk about Norbert Schwarz’s JPSM Distinguished Lecture on March 30! I’ve attended a few events and had a few experiences lately that I’m eager to blog about, but sometimes life has plans for us that don’t involve blogging. Today, I would say, is no different, except that I woke up thinking about this lecture!

Ok, enough about me, more about Schwartz.

I should start by saying that I am a longtime fan of Schwartz. In Fall 2009, I had just discovered the MLC program and finished what was a whirlwind application process, and I was first trying to wrap my head around the field of sociolinguistics and its intersection with my career in survey methodology. I had attended a presentation of an ethnography of communication pilot study to the McDonough School of Business, and, to my great shock, I came across a survey methodology paper that spoke of the Logic of Conversation and the role of Gricean maxims in survey responses. This fantastic piece is the work of Norbert Schwarz, and I’ve kept it nearby ever since. In it, Schwartz addresses the conversational expectations of survey respondents and shows how they respond not only to the question at hand, but also to these expectations.

It’s common in every survey to look at some of the responses and wonder how in the world they could have come about. I addressed this in an earlier blog post, where one researcher had gone as far as to call respondents stupid. Oftentimes we think of respondents “getting it right” or “getting it wrong.” But there is a larger phenomena underlying what appear to be strange responses, and it’s something that we experience when we attempt to respond to surveys.

We write survey questions with a mechanistic expectation, that if we ask a question, we will hear back the answer to that question, but we neglect to consider the fact that communication is not mechanistic. Of course, we are not necessarily aware of this. We’re aware of misunderstandings, but we’re not often aware of the tiny sphere of focus and interpretive frames that we apply to every utterance we here and utter. This is no fault of our own. This is a survival tool. We simply cannot process all of the information that we’re constantly inundated with.

In survey research, we’re aware that small differences in question format can influence responses. We’re aware that changing a scale will change the numeric range of the responses. We see that changing labels on a scalar question changes the results. We’re aware that sometimes answers appear to be absolute contradictions and seem to us to be impossible. These are especially large challenges for us, and they are the purview of linguistics.

Schwartz, however, is not a linguist. He is a cognitice scientist. And his lecture was not about the linguistic basis behind apparently wonky response phenomenon. Instead, he spoke about situated cognition.

Situated cognition makes a lot of intuitive sense. It is a proven psychological phenomena that shows that we don’t hold attitudes, beliefs and responses at a certain location in our mind, rather we recreate them each time. Instead we create or recreate them each time. This process allows for much more of an influence from “what’s on our mind,” making situational or contextual factors much more important, and decreasing the reliability, or repeatability, of survey responses. This is not a hard egg for someone (me) with a background in cognitive science and sociolinguistics to swallow, but the effect on the audience was remarkable. How does someone from a field that thrives on the mechanistic nature of responses take the suggestion that what they’re measuring is not a distinctly measurable entity so much as a complicated, potentially unreliable act of nature?

One of the discussants used a couple that he was not very fond of as an example of a stable opinion. I believe that this example lends itself well to further exploration. If he had just met the couple, and he had had a negative experience with them, his evaluation of his opinion toward the couple would depend on the degree of negativity of the experience, his predisposition to give or not give them the benefit of the doubt, and his degree of concern about expressing a negative opinion to the interviewer or survey researchers. After this point, these factors will be increasingly influenced by his further experiences with the people and the degree of negativity, positivity or neutrality of the experiences, and the recency and saliency of the experiences. Essentially, his response would reflect a complicated underlying equation and be the output of situated cognition.

But what is a survey researcher supposed to do with this information?

It would be easy at this point to throw the baby out with the bathwater and cast doubt on the whole survey and response process. But that’s not necessary, and that’s not the point.

The point is that each method of analysis has its own unique set of strengths and weaknesses. It is important to know the strengths and weaknesses of your methods in order to better understand what exactly you are finding and what your findings mean. And it also behooves us to supplement across methodologies. A reliable survey response is a strong finding, but it can mask underlying factors that can be accessed through other methodologies. As Pew demonstrated in their Kony 2012 report, mixing methodologies can lead to a more clear, nuanced narrative than any single method could yield.

It would be easy to dismiss Schwartz’s reporting, or to dismiss survey methodology. But dismissing either would be foolish, rash and unnecessary. Instead, let’s build on both. A wider foundation can build a better house, but the best house will need to take down some old walls and rethink its floorplan.

When Code Is Hot

Excellent article on TechCrunch by Jon Evans, “When Code is Hot”


“That first cited piece above begins with “Parlez-vous Python?”, a cutesy bit that’s also a pet peeve. Non-coders tend to think of different programming languages as, well, different languages. I’ve long maintained that while programming itself — “computational thinking”, as the professor put it — is indeed very like a language, “programming languages” are mere dialects; some crude and terse, some expressive and eloquent, but all broadly used to convey the same concepts in much the same way.

Like other languages, though, or like music, it’s best learned by the young. I am skeptical of the notion that many people who start learning to code in their 30s or even 20s will ever really grok the fundamental abstract notions of software architecture and design.

Stross quotes Michael Littman of Rutgers: “Computational thinking should have been covered in middle school, and it isn’t, so we in the C.S. department must offer the equivalent of a remedial course.” Similarly, the Guardian recently ran an excellent series of articles on why all children should be taught how to code. (One interesting if depressing side note there: the older the students, the more likely it is that girls will be peer-pressured out of the technical arena.)”