Unlocking patterns in language

In linguistics study, we quickly learn that all language is patterned. Although the actual words we produce vary widely, the process of production does not. The process of constructing baby talk was found to be consistent across kids from 15 different languages. When any two people who do not speak overlapping languages come together and try to speak, the process is the same. When we look at any large body of data, we quickly learn that just about any linguistic phenomena is subject to statistical likelihood. Grammatical patterns govern the basic structure of what we see in the corpus. Variations in language use may tweak these patterns, but each variation is a patterned tweak with its own set of statistical likelihoods. Variations that people are quick to call bastardizations are actually patterned departures from what those people consider to be “standard” english. Understanding “differences not defecits” is a crucially important part of understanding and processing language, because any variation, even texting shorthand, “broken english,” or slang, can be better understood and used once its underlying structure is recognized.

The patterns in language extend beyond grammar to word usage. The most frequent words in a corpus are function words such as “a” and “the,” and the most frequent collocations are combinations like “and the” or “and then it.” These patterns govern the findings of a lot of investigations into textual data. A certain phrase may show up as a frequent member of a dataset simply because it is a common or lexicalized expression, and another combination may not appear because it is more rare- this could be particularly problematic, because what is rare is often more noticeable or important.

Here are some good starter questions to ask to better understand your textual data:

1) Where did this data come from? What was it’s original purpose and context?

2) What did the speakers intend to accomplish by producing this text?

3) What type of data or text, or genre, does this represent?

4) How was this data collected? Where is it from?

5) Who are the speakers? What is their relationship to eachother?

6) Is there any cohesion to the text?

7) What language is the text in? What is the linguistic background of the speakers?

8) Who is the intended audience?

9) What kind of repetition do you see in the text? What about repetition within the context of a conversation? What about repetition of outside elements?

10) What stands out as relatively unusual or rare within the body of text?

11) What is relatively common within the dataset?

12) What register is the text written in? Casual? Academic? Formal? Informal?

13) Pronoun use. Always look at pronoun use. It’s almost always enlightening.

These types of questions will take you much further into your dataset that the knee-jerk question “What is this text about?”

Now, go forth and research! …And be sure to report back!

A fleet of research possibilities and a scattering of updates

Tomorrow is my first day of my 3rd year as a Masters student in the MLC program at Georgetown University. I’m taking the slowwww route through higher ed, as happens when you work full-time, have two kids and are an only child who lost her mother along the way.

This semester I will [finally] take the class I’ve been borrowing pieces from for the past two years: Ethnography of Communication. I’ve decided to use this opportunity do an ethnography of DC taxi drivers. My husband is a DC taxi driver, so in essence this research will build on years of daily conversations. I find that the representation of DC taxi drivers in the news never quite approximates what I’ve seen, and that is my real motivation for the project. I have a couple of enthusiastic collaborators: my husband and a friend whose husband is also a DC taxi driver and who has been a vocal advocate for DC taxi drivers.

I am really eager to get back into linguistics study. I’ve been learning powerful sociolinguistic methods to recognize and interpret patterning in discourse, but it is a challenge not to fall into the age old habit of studying aboutness or topicality, which is much less patterned and powerful.

I have been fortunate enough to combine some of my new qualitative methods with my more quantitative work on some of the reports I’ve completed over the summer. I’m using the open ended responses that we usually don’t fully exploit in order to tell more detailed stories in our survey reports. But balancing quantitative and qualitative methods is very difficult, as I’ve mentioned before, because the power punch of good narrative blows away the quiet power of high quality, representative statistical analysis. Reporting qualitative findings has to be done very carefully.

Over the summer I had the wonderful opportunity to apply my sociolinguistics education to a medical setting. Last May, while my mom was on life support, we were touched by a medical error when my mom was mistakenly declared brain dead. Because she was an organ donor, her life support was not withdrawn before the error was recognized. But the fallout from the error was tremendous. The problem arose because two of her doctors were consulting by phone about their patients, and each thought they were talking about a different patient. In collaboration with one of the doctors involved, I’ve learned a great amount about medical errors and looked at the role of linguistics in bringing awareness to potential errors of miscommunication in conversation. This project was different from other research I’ve done, because it did not involve conducting new research, but rather rereading foundational research and focusing on conversational structure.

In this case, my recommendations were for an awareness of existing conversational structures, rather than an imposition of a new order or procedure. My recommendations, developed in conjunction with Dr Heidi Hamilton, the chair of our linguistics department and medical communication expert, were to be aware of conversational transition points, to focus on the patient identifiers used, and to avoid reaching back or ahead to other patients while discussing a single patient. Each patient discussion must be treated as a separate conversation. Conversation is one of the largest sources of medical error and must be approached carefully is critically important. My mom’s doctor and I hope to make a Grand Rounds presentation out of this effort.

On a personal level, this summer has been one of great transitions. I like to joke that the next time my mom passes away I’ll be better equipped to handle it all. I have learned quite a bit about real estate and estate law and estate sales and more. And about grieving, of course. Having just cleaned through my mom’s house last week, I am beginning this new school year more physically, mentally and emotionally tired than I have ever felt. A close friend of mine has recently finished an extended series of chemo and radiation, and she told me that she is reveling in her strength as it returns. I am also reveling in my own strength, as it returns. I may not be ready for the semester or the new school year, but I am ready for the first day of class tomorrow. And I’m hopeful. For the semester, for the research ahead, for my family, and for myself. I’m grateful for the guidance of my newest guardian angel and the inspiration of great research.

A snapshot from a lunchtime walk

In the words of Sri Aurobindo, “By your stumbling the world is perfected”

Could our attitude toward marketing determine our field’s future?

In our office, we call it the “cocktail party question:” What do you do for a living? For those of us who work in the area of survey research, this can be a particularly difficult question to answer. Not only do people rarely know much about our work, but they rarely have a great deal of interest in it. I like to think of myself as a survey methodologist, but it is easier in social situations to discuss the focus of my research than my passion for methodology. I work at the American Institute of Physics, so I describe my work as “studying people who study physics.” Usually this description is greeted with an uncomfortable laugh, and the conversation progresses elsewhere. Score!

But the wider lack of understanding of survey research can have larger implications than simply awkward social situations. It can also cause tension with clients who don’t understand our work, our process, or where and how we add expertise to the process. Toward this end, I once wrote a guide for working with clients that separated out each stage in the survey process and detailed what expertise the researcher brings to the stage and what expertise we need from the client. I hoped that it would be a way of both separating and affirming the roles of client and researcher and advertising our firm and our field. I have not ye had the opportunity to use this piece, because of the nature of my current projects, but I’d be happy to share it with anyone who is interested in using or adapting it.

I think about that piece often as I see more talk about big data and social media analysis. Data seems to be everywhere and free, and I wonder what affect this buzz will have on a body of research consumers who might not have respected the role of the researchers from the get-go. We worried when Survey Monkey and other automated survey tools came along, but the current bevvy of tools and attitudes could have an exponentially larger impact on our practice.

Survey researchers often thumb their nose at advertising, despite the heavy methodological overlap. Oftentimes there is a knee-jerk reaction against marketing speak. Not only do survey methodologists often thumb their/our noses at the goal and importance of advertising, but they/we often thumb their/our nose at what appears to be evidence of less rigorous methodology. This has led us to a ridiculous point where data and analyses have evolved quickly with the demand and heavy use of advertising and market researchers and evolved strikingly little in more traditional survey areas, like polling and educational research. Much of the rhetoric about social media analysis, text analysis, social network analysis and big data is directed at the marketing and advertising crowd. Translating it to a wider research context and communicating it to a field that is often not eager to adapt to it can be difficult. And yet the exchange of ideas between the sister fields has never been more crucial to our mutual survival and relevance.

One of the goals of this blog has been to approach the changing landscape of research from a methodologically sound, interdisciplinary perspective that doesn’t suffer from the artificial walls and divisions. As I’ve worked on the blog, my own research methodology has evolved considerably. I’m relying more heavily on mixed methods and trying to use and integrate different tools into my work. I’ve learned quite a bit from researchers with a wide variety of backgrounds, and I often feel like I’m belted into a car with the windows down, hurtling down the highways of progress at top speed and trying to control the airflow. And then I often glimpse other survey researchers out the window, driving slowly, sensibly along the access road alongside the highway. I wonder if my mentors feel the change of landscape as viscerally as I do. I wonder how to carry forward the anchors and quality controls that led to such high quality research in the survey realm. I wonder about the future. And the present. About who’s driving, and who in what car is talking to who? Using what gps?

Mostly I wonder: could our negative attitude toward advertising and market research drive us right into obscurity? Are we too quick to misjudge the magnitude of the changes afoot?

 

This post is meant to be provocative, and I hope it inspires some good conversation.

Rethinking demographics in research

I read a blog post on the LoveStats blog today that referred to one of the most widely regarded critiques of social media research: the lack of demographic information.

In traditional survey research, demographic information is a critically important piece of the analysis. We often ask questions like “Yes 50% of the respondents said they had encountered gender harassment, but what is the breakdown by gender?” The prospect of not having this demographic information is a large enough game changer to cast the field of social media research into the shade.

Here I’d like to take a sidestep and borrow a debate from linguistics. In the linguistic subfield of conversation analysis, there are two main streams of thought about analysis. One believes in gathering as much outside data as possible, often through ethnographic research, to inform a detailed understanding of the conversation. The second stream is rooted in the purity of the data. This stream emphasizes our dynamic construction of identity over the stability of identity. The underlying foundation of this stream is that we continually construct and reconstruct the most important and relevant elements of our identity in the process of our interaction. Take, for example, a study of an interaction between a doctor and a patient. The first school would bring into the analysis a body of knowledge about interactions between doctors and patients. The second would believe that this body of knowledge is potentially irrelevant or even corrupting to the analysis, and if the relationship is in fact relevant it will be constructed within the excerpt of study. This begs the question: are all interactions between doctors and patients primarily doctor patient interactions? We could address this further through the concept of framing and embedded frames (a la Goffman), but we won’t do that right now.

Instead, I’ll ask another question:
If we are studying gender discrimination, is it necessary to have a variable for gender within our datasouce?

My kneejerk reaction to this question, because of my quantitative background, is yes. But looking deeper: is gender always relevant? This does strongly depend on the datasource, so let’s assume for this example that the stimulus was a question on a survey that was not directly about discrimination, but rather more general (e.g. “Additional Comments:”).

What if we took that second CA approach, the purist approach, and say that where gender is applicable to the response it will be constructed within that response. The question now becomes ‘how is gender constructed within a response?’ This is a beautiful and interesting question for a linguist, and it may be a question that much better fits the underlying data and provides deeper insight into the data. It also turns the age old analytic strategy on its head. Now we can ask whether a priori assumptions that the demographics could or do matter are just rote research or truly the productive and informative measures that we’ve built them up to be?

I believe that this is a key difference between analysis types. In the qualitative analysis of open ended survey questions, it isn’t very meaningful to say x% of the respondents mentioned z, and y% of the respondents mentioned d, because a nonmention of z or d is not really meaningful. Instead we go deeper into the data to see what was said about d or z. So the goal is not prevalence, but description. On the other hand, prevalence is a hugely important aspect of quantitative analysis, as are other fun statistics which feed off of demographic variables.

The lesson in all of this is to think carefully about what is meaningful information that is relevant to your analysis and not to make assumptions across analytic strategies.

Do you ever think about interfaces? Because I do. All the time.

Did you ever see the movie Singles? It came out in the early 90s, shortly before the alternative scene really blew up and I dyed [part of] my hair blue and thought seriously about piercings. Singles was a part of the growth of the alternative movement. In the movie, there is a moment when one character says to another “Do you ever think about traffic? Because I do. All the time.” I spent quite a bit of time obsessing over that line, about what it meant, and, more deeply, what it signaled.

I still think about that line. As I drove toward the turnoff to my mom’s street during our 4th of July vacation, I saw what looked like the turn lane for her street, but it was actually an intersection- less left- turning split immediately preceding the real left turn lane for her street. It threw me off every time, and I kept remembering that romantic moment in Singles when the two characters were getting to know each other’s quirks, and the man was talking about traffic. And it was okay, even cool, to be quirky and think or talk about traffic, even during a romantic moment.

I don’t think about traffic often. But I am no less quirky. Lately, I tend to think about interfaces. Before my first brush with NLP (Natural Language Processing), I thought quite a bit about alternatives to e-mail. Since I discovered the world of text analytics, I have been thinking quite a bit about ways to integrate the knowledge across different fields about methods for text analysis and the needs of quantitative and qualitative researchers. I want to think outside of the sentiment box, because I believe that sentiment analysis does not fully address the underlying richness of textual data. I want to find a way to give researchers what they need, not what they think they want. Recently, my thinking on this topic has flipped. Instead of thinking from the data end, or the analytic possibilities end, or about what programs already exist and what they do, I have started to think about interfaces. This feels like a real epiphany. Once we think about the problem from an interface, or user experience perspective, we can better utilize existing technology and harness user expectations.

Have you read the new Imagine book about how creativity works? I believe that this strategy is the natural step after spending time zoning out on the web, thinking, or not thinking, about research. The more time you cruise, the better feel you develop for what works and what doesn’t, the more you learn what to expect. Interfaces are simply the masks we put on datasets of all sorts. The data could be the world wide web as a whole, results from a site or time period, a database of merchandise, or even a set of open ended survey responses. The goal is to streamline the searching interface and then make it available for use on any number of datasets. We use NLP every day when we search the internet, or shop. We understand it intuitively. Why don’t we extend that understanding to text analysis?

I find myself thinking about what this interface should look like and what I want this program to do.

Not traffic, not as romantic. But still quirky and all-encompassing.

Question Writing is an Art

As a survey researcher, I like to participate in surveys with enough regularity to keep current on any trends in methodology. As a web designer, an aspect of successful design is a seamlessness with the visitor’s expectations. So if the survey design realm has moved toward submit buttons on the upper right hand corner of individual pages, your idea (no matter how clever) to put a submit button on the upper left can result in a disconnect on the part of the user that will effect their behavior on the page. In fact, the survey design world has evolved quite a bit in the last few years, and it is easy to design something that reflects poorly on the quality of your research endeavor. But these design concerns are less of an issue than they have been, because most researchers are using templates.

Yet there is still value in keeping current.

And sometimes we encounter questions that lend themselves to an explanation of the importance of question writing. These questions are a gift for a field that is so difficult to describe in terms of knowledge and skills!

Here is a question I encountered today (I won’t reveal the source):

How often do you purchase potato chips when you eat out at any quick service and fast food restaurants?

2x a week or more
1x a week
1x every 2-3 weeks
1x a month
1x every 2-3 months
Less than 1x every 3 months
Never

This is a prime example of a double barreled question, and it is also an especially difficult question to answer. In my care, I rarely eat at quick service restaurants, especially sandwich places, like this one, that offer potato chips. When I do eat at them, I am tempted to order chips. About half the time I will give in to the temptation with a bag of sunchips, which I’m pretty sure are not made of potato.

In bigger firms that have more time to work through, this information would come out in the process of a cognitive interview or think aloud during the pretesting phase. Many firms, however, have staunchly resisted these important steps in the surveying process, because of their time and expense. It is important to note that the time and expense involved with trying to make usable answers out of poorly written questions can be immense.

I have spent some time thinking about alternatives to cognitive testing, because I have some close experience with places that do not use this method. I suspect that this is a good place for text analytics, because of the power of reaching people quickly and potentially cheaply (depending on your embedded TA processes). Although oftentimes we are nervous about web analytics because of their representativeness, the bar for representativeness is significantly lower in the pretesting stage than in the analysis phase.

But, no matter what pretesting model you choose, it is important to look closely at the questions that you are asking. Are you asking a single question, or would these questions be better separated out into a series?

How often do you eat at quick service sandwich restaurants?

When you eat at quick service restaurants, do you order [potato] chips?

What kind of [potato] chips do you order?

The lesson of all of this is that question writing is important, and the questions we write in surveys will determine the kind of survey responses we receive and the usability of our answers.

To go big, first think small

We use language all of the time. Because of this, we are all experts in language use. As native speakers of a language, we are experts in the intricacies of that language.

Why, then, do people study linguistics? Aren’t we all linguists?

Absolutely not.

We are experts in *using* language, but we are not experts in the methods we employ. Believe it or not, much of the process of speaking and hearing is not conscious. If it was, we would be sensorally overwhelmed with the sheer volume of words around us. Instead, listening comprehension involves a process of merging what we expect to hear with what we gauge to be the most important elements of what we do hear. The process of speaking involves merging our estimates of what the people we communicate with know and expect to hear with our understanding of the social expectations surrounding our words and our relationships and distilling these sources into a workable expression. The hearer will reconstruct elements of this process using cues that are sometimes conscious and sometimes not.

We often think of language as simple and mechanistic, but it is not simple at all. As conversational analysts, our job is to study conversation that we have access to in an attempt to reconstruct the elements that constituted the interaction. Even small chunks of conversation encode quite a bit of information.

The process of conversation analysis is very much contrary to our sense of language as regular language users. This makes the process of explaining our research to people outside our field difficult. It is difficult to justify the research, and it is difficult to explain why such small pieces of data can be so useful, when most other fields of research rely on greater volumes of data.

In fact, a greater volume of data can be more harmful than helpful in conversation analysis. Conversation is heavily dependent on its context; on the people conversing, their relationship, their expectations, their experiences that day, the things on their mind, what they expect from each other and the situation, their understanding of language and expectations, and more. The same sentence can have greatly different meanings once those factors are taken into account.

At a time when there is so much talk of the glory of big data, it is especially important to keep in mind the contributions of small data. These contributions are the ones that jeopardize the utility and promise of big data, and if these contributions can be captured in creative ways, they will be the true promise of the field.

Not what language users expect to see, but rather what we use every day, more or less consciously.

Data Journalism, like photography, “involves selection, filtering, framing, composition and emphasis”

Beautiful:

“Creating a good piece of data journalism or a good data-driven app is often more like an art than a science. Like photography, it involves selection, filtering, framing, composition and emphasis. It involves making sources sing and pursuing truth – and truth often doesn’t come easily. ” -Jonathan Gray

Whole article:

http://www.guardian.co.uk/news/datablog/2012/may/31/data-journalism-focused-critical

Truly, at a time when the buzz about big data is at such a peak, it is nice to hear a voice of reason and temper! Folks: big data will not do all that it is talked up to do. It will, in fact, do something surprising and different. And that something will come from the interdisciplinary thought leaders in fields like natural language processing and linguistics. That *something,* not the data itself, will be the new oil.

Searching for Social Meanings in Social Media

This next CLIP event looks really fantastic!

 

Please join us on Wednesday at 11AM in AV Williams room 3258 for the University of Maryland Computational Linguistics and Information Processing (CLIP) colloquium!

 

May 2: Jacob Eisenstein: Searching for social meanings in social media

 

Social interaction is increasingly conducted through online platforms such as Facebook and Twitter, leaving a recorded trace of millions of individual interactions. While some have focused on the supposed deficiencies of social media with respect to more traditional communication channels, language in social media features the same rich connections with personal and group identity, style, and social context. However, social media’s unique set of linguistic affordances causes social meanings to be expressed in new and perhaps surprising ways. This talk will describe research that builds on large-scale social media corpora using analytic tools from statistical machine learning. I will focus on some of the ways in which social media data allow us to go beyond traditional sociolinguistic methods, but I will also discuss lessons from the sociolinguistics literature that the new generation of “big data” research might do well to heed.

 

This research includes collaborations with David Bamman, Brendan O’Connor, Tyler Schnoebelen, Noah A. Smith, and Eric P. Xing.

 

Bio: Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and non-verbal communication. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.

 

Location of AV Williams:

http://www.umd.edu/CampusMaps/bld_detail.cfm?bld_code=AVW

http://maps.google.com/maps?q=av+williams

 

Webpage for CLIP events:

https://wiki.umiacs.umd.edu/clip/index.php/Events#Colloquia

 

More rundown on Academedia

So I promised more on Academedia (note: they will add more video and visual resources to the Academedia website in the next few days)…

First, some of Robert Cannon’s (employed with the FCC and a member of Panel B “New Media: A closer look at what works”) insightful gems

Re: internet “a participatory market of free speech”

Re: kids& social media “It’s not a question of whether kids are writing. Kids are writing all the time. It’s whether parents understand that.”

“The issue is not whether to use Wikipedia, but how to use Wikipedia”
Next, the final panel, “Digital Tools for Communication:” http://gnovis-conferences.com/panel-c/
Hitlin (Pew Project for Excellence in Journalism)
People communicate differently about issues on different kinds of media sources.
Re: Trayvon Martin case –> largest issue by media source

  •      Twitter: 21% Outrage @ Zimmerman
  •      Cable & Talk radio: 17% Gun control legislation
  •      Blogs: 15% Role of race

Re: Crimson Hexagon
Pew is different, because they’re in a partnership with Crimson Hexagon to measure trends in Traditional media sources. Also because their standard of error is much higher, and they have a team of hand coders available.

Crimson Hexagon is different, because it combines human coding with machine learning to develop algorithms. It may actually overlap pretty intensely with some of the traditional qualitative coding programs that allow for some machine learning. I can imagine that this feature would appeal especially to researchers who are reluctant to fully embrace machine coding, which is understandable, given the current state of the art. I wonder if, by hosting their users instead of distributing programs, they’re able to store and learn from the codes developed by the users?

CH appears to measure two main domains: topic volume over time and topic sentiment over time. Users get a sense of recall and precision in action as they work with the program, by seeing the results of additions and subtractions to a search lexicon. Through this process, Hitlin got a sense of the meat of the problems with text analysis. He said that it was difficult to find examples that neatly fit into boxes, and that the computer didn’t have an eye for subtlety or things that fit into multiple categories. What he was commenting about was the nature of language in action, or what sociolinguists call Discourse! Through the process of categorizing language, he could sense how complicated it is. Here I get to reiterate one of the main points of this blog: these problems are the reason why linguistics is a necessary aspect of this process. Linguistics is the study of patterns in language, and the patterns we find are inherently different from the patterns we expect to find. Linguistics is a small field, one that people rarely think of. But it is critically essential to a high quality analysis of communication. In fact, we find, when we look for patterns in language, that everything in language is patterned, from its basic morphology and syntax, to its many variations (which are more systematic than we would predict), to methods like metaphor use and intertextuality, and more.

Linguistics is a key, but it’s not a simple fit. Language is patterned in so many ways that linguistics is a huge field. Unfortunately, the subfields of linguistics divide quickly into political and educational camps. It is rare to find a linguist trained in cognitive linguistics, applied linguistics and discourse analysis, for example. But each of these fields are necessary parts of text analysis.

Just as this blog is devoted to knocking down borders in research methods, it is devoted to knocking down borders between subfields and moving forward with strategic intellectual partnerships.

This next speaker in the panel thoroughly blew my mind!

Rami Khater from Al Jazeera English talked about the generation of ‘The Stream,’ an Al Jazeera program that is entirely driven by social media analysis.

Rami can be found on Twitter: @ramisms , and he shared a bit.ly with resources from his talk: bit.ly/yzST1d

The goal of The Stream is to be “a voice of the voiceless,” by monitoring how the hyperlocal goes global. Rami gave a few examples of things we never would have heard about without social media. He showed how hash tags evolve, by starting with competing tags, evolving and changing, and eventually converging into a trend (incidentally, Rami identified the Kony 2012 trend as synthetic from the get go by pointing that there was no organic hashtag evolution. It simply started and nded as #Kony2012). He used TrendsMap to show a quick global map of currently trending hashtags. I put a link to TrendsMap on the tools section of the links on this blog, and I strongly encourage you to experiment with it. My daughter and I spent some time looking at it today, and we found an emerging conversation in South Africa about black people on the Titanic. We followed this up with another tool, Topsy, which allowed us to see what the exact conversation was about. Rami gets to know the emerging conversations and then uses local tools to isolate the genesis of the trend and interview people at its source. Instead, my daughter and I looked at WhereTweeting to see what the people around us are tweeting about. We saw some nice words of wisdom from Iyanla Vanzant that were drowning in what appeared to me to be “a whole bunch of crap!” (“Mom-mmy, you just used the C word!”)

Anyway, the tools that Rami shared are linked over here —->

I encourage you to play around with them, and I encourage you and me both to go check out the recent Stream interview with Ai Wei Wei!

The final speaker on the panel was Karine Megerdoomian from MITRE. I have encountered a few people from MITRE recently at conferences, and I’ve been impressed with all of them! Karine started with some words that made my day:

“How helpful a word cloud is is basically how much work you put into it”

EXactly! Great point, Karine! And she showed a particularly great word cloud that combined useful words and phrases into a single image. Niiice!

Karine spoke a bit about MITRE’s efforts to use machine learning to identify age and gender among internet users. She mentioned that older users tended to use noses in their smilies 🙂 and younger users did not 🙂 . She spoke of how older Iranian users tended to use Persian morphology when creating neologisms, and younger users tended to use English, and she spoke about predicting revolutions and seeing how they are propagated over time.

After this point, the floor was opened up for questions. The first question was a critically important one for researchers. It was about representativeness.

The speakers pointed out that social media has a clear bias toward English speakers, western educated people, white, mail, liberal, US & UK. Every network has a different set of flaws, but every network has flaws. It is important not to just use these analyses as though they were complete. You simply have to go deeper in your analysis.

 

There was a bit more great discussion, but I’m going to end here. I hope that other will cover this event from other perspectives. I didn’t even mention the excellent discussions about education and media!