Is there Interdisciplinary hope for Social Media Research?

I’ve been trying to wrap my head around social media research for a couple of years now. I don’t think it would be as hard to understand from any one academic or professional perspective, but, from an interdisciplinary standpoint, the variety of perspectives and the disconnects between them are stunning.

In the academic realm:

There is the computer science approach to social media research. From this standpoint, we see the fleshing out of machine learning algorithms in a stunning horserace of code development across a few programming languages. This is the most likely to be opaque, proprietary knowledge.

There is the NLP or linguistic approach, which overlaps to some degree with the cs approach, although it is often more closely tied to grammatical rules. In this case, we see grammatical parsers, dictionary development, and api’s or shared programming modules, such as NLTK or GATE. Linguistics is divided as a discipline, and many of these divisions have filtered into NLP.

Both the NLP and CS approaches can be fleshed out, trained, or used on just about any data set.

There are the discourse approaches. Discourse is an area of linguistics concerned with meaning above the level of the sentence. This type of research can follow more of a strict Conversation Analysis approach or a kind of Netnography approach. This school of thought is more concerned with context as a determiner or shaper of meaning than the two approaches above.

For these approaches, the dataset cannot just come from anywhere. The analyst should understand where the data came from.

One could divide these traditions by programming skills, but there are enough of us who do work on both sides that the distinction is superficial. Although, generally speaker, the deeper one’s programming or qualitative skills, the less likely one is to cross over to the other side.

There is also a growing tradition of data science, which is primarily quantitative. Although I have some statistical background and work with quantitative data sets every day, I don’t have a good understanding of data science as a discipline. I assume that the growing field of data visualization would fall into this camp.

In the professional realm:

There are many companies in horseraces to develop the best systems first. These companies use catchphrases like “big data” and “social media firehose” and often focus on sentiment analysis or topic analysis (usually topics are gleaned through keywords). These companies primarily market to the advertising industry and market researchers, often with inflated claims of accuracy, which are possible because of the opacity of their methods.

There is the realm of market research, which is quickly becoming dependent on fast, widely available knowledge. This knowledge is usually gleaned through companies involved in the horserace, without much awareness of the methodology. There is an increasing need for companies to be aware of their brand’s mentions and interactions online, in real time, and as they collect this information it is easy, convenient and cost effective to collect more information in the process, such as sentiment analyses and topic analyses. This field has created an astronomically high demand for big data analysis.

There is the traditional field of survey research. This field is methodical and error focused. Knowledge is created empirically and evaluated critically. Every aspect of the survey process is highly researched and understood in great depth, so new methods are greeted with a natural skepticism. Although they have traditionally been the anchors of good professional research methods and the leaders in the research field, survey researchers are largely outside of the big data rush. Survey researchers tend to value accuracy over timeliness, so the big, fast world of big data, with its dubious ability to create representative samples, hold little allure or relevance.

The wider picture

In the wider picture, we have discussions of access and use. We see a growing proportion of the population coming online on an ever greater variety of devices. On the surface, the digital divide is fast shrinking (albeit still significant). Some of the digital access debate has been expanded into an understanding of differential use- essentially that different people do different activities while online. I want to take this debate further by focusing on discursive access or the digital representation of language ideologies.

The problem

The problem with such a wide spread of methods, needs, focuses and analytic traditions is that there isn’t enough crossover. It is very difficult to find work that spreads across these domains. The audiences are different, the needs are different, the abilities are different, and the professional visions are dramatically different across traditions. Although many people are speaking, it seems like people are largely speaking within silos or echo chambers, and knowledge simply isn’t trickling across borders.

This problem has rapidly grown because the underlying professional industries have quickly calcified. Sentiment analysis is not the revolutionary answer to the text analysis problem, but it is good enough for now, and it is skyrocketing in use. Academia is moving too slow for the demands of industry and not addressing the needs of industry, so other analytic techniques are not being adopted.

Social media analysis would best be accomplished by a team of people, each with different training. But it is not developing that way. And that, I believe, is a big (and fast growing) problem.

Notes on the Past, Present and Future of Survey Methodology from #dcaapor

I had wanted to write these notes up into paragraphs, but I think the notes will be more timely, relevant and readable if I share them as they are. This was a really great conference- very relevant and timely- based on a really great issue of Public Opinion Quarterly. As I was reminded at the DC African Festival (a great festival, lots of fun, highly recommended) on Saturday, “In order to understand the future you must embrace the past.”

DC AAPOR Annual Public Opinion Quarterly Special Issue Conference

75th Anniversary Edition

The Past, Present and Future of Survey Methodology and Public Opinion Research

Look out for slides from the event here:


Note: Of course, I took more notes in some sessions than others…

Peter Miller:

–       Adaptive design- tracking changes in estimates across mailing waves and tracking response bias, is becoming standard practice at Census

–       Check out Howard Schuman’s article tracking attitudes toward Christopher Columbus

  • Ended up doing some field research in the public library, reading children’s books

Stanley Presser:

–       Findings have no meaning independent of the method with which they were collected

–       Balance of substance and method make POQ unique (this was a repeated theme)

Robert Groves:

–       The survey was the most important invention in Social Science in the 20th century – quote credit?

–       3 era’s of Survey research (boundaries somewhat arbritrary)

  • 1930-1960
    • Foundation laid, practical development
  • 1960-1990
    • Founders pass on their survey endeavors to their protégés
    • From face to face to phone and computer methods
    • Emergence & Dominance of Dillman method
    • Growth of methodological research
    • Total Survey Error perspective dominates
    • Big increase in federal surveys
    • Expansion of survey centers & private sector organizations
    • Some articles say survey method dying because of nonresponse and inflating costs. This is a perennial debate. Groves speculated that around every big election time, someone finds it in their interest to doubt the polls and assigns a jr reporter to write a piece calling the polls into question.
  • 1990à
    • Influence of other fields, such as social cognitive psychology
    • Nonresponse up, costs up à volunteer panels
    • Mobile phones decrease cost effectiveness of phone surveys
    • Rise of internet only survey groups
    • Increase in surveys
    • Organizational/ business/ management skills more influential than science/ scientists
    • Now: software platforms, culture clash with all sides saying “Who are these people? Why do they talk so funny? Why don’t they know what we know?”
    • Future
      • Rise of organic data
      • Use of administrative data
      • Combining data sets
      • Proprietary data sets
      • Multi-mode
      • More statistical gymnastics

Mike Brick:

  • Society’s demand for information is Insatiable
  • Re: Heckathorn/ Respondent Driven samples
    • Adaptive/ indirect sampling is better
    • Model based methods
      • Missing data problem
      • Cost the main driver now
      • Estimation methods
      • Future
        • Rise of multi-frame surveys
        • Administrative records
        • Sampling theory w/nonsampling errors at design & data collection stages
          • Sample allocation
          • Responsive & adaptive design
          • Undercoverage bias can’t be fixed at the back end
            • *Biggest problem we face*
            • Worse than nonresponse
            • Doug Rivers (2007)
              • Math sampling
              • Web & volunteer samples
              • 1st shot at a theory of nonprobability sampling
            • Quota sampling failed in 2 high profile examples
              • Problem: sample from interviews/ biased
              • But that’s FIXABLE
            • Observational
              • Case control & eval studies
              • Focus on single treatment effect
              • “tougher to measure everything than to measure one thing”

Mick Couper:

–       Mode an outdated concept

  • Too much variety and complexity
  • Modes are multidimensional
    • Degree of interviewer involvement
    • Degree of contact
    • Channels of communication
    • Level of privacy
    • Technology (used by whom?)
    • Synchronous vs. asynchronous
  • More important to look at dimensions other than mode
  • Mode is an attribute of a respondent or item
  • Basic assumption of mixed mode is that there is no difference in responses by mode, but this is NOT true
    • We know of many documented, nonignorable, nonexplainable mode differences
    • Not “the emperor has no clothes” but “the emperor is wearing suggestive clothes”
    • Dilemma: differences not Well understood
      • Sometimes theory comes after facts
      • That’s where we are now- waiting for the theory to catch up (like where we are on nonprobability sampling)

–       So, the case for mixed mode collection so far is mixed

  • Mail w/web option has been shown to have a lower response rate than mail only across 24-26 studies, at least!!
    • (including Dillman, JPSM, …)
    • Why? What can we do to fix this?
    • Sequential modes?
      • Evidence is really mixed
      • The impetus for this is more cost than response rate
      • No evidence that it brings in a better mix of people

–       What about Organic data?

  • Cheap, easily available
  • But good?
  • Disadvantages:
    • One var at a time
    • No covariates
    • Stability of estimates over time?
    • Potential for mischief
      • E.g. open or call-in polls
      • My e.g. #muslimrage
  • Organic data wide, thin
  • Survey data narrow, deep

–       Face to face

  • Benchmark, gold standard, increasingly rare

–       Interviewers

  • Especially helpful in some cases
    • Nonobservation
    • Explaining, clarifying

–       Future

  • Technical changes will drive dev’t
  • Modes and combinations of modes will proliferate
  • Selection bias The Biggest Threat
  • Further proliferation of surveys
    • Difficult for us to distinguish our work from “any idiot out there doing them”

–       Surveys are tools for democracy

  • Shouldn’t be restricted to tools for the elite
  • BUT
  • There have to be some minimum standards

–       “Surveys are tools and methodologists are the toolmakers”

Nora Cate Schaeffer:

–       Jen Dykema read & summarized 78 design papers- her summary is available in the appendix of the paper

–       Dynamic interactive displays for respondent in order to help collect complex data

–       Making decisions when writing questions

  • See flow chart in paper
    • Some decisions are nested
  • Question characteristics
    • E.g. presence or absence of a feature
      • E.g. response choices

Sunshine Hillygus:

–       Political polling is “a bit of a bar trick”

  • The best value in polls is in understanding why the election went the way it did

–       Final note: “The things we know as a field are going to be important going forward, even if it’s not in the way they’ve been used in the past”

Lori Young and Diana Mutz:

–       Biggest issues:

  • Diversity
  • Selective exposure
  • Interpersonal communication

–       2 kinds of search, influence of each

  • Collaborative filter matching, like Amazon
    • Political targeting
    • Contentious issue: 80% of people said that if they knew a politician was targeting them they wouldn’t vote for that candidate
      • My note: interesting to think about peoples relationships with their superficial categories of identity- it’s taken for granted so much in social science research, yet not by the people within the categories

–       Search engines: the new gatekeepers

  • Page rank & other algorithms
  • No one knows what influence personalization of search results will have
  • Study on search learning: gave systematically different input to train engines are (given same start point), results changes Fast and Substantively

Rob Santos:

–       Necessity mother of invention

  • Economic pressure
  • Reduce costs
  • Entrepreneurial spirit
  • Profit
  • Societal changes
    • Demographic diversification
      • Globalization
      • Multi-lingual
      • Multi-cultural
      • Privacy concerns
      • Declining participation

–       Bottom line: we adapt. Our industry Always Evolves

–       We’re “in the midst of a renaissance, reinventing ourselves”

  • Me: That’s framing for you! Wow!

–       On the rise:

  • Big Data
  • Synthetic Data
    • Transportation industry
    • Census
    • Simulation studies
      • E.g. How many people would pay x amount of income tax under y policy?
  • Bayesian Methods
    • Apply to probability and nonprobability samples
  • New generation
    • Accustomed to and EXPECT rapid technological turnover
    • Fully enmeshed in social media

–       3 big changes:

  • Non-probability sampling
    • “Train already left the station”
    • Level of sophistication varies
    • Model based inference
    • Wide public acceptance
    • Already a proliferation
  • Communication technology
    • Passive data collection
      • Behaviors
        • E.g. pos (point of service) apps
        • Attitudes or opinions
      • Real time collection
        • Prompted recall (apps)
        • Burden reduction
          • Gamification
  • Big Data
    • What is it?
    • Data too big to store
      • (me: “think “firehoses”)
      • Volume, velocity, variety
      • Fuzzy inferences
      • Not necessarily statistical
      • Coursenes insights

–       We need to ask tough questions

  • (theme of next AAPOR conference is just that)
  • We need to question probability samples, too
    • Flawed designs abound
    • High nonresponse & noncoverage
    • Can’t just scrutinize nonprobability samples
  • Nonprobability designs
    • Some good, well accepted methods
    • Diagnostics for measurement
      • How to measure validity?
      • What are the clues?
      • How to create a research agenda to establish validity?
  • Expanding the players
    • Multidisciplinary
      • Substantive scientists
      • Math stats
      • Modelers
      • Econometricians
  • We need
    • Conversations with practitioners
    • Better listening skills

–       AAPOR’s role

  • Create forum for conversation
  • Encourage transparency
  • Engage in outreach
  • Understanding limitations but learning approaches

–       We need to explore the utility of nonprobability samples

–       Insight doesn’t have to be purely from statistical inferences

–       The biggest players in big data to date include:

  • Computational scientists
  • Modelers/ synthetic data’ers

–       We are not a “one size fits all” society, and our research tools should reflect that

My big questions:

–       “What are the borders of our field?”

–       “What makes us who we are, if we don’t do surveys even primarily?”

Linguistic notes:

–       Use of we/who/us

–       Metaphors: “harvest” “firehose”

–       Use of specialized vocabulary

–       Use of the word “comfortable”

–       Interview as a service encounter?

Other notes:

–       This reminds me of Colm O’Muircheartaigh- from that old JPSM distinguished lecture

  • Embracing diversity
  • Allowing noise
  • Encouraging mixed methods

I wish his voice was a part of this discussion…

Remotely following AAPOR conference #aapor

The AAPOR 2012 conference began today in sunny Orlando, Florida. This is my my favorite conference of the year, and I am sorry to miss it. Fortunately, the Twitter action is bringing a lot of the action to homeviewers like us!!/search/realtime/%23AAPOR

I will keep retweeting some of the action. For those of you who may be concerned that this represents a new era of heavy tweeting for me, rest assured- it wont!

And for anyone who has been wondering what happened to me and my blog, please stay tuned. I am working on an exciting new project that I will eagerly share about in due time.