Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News

I attended another great CLIP event today, Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News, by Brendan O’Connor, CMU. I’d love to write it up, but I decided instead to share my notes. I hope they’re easy to follow. Please feel free to ask any follow-up questions!

Computational Social Science

– Then: 1890 census tabulator- hand cranked punch card tabulator

– Now: automated text analysis

Goal: develop methods of predicting, etc conflicts

– events = data

– extracting events from news stories

– information extraction from large scale news data

– goal: time series of country-country interactions

– who did what to whom? in what order?

Long history of manual coding of this kind of data for this kind of purpose

– more recently: rule based pattern extraction, TABARI

– —> developing event types (diplomatic events, aggressions, …) from verb patterns – TABARI hand engineered 15,000 coding patterns over the course of 2 decades —> very difficult, validity issues, changes over time- all developed by political scientists Schrodt 1994- in MUCK (sp?) days – still a common poli sci methodology- GDELT project- software, etc. w/pre & postprocessing

– http://gdelt.utdallas.edu

– Sources: mainstream media news, English language, select sources

THIS research

– automatic learning of event types

– extract events/ political dynamics

→ use Bayesian probabilistic methods

– using social context to drive unsupervised learning about language

– data: Gigaword corpus (news articles) – a few extra sources (end result mostly AP articles)

– named entities- dictionary of country names

– news biases difficult to take into account (inherent complication of the dataset)(future research?)

– main verb based dependency path (so data is pos tagged & then sub/obj tagged)

– 3 components: source (acting country)/ recipient (recipient country)/ predicate (dependency path)

– loosely Dowty 1990

– International Relations (IR) is heavily concerned with reciprocity- that affects/shapes coding, goals, project dynamics (e.g. timing less important than order, frequency, symmetry)

– parsing- core NLP

– filters (e.g. Georgia country vs. Georgia state) (manual coding statements)

– analysis more focused on verb than object (e.g. text following “said that” excluded)

– 50% accuracy finding main verb (did I hear that right? ahhh pos taggers and their many joys…)

– verb: “reported that” – complicated: who is a valid source? reported events not necessarily verified events

– verb: “know that” another difficult verb

The models:

– dyads = country pairs

– each w/ timesteps

– for each country pair a time series

– deduping necessary for multiple news coverage (normalizing)

– more than one article cover a single event

– effect of this mitigated because measurement in the model focuses on the timing of events more than the number of events

1st model

– independent contexts

– time slices

– figure for expected frequency of events (talking most common, e.g.)

2nd model

– temporal smoothing: assumes a smoothness in event transitions

– possible to put coefficients that reflect common dynamics- what normally leads to what? (opportunity for more research)

– blocked Gibbs sampling

– learned event types

– positive valence

– negative valence

– “say” ← some noise

– clusters: verbal conflict, material conflict, war terms, …

How to evaluate?

– need more checks of reasonableness, more input from poli sci & international relations experts

– project end goal: do political sci

– one evaluative method: qualitative case study (face validity)

– used most common dyad Israeli: Palestinian

– event class over time

– e.g. diplomatic actions over time

– where are the spikes, what do they correspond with? (essentially precision & recall)

– another event class: police action & crime response

– Great point from audience: face validity: my model says x, then go to data- can’t develop labels from the data- label should come from training data not testing data

– Now let’s look at a small subset of words to go deeper

– semantic coherence?

– does it correlate with conflict?

– quantitative

– lexical scale evaluation

– compare against TABARI (lucky to have that as a comparison!!)

– another element in TABARI: expert assigned scale scores – very high or very low

– validity debatable, but it’s a comparison of sorts

– granularity invariance

– lexical scale impurity

Comparison sets

– wordnet – has synsets – some verb clusters

– wordnet is low performing, generic

– wordnet is a better bar than beating random clusters

– this model should perform better because of topic specificity

“Gold standard” method- rarely a real gold standard- often gold standards themselves are problematic

– in this case: militarized interstate dispute dataset (wow, lucky to have that, too!)

Looking into semi-supervision, to create a better model

speaker website:

http://brenocon.com

Q &A:

developing a user model

– user testing

– evaluation from users & not participants or collaborators

– terror & protest more difficult linguistic problems

more complications to this project:

– Taiwan, Palestine, Hezbollah- diplomatic actors, but not countries per se

8 thoughts on “Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News”

Pingback: Statistical Text Analysis for Social Science: Learning to Extract … | ICS Next Level Blog

Pingback: Homepage

Nice to see Dowty getting utilized by NLP. Would love to see more Jackendoff too.

freerangeresearch says:

October 10, 2013 at 5:34 pm

Honestly, I haven’t come across either. Do you have a good introduction to each to recommend?

Reply
- TheLousyLinguist says:
  
  October 11, 2013 at 9:50 am
  
  For a quick intro to Dowty-style semantic decomposition, try Van Valin’s intro to RRG http://linguistics.buffalo.edu/people/faculty/vanvalin/rrg/RRGsummary.pdf (ignore the first part about trees, way too messy).
  
  The sections ‘The lexical representation of verbs’, ‘Semantic roles’ and ‘The lexicon’ (pages 9-17) are all derived from Dowty 1979 (an impossible paper to find these days, sadly) and Jackendoff’s work.
  
  For Jackendoff, try this handout: Lexical-conceptual structure (http://people.brandeis.edu/~smalamud/ling130/LCS.pdf )
- freerangeresearch says:
  
  October 11, 2013 at 10:02 am
  
  That’s great. Thank you!

What i do not understood is actually how you’re not really much more well-appreciated than you might be right now. You are very intelligent. You realize thus significantly with regards to this subject, made me personally believe it from so many numerous angles. Its like men and women don’t seem to be fascinated unless it is something to accomplish with Lady gaga! Your personal stuffs nice. At all times care for it up!

Pingback: create my own website

Free Range Research

An aspiring postdisciplinarian surfs through the ebbs and flows of the changing research environment

Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News

8 thoughts on “Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News”

Leave a comment Cancel reply

Share this:

Related

8 thoughts on “Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News”

Leave a comment Cancel reply