I attended another great CLIP event today, Statistical Text Analysis for Social Science: Learning to Extract International Relations from the News, by Brendan O’Connor, CMU. I’d love to write it up, but I decided instead to share my notes. I hope they’re easy to follow. Please feel free to ask any follow-up questions!
Computational Social Science
– Then: 1890 census tabulator- hand cranked punch card tabulator
– Now: automated text analysis
Goal: develop methods of predicting, etc conflicts
– events = data
– extracting events from news stories
– information extraction from large scale news data
– goal: time series of country-country interactions
– who did what to whom? in what order?
Long history of manual coding of this kind of data for this kind of purpose
– more recently: rule based pattern extraction, TABARI
– —> developing event types (diplomatic events, aggressions, …) from verb patterns – TABARI hand engineered 15,000 coding patterns over the course of 2 decades —> very difficult, validity issues, changes over time- all developed by political scientists Schrodt 1994- in MUCK (sp?) days – still a common poli sci methodology- GDELT project- software, etc. w/pre & postprocessing
– http://gdelt.utdallas.edu
– Sources: mainstream media news, English language, select sources
THIS research
– automatic learning of event types
– extract events/ political dynamics
→ use Bayesian probabilistic methods
– using social context to drive unsupervised learning about language
– data: Gigaword corpus (news articles) – a few extra sources (end result mostly AP articles)
– named entities- dictionary of country names
– news biases difficult to take into account (inherent complication of the dataset)(future research?)
– main verb based dependency path (so data is pos tagged & then sub/obj tagged)
– 3 components: source (acting country)/ recipient (recipient country)/ predicate (dependency path)
– loosely Dowty 1990
– International Relations (IR) is heavily concerned with reciprocity- that affects/shapes coding, goals, project dynamics (e.g. timing less important than order, frequency, symmetry)
– parsing- core NLP
– filters (e.g. Georgia country vs. Georgia state) (manual coding statements)
– analysis more focused on verb than object (e.g. text following “said that” excluded)
– 50% accuracy finding main verb (did I hear that right? ahhh pos taggers and their many joys…)
– verb: “reported that” – complicated: who is a valid source? reported events not necessarily verified events
– verb: “know that” another difficult verb
The models:
– dyads = country pairs
– each w/ timesteps
– for each country pair a time series
– deduping necessary for multiple news coverage (normalizing)
– more than one article cover a single event
– effect of this mitigated because measurement in the model focuses on the timing of events more than the number of events
1st model
– independent contexts
– time slices
– figure for expected frequency of events (talking most common, e.g.)
2nd model
– temporal smoothing: assumes a smoothness in event transitions
– possible to put coefficients that reflect common dynamics- what normally leads to what? (opportunity for more research)
– blocked Gibbs sampling
– learned event types
– positive valence
– negative valence
– “say” ← some noise
– clusters: verbal conflict, material conflict, war terms, …
How to evaluate?
– need more checks of reasonableness, more input from poli sci & international relations experts
– project end goal: do political sci
– one evaluative method: qualitative case study (face validity)
– used most common dyad Israeli: Palestinian
– event class over time
– e.g. diplomatic actions over time
– where are the spikes, what do they correspond with? (essentially precision & recall)
– another event class: police action & crime response
– Great point from audience: face validity: my model says x, then go to data- can’t develop labels from the data- label should come from training data not testing data
– Now let’s look at a small subset of words to go deeper
– semantic coherence?
– does it correlate with conflict?
– quantitative
– lexical scale evaluation
– compare against TABARI (lucky to have that as a comparison!!)
– another element in TABARI: expert assigned scale scores – very high or very low
– validity debatable, but it’s a comparison of sorts
– granularity invariance
– lexical scale impurity
Comparison sets
– wordnet – has synsets – some verb clusters
– wordnet is low performing, generic
– wordnet is a better bar than beating random clusters
– this model should perform better because of topic specificity
“Gold standard” method- rarely a real gold standard- often gold standards themselves are problematic
– in this case: militarized interstate dispute dataset (wow, lucky to have that, too!)
Looking into semi-supervision, to create a better model
speaker website:
http://brenocon.com
Q &A:
developing a user model
– user testing
– evaluation from users & not participants or collaborators
– terror & protest more difficult linguistic problems
more complications to this project:
– Taiwan, Palestine, Hezbollah- diplomatic actors, but not countries per se