Every so often I’m reminded of the power of data science. Today I attended a talk entitled ‘Spatiotemporal Crime Prediction Using GPS & Time-tagged Tweets” by Matt Gerber of the UVA PTL. The talk was a UMD CLIP event (great events! Go if you can!).
Gerber began by introducing a few of the PTL projects, which include:
- Developing automatic detection methods for extremist recruitment in the Dark Net
- Turning medical knowledge from large bodies of unstructured texts into medical decision support models
- Many other cool initiatives
He then introduced the research at hand: developing predictive models for criminal activity. The control model in this case use police report data from a given period of time to map incidents onto a map of Chicago using latitude and longitude. He then superimposed a grid on the map and collapsed incidents down into a binary presence vs absence model. Each square in the grid would either have one or more crimes (1) or not have any crimes (-1). This was his training data. He built a binary classifier and then used logistic regression to compute probabilities and layered a kernel density estimator on top. He used this control model to compare with a model built from unstructured text. The unstructured text consisted of GPS tagged Twitter data (roughly 3% of tweets) from the Chicago area. He drew the same grid using longitude and latitude coordinates and tossed all of the tweets from each “neighborhood” (during the same one month training window) into the boxes. Then, using essentially a one box=one document for a document based classifier, he subjected each document to topic modeling (using LDA & MALLET). He focused on crime related words and topics to build models to compare against the control models. He found that the predictive value of both models was similar when compared against actual crime reports from days within the subsequent month.
This is a basic model. The layering can be further refined and better understood (there was some discussion about the word “turnup,” for example). Many more interesting layers can be built into it in order to improve its predictive power, including more geographic features, population densities, some temporal modeling to accommodate the periodic nature of some crimes (e.g. most robberies happen during the work week, while people are away from their homes), a better accommodation for different types of crime, and a host of potential demographic and other variables.
I would love to dig deeper into this data to gain a deeper understanding of the conversation underlying the topic models. I imagine there is quite a wealth of deeper information to be gained as well as a deeper understanding of what kind of work the models are doing. It strikes me that each assumption and calculation has a heavy social load attached to it. Each variable and each layer that is built into the model and roots out correlations may be working to reinforce certain stereotypes and anoint them with the power of massive data. Some questions need to be asked. Who has access to the internet? What type of access? How are they using the internet? Are there substantive differences between tweets with and without geotagging? What varieties of language are the tweeters using? Do classifiers take into account language variation? Are the researchers simply building a big data model around the old “bad neighborhood” notions?
Data is powerful, and the predictive power of data is fascinating. Calculations like these raise questions in new ways, remixing old assumptions into new correlations. Let’s not forget to question new methods, put them into their wider sociocultural contexts and delve qualitatively into the data behind the analyses. Data science can be incredibly powerful and interesting, but it needs a qualitative and theoretical perspective to keep it rooted. I hope to see more, deeper interdisciplinary partnerships soon, working together to build powerful, grounded, and really interesting research!