Inspiration

Events that impact an entire society like political and humanitarian crises, mass migrations, disease outbreaks, economic instability, resource shortages, large-scale protests and worker-strikes, civil unrest, riots, political violence, state-sponsored ethnic violence, terrorism, economic sanctions, military action, wars, invasions, are often preceded by population-level changes in mass communication. In today's world this means coverage of and commentary on political events by traditional news media like newspapers and news magazines, news radio and television stations and corporate news websites. It also means coverage and commentary by independent and individual-run news websites, aggregators, blogs, forums, journals, newsletters, essay and article writing sites, as well individual posts by political actors and their supporters and antagonists on social media sites like Facebook (Meta), Twitter et.al which achieve society-wide dissemination. One of the defining characteristics of the present generation of the WWW is the parity between news content created and curated by professional journalists and commentators and corporations, and user-generated and individual-produced news content and reporting and commentary disseminated mainly over social networking platforms.

Conflict in societies can be naturally modelled as graphs and networks e.g consider the graph diagram below cg

The diagram captures a tiny part of the conflict that erupted in the U.S. in the summer of 2020 during the presidency of Donald J. Trump. Organizations and movements like BLM and Antifa staged violent protests over the police killing of George Floyd in response to which President Trump criticized these organizations and ordered further mobilization of police and other security forces.

The larger vertices outlined in orange represent events while the smaller vertices represent actors. Each event has spatio-temporal attributes and is coded using a standard classification like 1823 KILL BY PHYSICAL ASSAULT or 1453 ENGAGE IN VIOLENT PROTESTS TO DEMAND RIGHTS, Directed edges connect actors with events with each event being connected to a dyad or pair of actors where one actor is the source of the event action and the other the target. Using this model we observe:

  • Actors initiate and receive event actions and events connect to other events only through actor vertices.
  • If an event A is possible cause of B then A must happen before B and a path must exist from B to A passing only through event vertices that also precede B.
  • A sequence or chain of events leading to a violent event may show increasing levels of intensity e.g 1453 ENGAGE IN VIOLENT PROTESTS TO DEMAND RIGHTS -> 153 MOBILIZE OR INCREASE POLICE POWER -> 1823 KILL BY PHYSICAL ASSAULT. The event coding reflects this increase numerically.
  • A typical news day globally would comprise a network of tens of thousands of event vertices and edges like these.

This global event data may be harvested from the massive amount of news stories published and available online each day where a story will mention a particular event on a particular date together with the actors involved. A schema in TigerGraph for this model of events and news would look like gdelt_event_schema

It seemed clear to everyone that at the time that President's Trump.actions inflamed the situation and likely led to more violence and retaliation against the police. The event connecting Antifa to U.S. Law Enforcement with dashed-edges is a hypothetical event. An interesting question that a model like this fed with sufficient data might be able to answer is:

What is the probability of a fatal event linking protestors and rioters to police or vice-versa?

Humans have been trying to figure out what causes conflict in our societies for millenia and this will only continue for millenia more. But one thing that has become clear is that we do not need a full theoretical treatment of any phenomena in order to make predictions. Using vast amounts of data and computing power, machines can discover and learn parameters from existing correlations that allow them to predict sequences of events with surprising accuracy in a way that appears to emulate intelligence and understanding.

But even with the evolutionary advances in our knowledge and technology and how we communicate, disseminate and consume news, there is fundamental approach to forecasting societal conflict and violence which has not changed since ancient times: observe historical parallels and monitor the words and actions of political actors and their supporters and antagonists. pol Can events like these be reliably forecasted without a theoretical model of their causes?

The predominance of open-source software over commercial and closed-source software over the past decades has been accompanied by the massive increase in the availability of open-source indicators and intelligence, as well as by the increase in availabilty of big data computing resources both to academic and government researchers, and to ordinary interested citizens. Open-source in this context refers to data that is publically available to anyone with requiring special credentials or privileges or payments. Whereas in the past the kind of population-level data required to attempt technical forecasting of conflict has been carefully guarded behind academic and government barriers, today much of this data is now available to the public

In 2011 the U.S Intelligence Advanced Researc Projects Agency (IARPA) began the OSI R&D program

The OSI Program aims to develop methods for continuous, automated analysis of publicly available data in order to anticipate and/or detect significant societal events, such as political crises, humanitarian crises, mass violence, riots, mass migrations, disease outbreaks, economic instability, resource shortages, and responses to natural disasters. Performers will be evaluated on the basis of warnings that they deliver about real-world events. If successful, OSI methods will “beat the news” by fusing early indicators of events from multiple publicly available data sources and types.

iarpa_osi

It seems that the enormous amounts of open-source human-initiated conflict data could be most efficiently stored and analyzed by a scalable graph database.

What it does

The osiris project provides a graph-oriented data-processing environment that facilitates research into technical conflict event forecasting using massive open-source intelligence datasets like the GDELT project. It is a Python data-processing frameworj to analyze how population-level changes in social and traditional news communication and coverage can predict societal events and conflicts using open-source intelligence and graph analysis. .

Challenges we ran into

Google BigQuery is really fast querying at hundreds of gigabytes of data...so I learned pretty quickly that you will blow through your free quota of monthly query storage pretty quickly just tweaking and testing ETL scripts on huge tables. You should always just create small snapshot tables to debug your ETL scripts on.

The graphistry TigerGraph package does not support token authorization calling the REST++ endpoints of installed queries. I made changes to support this in the osiris fork of this package.

What's next for osiris

There are several sources of machine-coded conflict event data from other quantitative conflict forecasting programs and automatic event coding programs that have been open-sourced. I plan to add these sources to complement the GDELT data source:

  • UCPD The Uppsala Conflict Data Program is "the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years." UCPD runs the Violence Early Warning System (ViEWS). A list of all UCPD datasets is here.
  • ICEWS The Integrated Conflict Early Warning System from DARPA is another conflict forecasting system. The most recent data update is April 2022. See here for a comparison with GDELT and here for an analysis of the ICEWS dataset.
  • Phoenix The Phoenix automatic coding pipeline for scraped news stories by the Open Event Data Alliance, a research group including Phil Schrodt "committed to facilitating the development, adoption and analysis of social and political event data." The OEDA mainatins a list of conflict datasets here.
Share this project:

Updates