Final Tutorial

by Joaquim Malcampo, Drayton Hoffman and Liam Y Lehr

Introduction

This project is intended to showcase our understanding of the data science pipeline by applying it to real world data. We accomplish this by looking at data on terrorist activity in the United States provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland.

START website: https://www.start.umd.edu/gtd/

The START website provides a codebook that explains how the data is collected and maintained as well as how to interpret each data element for your own use.

NBC News article: https://www.nbcnews.com/news/us-news/white-nationalism-fueled-violence-rise-fbi-slow-call-it-domestic-n1039206

The above article outlines an issue with that the FBI has recently been misclassifying attacks so that they are not treated as terrorist activity. This affects the investigation process in such a way that protects white supremacists from legal repurcusions. Our goal is to create a model to take terrorist attack data for which the perpetrator is unknown to determine the likelihood it is related to a white supremacist group.

Data Curation, Parsing, and Management

We decided to look at data provided by the University of Maryland's own National Consortium for the Study of Terrorism and Responses to Terrorism (START). The global terrorism dataset we are using has in depth data for terrorist attacks between 1970 and 2018.

First, we need to import the python libraries we intend to use.

In [48]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
!pip install folium
import folium

import statsmodels.api as sm
import statsmodels.formula.api as smf

print("Success!")
Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.1)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Success!

Then, we read in the dataset and proceed to tidy it.

In [49]:
try:
  if data is None:
    data = pd.DataFrame()
except:
  data = pd.DataFrame()

# database description (for columns) page 13 of codebook
if data.empty:
  data = pd.read_excel('globalterrorismdb_0919dist.xlsx')

We process the data so that we are only looking at terrorist atttacks that have taken place in the United States. Then, we try to find the terrorist groups that have been active since 2015 (inclusive). With that information, we combine the two data elements to find the historical data for only groups that have been active in the United States since 2015.

We then separate that data into three tables by the perpetrator group responsible: Known White Supremacists, Known Non-White Supremacists, and Unknown Terrorist group.

https://gazette.com/news/white-supremacist-graffiti-and-propaganda-found-in-colorado-springs-and-denver-suburb/article_3f4bd0ae-389e-11eb-9613-bb31df3c7452.html

This article gives a little background as to which key words we were looking for in group names to classify them as White Supremacists. The buzzwords essentially came down to: 'White supremacists/nationalists', 'Ku Klux Klan', and 'Neo-Nazi extremists'.

In [50]:
# get all attacks in the United States
usa = data[data.country_txt == 'United States']
# make list of all groups that have attacked the USA since 2015 (inclusive)
activeGroups = list(usa[usa.iyear >= 2015].gname.unique())

# get info for all the attacks from active groups
usaActive = usa[usa.gname.isin(activeGroups)]
usaActive = usaActive[['iyear', 'region_txt', 'latitude', 'longitude', 'attacktype1', 'attacktype1_txt', 'targtype1', 'targtype1_txt', 'gname', 'nkill']]
# remove NAs from latitude and longitude columns (for mapping)
usaActive = usaActive[usaActive.latitude.notna()][usaActive.longitude.notna()]
# rename the columns to be more meaningful
usaActive.rename(columns={'iyear': 'year', 'region_txt': 'region', 
                        'attacktype1': 'attackTypeCode', 'attacktype1_txt': 'attackTypeText',
                        'targtype1': 'targetTypeCode', 'targtype1_txt': 'targetTypeText',
                        'gname': 'groupName', 'nkill': 'fatalityCount'}, inplace = True)

# separate attacks where the group is unknown from groups that have been identified
usaActiveUnknown = usaActive[usaActive.groupName == 'Unknown']
usaActive = usaActive[usaActive.groupName != 'Unknown']

# separate White Supremacist groups from Non-White Supremacist groups
uaWS = usaActive[(usaActive.groupName == 'White supremacists/nationalists') | (usaActive.groupName == 'Ku Klux Klan')| (usaActive.groupName == 'Neo-Nazi extremists')]
uaNWS = usaActive[(usaActive.groupName != 'White supremacists/nationalists') | (usaActive.groupName != 'Ku Klux Klan')| (usaActive.groupName != 'Neo-Nazi extremists')]

Exploratory Data Analysis

With the separate data tables for Known White Supremacist, Known Non-White Supremacist, and Unknown Terrorist groups, we decided to first look at the number of terrorist attacks per year in each grouping.

To do this, we first get the counts for each year and put them into new data frames so that we can clean the data separately and represent them in their own indvidual plots. We decided to plot this data with a line graph to compare the trends as to when there are influxes of attacks.

In [51]:
# white supremacists
g1 = pd.DataFrame(uaWS.year.value_counts()) # make counts for attacks by year into dataframe
g1['count'] = g1['year'] # change name to counts because value_counts() makes year index
g1['year'] = g1.index # reset year to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['year']) # sort the values by year
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g1['year'], g1['count'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of White Supremacy Related Terrorist Attacks by Year')
plt.show()
In [52]:
# non white supremacists
g2 = pd.DataFrame(uaNWS.year.value_counts()) # make counts for attacks by year into dataframe
g2['count'] = g2['year'] # change name to counts because value_counts() makes year index
g2['year'] = g2.index # reset year to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['year']) # sort the values by year
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g2['year'], g2['count'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Non-White Supremacy Related Terrorist Attacks by Year')
plt.show()
In [53]:
# unknown groups
g3 = pd.DataFrame(usaActiveUnknown.year.value_counts()) # make counts for attacks by year into dataframe
g3['count'] = g3['year'] # change name to counts because value_counts() makes year index
g3['year'] = g3.index # reset year to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['year']) # sort the values by year
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g3['year'], g3['count'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Terrorist Attacks by Unknown Groups by Year')
plt.show()

From the above plots, it appears as though the White Supremacy related attacks graph spikes dramatically at similar locations as does the graph for attacks by groups that are unknown. With that said, the non-white supremacist group graph does not appear to have the same similarities. For this reason, we reasonably suspect that many of the attacks for which the perpetrator is unknown, could still be related to white supremacist groups.

Next, we similarly summarize the numbers of different types of attacks for each of the groupings. We express these values on separate bar charts to best show the relative attack tendencies within each grouping.

In [54]:
g1 = pd.DataFrame(uaWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g1['count'] = g1['attackTypeText'] # change name of column to counts
g1['attackTypeText'] = g1.index # reset attackTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['attackTypeText']) # sort the values by attack type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g1['attackTypeText']
dat = g1['count']
f1 = ax.bar(lab,dat, color=['blue'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()
In [55]:
g2 = pd.DataFrame(uaNWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g2['count'] = g2['attackTypeText'] # change name of column to counts
g2['attackTypeText'] = g2.index # reset attackTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['attackTypeText']) # sort the values by attack type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g2['attackTypeText']
dat = g2['count']
f1 = ax.bar(lab,dat, color=['red'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Non-White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()
In [56]:
g3 = pd.DataFrame(usaActiveUnknown.attackTypeText.value_counts()) # make counts for attacks into dataframe
g3['count'] = g3['attackTypeText'] # change name of column to counts
g3['attackTypeText'] = g3.index # reset attackTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['attackTypeText']) # sort the values by attack type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g3

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g3['attackTypeText']
dat = g3['count']
f3 = ax.bar(lab,dat, color=['green'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks for which the Group is Unknown by Type of Attack')
plt.show()

It appears that the most common attack types over the three groupings are: bombings/explosions, facility/infrastructure attacks, armed assaults and assassinations. With that said, the different groupings show different tendencies in their arrangment of these main attack types. The white supremacist groups are mostly recorded for armed assaults with bombings/explosions and facility/infrastructure attacks having similar counts after that. Non-white supremacists groups tend to mostly be recorded for facility/infrastructure attacks. The unknown portion of the data mostly records bombings and explosions with facility/infrastructure attacks being next up at about half the number of attacks. From this, the clear differences in trends between white supremacist attack types to non-white supremacist attack types indicates that this is a strong feature to differentiate the two.

We next look at the target type counts in comparison between the three groupings. We place these on the same bar chart to look at the trends between the three groupings as to what targets they hit most frequently.

In [57]:
g1 = pd.DataFrame(uaWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g1['WScount'] = g1['targetTypeText'] # change name of column to counts
g1['targetTypeText'] = g1.index # reset targetTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['WScount'], ascending=False) # sort the values by target type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g2 = pd.DataFrame(uaNWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g2['NWScount'] = g2['targetTypeText'] # change name of column to counts
g2['targetTypeText'] = g2.index # reset targetTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['NWScount'], ascending=False) # sort the values by target type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g3 = pd.DataFrame(usaActiveUnknown.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g3['Ucount'] = g3['targetTypeText'] # change name of column to counts
g3['targetTypeText'] = g3.index # reset targetTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['Ucount'], ascending=False) # sort the values by target type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g = g1.set_index('targetTypeText').join(g2.set_index('targetTypeText')).join(g3.set_index('targetTypeText')) # join the two count tables
g = g.head(5) # truncate the tables to only display the top 5 target types
g

ax = g.plot.bar(color=["blue", "red", "green"], figsize=(10,10))
plt.xticks(rotation=45)
plt.xlabel('Type of Target')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks by Target Type (for White Supremacist top 5 Target Types)')
plt.legend(['White Supremacist Group Counts', 'Non-White Supremacist Group Counts', 'Unknown Group Counts']);
plt.show()

From the chart, it is visible that white supremacist groups and non-white supremacist groups follow a similar trend on the top five targets for white-supremacist groups. This is except for the fact that for non-white supremacist groups, there is a huge number of attacks on businesses that deviates from the previously mentioned trend.

Here we look at the number of fatalities in attacks in the different groupings by year.

In [58]:
plt.figure(figsize=(10, 5))
plt.scatter(uaWS['year'], uaWS['fatalityCount'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()

plt.figure(figsize=(10, 5))
plt.scatter(uaNWS['year'], uaNWS['fatalityCount'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Non-White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()

plt.figure(figsize=(10, 5))
plt.scatter(usaActiveUnknown['year'], usaActiveUnknown['fatalityCount'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Terrorist Attacks for which the Group is Unknown distributed by Year')
plt.show()

This is not very conclusive besides the fact that white supremacist groups have been more deadly in recent years.

To continue exploring, we create a map of the locations of terrorist attacks from the three data groupings. Blue indicates White Supremacist groups; Red, Non-White Supremacist groups; Green, unknown terrorist groups.
We use the folium package to loop through all of the rows in the three tables, plotting the points in different colors using the latittude and longitude fields.

In [59]:
map = folium.Map(location=[42, -95], zoom_start=4)
# loop through the table rows
for _,row in usaActiveUnknown.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='green').add_to(map)

for _,row in uaNWS.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='red').add_to(map)

for _,row in uaWS.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='blue').add_to(map)

map
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The map did not enlighten us to much in terms of the creation of our model but it did emphasize the issue at hand. This is because, the locations for unknown attacks tend to the same exact locations for white supremacist attacks. This resemblance could be indicative that much of the entries that recorded unknown perpetrators could likely be white supremacists.

Machine Learning to Provide Analysis

Now after investigating possible features, we have concluded that we will use a multiple linear regression model that takes into account the year, attack type and target type to project whether or not a terrorist attack was carrried out by a White Supremacist group.

To do this, we create a dataframe of dummy variables for the attack type and target type. A dummy variable represents a categorical variable as a binary value so that it can be accounted for in a model.

In [60]:
# make dummy column for usa known active attacks to indicate 1 for white supremacist and 0 for non
nrow = list()
for g in usaActive['groupName']:
  if (g == 'White supremacists/nationalists') or (g == 'Ku Klux Klan') or (g == 'Neo-Nazi extremists'):
    nrow.append(1)
  else:
    nrow.append(0)
# add the column to the usaActive table
usaActive["WS"] = nrow

# make dataframes of dummy variables for attack type and target type
attackTypeDummy = pd.get_dummies(usaActive.attackTypeText)
targetTypeDummy = pd.get_dummies(usaActive.targetTypeText)

# add the dummy variables to one table with the year and "White Supramcist" dummy field
usaActiveDummy = pd.concat([usaActive,attackTypeDummy], axis= 1)
usaActiveDummy = pd.concat([usaActive,targetTypeDummy], axis= 1)

# remove superfluous columns
usaActiveDummy = usaActiveDummy.drop(['region', 'latitude', 'longitude', 'attackTypeCode', 'attackTypeText', 'targetTypeCode', 'targetTypeText', 'groupName', 'fatalityCount'], axis=1)
usaActiveDummy
Out[60]:
year WS Abortion Related Airports & Aircraft Business Educational Institution Food or Water Supply Government (Diplomatic) Government (General) Journalists & Media ... Police Private Citizens & Property Religious Figures/Institutions Telecommunication Terrorists/Non-State Militia Tourists Transportation Unknown Utilities Violent Political Party
24 1970 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
29 1970 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
35 1970 1 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
48 1970 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
49 1970 1 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190108 2018 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
190109 2018 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
190206 2018 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
190291 2018 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
190628 2018 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

665 rows × 22 columns

Now we feed the result vector and the dummy variables into the model for which we get an R squared scored of ~0.47

In [61]:
# create ols model of White Supremacy data taking into account the dummy table data
mod = sm.OLS(usaActiveDummy['WS'], usaActiveDummy.drop(['WS'], axis=1))
# fit model, save coefficients and print the parameters
res = mod.fit() 
coef = res.params 
# print(res.params) 
res.summary()
Out[61]:
OLS Regression Results
Dep. Variable: WS R-squared: 0.478
Model: OLS Adj. R-squared: 0.462
Method: Least Squares F-statistic: 29.47
Date: Mon, 21 Dec 2020 Prob (F-statistic): 2.16e-77
Time: 20:03:01 Log-Likelihood: -137.69
No. Observations: 665 AIC: 317.4
Df Residuals: 644 BIC: 411.9
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
year -0.0135 0.001 -15.160 0.000 -0.015 -0.012
Abortion Related 26.8279 1.769 15.161 0.000 23.353 30.303
Airports & Aircraft 26.9378 1.785 15.087 0.000 23.432 30.444
Business 27.0740 1.778 15.225 0.000 23.582 30.566
Educational Institution 27.2935 1.771 15.413 0.000 23.816 30.771
Food or Water Supply 27.5161 1.775 15.502 0.000 24.030 31.002
Government (Diplomatic) 27.1083 1.814 14.948 0.000 23.547 30.669
Government (General) 27.1261 1.782 15.226 0.000 23.628 30.624
Journalists & Media 27.1227 1.779 15.246 0.000 23.629 30.616
Military 27.1538 1.782 15.235 0.000 23.654 30.654
NGO 27.4187 1.771 15.485 0.000 23.942 30.896
Police 27.1746 1.791 15.176 0.000 23.658 30.691
Private Citizens & Property 27.4223 1.775 15.446 0.000 23.936 30.908
Religious Figures/Institutions 27.3556 1.781 15.356 0.000 23.858 30.854
Telecommunication 27.5161 1.762 15.615 0.000 24.056 30.976
Terrorists/Non-State Militia 26.7853 1.793 14.943 0.000 23.265 30.305
Tourists 27.1487 1.816 14.948 0.000 23.582 30.715
Transportation 27.3064 1.780 15.341 0.000 23.811 30.802
Unknown 27.5612 1.798 15.330 0.000 24.031 31.091
Utilities 27.0006 1.790 15.088 0.000 23.486 30.515
Violent Political Party 27.5497 1.764 15.615 0.000 24.085 31.014
Omnibus: 104.960 Durbin-Watson: 1.300
Prob(Omnibus): 0.000 Jarque-Bera (JB): 173.359
Skew: 0.991 Prob(JB): 2.27e-38
Kurtosis: 4.527 Cond. No. 1.35e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.35e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

"The correct R2 value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be."

https://statisticsbyjim.com/regression/how-high-r-squared

Given that terrorist behavior is human behavior, according to this blog post, a R squared score of ~0.47 is acceptable since we are modeling human behavipr which is highly variable by nature.

Curation of a Message or Messages Covering Insights Learned During the Tutorial

With this tutorial, we wanted to be able to classify the Unknown perpetrators in the data set to either be a white supremacist group or not. We used a reputable data source (data taken by the University of Maryland) so we are confident that the data we use for analysis and processing have merit to be able to make some reputable conclusions from it. What resulted from it is a model with an R squared of almost 50%.

Ultimately, the question we were trying to answer is a question of trying to predict human behavior. Terrorist groups ultimately rely on irrational thought with little regards to empathy and life. Having an R squared that large is honestly impressive. Of course, given that, it is still safer to be skeptical with the results of this model. In the real world, using this model to predict the unknown terrorist groups can be dangerous if the results leads to tangible policy changes. We can update our model with more features to potentially get a better model but this can still lead to unpredictable results.

We hope you enjoyed this tutorial and liked this example of the data science pipeline!

In [ ]: