This project is intended to showcase our understanding of the data science pipeline by applying it to real world data. We accomplish this by looking at data on terrorist activity in the United States provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland.
START website: https://www.start.umd.edu/gtd/
The START website provides a codebook that explains how the data is collected and maintained as well as how to interpret each data element for your own use.
NBC News article: https://www.nbcnews.com/news/us-news/white-nationalism-fueled-violence-rise-fbi-slow-call-it-domestic-n1039206
The above article outlines an issue with that the FBI has recently been misclassifying attacks so that they are not treated as terrorist activity. This affects the investigation process in such a way that protects white supremacists from legal repurcusions. Our goal is to create a model to take terrorist attack data for which the perpetrator is unknown to determine the likelihood it is related to a white supremacist group.
We decided to look at data provided by the University of Maryland's own National Consortium for the Study of Terrorism and Responses to Terrorism (START). The global terrorism dataset we are using has in depth data for terrorist attacks between 1970 and 2018.
First, we need to import the python libraries we intend to use.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
!pip install folium
import folium
import statsmodels.api as sm
import statsmodels.formula.api as smf
print("Success!")
Then, we read in the dataset and proceed to tidy it.
try:
if data is None:
data = pd.DataFrame()
except:
data = pd.DataFrame()
# database description (for columns) page 13 of codebook
if data.empty:
data = pd.read_excel('globalterrorismdb_0919dist.xlsx')
We process the data so that we are only looking at terrorist atttacks that have taken place in the United States. Then, we try to find the terrorist groups that have been active since 2015 (inclusive). With that information, we combine the two data elements to find the historical data for only groups that have been active in the United States since 2015.
We then separate that data into three tables by the perpetrator group responsible: Known White Supremacists, Known Non-White Supremacists, and Unknown Terrorist group.
This article gives a little background as to which key words we were looking for in group names to classify them as White Supremacists. The buzzwords essentially came down to: 'White supremacists/nationalists', 'Ku Klux Klan', and 'Neo-Nazi extremists'.
# get all attacks in the United States
usa = data[data.country_txt == 'United States']
# make list of all groups that have attacked the USA since 2015 (inclusive)
activeGroups = list(usa[usa.iyear >= 2015].gname.unique())
# get info for all the attacks from active groups
usaActive = usa[usa.gname.isin(activeGroups)]
usaActive = usaActive[['iyear', 'region_txt', 'latitude', 'longitude', 'attacktype1', 'attacktype1_txt', 'targtype1', 'targtype1_txt', 'gname', 'nkill']]
# remove NAs from latitude and longitude columns (for mapping)
usaActive = usaActive[usaActive.latitude.notna()][usaActive.longitude.notna()]
# rename the columns to be more meaningful
usaActive.rename(columns={'iyear': 'year', 'region_txt': 'region',
'attacktype1': 'attackTypeCode', 'attacktype1_txt': 'attackTypeText',
'targtype1': 'targetTypeCode', 'targtype1_txt': 'targetTypeText',
'gname': 'groupName', 'nkill': 'fatalityCount'}, inplace = True)
# separate attacks where the group is unknown from groups that have been identified
usaActiveUnknown = usaActive[usaActive.groupName == 'Unknown']
usaActive = usaActive[usaActive.groupName != 'Unknown']
# separate White Supremacist groups from Non-White Supremacist groups
uaWS = usaActive[(usaActive.groupName == 'White supremacists/nationalists') | (usaActive.groupName == 'Ku Klux Klan')| (usaActive.groupName == 'Neo-Nazi extremists')]
uaNWS = usaActive[(usaActive.groupName != 'White supremacists/nationalists') | (usaActive.groupName != 'Ku Klux Klan')| (usaActive.groupName != 'Neo-Nazi extremists')]
With the separate data tables for Known White Supremacist, Known Non-White Supremacist, and Unknown Terrorist groups, we decided to first look at the number of terrorist attacks per year in each grouping.
To do this, we first get the counts for each year and put them into new data frames so that we can clean the data separately and represent them in their own indvidual plots. We decided to plot this data with a line graph to compare the trends as to when there are influxes of attacks.
# white supremacists
g1 = pd.DataFrame(uaWS.year.value_counts()) # make counts for attacks by year into dataframe
g1['count'] = g1['year'] # change name to counts because value_counts() makes year index
g1['year'] = g1.index # reset year to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['year']) # sort the values by year
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
plt.figure(figsize=(10, 5))
plt.plot(g1['year'], g1['count'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of White Supremacy Related Terrorist Attacks by Year')
plt.show()
# non white supremacists
g2 = pd.DataFrame(uaNWS.year.value_counts()) # make counts for attacks by year into dataframe
g2['count'] = g2['year'] # change name to counts because value_counts() makes year index
g2['year'] = g2.index # reset year to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['year']) # sort the values by year
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
plt.figure(figsize=(10, 5))
plt.plot(g2['year'], g2['count'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Non-White Supremacy Related Terrorist Attacks by Year')
plt.show()
# unknown groups
g3 = pd.DataFrame(usaActiveUnknown.year.value_counts()) # make counts for attacks by year into dataframe
g3['count'] = g3['year'] # change name to counts because value_counts() makes year index
g3['year'] = g3.index # reset year to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['year']) # sort the values by year
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
plt.figure(figsize=(10, 5))
plt.plot(g3['year'], g3['count'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Terrorist Attacks by Unknown Groups by Year')
plt.show()
From the above plots, it appears as though the White Supremacy related attacks graph spikes dramatically at similar locations as does the graph for attacks by groups that are unknown. With that said, the non-white supremacist group graph does not appear to have the same similarities. For this reason, we reasonably suspect that many of the attacks for which the perpetrator is unknown, could still be related to white supremacist groups.
Next, we similarly summarize the numbers of different types of attacks for each of the groupings. We express these values on separate bar charts to best show the relative attack tendencies within each grouping.
g1 = pd.DataFrame(uaWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g1['count'] = g1['attackTypeText'] # change name of column to counts
g1['attackTypeText'] = g1.index # reset attackTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['attackTypeText']) # sort the values by attack type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g1['attackTypeText']
dat = g1['count']
f1 = ax.bar(lab,dat, color=['blue'])
plt.xticks(rotation=45)
# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()
g2 = pd.DataFrame(uaNWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g2['count'] = g2['attackTypeText'] # change name of column to counts
g2['attackTypeText'] = g2.index # reset attackTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['attackTypeText']) # sort the values by attack type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g2['attackTypeText']
dat = g2['count']
f1 = ax.bar(lab,dat, color=['red'])
plt.xticks(rotation=45)
# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Non-White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()
g3 = pd.DataFrame(usaActiveUnknown.attackTypeText.value_counts()) # make counts for attacks into dataframe
g3['count'] = g3['attackTypeText'] # change name of column to counts
g3['attackTypeText'] = g3.index # reset attackTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['attackTypeText']) # sort the values by attack type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g3
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g3['attackTypeText']
dat = g3['count']
f3 = ax.bar(lab,dat, color=['green'])
plt.xticks(rotation=45)
# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks for which the Group is Unknown by Type of Attack')
plt.show()
It appears that the most common attack types over the three groupings are: bombings/explosions, facility/infrastructure attacks, armed assaults and assassinations. With that said, the different groupings show different tendencies in their arrangment of these main attack types. The white supremacist groups are mostly recorded for armed assaults with bombings/explosions and facility/infrastructure attacks having similar counts after that. Non-white supremacists groups tend to mostly be recorded for facility/infrastructure attacks. The unknown portion of the data mostly records bombings and explosions with facility/infrastructure attacks being next up at about half the number of attacks. From this, the clear differences in trends between white supremacist attack types to non-white supremacist attack types indicates that this is a strong feature to differentiate the two.
We next look at the target type counts in comparison between the three groupings. We place these on the same bar chart to look at the trends between the three groupings as to what targets they hit most frequently.
g1 = pd.DataFrame(uaWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g1['WScount'] = g1['targetTypeText'] # change name of column to counts
g1['targetTypeText'] = g1.index # reset targetTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['WScount'], ascending=False) # sort the values by target type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g2 = pd.DataFrame(uaNWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g2['NWScount'] = g2['targetTypeText'] # change name of column to counts
g2['targetTypeText'] = g2.index # reset targetTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['NWScount'], ascending=False) # sort the values by target type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g3 = pd.DataFrame(usaActiveUnknown.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g3['Ucount'] = g3['targetTypeText'] # change name of column to counts
g3['targetTypeText'] = g3.index # reset targetTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['Ucount'], ascending=False) # sort the values by target type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g = g1.set_index('targetTypeText').join(g2.set_index('targetTypeText')).join(g3.set_index('targetTypeText')) # join the two count tables
g = g.head(5) # truncate the tables to only display the top 5 target types
g
ax = g.plot.bar(color=["blue", "red", "green"], figsize=(10,10))
plt.xticks(rotation=45)
plt.xlabel('Type of Target')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks by Target Type (for White Supremacist top 5 Target Types)')
plt.legend(['White Supremacist Group Counts', 'Non-White Supremacist Group Counts', 'Unknown Group Counts']);
plt.show()
From the chart, it is visible that white supremacist groups and non-white supremacist groups follow a similar trend on the top five targets for white-supremacist groups. This is except for the fact that for non-white supremacist groups, there is a huge number of attacks on businesses that deviates from the previously mentioned trend.
Here we look at the number of fatalities in attacks in the different groupings by year.
plt.figure(figsize=(10, 5))
plt.scatter(uaWS['year'], uaWS['fatalityCount'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()
plt.figure(figsize=(10, 5))
plt.scatter(uaNWS['year'], uaNWS['fatalityCount'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Non-White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()
plt.figure(figsize=(10, 5))
plt.scatter(usaActiveUnknown['year'], usaActiveUnknown['fatalityCount'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Terrorist Attacks for which the Group is Unknown distributed by Year')
plt.show()
This is not very conclusive besides the fact that white supremacist groups have been more deadly in recent years.
To continue exploring, we create a map of the locations of terrorist attacks from the three data groupings. Blue indicates White Supremacist groups; Red, Non-White Supremacist groups; Green, unknown terrorist groups.
We use the folium package to loop through all of the rows in the three tables, plotting the points in different colors using the latittude and longitude fields.
map = folium.Map(location=[42, -95], zoom_start=4)
# loop through the table rows
for _,row in usaActiveUnknown.iterrows():
folium.Circle([row.loc['latitude'], row.loc['longitude']],
radius=20, popup = row.groupName,
color='green').add_to(map)
for _,row in uaNWS.iterrows():
folium.Circle([row.loc['latitude'], row.loc['longitude']],
radius=20, popup = row.groupName,
color='red').add_to(map)
for _,row in uaWS.iterrows():
folium.Circle([row.loc['latitude'], row.loc['longitude']],
radius=20, popup = row.groupName,
color='blue').add_to(map)
map
The map did not enlighten us to much in terms of the creation of our model but it did emphasize the issue at hand. This is because, the locations for unknown attacks tend to the same exact locations for white supremacist attacks. This resemblance could be indicative that much of the entries that recorded unknown perpetrators could likely be white supremacists.
Now after investigating possible features, we have concluded that we will use a multiple linear regression model that takes into account the year, attack type and target type to project whether or not a terrorist attack was carrried out by a White Supremacist group.
To do this, we create a dataframe of dummy variables for the attack type and target type. A dummy variable represents a categorical variable as a binary value so that it can be accounted for in a model.
# make dummy column for usa known active attacks to indicate 1 for white supremacist and 0 for non
nrow = list()
for g in usaActive['groupName']:
if (g == 'White supremacists/nationalists') or (g == 'Ku Klux Klan') or (g == 'Neo-Nazi extremists'):
nrow.append(1)
else:
nrow.append(0)
# add the column to the usaActive table
usaActive["WS"] = nrow
# make dataframes of dummy variables for attack type and target type
attackTypeDummy = pd.get_dummies(usaActive.attackTypeText)
targetTypeDummy = pd.get_dummies(usaActive.targetTypeText)
# add the dummy variables to one table with the year and "White Supramcist" dummy field
usaActiveDummy = pd.concat([usaActive,attackTypeDummy], axis= 1)
usaActiveDummy = pd.concat([usaActive,targetTypeDummy], axis= 1)
# remove superfluous columns
usaActiveDummy = usaActiveDummy.drop(['region', 'latitude', 'longitude', 'attackTypeCode', 'attackTypeText', 'targetTypeCode', 'targetTypeText', 'groupName', 'fatalityCount'], axis=1)
usaActiveDummy
Now we feed the result vector and the dummy variables into the model for which we get an R squared scored of ~0.47
# create ols model of White Supremacy data taking into account the dummy table data
mod = sm.OLS(usaActiveDummy['WS'], usaActiveDummy.drop(['WS'], axis=1))
# fit model, save coefficients and print the parameters
res = mod.fit()
coef = res.params
# print(res.params)
res.summary()
"The correct R2 value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be."
https://statisticsbyjim.com/regression/how-high-r-squared
Given that terrorist behavior is human behavior, according to this blog post, a R squared score of ~0.47 is acceptable since we are modeling human behavipr which is highly variable by nature.
With this tutorial, we wanted to be able to classify the Unknown perpetrators in the data set to either be a white supremacist group or not. We used a reputable data source (data taken by the University of Maryland) so we are confident that the data we use for analysis and processing have merit to be able to make some reputable conclusions from it. What resulted from it is a model with an R squared of almost 50%.
Ultimately, the question we were trying to answer is a question of trying to predict human behavior. Terrorist groups ultimately rely on irrational thought with little regards to empathy and life. Having an R squared that large is honestly impressive. Of course, given that, it is still safer to be skeptical with the results of this model. In the real world, using this model to predict the unknown terrorist groups can be dangerous if the results leads to tangible policy changes. We can update our model with more features to potentially get a better model but this can still lead to unpredictable results.
We hope you enjoyed this tutorial and liked this example of the data science pipeline!