Final Tutorial¶

by Joaquim Malcampo, Drayton Hoffman and Liam Y Lehr¶

Introduction¶

This project is intended to showcase our understanding of the data science pipeline by applying it to real world data. We accomplish this by looking at data on terrorist activity in the United States provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland.

START website: https://www.start.umd.edu/gtd/

The START website provides a codebook that explains how the data is collected and maintained as well as how to interpret each data element for your own use.

NBC News article: https://www.nbcnews.com/news/us-news/white-nationalism-fueled-violence-rise-fbi-slow-call-it-domestic-n1039206

The above article outlines an issue with that the FBI has recently been misclassifying attacks so that they are not treated as terrorist activity. This affects the investigation process in such a way that protects white supremacists from legal repurcusions. Our goal is to create a model to take terrorist attack data for which the perpetrator is unknown to determine the likelihood it is related to a white supremacist group.

Data Curation, Parsing, and Management¶

We decided to look at data provided by the University of Maryland's own National Consortium for the Study of Terrorism and Responses to Terrorism (START). The global terrorism dataset we are using has in depth data for terrorist attacks between 1970 and 2018.

First, we need to import the python libraries we intend to use.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
!pip install folium
import folium

import statsmodels.api as sm
import statsmodels.formula.api as smf

print("Success!")

Requirement already satisfied: folium in /opt/conda/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.1)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.24.0)
Requirement already satisfied: branca>=0.3.0 in /opt/conda/lib/python3.8/site-packages (from folium) (0.4.1)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (3.0.4)
Success!

Then, we read in the dataset and proceed to tidy it.

try:
  if data is None:
    data = pd.DataFrame()
except:
  data = pd.DataFrame()

# database description (for columns) page 13 of codebook
if data.empty:
  data = pd.read_excel('globalterrorismdb_0919dist.xlsx')

We process the data so that we are only looking at terrorist atttacks that have taken place in the United States. Then, we try to find the terrorist groups that have been active since 2015 (inclusive). With that information, we combine the two data elements to find the historical data for only groups that have been active in the United States since 2015.

We then separate that data into three tables by the perpetrator group responsible: Known White Supremacists, Known Non-White Supremacists, and Unknown Terrorist group.

https://gazette.com/news/white-supremacist-graffiti-and-propaganda-found-in-colorado-springs-and-denver-suburb/article_3f4bd0ae-389e-11eb-9613-bb31df3c7452.html

This article gives a little background as to which key words we were looking for in group names to classify them as White Supremacists. The buzzwords essentially came down to: 'White supremacists/nationalists', 'Ku Klux Klan', and 'Neo-Nazi extremists'.

# get all attacks in the United States
usa = data[data.country_txt == 'United States']
# make list of all groups that have attacked the USA since 2015 (inclusive)
activeGroups = list(usa[usa.iyear >= 2015].gname.unique())

# get info for all the attacks from active groups
usaActive = usa[usa.gname.isin(activeGroups)]
usaActive = usaActive[['iyear', 'region_txt', 'latitude', 'longitude', 'attacktype1', 'attacktype1_txt', 'targtype1', 'targtype1_txt', 'gname', 'nkill']]
# remove NAs from latitude and longitude columns (for mapping)
usaActive = usaActive[usaActive.latitude.notna()][usaActive.longitude.notna()]
# rename the columns to be more meaningful
usaActive.rename(columns={'iyear': 'year', 'region_txt': 'region', 
                        'attacktype1': 'attackTypeCode', 'attacktype1_txt': 'attackTypeText',
                        'targtype1': 'targetTypeCode', 'targtype1_txt': 'targetTypeText',
                        'gname': 'groupName', 'nkill': 'fatalityCount'}, inplace = True)

# separate attacks where the group is unknown from groups that have been identified
usaActiveUnknown = usaActive[usaActive.groupName == 'Unknown']
usaActive = usaActive[usaActive.groupName != 'Unknown']

# separate White Supremacist groups from Non-White Supremacist groups
uaWS = usaActive[(usaActive.groupName == 'White supremacists/nationalists') | (usaActive.groupName == 'Ku Klux Klan')| (usaActive.groupName == 'Neo-Nazi extremists')]
uaNWS = usaActive[(usaActive.groupName != 'White supremacists/nationalists') | (usaActive.groupName != 'Ku Klux Klan')| (usaActive.groupName != 'Neo-Nazi extremists')]

Exploratory Data Analysis¶

With the separate data tables for Known White Supremacist, Known Non-White Supremacist, and Unknown Terrorist groups, we decided to first look at the number of terrorist attacks per year in each grouping.

To do this, we first get the counts for each year and put them into new data frames so that we can clean the data separately and represent them in their own indvidual plots. We decided to plot this data with a line graph to compare the trends as to when there are influxes of attacks.

# white supremacists
g1 = pd.DataFrame(uaWS.year.value_counts()) # make counts for attacks by year into dataframe
g1['count'] = g1['year'] # change name to counts because value_counts() makes year index
g1['year'] = g1.index # reset year to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['year']) # sort the values by year
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g1['year'], g1['count'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of White Supremacy Related Terrorist Attacks by Year')
plt.show()

# non white supremacists
g2 = pd.DataFrame(uaNWS.year.value_counts()) # make counts for attacks by year into dataframe
g2['count'] = g2['year'] # change name to counts because value_counts() makes year index
g2['year'] = g2.index # reset year to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['year']) # sort the values by year
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g2['year'], g2['count'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Non-White Supremacy Related Terrorist Attacks by Year')
plt.show()

# unknown groups
g3 = pd.DataFrame(usaActiveUnknown.year.value_counts()) # make counts for attacks by year into dataframe
g3['count'] = g3['year'] # change name to counts because value_counts() makes year index
g3['year'] = g3.index # reset year to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['year']) # sort the values by year
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

plt.figure(figsize=(10, 5))
plt.plot(g3['year'], g3['count'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Number of Terrorist Attacks by Unknown Groups by Year')
plt.show()

From the above plots, it appears as though the White Supremacy related attacks graph spikes dramatically at similar locations as does the graph for attacks by groups that are unknown. With that said, the non-white supremacist group graph does not appear to have the same similarities. For this reason, we reasonably suspect that many of the attacks for which the perpetrator is unknown, could still be related to white supremacist groups.

Next, we similarly summarize the numbers of different types of attacks for each of the groupings. We express these values on separate bar charts to best show the relative attack tendencies within each grouping.

g1 = pd.DataFrame(uaWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g1['count'] = g1['attackTypeText'] # change name of column to counts
g1['attackTypeText'] = g1.index # reset attackTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['attackTypeText']) # sort the values by attack type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g1['attackTypeText']
dat = g1['count']
f1 = ax.bar(lab,dat, color=['blue'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()

g2 = pd.DataFrame(uaNWS.attackTypeText.value_counts()) # make counts for attacks into dataframe
g2['count'] = g2['attackTypeText'] # change name of column to counts
g2['attackTypeText'] = g2.index # reset attackTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['attackTypeText']) # sort the values by attack type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g2['attackTypeText']
dat = g2['count']
f1 = ax.bar(lab,dat, color=['red'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Non-White Supremacy Related Terrorist Attacks by Type of Attack')
plt.show()

g3 = pd.DataFrame(usaActiveUnknown.attackTypeText.value_counts()) # make counts for attacks into dataframe
g3['count'] = g3['attackTypeText'] # change name of column to counts
g3['attackTypeText'] = g3.index # reset attackTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['attackTypeText']) # sort the values by attack type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous
g3

fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0,0,1,1])
lab = g3['attackTypeText']
dat = g3['count']
f3 = ax.bar(lab,dat, color=['green'])
plt.xticks(rotation=45)

# label the plot and display it
plt.xlabel('Type of Attack')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks for which the Group is Unknown by Type of Attack')
plt.show()

It appears that the most common attack types over the three groupings are: bombings/explosions, facility/infrastructure attacks, armed assaults and assassinations. With that said, the different groupings show different tendencies in their arrangment of these main attack types. The white supremacist groups are mostly recorded for armed assaults with bombings/explosions and facility/infrastructure attacks having similar counts after that. Non-white supremacists groups tend to mostly be recorded for facility/infrastructure attacks. The unknown portion of the data mostly records bombings and explosions with facility/infrastructure attacks being next up at about half the number of attacks. From this, the clear differences in trends between white supremacist attack types to non-white supremacist attack types indicates that this is a strong feature to differentiate the two.

We next look at the target type counts in comparison between the three groupings. We place these on the same bar chart to look at the trends between the three groupings as to what targets they hit most frequently.

g1 = pd.DataFrame(uaWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g1['WScount'] = g1['targetTypeText'] # change name of column to counts
g1['targetTypeText'] = g1.index # reset targetTypeText to be the index values due to value_counts() behavior
g1=g1.sort_values(by=['WScount'], ascending=False) # sort the values by target type
g1=g1.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g2 = pd.DataFrame(uaNWS.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g2['NWScount'] = g2['targetTypeText'] # change name of column to counts
g2['targetTypeText'] = g2.index # reset targetTypeText to be the index values due to value_counts() behavior
g2=g2.sort_values(by=['NWScount'], ascending=False) # sort the values by target type
g2=g2.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g3 = pd.DataFrame(usaActiveUnknown.targetTypeText.value_counts()) # make counts for attacks by targetType into dataframe
g3['Ucount'] = g3['targetTypeText'] # change name of column to counts
g3['targetTypeText'] = g3.index # reset targetTypeText to be the index values due to value_counts() behavior
g3=g3.sort_values(by=['Ucount'], ascending=False) # sort the values by target type
g3=g3.reset_index(drop=True) # reset the indeces so that they start from 0 and are contiguous

g = g1.set_index('targetTypeText').join(g2.set_index('targetTypeText')).join(g3.set_index('targetTypeText')) # join the two count tables
g = g.head(5) # truncate the tables to only display the top 5 target types
g

ax = g.plot.bar(color=["blue", "red", "green"], figsize=(10,10))
plt.xticks(rotation=45)
plt.xlabel('Type of Target')
plt.ylabel('Number of Terrorist Attacks')
plt.title('Overview of Terrorist Attacks by Target Type (for White Supremacist top 5 Target Types)')
plt.legend(['White Supremacist Group Counts', 'Non-White Supremacist Group Counts', 'Unknown Group Counts']);
plt.show()

From the chart, it is visible that white supremacist groups and non-white supremacist groups follow a similar trend on the top five targets for white-supremacist groups. This is except for the fact that for non-white supremacist groups, there is a huge number of attacks on businesses that deviates from the previously mentioned trend.

Here we look at the number of fatalities in attacks in the different groupings by year.

plt.figure(figsize=(10, 5))
plt.scatter(uaWS['year'], uaWS['fatalityCount'], c='blue')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()

plt.figure(figsize=(10, 5))
plt.scatter(uaNWS['year'], uaNWS['fatalityCount'], c='red')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Non-White Supremacy Related Terrorist Attacks distributed by Year')
plt.show()

plt.figure(figsize=(10, 5))
plt.scatter(usaActiveUnknown['year'], usaActiveUnknown['fatalityCount'], c='green')
# label the plot and display it
plt.xlabel('Year')
plt.ylabel('Fatality Count')
plt.title('Scatter Plot of Fatality Counts from Terrorist Attacks for which the Group is Unknown distributed by Year')
plt.show()

This is not very conclusive besides the fact that white supremacist groups have been more deadly in recent years.

To continue exploring, we create a map of the locations of terrorist attacks from the three data groupings. Blue indicates White Supremacist groups; Red, Non-White Supremacist groups; Green, unknown terrorist groups.
We use the folium package to loop through all of the rows in the three tables, plotting the points in different colors using the latittude and longitude fields.

map = folium.Map(location=[42, -95], zoom_start=4)
# loop through the table rows
for _,row in usaActiveUnknown.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='green').add_to(map)

for _,row in uaNWS.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='red').add_to(map)

for _,row in uaWS.iterrows():
    folium.Circle([row.loc['latitude'], row.loc['longitude']], 
                      radius=20, popup = row.groupName,
                      color='blue').add_to(map)

map

The map did not enlighten us to much in terms of the creation of our model but it did emphasize the issue at hand. This is because, the locations for unknown attacks tend to the same exact locations for white supremacist attacks. This resemblance could be indicative that much of the entries that recorded unknown perpetrators could likely be white supremacists.

Machine Learning to Provide Analysis¶

Now after investigating possible features, we have concluded that we will use a multiple linear regression model that takes into account the year, attack type and target type to project whether or not a terrorist attack was carrried out by a White Supremacist group.

To do this, we create a dataframe of dummy variables for the attack type and target type. A dummy variable represents a categorical variable as a binary value so that it can be accounted for in a model.

# make dummy column for usa known active attacks to indicate 1 for white supremacist and 0 for non
nrow = list()
for g in usaActive['groupName']:
  if (g == 'White supremacists/nationalists') or (g == 'Ku Klux Klan') or (g == 'Neo-Nazi extremists'):
    nrow.append(1)
  else:
    nrow.append(0)
# add the column to the usaActive table
usaActive["WS"] = nrow

# make dataframes of dummy variables for attack type and target type
attackTypeDummy = pd.get_dummies(usaActive.attackTypeText)
targetTypeDummy = pd.get_dummies(usaActive.targetTypeText)

# add the dummy variables to one table with the year and "White Supramcist" dummy field
usaActiveDummy = pd.concat([usaActive,attackTypeDummy], axis= 1)
usaActiveDummy = pd.concat([usaActive,targetTypeDummy], axis= 1)

# remove superfluous columns
usaActiveDummy = usaActiveDummy.drop(['region', 'latitude', 'longitude', 'attackTypeCode', 'attackTypeText', 'targetTypeCode', 'targetTypeText', 'groupName', 'fatalityCount'], axis=1)
usaActiveDummy

Now we feed the result vector and the dummy variables into the model for which we get an R squared scored of ~0.47

# create ols model of White Supremacy data taking into account the dummy table data
mod = sm.OLS(usaActiveDummy['WS'], usaActiveDummy.drop(['WS'], axis=1))
# fit model, save coefficients and print the parameters
res = mod.fit() 
coef = res.params 
# print(res.params) 
res.summary()

"The correct R2 value depends on your study area. Different research questions have different amounts of variability that are inherently unexplainable. Case in point, humans are hard to predict. Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be."

https://statisticsbyjim.com/regression/how-high-r-squared

Given that terrorist behavior is human behavior, according to this blog post, a R squared score of ~0.47 is acceptable since we are modeling human behavipr which is highly variable by nature.

Curation of a Message or Messages Covering Insights Learned During the Tutorial¶

With this tutorial, we wanted to be able to classify the Unknown perpetrators in the data set to either be a white supremacist group or not. We used a reputable data source (data taken by the University of Maryland) so we are confident that the data we use for analysis and processing have merit to be able to make some reputable conclusions from it. What resulted from it is a model with an R squared of almost 50%.

Ultimately, the question we were trying to answer is a question of trying to predict human behavior. Terrorist groups ultimately rely on irrational thought with little regards to empathy and life. Having an R squared that large is honestly impressive. Of course, given that, it is still safer to be skeptical with the results of this model. In the real world, using this model to predict the unknown terrorist groups can be dangerous if the results leads to tangible policy changes. We can update our model with more features to potentially get a better model but this can still lead to unpredictable results.

We hope you enjoyed this tutorial and liked this example of the data science pipeline!

	year	WS	Abortion Related	Airports & Aircraft	Business	Educational Institution	Food or Water Supply	Government (Diplomatic)	Government (General)	Journalists & Media	...	Police	Private Citizens & Property	Religious Figures/Institutions	Telecommunication	Terrorists/Non-State Militia	Tourists	Transportation	Unknown	Utilities	Violent Political Party
24	1970	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
29	1970	1	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
35	1970	1	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
48	1970	1	0	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
49	1970	1	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
190108	2018	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
190109	2018	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
190206	2018	1	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
190291	2018	1	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
190628	2018	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0

Dep. Variable:	WS	R-squared:	0.478
Model:	OLS	Adj. R-squared:	0.462
Method:	Least Squares	F-statistic:	29.47
Date:	Mon, 21 Dec 2020	Prob (F-statistic):	2.16e-77
Time:	20:03:01	Log-Likelihood:	-137.69
No. Observations:	665	AIC:	317.4
Df Residuals:	644	BIC:	411.9
Df Model:	20
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
year	-0.0135	0.001	-15.160	0.000	-0.015	-0.012
Abortion Related	26.8279	1.769	15.161	0.000	23.353	30.303
Airports & Aircraft	26.9378	1.785	15.087	0.000	23.432	30.444
Business	27.0740	1.778	15.225	0.000	23.582	30.566
Educational Institution	27.2935	1.771	15.413	0.000	23.816	30.771
Food or Water Supply	27.5161	1.775	15.502	0.000	24.030	31.002
Government (Diplomatic)	27.1083	1.814	14.948	0.000	23.547	30.669
Government (General)	27.1261	1.782	15.226	0.000	23.628	30.624
Journalists & Media	27.1227	1.779	15.246	0.000	23.629	30.616
Military	27.1538	1.782	15.235	0.000	23.654	30.654
NGO	27.4187	1.771	15.485	0.000	23.942	30.896
Police	27.1746	1.791	15.176	0.000	23.658	30.691
Private Citizens & Property	27.4223	1.775	15.446	0.000	23.936	30.908
Religious Figures/Institutions	27.3556	1.781	15.356	0.000	23.858	30.854
Telecommunication	27.5161	1.762	15.615	0.000	24.056	30.976
Terrorists/Non-State Militia	26.7853	1.793	14.943	0.000	23.265	30.305
Tourists	27.1487	1.816	14.948	0.000	23.582	30.715
Transportation	27.3064	1.780	15.341	0.000	23.811	30.802
Unknown	27.5612	1.798	15.330	0.000	24.031	31.091
Utilities	27.0006	1.790	15.088	0.000	23.486	30.515
Violent Political Party	27.5497	1.764	15.615	0.000	24.085	31.014

Omnibus:	104.960	Durbin-Watson:	1.300
Prob(Omnibus):	0.000	Jarque-Bera (JB):	173.359
Skew:	0.991	Prob(JB):	2.27e-38
Kurtosis:	4.527	Cond. No.	1.35e+06