Finding Clusters of NYPD Officers In CCRB Complaint Data
Why?
Complaints filed against police officers by the public are often the first and only warning sign that a cop might be on a course of escalating violence.
In the deaths of George Floyd and Eric Garner their killers had a documented history of complaints filed against them. Unfortunately nothing was done to disrupt their pattern of abuse, and both cases ended in those officers killing members of the public they had sworn to protect.
George Floyd
Chauvin, who was fired, has said through his attorney that his handling of Floyd’s arrest was a reasonable use of authorized force. But he was the subject of at least 22 complaints or internal investigations during his more than 19 years at the department, only one of which resulted in discipline. These new interviews show not only that he may have used excessive force in the past, but that he had used startlingly similar techniques. 'That could have been me': The people Derek Chauvin choked before George Floyd
The officer convicted of murdering George Floyd had at least 22 complaints against him. The officer who put Eric Garner in a chokehold and killed him had 7 complaints filed against him.
Eric Garner
Before he put Garner in the chokehold, the records show, he had seven disciplinary complaints and 14 individual allegations lodged against him. Four of those allegations were substantiated by an independent review board. The disturbing secret history of the NYPD officer who killed Eric Garner
Of the 14 individual allegations against Garner's killer, 5 are for force: "hit against inanimate object", "physical force", and a single complaint in 2014 that would foreshadow the behavior that would eventually end the Officer's career: "Force - Chokehold".
I am documenting my analysis in detail for a few reasons:
- So that other people who may want to perform similar analysis for other Police Departments can understand and recreate my analysis
- So that every step is documented, and any mistake can be easily caught and fixed by the infinite supply of people on the internet who are smarter than me
- To maybe inspire people to use computers to investigate the things in the world that are important to them, and share the tools I use to do that
Network visualization prior work / inspiration
You may have seen network analysis like this before.
Jacob Silver used it to look at the spread of anti-vax material.
Adi Cohen has pioneered a method of combining Gephi with CrowdTangle to analyze the network of groups and pages sharing links.
Provenance
WNYC/Gothamist received the data in response to a Freedom of Information Law request and provided me an excel file for analysis.
The Dataset
Differences from data previously released by ProPublica
ProPublica released and covered similar data in July of 2020.
They chose to only publish data for "active-duty officers who’ve had at least one allegation against them substantiated by the CCRB".
The dataset we are working with today contains every complaint and officer, even those with no substantiated allegations.
It also contains officers who were listed as witnesses on complaints, including complaints found as "unsubstantiated" or "unfounded" by the CCRB.
This makes it a "noisier" dataset. In our case this can be an advantage since we are looking to visualize the network of officers.
Being named with another officer on a complaint, even if that complaint is unfounded, is a signal that those officers interacted in a way that was noticed by the public. Being the subject of an unfounded complaint together might even cause officers to form a tighter relationship. Because of that, I will incorporate witness data into our analysis.
NYPD internal structure as it relates to our data
The NYPD is divided into coverage areas within the 5 boroughs known as precincts. When I lived in Brooklyn, I lived in the 81st Precinct which covers Bed-Stuy.
The NYPD also has a number of units, like the Warrant Squad or Narcotics that span different precincts. An officer might report to a numbered precinct, but their command is Brooklyn Narcotics, and they are interacting with other officers in their unit more than the precinct they work out of. Our data reflects this.
Analysis
Overview exploration / metadata
The source dataset is an 81.2MB excel file that I received as FOIL2021-00167_Dataset.xlsx
. It has 3 tabs. The first has some general notes1 about the dataset.
The first tab has the title of OfficerAllegationHistory
and has 181,627 entries and 472 columns.
The second tab has the title of OfficersInvolvedInComplaints
and has 239,608 entries and 183 columns.
Analyzing our data with Datasette / SQLite
Once we convert our CSV files into SQLite .db
files we can use Datasette to get a sense of the data and slice off pieces for further analysis.
The first thing we might want to look at is the top commands that received complaints since 2010.
- 075 PCT: 320
- 046 PCT: 251
- 047 PCT: 226
- 120 PCT: 199
- BX IRT : 196
Or the top 10 commands whose complaints ended in penalties.
- WARRSEC: 91
- PBBX : 84
- 081 PCT: 71
- 079 PCT: 70
- 075 PCT: 68
After using Datasette to get a sense for the shape of our dataset, we can use it to filter out a slice of the data to use to feed into our next tool and begin doing our network analysis.
Ian Johnson helped me explore the data a bit in Observable as well.
We looked at when officers received complaints. Here each year is split into a bar. We are showing the officers who received the most complaints in the entire dataset.
Or stacking the bars- the darker they are, the more recent the complaints occurred.
You can also look at the CCRB outcomes and see how rarely cases are substantiated or end in discipline.
We also looked at which commands received the most complaints across the entire dataset.
Filtering out "Exonerated" and "Unfounded" complaints for our network
To get our network closer to a representation of officers who are receiving complaints for misconduct we want to filter out any of the cases in which the officer was Exonerated
or the CCRB's disposition was that it was Unfounded.
So to get every complaint filed since 2010 that wasn't marked as exonerated
or unfounded
we'll write a query like
Which gives us a slightly more manageable 65,401 rows. We'll save these results off as a .csv for further analysis.
I had to do some funky stuff3 to fix the dates in SQLite, but once I did that, it was easy to filter by the date column.
We can also export new CSVs for all of the incidents since 2010 for a particular precinct:
These precinct-specific slices are a bit smaller and more manageable for analysis in Observable.
We'll also want to export another CSV with ONLY complaints that were substantiated
Analyzing our data with Neo4J
I first encountered Neo4J when I was working with Ben Popken on an NBC News analysis of tweets tied to Senate Intelligence-identified Russian Twitter Bots where Neo4J provided analysts who were crucial to understanding the shape of our data.
It is an incredibly useful tool for generating and analyzing networks, and I was excited to have another dataset that would let me use its considerable power.
Importing our CSV with Cypher
To import our .csv
into a network of node and relationships in Neo4J, we will use the Cypher query language, which makes this process really easy and the code is relatively readable and easy to follow.
Special thanks to David Allen at Neo4J for his guidance in writing queries and designing these relationships.
Basically we take the CSV files we exported from Datasette (when we filtered our Exonerated, Unfounded, and everything before 2010) and go through every row and push it into our network.
Creating officer nodes
First we tell Neo4J to use officer.id as a unique constraint (this makes things faster, I think?) and create a node for each officer from one CSV.
Then I loop through every line of the .csv and create a new Officer node for every new officer I see. I use MERGE
instead of CREATE
to make sure I don't duplicate officer nodes.
This creates 29,915 unique officer nodes.
Then we bind more data into it from our other CSV
Now we have 113,265 unique officer nodes.
An officer node looks like this:
Finally, we need to add a boolean to denote if an officer has ever had a charge substantiated. We'll use this later to please some lawyers. You'll see.
Now we've marked 4,768 of New York's ~36,000 (13%) finest as having a substantiated complaint in the last 10 years.
Let's NOT do the same thing for OfficersInvolved - because that file contains officers who were merely witnesses to substantiated complaints, and we don't want to accidentally label a witness to a substantiated case.
Creating officer labels
Now we need to set labels for our nodes depending on whether they have ever had a complaint substantiated. We don't want to label nodes with names for any officers who may have complaints but have never had any substantiated. I have been told that lawyers think this is a good idea.
First we set every officer label to their unique ID
Then we look for officers with substantiated complaints and set their label to their full name.
Now we have our officers created, we need to create our incidents.
Creating incident nodes
We are going to continue to use our CSV which filtered out incidents before 2010 or that were unfounded or exonerated.
First we tell Neo4J that we have unique incident IDs
Then we create an incident for every row we see in OfficerAllegationHistory. We'll make note of the precinct the incident occurred in, what the specific allegation was, and what date the incident occurred.
Now let's do the same for OfficersInvolvedInComplaints.
Now let's create labels for our incidents, which is going to be the allegation.
Now that we have our Incidents and our Officers we need to create our relationships between them.
Creating relationships between incidents and officers
Now for the fun part.
We are going to create a new relationship called INVOLVED_IN
, and officers can be INVOLVED_IN
one or many incidents. Incidents may have one or many officers that were INVOLVED_IN
it, either as witness or subject officers.
First we create our relationships from OfficersInvolved:
Which creates 94,323 relationships.
Then from OfficerAllegationHistory:
Now we have 159,671 relationships. Sick.
Next we can flatten out our graph a little bit and remove incidents if we want.
We will create a new type of relationship that only occurs between two officers called CO_OCCURANCE
- we will only make one of these between each officer, but the weight of that link will be decided by how many complaints those officers appear together on.
So officers who appear on 3 complaints together have a CO_OCCURANCE
relationship with a weight of 3. This allows us to do some weighted degree analysis when we are making our layout, deciding how large to make nodes, and when we are detecting communities.
Officers without connections
Eigenvector analysis on our network
Let's run a standard centrality analysis algorithm called "Eigenvector Centrality"
CALL gds.alpha.eigenvector.write({
nodeProjection: 'Officer',
relationshipProjection: 'CO_OCCURANCE',
relationshipProperties: 'weight',
relationshipWeightProperty: 'weight',
writeProperty: 'eigenvector'
})
YIELD nodes, iterations, dampingFactor, writeProperty
Now every Officer node has an eigenvector
value that represents its centrality across our entire NYPD-wide network. The larger the value, the more central that node is.
Analyzing our data with Gephi
Neo4J is cool for processing and analyzing tons of data, but I want to draw thousands of circles and lines now and start untangling the hairball of our network.
I am going to use Gephi, which I have a love-hate relationship with, but is unrivaled when it comes to network visualization. Plus, I already know how to use it.
We are going to stream our data from Neo4J to Gephi in order to leverage Neo4J's power to handle huge amounts of data (way more than Gephi) but still get to use Gephi's layout algorithms and analysis techniques.
Flattened co-occurance network
To get our flattened network, which removed incident nodes:
This streams 100,000 edges and 25,064 nodes into Gephi.
We'll run the Force Atlas 2 layout algorithm in Gephi to have the nodes arrange themselves into some sort of sense.
We can use Gephi's modularity algorithm to color by "community" within our network, and we'll tweak our layout algorithm to separate things out a bit.
The communities that Gephi detects often mirror real-world precincts. As one might expect, officers appear on complaints with other officers in their precinct because they are working together most often.
I like that the algorithm detects communities that resemble precincts, and it actually gives me confidence that the community detection is working.
Then we can add some labels and we've made a map of the network of officers who appeared on CCRB complaints together, and there appear to be a number of closely-knit clusters and different communities within our network.
Precinct-specific networks including incidents (un-flattened network)
Let's put it all together and stream all the officers from a single precinct using only incidents since 2010.
We can also make a network that is un-flattened, that is, we can see nodes for both officers as well as incidents. This way, we can see the patterns in the way incidents tie officers together.
Every blue circle here is an incident, and every red circle is an officer (both filtered by the 75th precinct, since 2010).
Now let's size the circles by their degree (the number of connections they have) and run a layout algorithm.
Some big nodes start to pop up, like Martinez, Radoncic, and Grieco. What is causing these officers to co-appear on so many different complaints with so many different officers?
When exploring this network, large nodes pop up, and I became naturally curious what the careers of those officers looked like. One of the first ones that jumped out to me was a big node that represented an officer named Nicholas Rios.
I googled his name, and one of the first results was a harrowing story of a federal civil rights case that named him.
In the reporting that George Joseph did for Gothamist/WNYC, he found that a number of large influencers that appear in this network analysis also acted as influencers on the ground.
He looked at another large node in the network, Adnan Radoncic:
Atunbi asserted that Radoncic was a catalyst for a group assault on the street that day. “As soon as he grabbed me, all the officers was hands on,” he said. “It’s like they just followed his lead.”
Potential Next Steps
There are a few different directions for further analysis that I didn't have time for, but may result in interesting findings.
- Explorable NYPD-wide network
- Looking at protest complaints
- Looking at veterans influencing rookies
- Looking at the effects of NYPD discipline
- Officer career-specific visualization
- Analysis of length/outcomes of CCRB investigations
- Geographic analysis
Flotsam & Jetsam
Hire me to do work like this
I do freelance data exploration and visualization for clients who aren't evil. If you'd like to hire me to take a look at a dataset for you, just get in touch at [email protected]
Footnotes
-
The notes say, basically: these are complaints received in or after the year 2000. Cases that are mediated or were attempted to be mediated are excluded. ↩
-
OfficerAllegationHistory columns:
1: As Of Date
2: Allegation Record Identity
3: Unique Officer Id
4: Active Per Last Reported Status
5: Last Reported Active Date
6: Officer First Name
7: Officer Last Name
8: Officer Race
9: Officer Gender
10: Current Rank Abbreviation
11: Current Rank
12: Current Command
13: Shield No
14: Complaint Id
15: Incident Date
16: CCRB Received Date
17: Close Date
18: Officer Rank Abbreviation At Incident
19: Officer Rank At Incident
20: Officer Command At Incident
21: Officer Days On Force At Incident
22: Borough Of Incident Occurrence
23: Precinct Of Incident Occurrence
24: Location Type Of Incident
25: Reason for Police Contact
26: Outcome Of Police Encounter
27: Victim Age At Incident
28: Victim Race
29: Victim Gender
30: FADO Type
31: Allegation
32: CCRB Allegation Disposition
33: Board Discipline Recommendation
34: Non-APU Penalty Report Date
35: Officer Is APU
36: APU CCRB Trial Recommended Penalty
37: APU Trial Commissioner Recommended Penalty
38: APU Plea Agreed Penalty
39: APU Case Status
40: APU Closing Date
41: NYPD Allegation Disposition
42: NYPD Officer Penalty
43: Reconsideration Requested
44: Reconsideration Request Withdrawn
45: Reconsideration Request Rejected
46: Reconsideration Occurred
47: Reconsideration Decision Pending
:3 OfficersInvolvedInComplaints columns:
1: As Of Date
2: Officer Status
3: Unique Officer Id
4: Active Per Last Reported Status
5: Last Reported Active Date
6: Officer First Name
7: Officer Last Name
8: Officer Race
9: Officer Gender
10: Current Rank Abbreviation
11: Current Rank
12: Current Command
13: Shield No
14: Complaint Id
15: Complaint Disposition
16: Incident Date
17: CCRB Recieved Date
18: Close Date
-
I ran this in the CLI SQLite client to chop up the date string and re-write it in the way SQLite wants it:
↩ ↩2 ↩3
One-time Support
Every bit matters. Thank you!