For the data collection process, I gathered multiple datasets, including arrest records from the ICJIA Arrest Explorer, exoneration data from the National Registry of Exonerations, incarceration demographics from the Prison Policy Initiative, and geospatial information. This diverse collection enabled an in-depth examination of systemic racial inequities that drive the over-policing of marginalized communities while also facilitating a critical analysis of nuanced racial patterns in exoneration data, exploring their connection to broader societal structures. Together, these datasets provide a framework for understanding the interconnected dynamics of racial injustice in the criminal justice system.
Goals & Motivation
My goal in the data collection process was to gather the datasets needed to examine systemic racial disparities in policing and exonerations. I focused on sourcing reliable arrest records, exoneration data, incarceration demographics, and geospatial information. Ultimately, I aimed to compile data that can stand alone or work together to analyze over-policing trends, uncover racial patterns in exonerations, and explore the geographic dynamics connected to these issues.
I’m driven by the need to ground these systemic inequities in hard numbers and statistical analysis because too often, people refuse to acknowledge the reality of racism without concrete evidence. By putting real data behind these patterns, I hope to demonstrate the scope and impact of these injustices in a way that cannot be ignored.
Objectives
Identify and collect datasets needed to analyze racial disparities in over-policing and wrongful convictions in Illinois, including arrest records, exoneration data, incarceration demographics, and geospatial information.
Ensure datasets are accurate, up-to-date, and aligned with the project’s goals by prioritizing reliable sources like official registries and well-documented repositories.
Structure the data to support analysis of racial patterns in policing and exonerations, allowing for both independent and interconnected evaluations.
Document all steps in the data collection process, including sourcing methods and preprocessing workflows, to ensure transparency and reproducibility.
Collect and preprocess geocoded data and Illinois county shapefiles to enable geographic exploratory data analysis and visualizations.
Methods
To compile a dataset capable of critically examining systemic racial inequities in policing and wrongful convictions, I employed a multi-stepped methodology that integrated direct downloads, web scraping, and API-based geocoding. This approach facilitated the collection of diverse data sources while providing the spatial and demographic context necessary for comprehensive analysis.
Foundation: Exoneration and Arrest Datasets
The exoneration and arrest datasets formed the backbone of the project, each bringing a unique perspective to the analysis of racial disparities in the criminal justice system.
To ensure privacy and confidentiality, ICJIA applied the following modifications:
Counts under 10 are approximated (e.g., 1 for counts 0–4, 6 for counts 5–9),
Subtotals, such as arrests by race or county, are accurate within +1/-1, and
Statewide totals align exactly with the CHRI database at the time of retrieval, which occurs twice annually.
Further, the dataset excludes juvenile arrests, class C misdemeanors, and cases with missing demographic details. For this project, the data was first filtered by race, county, and year, and then downloaded directly to examine patterns relevant to my analysis.2
import pandas as pdarrest_df = pd.read_csv("../../data/raw-data/illinois_arrest_explorer_data.csv")arrest_df.head(3)
Year
race
county_Adams
county_Alexander
county_Bond
county_Boone
county_Brown
county_Bureau
county_Calhoun
county_Carroll
...
county_Wabash
county_Warren
county_Washington
county_Wayne
county_White
county_Whiteside
county_Will
county_Williamson
county_Winnebago
county_Woodford
0
2001
African American
226
147
25
18
18
48
6
12
...
6
46
25
6
16
128
3000
75
2509
42
1
2001
Asian
1
1
1
1
1
1
1
1
...
1
1
1
1
1
1
14
1
28
1
2
2001
Hispanic
1
1
1
1
1
1
1
1
...
1
1
1
1
1
1
1
1
1
1
3 rows × 106 columns
Exoneration Data
The exoneration dataset was downloaded directly from the National Registry of Exonerations, a collaborative initiative by the Newkirk Center for Science and Society at the University of California (Irvine), the University of Michigan Law School, and Michigan State University College of Law. Established in 2012 by Rob Warden, then Executive Director of Northwestern University’s Pritzker School of Law’s Center on Wrongful Convictions, and Samuel R. Gross, a Law Professor at the University of Michigan, the Registry collects and publishes comprehensive, searchable statistical data and detailed case records for exonerations of innocent criminal defendants in the United States dating back to 19893.
The Registry defines exonerations as cases where a person, following new evidence of innocence, is officially cleared through actions like factual declarations of innocence, pardons, or the dismissal/acquittal of charges.4
To access the exoneration dataset, a spreadsheet request form was submitted, and the dataset was provided under conditions ensuring its proper use. These conditions include restrictions on retransmission, a requirement for advance notice of publication, and the obligation to report any identified errors or missing data.
To contextualize these datasets within broader racial and geographic dynamics, I scraped incarceration demographics for Illinois counties from the Prison Policy Initiative’s. Using Python libraries like requests and BeautifulSoup, I extracted and processed incarceration rates by race for geographic analysis. This provided insight into racial overrepresentation within incarceration systems and illuminated trends at the county level.5
from io import StringIOimport requestsfrom bs4 import BeautifulSoup# Retrieve the HTML content of the target webpagehtml_url ="https://www.prisonpolicy.org/racialgeography/counties.html"result = requests.get(html_url)# Parse the HTML content with BeautifulSoupsoup = BeautifulSoup(result.text)# Locate the table element containing the datatable_elt = soup.find("table")# Convert the HTML table into a Pandas DataFrametable_sio = StringIO(str(table_elt))county_df = pd.read_html(table_sio)[0]# Clean the column names by removing unnecessary characterscounty_df.columns = [ c.replace("", "").replace("", "").strip() for c in county_df.columns]# Filter for Illinoisil_df = county_df[county_df["State"] =="Illinois"].copy()# Export to CSVil_df.to_csv("../../data/raw-data/representation_by_county_raw.csv", index=False)print("Data saved to 'representation_by_county_raw.csv'")il_df.head()
Data saved to 'representation_by_county_raw.csv'
County
State
Total Population
Total White Population
Total Black Population
Total Latino Population
Incarcerated Population
Incarcerated White Population
Incarcerated Black Population
Incarcerated Latino Population
Non-incarcerated Population
Non-incarcerated White Population
Non-Incarcerated Black Population
Non-Incarcerated Latino Population
Ratio of Overrepresentation of Whites Incarcerated Compared to Whites Non-Incarcerated
Ratio of Overrepresentation of Blacks Incarcerated Compared to Blacks Non-Incarcerated
Ratio of Overrepresentation of Latinos Incarcerated Compared to Latinos Non-Incarcerated
595
Adams County
Illinois
67103
62414
2331
776
110
73
36
0
66993
62341
2295
776
0.71
9.54
0.00
596
Alexander County
Illinois
8238
4983
2915
155
411
89
242
79
7827
4894
2673
76
0.35
1.72
19.82
597
Bond County
Illinois
17768
15797
1080
547
1542
500
657
304
16226
15297
423
243
0.34
16.32
13.14
598
Boone County
Illinois
54165
40757
1064
10967
71
38
12
21
54094
40719
1052
10946
0.71
8.71
1.46
599
Brown County
Illinois
6937
5191
1280
402
2059
419
1267
367
4878
4772
13
35
0.21
227.91
24.76
Geographical Insights: API-Based Geocoding
To strengthen the exploratory data analysis (EDA), I utilized geocoded data, adding latitude, longitude, and full addresses for Illinois counties. This addition enabled the visualization of systemic racial disparities across geographic areas, revealing trends and disparities among counties, identifying geographic clusters of exoneration cases, and examining regional variations in systemic factors.
The geocoding process was conducted using GeoPy, a Python library that serves as an interface to geocoding APIs, specifically the Nominatim API from OpenStreetMap. GeoPy simplifies the retrieval of geographic details like latitude, longitude, and full address from place names by sending requests to the Nominatim API and processing the responses.
# Import the Nominatim geocoder for converting location names into geographic coordinates:from geopy.geocoders import Nominatim# Initialize the geolocator with a user-defined agent to avoid request limits:geolocator = Nominatim(user_agent="illinois_exoneration_geocode")# Clean the 'County' column by removing the word "County" and any extra spaces:il_df["County"] = il_df["County"].str.replace("County", "").str.strip()# Rename the DataFrame to clarify it contains Illinois counties:illinois_counties = il_df[["County", "State"]].copy()# Define a function to geocode counties and return geographic details:def geocode_county(row):""" Takes a row containing 'County' and 'State' columns. Uses the geolocator to find the full address, latitude, and longitude. Returns a dictionary with the geocoded data or None if geocoding fails. """try:# Combine county and state into a query string for geocoding: location = geolocator.geocode(f"{row['County']}, {row['State']}, USA")# If a location is found, return the geocoded details:if location:return {"address": location.address, # The full geocoded address"latitude": location.latitude, # The latitude coordinate"longitude": location.longitude, # The longitude coordinate }else: # Print a message if no geocoding result is found:returnNoneexceptExceptionas e: # Handle errors during geocoding and print them:returnNone# Apply the geocoding function to each Illinois county:geocoded_results = illinois_counties.apply(geocode_county, axis=1)# Extract the geocoding results into separate columns for address, latitude, and longitude:illinois_counties["geocode_address"] = geocoded_results.apply(lambda x: x["address"] ifisinstance(x, dict) and"address"in x elseNone)illinois_counties["latitude"] = geocoded_results.apply(lambda x: x["latitude"] ifisinstance(x, dict) and"latitude"in x elseNone)illinois_counties["longitude"] = geocoded_results.apply(lambda x: x["longitude"] ifisinstance(x, dict) and"longitude"in x elseNone)# Display the first few rows of the DataFrame with geocoded data:illinois_counties.head()
County
State
geocode_address
latitude
longitude
595
Adams
Illinois
None
None
None
596
Alexander
Illinois
None
None
None
597
Bond
Illinois
None
None
None
598
Boone
Illinois
None
None
None
599
Brown
Illinois
None
None
None
# Save the geocoded results to a CSV fileillinois_counties.to_csv("../../data/raw-data/geocoded_population_counties.csv", index=False)print("Geocoded county data saved to 'geocoded_population_counties.csv'")
Geocoded county data saved to 'geocoded_population_counties.csv'
Mapping: Illinois County
To visualize geocoded data and perform geographic exploratory data analysis (EDA), I required a shapefile for Illinois county boundaries. While the Census Bureau provides Illinois shapefiles, these files did not work as expected for my Exploratory Data Analysis (EDA) purposes due to compatibility issues.
Instead, I sourced a shapefile directly from the Illinois State Geological Survey (ISGS) which included county boundaries in a format compatible with the GIS tools I used. ISGS’s shapefiles provided the geographic foundation for mapping and visualizing trends across Illinois counties for EDA which will allow me to adda crucial layer of context to the datasets and support meaningful EDA.6
Summary
Challenges
The foundation for meaningful data analysis is accurate and reliable crime data, yet long-standing challenges in criminal history record systems continue to undermine their precision and dependability7. As outlined in the Use and Management of Criminal History Record Information Report (1993), issues such as incomplete data reporting, delays in recording arrests and case dispositions, and inconsistent fingerprint submissions have plagued the system for decades7 . Criminal History Record Information (CHRI), the backbone for arrest datasets like those from the Illinois Criminal Justice Information Authority (ICJIA), is vulnerable to these shortcomings. Dispositions, or the outcomes of arrests, are often missing or delayed, leaving critical gaps in the data that make it difficult to analyze systemic trends. While advances in technology, such as digital fingerprinting, have improved some processes, data fragmentation and insufficient oversight remain persistent barriers to comprehensive and accurate reporting.
Fast forward thirty years, and many of these challenges remain unresolved—now further complicated by uneven implementation of modern systems. The Marshall Project highlights a striking example: in 2022, the FBI’s shift to the National Incident-Based Reporting System (NIBRS) created a significant data gap in national crime statistics8. Over 6,000 police agencies failed to submit their data, representing nearly one-third of all agencies and leaving vast portions of the U.S. population unaccounted for. This includes major departments like the NYPD and LAPD, alongside countless smaller agencies8. These gaps reflect broader systemic failures—inconsistent adoption of updated systems, lack of adequate funding, and minimal oversight—that echo the same issues identified decades ago.
In short, crime data, even when sourced from official systems, remains inherently flawed and incomplete. Whether due to outdated processes, inconsistent reporting practices, privacy-driven modifications, or gaps in modern collection systems, crime data often falls short of providing a fully accurate or comprehensive picture. As a result, any analysis relying on this data must account for these imperfections, recognizing that while the data can uncover critical trends and systemic disparities, it is rarely a perfect representation of reality.
Arrest Data Precision
The Illinois arrest dataset obtained from the ICJIA Arrest Explorer highlights the trade-offs between privacy and precision, adding to the broader challenges of criminal justice data. To protect confidentiality, counts under 10 are approximated—values between 0 and 4 are replaced with 1, while counts from 5 to 9 are replaced with 6. Though this approach is necessary to safeguard sensitive data, it can introduce distortions, particularly in smaller counties or demographic groups, where even slight approximations can significantly skew trends and reduce the accuracy of analysis. Compounding these issues, the lack of oversight in reporting further undermines data consistency and reliability, reflecting the broader systemic limitations that continue to plague many criminal justice datasets.
Benchmarks
The National Registry of Exonerations is a trusted and widely used resource, frequently cited in academic and legal research. Unlike government databases, the Registry is run by a team of researchers and academics, which brings a level of precision and thoroughness often missing in state-managed systems9. Its foundation in meticulous documentation and independent research makes it less vulnerable to political or institutional biases. As a result, the Registry stands out as a more reliable and comprehensive tool for understanding wrongful convictions.
Conclusion and Future Steps
The data collected from various sources provide a foundation for examining systemic racial disparities in over-policing and wrongful convictions. However, challenges such as the lack of precision in Illinois arrest datasets and broader gaps in national crime reporting underscore the need to approach findings with caution. Even with these limitations, the use of reliable resources like the National Registry of Exonerations adds credibility to the analysis. Moving forward, future research could focus on compiling crime data from a single state across local, state, and federal agencies to better understand inconsistencies and uncover patterns of underreporting. Expanding this work to compare data collection practices across states could also reveal regional differences in crime reporting and highlight systemic disparities in how data is recorded and shared.
3. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The NationalRegistry of Exonerations - ExonerationRegistry. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/about.aspx
4. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The NationalRegistry of Exonerations - Glossary. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/glossary.aspx
9. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The NationalRegistry of Exonerations - OurMission. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/mission.aspx