Data Collection

Overview

For the data collection process, I gathered multiple datasets, including arrest records from the ICJIA Arrest Explorer, exoneration data from the National Registry of Exonerations, incarceration demographics from the Prison Policy Initiative, and geospatial information. This diverse collection enabled an in-depth examination of systemic racial inequities that drive the over-policing of marginalized communities while also facilitating a critical analysis of nuanced racial patterns in exoneration data, exploring their connection to broader societal structures. Together, these datasets provide a framework for understanding the interconnected dynamics of racial injustice in the criminal justice system.

Goals & Motivation

My goal in the data collection process was to gather the datasets needed to examine systemic racial disparities in policing and exonerations. I focused on sourcing reliable arrest records, exoneration data, incarceration demographics, and geospatial information. Ultimately, I aimed to compile data that can stand alone or work together to analyze over-policing trends, uncover racial patterns in exonerations, and explore the geographic dynamics connected to these issues.

I’m driven by the need to ground these systemic inequities in hard numbers and statistical analysis because too often, people refuse to acknowledge the reality of racism without concrete evidence. By putting real data behind these patterns, I hope to demonstrate the scope and impact of these injustices in a way that cannot be ignored.

Objectives

Identify and collect datasets needed to analyze racial disparities in over-policing and wrongful convictions in Illinois, including arrest records, exoneration data, incarceration demographics, and geospatial information.
Ensure datasets are accurate, up-to-date, and aligned with the project’s goals by prioritizing reliable sources like official registries and well-documented repositories.
Structure the data to support analysis of racial patterns in policing and exonerations, allowing for both independent and interconnected evaluations.
Document all steps in the data collection process, including sourcing methods and preprocessing workflows, to ensure transparency and reproducibility.
Collect and preprocess geocoded data and Illinois county shapefiles to enable geographic exploratory data analysis and visualizations.

Methods

To compile a dataset capable of critically examining systemic racial inequities in policing and wrongful convictions, I employed a multi-stepped methodology that integrated direct downloads, web scraping, and API-based geocoding. This approach facilitated the collection of diverse data sources while providing the spatial and demographic context necessary for comprehensive analysis.

Foundation: Exoneration and Arrest Datasets

The exoneration and arrest datasets formed the backbone of the project, each bringing a unique perspective to the analysis of racial disparities in the criminal justice system.

Illinois Arrest Data

The Illinois arrest dataset was sourced from the Illinois Criminal Justice Information Authority’s (ICJIA) Arrest Explorer, a platform providing aggregate arrest data from the Criminal History Record Information (CHRI) system—a statewide resource for demographic and offense-related variables.¹

To ensure privacy and confidentiality, ICJIA applied the following modifications:

Counts under 10 are approximated (e.g., 1 for counts 0–4, 6 for counts 5–9),
Subtotals, such as arrests by race or county, are accurate within +1/-1, and
Statewide totals align exactly with the CHRI database at the time of retrieval, which occurs twice annually.

Further, the dataset excludes juvenile arrests, class C misdemeanors, and cases with missing demographic details. For this project, the data was first filtered by race, county, and year, and then downloaded directly to examine patterns relevant to my analysis.²

import pandas as pd

arrest_df = pd.read_csv("../../data/raw-data/illinois_arrest_explorer_data.csv")
arrest_df.head(3)

	Year	race	county_Adams	county_Alexander	county_Bond	county_Boone	county_Brown	county_Bureau	county_Calhoun	county_Carroll	...	county_Wabash	county_Warren	county_Washington	county_Wayne	county_White	county_Whiteside	county_Will	county_Williamson	county_Winnebago	county_Woodford
0	2001	African American	226	147	25	18	18	48	6	12	...	6	46	25	6	16	128	3000	75	2509	42
1	2001	Asian	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	14	1	28	1
2	2001	Hispanic	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1

3 rows × 106 columns

Exoneration Data

The exoneration dataset was downloaded directly from the National Registry of Exonerations, a collaborative initiative by the Newkirk Center for Science and Society at the University of California (Irvine), the University of Michigan Law School, and Michigan State University College of Law. Established in 2012 by Rob Warden, then Executive Director of Northwestern University’s Pritzker School of Law’s Center on Wrongful Convictions, and Samuel R. Gross, a Law Professor at the University of Michigan, the Registry collects and publishes comprehensive, searchable statistical data and detailed case records for exonerations of innocent criminal defendants in the United States dating back to 1989³.

The Registry defines exonerations as cases where a person, following new evidence of innocence, is officially cleared through actions like factual declarations of innocence, pardons, or the dismissal/acquittal of charges.⁴

To access the exoneration dataset, a spreadsheet request form was submitted, and the dataset was provided under conditions ensuring its proper use. These conditions include restrictions on retransmission, a requirement for advance notice of publication, and the obligation to report any identified errors or missing data.

exoneration_df = pd.read_csv("../../data/raw-data/US_exoneration_data.csv")
exoneration_df.head(3)

	Last Name	First Name	Age	Race	Sex	State	County	Tags	Worst Crime Display	Sentence	...	F/MFE	FC	ILD	P/FA	DNA	MWID	OM	Date of Exoneration	Date of 1st Conviction	Date of Release
0	Abbitt	Joseph	31.0	Black	Male	North Carolina	Forsyth	CV;#IO;#SA	Child Sex Abuse	Life	...	NaN	NaN	NaN	NaN	DNA	MWID	NaN	9/2/09	6/22/95	9/2/09
1	Abbott	Cinque	19.0	Black	Male	Illinois	Cook	CIU;#IO;#NC;#P	Drug Possession or Sale	Probation	...	NaN	NaN	NaN	P/FA	NaN	NaN	OM	2/1/22	3/25/08	3/25/08
2	Abdal	Warith Habib	43.0	Black	Male	New York	Erie	IO;#SA	Sexual Assault	20 to Life	...	F/MFE	NaN	NaN	NaN	DNA	MWID	OM	9/1/99	6/6/83	9/1/99

3 rows × 22 columns

Adding Context: Web Scraping and Demographic Data

To contextualize these datasets within broader racial and geographic dynamics, I scraped incarceration demographics for Illinois counties from the Prison Policy Initiative’s. Using Python libraries like requests and BeautifulSoup, I extracted and processed incarceration rates by race for geographic analysis. This provided insight into racial overrepresentation within incarceration systems and illuminated trends at the county level.⁵

from io import StringIO
import requests
from bs4 import BeautifulSoup

# Retrieve the HTML content of the target webpage
html_url = "https://www.prisonpolicy.org/racialgeography/counties.html"
result = requests.get(html_url)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(result.text)

# Locate the table element containing the data
table_elt = soup.find("table")

# Convert the HTML table into a Pandas DataFrame
table_sio = StringIO(str(table_elt))
county_df = pd.read_html(table_sio)[0]

# Clean the column names by removing unnecessary characters
county_df.columns = [
    c.replace("", "").replace("", "").strip() for c in county_df.columns
]

# Filter for Illinois
il_df = county_df[county_df["State"] == "Illinois"].copy()

# Export to CSV
il_df.to_csv("../../data/raw-data/representation_by_county_raw.csv", index=False)
print("Data saved to 'representation_by_county_raw.csv'")
il_df.head()

Data saved to 'representation_by_county_raw.csv'

	County	State	Total Population	Total White Population	Total Black Population	Total Latino Population	Incarcerated Population	Incarcerated White Population	Incarcerated Black Population	Incarcerated Latino Population	Non-incarcerated Population	Non-incarcerated White Population	Non-Incarcerated Black Population	Non-Incarcerated Latino Population	Ratio of Overrepresentation of Whites Incarcerated Compared to Whites Non-Incarcerated	Ratio of Overrepresentation of Blacks Incarcerated Compared to Blacks Non-Incarcerated	Ratio of Overrepresentation of Latinos Incarcerated Compared to Latinos Non-Incarcerated
595	Adams County	Illinois	67103	62414	2331	776	110	73	36	0	66993	62341	2295	776	0.71	9.54	0.00
596	Alexander County	Illinois	8238	4983	2915	155	411	89	242	79	7827	4894	2673	76	0.35	1.72	19.82
597	Bond County	Illinois	17768	15797	1080	547	1542	500	657	304	16226	15297	423	243	0.34	16.32	13.14
598	Boone County	Illinois	54165	40757	1064	10967	71	38	12	21	54094	40719	1052	10946	0.71	8.71	1.46
599	Brown County	Illinois	6937	5191	1280	402	2059	419	1267	367	4878	4772	13	35	0.21	227.91	24.76

Geographical Insights: API-Based Geocoding

To strengthen the exploratory data analysis (EDA), I utilized geocoded data, adding latitude, longitude, and full addresses for Illinois counties. This addition enabled the visualization of systemic racial disparities across geographic areas, revealing trends and disparities among counties, identifying geographic clusters of exoneration cases, and examining regional variations in systemic factors.

The geocoding process was conducted using GeoPy, a Python library that serves as an interface to geocoding APIs, specifically the Nominatim API from OpenStreetMap. GeoPy simplifies the retrieval of geographic details like latitude, longitude, and full address from place names by sending requests to the Nominatim API and processing the responses.

# Import the Nominatim geocoder for converting location names into geographic coordinates:
from geopy.geocoders import Nominatim

# Initialize the geolocator with a user-defined agent to avoid request limits:
geolocator = Nominatim(user_agent="illinois_exoneration_geocode")

# Clean the 'County' column by removing the word "County" and any extra spaces:
il_df["County"] = il_df["County"].str.replace("County", "").str.strip()

# Rename the DataFrame to clarify it contains Illinois counties:
illinois_counties = il_df[["County", "State"]].copy()


# Define a function to geocode counties and return geographic details:
def geocode_county(row):
    """
    Takes a row containing 'County' and 'State' columns.
    Uses the geolocator to find the full address, latitude, and longitude.
    Returns a dictionary with the geocoded data or None if geocoding fails.
    """
    try:
        # Combine county and state into a query string for geocoding:
        location = geolocator.geocode(f"{row['County']}, {row['State']}, USA")

        # If a location is found, return the geocoded details:
        if location:
            return {
                "address": location.address,  # The full geocoded address
                "latitude": location.latitude,  # The latitude coordinate
                "longitude": location.longitude,  # The longitude coordinate
            }
        else:  # Print a message if no geocoding result is found:
            return None
    except Exception as e:  # Handle errors during geocoding and print them:
        return None


# Apply the geocoding function to each Illinois county:
geocoded_results = illinois_counties.apply(geocode_county, axis=1)

# Extract the geocoding results into separate columns for address, latitude, and longitude:
illinois_counties["geocode_address"] = geocoded_results.apply(
    lambda x: x["address"] if isinstance(x, dict) and "address" in x else None
)
illinois_counties["latitude"] = geocoded_results.apply(
    lambda x: x["latitude"] if isinstance(x, dict) and "latitude" in x else None
)
illinois_counties["longitude"] = geocoded_results.apply(
    lambda x: x["longitude"] if isinstance(x, dict) and "longitude" in x else None
)

# Display the first few rows of the DataFrame with geocoded data:
illinois_counties.head()

	County	State	geocode_address	latitude	longitude
595	Adams	Illinois	None	None	None
596	Alexander	Illinois	None	None	None
597	Bond	Illinois	None	None	None
598	Boone	Illinois	None	None	None
599	Brown	Illinois	None	None	None

# Save the geocoded results to a CSV file
illinois_counties.to_csv(
    "../../data/raw-data/geocoded_population_counties.csv", index=False
)
print("Geocoded county data saved to 'geocoded_population_counties.csv'")

Geocoded county data saved to 'geocoded_population_counties.csv'

Mapping: Illinois County

To visualize geocoded data and perform geographic exploratory data analysis (EDA), I required a shapefile for Illinois county boundaries. While the Census Bureau provides Illinois shapefiles, these files did not work as expected for my Exploratory Data Analysis (EDA) purposes due to compatibility issues.

Instead, I sourced a shapefile directly from the Illinois State Geological Survey (ISGS) which included county boundaries in a format compatible with the GIS tools I used. ISGS’s shapefiles provided the geographic foundation for mapping and visualizing trends across Illinois counties for EDA which will allow me to adda crucial layer of context to the datasets and support meaningful EDA.⁶

Summary

Challenges

The foundation for meaningful data analysis is accurate and reliable crime data, yet long-standing challenges in criminal history record systems continue to undermine their precision and dependability⁷. As outlined in the Use and Management of Criminal History Record Information Report (1993), issues such as incomplete data reporting, delays in recording arrests and case dispositions, and inconsistent fingerprint submissions have plagued the system for decades⁷ . Criminal History Record Information (CHRI), the backbone for arrest datasets like those from the Illinois Criminal Justice Information Authority (ICJIA), is vulnerable to these shortcomings. Dispositions, or the outcomes of arrests, are often missing or delayed, leaving critical gaps in the data that make it difficult to analyze systemic trends. While advances in technology, such as digital fingerprinting, have improved some processes, data fragmentation and insufficient oversight remain persistent barriers to comprehensive and accurate reporting.

Fast forward thirty years, and many of these challenges remain unresolved—now further complicated by uneven implementation of modern systems. The Marshall Project highlights a striking example: in 2022, the FBI’s shift to the National Incident-Based Reporting System (NIBRS) created a significant data gap in national crime statistics⁸. Over 6,000 police agencies failed to submit their data, representing nearly one-third of all agencies and leaving vast portions of the U.S. population unaccounted for. This includes major departments like the NYPD and LAPD, alongside countless smaller agencies⁸. These gaps reflect broader systemic failures—inconsistent adoption of updated systems, lack of adequate funding, and minimal oversight—that echo the same issues identified decades ago.

In short, crime data, even when sourced from official systems, remains inherently flawed and incomplete. Whether due to outdated processes, inconsistent reporting practices, privacy-driven modifications, or gaps in modern collection systems, crime data often falls short of providing a fully accurate or comprehensive picture. As a result, any analysis relying on this data must account for these imperfections, recognizing that while the data can uncover critical trends and systemic disparities, it is rarely a perfect representation of reality.

Arrest Data Precision

The Illinois arrest dataset obtained from the ICJIA Arrest Explorer highlights the trade-offs between privacy and precision, adding to the broader challenges of criminal justice data. To protect confidentiality, counts under 10 are approximated—values between 0 and 4 are replaced with 1, while counts from 5 to 9 are replaced with 6. Though this approach is necessary to safeguard sensitive data, it can introduce distortions, particularly in smaller counties or demographic groups, where even slight approximations can significantly skew trends and reduce the accuracy of analysis. Compounding these issues, the lack of oversight in reporting further undermines data consistency and reliability, reflecting the broader systemic limitations that continue to plague many criminal justice datasets.

Benchmarks

The National Registry of Exonerations is a trusted and widely used resource, frequently cited in academic and legal research. Unlike government databases, the Registry is run by a team of researchers and academics, which brings a level of precision and thoroughness often missing in state-managed systems⁹. Its foundation in meticulous documentation and independent research makes it less vulnerable to political or institutional biases. As a result, the Registry stands out as a more reliable and comprehensive tool for understanding wrongful convictions.

Conclusion and Future Steps

The data collected from various sources provide a foundation for examining systemic racial disparities in over-policing and wrongful convictions. However, challenges such as the lack of precision in Illinois arrest datasets and broader gaps in national crime reporting underscore the need to approach findings with caution. Even with these limitations, the use of reliable resources like the National Registry of Exonerations adds credibility to the analysis. Moving forward, future research could focus on compiling crime data from a single state across local, state, and federal agencies to better understand inconsistencies and uncover patterns of underreporting. Expanding this work to compare data collection practices across states could also reveal regional differences in crime reporting and highlight systemic disparities in how data is recorded and shared.

References

1. Illinois Criminal Justice Information Authority. (2024). Overview arrest explorer. In Arrest Explorer. https://icjia.illinois.gov/arrestexplorer/docs/#what-data-is-available

2. Illinois Criminal Justice Information Authority. (2024). Arrests by race, county, and year. In Arrest Explorer. https://icjia.illinois.gov/arrestexplorer/

3. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Exoneration Registry. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/about.aspx

4. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Glossary. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/glossary.aspx

5. Initiative, P. P. (2024). Appendix A. Counties – Ratios of Overrepresentation. https://www.prisonpolicy.org/racialgeography/counties.html

6. Survey, I. S. G. (1984). Illinois county boundaries (2.0 ed.). Illinois State Geological Survey. https://clearinghouse.isgs.illinois.edu/data/reference/illinois-county-boundaries-polygons-and-lines

7. Woodard, P. L., & Belair, R. R. (1993). Use and Management of Criminal History Record Information: A Comprehensive Report Bureau of Justice Statistics [.gov]. In Bureau of Justice Statistics. https://bjs.ojp.gov/library/publications/use-and-management-criminal-history-record-information-comprehensive-report

8. Li, W., & Ricard, J. (2023). Many Large U.S. Police Agencies Are Missing from FBI Crime Data. In The Marshall Project. https://www.themarshallproject.org/2023/07/13/fbi-crime-rates-data-gap-nibrs

9. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Our Mission. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/mission.aspx