DSAN-5000
  • Home
  • Report
  • Technical details
    • Data-collection
    • Data-cleaning
    • Counterfactual Data Balancing
    • Exploratory Data Analysis
    • Unsupervised Learning
    • Supervised Learning
    • LLM usage Log
  • Bibliography
  • Appendix
  • Back to Portfolio

On this page

  • Overview
    • Goals & Motivation
    • Objectives
  • Methods
    • Foundation: Exoneration and Arrest Datasets
      • Illinois Arrest Data
      • Exoneration Data
    • Adding Context: Web Scraping and Demographic Data
    • Geographical Insights: API-Based Geocoding
    • Mapping: Illinois County
  • Summary
    • Challenges
      • Arrest Data Precision
    • Benchmarks
    • Conclusion and Future Steps

Data Collection

Overview

For the data collection process, I gathered multiple datasets, including arrest records from the ICJIA Arrest Explorer, exoneration data from the National Registry of Exonerations, incarceration demographics from the Prison Policy Initiative, and geospatial information. This diverse collection enabled an in-depth examination of systemic racial inequities that drive the over-policing of marginalized communities while also facilitating a critical analysis of nuanced racial patterns in exoneration data, exploring their connection to broader societal structures. Together, these datasets provide a framework for understanding the interconnected dynamics of racial injustice in the criminal justice system.

Goals & Motivation

My goal in the data collection process was to gather the datasets needed to examine systemic racial disparities in policing and exonerations. I focused on sourcing reliable arrest records, exoneration data, incarceration demographics, and geospatial information. Ultimately, I aimed to compile data that can stand alone or work together to analyze over-policing trends, uncover racial patterns in exonerations, and explore the geographic dynamics connected to these issues.

I’m driven by the need to ground these systemic inequities in hard numbers and statistical analysis because too often, people refuse to acknowledge the reality of racism without concrete evidence. By putting real data behind these patterns, I hope to demonstrate the scope and impact of these injustices in a way that cannot be ignored.

Objectives

  • Identify and collect datasets needed to analyze racial disparities in over-policing and wrongful convictions in Illinois, including arrest records, exoneration data, incarceration demographics, and geospatial information.
  • Ensure datasets are accurate, up-to-date, and aligned with the project’s goals by prioritizing reliable sources like official registries and well-documented repositories.
  • Structure the data to support analysis of racial patterns in policing and exonerations, allowing for both independent and interconnected evaluations.
  • Document all steps in the data collection process, including sourcing methods and preprocessing workflows, to ensure transparency and reproducibility.
  • Collect and preprocess geocoded data and Illinois county shapefiles to enable geographic exploratory data analysis and visualizations.

Methods

To compile a dataset capable of critically examining systemic racial inequities in policing and wrongful convictions, I employed a multi-stepped methodology that integrated direct downloads, web scraping, and API-based geocoding. This approach facilitated the collection of diverse data sources while providing the spatial and demographic context necessary for comprehensive analysis.

Foundation: Exoneration and Arrest Datasets

The exoneration and arrest datasets formed the backbone of the project, each bringing a unique perspective to the analysis of racial disparities in the criminal justice system.

Illinois Arrest Data

The Illinois arrest dataset was sourced from the Illinois Criminal Justice Information Authority’s (ICJIA) Arrest Explorer, a platform providing aggregate arrest data from the Criminal History Record Information (CHRI) system—a statewide resource for demographic and offense-related variables.1

To ensure privacy and confidentiality, ICJIA applied the following modifications:

  • Counts under 10 are approximated (e.g., 1 for counts 0–4, 6 for counts 5–9),
  • Subtotals, such as arrests by race or county, are accurate within +1/-1, and
  • Statewide totals align exactly with the CHRI database at the time of retrieval, which occurs twice annually.

Further, the dataset excludes juvenile arrests, class C misdemeanors, and cases with missing demographic details. For this project, the data was first filtered by race, county, and year, and then downloaded directly to examine patterns relevant to my analysis.2

import pandas as pd

arrest_df = pd.read_csv("../../data/raw-data/illinois_arrest_explorer_data.csv")
arrest_df.head(3)
Year race county_Adams county_Alexander county_Bond county_Boone county_Brown county_Bureau county_Calhoun county_Carroll ... county_Wabash county_Warren county_Washington county_Wayne county_White county_Whiteside county_Will county_Williamson county_Winnebago county_Woodford
0 2001 African American 226 147 25 18 18 48 6 12 ... 6 46 25 6 16 128 3000 75 2509 42
1 2001 Asian 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 14 1 28 1
2 2001 Hispanic 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

3 rows × 106 columns

Exoneration Data

The exoneration dataset was downloaded directly from the National Registry of Exonerations, a collaborative initiative by the Newkirk Center for Science and Society at the University of California (Irvine), the University of Michigan Law School, and Michigan State University College of Law. Established in 2012 by Rob Warden, then Executive Director of Northwestern University’s Pritzker School of Law’s Center on Wrongful Convictions, and Samuel R. Gross, a Law Professor at the University of Michigan, the Registry collects and publishes comprehensive, searchable statistical data and detailed case records for exonerations of innocent criminal defendants in the United States dating back to 19893.

The Registry defines exonerations as cases where a person, following new evidence of innocence, is officially cleared through actions like factual declarations of innocence, pardons, or the dismissal/acquittal of charges.4

To access the exoneration dataset, a spreadsheet request form was submitted, and the dataset was provided under conditions ensuring its proper use. These conditions include restrictions on retransmission, a requirement for advance notice of publication, and the obligation to report any identified errors or missing data.

exoneration_df = pd.read_csv("../../data/raw-data/US_exoneration_data.csv")
exoneration_df.head(3)
Last Name First Name Age Race Sex State County Tags Worst Crime Display Sentence ... F/MFE FC ILD P/FA DNA MWID OM Date of Exoneration Date of 1st Conviction Date of Release
0 Abbitt Joseph 31.0 Black Male North Carolina Forsyth CV;#IO;#SA Child Sex Abuse Life ... NaN NaN NaN NaN DNA MWID NaN 9/2/09 6/22/95 9/2/09
1 Abbott Cinque 19.0 Black Male Illinois Cook CIU;#IO;#NC;#P Drug Possession or Sale Probation ... NaN NaN NaN P/FA NaN NaN OM 2/1/22 3/25/08 3/25/08
2 Abdal Warith Habib 43.0 Black Male New York Erie IO;#SA Sexual Assault 20 to Life ... F/MFE NaN NaN NaN DNA MWID OM 9/1/99 6/6/83 9/1/99

3 rows × 22 columns

Adding Context: Web Scraping and Demographic Data

To contextualize these datasets within broader racial and geographic dynamics, I scraped incarceration demographics for Illinois counties from the Prison Policy Initiative’s. Using Python libraries like requests and BeautifulSoup, I extracted and processed incarceration rates by race for geographic analysis. This provided insight into racial overrepresentation within incarceration systems and illuminated trends at the county level.5

from io import StringIO
import requests
from bs4 import BeautifulSoup

# Retrieve the HTML content of the target webpage
html_url = "https://www.prisonpolicy.org/racialgeography/counties.html"
result = requests.get(html_url)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(result.text)

# Locate the table element containing the data
table_elt = soup.find("table")

# Convert the HTML table into a Pandas DataFrame
table_sio = StringIO(str(table_elt))
county_df = pd.read_html(table_sio)[0]

# Clean the column names by removing unnecessary characters
county_df.columns = [
    c.replace("", "").replace("", "").strip() for c in county_df.columns
]

# Filter for Illinois
il_df = county_df[county_df["State"] == "Illinois"].copy()

# Export to CSV
il_df.to_csv("../../data/raw-data/representation_by_county_raw.csv", index=False)
print("Data saved to 'representation_by_county_raw.csv'")
il_df.head()
Data saved to 'representation_by_county_raw.csv'
County State Total Population Total White Population Total Black Population Total Latino Population Incarcerated Population Incarcerated White Population Incarcerated Black Population Incarcerated Latino Population Non-incarcerated Population Non-incarcerated White Population Non-Incarcerated Black Population Non-Incarcerated Latino Population Ratio of Overrepresentation of Whites Incarcerated Compared to Whites Non-Incarcerated Ratio of Overrepresentation of Blacks Incarcerated Compared to Blacks Non-Incarcerated Ratio of Overrepresentation of Latinos Incarcerated Compared to Latinos Non-Incarcerated
595 Adams County Illinois 67103 62414 2331 776 110 73 36 0 66993 62341 2295 776 0.71 9.54 0.00
596 Alexander County Illinois 8238 4983 2915 155 411 89 242 79 7827 4894 2673 76 0.35 1.72 19.82
597 Bond County Illinois 17768 15797 1080 547 1542 500 657 304 16226 15297 423 243 0.34 16.32 13.14
598 Boone County Illinois 54165 40757 1064 10967 71 38 12 21 54094 40719 1052 10946 0.71 8.71 1.46
599 Brown County Illinois 6937 5191 1280 402 2059 419 1267 367 4878 4772 13 35 0.21 227.91 24.76

Geographical Insights: API-Based Geocoding

To strengthen the exploratory data analysis (EDA), I utilized geocoded data, adding latitude, longitude, and full addresses for Illinois counties. This addition enabled the visualization of systemic racial disparities across geographic areas, revealing trends and disparities among counties, identifying geographic clusters of exoneration cases, and examining regional variations in systemic factors.

The geocoding process was conducted using GeoPy, a Python library that serves as an interface to geocoding APIs, specifically the Nominatim API from OpenStreetMap. GeoPy simplifies the retrieval of geographic details like latitude, longitude, and full address from place names by sending requests to the Nominatim API and processing the responses.

# Import the Nominatim geocoder for converting location names into geographic coordinates:
from geopy.geocoders import Nominatim

# Initialize the geolocator with a user-defined agent to avoid request limits:
geolocator = Nominatim(user_agent="illinois_exoneration_geocode")

# Clean the 'County' column by removing the word "County" and any extra spaces:
il_df["County"] = il_df["County"].str.replace("County", "").str.strip()

# Rename the DataFrame to clarify it contains Illinois counties:
illinois_counties = il_df[["County", "State"]].copy()


# Define a function to geocode counties and return geographic details:
def geocode_county(row):
    """
    Takes a row containing 'County' and 'State' columns.
    Uses the geolocator to find the full address, latitude, and longitude.
    Returns a dictionary with the geocoded data or None if geocoding fails.
    """
    try:
        # Combine county and state into a query string for geocoding:
        location = geolocator.geocode(f"{row['County']}, {row['State']}, USA")

        # If a location is found, return the geocoded details:
        if location:
            return {
                "address": location.address,  # The full geocoded address
                "latitude": location.latitude,  # The latitude coordinate
                "longitude": location.longitude,  # The longitude coordinate
            }
        else:  # Print a message if no geocoding result is found:
            return None
    except Exception as e:  # Handle errors during geocoding and print them:
        return None


# Apply the geocoding function to each Illinois county:
geocoded_results = illinois_counties.apply(geocode_county, axis=1)

# Extract the geocoding results into separate columns for address, latitude, and longitude:
illinois_counties["geocode_address"] = geocoded_results.apply(
    lambda x: x["address"] if isinstance(x, dict) and "address" in x else None
)
illinois_counties["latitude"] = geocoded_results.apply(
    lambda x: x["latitude"] if isinstance(x, dict) and "latitude" in x else None
)
illinois_counties["longitude"] = geocoded_results.apply(
    lambda x: x["longitude"] if isinstance(x, dict) and "longitude" in x else None
)

# Display the first few rows of the DataFrame with geocoded data:
illinois_counties.head()
County State geocode_address latitude longitude
595 Adams Illinois None None None
596 Alexander Illinois None None None
597 Bond Illinois None None None
598 Boone Illinois None None None
599 Brown Illinois None None None
# Save the geocoded results to a CSV file
illinois_counties.to_csv(
    "../../data/raw-data/geocoded_population_counties.csv", index=False
)
print("Geocoded county data saved to 'geocoded_population_counties.csv'")
Geocoded county data saved to 'geocoded_population_counties.csv'

Mapping: Illinois County

To visualize geocoded data and perform geographic exploratory data analysis (EDA), I required a shapefile for Illinois county boundaries. While the Census Bureau provides Illinois shapefiles, these files did not work as expected for my Exploratory Data Analysis (EDA) purposes due to compatibility issues.

Instead, I sourced a shapefile directly from the Illinois State Geological Survey (ISGS) which included county boundaries in a format compatible with the GIS tools I used. ISGS’s shapefiles provided the geographic foundation for mapping and visualizing trends across Illinois counties for EDA which will allow me to adda crucial layer of context to the datasets and support meaningful EDA.6

Summary

Challenges

The foundation for meaningful data analysis is accurate and reliable crime data, yet long-standing challenges in criminal history record systems continue to undermine their precision and dependability7. As outlined in the Use and Management of Criminal History Record Information Report (1993), issues such as incomplete data reporting, delays in recording arrests and case dispositions, and inconsistent fingerprint submissions have plagued the system for decades7 . Criminal History Record Information (CHRI), the backbone for arrest datasets like those from the Illinois Criminal Justice Information Authority (ICJIA), is vulnerable to these shortcomings. Dispositions, or the outcomes of arrests, are often missing or delayed, leaving critical gaps in the data that make it difficult to analyze systemic trends. While advances in technology, such as digital fingerprinting, have improved some processes, data fragmentation and insufficient oversight remain persistent barriers to comprehensive and accurate reporting.

Fast forward thirty years, and many of these challenges remain unresolved—now further complicated by uneven implementation of modern systems. The Marshall Project highlights a striking example: in 2022, the FBI’s shift to the National Incident-Based Reporting System (NIBRS) created a significant data gap in national crime statistics8. Over 6,000 police agencies failed to submit their data, representing nearly one-third of all agencies and leaving vast portions of the U.S. population unaccounted for. This includes major departments like the NYPD and LAPD, alongside countless smaller agencies8. These gaps reflect broader systemic failures—inconsistent adoption of updated systems, lack of adequate funding, and minimal oversight—that echo the same issues identified decades ago.

In short, crime data, even when sourced from official systems, remains inherently flawed and incomplete. Whether due to outdated processes, inconsistent reporting practices, privacy-driven modifications, or gaps in modern collection systems, crime data often falls short of providing a fully accurate or comprehensive picture. As a result, any analysis relying on this data must account for these imperfections, recognizing that while the data can uncover critical trends and systemic disparities, it is rarely a perfect representation of reality.

Arrest Data Precision

The Illinois arrest dataset obtained from the ICJIA Arrest Explorer highlights the trade-offs between privacy and precision, adding to the broader challenges of criminal justice data. To protect confidentiality, counts under 10 are approximated—values between 0 and 4 are replaced with 1, while counts from 5 to 9 are replaced with 6. Though this approach is necessary to safeguard sensitive data, it can introduce distortions, particularly in smaller counties or demographic groups, where even slight approximations can significantly skew trends and reduce the accuracy of analysis. Compounding these issues, the lack of oversight in reporting further undermines data consistency and reliability, reflecting the broader systemic limitations that continue to plague many criminal justice datasets.

Benchmarks

The National Registry of Exonerations is a trusted and widely used resource, frequently cited in academic and legal research. Unlike government databases, the Registry is run by a team of researchers and academics, which brings a level of precision and thoroughness often missing in state-managed systems9. Its foundation in meticulous documentation and independent research makes it less vulnerable to political or institutional biases. As a result, the Registry stands out as a more reliable and comprehensive tool for understanding wrongful convictions.

Conclusion and Future Steps

The data collected from various sources provide a foundation for examining systemic racial disparities in over-policing and wrongful convictions. However, challenges such as the lack of precision in Illinois arrest datasets and broader gaps in national crime reporting underscore the need to approach findings with caution. Even with these limitations, the use of reliable resources like the National Registry of Exonerations adds credibility to the analysis. Moving forward, future research could focus on compiling crime data from a single state across local, state, and federal agencies to better understand inconsistencies and uncover patterns of underreporting. Expanding this work to compare data collection practices across states could also reveal regional differences in crime reporting and highlight systemic disparities in how data is recorded and shared.

References

1. Illinois Criminal Justice Information Authority. (2024). Overview arrest explorer. In Arrest Explorer. https://icjia.illinois.gov/arrestexplorer/docs/#what-data-is-available
2. Illinois Criminal Justice Information Authority. (2024). Arrests by race, county, and year. In Arrest Explorer. https://icjia.illinois.gov/arrestexplorer/
3. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Exoneration Registry. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/about.aspx
4. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Glossary. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/glossary.aspx
5. Initiative, P. P. (2024). Appendix A. Counties – Ratios of Overrepresentation. https://www.prisonpolicy.org/racialgeography/counties.html
6. Survey, I. S. G. (1984). Illinois county boundaries (2.0 ed.). Illinois State Geological Survey. https://clearinghouse.isgs.illinois.edu/data/reference/illinois-county-boundaries-polygons-and-lines
7. Woodard, P. L., & Belair, R. R. (1993). Use and Management of Criminal History Record Information: A Comprehensive Report Bureau of Justice Statistics [.gov]. In Bureau of Justice Statistics. https://bjs.ojp.gov/library/publications/use-and-management-criminal-history-record-information-comprehensive-report
8. Li, W., & Ricard, J. (2023). Many Large U.S. Police Agencies Are Missing from FBI Crime Data. In The Marshall Project. https://www.themarshallproject.org/2023/07/13/fbi-crime-rates-data-gap-nibrs
9. University of California Irvine Newkirk Center for Science & Society, & Michigan State University College of Law, U. of M. L. S. (2024). The National Registry of Exonerations - Our Mission. In The National Registry of Exonerations. https://www.law.umich.edu/special/exoneration/Pages/mission.aspx