DSAN-5000
  • Home
  • Report
  • Technical details
    • Data-collection
    • Data-cleaning
    • Counterfactual Data Balancing
    • Exploratory Data Analysis
    • Unsupervised Learning
    • Supervised Learning
    • LLM usage Log
  • Bibliography
  • Appendix
  • Back to Portfolio

On this page

  • Introduction and Motivation
    • Why Use Counterfactual Data?
      • Acknowledgments
  • Narrowing to Incarcerated Population
  • Illinois Exoneree Counts/Demographics
  • Sampling from the Incarcerated Population
    • Draw Representative Samples
    • Simulating Racial Distribution
    • Constructing the Final Balanced Dataset
  • Summary and Next Steps
    • Next Steps

Counterfactual Data Balancing

Introduction and Motivation

In this project, I set out to create a balanced dataset that would support supervised learning models for predicting the factors linked to exonerations. At the heart of this process is counterfactual balancing: building a dataset that includes exonerated individuals alongside a comparable group of non-exonerated individuals, drawn to reflect the broader incarcerated population in Illinois. This balance is critical—it allows the model to make fair and meaningful comparisons when identifying patterns and predictors of exoneration outcomes.

Why Use Counterfactual Data?

Counterfactual data is a necessity when access to complete prison population records is unavailable. Since I don’t have access to a full dataset of all incarcerated individuals in Illinois and their exoneration statuses (e.g., exonerated, not exonerated), I relied on counterfactuals to bridge the gap and construct a balanced dataset.

Counterfactuals allow us to ask “what if?” questions.What if an exonerated person had not been exonerated? Would their characteristics look similar to non-exonerated individuals? Moreover, counterfactual data helps isolate these comparisons by holding everything else constant except the hypothetical condition—in this case, exoneration.

As explained in this primer on counterfactuals, a counterfactual statement operates on an unrealized “if” condition. The “if” portion, also known as the antecedent, frames the comparison: exonerated individuals versus those who weren’t. This approach is powerful because it reduces bias and ensures that the model is trained on data that is reliable, balanced, and representative.1

Acknowledgments

The implementation of this counterfactual data balancing relied heavily on expert guidance and code contributions from Professor Jeff Jacobs. His insights and support were invaluable in refining the methodology and making this process possible.

Narrowing to Incarcerated Population

To focus on the incarcerated population in Illinois, the dataset was filtered to include only relevant columns that captured key demographic details, such as total incarcerated populations broken down by race—White, Black, and Latino. This step ensured that the precise subset of data needed for balancing was used while also laying the groundwork for simulating representative draws from the Illinois incarcerated population.

import pandas as pd
import numpy as np
from tqdm import (
    tqdm,
)  # Adds progress bars to loops and other iterable processes for better visualization.

tqdm.pandas()  # Allows progress bars to appear during DataFrame operations.
il_df = pd.read_csv("../../data/processed-data/representation_by_county.csv")
il_df = il_df[il_df["state"] == "Illinois"].copy()
il_df.head(3)
county state total_population total_white_population total_black_population total_latino_population incarcerated_population incarcerated_white_population incarcerated_black_population incarcerated_latino_population non-incarcerated_population non-incarcerated_white_population non-incarcerated_black_population non-incarcerated_latino_population ratio_of_overrepresentation_of_whites_incarcerated_compared_to_whites_non-incarcerated ratio_of_overrepresentation_of_blacks_incarcerated_compared_to_blacks_non-incarcerated ratio_of_overrepresentation_of_latinos_incarcerated_compared_to_latinos_non-incarcerated
0 Adams Illinois 67103 62414 2331 776 110 73 36 0 66993 62341 2295 776 0.71 9.54 0.00
1 Alexander Illinois 8238 4983 2915 155 411 89 242 79 7827 4894 2673 76 0.35 1.72 19.82
2 Bond Illinois 17768 15797 1080 547 1542 500 657 304 16226 15297 423 243 0.34 16.32 13.14

Columns are renamed to streamline the analysis, removing unnecessary verbosity while retaining clarity.

rename_map = {
    "county": "county",
    "state": "state",
    "incarcerated_population": "Total",
    "incarcerated_white_population": "White",
    "incarcerated_black_population": "Black",
    "incarcerated_latino_population": "Latino",
}

# Keep only the cols in the rename_map
cols_to_keep = list(rename_map.keys())
il_df = il_df[cols_to_keep].copy()

# And do the renaming
il_df.rename(columns=rename_map, inplace=True)
il_df.head()
county state Total White Black Latino
0 Adams Illinois 110 73 36 0
1 Alexander Illinois 411 89 242 79
2 Bond Illinois 1542 500 657 304
3 Boone Illinois 71 38 12 21
4 Brown Illinois 2059 419 1267 367

To align the data with the exoneration registry, a small adjustment was made to clean up the county names. The original dataset listed counties with the trailing word “County” (e.g., “Cook County”), but the registry uses simplified names (like “Cook”), ensuring consistency across datasets.

A state_prop column was then added to represent the proportion of all Illinois inmates coming from each county. This was calculated by dividing each county’s total incarcerated population (Total) by the sum of the total population across all counties. Sorting the values in descending order highlighted the counties with the largest share of the state’s incarcerated population.

# Since the Exoneree project uses just the county name (like "Cook"), we'll remove the trailing " County" (so, e.g., "Cook County" will turn into just "Cook"):
il_df["county"] = il_df["county"].str.replace(" county", "")

# Compute a state_prop column representing the % of all Illinois inmates contained in each county:
il_df["state_prop"] = il_df["Total"] / il_df["Total"].sum()
il_df.sort_values(by="state_prop", ascending=False).head()
county state Total White Black Latino state_prop
15 Cook Illinois 11649 1769 8369 1468 0.164469
98 Will Illinois 3902 811 2528 538 0.055091
78 Randolph Illinois 3571 934 2250 377 0.050418
53 Logan Illinois 3060 963 1705 389 0.043203
52 Livingston Illinois 2798 905 1577 294 0.039504

From the output, Cook County stands out, contributing roughly 16% of Illinois’ incarcerated individuals, followed by Will, Randolph, Logan, and Livingston counties. This helps identify where most of the incarcerated population is concentrated, which will be key for balancing comparisons in the analysis.

# To avoid confusing the state_prop value with the sampled proportion that we compute below, we can drop state_prop now:
il_df = il_df.drop(columns=["state_prop"])
# Since they're only tracking three racial groups, the total of the three race counts should not equal the total incarcerated population. But let's check:
il_df["three_cat_total"] = il_df["Black"] + il_df["White"] + il_df["Latino"]
il_df.head()
county state Total White Black Latino three_cat_total
0 Adams Illinois 110 73 36 0 109
1 Alexander Illinois 411 89 242 79 410
2 Bond Illinois 1542 500 657 304 1461
3 Boone Illinois 71 38 12 21 71
4 Brown Illinois 2059 419 1267 367 2053

To ensure the sample accurately represents the county-by-county distributions, the difference between three_cat_total and Total was used to construct the “Other” category.

il_df["Other"] = il_df["Total"] - il_df["three_cat_total"]
il_df.head()
county state Total White Black Latino three_cat_total Other
0 Adams Illinois 110 73 36 0 109 1
1 Alexander Illinois 411 89 242 79 410 1
2 Bond Illinois 1542 500 657 304 1461 81
3 Boone Illinois 71 38 12 21 71 0
4 Brown Illinois 2059 419 1267 367 2053 6

The data source doesn’t provide much documentation, but it seems like some counties might be double-counting individuals who report more than one race. This assumption comes from the fact that, in some cases, the three_cat_total values (sum of White, Black, and Latino counts) are higher than the overall Total population for those counties.

il_df[il_df["three_cat_total"] > il_df["Total"]]
county state Total White Black Latino three_cat_total Other
13 Clinton Illinois 1599 486 917 199 1602 -3
16 Crawford Illinois 1230 310 782 141 1233 -3
25 Fayette Illinois 1527 467 933 129 1529 -2
40 Jefferson Illinois 1857 827 812 224 1863 -6
50 Lawrence Illinois 2358 486 1490 393 2369 -11
59 Madison Illinois 14 0 11 14 25 -11
60 Marion Illinois 114 69 37 10 116 -2
91 Vermilion Illinois 2084 536 1236 319 2091 -7
95 Wayne Illinois 2 0 2 2 4 -2
96 White Illinois 72 35 19 36 90 -18

Since most of these cases involve low numbers (with Madison County and White County as notable exceptions—anomalous, but beyond the scope of what can be addressed without direct input from correctional facilities), the “Other” value was set to 0 in these instances.

il_df["Other"] = il_df["Other"].apply(lambda x: 0 if x < 0 else x)

# Drop three_cat_total, since we only needed that in order to form the other count:
il_df.drop(columns=["three_cat_total"], inplace=True, errors="ignore")

#  Store these names in a list for future use (to ensure consistency in naming throughout):
race_category_names = ["White", "Black", "Latino", "Other"]
il_df.head()
county state Total White Black Latino Other
0 Adams Illinois 110 73 36 0 1
1 Alexander Illinois 411 89 242 79 1
2 Bond Illinois 1542 500 657 304 81
3 Boone Illinois 71 38 12 21 0
4 Brown Illinois 2059 419 1267 367 6

Illinois Exoneree Counts/Demographics

The Illinois exoneration data was loaded, and the total number of exonerated individuals was calculated by taking the length of the dataframe using len(exon_il_df). The result: 548 exonerations. This serves as the starting point for understanding the scope of exoneration cases in Illinois.

exon_il_df = pd.read_csv("../../data/processed-data/illinois_exoneration_data.csv")
exon_il_df.head(3)
last_name first_name age race sex state county latitude longitude worst_crime_display ... child_welfare_worker_misconduct withheld_exculpatory_evidence misconduct_that_is_not_withholding_evidence knowingly_permitting_perjury witness_tampering_or_misconduct_interrogating_co_defendant misconduct_in_interrogation_of_exoneree perjury_by_official prosecutor_lied_in_court tag_sum geocode_address
0 Abbott Cinque 19.0 Black male Illinois Cook 41.819738 -87.756525 Drug Possession or Sale ... 0 1 1 0 0 0 0 0 7 Cook County, Illinois, United States
1 Abernathy Christopher 17.0 White male Illinois Cook 41.819738 -87.756525 Murder ... 0 1 1 0 0 1 0 0 10 Cook County, Illinois, United States
2 Abrego Eruby 20.0 Hispanic male Illinois Cook 41.819738 -87.756525 Murder ... 0 1 1 0 1 1 1 0 9 Cook County, Illinois, United States

3 rows × 49 columns

num_il = len(exon_il_df)
num_il
548

The value_counts() function was applied to the race column with normalize=True to calculate the proportion of exonerated individuals by race in Illinois. The results highlight significant disparities:

  • Black individuals make up the majority of exonerations at 76.3%.
  • Hispanic individuals account for 14.8%, while White individuals represent only 8.6%.
  • The remaining categories, including Asian and Native American, each comprise less than 0.2% of exonerations.
exon_il_df["race"].value_counts(normalize=True)
race
Black              0.762774
Hispanic           0.147810
White              0.085766
Asian              0.001825
Native American    0.001825
Name: proportion, dtype: float64

Since the Prison Policy Initiative demographic data only includes Black, White, Latino, and Other as race categories, “Hispanic” was first renamed to “Latino” for consistency. “Asian” and “Native American” were then combined into the “Other” category. To preserve the original race data, it was saved into a new column called Race_orig for future reference if needed.

recode_map = {
    "Black": "Black",
    "Hispanic": "Latino",
    "White": "White",
    "Asian": "Other",
    "Native American": "Other",
}
exon_il_df["Race_orig"] = exon_il_df["race"]
exon_il_df["race"] = exon_il_df["race"].apply(lambda x: recode_map[x])
exon_il_df["race"].value_counts(normalize=True)
race
Black     0.762774
Latino    0.147810
White     0.085766
Other     0.003650
Name: proportion, dtype: float64

Sampling from the Incarcerated Population

Draw Representative Samples

The first step in the simulation is to draw a representative sample of 548 “people” from the Illinois prison population. To achieve this, a weighted random sample with replacement was performed from the il_df dataset. Sampling weights were determined based on each county’s total incarcerated population, ensuring that counties with larger populations contributed proportionally more to the sample.

A random seed (random_state=5000) was set to ensure the results are replicable. This step produces a valid population-weighted sample where the only known characteristic of each “person” is their county.

il_sample_df = il_df.sample(
    num_il,
    replace=True,
    weights=il_df["Total"],
    random_state=5000,
).copy()
il_sample_df.head()
county state Total White Black Latino Other
15 Cook Illinois 11649 1769 8369 1468 43
36 Henry Illinois 301 172 108 21 0
72 Perry Illinois 2323 561 1398 352 12
15 Cook Illinois 11649 1769 8369 1468 43
53 Logan Illinois 3060 963 1705 389 3
il_sample_df["county"].value_counts(normalize=True).head()
county
Cook        0.142336
Will        0.060219
Randolph    0.056569
Perry       0.040146
Logan       0.040146
Name: proportion, dtype: float64

Simulating Racial Distribution

To replicate the racial makeup of the incarcerated population, racial counts for each county were used to create a probability distribution for race. For each row in il_sample_df (which represents a sampled county), a distribution was formed based on the race-specific counts, and a single “person” was drawn from that distribution.

This process was done row-by-row using NumPy’s random.choice() function. A random seed (RNG) was also set to ensure the results remain consistent and replicable across runs.

rng = np.random.default_rng(seed=5000)


def draw_race_sample(row):
    race_counts = [row[cur_val] for cur_val in race_category_names]
    total_count = sum(race_counts)
    race_probs = [cur_count / total_count for cur_count in race_counts]
    # And now we have a probability distribution! We can use rng.choice() to sample from it
    sampled_vals = rng.choice(race_category_names, size=1, p=race_probs)
    # We only sampled 1 value here, so we use [0] to extract it
    sampled_val = list(sampled_vals)[0]
    return sampled_val

Before sampling, the function was tested by drawing multiple samples for a specific county—Cook County, in this case. To verify its accuracy, the expected proportions for sampling N inmates from Cook were first computed.

cook_row = il_df[il_df["county"] == "Cook"].iloc[0]
for cname in race_category_names:
    cook_row[f"{cname}_prop"] = cook_row[cname] / cook_row["Total"]
cook_row
county             Cook
state          Illinois
Total             11649
White              1769
Black              8369
Latino             1468
Other                43
White_prop     0.151859
Black_prop     0.718431
Latino_prop    0.126019
Other_prop     0.003691
Name: 15, dtype: object

This means that if the draw_race_sample() function is working correctly, it should generate “White” 15.2% of the time, “Black” 71.8% of the time, and so on. To confirm this, a sample of size N=5000 was generated from Cook County to check whether the proportions align with the expected values.

N = 5000
cook_samples = [draw_race_sample(cook_row) for _ in range(N)]
cook_sample_df = pd.DataFrame(cook_samples, columns=["Race"])
cook_sample_df["Race"].value_counts(normalize=True)
Race
Black     0.7186
White     0.1518
Latino    0.1260
Other     0.0036
Name: proportion, dtype: float64

The results look good and are very close to the expected proportions, which confirms that the draw_race_sample() function is working as intended. With this validation, the function can now be used to sample a race value for each row in il_sample_df.

This step also introduces the tqdm library, which is useful for tracking progress when running simulations like this. It helps monitor how long the code takes per row, ensuring the simulation remains efficient.

il_sample_df["Race"] = il_sample_df.progress_apply(draw_race_sample, axis=1)
100%|██████████| 548/548 [00:00<00:00, 8289.38it/s]
sample_cols_to_keep = ["county", "state", "Race"]
il_sample_df = il_sample_df[sample_cols_to_keep].copy()
il_sample_df
county state Race
15 Cook Illinois Black
36 Henry Illinois White
72 Perry Illinois Black
15 Cook Illinois Black
53 Logan Illinois Black
... ... ... ...
51 Lee Illinois White
10 Christian Illinois Black
25 Fayette Illinois Black
44 Kane Illinois White
52 Livingston Illinois Black

548 rows × 3 columns

Let’s take a look at the racial distribution of the Cook County subset from our sample to see how it turned out:

cook_sample_df = il_sample_df[il_sample_df["county"] == "Cook"].copy()
cook_sample_df["Race"].value_counts(normalize=True)
Race
Black     0.743590
Latino    0.166667
White     0.089744
Name: proportion, dtype: float64

The results show a slight oversample of Latinos compared to the population expectation and an undersample of Whites. While this might seem odd, it’s actually a feature of this sampling process. The goal here is to simulate the simplified model of the Exoneration Registry, where the sample of exonerees represents a subset of 548 inmates from Cook County. This allows for a direct comparison with another size-548 subset of those still incarcerated in Cook.

With this step completed, the 548 rows from il_sample_df can now be combined with the 548 rows in exon_il_df, creating a balanced DataFrame with a total of 1,096 rows. Half of these rows represent exonerated individuals from Illinois, and the other half represent non-exonerated individuals, sampled to be statistically representative of Illinois’ incarcerated population as a whole.

Constructing the Final Balanced Dataset

To prepare the final balanced dataset, a new label column was added to distinguish between exonerated and non-exonerated individuals. Specifically:
- The Label column in exon_il_df was set to “Exonerated”.
- The Label column in il_sample_df was set to “Non-Exonerated”.

To avoid confusion when combining datasets, the county column in il_sample_df was renamed to County. With the labels in place and columns aligned, both datasets were combined into a single DataFrame using pd.concat().

Next, a race mapping was applied to standardize the race categories across datasets:
- “Asian” and “Native American” were combined into the “Other” category.
- “Black,” “White,” and “Hispanic” categories were kept as-is.

To clean up, the race and Race columns were combined, prioritizing non-NaN values to ensure no data was lost. The original race column was then dropped. Similarly, the county and County columns were merged, and the original county column was removed to streamline the final DataFrame.

Finally, the resulting Race and County columns were checked to confirm the expected values, and the first few rows of the balanced dataset were displayed to verify everything was in place.

# Construct our new label: exonerated vs. non-exonerated
exon_il_df["Label"] = "Exonerated"
il_sample_df["Label"] = "Non-Exonerated"
il_sample_df = il_sample_df.rename(
    columns={"county": "County"}
)  # Rename to distinguish when combining datasets

# And combine!
balanced_df = pd.concat([exon_il_df, il_sample_df], axis=0)
# Define the mapping for 'race'
race_mapping = {
    "Asian": "Other",
    "Native American": "Other",
    "Black": "Black",
    "White": "White",
    "Hispanic": "Hispanic",
}


# Map the 'race' column
balanced_df["race"] = balanced_df["race"].map(race_mapping)

# Combine 'race' and 'Race' columns, prioritizing non-NaN values
balanced_df["Race"] = balanced_df["race"].combine_first(balanced_df["Race"])

# Drop the old 'race' column
balanced_df.drop(columns=["race"], inplace=True)

# Combine 'county' and 'County' columns, prioritizing non-NaN values
balanced_df["County"] = balanced_df["county"].combine_first(balanced_df["County"])

# Drop the old 'county' column
balanced_df.drop(columns=["county"], inplace=True)

# Verify the final Race column
print(balanced_df["Race"].value_counts())
print(balanced_df["County"].value_counts())
balanced_df.head()
Race
Black     709
White     207
Latino     92
Other       5
Name: count, dtype: int64
County
Cook           552
Will            37
Randolph        31
Jefferson       23
Logan           22
Perry           22
Livingston      22
Fulton          21
Johnson         21
Tazewell        19
Lawrence        18
Montgomery      17
Bond            17
Vermilion       16
DuPage          15
Winnebago       15
Lake            14
St. Clair       14
Clinton         14
La Salle        14
Fayette         13
Lee             13
Brown           12
Kane            12
Knox            11
Peoria          10
Morgan          10
Rock Island     10
Macon            9
Crawford         9
McHenry          7
Christian        6
Williamson       6
Champaign        5
McLean           4
Sangamon         4
Henry            3
Kankakee         3
Stephenson       2
Woodford         2
Edgar            2
Effingham        2
Iroquois         2
Adams            2
Richland         1
Menard           1
Pope             1
Madison          1
Boone            1
Jackson          1
Cumberland       1
Washington       1
Dupage           1
Dekalb           1
Moultrie         1
LaSalle          1
De Witt          1
Name: count, dtype: int64
last_name first_name age sex state latitude longitude worst_crime_display sentence sentence_in_years ... witness_tampering_or_misconduct_interrogating_co_defendant misconduct_in_interrogation_of_exoneree perjury_by_official prosecutor_lied_in_court tag_sum geocode_address Race_orig Label County Race
0 Abbott Cinque 19.0 male Illinois 41.819738 -87.756525 Drug Possession or Sale Probation 0.0 ... 0.0 0.0 0.0 0.0 7.0 Cook County, Illinois, United States Black Exonerated Cook Black
1 Abernathy Christopher 17.0 male Illinois 41.819738 -87.756525 Murder Life without parole 100.0 ... 0.0 1.0 0.0 0.0 10.0 Cook County, Illinois, United States White Exonerated Cook White
2 Abrego Eruby 20.0 male Illinois 41.819738 -87.756525 Murder 90 years 90.0 ... 1.0 1.0 1.0 0.0 9.0 Cook County, Illinois, United States Hispanic Exonerated Cook NaN
3 Adams Demetris 22.0 male Illinois 41.819738 -87.756525 Drug Possession or Sale 1 year 1.0 ... 0.0 0.0 0.0 0.0 7.0 Cook County, Illinois, United States Black Exonerated Cook Black
4 Adams Kenneth 22.0 male Illinois 41.819738 -87.756525 Murder 75 years 75.0 ... 1.0 0.0 0.0 0.0 11.0 Cook County, Illinois, United States Black Exonerated Cook Black

5 rows × 51 columns

balanced_df.to_csv("../../data/processed-data/exonerees_balanced.csv", index=False)

Summary and Next Steps

The final balanced dataset now consists of 1,096 rows, split evenly between exonerated and non-exonerated individuals. Key steps included creating consistent labels, standardizing race categories, and combining the datasets while ensuring no critical data was lost. The resulting DataFrame provides a clean and structured foundation for further analysis.

Next Steps

This balanced dataset can now be used for supervised learning tasks, such as:
- Predicting Exoneration Factors: Training machine learning models to identify the characteristics most associated with exoneration outcomes.
- Comparative Analysis: Exploring differences in demographics, geographic distribution, or other variables between exonerated and non-exonerated individuals.
- Visualization and Insights: Mapping trends or disparities across counties and racial groups to better understand systemic patterns in wrongful convictions.

With this dataset, models and analyses can provide deeper insights into the factors driving exonerations while ensuring fairness and balance in comparisons.

References

1. Pearl, J. (2016). Counterfactuals and their applications. In Causal inference in statistics: A primer. John Wiley & Sons. https://bayes.cs.ucla.edu/PRIMER/ch4-preview.pdf