Counterfactual Data Balancing

Introduction and Motivation

In this project, I set out to create a balanced dataset that would support supervised learning models for predicting the factors linked to exonerations. At the heart of this process is counterfactual balancing: building a dataset that includes exonerated individuals alongside a comparable group of non-exonerated individuals, drawn to reflect the broader incarcerated population in Illinois. This balance is critical—it allows the model to make fair and meaningful comparisons when identifying patterns and predictors of exoneration outcomes.

Why Use Counterfactual Data?

Counterfactual data is a necessity when access to complete prison population records is unavailable. Since I don’t have access to a full dataset of all incarcerated individuals in Illinois and their exoneration statuses (e.g., exonerated, not exonerated), I relied on counterfactuals to bridge the gap and construct a balanced dataset.

Counterfactuals allow us to ask “what if?” questions.What if an exonerated person had not been exonerated? Would their characteristics look similar to non-exonerated individuals? Moreover, counterfactual data helps isolate these comparisons by holding everything else constant except the hypothetical condition—in this case, exoneration.

As explained in this primer on counterfactuals, a counterfactual statement operates on an unrealized “if” condition. The “if” portion, also known as the antecedent, frames the comparison: exonerated individuals versus those who weren’t. This approach is powerful because it reduces bias and ensures that the model is trained on data that is reliable, balanced, and representative.¹

Acknowledgments

The implementation of this counterfactual data balancing relied heavily on expert guidance and code contributions from Professor Jeff Jacobs. His insights and support were invaluable in refining the methodology and making this process possible.

Narrowing to Incarcerated Population

To focus on the incarcerated population in Illinois, the dataset was filtered to include only relevant columns that captured key demographic details, such as total incarcerated populations broken down by race—White, Black, and Latino. This step ensured that the precise subset of data needed for balancing was used while also laying the groundwork for simulating representative draws from the Illinois incarcerated population.

import pandas as pd
import numpy as np
from tqdm import (
    tqdm,
)  # Adds progress bars to loops and other iterable processes for better visualization.

tqdm.pandas()  # Allows progress bars to appear during DataFrame operations.

il_df = pd.read_csv("../../data/processed-data/representation_by_county.csv")
il_df = il_df[il_df["state"] == "Illinois"].copy()
il_df.head(3)

	county	state	total_population	total_white_population	total_black_population	total_latino_population	incarcerated_population	incarcerated_white_population	incarcerated_black_population	incarcerated_latino_population	non-incarcerated_population	non-incarcerated_white_population	non-incarcerated_black_population	non-incarcerated_latino_population	ratio_of_overrepresentation_of_whites_incarcerated_compared_to_whites_non-incarcerated	ratio_of_overrepresentation_of_blacks_incarcerated_compared_to_blacks_non-incarcerated	ratio_of_overrepresentation_of_latinos_incarcerated_compared_to_latinos_non-incarcerated
0	Adams	Illinois	67103	62414	2331	776	110	73	36	0	66993	62341	2295	776	0.71	9.54	0.00
1	Alexander	Illinois	8238	4983	2915	155	411	89	242	79	7827	4894	2673	76	0.35	1.72	19.82
2	Bond	Illinois	17768	15797	1080	547	1542	500	657	304	16226	15297	423	243	0.34	16.32	13.14

Columns are renamed to streamline the analysis, removing unnecessary verbosity while retaining clarity.

rename_map = {
    "county": "county",
    "state": "state",
    "incarcerated_population": "Total",
    "incarcerated_white_population": "White",
    "incarcerated_black_population": "Black",
    "incarcerated_latino_population": "Latino",
}

# Keep only the cols in the rename_map
cols_to_keep = list(rename_map.keys())
il_df = il_df[cols_to_keep].copy()

# And do the renaming
il_df.rename(columns=rename_map, inplace=True)
il_df.head()

	county	state	Total	White	Black	Latino
0	Adams	Illinois	110	73	36	0
1	Alexander	Illinois	411	89	242	79
2	Bond	Illinois	1542	500	657	304
3	Boone	Illinois	71	38	12	21
4	Brown	Illinois	2059	419	1267	367

To align the data with the exoneration registry, a small adjustment was made to clean up the county names. The original dataset listed counties with the trailing word “County” (e.g., “Cook County”), but the registry uses simplified names (like “Cook”), ensuring consistency across datasets.

A state_prop column was then added to represent the proportion of all Illinois inmates coming from each county. This was calculated by dividing each county’s total incarcerated population (Total) by the sum of the total population across all counties. Sorting the values in descending order highlighted the counties with the largest share of the state’s incarcerated population.

# Since the Exoneree project uses just the county name (like "Cook"), we'll remove the trailing " County" (so, e.g., "Cook County" will turn into just "Cook"):
il_df["county"] = il_df["county"].str.replace(" county", "")

# Compute a state_prop column representing the % of all Illinois inmates contained in each county:
il_df["state_prop"] = il_df["Total"] / il_df["Total"].sum()
il_df.sort_values(by="state_prop", ascending=False).head()

	county	state	Total	White	Black	Latino	state_prop
15	Cook	Illinois	11649	1769	8369	1468	0.164469
98	Will	Illinois	3902	811	2528	538	0.055091
78	Randolph	Illinois	3571	934	2250	377	0.050418
53	Logan	Illinois	3060	963	1705	389	0.043203
52	Livingston	Illinois	2798	905	1577	294	0.039504

From the output, Cook County stands out, contributing roughly 16% of Illinois’ incarcerated individuals, followed by Will, Randolph, Logan, and Livingston counties. This helps identify where most of the incarcerated population is concentrated, which will be key for balancing comparisons in the analysis.

# To avoid confusing the state_prop value with the sampled proportion that we compute below, we can drop state_prop now:
il_df = il_df.drop(columns=["state_prop"])
# Since they're only tracking three racial groups, the total of the three race counts should not equal the total incarcerated population. But let's check:
il_df["three_cat_total"] = il_df["Black"] + il_df["White"] + il_df["Latino"]
il_df.head()

	county	state	Total	White	Black	Latino	three_cat_total
0	Adams	Illinois	110	73	36	0	109
1	Alexander	Illinois	411	89	242	79	410
2	Bond	Illinois	1542	500	657	304	1461
3	Boone	Illinois	71	38	12	21	71
4	Brown	Illinois	2059	419	1267	367	2053

To ensure the sample accurately represents the county-by-county distributions, the difference between three_cat_total and Total was used to construct the “Other” category.

il_df["Other"] = il_df["Total"] - il_df["three_cat_total"]
il_df.head()

	county	state	Total	White	Black	Latino	three_cat_total	Other
0	Adams	Illinois	110	73	36	0	109	1
1	Alexander	Illinois	411	89	242	79	410	1
2	Bond	Illinois	1542	500	657	304	1461	81
3	Boone	Illinois	71	38	12	21	71	0
4	Brown	Illinois	2059	419	1267	367	2053	6

The data source doesn’t provide much documentation, but it seems like some counties might be double-counting individuals who report more than one race. This assumption comes from the fact that, in some cases, the three_cat_total values (sum of White, Black, and Latino counts) are higher than the overall Total population for those counties.

il_df[il_df["three_cat_total"] > il_df["Total"]]

	county	state	Total	White	Black	Latino	three_cat_total	Other
13	Clinton	Illinois	1599	486	917	199	1602	-3
16	Crawford	Illinois	1230	310	782	141	1233	-3
25	Fayette	Illinois	1527	467	933	129	1529	-2
40	Jefferson	Illinois	1857	827	812	224	1863	-6
50	Lawrence	Illinois	2358	486	1490	393	2369	-11
59	Madison	Illinois	14	0	11	14	25	-11
60	Marion	Illinois	114	69	37	10	116	-2
91	Vermilion	Illinois	2084	536	1236	319	2091	-7
95	Wayne	Illinois	2	0	2	2	4	-2
96	White	Illinois	72	35	19	36	90	-18

Since most of these cases involve low numbers (with Madison County and White County as notable exceptions—anomalous, but beyond the scope of what can be addressed without direct input from correctional facilities), the “Other” value was set to 0 in these instances.

il_df["Other"] = il_df["Other"].apply(lambda x: 0 if x < 0 else x)

# Drop three_cat_total, since we only needed that in order to form the other count:
il_df.drop(columns=["three_cat_total"], inplace=True, errors="ignore")

#  Store these names in a list for future use (to ensure consistency in naming throughout):
race_category_names = ["White", "Black", "Latino", "Other"]
il_df.head()

	county	state	Total	White	Black	Latino	Other
0	Adams	Illinois	110	73	36	0	1
1	Alexander	Illinois	411	89	242	79	1
2	Bond	Illinois	1542	500	657	304	81
3	Boone	Illinois	71	38	12	21	0
4	Brown	Illinois	2059	419	1267	367	6

Illinois Exoneree Counts/Demographics

The Illinois exoneration data was loaded, and the total number of exonerated individuals was calculated by taking the length of the dataframe using len(exon_il_df). The result: 548 exonerations. This serves as the starting point for understanding the scope of exoneration cases in Illinois.

exon_il_df = pd.read_csv("../../data/processed-data/illinois_exoneration_data.csv")
exon_il_df.head(3)

	last_name	first_name	age	race	sex	state	county	latitude	longitude	worst_crime_display	...	withheld_exculpatory_evidence	misconduct_that_is_not_withholding_evidence	witness_tampering_or_misconduct_interrogating_co_defendant	misconduct_in_interrogation_of_exoneree	perjury_by_official	tag_sum	geocode_address
0	Abbott	Cinque	19.0	Black	male	Illinois	Cook	41.819738	-87.756525	Drug Possession or Sale	...	1	1	0	0	0	7	Cook County, Illinois, United States
1	Abernathy	Christopher	17.0	White	male	Illinois	Cook	41.819738	-87.756525	Murder	...	1	1	0	1	0	10	Cook County, Illinois, United States
2	Abrego	Eruby	20.0	Hispanic	male	Illinois	Cook	41.819738	-87.756525	Murder	...	1	1	1	1	1	9	Cook County, Illinois, United States

3 rows × 49 columns

num_il = len(exon_il_df)
num_il

The value_counts() function was applied to the race column with normalize=True to calculate the proportion of exonerated individuals by race in Illinois. The results highlight significant disparities:

Black individuals make up the majority of exonerations at 76.3%.
Hispanic individuals account for 14.8%, while White individuals represent only 8.6%.
The remaining categories, including Asian and Native American, each comprise less than 0.2% of exonerations.

exon_il_df["race"].value_counts(normalize=True)

race
Black              0.762774
Hispanic           0.147810
White              0.085766
Asian              0.001825
Native American    0.001825
Name: proportion, dtype: float64

Since the Prison Policy Initiative demographic data only includes Black, White, Latino, and Other as race categories, “Hispanic” was first renamed to “Latino” for consistency. “Asian” and “Native American” were then combined into the “Other” category. To preserve the original race data, it was saved into a new column called Race_orig for future reference if needed.

recode_map = {
    "Black": "Black",
    "Hispanic": "Latino",
    "White": "White",
    "Asian": "Other",
    "Native American": "Other",
}
exon_il_df["Race_orig"] = exon_il_df["race"]
exon_il_df["race"] = exon_il_df["race"].apply(lambda x: recode_map[x])
exon_il_df["race"].value_counts(normalize=True)

race
Black     0.762774
Latino    0.147810
White     0.085766
Other     0.003650
Name: proportion, dtype: float64

Sampling from the Incarcerated Population

Draw Representative Samples

The first step in the simulation is to draw a representative sample of 548 “people” from the Illinois prison population. To achieve this, a weighted random sample with replacement was performed from the il_df dataset. Sampling weights were determined based on each county’s total incarcerated population, ensuring that counties with larger populations contributed proportionally more to the sample.

A random seed (random_state=5000) was set to ensure the results are replicable. This step produces a valid population-weighted sample where the only known characteristic of each “person” is their county.

il_sample_df = il_df.sample(
    num_il,
    replace=True,
    weights=il_df["Total"],
    random_state=5000,
).copy()
il_sample_df.head()

	county	state	Total	White	Black	Latino	Other
15	Cook	Illinois	11649	1769	8369	1468	43
36	Henry	Illinois	301	172	108	21	0
72	Perry	Illinois	2323	561	1398	352	12
15	Cook	Illinois	11649	1769	8369	1468	43
53	Logan	Illinois	3060	963	1705	389	3

il_sample_df["county"].value_counts(normalize=True).head()

county
Cook        0.142336
Will        0.060219
Randolph    0.056569
Perry       0.040146
Logan       0.040146
Name: proportion, dtype: float64

Simulating Racial Distribution

To replicate the racial makeup of the incarcerated population, racial counts for each county were used to create a probability distribution for race. For each row in il_sample_df (which represents a sampled county), a distribution was formed based on the race-specific counts, and a single “person” was drawn from that distribution.

This process was done row-by-row using NumPy’s random.choice() function. A random seed (RNG) was also set to ensure the results remain consistent and replicable across runs.

rng = np.random.default_rng(seed=5000)


def draw_race_sample(row):
    race_counts = [row[cur_val] for cur_val in race_category_names]
    total_count = sum(race_counts)
    race_probs = [cur_count / total_count for cur_count in race_counts]
    # And now we have a probability distribution! We can use rng.choice() to sample from it
    sampled_vals = rng.choice(race_category_names, size=1, p=race_probs)
    # We only sampled 1 value here, so we use [0] to extract it
    sampled_val = list(sampled_vals)[0]
    return sampled_val

Before sampling, the function was tested by drawing multiple samples for a specific county—Cook County, in this case. To verify its accuracy, the expected proportions for sampling N inmates from Cook were first computed.

cook_row = il_df[il_df["county"] == "Cook"].iloc[0]
for cname in race_category_names:
    cook_row[f"{cname}_prop"] = cook_row[cname] / cook_row["Total"]
cook_row

county             Cook
state          Illinois
Total             11649
White              1769
Black              8369
Latino             1468
Other                43
White_prop     0.151859
Black_prop     0.718431
Latino_prop    0.126019
Other_prop     0.003691
Name: 15, dtype: object

This means that if the draw_race_sample() function is working correctly, it should generate “White” 15.2% of the time, “Black” 71.8% of the time, and so on. To confirm this, a sample of size N=5000 was generated from Cook County to check whether the proportions align with the expected values.

N = 5000
cook_samples = [draw_race_sample(cook_row) for _ in range(N)]
cook_sample_df = pd.DataFrame(cook_samples, columns=["Race"])
cook_sample_df["Race"].value_counts(normalize=True)

Race
Black     0.7186
White     0.1518
Latino    0.1260
Other     0.0036
Name: proportion, dtype: float64

The results look good and are very close to the expected proportions, which confirms that the draw_race_sample() function is working as intended. With this validation, the function can now be used to sample a race value for each row in il_sample_df.

This step also introduces the tqdm library, which is useful for tracking progress when running simulations like this. It helps monitor how long the code takes per row, ensuring the simulation remains efficient.

il_sample_df["Race"] = il_sample_df.progress_apply(draw_race_sample, axis=1)

100%|██████████| 548/548 [00:00<00:00, 8289.38it/s]

sample_cols_to_keep = ["county", "state", "Race"]
il_sample_df = il_sample_df[sample_cols_to_keep].copy()
il_sample_df

	county	state	Race
15	Cook	Illinois	Black
36	Henry	Illinois	White
72	Perry	Illinois	Black
15	Cook	Illinois	Black
53	Logan	Illinois	Black
...	...	...	...
51	Lee	Illinois	White
10	Christian	Illinois	Black
25	Fayette	Illinois	Black
44	Kane	Illinois	White
52	Livingston	Illinois	Black

548 rows × 3 columns

Let’s take a look at the racial distribution of the Cook County subset from our sample to see how it turned out:

cook_sample_df = il_sample_df[il_sample_df["county"] == "Cook"].copy()
cook_sample_df["Race"].value_counts(normalize=True)

Race
Black     0.743590
Latino    0.166667
White     0.089744
Name: proportion, dtype: float64

The results show a slight oversample of Latinos compared to the population expectation and an undersample of Whites. While this might seem odd, it’s actually a feature of this sampling process. The goal here is to simulate the simplified model of the Exoneration Registry, where the sample of exonerees represents a subset of 548 inmates from Cook County. This allows for a direct comparison with another size-548 subset of those still incarcerated in Cook.

With this step completed, the 548 rows from il_sample_df can now be combined with the 548 rows in exon_il_df, creating a balanced DataFrame with a total of 1,096 rows. Half of these rows represent exonerated individuals from Illinois, and the other half represent non-exonerated individuals, sampled to be statistically representative of Illinois’ incarcerated population as a whole.

Constructing the Final Balanced Dataset

To prepare the final balanced dataset, a new label column was added to distinguish between exonerated and non-exonerated individuals. Specifically:
- The Label column in exon_il_df was set to “Exonerated”.
- The Label column in il_sample_df was set to “Non-Exonerated”.

To avoid confusion when combining datasets, the county column in il_sample_df was renamed to County. With the labels in place and columns aligned, both datasets were combined into a single DataFrame using pd.concat().

Next, a race mapping was applied to standardize the race categories across datasets:
- “Asian” and “Native American” were combined into the “Other” category.
- “Black,” “White,” and “Hispanic” categories were kept as-is.

To clean up, the race and Race columns were combined, prioritizing non-NaN values to ensure no data was lost. The original race column was then dropped. Similarly, the county and County columns were merged, and the original county column was removed to streamline the final DataFrame.

Finally, the resulting Race and County columns were checked to confirm the expected values, and the first few rows of the balanced dataset were displayed to verify everything was in place.

# Construct our new label: exonerated vs. non-exonerated
exon_il_df["Label"] = "Exonerated"
il_sample_df["Label"] = "Non-Exonerated"
il_sample_df = il_sample_df.rename(
    columns={"county": "County"}
)  # Rename to distinguish when combining datasets

# And combine!
balanced_df = pd.concat([exon_il_df, il_sample_df], axis=0)
# Define the mapping for 'race'
race_mapping = {
    "Asian": "Other",
    "Native American": "Other",
    "Black": "Black",
    "White": "White",
    "Hispanic": "Hispanic",
}


# Map the 'race' column
balanced_df["race"] = balanced_df["race"].map(race_mapping)

# Combine 'race' and 'Race' columns, prioritizing non-NaN values
balanced_df["Race"] = balanced_df["race"].combine_first(balanced_df["Race"])

# Drop the old 'race' column
balanced_df.drop(columns=["race"], inplace=True)

# Combine 'county' and 'County' columns, prioritizing non-NaN values
balanced_df["County"] = balanced_df["county"].combine_first(balanced_df["County"])

# Drop the old 'county' column
balanced_df.drop(columns=["county"], inplace=True)

# Verify the final Race column
print(balanced_df["Race"].value_counts())
print(balanced_df["County"].value_counts())
balanced_df.head()

Race
Black     709
White     207
Latino     92
Other       5
Name: count, dtype: int64
County
Cook           552
Will            37
Randolph        31
Jefferson       23
Logan           22
Perry           22
Livingston      22
Fulton          21
Johnson         21
Tazewell        19
Lawrence        18
Montgomery      17
Bond            17
Vermilion       16
DuPage          15
Winnebago       15
Lake            14
St. Clair       14
Clinton         14
La Salle        14
Fayette         13
Lee             13
Brown           12
Kane            12
Knox            11
Peoria          10
Morgan          10
Rock Island     10
Macon            9
Crawford         9
McHenry          7
Christian        6
Williamson       6
Champaign        5
McLean           4
Sangamon         4
Henry            3
Kankakee         3
Stephenson       2
Woodford         2
Edgar            2
Effingham        2
Iroquois         2
Adams            2
Richland         1
Menard           1
Pope             1
Madison          1
Boone            1
Jackson          1
Cumberland       1
Washington       1
Dupage           1
Dekalb           1
Moultrie         1
LaSalle          1
De Witt          1
Name: count, dtype: int64

	last_name	first_name	age	sex	state	latitude	longitude	worst_crime_display	sentence	sentence_in_years	...	witness_tampering_or_misconduct_interrogating_co_defendant	misconduct_in_interrogation_of_exoneree	perjury_by_official	tag_sum	geocode_address	Race_orig	Label	County	Race
0	Abbott	Cinque	19.0	male	Illinois	41.819738	-87.756525	Drug Possession or Sale	Probation	0.0	...	0.0	0.0	0.0	7.0	Cook County, Illinois, United States	Black	Exonerated	Cook	Black
1	Abernathy	Christopher	17.0	male	Illinois	41.819738	-87.756525	Murder	Life without parole	100.0	...	0.0	1.0	0.0	10.0	Cook County, Illinois, United States	White	Exonerated	Cook	White
2	Abrego	Eruby	20.0	male	Illinois	41.819738	-87.756525	Murder	90 years	90.0	...	1.0	1.0	1.0	9.0	Cook County, Illinois, United States	Hispanic	Exonerated	Cook	NaN
3	Adams	Demetris	22.0	male	Illinois	41.819738	-87.756525	Drug Possession or Sale	1 year	1.0	...	0.0	0.0	0.0	7.0	Cook County, Illinois, United States	Black	Exonerated	Cook	Black
4	Adams	Kenneth	22.0	male	Illinois	41.819738	-87.756525	Murder	75 years	75.0	...	1.0	0.0	0.0	11.0	Cook County, Illinois, United States	Black	Exonerated	Cook	Black

5 rows × 51 columns

balanced_df.to_csv("../../data/processed-data/exonerees_balanced.csv", index=False)

Summary and Next Steps

The final balanced dataset now consists of 1,096 rows, split evenly between exonerated and non-exonerated individuals. Key steps included creating consistent labels, standardizing race categories, and combining the datasets while ensuring no critical data was lost. The resulting DataFrame provides a clean and structured foundation for further analysis.

Next Steps

This balanced dataset can now be used for supervised learning tasks, such as:
- Predicting Exoneration Factors: Training machine learning models to identify the characteristics most associated with exoneration outcomes.
- Comparative Analysis: Exploring differences in demographics, geographic distribution, or other variables between exonerated and non-exonerated individuals.
- Visualization and Insights: Mapping trends or disparities across counties and racial groups to better understand systemic patterns in wrongful convictions.

With this dataset, models and analyses can provide deeper insights into the factors driving exonerations while ensuring fairness and balance in comparisons.

References

1. Pearl, J. (2016). Counterfactuals and their applications. In Causal inference in statistics: A primer. John Wiley & Sons. https://bayes.cs.ucla.edu/PRIMER/ch4-preview.pdf