Unsupervised Learning

Introduction and Motivation

This analysis employs unsupervised learning techniques—including Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), K-Means, DBSCAN, and Hierarchical Clustering—to examine the Illinois exoneration dataset. The primary objective is to identify patterns and hidden structures within the data, particularly focusing on how case characteristics, demographic variables (such as race and county), and the number of years lost to wrongful convictions intersect.

The analysis is structured as follows:
1. Dimensionality Reduction: Methods such as PCA and t-SNE are utilized to project high-dimensional data into lower-dimensional spaces, simplifying the visualization of complex relationships while preserving key structural and variance-based insights.

Clustering: Clustering techniques—K-Means, DBSCAN, and Hierarchical Clustering—are applied to uncover natural groupings within the dataset and assess whether these clusters align with demographic features like race or case-related factors.
Evaluation and Interpretation: The performance of each method is evaluated, and clustering results are compared to draw meaningful interpretations. Visualizations are integrated throughout the analysis to enhance clarity and support findings.

The motivation for this analysis stems from the critical need to uncover systemic patterns in wrongful conviction data. By applying unsupervised learning methods, the investigation aims to reveal relationships and disparities between demographic factors and case outcomes that are not immediately apparent. These insights contribute to a deeper understanding of biases and inequities within exoneration cases and support broader efforts for justice system reform.

Data Preprocessing

The data preprocessing stage prepares the Illinois exoneration dataset for dimensionality reduction and clustering. This step involves selecting relevant features, encoding categorical variables, and standardizing numerical features to ensure compatibility with unsupervised learning algorithms.

Feature Selection

The features chosen for this analysis include a combination of numerical and categorical variables. The numerical variables—age, sentence in years, and years lost—were selected to provide quantitative insights into exoneration cases, such as the age at conviction, the length of imprisonment, and the total number of years lost. The categorical variable race was included to capture demographic patterns and be used as a color vector during visualization to assess how the identified clusters align with racial groupings.

The selected features are as follows:
- Numerical: age, sentence_in_years, years_lost
- Categorical: race

Standardization and Encoding

To ensure that numerical features contribute equally to the analysis, they were standardized using StandardScaler. Standardization adjusts each numerical feature to have a mean of 0 and a standard deviation of 1, preventing variables like sentence lengths from dominating the clustering process. The categorical variable race was encoded using Label Encoding, which assigns a numerical value to each category (e.g., race). This transformation ensures compatibility with algorithms such as K-Means, DBSCAN, and Hierarchical Clustering that require numerical inputs for processing.

Code

import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("../../data/processed-data/illinois_exoneration_data.csv")

# Select features of interest
features = ["age", "sentence_in_years", "years_lost", "race", "county"]
df = df[features].dropna()

# Encode categorical variables
le_race = LabelEncoder()
le_county = LabelEncoder()
df["race_encoded"] = le_race.fit_transform(df["race"])

# Standardize numerical variables
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[["age", "sentence_in_years", "years_lost"]])

X = scaled_features  # Already standardized

Dimensionality Reduction

The objective of this section is to explore and demonstrate the effectiveness of PCA and t-SNE in reducing the dimensionality of complex data while preserving essential information and improving visualization.

PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of high-dimensional datasets. It achieves this by identifying the most significant features, known as principal components, through linear transformations. These components capture the maximum variance in the data, allowing for a simplified yet informative representation of complex datasets.

Code

import plotly.io as pio
pio.renderers.default = "notebook_connected"
import plotly.graph_objects as go
import numpy as np

race_color_map = {
    "Black": "#ff5f66",
    "Hispanic": "#ffa600",
    "White": "#594e90",
    "Asian": "#003f5c",
    "Native American": "#bc4c96",
}


def plot_2D(X, color_vector, plot_title, label_col=None):
    color_vector = np.array(color_vector)
    fig = go.Figure()
    if label_col is not None:
        unique_vals = sorted(set(zip(color_vector, label_col)))
        for enc, race in unique_vals:
            mask = color_vector == enc
            fig.add_trace(
                go.Scatter(
                    x=X[mask, 0],
                    y=X[mask, 1],
                    mode="markers",
                    name=str(race),
                    marker=dict(
                        color=race_color_map.get(str(race), "#888"),
                        symbol="circle",
                        size=7,
                        opacity=0.7,
                    ),
                )
            )
    else:
        unique_vals = sorted(set(color_vector))
        cluster_colors = [
            "#ff5f66",
            "#594e90",
            "#ffa600",
            "#003f5c",
            "#bc4c96",
            "#59A14F",
            "#EDC948",
            "#FF9DA7",
        ]
        for i, val in enumerate(unique_vals):
            mask = color_vector == val
            fig.add_trace(
                go.Scatter(
                    x=X[mask, 0],
                    y=X[mask, 1],
                    mode="markers",
                    name=f"Cluster {val}" if val >= 0 else "Noise",
                    marker=dict(
                        color=cluster_colors[i % len(cluster_colors)],
                        symbol="circle",
                        size=6,
                        opacity=0.6,
                    ),
                )
            )
    fig.update_layout(
        title=dict(text=plot_title, x=0.5, font=dict(size=16)),
        xaxis=dict(title="Component 1", gridcolor="#e5e5e5"),
        yaxis=dict(title="Component 2", gridcolor="#e5e5e5"),
        plot_bgcolor="white",
        paper_bgcolor="white",
        font=dict(family="Arial", size=13),
        width=750,
        height=500,
        margin=dict(t=70, b=60, l=70, r=40),
    )
    fig.show()
    return fig


def plot_variance_explained(pca):
    explained = pca.explained_variance_ratio_
    cumulative = np.cumsum(explained)
    components = list(range(1, len(explained) + 1))
    fig = go.Figure()
    fig.add_trace(
        go.Scatter(
            x=components,
            y=explained,
            mode="lines+markers",
            name="Explained Variance",
            line=dict(color="#594e90", width=2.5),
            marker=dict(size=7),
        )
    )
    fig.update_layout(
        title=dict(
            text="Explained Variance Ratio by Component", x=0.5, font=dict(size=16)
        ),
        xaxis=dict(title="Number of Components", gridcolor="#e5e5e5"),
        yaxis=dict(title="Explained Variance Ratio", gridcolor="#e5e5e5"),
        plot_bgcolor="white",
        paper_bgcolor="white",
        font=dict(family="Arial", size=13),
        width=700,
        height=450,
        margin=dict(t=70, b=60, l=70, r=40),
    )
    fig.show()
    fig2 = go.Figure()
    fig2.add_trace(
        go.Scatter(
            x=components,
            y=cumulative,
            mode="lines+markers",
            name="Cumulative Variance",
            line=dict(color="#ff5f66", width=2.5),
            marker=dict(size=7),
        )
    )
    fig2.update_layout(
        title=dict(
            text="Cumulative Explained Variance by Component", x=0.5, font=dict(size=16)
        ),
        xaxis=dict(title="Number of Components", gridcolor="#e5e5e5"),
        yaxis=dict(title="Cumulative Explained Variance", gridcolor="#e5e5e5"),
        plot_bgcolor="white",
        paper_bgcolor="white",
        font=dict(family="Arial", size=13),
        width=700,
        height=450,
        margin=dict(t=70, b=60, l=70, r=40),
    )
    fig2.show()

Explained Variance Ratio

The explained variance ratio in PCA indicates the proportion of the total variance in the dataset that is captured by each principal component (PC). This metric is essential for determining the optimal number of components to retain during dimensionality reduction. Each principal component captures a fraction of the total variance, with the first principal component (PC-1) explaining the largest share, followed by the second component (PC-2), and so forth. By analyzing the distribution of variance across the components, it becomes possible to identify which components contribute the most meaningful information to the dataset. The cumulative variance** is calculated by summing the explained variance ratios of successive components. This cumulative measure helps determine how many components are necessary to retain a significant portion of the total variance, such as 90% or 95%. Retaining fewer components simplifies the data representation, making it more computationally efficient, while still preserving most of the underlying structure and variability of the original dataset.

Code

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X)

# Print variance explained and cumulative variance by each principal component
print("Variance explained by each principal component:")
print(pca.explained_variance_ratio_[:10])

print("\nCumulative variance explained by each principal component:")
print(np.cumsum(pca.explained_variance_ratio_)[:10])

# Plot the variance explained
plot_variance_explained(pca)

Variance explained by each principal component:
[0.59465697 0.29997716 0.10536587]

Cumulative variance explained by each principal component:
[0.59465697 0.89463413 1.        ]

The first plot, “Explained Variance Ratio by Component,” shows the proportion of variance captured by each principal component. The steep decline in this plot indicates that the first principal component (PC-1) explains the largest portion of the variance, followed by the second component (PC-2). After these first two components, the additional variance explained by subsequent components decreases significantly. This behavior suggests that the majority of the dataset’s structure can be captured by the first two components.

The second plot, “Cumulative Explained Variance by Component,” illustrates the total variance explained as additional components are added. The curve rises sharply at the start, with the first two components capturing approximately 90% of the total variance. Beyond the second component, the curve begins to flatten, indicating diminishing returns. This flattening demonstrates that including more components contributes little new information to the overall representation of the data.

Together, these plots emphasize the significance of the first two principal components. By focusing on these components, the dimensionality of the data can be effectively reduced while retaining most of its variance. This reduction simplifies computations, decreases model complexity, and enhances the interpretability of visualizations, all without sacrificing critical information.

Code

pca = PCA(n_components=2)
X_pca_2 = pca.fit_transform(X)

plot_2D(
    X_pca_2,
    df["race_encoded"],
    "Principal Component Analysis Results",
    label_col=df["race"],
)

The plot, “Principal Component Analysis Results,” presents the dataset reduced to two dimensions—PC-1 and PC-2—the two principal components that capture the most variance. Each point represents an individual data observation, with colors corresponding to the race_encoded variable.

PC-1 (x-axis) captures the largest portion of the variance, approximately 60%.
PC-2 (y-axis) explains the next largest portion, around 30%.

The distribution of points shows that much of the variance is concentrated along the PC-1 axis, suggesting that this direction captures the most meaningful structure in the dataset. The spread along the PC-2 axis provides additional separation, though to a lesser extent. The presence of vertical “striping” and overlapping points indicates that the race_encoded variable (color-coded) does not perfectly align with the variance explained by the first two components. This observation suggests that the numerical features—age, sentence_in_years, and years_lost—alone may not fully differentiate racial categories.

In sum, PCA effectively reduces the dataset to two dimensions, capturing a significant portion of the variance. However, the clustering patterns observed suggest that race-based groupings may not be strongly linear within the numerical features.

t-SNE (t-distributed Stochastic Neighbor Embedding)

T-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique designed for visualizing high-dimensional data in a low-dimensional space. It preserves local relationships within the data, making it particularly effective for identifying clusters and patterns that may not be visible in higher dimensions.

Code

from sklearn.manifold import TSNE

perplexity_values = [5, 30, 50, 100]
for perplexity in perplexity_values:
    print(f"Running t-SNE with perplexity={perplexity}")
    tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
    X_tsne = tsne.fit_transform(X)
    plot_2D(
        X_tsne,
        df["race_encoded"],
        f"t-SNE Results (Perplexity={perplexity})",
        label_col=df["race"],
    )

Running t-SNE with perplexity=5

Running t-SNE with perplexity=30

Running t-SNE with perplexity=50

Running t-SNE with perplexity=100

The t-SNE visualizations were generated using perplexity values of 5, 30, 50, and 100 to examine how this parameter influences the clustering structure. Perplexity determines the balance between local and global relationships within the data, where lower values emphasize small neighborhoods and higher values capture broader patterns.

At perplexity = 5, the plot reveals fragmented and overly localized clusters. While small neighborhoods are highlighted, the data appears disjointed, making it difficult to identify coherent global groupings. This behavior suggests that a perplexity of 5 is too low to capture meaningful structure.

At perplexity = 30, the visualization becomes more organized, striking a balance between local and global structure. Clear regional groupings emerge, with smaller clusters visible alongside broader trends. This representation provides an interpretable and balanced view of the data.

At perplexity = 50, the clustering appears more cohesive and distinct. The structure of the data is well-defined, and groupings are clearer compared to perplexity = 30. This value maintains a strong balance between fine-grained patterns and global structure, making it ideal for visualizing race-based patterns in the data.

At perplexity = 100, the plot emphasizes global structure but sacrifices local details. Clusters become stretched horizontally, and smaller, fine-grained groupings are smoothed out. While broad relationships are highlighted, important insights from localized clusters are diminished.

In conclusion, a perplexity value of 50 was selected as it produces the clearest and most cohesive clusters. This value preserves both local details and global structure, providing the optimal balance for identifying meaningful groupings and visualizing patterns related to race.

Comparison of PCA and t-SNE Results

Code

import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=50, random_state=42)
X_tsne = tsne.fit_transform(X)

race_list = df["race"].tolist()
race_encoded = df["race_encoded"].tolist()

fig = make_subplots(rows=1, cols=2, subplot_titles=("PCA Results", "t-SNE Results"))

for race, color in race_color_map.items():
    mask = [r == race for r in race_list]
    pca_x = [X_pca_2[i, 0] for i in range(len(mask)) if mask[i]]
    pca_y = [X_pca_2[i, 1] for i in range(len(mask)) if mask[i]]
    tsne_x = [X_tsne[i, 0] for i in range(len(mask)) if mask[i]]
    tsne_y = [X_tsne[i, 1] for i in range(len(mask)) if mask[i]]
    if not pca_x:
        continue
    fig.add_trace(
        go.Scatter(
            x=pca_x,
            y=pca_y,
            mode="markers",
            name=race,
            marker=dict(
                color=color,
                symbol="circle",
                size=7,
                opacity=0.7,
                line=dict(width=0.5, color="white"),
            ),
            legendgroup=race,
            showlegend=True,
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Scatter(
            x=tsne_x,
            y=tsne_y,
            mode="markers",
            name=race,
            marker=dict(
                color=color,
                symbol="circle",
                size=7,
                opacity=0.7,
                line=dict(width=0.5, color="white"),
            ),
            legendgroup=race,
            showlegend=False,
        ),
        row=1,
        col=2,
    )

fig.update_xaxes(title_text="PC-1", gridcolor="#e5e5e5", row=1, col=1)
fig.update_xaxes(title_text="t-SNE 1", gridcolor="#e5e5e5", row=1, col=2)
fig.update_yaxes(title_text="PC-2", gridcolor="#e5e5e5", row=1, col=1)
fig.update_yaxes(title_text="t-SNE 2", gridcolor="#e5e5e5", row=1, col=2)
fig.update_layout(
    plot_bgcolor="white",
    paper_bgcolor="white",
    font=dict(family="Arial", size=13),
    width=1100,
    height=520,
    margin=dict(t=70, b=60, l=70, r=40),
)


import os

img_paths = [
    "../../images/dimensionality_reduction.png",
    "../../docs/images/dimensionality_reduction.png",
    "../../multiclass-portfolio-website/projects/dsan-5000/_site/images/dimensionality_reduction.png",
    "../../multiclass-portfolio-website/_site/projects/dsan-5000/_site/images/dimensionality_reduction.png",
]
for path in img_paths:
    try:
        os.makedirs(os.path.dirname(path), exist_ok=True)
        fig.write_image(path, scale=2)
    except Exception as e:
        print(f"Could not save {path}: {e}")
try:
    fig.write_html(
        "../../report/figures/dimensionality_reduction.html", include_plotlyjs="cdn"
    )
except Exception as e:
    print(f"Could not save html: {e}")
fig.show()

Resorting to unclean kill browser.

The two plots above compare the results of Principal Component Analysis (PCA) and t-SNE (perplexity = 50) when applied to the same dataset. In the PCA Results, the data is reduced to two principal components that capture the directions of maximum variance. The points appear somewhat aligned along the vertical axis (PC-1), indicating that the majority of the variance lies in that direction. However, the plot does not reveal clear or well-defined clusters, suggesting that PCA effectively captures global variance patterns but struggles to preserve local neighborhood structures. The visual separation by race (color-coded) is not particularly distinct in the PCA output. Alternatively, t-SNE Results provide a more nuanced and detailed visualization. By balancing local and global relationships, t-SNE produces more distinct clusters and patterns with greater cohesion. The clusters are clearer and better separated, indicating that t-SNE excels at preserving the local structure of the data. Although some overlap remains, the t-SNE output reveals a structure that is significantly clearer compared to PCA, particularly when visualized using race-based color encoding. Overall, t-SNE (with perplexity = 50) outperforms PCA in uncovering patterns and potential clusters within the dataset. While PCA captures the global variance effectively, it fails to separate groups as clearly. This comparison highlights the advantage of t-SNE for visualizing complex, high-dimensional data where local relationships are particularly important.

Clustering Methods

Clustering is an unsupervised learning technique used to identify natural groupings, or clusters, within a dataset. In clustering, the data is unlabeled, meaning there are no predefined classes or categories. The primary goal is to discover groups of data points that are similar to each other based on a defined measure of similarity or distance.

In this analysis, K-Means, DBSCAN, and Hierarchical Clustering are applied to the dataset to:
- Explore the structure of the data and identify meaningful clusters.
- Compare the performance and outcomes of each clustering technique.
- Interpret the results to gain insights into groupings and relationships within the data.

By leveraging these methods, the analysis seeks to uncover patterns and hidden structures that may not be immediately apparent, providing a deeper understanding of the data.

K-Means

K-Means is a foundational unsupervised clustering algorithm that partitions data into a predefined number of clusters, denoted as K. K-Means begins by randomly selecting K initial cluster centroids and assigns each data point to the closest centroid based on the Euclidean distance. The centroids are then recalculated as the mean of all data points within their respective clusters. This process of assignment and centroid adjustment repeats until the centroids stabilize or a convergence criterion is reached. The goal of K-Means is to group data into clusters that are both cohesive and well-separated—where distances within each cluster are minimized, and distances between clusters are maximized. While K-Means performs effectively when clusters are spherical and well-defined, it requires specifying the number of clusters (K) in advance, which introduces the need for hyperparameter tuning.

Elbow Method

To identify the optimal number of clusters, the Elbow Method and Silhouette Score are employed. The Elbow Method involves plotting the point where the rate of inertia reduction slows, indicating the optimal number of clusters, as adding more clusters beyond it provides diminishing returns in variance reduction. The Silhouette Score offers an additional measure to assess clustering quality by evaluating how similar each data point is to its own cluster compared to other clusters. Higher Silhouette Scores indicate clusters that are well-defined and more cohesive, providing a clearer structure within the data.

Code

import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings

warnings.filterwarnings("ignore", category=UserWarning)

k_values = list(range(2, 10))
wcss = []
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, labels))

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=k_values,
        y=wcss,
        mode="lines+markers",
        name="WCSS",
        line=dict(color="#594e90", width=2.5),
        marker=dict(size=8),
    )
)
fig.update_layout(
    title=dict(text="Elbow Method for K-Means", x=0.5, font=dict(size=16)),
    xaxis=dict(title="Number of Clusters (K)", gridcolor="#e5e5e5", dtick=1),
    yaxis=dict(title="Within-Cluster Sum of Squares (WCSS)", gridcolor="#e5e5e5"),
    plot_bgcolor="white",
    paper_bgcolor="white",
    font=dict(family="Arial", size=13),
    width=750,
    height=480,
    showlegend=False,
    margin=dict(t=70, b=60, l=80, r=40),
)
fig.show()

In the Elbow Method plot above, the inertia decreases sharply up to around K = 4 or K = 5, after which the curve begins to flatten. This behavior suggests that the optimal number of clusters lies between 4 and 5. By combining insights from both the Elbow Method and the Silhouette Scores, the best value for K can be confidently selected. Visualizing the clustering results further allows for assessing their alignment with meaningful patterns within the data.

Visualize Clusters for Optimal K

Code

import plotly.graph_objects as go
from sklearn.cluster import KMeans

optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels_kmeans = kmeans.fit_predict(X)

cluster_colors = ["#ff5f66", "#594e90", "#ffa600", "#003f5c", "#bc4c96", "#59A14F"]

fig = go.Figure()
for cluster in range(optimal_k):
    mask = labels_kmeans == cluster
    fig.add_trace(
        go.Scatter(
            x=X[mask, 0],
            y=X[mask, 1],
            mode="markers",
            name=f"Cluster {cluster}",
            marker=dict(color=cluster_colors[cluster], size=6, opacity=0.65),
        )
    )

fig.update_layout(
    title=dict(text="K-Means Clustering Results", x=0.5, font=dict(size=16)),
    xaxis=dict(title="Component 1", gridcolor="#e5e5e5"),
    yaxis=dict(title="Component 2", gridcolor="#e5e5e5"),
    plot_bgcolor="white",
    paper_bgcolor="white",
    font=dict(family="Arial", size=13),
    width=800,
    height=550,
    margin=dict(t=70, b=60, l=70, r=40),
)


import os

img_paths = [
    "../../images/k_means.png",
    "../../docs/images/k_means.png",
    "../../multiclass-portfolio-website/projects/dsan-5000/_site/images/k_means.png",
    "../../multiclass-portfolio-website/_site/projects/dsan-5000/_site/images/k_means.png",
]
for path in img_paths:
    try:
        os.makedirs(os.path.dirname(path), exist_ok=True)
        fig.write_image(path, scale=2)
    except Exception as e:
        print(f"Could not save {path}: {e}")
try:
    fig.write_html("../../report/figures/k_means.html", include_plotlyjs="cdn")
except Exception as e:
    print(f"Could not save html: {e}")
fig.show()

Resorting to unclean kill browser.

The visualization above displays the results of K-Means clustering applied with an optimal value of K = 4, as determined using the Elbow Method. The data has been projected onto the first two principal components (PC-1 and PC-2) for visualization, with each point color-coded based on its cluster assignment.

The clusters appear well-separated and exhibit distinct patterns along the two principal components:

The yellow cluster occupies the far right of the PC-1 axis, indicating that this group has unique characteristics that distinguish it from the others.
The purple cluster is concentrated at the bottom of the plot, suggesting that it shares common features that set it apart, particularly along PC-2.
The teal and blue clusters are more centered with slight overlap, reflecting some shared attributes while still maintaining discernible boundaries.

The choice of K = 4 aligns well with the structure of the exoneration dataset, where race emerged as a key factor during Exploratory Data Analysis (EDA). The racial groups initially considered included Black, Hispanic, White, Native American, and Asian. However, the Asian group was excluded due to its negligible presence in the dataset, making four clusters a logical and meaningful choice. This outcome mirrors the underlying data distribution, where the remaining racial groups are clearly represented in the clustering results.

These findings suggest that K-Means effectively partitions the data into four meaningful clusters based on the selected features. The clear separation of clusters along PC-1 and PC-2 highlights that these principal components successfully capture the variance in the data, enabling the differentiation of racial groupings. The slight overlap between clusters may stem from shared attributes across groups or limitations in the chosen features.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is an unsupervised clustering algorithm that groups data points based on density. Unlike K-Means, which requires specifying the number of clusters in advance, DBSCAN identifies clusters by locating dense regions in the data while marking points in less dense areas as noise. This makes it particularly effective for identifying clusters of arbitrary shapes and handling outliers. DBSCAN relies on two key parameters:

eps: The maximum distance between two points for them to be considered part of the same neighborhood.
min_samples: The minimum number of points required to form a dense region (a cluster).

DBSCAN begins with an unvisited point and determines its neighborhood within the radius eps. If the number of points in the neighborhood meets or exceeds min_samples, a cluster is initiated. The cluster is then expanded by iteratively including points within eps distance of other points already in the cluster. Points that do not meet the density requirement are labeled as noise (outliers).

One of the key strengths of DBSCAN is its ability to detect clusters of varying shapes and handle datasets with noisy or irregular boundaries. Additionally, it does not force every point into a cluster, which allows for the identification of outliers—a feature particularly useful for understanding anomalies within the data. However, selecting appropriate values for eps and min_samples is critical to DBSCAN’s performance. A common strategy involves experimenting with different eps values and assessing the results using metrics such as the Silhouette Score.

By leveraging DBSCAN, the analysis can uncover nuanced structures in the dataset that may not be apparent with algorithms like K-Means, offering deeper insights into hidden patterns and systemic disparities.

Eps Hyper-Parameter Tuning

Code

from sklearn.cluster import DBSCAN

eps_values = [0.5, 1.0, 1.5, 2.0]
for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    labels_dbscan = dbscan.fit_predict(X)
    try:
        sil_score = silhouette_score(X, labels_dbscan)
        print(f"EPS: {eps}, Silhouette Score: {sil_score:.4f}")
    except:
        print(f"EPS: {eps}, Silhouette Score: Undefined (noise present)")
    plot_2D(X, labels_dbscan, f"DBSCAN Clustering (EPS={eps})")

EPS: 0.5, Silhouette Score: 0.1970

EPS: 1.0, Silhouette Score: 0.7809

EPS: 1.5, Silhouette Score: 0.7809

EPS: 2.0, Silhouette Score: 0.7809

The eps values for DBSCAN were selected by testing a range of values (0.5, 1.0, 1.5, and 2.0) to analyze the algorithm’s sensitivity to this key parameter. DBSCAN uses eps (neighborhood radius) and min_samples to define dense regions, where eps controls the size of these regions. Smaller values of eps result in fragmented clusters, while larger values produce fewer, broader clusters. The chosen values allow for an incremental evaluation of how clustering changes to identify the best balance between fragmentation and cohesion.

EPS = 0.5
- At eps = 0.5, the clusters are highly fragmented, with a significant number of points labeled as noise (not assigned to any cluster). The neighborhood radius is too small to form large, cohesive clusters.
- The Silhouette Score is 0.197, indicating poor clustering performance with high intra-cluster variance. The results lack meaningful structure and cohesion.
EPS = 1.0
- With eps = 1.0, clustering performance improves significantly. More points are grouped into clusters, dense regions become apparent, and the number of noise points is reduced.
- The Silhouette Score rises to 0.78, reflecting well-defined clusters and clear separation between groups. At this value, the algorithm strikes a good balance between cohesive clusters and noise reduction.
EPS = 1.5
- At eps = 1.5, the clustering results remain largely similar to those observed at eps = 1.0. Most data points are grouped into a single large cluster, with only a small number of points remaining on the periphery.
- The Silhouette Score remains stable at 0.78, but expanding the neighborhood radius further does not uncover additional structure in the data.
EPS = 2.0
- When eps = 2.0, nearly all points are assigned to a single large cluster. While this eliminates noise points, it oversimplifies the data and removes meaningful structural separation.
- Although the Silhouette Score remains consistent, the clustering lacks distinct groupings, indicating that eps = 2.0 is too large for this dataset.

The analysis of DBSCAN results demonstrates that eps = 1.0 provides the best clustering performance. At this value, the clusters are well-defined, noise is minimized, and the Silhouette Score achieves a peak of 0.78. Smaller eps values, such as 0.5, lead to fragmented clusters with excessive noise, while larger values, such as 1.5 and 2.0, smooth the data excessively, reducing meaningful separation. Thus, eps = 1.0 emerges as the optimal choice, balancing noise reduction with cohesive and interpretable clustering results.

Hierarchical Clustering

Hierarchical Clustering is an unsupervised machine learning algorithm that builds a hierarchy of clusters through an iterative process. Unlike K-Means, which requires specifying the number of clusters in advance, hierarchical clustering produces a dendrogram—a tree-like structure that shows how data points are grouped at different levels of granularity. The algorithm can follow two main approaches: agglomerative (bottom-up) and divisive (top-down). In agglomerative clustering, each data point starts as its own cluster. Clusters are progressively merged based on their similarity (distance), defined by the linkage method. This analysis uses Ward’s linkage, which minimizes intra-cluster variance at each step, resulting in well-balanced and cohesive clusters. The dendrogram serves as a visual tool to identify the optimal number of clusters by “cutting” the tree at a height where clusters are most distinct. This makes hierarchical clustering particularly effective for EDA, as it does not require prior knowledge of the number of clusters.

To determine the optimal clusters, the dendrogram was examined for large vertical distances, indicating well-separated groups. Ward’s linkage ensures minimal intra-cluster variance, producing compact and interpretable groupings—an essential property for datasets like the exoneration data, where clear separations between groups are critical. Hierarchical clustering with Ward’s linkage offers a flexible and interpretable approach for uncovering the dataset’s structure. By analyzing the dendrogram and selecting the appropriate height, meaningful clusters were identified. Compared to K-Means and DBSCAN, hierarchical clustering provides the additional benefit of visualizing relationships between clusters, making it a valuable method for validating and interpreting clustering results.

Code

import plotly.figure_factory as ff
import plotly.graph_objects as go
from scipy.cluster.hierarchy import linkage
from sklearn.cluster import AgglomerativeClustering
import numpy as np

sample_idx = np.random.choice(len(X), size=min(200, len(X)), replace=False)
X_sample = X[sample_idx]

linkage_matrix = linkage(X_sample, method="ward")

fig = ff.create_dendrogram(
    X_sample, linkagefun=lambda x: linkage(x, method="ward"), color_threshold=0
)
fig.update_layout(
    title=dict(text="Hierarchical Clustering Dendrogram", x=0.5, font=dict(size=16)),
    xaxis=dict(title="Samples", showticklabels=False),
    yaxis=dict(title="Distance", gridcolor="#e5e5e5"),
    plot_bgcolor="white",
    paper_bgcolor="white",
    font=dict(family="Arial", size=13),
    width=950,
    height=500,
    margin=dict(t=70, b=60, l=70, r=40),
)
fig.show()

n_clusters = 4
agglom = AgglomerativeClustering(n_clusters=n_clusters, linkage="ward")
labels_agglom = agglom.fit_predict(X)
plot_2D(X, labels_agglom, "Hierarchical Clustering Results")

Hierarchical Clustering Results Interpretation

The dendrogram above visually represents the clustering hierarchy generated using Ward’s linkage in hierarchical clustering. The dendrogram shows how data points are progressively merged into clusters at varying distances. By cutting the dendrogram at an appropriate height, four clusters are identified, which align with the patterns observed in the earlier K-Means results.

The Hierarchical Clustering Results plot projects these clusters onto the first two principal components (PC-1 and PC-2), with data points color-coded based on their cluster assignments:
1. Purple Cluster: Positioned at the top of the PC-2 axis, this group stands out distinctly, indicating unique features that strongly separate it along PC-2.
2. Yellow Cluster: Spread vertically across the middle region of PC-1, this cluster shows moderate cohesion while spanning a wide range along PC-2.
3. Teal Cluster: Concentrated toward the bottom-right, this group is more compact and well-defined along PC-1, suggesting strong intra-cluster similarity.
4. Blue Cluster: Located in the bottom-left region, this group appears dense and cohesive, with relatively low variation along PC-2.

The dendrogram supports the selection of four clusters, evidenced by clear separations into branches at a distance of approximately 20 units. These clusters are well-defined and consistent with the natural groupings observed in the data, particularly along PC-1 and PC-2. Compared to other clustering methods, hierarchical clustering offers a notable advantage by providing a hierarchical structure for cluster exploration. This allows for the analysis of groupings at multiple levels of granularity, enhancing the interpretability of the results. Overall, hierarchical clustering confirms the patterns identified in the K-Means results, while providing additional insights through the dendrogram. The use of Ward’s linkage ensures minimal intra-cluster variance, resulting in cohesive and interpretable clusters. These clusters align closely with the underlying structure of the data, reinforcing the conclusions drawn from the PCA and K-Means analyses.

Discussion

The clustering analysis using K-Means, DBSCAN, and Hierarchical Clustering reveals meaningful patterns within the exoneration dataset. Each method identifies groupings that align with the underlying racial categories—Black, Hispanic, White, and Native American—highlighted in the earlier exploratory data analysis (EDA).

K-Means effectively grouped the data into four clusters, which correspond closely with the primary racial groups observed. The Elbow Method confirmed that four clusters were optimal, with clear separations visible along PC-1 and PC-2. This method produced cohesive and well-defined clusters with minimal noise, reflecting the structure of the data. The results demonstrate a strong alignment with the racial breakdown, as the groupings were particularly distinct along PC-1, capturing critical variance in the dataset.

DBSCAN yielded varying results depending on the value of the eps parameter. At eps = 1.0, the method achieved the best balance between cluster cohesion and noise reduction, with a Silhouette Score of 0.78. At this setting, the clusters were distinguishable, and the algorithm successfully identified dense regions within the data. However, increasing eps beyond 1.0 caused most data points to merge into a single cluster, diminishing DBSCAN’s ability to detect finer groupings. Despite this limitation, DBSCAN proved particularly useful for identifying sparse regions and outliers, which may correspond to underrepresented racial groups or anomalies within the dataset.

Hierarchical Clustering using Ward’s linkage provided additional insights through its dendrogram, offering a clear visualization of the clustering process. The dendrogram supported the selection of four clusters, which closely matched the groupings identified by K-Means. Projecting the clusters onto the first two principal components revealed well-defined and interpretable separations. Ward’s linkage ensured minimal intra-cluster variance, producing compact clusters that effectively reflect the structure of the data.

While all three clustering methods revealed meaningful patterns, K-Means and Hierarchical Clustering produced the clearest and most interpretable results. Both consistently identified four clusters that align with the racial categories analyzed earlier. K-Means offered computational efficiency and well-separated groupings, while Hierarchical Clustering provided the added benefit of a dendrogram, which validated the results by illustrating the relationships between clusters. DBSCAN, while effective for detecting outliers and handling irregular data distributions, struggled to produce distinct groupings beyond specific parameter settings.

The clustering results demonstrate that the exoneration data can be effectively grouped into four distinct clusters that align with the racial categories of Black, Hispanic, White, and Native American. These findings reinforce the observations from the EDA, where negligible representation of certain racial groups, such as Asians, led to their exclusion from the analysis. By revealing clear and consistent patterns within the data, this analysis highlights systemic disparities in exoneration outcomes that are closely tied to racial identity. Understanding these patterns provides critical insights into racial biases within the criminal justice system, emphasizing the need for informed policy changes and further investigations.