Exploratory Data Analysis

Definition

The initial, open-ended investigation of a dataset to understand its structure, distributions, missing values, outliers, and relationships before modelling.

Intuition

EDA is how you build intuition about data before committing to any model. It catches data quality issues early, suggests feature engineering ideas, reveals class imbalance, and uncovers surprising relationships that can reshape the problem formulation.

Formal Description

Univariate Analysis

Examine each feature in isolation.

For numerical features:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 
df.describe()             # count, mean, std, min, quartiles, max
df[col].hist(bins=50)     # distribution shape
df[col].skew()            # > 1 or < -1: consider log transform
df[col].kurtosis()        # excess kurtosis vs Gaussian

For categorical features:

df[col].value_counts()                       # frequency table
df[col].value_counts(normalize=True)         # proportions
df[col].nunique()                            # cardinality

High cardinality (> 50 unique values) flags potential encoding challenges.

Missing Value Analysis

missing = df.isnull().mean().sort_values(ascending=False)
# Threshold: > 40–50% missing → consider dropping the column
# Inspect whether missingness is MCAR, MAR, or MNAR
  • MCAR (Missing Completely At Random): safe to ignore (but lose samples).
  • MAR (Missing At Random): conditional on observed variables; imputable.
  • MNAR (Missing Not At Random): missingness carries information; missingness indicator as a feature is often useful.

Bivariate and Multivariate Analysis

Numerical vs numerical:

# Correlation matrix
corr = df.corr(method='pearson')   # linear relationship
sns.heatmap(corr, annot=True, cmap='coolwarm')
 
# Scatter matrix
pd.plotting.scatter_matrix(df[num_cols], alpha=0.2, figsize=(12,12))

Note: correlation ≠ causation; check for non-linear relationships with scatter plots.

Numerical vs target:

# For regression target
df.groupby('target')[num_col].describe()
 
# For classification target
df.boxplot(column=num_col, by='target')

Categorical vs target:

df.groupby(cat_col)['target'].mean()    # mean target per category
pd.crosstab(df[cat_col], df['target'], normalize='index')

Outlier Detection

IQR method:

Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]

Z-score method: flag observations with .

Outliers should be investigated, not automatically removed — they may be errors, or genuine extreme cases (e.g., large insurance claims).

Class Imbalance

df['target'].value_counts(normalize=True)

Imbalanced classes (< 10% minority): evaluate with PR-AUC or F1, not accuracy. Consider stratified sampling, SMOTE, class weights.

Target Leakage Detection

Leakage occurs when a feature contains information that would not be available at prediction time, causing unrealistically high offline performance.

Symptoms:

  • A feature with Pearson |r| or mutual information far above all others.
  • Model performance that is suspiciously high (AUC > 0.99 on an unstructured problem).
  • A feature that is derived from or recorded after the event being predicted.

Detection:

# Flag suspiciously high correlations with target
corr_with_target = df.corr()['target'].abs().sort_values(ascending=False)
high_corr = corr_with_target[corr_with_target > 0.9]
print("Potential leakage candidates:\n", high_corr)

Common leakage patterns:

  1. Temporal leakage: a feature computed using data from after the prediction timestamp.
  2. Group leakage: an aggregate feature (e.g., mean target per group) computed over the full dataset, leaking future information into the past.
  3. Proxy leakage: a feature that encodes the target indirectly (e.g., claim_settled_flag in a model predicting claim_filed).

Temporal EDA

For time-series and event-based data, standard EDA must be supplemented with time-aware analysis.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
 
# Sort and set index
df = df.sort_values('timestamp').set_index('timestamp')
 
# Plot the series
df['value'].plot(figsize=(12, 4), title='Time Series Overview')
 
# Autocorrelation (ACF) and Partial Autocorrelation (PACF)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(df['value'].dropna(), lags=40, ax=axes[0])
plot_pacf(df['value'].dropna(), lags=40, ax=axes[1])
plt.tight_layout()

Key checks:

  • Trend: linear or non-linear upward/downward drift → differencing may be needed.
  • Seasonality: repeating cycle (weekly, annual) → visible in ACF at lag = season length.
  • Stationarity: ADF test (statsmodels.tsa.stattools.adfuller) — p < 0.05 → reject unit root.
  • Missing timestamps: check for gaps in the index; irregular series require special models.
  • Train/test distribution shift: compare feature statistics of training and holdout sets — significant divergence is a red flag.

Distribution Shape

CharacteristicDetectionAction
Right skewskew() > 1Log or square-root transform
Heavy tailsHigh kurtosis, QQ plot vs NormalRobust estimators, quantile regression
MultimodalityHistogram with multiple peaksConsider subgroup modelling
TruncationValues cut at hard boundaryBe aware of censoring

Applications

  • Discovering that a numeric feature is actually a categorical code (all integers 0-4).
  • Finding that a date feature has anomalous spikes (data collection errors).
  • Identifying target leakage: a feature with suspiciously high correlation with the target.

Trade-offs

  • EDA is open-ended — time-box it. Use automated tools (e.g., ydata-profiling) for initial scans, then dig deeper into anomalies manually.
  • Do not use the test set during EDA; restrict to training data only.

References

  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Breck et al. (2019). “Data Validation for Machine Learning.” SysML.
  • Kohavi & Longbotham (2017). “Online Controlled Experiments and A/B Testing.” Encyclopedia of Machine Learning.