Exploratory Data Analysis
Definition
The initial, open-ended investigation of a dataset to understand its structure, distributions, missing values, outliers, and relationships before modelling.
Intuition
EDA is how you build intuition about data before committing to any model. It catches data quality issues early, suggests feature engineering ideas, reveals class imbalance, and uncovers surprising relationships that can reshape the problem formulation.
Formal Description
Univariate Analysis
Examine each feature in isolation.
For numerical features:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df.describe() # count, mean, std, min, quartiles, max
df[col].hist(bins=50) # distribution shape
df[col].skew() # > 1 or < -1: consider log transform
df[col].kurtosis() # excess kurtosis vs GaussianFor categorical features:
df[col].value_counts() # frequency table
df[col].value_counts(normalize=True) # proportions
df[col].nunique() # cardinalityHigh cardinality (> 50 unique values) flags potential encoding challenges.
Missing Value Analysis
missing = df.isnull().mean().sort_values(ascending=False)
# Threshold: > 40–50% missing → consider dropping the column
# Inspect whether missingness is MCAR, MAR, or MNAR- MCAR (Missing Completely At Random): safe to ignore (but lose samples).
- MAR (Missing At Random): conditional on observed variables; imputable.
- MNAR (Missing Not At Random): missingness carries information; missingness indicator as a feature is often useful.
Bivariate and Multivariate Analysis
Numerical vs numerical:
# Correlation matrix
corr = df.corr(method='pearson') # linear relationship
sns.heatmap(corr, annot=True, cmap='coolwarm')
# Scatter matrix
pd.plotting.scatter_matrix(df[num_cols], alpha=0.2, figsize=(12,12))Note: correlation ≠ causation; check for non-linear relationships with scatter plots.
Numerical vs target:
# For regression target
df.groupby('target')[num_col].describe()
# For classification target
df.boxplot(column=num_col, by='target')Categorical vs target:
df.groupby(cat_col)['target'].mean() # mean target per category
pd.crosstab(df[cat_col], df['target'], normalize='index')Outlier Detection
IQR method:
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]Z-score method: flag observations with .
Outliers should be investigated, not automatically removed — they may be errors, or genuine extreme cases (e.g., large insurance claims).
Class Imbalance
df['target'].value_counts(normalize=True)Imbalanced classes (< 10% minority): evaluate with PR-AUC or F1, not accuracy. Consider stratified sampling, SMOTE, class weights.
Target Leakage Detection
Leakage occurs when a feature contains information that would not be available at prediction time, causing unrealistically high offline performance.
Symptoms:
- A feature with Pearson |r| or mutual information far above all others.
- Model performance that is suspiciously high (AUC > 0.99 on an unstructured problem).
- A feature that is derived from or recorded after the event being predicted.
Detection:
# Flag suspiciously high correlations with target
corr_with_target = df.corr()['target'].abs().sort_values(ascending=False)
high_corr = corr_with_target[corr_with_target > 0.9]
print("Potential leakage candidates:\n", high_corr)Common leakage patterns:
- Temporal leakage: a feature computed using data from after the prediction timestamp.
- Group leakage: an aggregate feature (e.g., mean target per group) computed over the full dataset, leaking future information into the past.
- Proxy leakage: a feature that encodes the target indirectly (e.g.,
claim_settled_flagin a model predictingclaim_filed).
Temporal EDA
For time-series and event-based data, standard EDA must be supplemented with time-aware analysis.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Sort and set index
df = df.sort_values('timestamp').set_index('timestamp')
# Plot the series
df['value'].plot(figsize=(12, 4), title='Time Series Overview')
# Autocorrelation (ACF) and Partial Autocorrelation (PACF)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(df['value'].dropna(), lags=40, ax=axes[0])
plot_pacf(df['value'].dropna(), lags=40, ax=axes[1])
plt.tight_layout()Key checks:
- Trend: linear or non-linear upward/downward drift → differencing may be needed.
- Seasonality: repeating cycle (weekly, annual) → visible in ACF at lag = season length.
- Stationarity: ADF test (
statsmodels.tsa.stattools.adfuller) — p < 0.05 → reject unit root. - Missing timestamps: check for gaps in the index; irregular series require special models.
- Train/test distribution shift: compare feature statistics of training and holdout sets — significant divergence is a red flag.
Distribution Shape
| Characteristic | Detection | Action |
|---|---|---|
| Right skew | skew() > 1 | Log or square-root transform |
| Heavy tails | High kurtosis, QQ plot vs Normal | Robust estimators, quantile regression |
| Multimodality | Histogram with multiple peaks | Consider subgroup modelling |
| Truncation | Values cut at hard boundary | Be aware of censoring |
Applications
- Discovering that a numeric feature is actually a categorical code (all integers 0-4).
- Finding that a date feature has anomalous spikes (data collection errors).
- Identifying target leakage: a feature with suspiciously high correlation with the target.
Trade-offs
- EDA is open-ended — time-box it. Use automated tools (e.g.,
ydata-profiling) for initial scans, then dig deeper into anomalies manually. - Do not use the test set during EDA; restrict to training data only.
References
- Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
- Breck et al. (2019). “Data Validation for Machine Learning.” SysML.
- Kohavi & Longbotham (2017). “Online Controlled Experiments and A/B Testing.” Encyclopedia of Machine Learning.