Forem: Rupesh Bharambe

From Raw CSV to Model Comparison in 3 Lines of Python

Rupesh Bharambe — Wed, 08 Apr 2026 10:54:55 +0000

A hands-on tutorial with dissectml — the library that combines deep EDA with model comparison.

Let me show you something. This is how most data scientists start a project:

import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import shap
# ... 150 more lines of boilerplate

And this is the same thing with dissectml:

import dissectml as dml

report = dml.analyze(df, target="survived")
report.export("report.html")

Same output. Same depth. Three lines. Let me walk you through what happens under the hood.

Setup

pip install dissectml

For this tutorial, we'll use the built-in Titanic dataset:

import dissectml as dml

df = dml.load_titanic()
print(f"Dataset: {df.shape[0]} rows × {df.shape[1]} columns")
# Dataset: 891 rows × 8 columns

Stage 1: Deep EDA

eda = dml.explore(df, target="survived")

This returns instantly — dissectml uses lazy evaluation, so nothing computes until you ask for it. Now let's explore:

Overview

eda.overview.show()

This auto-detects column types (numeric, categorical, boolean, datetime, high-cardinality, constant), shows memory usage, and generates a type distribution chart.

Correlations

eda.correlations.heatmap()

Unlike basic df.corr(), this computes a unified correlation matrix that handles mixed types: Pearson for numeric-numeric, Cramér's V for categorical-categorical, and correlation ratio (eta) for numeric-categorical pairs. All in one heatmap.

Missing Data Intelligence

eda.missing.patterns()

This goes beyond "column X has 20% missing." It analyzes the pattern of missingness — is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)? This determines which imputation strategy you should use.

Outlier Detection

eda.outliers.plot()

Runs three methods simultaneously — IQR, Z-score, and Isolation Forest — and shows a consensus view. Points flagged by all three methods are the most confident outliers.

Statistical Tests

eda.tests.normality()
eda.tests.independence()

Automated Shapiro-Wilk normality tests for all numeric columns, chi-square independence tests for categorical pairs, and ANOVA/Kruskal-Wallis for group comparisons against the target.

Cluster Discovery

eda.clusters.scatter_2d()

Automatically runs K-Means and DBSCAN, finds the optimal number of clusters, and visualizes them with PCA projection. Reveals hidden structure in your data before you even start modeling.

Stage 2: Pre-Model Intelligence

intel = dml.analyze_intelligence(df, target="survived", task="classification")

Data Readiness Score

intel.readiness.show()
# Data Readiness: 96/100 (Grade A)

A composite score from 0-100 based on missing values, class imbalance, multicollinearity, outlier prevalence, and feature quality. No other library does this.

Target Leakage Detection

intel.leakage

Four-pronged leakage scan: suspiciously high correlations, look-ahead bias in temporal features, near-perfect predictors, and data contamination patterns. Catches issues that silently inflate your metrics.

Algorithm Recommendations

intel.recommendations.show()

Based on your data characteristics (size, non-linearity, cardinality, sparsity), recommends which algorithm families to prioritize. Small dataset with non-linear relationships? Trees and ensembles rank high, neural nets rank low.

Stage 3: Model Battle

models = dml.battle(df, target="survived")
models.leaderboard()

This trains 19 classifiers in parallel with cross-validation and returns a sorted leaderboard:

                     model         accuracy    f1_weighted    train_time_s
0   GradientBoostingClassifier     0.8260       0.8245         5.01
1   RandomForestClassifier         0.8080       0.8062         3.90
2   LogisticRegression             0.7970       0.7958         0.84
...

Each model is automatically paired with appropriate preprocessing — tree-based models skip scaling, linear models get StandardScaler, categorical features get encoded based on cardinality.

Want only specific models?

# Filter by family
models = dml.battle(df, target="survived", families=["tree", "linear"])

# Or pick specific models
models = dml.battle(df, target="survived", 
                    models=["RandomForestClassifier", "LogisticRegression", "XGBClassifier"])

Stage 4: Full Pipeline

Now let's run everything together:

report = dml.analyze(df, target="survived")

This chains all stages: EDA → Intelligence → Battle → Compare. The returned report object gives you access to everything:

# Text summary
print(report.summary())
# === DissectML Analysis Report ===
# Task: classification  |  Target: survived
# Dataset: 891 samples × 7 features
# Data Readiness: 96/100 (Grade A)
# Best Model: GradientBoostingClassifier (accuracy=0.8260)

# Access any sub-result
report.eda.correlations.heatmap()
report.models.leaderboard()
report.intelligence.readiness.show()

# Export interactive HTML report
report.export("report.html")

The HTML report is a single self-contained file with interactive Plotly charts, collapsible sections, a sidebar table of contents, and narrative summaries. Open it in any browser, share it with stakeholders, attach it to an email.

Configuration

# View current settings
dml.get_config()

# Customize for this session
with dml.config_context(cv_folds=10, random_state=123, n_jobs=-1):
    report = dml.analyze(df, target="survived")

Installation Options

# Core (sklearn + plotly only)
pip install dissectml

# With XGBoost, LightGBM, CatBoost
pip install dissectml[boost]

# With SHAP explainability
pip install dissectml[explain]

# Everything
pip install dissectml[full]

What Makes This Different

I've used PyCaret, LazyPredict, and YData Profiling extensively. They're great tools. But each one covers only part of the workflow:

What You Need	Old Way	dissectml
Understand your data	YData Profiling	`dml.explore(df)`
Check for leakage/issues	Manual code	`dml.analyze_intelligence(df)`
Compare models	PyCaret/LazyPredict	`dml.battle(df)`
Explain why models differ	SHAP + matplotlib	`report.compare`
Share findings	Copy-paste into slides	`report.export("report.html")`
All of the above	5 libraries, 200 lines	3 lines

The key insight: these stages shouldn't be independent tools. Your EDA findings should inform your model preprocessing. Your model comparison should include statistical significance tests. Your report should contain both data insights and model insights in one place.

I Analyzed 26 ML Libraries and Found a Gap Nobody Fills - So I Built It

Rupesh Bharambe — Tue, 07 Apr 2026 07:18:20 +0000

How I built dissectml, the missing middle layer between EDA and AutoML.

Every data science project starts the same way.

You load your dataset. You run df.describe(). You open YData Profiling for a quick report. Then you switch to PyCaret or LazyPredict to screen a bunch of models. Then you pull in SHAP for explainability. Then matplotlib for custom comparison plots. By the time you actually understand your data and your models, you've imported five libraries, written 200 lines of glue code, and it's been three hours.

I kept asking myself: why isn't there one library that does the full journey?

So I researched every tool in the space. Thoroughly. And then I built the one that was missing.

The Research That Started Everything

I spent weeks doing deep market research on two categories: Auto-EDA tools (libraries that explore your data) and AutoML/model comparison tools (libraries that train and compare models).

Auto-EDA landscape (10+ libraries):

YData Profiling (13K+ GitHub stars) — the king of one-line profiling reports. Great for stats and correlations, but no model insights.
DataPrep — Dask-powered, 10x faster. But stops at data profiling.
SweetViz — beautiful HTML reports with target analysis. But static and shallow.
D-Tale — Flask+React interactive GUI. Impressive, but no ML integration.
AutoViz, Lux, klib, Missingno — each does one thing well but nothing end-to-end.

AutoML landscape (16+ frameworks):

PyCaret (9K+ stars) — low-code model comparison with compare_models(). But no deep EDA, no statistical significance tests between models, no cross-model error analysis.
LazyPredict — trains 30 models in 2 lines. But zero depth: no plots, no tuning, no explanations.
AutoGluon (AWS) — wins competitions via stacking. But it's a black box focused on prediction, not understanding.
MLJAR — per-model SHAP reports. But reports are per-model, not comparative.
FLAML (Microsoft), H2O, TPOT, EvalML — all focused on finding the best model, not understanding why.

The gap I found:

Capability	YData	PyCaret	LazyPredict	Nobody
Deep EDA with statistical tests	✅	❌	❌	—
Train 20+ models in one call	❌	✅	✅	—
Cross-model error analysis	❌	❌	❌	❌
Statistical significance between models	❌	❌	❌	❌
Target leakage detection	❌	❌	❌	❌
Data readiness score	❌	❌	❌	❌
EDA insights informing model selection	❌	❌	❌	❌
End-to-end: EDA → Models → Report	❌	❌	❌	❌

The right column was empty across every tool. Not a single library bridges the full journey from "What is my data?" to "Which model is best and WHY?"

That's not an AutoML gap. It's an Auto-Analysis gap.

What I Built: dissectml

dissectml is a Python library that unifies deep EDA with comparative model analysis in a single, coherent pipeline. It has five stages:

Deep EDA — auto-detect types, distributions, correlations (Pearson + Spearman + Cramér's V), missing data patterns (MCAR/MAR/MNAR), outlier detection (IQR + Z-score + Isolation Forest), statistical tests (Shapiro-Wilk, chi-square, ANOVA), cluster discovery, feature interactions.
Pre-Model Intelligence — target leakage detection, multicollinearity (VIF), data readiness score (0-100 with letter grade), algorithm recommendations based on data characteristics.
Model Battle — parallel cross-validation across 19 classifiers or 19 regressors. Supports XGBoost, LightGBM, CatBoost as optional extras.
Comparative Analysis — side-by-side metrics, ROC/PR curves, confusion matrices, cross-model error analysis, McNemar/corrected paired t-tests for statistical significance, accuracy vs speed Pareto front.
HTML Report — self-contained interactive report with Plotly charts, collapsible sections, and narrative summaries.

The API is 3 lines:

import dissectml as dml

df = dml.load_titanic()
report = dml.analyze(df, target="survived")
report.export("report.html")

That's it. Five stages. One function call. One interactive report.

Or use any stage independently:

# Just EDA
eda = dml.explore(df, target="survived")
eda.correlations.heatmap()
eda.missing.patterns()
eda.outliers.plot()
eda.tests.normality()

# Just model comparison
models = dml.battle(df, target="survived")
models.leaderboard()

The Architecture Decisions

A few choices I'm proud of:

Lazy evaluation everywhere. dml.explore() returns instantly. Computation only happens when you access a sub-module like eda.correlations. This means you never wait for analysis you don't need.

EDA informs model training. The intelligence stage detects your data characteristics (non-linearity, sparsity, cardinality) and feeds that into the battle stage's preprocessing. Tree-based models skip scaling. High-cardinality categoricals get target encoding instead of one-hot. The pipeline adapts to your data.

Optional dependencies done right. Core package needs only sklearn + plotly. XGBoost/LightGBM/CatBoost install with pip install dissectml[boost]. SHAP with [explain]. If an optional model isn't installed, it's silently skipped — no crashes.

Modular plugin architecture. Each EDA sub-module, each model entry, each comparison method is a self-contained unit. Want to add a custom model? Register it with the model registry. Want to add a custom EDA analysis? Extend the base class.

The Numbers

11,000+ lines of source code across 67 files
600+ tests, all passing, 82% coverage
0 lint issues (ruff-clean)
19 classifiers + 19 regressors in the model catalog
10 EDA sub-modules: overview, univariate, bivariate, correlations, missing, outliers, statistical tests, clusters, interactions, target analysis
148KB wheel on PyPI

Try It Now

pip install dissectml

import dissectml as dml

# Load the built-in Titanic dataset
df = dml.load_titanic()

# Full pipeline: EDA → Intelligence → Battle → Compare → Report
report = dml.analyze(df, target="survived")
print(report.summary())
report.export("report.html")

GitHub: github.com/rupeshbharambe24/InsightML
PyPI: pypi.org/project/dissectml

If you find this useful, a ⭐ on GitHub means a lot — it's what helps open-source projects get discovered.

What's Next

v0.2: Polars backend for 10x EDA speed on large datasets
v0.3: Deep learning models (PyTorch MLP, TabNet)
v0.4: PDF export and branded report templates
v0.5: LLM-powered narrative insights (natural language summaries of findings)

I built this because I was tired of stitching together five libraries every time I started a new ML project. If you feel the same way, give dissectml a try and let me know what you think.

🚀 Try it now (no install needed):

👉 Run in Google Colab — full demo, runs in your browser in 60 seconds

👉 Kaggle Notebook — with rendered outputs

👉 pip install dissectml — install locally

Links: GitHub · PyPI · Docs

If this was useful, a ⭐ on GitHub helps the project get discovered!

Rupesh Bharambe — AIML Engineer & Open Source Developer
Find me on GitHub