<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rupesh Bharambe</title>
    <description>The latest articles on Forem by Rupesh Bharambe (@rupesh24).</description>
    <link>https://forem.com/rupesh24</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865174%2F0ee9edb5-b608-42a5-b3cb-fda11a2050c1.jpg</url>
      <title>Forem: Rupesh Bharambe</title>
      <link>https://forem.com/rupesh24</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rupesh24"/>
    <language>en</language>
    <item>
      <title>From Raw CSV to Model Comparison in 3 Lines of Python</title>
      <dc:creator>Rupesh Bharambe</dc:creator>
      <pubDate>Wed, 08 Apr 2026 10:54:55 +0000</pubDate>
      <link>https://forem.com/rupesh24/from-raw-csv-to-model-comparison-in-3-lines-of-python-3hdd</link>
      <guid>https://forem.com/rupesh24/from-raw-csv-to-model-comparison-in-3-lines-of-python-3hdd</guid>
      <description>&lt;p&gt;&lt;em&gt;A hands-on tutorial with dissectml — the library that combines deep EDA with model comparison.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Let me show you something. This is how most data scientists start a project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ydata_profiling&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProfileReport&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GradientBoostingClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LabelEncoder&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shap&lt;/span&gt;
&lt;span class="c1"&gt;# ... 150 more lines of boilerplate
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And this is the same thing with dissectml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dissectml&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;

&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same output. Same depth. Three lines. Let me walk you through what happens under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this tutorial, we'll use the built-in Titanic dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dissectml&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_titanic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows × &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Dataset: 891 rows × 8 columns
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Stage 1: Deep EDA
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns instantly — dissectml uses lazy evaluation, so nothing computes until you ask for it. Now let's explore:&lt;/p&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;overview&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This auto-detects column types (numeric, categorical, boolean, datetime, high-cardinality, constant), shows memory usage, and generates a type distribution chart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correlations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike basic &lt;code&gt;df.corr()&lt;/code&gt;, this computes a &lt;strong&gt;unified correlation matrix&lt;/strong&gt; that handles mixed types: Pearson for numeric-numeric, Cramér's V for categorical-categorical, and correlation ratio (eta) for numeric-categorical pairs. All in one heatmap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing Data Intelligence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This goes beyond "column X has 20% missing." It analyzes the &lt;em&gt;pattern&lt;/em&gt; of missingness — is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)? This determines which imputation strategy you should use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outlier Detection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outliers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Runs three methods simultaneously — IQR, Z-score, and Isolation Forest — and shows a consensus view. Points flagged by all three methods are the most confident outliers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Statistical Tests
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normality&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;independence&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automated Shapiro-Wilk normality tests for all numeric columns, chi-square independence tests for categorical pairs, and ANOVA/Kruskal-Wallis for group comparisons against the target.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cluster Discovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatter_2d&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automatically runs K-Means and DBSCAN, finds the optimal number of clusters, and visualizes them with PCA projection. Reveals hidden structure in your data before you even start modeling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: Pre-Model Intelligence
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;intel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_intelligence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Data Readiness Score
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;intel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readiness&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Data Readiness: 96/100 (Grade A)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A composite score from 0-100 based on missing values, class imbalance, multicollinearity, outlier prevalence, and feature quality. No other library does this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Target Leakage Detection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;intel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;leakage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four-pronged leakage scan: suspiciously high correlations, look-ahead bias in temporal features, near-perfect predictors, and data contamination patterns. Catches issues that silently inflate your metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm Recommendations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;intel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recommendations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on your data characteristics (size, non-linearity, cardinality, sparsity), recommends which algorithm families to prioritize. Small dataset with non-linear relationships? Trees and ensembles rank high, neural nets rank low.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Model Battle
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;battle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;leaderboard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This trains &lt;strong&gt;19 classifiers&lt;/strong&gt; in parallel with cross-validation and returns a sorted leaderboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;                     &lt;span class="k"&gt;model&lt;/span&gt;         &lt;span class="k"&gt;accuracy&lt;/span&gt;    &lt;span class="k"&gt;f&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;weighted&lt;/span&gt;    &lt;span class="k"&gt;train&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;time&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;s&lt;/span&gt;
&lt;span class="mf"&gt;0&lt;/span&gt;   &lt;span class="k"&gt;GradientBoostingClassifier&lt;/span&gt;     &lt;span class="mf"&gt;0.8260&lt;/span&gt;       &lt;span class="mf"&gt;0.8245&lt;/span&gt;         &lt;span class="mf"&gt;5.01&lt;/span&gt;
&lt;span class="mf"&gt;1&lt;/span&gt;   &lt;span class="k"&gt;RandomForestClassifier&lt;/span&gt;         &lt;span class="mf"&gt;0.8080&lt;/span&gt;       &lt;span class="mf"&gt;0.8062&lt;/span&gt;         &lt;span class="mf"&gt;3.90&lt;/span&gt;
&lt;span class="mf"&gt;2&lt;/span&gt;   &lt;span class="k"&gt;LogisticRegression&lt;/span&gt;             &lt;span class="mf"&gt;0.7970&lt;/span&gt;       &lt;span class="mf"&gt;0.7958&lt;/span&gt;         &lt;span class="mf"&gt;0.84&lt;/span&gt;
&lt;span class="err"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each model is automatically paired with appropriate preprocessing — tree-based models skip scaling, linear models get StandardScaler, categorical features get encoded based on cardinality.&lt;/p&gt;

&lt;p&gt;Want only specific models?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Filter by family
&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;battle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;families&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tree&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;linear&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Or pick specific models
&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;battle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LogisticRegression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;XGBClassifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Stage 4: Full Pipeline
&lt;/h2&gt;

&lt;p&gt;Now let's run everything together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This chains all stages: EDA → Intelligence → Battle → Compare. The returned report object gives you access to everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Text summary
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# === DissectML Analysis Report ===
# Task: classification  |  Target: survived
# Dataset: 891 samples × 7 features
# Data Readiness: 96/100 (Grade A)
# Best Model: GradientBoostingClassifier (accuracy=0.8260)
&lt;/span&gt;
&lt;span class="c1"&gt;# Access any sub-result
&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correlations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;leaderboard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intelligence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readiness&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Export interactive HTML report
&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTML report is a single self-contained file with interactive Plotly charts, collapsible sections, a sidebar table of contents, and narrative summaries. Open it in any browser, share it with stakeholders, attach it to an email.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# View current settings
&lt;/span&gt;&lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Customize for this session
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv_folds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Installation Options
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Core (sklearn + plotly only)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml

&lt;span class="c"&gt;# With XGBoost, LightGBM, CatBoost&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml[boost]

&lt;span class="c"&gt;# With SHAP explainability&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml[explain]

&lt;span class="c"&gt;# Everything&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml[full]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Makes This Different
&lt;/h2&gt;

&lt;p&gt;I've used PyCaret, LazyPredict, and YData Profiling extensively. They're great tools. But each one covers only part of the workflow:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What You Need&lt;/th&gt;
&lt;th&gt;Old Way&lt;/th&gt;
&lt;th&gt;dissectml&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Understand your data&lt;/td&gt;
&lt;td&gt;YData Profiling&lt;/td&gt;
&lt;td&gt;&lt;code&gt;dml.explore(df)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Check for leakage/issues&lt;/td&gt;
&lt;td&gt;Manual code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;dml.analyze_intelligence(df)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compare models&lt;/td&gt;
&lt;td&gt;PyCaret/LazyPredict&lt;/td&gt;
&lt;td&gt;&lt;code&gt;dml.battle(df)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explain why models differ&lt;/td&gt;
&lt;td&gt;SHAP + matplotlib&lt;/td&gt;
&lt;td&gt;&lt;code&gt;report.compare&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Share findings&lt;/td&gt;
&lt;td&gt;Copy-paste into slides&lt;/td&gt;
&lt;td&gt;&lt;code&gt;report.export("report.html")&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;All of the above&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5 libraries, 200 lines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 lines&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: these stages shouldn't be independent tools. Your EDA findings should inform your model preprocessing. Your model comparison should include statistical significance tests. Your report should contain both data insights and model insights in one place.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install dissectml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/rupeshbharambe24/dissectML" rel="noopener noreferrer"&gt;github.com/rupeshbharambe24/dissectML&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/dissectml/" rel="noopener noreferrer"&gt;pypi.org/project/dissectml&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saves you time, drop a ⭐ on GitHub — it genuinely helps with discoverability.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rupesh Bharambe — ML Engineer &amp;amp; Open Source Developer&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Analyzed 26 ML Libraries and Found a Gap Nobody Fills - So I Built It</title>
      <dc:creator>Rupesh Bharambe</dc:creator>
      <pubDate>Tue, 07 Apr 2026 07:18:20 +0000</pubDate>
      <link>https://forem.com/rupesh24/i-analyzed-26-ml-libraries-and-found-a-gap-nobody-fills-so-i-built-it-kad</link>
      <guid>https://forem.com/rupesh24/i-analyzed-26-ml-libraries-and-found-a-gap-nobody-fills-so-i-built-it-kad</guid>
      <description>&lt;h2&gt;
  
  
  &lt;em&gt;How I built dissectml, the missing middle layer between EDA and AutoML.&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Every data science project starts the same way.&lt;/p&gt;

&lt;p&gt;You load your dataset. You run &lt;code&gt;df.describe()&lt;/code&gt;. You open YData Profiling for a quick report. Then you switch to PyCaret or LazyPredict to screen a bunch of models. Then you pull in SHAP for explainability. Then matplotlib for custom comparison plots. By the time you actually understand your data &lt;em&gt;and&lt;/em&gt; your models, you've imported five libraries, written 200 lines of glue code, and it's been three hours.&lt;/p&gt;

&lt;p&gt;I kept asking myself: &lt;strong&gt;why isn't there one library that does the full journey?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So I researched every tool in the space. Thoroughly. And then I built the one that was missing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Research That Started Everything
&lt;/h2&gt;

&lt;p&gt;I spent weeks doing deep market research on two categories: &lt;strong&gt;Auto-EDA tools&lt;/strong&gt; (libraries that explore your data) and &lt;strong&gt;AutoML/model comparison tools&lt;/strong&gt; (libraries that train and compare models).&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-EDA landscape (10+ libraries):
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;YData Profiling&lt;/strong&gt; (13K+ GitHub stars) — the king of one-line profiling reports. Great for stats and correlations, but no model insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DataPrep&lt;/strong&gt; — Dask-powered, 10x faster. But stops at data profiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SweetViz&lt;/strong&gt; — beautiful HTML reports with target analysis. But static and shallow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;D-Tale&lt;/strong&gt; — Flask+React interactive GUI. Impressive, but no ML integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoViz&lt;/strong&gt;, &lt;strong&gt;Lux&lt;/strong&gt;, &lt;strong&gt;klib&lt;/strong&gt;, &lt;strong&gt;Missingno&lt;/strong&gt; — each does one thing well but nothing end-to-end.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AutoML landscape (16+ frameworks):
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyCaret&lt;/strong&gt; (9K+ stars) — low-code model comparison with &lt;code&gt;compare_models()&lt;/code&gt;. But no deep EDA, no statistical significance tests between models, no cross-model error analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LazyPredict&lt;/strong&gt; — trains 30 models in 2 lines. But zero depth: no plots, no tuning, no explanations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoGluon&lt;/strong&gt; (AWS) — wins competitions via stacking. But it's a black box focused on prediction, not understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLJAR&lt;/strong&gt; — per-model SHAP reports. But reports are per-model, not comparative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FLAML&lt;/strong&gt; (Microsoft), &lt;strong&gt;H2O&lt;/strong&gt;, &lt;strong&gt;TPOT&lt;/strong&gt;, &lt;strong&gt;EvalML&lt;/strong&gt; — all focused on &lt;em&gt;finding the best model&lt;/em&gt;, not &lt;em&gt;understanding why&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The gap I found:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;YData&lt;/th&gt;
&lt;th&gt;PyCaret&lt;/th&gt;
&lt;th&gt;LazyPredict&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Nobody&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deep EDA with statistical tests&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Train 20+ models in one call&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-model error analysis&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Statistical significance between models&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target leakage detection&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data readiness score&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EDA insights informing model selection&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end: EDA → Models → Report&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The right column was empty across every tool. Not a single library bridges the full journey from "What is my data?" to "Which model is best and WHY?"&lt;/p&gt;

&lt;p&gt;That's not an AutoML gap. It's an &lt;strong&gt;Auto-Analysis&lt;/strong&gt; gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: dissectml
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;dissectml&lt;/strong&gt; is a Python library that unifies deep EDA with comparative model analysis in a single, coherent pipeline. It has five stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deep EDA&lt;/strong&gt; — auto-detect types, distributions, correlations (Pearson + Spearman + Cramér's V), missing data patterns (MCAR/MAR/MNAR), outlier detection (IQR + Z-score + Isolation Forest), statistical tests (Shapiro-Wilk, chi-square, ANOVA), cluster discovery, feature interactions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pre-Model Intelligence&lt;/strong&gt; — target leakage detection, multicollinearity (VIF), data readiness score (0-100 with letter grade), algorithm recommendations based on data characteristics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model Battle&lt;/strong&gt; — parallel cross-validation across 19 classifiers or 19 regressors. Supports XGBoost, LightGBM, CatBoost as optional extras.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comparative Analysis&lt;/strong&gt; — side-by-side metrics, ROC/PR curves, confusion matrices, cross-model error analysis, McNemar/corrected paired t-tests for statistical significance, accuracy vs speed Pareto front.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;HTML Report&lt;/strong&gt; — self-contained interactive report with Plotly charts, collapsible sections, and narrative summaries.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The API is 3 lines:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dissectml&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_titanic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Five stages. One function call. One interactive report.&lt;/p&gt;

&lt;p&gt;Or use any stage independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Just EDA
&lt;/span&gt;&lt;span class="n"&gt;eda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correlations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outliers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;eda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normality&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Just model comparison
&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;battle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;leaderboard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Architecture Decisions
&lt;/h2&gt;

&lt;p&gt;A few choices I'm proud of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lazy evaluation everywhere.&lt;/strong&gt; &lt;code&gt;dml.explore()&lt;/code&gt; returns instantly. Computation only happens when you access a sub-module like &lt;code&gt;eda.correlations&lt;/code&gt;. This means you never wait for analysis you don't need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EDA informs model training.&lt;/strong&gt; The intelligence stage detects your data characteristics (non-linearity, sparsity, cardinality) and feeds that into the battle stage's preprocessing. Tree-based models skip scaling. High-cardinality categoricals get target encoding instead of one-hot. The pipeline adapts to your data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optional dependencies done right.&lt;/strong&gt; Core package needs only sklearn + plotly. XGBoost/LightGBM/CatBoost install with &lt;code&gt;pip install dissectml[boost]&lt;/code&gt;. SHAP with &lt;code&gt;[explain]&lt;/code&gt;. If an optional model isn't installed, it's silently skipped — no crashes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modular plugin architecture.&lt;/strong&gt; Each EDA sub-module, each model entry, each comparison method is a self-contained unit. Want to add a custom model? Register it with the model registry. Want to add a custom EDA analysis? Extend the base class.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;11,000+ lines of source code&lt;/strong&gt; across 67 files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;600+ tests&lt;/strong&gt;, all passing, 82% coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 lint issues&lt;/strong&gt; (ruff-clean)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 classifiers + 19 regressors&lt;/strong&gt; in the model catalog&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10 EDA sub-modules&lt;/strong&gt;: overview, univariate, bivariate, correlations, missing, outliers, statistical tests, clusters, interactions, target analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;148KB wheel&lt;/strong&gt; on PyPI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Now
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dissectml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dissectml&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;

&lt;span class="c1"&gt;# Load the built-in Titanic dataset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_titanic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Full pipeline: EDA → Intelligence → Battle → Compare → Report
&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;survived&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/rupeshbharambe24/InsightML" rel="noopener noreferrer"&gt;github.com/rupeshbharambe24/InsightML&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/dissectml/" rel="noopener noreferrer"&gt;pypi.org/project/dissectml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find this useful, a ⭐ on GitHub means a lot — it's what helps open-source projects get discovered.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.2&lt;/strong&gt;: Polars backend for 10x EDA speed on large datasets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.3&lt;/strong&gt;: Deep learning models (PyTorch MLP, TabNet)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.4&lt;/strong&gt;: PDF export and branded report templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.5&lt;/strong&gt;: LLM-powered narrative insights (natural language summaries of findings)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built this because I was tired of stitching together five libraries every time I started a new ML project. If you feel the same way, give dissectml a try and let me know what you think.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🚀 Try it now (no install needed):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://colab.research.google.com/github/rupeshbharambe24/InsightML/blob/master/notebooks/dissectml_demo.ipynb" rel="noopener noreferrer"&gt;Run in Google Colab&lt;/a&gt; — full demo, runs in your browser in 60 seconds&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://www.kaggle.com/code/YOUR_KAGGLE_USERNAME/titanic-dissectml" rel="noopener noreferrer"&gt;Kaggle Notebook&lt;/a&gt; — with rendered outputs&lt;/p&gt;

&lt;p&gt;👉 &lt;code&gt;pip install dissectml&lt;/code&gt; — install locally&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/rupeshbharambe24/InsightML" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://pypi.org/project/dissectml/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; · &lt;a href="https://insightml.readthedocs.io" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this was useful, a ⭐ on &lt;a href="https://github.com/rupeshbharambe24/InsightML" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; helps the project get discovered!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rupesh Bharambe — AIML Engineer &amp;amp; Open Source Developer&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Find me on &lt;a href="https://github.com/rupeshbharambe24" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




</description>
      <category>python</category>
      <category>opensource</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
