<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rajeev Srivastava</title>
    <description>The latest articles on Forem by Rajeev Srivastava (@rajeevsrivastava).</description>
    <link>https://forem.com/rajeevsrivastava</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3784543%2Fac2f242f-25ee-4aa9-a4bc-86ff74dd64ee.jpg</url>
      <title>Forem: Rajeev Srivastava</title>
      <link>https://forem.com/rajeevsrivastava</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rajeevsrivastava"/>
    <language>en</language>
    <item>
      <title>Machine Learning Based Intelligent Test Selection for Faster CI/CD Pipelines</title>
      <dc:creator>Rajeev Srivastava</dc:creator>
      <pubDate>Sun, 08 Mar 2026 23:52:33 +0000</pubDate>
      <link>https://forem.com/rajeevsrivastava/machine-learning-based-intelligent-test-selection-for-faster-cicd-pipelines-2ela</link>
      <guid>https://forem.com/rajeevsrivastava/machine-learning-based-intelligent-test-selection-for-faster-cicd-pipelines-2ela</guid>
      <description>&lt;p&gt;CI pipelines become slow as regression suites grow. In many teams, every commit triggers full test execution even when only a few components changed.&lt;/p&gt;

&lt;p&gt;In this project, I built a practical prototype that predicts impacted Playwright tests using machine learning.&lt;/p&gt;

&lt;p&gt;The Problem&lt;/p&gt;

&lt;p&gt;When all tests run on every commit:&lt;/p&gt;

&lt;p&gt;feedback is delayed&lt;br&gt;
compute cost increases&lt;br&gt;
developer productivity drops&lt;br&gt;
For large systems, this creates a release bottleneck.&lt;/p&gt;

&lt;p&gt;The Idea&lt;/p&gt;

&lt;p&gt;Use historical data from CI:&lt;/p&gt;

&lt;p&gt;changed files in commit&lt;br&gt;
tests that were impacted (failed, flaky, or behaviorally affected)&lt;br&gt;
Train a model that maps file-change patterns to impacted test files.&lt;/p&gt;

&lt;p&gt;Then in CI:&lt;/p&gt;

&lt;p&gt;detect changed files&lt;br&gt;
predict relevant tests&lt;br&gt;
run only selected tests first&lt;br&gt;
keep full-suite fallback/nightly run for safety&lt;br&gt;
Example&lt;/p&gt;

&lt;p&gt;Commit touches:&lt;/p&gt;

&lt;p&gt;src/services/inventory.js&lt;br&gt;
Model predicts:&lt;/p&gt;

&lt;p&gt;tests/playwright/tests/inventory.spec.js&lt;br&gt;
tests/playwright/tests/order.spec.js&lt;br&gt;
This gives much faster feedback compared to running all tests.&lt;/p&gt;

&lt;p&gt;Tech Stack&lt;/p&gt;

&lt;p&gt;Playwright for test execution&lt;br&gt;
Python + scikit-learn for model training/inference&lt;br&gt;
GitHub Actions for CI integration&lt;br&gt;
Implementation Summary&lt;/p&gt;

&lt;p&gt;The repository includes:&lt;/p&gt;

&lt;p&gt;synthetic commit-impact dataset generator&lt;br&gt;
multi-label classifier (OneVsRest + LogisticRegression)&lt;br&gt;
prediction utility with threshold and safe fallback&lt;br&gt;
CI script that exports SELECTED_TESTS&lt;br&gt;
Playwright runner that executes just selected spec files&lt;br&gt;
Why This Matters&lt;/p&gt;

&lt;p&gt;Intelligent test selection is a practical way to improve CI throughput. With good historical data and conservative fallback strategy, teams can achieve significant speedups while preserving confidence.&lt;/p&gt;

&lt;p&gt;In many repositories this can reduce per-commit test time by 70-80%.&lt;/p&gt;

&lt;p&gt;Repository&lt;/p&gt;

&lt;p&gt;GitHub - intelligent-test-selection-ml&lt;/p&gt;

&lt;p&gt;If you want, I can share next steps for production hardening (coverage guards, risk bands, retraining cadence, and drift monitoring).&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>machinelearning</category>
      <category>testing</category>
    </item>
    <item>
      <title>Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach</title>
      <dc:creator>Rajeev Srivastava</dc:creator>
      <pubDate>Sun, 22 Feb 2026 03:09:34 +0000</pubDate>
      <link>https://forem.com/rajeevsrivastava/test-flakiness-prediction-using-machine-learning-in-cicd-pipelines-177j</link>
      <guid>https://forem.com/rajeevsrivastava/test-flakiness-prediction-using-machine-learning-in-cicd-pipelines-177j</guid>
      <description>&lt;h1&gt;
  
  
  Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;In modern CI/CD environments, automated tests are expected to provide fast and reliable feedback. However, flaky tests — tests that pass and fail intermittently without code changes — introduce instability into the pipeline.&lt;/p&gt;

&lt;p&gt;A flaky test may:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pass locally but fail in CI
&lt;/li&gt;
&lt;li&gt;Fail due to timing issues or race conditions
&lt;/li&gt;
&lt;li&gt;Fail because of shared state or environment dependencies
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over time, flaky tests reduce trust in automation and slow down engineering velocity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It Damages CI/CD Velocity
&lt;/h2&gt;

&lt;p&gt;When a test fails, engineers must decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this a real regression?
&lt;/li&gt;
&lt;li&gt;Or just another flaky failure?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This uncertainty causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repeated pipeline reruns
&lt;/li&gt;
&lt;li&gt;Increased build time
&lt;/li&gt;
&lt;li&gt;Delayed releases
&lt;/li&gt;
&lt;li&gt;Developer frustration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In high-frequency deployment environments, flaky tests silently become productivity killers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Traditional Approaches Fail
&lt;/h2&gt;

&lt;p&gt;Several mitigation strategies are commonly used:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Reruns
&lt;/h3&gt;

&lt;p&gt;Automatically rerunning failed tests may hide instability but does not eliminate the root cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Retry Logic
&lt;/h3&gt;

&lt;p&gt;Retrying tests reduces visible failures but increases pipeline time and masks systemic issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Manual Tagging
&lt;/h3&gt;

&lt;p&gt;Marking tests as flaky requires human intervention and constant maintenance.&lt;/p&gt;

&lt;p&gt;All these methods are reactive rather than predictive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Proposed Machine Learning Approach
&lt;/h2&gt;

&lt;p&gt;Instead of reacting to flaky behavior, we can attempt to predict it.&lt;/p&gt;

&lt;p&gt;The idea is to model test instability using historical execution data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature Engineering
&lt;/h3&gt;

&lt;p&gt;Potential predictive signals include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Historical failure frequency
&lt;/li&gt;
&lt;li&gt;Time between failures
&lt;/li&gt;
&lt;li&gt;Execution duration variance
&lt;/li&gt;
&lt;li&gt;Commit correlation patterns
&lt;/li&gt;
&lt;li&gt;Environment-specific behavior
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These features can be extracted from CI execution logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Labeling Strategy
&lt;/h3&gt;

&lt;p&gt;A test can be labeled as &lt;strong&gt;flaky&lt;/strong&gt; if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It alternates between pass and fail without related code changes
&lt;/li&gt;
&lt;li&gt;Failure patterns show inconsistency over multiple builds
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This labeling enables supervised learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Selection
&lt;/h3&gt;

&lt;p&gt;Initial models for experimentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logistic Regression
&lt;/li&gt;
&lt;li&gt;Random Forest
&lt;/li&gt;
&lt;li&gt;Gradient Boosting
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models can classify tests into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable
&lt;/li&gt;
&lt;li&gt;Potentially flaky
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Initial Experimental Setup
&lt;/h2&gt;

&lt;p&gt;To ensure this research remains independent and reproducible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test framework:&lt;/strong&gt; Playwright
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI data source:&lt;/strong&gt; Synthetic execution logs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; Artificially generated instability patterns
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No proprietary or company data is used.&lt;/p&gt;

&lt;p&gt;The dataset simulates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random intermittent failures
&lt;/li&gt;
&lt;li&gt;Timing-based instability
&lt;/li&gt;
&lt;li&gt;Controlled failure injection
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Preliminary Results
&lt;/h2&gt;

&lt;p&gt;In early synthetic experiments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy:&lt;/strong&gt; ~82%
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision:&lt;/strong&gt; Moderate
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall:&lt;/strong&gt; Strong for frequently unstable tests
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Historical variance in execution duration is a strong indicator
&lt;/li&gt;
&lt;li&gt;Tests with environment-dependent patterns show higher unpredictability
&lt;/li&gt;
&lt;li&gt;Simpler models perform surprisingly well with structured features
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These results suggest feasibility, though real-world validation is required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Future improvements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collecting real-world open-source CI datasets
&lt;/li&gt;
&lt;li&gt;Improving feature selection
&lt;/li&gt;
&lt;li&gt;Exploring time-series modeling
&lt;/li&gt;
&lt;li&gt;Integrating predictions directly into CI pipelines
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The long-term goal is proactive CI reliability — identifying unstable tests before they disrupt delivery.&lt;/p&gt;




&lt;p&gt;🔗 &lt;strong&gt;GitHub Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/srivastava-rajeev/flaky-test-prediction-ml" rel="noopener noreferrer"&gt;https://github.com/srivastava-rajeev/flaky-test-prediction-ml&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline
&lt;/h2&gt;

&lt;p&gt;I ran the end-to-end pipeline from this repository:&lt;br&gt;
&lt;a href="https://github.com/srivastava-rajeev/flaky-test-prediction-ml" rel="noopener noreferrer"&gt;https://github.com/srivastava-rajeev/flaky-test-prediction-ml&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Latest Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Logistic Regression: ROC-AUC 0.944, &lt;a href="mailto:Precision@0.5"&gt;Precision@0.5&lt;/a&gt; 0.966, &lt;a href="mailto:Recall@0.5"&gt;Recall@0.5&lt;/a&gt; 0.929&lt;/li&gt;
&lt;li&gt;Random Forest: ROC-AUC 0.950, &lt;a href="mailto:Precision@0.5"&gt;Precision@0.5&lt;/a&gt; 0.966, &lt;a href="mailto:Recall@0.5"&gt;Recall@0.5&lt;/a&gt; 0.929&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CI Threshold Simulation (Logistic Regression)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;t=0.30 -&amp;gt; estimated policy cost 548.00&lt;/li&gt;
&lt;li&gt;t=0.50 -&amp;gt; estimated policy cost 548.00&lt;/li&gt;
&lt;li&gt;t=0.70 -&amp;gt; estimated policy cost 569.33 (+21.33)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Takeaway
&lt;/h3&gt;

&lt;p&gt;Model quality is important, but CI impact depends heavily on threshold policy and false-negative cost trade-offs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducible Artifacts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;data/processed/sample_features.csv&lt;/li&gt;
&lt;li&gt;models/results/baseline_metrics.json&lt;/li&gt;
&lt;li&gt;ci_integration/threshold_scenarios.csv&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>testing</category>
      <category>devops</category>
      <category>machinelearning</category>
      <category>cicd</category>
    </item>
  </channel>
</rss>
