Machine Learning Based Intelligent Test Selection for Faster CI/CD Pipelines

Rajeev Srivastava — Sun, 08 Mar 2026 23:52:33 +0000

CI pipelines become slow as regression suites grow. In many teams, every commit triggers full test execution even when only a few components changed.

In this project, I built a practical prototype that predicts impacted Playwright tests using machine learning.

The Problem

When all tests run on every commit:

feedback is delayed
compute cost increases
developer productivity drops
For large systems, this creates a release bottleneck.

The Idea

Use historical data from CI:

changed files in commit
tests that were impacted (failed, flaky, or behaviorally affected)
Train a model that maps file-change patterns to impacted test files.

Then in CI:

detect changed files
predict relevant tests
run only selected tests first
keep full-suite fallback/nightly run for safety
Example

Commit touches:

src/services/inventory.js
Model predicts:

tests/playwright/tests/inventory.spec.js
tests/playwright/tests/order.spec.js
This gives much faster feedback compared to running all tests.

Tech Stack

Playwright for test execution
Python + scikit-learn for model training/inference
GitHub Actions for CI integration
Implementation Summary

The repository includes:

synthetic commit-impact dataset generator
multi-label classifier (OneVsRest + LogisticRegression)
prediction utility with threshold and safe fallback
CI script that exports SELECTED_TESTS
Playwright runner that executes just selected spec files
Why This Matters

Intelligent test selection is a practical way to improve CI throughput. With good historical data and conservative fallback strategy, teams can achieve significant speedups while preserving confidence.

In many repositories this can reduce per-commit test time by 70-80%.

Repository

GitHub - intelligent-test-selection-ml

If you want, I can share next steps for production hardening (coverage guards, risk bands, retraining cadence, and drift monitoring).

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

Rajeev Srivastava — Sun, 22 Feb 2026 03:09:34 +0000

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

The Problem

In modern CI/CD environments, automated tests are expected to provide fast and reliable feedback. However, flaky tests — tests that pass and fail intermittently without code changes — introduce instability into the pipeline.

A flaky test may:

Pass locally but fail in CI
Fail due to timing issues or race conditions
Fail because of shared state or environment dependencies

Over time, flaky tests reduce trust in automation and slow down engineering velocity.

Why It Damages CI/CD Velocity

When a test fails, engineers must decide:

Is this a real regression?
Or just another flaky failure?

This uncertainty causes:

Repeated pipeline reruns
Increased build time
Delayed releases
Developer frustration

In high-frequency deployment environments, flaky tests silently become productivity killers.

Why Traditional Approaches Fail

Several mitigation strategies are commonly used:

1. Reruns

Automatically rerunning failed tests may hide instability but does not eliminate the root cause.

2. Retry Logic

Retrying tests reduces visible failures but increases pipeline time and masks systemic issues.

3. Manual Tagging

Marking tests as flaky requires human intervention and constant maintenance.

All these methods are reactive rather than predictive.

Proposed Machine Learning Approach

Instead of reacting to flaky behavior, we can attempt to predict it.

The idea is to model test instability using historical execution data.

Feature Engineering

Potential predictive signals include:

Historical failure frequency
Time between failures
Execution duration variance
Commit correlation patterns
Environment-specific behavior

These features can be extracted from CI execution logs.

Labeling Strategy

A test can be labeled as flaky if:

It alternates between pass and fail without related code changes
Failure patterns show inconsistency over multiple builds

This labeling enables supervised learning.

Model Selection

Initial models for experimentation:

Logistic Regression
Random Forest
Gradient Boosting

These models can classify tests into:

Stable
Potentially flaky

Initial Experimental Setup

To ensure this research remains independent and reproducible:

Test framework: Playwright
CI data source: Synthetic execution logs
Dataset: Artificially generated instability patterns

No proprietary or company data is used.

The dataset simulates:

Random intermittent failures
Timing-based instability
Controlled failure injection

Preliminary Results

In early synthetic experiments:

Accuracy: ~82%
Precision: Moderate
Recall: Strong for frequently unstable tests

Observations

Historical variance in execution duration is a strong indicator
Tests with environment-dependent patterns show higher unpredictability
Simpler models perform surprisingly well with structured features

These results suggest feasibility, though real-world validation is required.

Next Steps

Future improvements include:

Collecting real-world open-source CI datasets
Improving feature selection
Exploring time-series modeling
Integrating predictions directly into CI pipelines

The long-term goal is proactive CI reliability — identifying unstable tests before they disrupt delivery.

🔗 GitHub Repository:

https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline

I ran the end-to-end pipeline from this repository:
https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Latest Metrics

Logistic Regression: ROC-AUC 0.944, Precision@0.5 0.966, Recall@0.5 0.929
Random Forest: ROC-AUC 0.950, Precision@0.5 0.966, Recall@0.5 0.929

CI Threshold Simulation (Logistic Regression)

t=0.30 -> estimated policy cost 548.00
t=0.50 -> estimated policy cost 548.00
t=0.70 -> estimated policy cost 569.33 (+21.33)

Key Takeaway

Model quality is important, but CI impact depends heavily on threshold policy and false-negative cost trade-offs.

Reproducible Artifacts

data/processed/sample_features.csv
models/results/baseline_metrics.json
ci_integration/threshold_scenarios.csv

Forem: Rajeev Srivastava

Machine Learning Based Intelligent Test Selection for Faster CI/CD Pipelines

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

The Problem

Why It Damages CI/CD Velocity

Why Traditional Approaches Fail

1. Reruns

2. Retry Logic

3. Manual Tagging

Proposed Machine Learning Approach

Feature Engineering

Labeling Strategy

Model Selection

Initial Experimental Setup

Preliminary Results

Observations

Next Steps

Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline

Latest Metrics

CI Threshold Simulation (Logistic Regression)

Key Takeaway

Reproducible Artifacts