Forem: Kipngeno Gregory

How a Data Pipeline and Machine Learning Can Modernize DNA Paternity Testing in Kenya

Kipngeno Gregory — Mon, 11 May 2026 04:42:44 +0000

A Pre-Research Article | Data Science, Data Pipelines & Genomic Informatics

Abstract

Paternity testing is one of the most legally and emotionally consequential applications of forensic science. In Kenya, demand for DNA paternity testing has grown sharply over the past decade, with laboratories such as the Bioinformatics Institute of Kenya now processing upwards of 125 cases every month, many feeding directly into court proceedings around child support, custody, and inheritance. Yet the infrastructure underpinning these tests remains largely manual, time-intensive, and constrained by limited laboratory capacity and a near-total absence of a national forensic DNA database.
This article makes the case that machine learning (ML), combined with a well-engineered data pipeline, offers a credible and practical path toward faster, more scalable, and more consistent paternity determination. It surveys the existing research landscape, identifies the specific gaps that remain unaddressed in the African genomic context, and outlines the architecture of a data pipeline that could serve as the foundation for an ML-assisted paternity testing system designed for the Kenyan environment.

1. The Problem: Paternity Disputes in the Kenyan Legal System

Paternity testing accounts for over 90% of all DNA tests conducted in Kenya. Cases span a wide range of contexts: child maintenance and custody battles in family courts, inheritance disputes, immigration documentation, and increasingly, personal verification outside any legal process.
The bottleneck is not awareness; demand is clearly there and growing. The bottleneck is capacity. Until recently, Kenya had only one institution authorised to conduct forensic DNA testing: the Government Chemist, a government-run laboratory that has historically struggled with case backlog. A second public laboratory was later established at the Kenya Medical Research Institute (KEMRI), and a small number of private providers, notably the Bioinformatics Institute of Kenya (BIK) and EasyDNA Kenya, now operate in the market.
Despite this growth, several structural problems persist:

No national forensic DNA database exists. Bodies are disposed of without DNA records. Cold cases cannot be cross-referenced against stored profiles.
Manual interpretation remains the norm. Short Tandem Repeat (STR) profiles are compared by trained analysts, introducing potential for human error and analyst inconsistency.
Turnaround times are slow. Legal-grade tests can take days to weeks, delaying court proceedings.
Population-specific allele frequency data for East African populations is sparse. Most STR allele frequency databases are built on European, East Asian, or American reference panels; which affects the statistical accuracy of paternity index calculations.

2. How DNA Paternity Testing Currently Works

Before discussing machine learning applications, it is worth briefly describing how the current process works.
Modern paternity testing uses Short Tandem Repeats (STRs), sections of the genome where a short sequence of base pairs (the "repeat unit") is repeated a variable number of times from person to person. Because the number of repeats at each STR locus is highly variable across individuals, and because a child inherits one allele at each locus from each parent, comparing STR profiles across child, mother, and alleged father allows analysts to determine whether the father's alleles are present in the child's DNA.
The output of this process is a Combined Paternity Index (CPI), a likelihood ratio that expresses how much more likely it is that the tested man is the biological father versus a random unrelated man from the same population. A CPI above 10,000 (corresponding to a probability of paternity above 99.99%) is typically required for a legal determination.
The standard in Kenya involves 24 genetic markers. The Bioinformatics Institute of Kenya, for example, offers a 24-marker test that it describes as superior to the panels used by most law enforcement laboratories in the country and across Africa.
The weakness in this process is two-fold: it requires skilled human analysts at every step, and the statistical power of the CPI calculation depends on accurate, population-specific allele frequency tables, which, for East African populations, are still being developed.

3. What the Research Says: Machine Learning and DNA Kinship Analysis

A growing body of peer-reviewed research now demonstrates that machine learning can meaningfully contribute to DNA-based kinship and paternity analysis. The work spans several different approaches.

3.1 Deep Neural Networks on STR Data
A 2023 paper published in the Journal of Intelligent Systems proposed replacing manual STR matching with a Deep Neural Network (DNN) trained on 15-locus STR data. The researchers created a synthetic familial dataset, augmented it to increase sample size, and trained a DNN to predict paternity. This was among the first studies to directly position deep learning as a substitute for, rather than a supplement to, manual forensic interpretation.
The paper explicitly acknowledged that in developing countries, conventional kinship analysis techniques result in inadequate accuracy when dealing with large STR datasets, largely because of the human labour required for profile-by-profile comparison.

3.2 Random Forest and SVM on mtDNA Sequences
A study published on PubMed (NIH) applied four machine learning classifiers, Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Random Forest (RF), to mitochondrial DNA hypervariable region I sequences. The data covered African, Asian, and Caucasian samples.
The results were encouraging: a Bag-of-Words + PCA + Random Forest combination achieved 94.4% accuracy in predicting genetic relatedness, outperforming all other configurations. Critically, this study is one of the few to explicitly include African DNA samples, making its findings directly relevant to the Kenyan context.

3.3 SNP-Based Kinship Panels with Supervised ML
A 2024 paper in Expert Systems with Applications introduced a novel panel of 4,849 Single Nucleotide Polymorphisms (SNPs) and applied supervised machine learning to classify kinship relationships across more than 150,000 simulated pairs. The panel was designed to overcome the limitations of STR-based methods for detecting second-degree and more distant relationships.
A key feature of this study was its transparency:
the full codebase was made publicly available on GitHub, making it an accessible starting point for researchers who want to replicate or extend the work.

3.4 Dynamic Programming and ML for DNA Sequence Classification
A two-part research series by Dr. Ernest Bonat and colleagues, published on Medium in 2024, explored using dynamic programming (specifically the Smith-Waterman and Needleman-Wunsch sequence alignment algorithms) in combination with machine learning classifiers for paternity DNA sequence classification. Part 2 of the series moved into feature engineering, DNA natural language processing (treating nucleotide sequences as text for embedding and classification), and model deployment strategies.
This work is notable for its practical orientation; it was designed with real-world deployment in hospital and laboratory settings in mind, using efficient, low-cost hardware platforms.

3.5 AI-Assisted Allele Calling in Forensic DNA Analysis
A 2025 preprint on bioRxiv demonstrated that deep learning models can outperform traditional rule-based systems for allele calling in forensic DNA electropherogram analysis. The researchers showed that deep learning eliminates much of the manual inspection currently required to classify electrophoresis signals into categories such as alleles, stutter artefacts, and baseline noise; the core step in any STR-based DNA test.

4. The Gap That Remains: East African Population Data

Despite this research momentum, a significant gap persists. The overwhelming majority of ML models for DNA kinship analysis have been trained on datasets drawn from European, East Asian, or broadly American populations. Allele frequency distributions vary meaningfully across ethnic groups, and a paternity index calculated using a European allele frequency table applied to a Luo, Kikuyu, or Kalenjin profile will produce an inaccurate probability estimate.
This is not a minor technical footnote; it is a potential source of serious legal error.
Addressing it requires two things: first, building and publishing a properly annotated STR/SNP allele frequency reference panel for major Kenyan and East African population groups; and second, retraining or fine-tuning kinship ML models on that East African reference data.
This is the specific research contribution that has not yet been made and that this project intends to address.

5. Proposed Data Pipeline Architecture

The data pipeline proposed for this research is designed to take raw STR genotype data as input and produce a paternity probability as output
The pipeline is structured across five stages:

Stage 1: Data Ingestion and Standardisation
Raw genotype data from laboratory electrophoresis systems is ingested in standard file formats (e.g., .csv, FASTA, or vendor-specific formats from Applied Biosystems or equivalent instruments). Data is standardised to a consistent allele notation format, and metadata (sample ID, locus names, collection date, chain-of-custody flags) is attached.
Tools: Python (pandas, biopython), validation schemas, format converters.

Stage 2: Allele Frequency Reference Construction
For each STR locus in the panel, allele frequencies are computed from a reference population dataset. Where East African population data is available (from published studies, KEMRI archives, or ethically sourced anonymised samples), it is used as the primary reference. Where gaps exist, published African population data from sources such as the 1000 Genomes Project is used as a fallback, clearly flagged.
Output: A locus-by-allele frequency matrix, stratified by population group where sufficient data is available.

Stage 3: Feature Engineering
Each trio of profiles (child, mother, alleged father) is transformed into a structured feature vector. Features include:

Per-locus allele match/mismatch flags between child and alleged father
Per-locus likelihood ratios (using the allele frequency reference from Stage 2)
Combined Paternity Index (CPI) computed using the classical formula
Encoded nucleotide sequences (using k-mer or one-hot encoding, for deep learning branches of the pipeline)
Population group label (where known) Tools: scikit-learn (preprocessing), numpy, custom STR feature extraction functions.

Stage 4: Model Training and Evaluation
Multiple model families are trained and benchmarked:

Baseline: Logistic Regression and Gradient Boosted Trees (for interpretability and legal transparency)
Intermediate: Random Forest (strong performance in existing literature on genetic relatedness)
Advanced: Deep Neural Network trained on locus-by-allele matrix representations
Sequence model (experimental): Transformer-based model treating nucleotide sequences as text, for cases where raw sequence data is available

All models are evaluated using stratified cross-validation. Primary metrics are accuracy, F1-score, and critically calibration (how well predicted probabilities correspond to actual paternity likelihoods). For a legal application, a poorly calibrated model that outputs overconfident probabilities is more dangerous than a slightly less accurate but well-calibrated one.
Tools: scikit-learn, tensorflow/keras, xgboost, matplotlib and shap for interpretability.

Stage 5: Output and Reporting
The pipeline produces a structured report containing:

Predicted paternity probability with confidence interval
Per-locus CPI breakdown (for transparency and legal review)
SHAP-based feature importance plot (showing which loci drove the prediction)
A plain-language summary suitable for inclusion in a court document

6. Why This Matters: The Broader Argument

This pipeline is not intended to remove DNA analysts from the paternity testing process. In a legal context, the chain-of-custody requirements, the professional accountability of a qualified scientist, and the right of courts to examine and cross-examine expert witnesses all require that human expertise remain central.
What the pipeline addresses is the bottleneck before the analyst signs off. The manual steps of computing per-locus paternity indices, constructing the CPI, and interpreting the result against a population reference are repetitive, time-consuming, and, when allele frequency tables are mismatched to the test population; potentially inaccurate. Automating and standardising those steps with a well-validated ML model makes the analyst's job faster, frees up laboratory capacity, and if the underlying allele frequency data is East African, makes the result more statistically appropriate for the populations actually being tested in Kenyan courts.
There is also a longer-term argument. Kenya currently has no national forensic DNA database. As paternity and forensic DNA testing scales, the data generated by each test represents a potential building block for such a database. A well-designed data pipeline, built with data governance and privacy protections from the ground up, could eventually support population-level allele frequency studies, cold case investigations, and missing persons identification, all areas where current Kenyan capacity is severely limited.

7. Research Questions and Next Steps
This pre-article identifies the following research questions that the forthcoming full paper and associated project will address:

1. How accurately can an ML model trained on East African STR data predict paternity, compared to the classical CPI-based statistical method?
2. Which model architecture (logistic regression, random forest, DNN) offers the best balance between predictive accuracy and interpretability for court-admissible use?
3. How significant is the degradation in paternity index accuracy when non-African allele frequency references are applied to East African DNA profiles?
4. Can a data pipeline for paternity testing be built that meets Kenya's legal chain-of-custody requirements while significantly reducing per-case analyst time?

The next steps are:

Complete a systematic literature review extending the sources identified here
Establish a data acquisition plan (simulated datasets for training; anonymised, ethically sourced real STR profiles where possible)
Develop and test the five-stage pipeline described above
Write and submit the full research paper

8. Resources and References

Peer-Reviewed Papers

Research Articles

GitHub Repositories

Kenya-Specific Context

9. Closing Note
This project is an attempt to begin filling that gap. The pipeline described here is a starting point, not an endpoint. The research paper that follows this pre-article will provide a more formal treatment of the methods, a detailed experimental evaluation, and where the results support it, a clear argument for why ML-assisted paternity testing should be considered as a complement to existing forensic laboratory practice in Kenya.

Pre-research article. All research questions and pipeline architecture are prospective. Full methodology and results will be published in the forthcoming research paper.

by Kipngeno Gregory Data and Software Engineer

ANOVA or Analysis Of Variance

Kipngeno Gregory — Thu, 16 Oct 2025 06:47:33 +0000

ANOVA: Comparing Multiple Groups Efficiently

Analysis of Variance (ANOVA) is a statistical method used to determine if there are significant differences between the means of three or more groups. Instead of running multiple t-tests (which increases error rates), ANOVA provides a single test for overall group differences.

Practical Example: Marketing Campaign Testing

A company tests four different advertising strategies (A, B, C, D) to see which generates the most sales:

Campaign A: 45, 48, 50, 47, 46 sales
Campaign B: 52, 55, 53, 54, 51 sales
Campaign C: 60, 58, 62, 59, 61 sales
Campaign D: 40, 42, 38, 41, 39 sales

ANOVA answers: Are these sales differences statistically significant, or just random variation?

How It's Used:

Research: Compare multiple treatments in medicine
Business: Test different pricing strategies
Education: Evaluate teaching methods
Manufacturing: Compare production methods

code

import scipy.stats as stats
import pandas as pd

# Sample data: Sales from 4 marketing campaigns
campaign_a = [45, 48, 50, 47, 46]
campaign_b = [52, 55, 53, 54, 51]
campaign_c = [60, 58, 62, 59, 61]
campaign_d = [40, 42, 38, 41, 39]

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(campaign_a, campaign_b, campaign_c, campaign_d)

print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("result: Significant differences exist between campaigns")
else:
    print("result: no significant differences between campaigns")

Output Interpretation:

P-value < 0.05: Significant differences exist
P-value ≥ 0.05: No significant differences

When to Use ANOVA:

Comparing 3+ groups

Testing one categorical variable
Meeting assumptions: normal distribution, equal variances

resources :github

by gregory.tech

Understanding Skewness and Kurtosis

Kipngeno Gregory — Mon, 29 Sep 2025 05:12:31 +0000

When analyzing datasets, it’s not enough to know measures of central tendency (mean, median, mode) and variability (variance, standard deviation).

Skewness: The Measure of Asymmetry

Definition: Skewness measures the degree and direction of asymmetry in a distribution around its mean.
Formula:

Skewness=(n1∑i=1n(xi−xˉ)2)3/2n1∑i=1n(xi−xˉ)3

Real-Life Example:

Income distribution → Often positively skewed because most people earn average wages, but a small number of high earners stretch the tail to the right.
Exam scores → If most students score high but a few fail badly, the distribution is negatively skewed.

Kurtosis: The Measure of Tailedness

Definition: Kurtosis measures the heaviness of tails in a distribution compared to a normal distribution.
Formula:

Kurtosis=(n1∑i=1n(xi−xˉ)2)2n1∑i=1n(xi−xˉ)4

A normal distribution has kurtosis ≈ 3 (called mesokurtic).
To make interpretation easier, analysts often use excess kurtosis = kurtosis – 3.

Real-Life Example:

Stock returns → Usually leptokurtic (heavy-tailed). This means extreme ups and downs occur more frequently than in a normal curve.
Heights of people → Typically close to mesokurtic, since extreme deviations are rare.
Uniform distribution → Often platykurtic (light-tailed), with fewer outliers.

Key Difference

Skewness → Tells us about the direction of data spread (left, right, or symmetric).
Kurtosis → Tells us about the intensity of tails (normal, heavy, or light).

Why It Matters

Understanding skewness and kurtosis helps analysts:
Detect outliers and anomalies.
Choose suitable statistical models (many assume normality).
Improve preprocessing before applying machine learning.

by gregory.tech

Are We Really Ready for the Next Silicon Savannah?

Kipngeno Gregory — Tue, 23 Sep 2025 02:15:20 +0000

Kenya has proudly worn the crown of the “Silicon Savannah” for over a decade, fueled by innovations like M-Pesa and a vibrant startup scene. But as we enter a new era, one driven by Big Data, Web3, and AI, it is worth asking: are we truly ready to lead the next wave of global technological transformation?

The Data Dilemma: Are We Owners or Just Sources?

Kenya’s data powers global AI, but who owns the results? Are we shaping the future, or just supplying the raw material while others cash in?

If we don’t invest in local talent, sovereign data repositories, and home-grown AI, we’ll stay data laborers in someone else’s revolution. It’s time to decide: do we build, or just get mined?

Web3: Decentralised Promise, Centralised Risks?

Web3 promises decentralization and inclusion, but in Kenya, it’s mostly about crypto trading. Where are the startups using blockchain to fix land fraud, or smart contracts to secure farmer supply chains?

If we focus only on speculation, Web3 risks becoming a bubble for the few. True readiness means clear regulation, blockchain skills development, and building tools that solve real problems, not just pump coins.

The Employment Paradox: Will AI Create or Displace?

AI introduces a critical dilemma for Kenya's workforce: while promising new opportunities, it directly threatens entry-level tech roles in data entry and customer service through automation. The urgent question is whether our education system is pivoting fast enough to train for future roles like AI ethics and prompt engineering, rather than obsolete skills. Success requires a radical shift towards fostering critical thinking and creativity to ensure Kenyans become pilots of AI, not its passengers.

The Green Dilemma: Can a Tech Boom be Eco-Conscious?

The digital world has a very real physical cost. Training large AI models consumes enough energy to power hundreds of homes for a year. Blockchain networks, depending on their model, can be incredibly energy-intensive.

As a nation celebrated for its commitment to renewable energy and natural beauty, how do we reconcile this with the carbon footprint of advanced tech? Pushing for AI and Web3 without a "Green Tech" policy is unsustainable, which begs for the questions:

1.How can we power our data centers with our abundant geothermal and solar energy?
2.What are our policies for the e-waste that will inevitably come from the hardware required for this tech leap?
3.Are we incentivizing startups that use AI and data to solve environmental challenges, such as optimizing water use or predicting climate impact on agriculture?

The Silicon Savannah cannot be a success if it turns into a digital wasteland. Our tech evolution must be inherently sustainable.

Readiness now means:

1.Building Data Sovereignty: Treating data as a strategic national asset.

2.Fostering Deep-Tech Innovation: Moving beyond consumer apps to fund and support foundational AI and Web3 projects.

3.Future-Proofing Education: Radically reshaping our curriculum to prepare for an AI-augmented world.

4.Embedding Sustainability: Making green principles non-negotiable in our tech policy.

What do you think? Is Kenya ready to lead the next wave of tech innovation? Share your thoughts.

by gregory.tech

Beyond Functional: Writing Professional and Performant SQL Queries

Kipngeno Gregory — Sun, 21 Sep 2025 13:03:36 +0000

Structured Query Language (SQL) is one of the most widely used languages for interacting with databases, yet even experienced developers often make subtle mistakes that affect performance, readability, and security. Writing high-quality SQL queries is critical for scalability, maintainability, and efficiency.

1. The Siren Call of `SELECT *` Instead of Explicit Columns

Problem: SELECT * retrieves every column from a table, even those not needed. This increases network load, slows down queries, and can break applications if the schema changes.

Solution: Always specify the exact columns you need:

SELECT id, name, created_at FROM users;

-- Instead of:
SELECT * FROM orders;

-- Use:
SELECT order_id, customer_id, order_date, total_amount
FROM orders;

This improves performance and keeps your queries predictable.

2. Neglecting the Power of Indexes

Problem: Queries that filter large tables without proper indexes often result in full table scans, significantly slowing down performance.

The Mistake: Writing predicates that prevent index usage.

Common culprits include:

Wrapping a column in a function: WHERE YEAR(order_date) = 2023
Using a wildcard at the beginning of a LIKE pattern:WHERE customer_name LIKE '%Smith%'
Using OR conditions on different columns without appropriate indexes. The Impact: The query forces a full table scan, which becomes exponentially slower as the table grows.

Solution: Ensure that columns used in WHERE, JOIN, and ORDER BYclauses are indexed where appropriate. Always analyze query execution plans to confirm indexes are being used.
code

-- Instead of (non-sargable):
SELECT * FROM orders WHERE YEAR(order_date) = 2023;

-- Use (sargable - Search ARGument ABLE):
SELECT * FROM orders WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01';

3. Mishandling `NULL` Values

Problem: Failing to account for NULL values can produce unexpected results, especially with comparison operators.

Solution: Use IS NULL or COALESCE() or IS NOT NULL or IFNULL() to handle null values explicitly:

SELECT COALESCE(email, 'kipngenogregory@gmail.com') AS safe_email FROM users;

-- Instead of (will not find NULL phone numbers):
SELECT * FROM customers WHERE phone_number = NULL;

-- Use:
SELECT * FROM customers WHERE phone_number IS NULL;

-- To safely perform calculations:
SELECT product_id, price * COALESCE(quantity, 0) AS estimated_value FROM order_items;

4. Improper Filtering in `GROUP BY`and `HAVING`

A confusion between the WHERE and HAVING clauses leads to inefficient queries.

The Mistake: Using the HAVING clause to filter rows before aggregation.
The Impact: The HAVING clause filters groups after they have been aggregated. Filtering individual rows first is the job of the WHERE clause, which is far more efficient as it reduces the working data set before the costly aggregation operation. The Professional Approach: Use WHERE to filter rows. Use HAVING to filter groups.

-- Inefficient:
SELECT customer_id, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id
HAVING customer_id > 100; -- Filtering on a single row *after* grouping

-- Efficient:
SELECT customer_id, COUNT(order_id) AS order_count
FROM orders
WHERE customer_id > 100  -- Filter rows *before* grouping
GROUP BY customer_id;

Conclusion

By eschewing SELECT *, respecting indexes, handling NULLscorrectly, using explicit joins, and filtering strategically, you elevate your code from merely functional to truly exceptional. This leads to systems that are faster, more reliable, and easier to debug—a hallmark of a true data professional.

by gregory.tech

The Best "Man" Wins: Why the Vibe Coder vs. Engineer Debate is Over

Kipngeno Gregory — Wed, 17 Sep 2025 11:39:53 +0000

You’ve seen the convo. It’s on tech spaces, in Slack channels, and over coffee. On one side: the Vibe Coder, the modern hacker who intuits, iterates, and ships with an almost artistic flow. On the other: the Engineer, the disciplined architect for whom structure, tests, and scalability are non-negotiable.

The dialogue is often framed as a battle of chaos versus order, a holy war for the soul of development.

But it’s a false flag. The winner isn’t one or the other. The winner is the one who knows which tool to use, and when.

The Rise of the Vibe

Let’s be clear: the Vibe Coder is ascendant for a reason. Their toolkit is a product of our time: Copilot, ChatGPT, Claude, and Serverless. They don’t just code; they conduct. They describe an intention a vibe and the AI pair programmer translates it into functional code.

This isn’t laziness. It’s leverage.

They are the engine of the MVP, the solo founder validating an idea over a weekend, the creative dev building breathtaking interactive art. Their currency is speed and instinct, and in the right context, it’s a superpower.

The Inevitable Crash

But every superpower has a kryptonite. The unchecked Vibe Coder’s kryptonite is Scale.

We’ve all seen it or inherited it. The prototype that “just worked” becomes a production nightmare. The repository becomes a museum of clever hacks that no one understands. The lack of tests, architecture, or documentation creates a technical debt that cripples progress. When the vibe checks out, all that’s left is the chaos.

The Bedrock of the Engineer

This is where the Engineer, often unfairly maligned as slow or rigid becomes indispensable. They are the anti-chaos agents. They build the platforms, databases, and frameworks that the Vibe Coder relies upon. They are the ones who scale the successful MVP into a robust system that can handle millions.

Their domain is mission-critical systems: finance, aviation, healthcare. Places where “it works on my machine” isn’t just a meme; it’s a catastrophic failure point. Their value isn’t in raw speed, but in predictable, lasting power.

The Synthesis: The Context-Aware Developer

So, who wins? The Vibe Coder or the Engineer?

The best “man” wins. And the best man is the developer who refuses to be just one thing.

The modern tech landscape doesn’t demand you pick a side. It demands you master both modes.

1.Vibe Mode:

For ideation, prototyping, exploration, and personal projects.

2.Engineer Mode:

For refactoring, scaling, working in legacy systems, and writing critical path code.

The most valuable player on any team is the one who can pivot from a frenetic, AI-powered coding session to methodically diagramming a system architecture and understand why both are essential.

The debate was never about which style is better. It was about context. The future belongs not to the pure Vibe Coder or the pure Engineer, but to the hybrid, the developer agile enough to harness intuition and discipline in equal measure.

The one who uses the right tool for the job.

That’s who wins.

by gregory.tech

Similarities Between Stored Procedures and Python Functions

Kipngeno Gregory — Mon, 08 Sep 2025 05:39:27 +0000

While residing in different technological layers—SQL in the database and Python in the application layer—stored procedures and Python functions are fundamental constructs that share a common philosophical goal: **modularity and reuse.**

1. Encapsulation of Logic
Stored Procedure: Encapsulates one or more SQL statements, along with procedural logic, into a single executable unit within the database. This hides the complexity of the underlying SQL and database schema from the application code.

Python Function: Encapsulates a block of Python code that performs a specific task. This promotes the DRY (Don't Repeat Yourself) principle and isolates functionality.

2. Parameterization
Stored Procedure: Defines input (IN), output (OUT), and input-output (INOUT) parameters.

example

CREATE PROCEDURE GetEmployee(IN emp_id INT)
BEGIN
    SELECT * FROM employees WHERE id = emp_id;
END;

Python Function: Defines parameters in its signature, which can be positional, keyword, or have default values.
example

def get_employee(emp_id):
    # ... code to fetch employee ...
    return employee_data

3. Reusability and Maintainability
Reusability: A single well-defined procedure or function eliminates code duplication. A change need only be made in one place.

Maintainability: Fixing a bug or optimizing logic requires modification only within the procedure or function, not in every location where the logic was previously duplicated. This reduces errors and simplifies testing.

_Conclusion:

Stored procedures and Python functions are conceptual cousins. They both champion the software engineering principles of modularity, encapsulation, and reuse. A stored procedure is essentially the database's equivalent of a function—a specialized function designed for optimal, secure, and efficient data manipulation within the database engine._

Understanding SQL Constructs: Subqueries, CTEs, and Stored Procedures

Kipngeno Gregory — Mon, 08 Sep 2025 05:29:24 +0000

In the realm of SQL and relational database management, efficiently retrieving and manipulating data is paramount.
This article elucidates the differences between subqueries, Common Table Expressions (CTEs), and stored procedures.

1. Subquery (Nested Query)
A subquery, or nested query, is a SQL query embedded within the WHERE, FROM, or SELECT clause of another SQL query. Its primary role is to return a result set that the outer query uses for its execution.

Characteristics:
Purpose: To compute a value or set of values for use in a filter, calculation, or as a derived table within a single, primary query.

Scope & Lifetime: The subquery is executed for each row processed by the outer query (in some cases) and its result exists only for the duration of the main query's execution. It is not reusable.

Readability: Can quickly become complex and difficult to read, especially when nested multiple levels deep (often called "nested hell").
Use Case: Find all employees whose salary is above the company average.

SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

2. Common Table Expression (CTE)
A CTE, defined using the WITH clause, is a temporary named result set that exists only within the scope of a single SELECT, INSERT, UPDATE, or DELETE statement. It is primarily a tool for improving query organization and readability.
Characteristics:
Purpose: To break down complex queries into simpler, logical, and reusable parts. CTEs make queries more readable and maintainable, and they enable recursive queries, which are impossible with standard subqueries.

Scope & Lifetime: The CTE is defined at the beginning of a statement and can be referenced multiple times within that same statement. It is discarded after the statement executes.

Readability: Significantly improves readability by allowing a modular, "step-by-step" approach to building queries.
*Use Case: *

WITH RegionalSales AS (
    SELECT region_id, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region_id
)
SELECT region_name, total_sales
FROM regions r
JOIN RegionalSales rs ON r.id = rs.region_id
WHERE rs.total_sales > 1000000;

3. Stored Procedure
A stored procedure is a precompiled collection of SQL statements and optional logic (variables, conditionals, loops) stored within the database itself. It is executed as a single unit, often to encapsulate a business logic operation.
Characteristics:
Purpose: To encapsulate complex operations, promote code reuse, enhance security, and improve performance. They are used for data manipulation, data definition, and administrative tasks.

Scope & Lifetime: Stored procedures are permanent database objects (like a table or view). They are stored on the server side and persist beyond a single query session.

Readability: Encapsulates business logic, keeping application code cleaner. The logic is centralized within the database.
Use Case:

CREATE PROCEDURE PlaceNewOrder (
    IN p_customer_id INT,
    IN p_product_id INT,
    IN p_quantity INT
)
BEGIN
    START TRANSACTION;
    INSERT INTO orders (customer_id, order_date) VALUES (p_customer_id, NOW());
    INSERT INTO order_items (order_id, product_id, quantity) VALUES (LAST_INSERT_ID(), p_product_id, p_quantity);
    COMMIT;
END;

ANALYSIS OF KENYAN CROPS DATASET: AGRICULTURAL INSIGHTS, TRENDS, AND OPPORTUNITIES

Kipngeno Gregory — Sun, 24 Aug 2025 06:05:29 +0000

Key Insights:

Most Profitable Crop: Potatoes led as the most profitable crop, followed closely by cassava and rice.
Land Utilization: Nairobi and Machakos counties allocated the highest land area to crop production.
Revenue Leaders by Crop Type: Cassava, potatoes, and rice generated the highest revenue.
County Performance: Machakos and Nyeri counties dominated in total revenue, particularly from potatoes, cassava, and rice.
Seasonal Insights: Dry seasons (35%) and long rains (33%) contributed the most to revenue, while short rains accounted for 31%.
Yield Patterns: Potatoes and rice recorded the highest average yields.
Irrigation Methods: Drip irrigation drove the highest profitability (43%), followed by flood (32%) and sprinkler irrigation (24%).
Soil & Fertilizer Use: Clay, loamy, and sandy soils supported the most revenue. Fertilizers CAN and DAP were most used and correlated with higher revenue.
Monthly Revenue Trends: Revenue peaked in March, April, and May, but dropped in August due to pest infestations, delayed harvests, and weather challenges.
Profit Trends Over Time: Profits fluctuated seasonally, but showed resilience in high-demand crops.

Overall Insight:
Potatoes, cassava, and rice are the backbone of Kenya’s agricultural profitability. Investments in drip irrigation, fertilizer optimization, and pest management can significantly improve yields and stabilize revenues.

analysis by Kipngeno Gregory

an article on Excel’s Strengths, Weaknesses and the Role of Excel in Predictive Analysis

Kipngeno Gregory — Sun, 10 Aug 2025 09:13:56 +0000

an article on Excel’s Strengths and Weaknesses in Predictive Analysis and the Role of Excel in Making Data-Driven Business Decisions

Introduction

Microsoft Excel remains fundamental tool for business analytics, offering accessible predictive capabilities despite its limitations.

Strengths

It's User-Friendly – Intuitive interface with no coding required for basic analysis.
Has Built-in Tools **– Forecast Sheet, regression (via Data Analysis ToolPak), and What-If Analysis.
**Ensures Rapid Visualization – Charts, trendlines, and PivotTables simplify pattern recognition.
*Enables Integration *– Works with Power BI, SQL, and other enterprise systems.

Weaknesses

Limited Computational Power-Excel -struggles with large datasets (usually over a few hundred thousand rows).
Basic Statistical Capabilities -While Excel can handle simple regressions and forecasts, it lacks advanced machine learning algorithms and statistical methods found in tools like Python, R
Error-Prone – Manual processes increase risk of formula mistakes.
*Static Data *– No real-time analytics without manual refreshes.

The Role of Excel in Data-Driven Business Decisions

Exploratory Analysis **– Quick insights for SMEs and non-technical teams.
**Scenario Modeling – Tests business strategies (e.g., pricing, budgets).
Transition Tool – Bridges gap between manual analysis and advanced BI platforms.
Trend Forecasting -Businesses can use Excel to project sales, expenses, or market growth using historical data.

Conclusion

Excel may not compete with advanced machine learning platforms for highly complex predictive analysis, but its accessibility, flexibility, and visualization capabilities make it an indispensable tool in many organizations.

Data is the new oil-by Kipngeno Gregory

How to Install and Set Up PostgreSQL on a Linux Server (Ubuntu)

Kipngeno Gregory — Sat, 02 Aug 2025 11:05:02 +0000

Assignment 0 LUXDEV COHORT 4 JULY INTAKE

1️⃣ Step 1: Update Your System

sudo apt update
sudo apt upgrade -y

2️⃣ Step 2: Install PostgreSQL

sudo apt install postgresql postgresql-contrib -y

postgresql: Core PostgreSQL database system.
_postgresql-contrib: Additional useful tools and extensions.

3️⃣ Step 3: Verify PostgreSQL Installation

sudo systemctl status postgresql

_To start/stop/restart PostgreSQL:

sudo systemctl start postgresql
sudo systemctl stop postgresql
sudo systemctl restart postgresql

4️⃣ Step 4: Switch to PostgreSQL User

sudo -i -u postgres

5️⃣ Step 5: Access PostgreSQL

psql

To exit:
\q

6️⃣ Step 6: Secure PostgreSQL

\password postgres

7️⃣ Step 7: Create a New Database

CREATE DATABASE myappdb;
CREATE USER myappuser WITH ENCRYPTED PASSWORD 'mypassword';
GRANT ALL PRIVILEGES ON DATABASE myappdb TO myappuser;

8️⃣ Step 9: Test Connection Locally

psql -h localhost -U myappuser -d myappdb

now you will have installed and configured PostgreSQL on your Linux server🙂

by Kipngeno Gregory

Forem: Kipngeno Gregory

How a Data Pipeline and Machine Learning Can Modernize DNA Paternity Testing in Kenya

A Pre-Research Article | Data Science, Data Pipelines & Genomic Informatics

ANOVA or Analysis Of Variance

ANOVA: Comparing Multiple Groups Efficiently

Practical Example: Marketing Campaign Testing

How It's Used:

When to Use ANOVA:

Understanding Skewness and Kurtosis

When analyzing datasets, it’s not enough to know measures of central tendency (mean, median, mode) and variability (variance, standard deviation).

Skewness: The Measure of Asymmetry

Kurtosis: The Measure of Tailedness

Why It Matters

Are We Really Ready for the Next Silicon Savannah?

The Data Dilemma: Are We Owners or Just Sources?

Web3: Decentralised Promise, Centralised Risks?

The Employment Paradox: Will AI Create or Displace?

The Green Dilemma: Can a Tech Boom be Eco-Conscious?

Readiness now means:

Beyond Functional: Writing Professional and Performant SQL Queries

1. The Siren Call of SELECT * Instead of Explicit Columns

2. Neglecting the Power of Indexes

3. Mishandling NULL Values

4. Improper Filtering in GROUP BYand HAVING

Conclusion

The Best "Man" Wins: Why the Vibe Coder vs. Engineer Debate is Over

The Rise of the Vibe

The Inevitable Crash

The Bedrock of the Engineer

The Synthesis: The Context-Aware Developer

1.Vibe Mode:

2.Engineer Mode:

Similarities Between Stored Procedures and Python Functions

Understanding SQL Constructs: Subqueries, CTEs, and Stored Procedures

ANALYSIS OF KENYAN CROPS DATASET: AGRICULTURAL INSIGHTS, TRENDS, AND OPPORTUNITIES

Key Insights:

an article on Excel’s Strengths, Weaknesses and the Role of Excel in Predictive Analysis

Introduction

Strengths

Weaknesses

The Role of Excel in Data-Driven Business Decisions

Conclusion

How to Install and Set Up PostgreSQL on a Linux Server (Ubuntu)

Assignment 0 LUXDEV COHORT 4 JULY INTAKE

1. The Siren Call of `SELECT *` Instead of Explicit Columns

3. Mishandling `NULL` Values

4. Improper Filtering in `GROUP BY`and `HAVING`