Forem: Emmanuel Chima

Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4

Emmanuel Chima — Sun, 24 May 2026 16:38:51 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Cell-to-Sentence (C2S) is an AI-powered annotation engine for single-cell RNA sequencing (scRNA-seq) data. It eliminates one of the most expensive bottlenecks in modern genomics: manually labelling what each cluster of cells is and does.
After computationally clustering cells, a trained bioinformatician must inspect marker gene lists, cross-reference databases like CellMarker and PanglaoDB, and formulate a biological interpretation of each cluster's identity and functional state. For a typical dataset this takes 4–8 hours and is highly dependent on domain expertise. C2S reduces this to under 2 minutes.

How it works:
Each cell's transcriptomic profile is converted into a "Cell Sentence"; a rank-ordered string of the most highly expressed gene symbols (e.g. CD8A GZMB PRF1 IFNG PDCD1 ...). This natural-language representation is then passed to Gemma 4, which uses its biomedical knowledge and structured chain-of-thought reasoning to return:

Cell type: e.g., CD8+ Cytotoxic T Cell
Functional state: e.g., Activated / Effector
Active pathways: e.g., T cell receptor signaling, Cytokine-mediated signaling

All pathway claims are validated against the Gene Ontology (GO) database to ensure scientific grounding, then the annotations are projected back onto a UMAP for publication-ready visualization.

What makes this different from existing tools?
Prior Cell-to-Sentence tools like CeLLama convert cell sentences into embedding vectors and find the nearest known-cell neighbour. That approach is fast but purely classificatory, it tells you what a cell is, but not why, cannot flag uncertainty, and cannot describe a cell's functional state. C2S uses Gemma 4's reasoning to explain a cell's phenotype, surface uncertainty explicitly, and ground every biological claim in the Gene Ontology

Demo

demo video

Code

kaggle notebook

How I Used Gemma 4

I chose Gemma 4 4B MoE (E4B) because the mixture-of-experts architecture gives it a far larger effective knowledge base than a 4B dense model, which matters enormously for biomedical reasoning. Recognising obscure gene symbols, understanding pathway crosstalk, and distinguishing cell states requires breadth that a small dense model simply lacks.
Critically, Gemma 4's <|thought|> structured reasoning was the deciding factor. When a cell sentence is ambiguous, for example, a cluster co-expressing both exhausted and effector T cell markers, Gemma 4 reasons through the tension explicitly before committing to an annotation. This is not possible with embedding-based approaches. The model's reasoning trace also serves as an audit trail, making the annotation scientifically defensible in a way that black-box classification cannot be.
The pipeline feeds each cell sentence as a structured prompt requesting a JSON response containing cell_type, functional_state, active_pathways, and an uncertainty_flag. This output is then parsed and validated against GO terms before being written back to the AnnData object.

Results
Against the CellTypist Pan_Immune v2 ground truth on a 70,000-cell PBMC dataset:

Gemma 4 Base (zero-shot): ARI=0.266, NMI=0.376, JSON parse rate=100%, GO verify rate=65.9%
Gemma 4 Fine-Tuned (QLoRA C2S): Improved GO verify rate=67.2%; ARI/NMI recovering post-fix
Top GO-verified pathways: T Cell Receptor Signaling (17 clusters), Cell Cycle Regulation (12), Plasma Cell Differentiation (11)

The confusion matrix in the notebook shows strong recall for monocytes (0.82), DCs (1.00), and the dominant "Other" T-cell blob (0.74), with NK and Platelet recall failing due to the coarse-mapping bugs now resolved.

Credits to my Team mate Andrew

Autoencoders and Representation Learning in Vision

Emmanuel Chima — Wed, 22 Apr 2026 14:28:41 +0000

Autoencoders are a type of neural network that compress data into a lower-dimensional space and then reconstruct the original input from that compressed representation.

If you've ever encountered Principal Component Analysis (PCA), then you already have an intuition for how this works. The key difference is that PCA is a linear projection method, while autoencoders use neural networks, allowing them to learn non-linear structure in the data.

In theory, a linear autoencoder with a single hidden layer behaves similarly to PCA in 1-D. But once we introduce depth and non-linearity, the model begins to learn richer representations that go far beyond linear subspaces.

How does the Autoencoder work?

The Autoencoder follows a two-stage component design.

1. The Encoder

The encoder is the first component of the autoencoder. It compresses the input data by projecting it into a lower-dimensional latent space.

The objective is to extract the most informative features required to represent the original data efficiently, while discarding redundancy and noise.

Formally:

f_{\theta}(x)

Where:

$x$ = input data
$fθf_{\theta}$ = encoder network (parameterized by $θ\theta$ )
$z$ = latent representation

This is analogous to PCA in the linear case, where the model projects data onto principal components. However, unlike PCA, the encoder learns non-linear representions.

2. The Decoder

The decoder is the second component of the autoencoder. It reconstructs the original input from the compressed latent representation.

x^=gϕ(z) \hat{x} = g_{\phi}(z)

Where:

$gϕg_{\phi}$ = decoder network
$x^\hat{x}$ = reconstructed output

The full pipeline becomes:

x^=gϕ(fθ(x)) \hat{x} = g_{\phi}(f_{\theta}(x))

The goal of the autoencoder is to minimize the reconstruction error:

L=∣x−x^∣22 \mathcal{L} = | x - \hat{x} |_2^2

So the model is explicitly trained to preserve information needed for reconstruction while discarding everything else.

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyAE(nn.Module):
    def __init__(self):
        super().__init__()
        self.enc = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 64)
        )
        self.dec = nn.Sequential(
            nn.Linear(64, 256),
            nn.ReLU(),
            nn.Linear(256, 784),
            nn.Sigmoid()
        )

    def forward(self, x):
        z = self.enc(x)
        return self.dec(z)

model = TinyAE()
opt = torch.optim.Adam(model.parameters(), 1e-3)

for step in range(500):
    x = torch.rand(16, 784)

    recon = model(x)
    loss = F.mse_loss(recon, x)

    opt.zero_grad()
    loss.backward()
    opt.step()

    if step % 100 == 0:
        print(step, loss.item())

What is Representation Learning?

Oftentimes in computer vision, we want to know what kind of internal structure a model learns about the world when it is forced to predict missing information. This type of task is called representation learning. To answer this question, Engineers use different types of self-supervised learning techniques. The different types of answers determine whether a model learns textures, true semantics or local continuity. This is important especially for fields like medical imaging where meaning is not encoded in pixels but in 3D configurations of the anatomy of animal bodies
At a technical level, representation learning asks

What structure does the latent space (z) actually encode about the input?

Formally:

f_\theta(x)

We want (z) to:

discard noise
preserve semantic structure
generalize to downstream tasks (segmentation, detection, classification)

All reconstruction-based methods share this same task: reconstruct what is missing from the inputs, but the nature of what is missing completely changes the learning dynamics.

To understand this better, let us take a look at three important concepts.

Three Levels of Reconstruction Difficulty

1. Naive reconstruction (identity learning)

Here, the model reconstructs the full input to output. If nothing is removed from the input, and only compression takes place, can the model reconstruct the full output with minimal error? This is the trivial case of representation learning called compresion of identity.

f_\theta(x) \approx x

The typical behaviour to expect is simple. The model learns an identity mapping and memorizes the relationships between different pixels. It does not learn any abstraction.
This is not exactly representation learning. It is compression without constraint.

2. Random masking (weak structure learning)

In Random masking, we remove independent pixels from the image data and ask the model to rebuild the image with those missing pixels intact. This forces interpolation, allowing the model to substitute the values using the neighboring pixel data. This allows the model to learn local texture, smoothing and short-range continuity.

x_{masked} = x \odot m, \quad m \sim \text{Bernoulli}(p)

The idea is that each pixel is removed independently. This allows the model to use local interpolation to predict or fill-in missing pixel values, enabling texture continuity and short-range correlation. It has limitations. Because missingness is unstructured, the model can rely on:

nearby pixels
local gradients
smoothing priors

So it never needs global reasoning.

3. Block masking (structural reasoning)

Block masking changes the game entirely by removing entire regions instead of points or pixels. The model is then forced to reproduce a missing space from incomplete data. This type of masking is very relevant in medical imaging like CT where pathology is region based and not pixel based.

def block_mask(x, patch=8, ratio=0.5):
    B, C, H, W = x.shape
    mask = torch.zeros_like(x)

    for i in range(0, H, patch):
        for j in range(0, W, patch):
            if torch.rand(1) < ratio:
                mask[:, :, i:i+patch, j:j+patch] = 1

    return x * (1 - mask), mask

The key idea is to remove entire regions. This forces the model to learn about the structure of the objects in a high level. It also allows spatial coherence and global context inference.
This is especially important in CT imaging where:

organs are contiguous 3D structures
pathology spans regions, not pixels

In our case, the block masking problem is central to this question. To address it, two major families of autoencoders are used:

Masked Autoencoder (MAE)
Denoising Autoencoder (DAE)

1. Denoising Autoencoder (DAE)

The DAE solves this problem by first corrupting the inputs via noise (gaussian white nosie). This means that every pixel is slightly corrupted yet structure is still visible. The issue with our DAE is that it can eaily remove the noise and leave everything unchanged meaning that it didn't really learn to reconstruct the missing parts. It just acted like a complex filter rather than a structure learner.

x_{noisy} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

What this means

Every pixel is slightly perturbed, but the overall structure of the image remains fully visible. The model then learns to remove noise and recover the original signal. The limitations of this is structural. Because structure is preserved, the model works by smoothing local variations and averaging out noise.

So instead of learning structure, it often behaves like a learned denoising filter.

In other words:

It learns how to clean, not how to understand.

2. Masked Autoencoder (MAE)

The masked autoencoder works by removing information entirely in a process called masking. This implies that large parts of the inputs are completly missing and reconstruction is performed from sparse context.
The Masked Autoencoder removes information completely using a binary mask:

x_{masked} = x \odot (1 - m), \quad m \in {0,1}^p

What this means

Large portions of the input are entirely removed (they are set to zero or omitted).

Unlike DAE,there is no noisy signal and certainly no hint of missing values. The model must then infer missing regions from context alone.

The major objective of the model si to reconstruct missing structure using only partial observations. This forces the model to use global reasoning and structural inference encouraging long-range dependency learning.

Block Masking as the Key Middle Ground

Block masking is a structured version of MAE-style corruption where contiguous regions are removed instead of random pixels.

A simple implementation looks like this:

def block_mask(x, patch=8, ratio=0.5):
    B, C, H, W = x.shape
    mask = torch.zeros_like(x)

    for i in range(0, H, patch):
        for j in range(0, W, patch):
            if torch.rand(1) < ratio:
                mask[:, :, i:i+patch, j:j+patch] = 1

    return x * (1 - mask), mask

Why this matters

Block masking forces the model to:

reconstruct missing regions, not pixels
infer object-level structure
rely on global context

Why MAE beats Denoising Autoencoders

Denoising Autoencoder (DAE)

DAEs corrupt inputs with noise:

x_{noisy} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

“remove noise but keep everything else unchanged”

So it behaves like a sophisticated smoothing filter.

Masked Autoencoder (MAE)

MAE removes information entirely:

x_{masked} = x \odot (1 - m), \quad m \in {0,1}^p

This brings about true representation learning.

Summary Table

Property	DAE	MAE
Corruption	noise	missing regions
Visibility	full structure	partial structure
Learning signal	weak	strong
Shortcut learning	easy	hard
Representation	local	global

When MAE Fails

Despite its strength, MAE is not universally optimal.

1. Over-masking collapse

If mask ratio is too high the model sees too little context and reconstruction becomes ambiguous. This makes training signals to becomes noisy.

2. Low-resolution or small objects

If the object is small relative to mask blocks, the entire object may be removed and reconstruction becomes guesswork

This is common in lesion detection and micro-structures in CT.

3. Distribution shift sensitivity

MAE learns strong priors about structure.

If test data differs significantly, the learned priors can mislead reconstruction and the model may hallucinate incorrect structure

4. Compute inefficiency (3D case)

In volumetric data, decoder cost scales with full reconstruction space and memory usage becomes a bottleneck

This is why many 3D MAE systems require:

patch-based decoding
latent-space reconstruction
or hybrid CNN-transformer designs

Summary

Autoencoders provide a simple but powerful framework for learning representations without labels. By compressing input data into a latent space and reconstructing it, they force a model to discover what information is essential and what can be discarded.

However, how we formulate the reconstruction task determines what the model learns.

Naive reconstruction leads to identity learning: the model memorizes rather than understands.
Random masking pushes the model toward local interpolation: learning textures and short-range continuity.
Block masking forces true reasoning: the model must infer missing structure from global context.

This is where the distinction between Denoising Autoencoders (DAE) and Masked Autoencoders (MAE) becomes critical:

DAE operates under corruption → information is degraded but still present
MAE operates under removal → information is absent and must be inferred

Because MAE removes large portions of the input, it creates a higher-uncertainty learning problem, which discourages shortcut solutions and encourages the emergence of semantic, structural representations.

In domains like computer vision and medical imaging, this difference is not just theoretical, it is decisive. Real-world signals (e.g., CT scans) are defined more by spatial relationships and global structure than by local pixel values. MAE aligns naturally with this requirement, making it a stronger foundation for downstream tasks.

That said, MAE is not universally perfect. It can fail when:

masking is too aggressive,
data lacks global structure,
or the decoder becomes too powerful and bypasses the encoder.

Ultimately, the key insight is:

Representation learning is not about reconstruction alone, it is about designing the right information bottleneck.

Turning Research Papers into Executable Code

Emmanuel Chima — Fri, 13 Feb 2026 22:16:36 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

I built MathPilot, a command-line tool designed to transform complex research papers into runnable code. As someone coming from a machine learning background, I know firsthand how intimidating it can be to implement algorithms directly from papers. Understanding the logic is one thing, but translating it into code that actually runs is often a barrier for students and researchers alike.

MathPilot aims to bridge that gap. It reads a research paper, helps scaffold the algorithm, and generates executable code that you can run immediately. For future iterations, I plan to integrate functionality that pushes the code directly to Google Colab, allowing resource-intensive algorithms to run on the cloud seamlessly. This means that even very complex ML models can be tested and iterated without worrying about local compute limitations.

Demo

Demo video
Project Link

My Experience with GitHub Copilot CLI

GitHub Copilot CLI was a game-changer for this project. It helped me:

Plan the architecture of MathPilot and break down the research-to-code pipeline.
Scaffold algorithms from complex papers quickly, giving me a runnable base to work from.
Write tests to validate that generated code works as expected.
Identify and fix bugs automatically during development.
Deploy the project to GitHub and manage commits efficiently.

While I made small tweaks to the generated algorithms for accuracy, Copilot handled the heavy lifting in generating the initial code and structuring the workflow. This made development much faster and allowed me to focus on refining the tool rather than getting stuck on boilerplate or low-level implementation details. Below are screenshots of Copilot-cli in action.

Credits to Rennagade

MathPilot started as a tool to solve a problem I personally faced: turning dense research papers into something tangible and runnable. As a student, I’ve always believed that understanding an idea truly begins when you can implement it. This project is my attempt to lower that barrier to make complex research more accessible, testable, and alive.
P.S we hit a rate limit during the video demo. But we just had to push it that way.