Darkstalker

Posted on Jun 6

Building a Scalable Scientific LLM Pipeline: From Raw Data to Hugging Face

#dataengineering #opensource #machinelearning

In the fast-evolving world of AI, domain-specific language models are unlocking new possibilities for scientific discovery. I’ve built an 882-line Python pipeline, Main_2.py, that transforms raw academic data into a clean, tokenized corpus for training models like NEXA-MOE-MINI, a 110 million parameter Mixture-of-Experts (MoE) model tailored for physics, biology, and materials science. This post dives into the stack’s architecture, its key features, and how it empowers scientific AI—from data generation to public sharing on Hugging Face. Whether you’re building LLMs or just curious about scientific AI, let’s explore how this pipeline works and how you can adapt it for your own projects.

Why This Pipeline Matters

General-purpose LLMs like GPT-4 excel at broad tasks but often falter in specialized domains where precision and context are critical. For scientific tasks like generating hypotheses or designing methodologies, you need high-quality, domain-specific datasets and models optimized for those niches. My pipeline addresses this by:

Curating a ~325M token scientific corpus from arXiv, PubMed, and FineWeb-Edu.
Distilling raw data into instruction-ready formats for tasks like hypothesis generation.
Training a sparse MoE model with domain-specialized experts.
Sharing the dataset publicly on Hugging Face for reproducibility and collaboration.

This isn’t a one-off script—it’s a reusable “research OS” for scientific AI, built with minimal resources and designed to scale.

The Stack: A Technical Deep Dive
The pipeline integrates data generation, model training, and dataset sharing into a modular, end-to-end system. Here’s how it breaks down:

Data Generation Engine (main2.py) The core is an 882-line Python script that builds a scientific corpus from academic sources. Its key components are:

Data Sources:

arXiv: Fetches up to 9,000 papers using the arxiv library, querying subcategories like physics* (astrophysics, quantum physics), q-bio* (biology), and cond-mat.mtrl-sci (materials science). Collects metadata: titles, abstracts, authors, publication dates, and arXiv IDs.
PubMed: Retrieves 3,000 biology abstracts via Biopython’s Entrez API, using MeSH-based queries (e.g., (methods[Title/Abstract]) AND (biology[MeSH Terms])). Returns titles, abstracts, and PMIDs.
FineWeb-Edu: Streams 30,000 samples from Hugging Face’s FineWeb-Edu dataset (sample-10BT, train split), selecting explanatory educational content.

Preprocessing Pipeline:

Cleaning: Normalizes text with clean_text, removing special characters, redundant whitespace, and boilerplate (e.g., acknowledgments).
Segmentation: Splits full-text into paragraphs using segment_paragraphs, preserving semantic coherence.
Tokenization: Converts text into tokens with QLoRAPreprocessor, optimized for scientific vocabulary and MoE training.
Semantic Tagging: Assigns metadata labels:

Domain Tags: [PHYS], [BIO], [MAT] for physics, biology, materials science.
Task Tags: [HYP], [MTH], [EXP] for hypothesis, methodology, experiment tasks.
Routing Tags: [GEN] for general routing, SPEC: for specialized routing.

Entropy-Based Filtering: Uses EntropyRanker to compute Shannon entropy
for each sample, discarding low-information content. Distills ~500M raw tokens into ~325M clean tokens, plus ~300k instruction-format samples for hypothesis/methodology tasks.

Output Formats:

JSONL (~15GB): Line-delimited JSON objects, each with fields like title, abstract, full_text, domain_tag, and provenance. Ideal for debugging and analysis.
Arrow (~3.13GB): Compressed columnar format, sharded (500MB max per shard) for ML frameworks like Hugging Face Datasets.

Efficiency:

Processes data in chunks (default: 1,000 samples) to manage memory.
Parallelizes filtering with concurrent.futures.ThreadPoolExecutor (8 workers default).
Saves checkpoints (e.g., arxiv_papers.jsonl) for fault tolerance.

Scalability:

Modular design supports new sources (e.g., Semantic Scholar) and larger corpora (up to 650M tokens for future models like ULTRAMAX). Configurable via CorpusConfig (e.g., max_arxiv_papers=9000, max_workers=8).

Example Output (JSONL Line):
jsonCollapseWrapCopy{ "title": "Quantum Entanglement in Black Holes", "abstract": "We explore quantum entanglement properties...", "domain_tag": "[PHYS]", "section_tag": "[ABSTRACT]", "task_tag": "[HYP]", "routing_tag": "[SPEC:QuantumPhysics]", "provenance": {"arxiv_id": "2305.12345"} }

Automated Upload Tooling (hf_upload.py) The uploader script shares the corpus on Hugging Face, ensuring accessibility and reproducibility:

Compression:

Converts JSONL to Arrow using datasets.Dataset.from_json and save_to_disk, reducing size from ~15GB to ~3.13GB.
Large File Handling:

Splits files >10MB into ~10MB chunks for Git LFS compatibility.
Tracks files with git lfs track "*.jsonl", *.arrow.

Dynamic README:

Generates a README.md with metadata (sources, token count, formats), ensuring Hugging Face compliance.
Hugging Face Integration:

Uses huggingface_hub.HfApi and Repository to manage Allanatrix/Scientific_Research_Tokenized.
Implements retries (max: 3, 30s backoff) for network failures, addressing issues on a 1 Gbit/s Ethernet.
Supports resumable uploads via Git LFS.
Commits with versioned messages (e.g., Upload dataset 2025-06-06T15:41:00).

Error Handling:

Validates tokens with HfApi.whoami.
Catches HTTPError, URLError, OSError, and cleans up temporary chunks.

Future Plans:

Offload uploads to a cloud-based backend to bypass local network constraints.

Example README Snippet:

markdownCollapseWrapCopy

Scientific Research Tokenized

This dataset contains ~325M tokens (~300k samples) for scientific ML tasks.

Sources: arXiv, PubMed, FineWeb-Edu
Formats: JSONL, Arrow
NEXA-MOE-MINI Training The pipeline trains NEXA-MOE-MINI, a 110M parameter MoE model, as the first in a family of scientific LLMs:

Architecture:

Four experts: BERT-based router (~110M parameters, shared layers) and three T5-based specialists (~60M each) for biology, physics, materials science.
Soft routing with top-k selection (k=1), driven by semantic tags.

Training:

Fine-tunes with QLoRA (4-bit/8-bit quantization, adapter layers) on ~325M tokens (~300k instructions).
Hardware: Intel i5 vPro (1.9–6.0 GHz, 16GB RAM), dual NVIDIA T4 GPUs (16GB VRAM).
Optimizations: Mixed precision (FP16/BF16), gradient checkpointing, torch.distributed for tensor parallelism.
Optimizers: Adam (Optuna-tuned), transitioning to AzureSky Optimizer (Stochastic Approximation + Adam hybrid) with RL fine-tuning.
Stages:

Easy: Basic STEM problems (e.g., physics equations).
Moderate: Complex tasks (e.g., astrophysics simulations).
Hard: Multi-step reasoning (e.g., CFD + alloy modeling).
Metrics: ~21 GFLOPS 60% utilization on 2 T4 GPU's

Output: Weights in .pt or .onnx, versioned for traceability.
Tasks: Hypothesis generation, methodology design, literature summarization.
Role: Distiller for reasoning/retrieval, bootstrap for larger models.

Full Pipeline: A Research OS The stack integrates: Data Generation: Modular, extensible to new sources. Training Infra: Plug-and-play expert swapping, dynamic routing. Sharing: Public datasets/models on Hugging Face. Compute:

CPU: Intel i5 vPro for preprocessing.
GPU: Dual T4s for training/inference.
Software: PyTorch, Hugging Face Transformers, Biopython, arxiv.

Metrics:
Processes ~500M tokens in ~10–12 hours.
Trains 110M parameters in ~40 hours (Kaggle GPU).
Uploads ~3.13GB in ~1–2 hours.

Roadmap:
NEXA-COD: Chain-of-thought model, ~425–500M tokens.
SCOUT: Exploratory reasoning for novel hypotheses.
ULTRAMAX: 2.2B parameters, 20,000-token context, ~600–650M tokens.

Key Features in Action
Modularity: Add new sources (e.g., OpenAlex) by updating CorpusConfig and queries.

Resilience:
Retries API failures with exponential backoff.
Saves checkpoints to recover from crashes.
Handles interrupts (SIGINT/SIGTERM) gracefully.

Efficiency:
Chunks data (1,000 samples) to manage memory.
Parallelizes filtering with 8 workers.

Quality:
Filters low-value content with EntropyRanker.
Tags samples for precise MoE routing.
Example Workflow

Generate the Corpus:
bashCollapseWrapRunCopyexport ENTREZ_EMAIL="your.email@example.com"
python main2.py
Output: scientific_corpus_325M.jsonl (~15GB).
Upload to Hugging Face:
bashCollapseWrapRunCopypython hf_upload.py
Enter your Hugging Face token, and the script compresses to Arrow (~3.13GB), splits large files, and uploads to Allanatrix/Scientific_Research_Tokenized.
Train NEXA-MOE-MINI:
Use the dataset to fine-tune the MoE model with QLoRA:
pythonCollapseWrapRunCopyfrom transformers import Trainer, TrainingArguments
trainer = Trainer(model=moe_model, train_dataset=dataset)
trainer.train()

Share Results:
Publish model weights and dataset on Hugging Face.

Sample Report:

============================================================
           SCIENTIFIC CORPUS BUILD REPORT
============================================================
SOURCE METRICS:
----------------------------------------
ARXIV          :  9000 papers |   2 errors |    120.50s
PUBMED         :  3000 papers |   1 errors |     80.30s
FINEWEB_EDU    : 15000 papers |   3 errors |    200.75s
OVERALL METRICS:
----------------------------------------
Total Papers:     27,000
Total Tokens:     324,500,000
Total Time:       401.55s
Success Rate:     99.98%
============================================================

Challenges and Solutions

Git LFS Bottlenecks: Uploading ~3.13GB on a 1 Gbit/s Ethernet faced errors. Solution: Split files into ~10MB chunks and retry with backoff. Future: Cloud-based backend.
Data Quality: Tuning EntropyRanker thresholds balanced precision/recall for high-signal data.
Compute Limits: Training on modest hardware (Intel i5, T4 GPUs) required 4-bit quantization, mixed precision, and gradient checkpointing.

Why It’s Exciting
This pipeline unlocks:
Rapid Prototyping: Build datasets for any scientific domain in hours.
Specialized Models: Train MoEs for niche tasks like hypothesis generation.
Community Impact: Share high-quality datasets/models publicly.
Scalability: Ready for billion-parameter models and massive corpora.

It’s a foundation for accelerating scientific discovery with AI, built with no institutional support.

Get Involved: https://github.com/DarkStarStrix/DataVolt/blob/master/Tokenization/Main_2.py

DEV Community