Forem: Muhammad Zubair Bin Akbar

Writing Your First Slurm Job Script

Muhammad Zubair Bin Akbar — Tue, 14 Apr 2026 11:33:31 +0000

If you are new to High Performance Computing, one of the first things you will do is submit a job using Slurm.

At first, it can feel confusing. But once you understand the basics, it becomes very straightforward.

Let’s walk through how to write your first Slurm job script.

What is a Slurm Job Script

A Slurm job script is just a simple shell script that tells the scheduler:

What resources you need
How long your job will run
What command should be executed

Instead of running your program directly, you submit this script to Slurm, and it handles everything for you.

Basic Structure of a Job Script

A typical Slurm script looks like this:

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=output.log
#SBATCH --error=error.log
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G

echo "Hello from Slurm!"
hostname

Let’s break this down.

Understanding the SBATCH Directives

Lines starting with #SBATCH are instructions for Slurm.

Job Name

#SBATCH --job-name=test_job

This is just a label to identify your job.

Output and Error Files

#SBATCH --output=output.log
#SBATCH --error=error.log

output.log → stores normal output
error.log → stores errors

Very useful for debugging.

Time Limit

#SBATCH --time=00:10:00

This means your job can run for 10 minutes max.

If it exceeds this, Slurm will stop it.

Tasks and CPUs

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

ntasks → number of processes
cpus-per-task → CPU cores per process

For simple jobs, keep both as 1.

Memory

#SBATCH --mem=1G

This requests 1 GB of RAM.

If your job needs more and you don’t request it, it may fail.

What Goes Inside the Script

After the SBATCH lines, you add the commands you want to run:

echo "Hello from Slurm!"
hostname

In real use cases, this could be:

Running a Python script
Executing a simulation
Launching an MPI job

Example:

python my_script.py

Submitting the Job

Once your script is ready, save it as:

job.sh

Then submit it:

sbatch job.sh

You will get a job ID like:

Submitted batch job 12345

Checking Job Status

To see if your job is running:

squeue -u your_username

To get more details:

scontrol show job 12345

Viewing Output

After the job finishes:

cat output.log
cat error.log

This is where you check results or debug issues.

Common Beginner Mistakes

A few things that often go wrong:

Requesting too little memory → job fails
Setting very short time limits → job gets killed
Running heavy jobs on login node instead of using Slurm
Forgetting to check error logs

Final Thoughts

Writing your first Slurm job script might seem small, but it is the foundation of everything you do in HPC.

Once you understand this, you can:

Run bigger workloads
Scale across multiple nodes
Work with GPUs and parallel jobs

Start simple, test small, and build from there.

“It works on my machine” is not a success metric. Different OS, env vars, or dependencies = different behavior. Use containers, lock dependencies, automate setup. Don’t aim for working locally. Aim for working everywhere. 🚀

Muhammad Zubair Bin Akbar — Mon, 13 Apr 2026 18:55:36 +0000

AI Generated Code Joins the Linux Kernel, But Humans Stay in Charge

Muhammad Zubair Bin Akbar — Mon, 13 Apr 2026 12:07:42 +0000

The discussion around AI generated code has been building for a while. Now the Linux kernel community has finally made its stance clear.

With Linux 7.0, a new guideline explains how AI tools can be used when contributing to the kernel. It is not a green light for everything, but it is definitely not a ban either.

What Actually Changed?

The kernel now includes a document called AI Coding Assistants in its contribution process.

The idea behind it is straightforward. AI can help, but a human is always responsible.

In practice, this means AI assisted code is allowed, but fully machine generated submissions are not accepted. Every contribution must still follow licensing rules, and someone must stand behind the code.

So yes, AI can write parts of the code, but it cannot take responsibility for it.

A New Way to Credit AI

One interesting addition is a new tag called Assisted-by.

If you used an AI tool while preparing a patch, you are expected to mention it. This keeps things transparent without making the process complicated.

Earlier, some maintainers felt this could just be mentioned in the changelog, but the community has now agreed on using a proper tag.

Responsibility Still Matters

The Developer Certificate of Origin is still as important as ever.

When you submit code, you are confirming that it follows the rules and that you take full responsibility for it. That does not change just because AI was involved.

Even if most of the code came from a tool, the final responsibility still belongs to the person submitting it.

This Is Already Happening

This is not just a theoretical policy. It is already being used in real workflows.

Maintainers have been using AI tools to detect bugs and analyze code. These tools can highlight potential issues, but humans still review everything before any fix is accepted.

That balance is exactly what this policy is trying to achieve. AI helps with discovery, while humans handle validation and decisions.

Different Projects, Different Choices

Not every open source project is taking the same approach.

Some projects have completely blocked AI generated contributions due to concerns around licensing, quality, and ethics. Others treat such code with extra caution and require special approvals.

Linux is taking a more practical path by allowing AI usage but putting strict responsibility on developers.

Is This the Right Direction?

This approach feels realistic.

AI tools are already part of everyday development, and ignoring them is not really an option. Instead of resisting, the Linux community is focusing on transparency and accountability.

The real challenge is not the tools themselves. It is whether developers take the responsibility seriously.

In the end, the quality of the kernel will always depend on the people reviewing and maintaining it, not the tools they use.

What Really Powers HPC Clusters: A Look at the Hardware Behind the Network

Muhammad Zubair Bin Akbar — Sun, 12 Apr 2026 12:58:55 +0000

When people talk about High Performance Computing, the conversation usually goes straight to software. You hear about MPI, job schedulers, or parallel algorithms. But honestly, none of that matters if the hardware underneath is not built properly.

The real backbone of any HPC cluster is its network. That is what decides whether your jobs finish in minutes or take forever.

Let's walk through what actually makes HPC networking so powerful.

What an HPC Cluster Actually is

At a basic level, an HPC cluster is just a group of computers working together. These computers are called nodes, and each one has it own CPU, memory, and sometimes GPUs.

But here is the important part. These nodes are not useful on their own in this setup. They need to communicate constantly.

That communication layer is the network, and that is where things get interesting.

Why the Network Matters So Much

In normal systems, the network is just there to move files or handles requests. In HPC, the network is part of the computation itself.

Nodes exchange data continuously. If that exchange is slow, the entire system slows down.

So the network needs to be built for:

Very low latency
Very high bandwidth
Stable and predictable performances

Even tiny delays can create serious bottlenecks when thousands of processes are involved.

InfiniBand and Why It Is So Popular

If you look at most serious HPC systems, you will see InfiniBand begin used.

The reason is simple. It is extremely fast and very efficient.

InfiniBand allows something called RDMA, which lets one machine access the memory of another machine directly. The CPU does not have to get involved much, which saves time and reduces overhead.

This is especially useful for workloads where processes need to constantly exchange small pieces of data.

Ethernet Is Catching Up

Ethernet is also used in HPC, especially with newer technologies.

With things like RDMA over Converged Ethernet (RoCE), Ethernet can now deliver very high performance as well.

It is often easier to integrate and sometimes more cost effective. But it needs careful setup. If the network is not tuned correctly, performance can drop quickly.

Network Cards Are Smarter Than You Think

In a typical computer, a network card just send and receives data. In HPC, it does much more than that.

Modern network interface cards can handle communication tasks on their own. They can manage RDMA operations and reduce the load on the CPU.

Some can even work directly with GPUs, which helps in AI and simulation workloads.

So these cards are not just hardware components. They actively improve performance.

How Nodes Are Connected Matters

The way nodes are connected to each other play a huge role in performance.

There are a few common designs.

Fat tree is widely used because it is reliable and scales well, though it can be expensive.

Mesh or torus layouts connect nodes in a grid pattern. These are more cost friendly but can introduce delays when data has to travel far.

Dragonfly is a more modern approach that tries to reduce the number of steps data has to take between nodes.

Each design has its own tradeoffs, and the right choice depends on the workload and budget.

Latency Versus Bandwidth

Two terms you will hear a lot are latency and bandwidth.

Latency is how quickly a message starts arriving.

Bandwidth is how much data can be transferred over time.

In many HPC applications, latency is actually more important. Small delays repeated thousands of times can slow everything down.

Switches Do More Than You Expect

Switches in HPC are built differently from the ones used in regular networks.

They are designed to move data as quickly as possible with very little delay. Some of them can start forwarding data before the full message is even received.

They also support a large number of high speed connections. In big clusters, the way switches are arranged can affect congestion and overall performance.

Physical Setup Still Matters

It is easy to focus only on performance numbers, but the physical side of things is just as important.

HPC hardware generates a lot of heat, so cooling becomes critical.

Cable management also plays a role, especially in large clusters. Poor layout can make maintenance difficult and even affect airflow.

Everything from rack design to airflow direction can impact how well the system runs.

What the Future Looks Like

HPC networking is still evolving.

New technologies are pushing more intelligence into the network itself. Devices are becoming better at handling communication without involving the CPU.

There is also a lot of work being done to reduce power usage and improve efficiency.

Technologies that connect memory and networking more closely are also starting to appear, which could change how clusters are designed in the future.

Final Thoughts

It is easy to assume that faster processes automatically mean better performance. But in HPC, that is not the full picture.

If the network is slow, even the best processors will spend time waiting.

A well designed network allows everything to work together smoothly. That is what unlocks the real power of parallel computing.

So next time you think about HPC performance, do not just look at compute power. Look at how the system is connected.

That is where the real difference is made.

Understanding the Hardware Behind an HPC Cluster

Muhammad Zubair Bin Akbar — Sat, 11 Apr 2026 21:07:31 +0000

High Performance Computing often sounds complex, but once you break it down, it is really a collection of specialized machines working together as one powerful system. Each component has a clear role, and understanding them makes everything from troubleshooting to optimization much easier.

Let us walk through the key hardware components you will find in a typical HPC cluster.

Head Node

The head node is the brain of the cluster. It is responsible for managing everything behind the scenes.

This is where the scheduler runs, user jobs are coordinated, and cluster level services are controlled. Tools like Slurm usually live here, deciding which job runs where and when.

Users usually do not run heavy workloads on this node. Instead, it acts as the control center that keeps the entire cluster organized.

Login Node

The login node is the front door to the cluster.

This is where users connect using SSH, write job scripts, compile code, and prepare their workloads. It is designed to handle multiple users at the same time, but not heavy computations.

Think of it as a workspace rather than a workhorse. Running large jobs here can impact other users, so it is best to use it only for preparation tasks.

Compute Nodes

Compute nodes are where the real work happens, whether on CPUs or GPUs

These nodes execute the jobs submitted by users. They are optimized for performance and usually come with powerful CPUs, large memory, and sometimes GPUs.

When you submit a job through the scheduler, it gets assigned to one or more compute nodes depending on the requirements. These nodes work either independently or together for parallel workloads.

Memory Optimized Nodes

Some workloads need more memory than standard compute nodes can provide.

That is where memory optimized nodes come in. These machines are built with a much higher RAM capacity, making them ideal for simulations, large datasets, and in memory processing tasks.

They are especially useful in fields like computational biology, weather modeling, and large scale data analytics.

GPU Nodes

GPU nodes are designed for workloads that need massive parallel processing.

Unlike CPUs, which handle tasks sequentially with a few powerful cores, GPUs have thousands of smaller cores that can process many operations at the same time. This makes them ideal for specific types of workloads.

You will typically use GPU nodes for machine learning, deep learning, scientific simulations, and rendering tasks. Frameworks like PyTorch or TensorFlow rely heavily on GPUs to speed up training and computation.

In a cluster, GPU nodes are usually limited and shared resources, so jobs requesting GPUs are scheduled carefully. Users specify how many GPUs they need, and the scheduler assigns the job to a node with available GPU resources.

These nodes often also come with high memory and fast interconnects to keep up with the data demands of GPU workloads.

Storage Systems

Storage is a critical part of any HPC cluster. It is not just about saving files, it is about moving data quickly and efficiently.

Parallel File Systems

A parallel file system allows multiple compute nodes to read and write data at the same time. This is essential for high performance workloads.

BeeGFS is a popular example. It distributes data across multiple storage servers, allowing high throughput and scalability. This means jobs do not get stuck waiting for data access.

Other systems like Lustre and GPFS follow similar ideas, focusing on speed, reliability, and scalability.

High Speed Network

All these components are connected through a high speed network.

Technologies like InfiniBand or Omni Path ensure low latency and high bandwidth communication between nodes. This is especially important for tightly coupled parallel applications where nodes need to exchange data frequently.

Without a fast network, even the best compute nodes would struggle to perform efficiently.

Putting It All Together

An HPC cluster is not just a collection of powerful machines. It is a carefully designed system where each component plays a specific role.

The head node manages, the login node prepares, the compute nodes execute, memory nodes handle heavy data loads, and the storage system ensures fast data access. All of this is tied together with a high speed network.

Once you understand this structure, working with HPC systems becomes far less intimidating and much more logical.