Forem: ufraaan

BitTorrent Internals

ufraaan — Sun, 03 May 2026 11:39:48 +0000

BitTorrent is a decentralized peer-to-peer (P2P) file-sharing protocol designed for fast, efficient distribution of large files over the internet.

Let's first see how we classically download files from the internet, and why we even need something like BitTorrent.

The client requests a file from the server, the server has the file and responds. But things get interesting when your download size is a bit larger.

Server bandwidth is limited, so as more clients connect, speed slows down.
Speed of data transfer is capped by the server's upload capacity.

If Bob's upload speed is 60 Mbps, then no matter how fast Alice's download speed is, the overall download speed cannot exceed 60 Mbps.

Peer-to-Peer Network

In a P2P network, every party participating in the network has the exact same capabilities: they are all equal peers and can initiate conversations with each other.

The main highlight of P2P: even if a few nodes crash or are removed, the network keeps serving its purpose. No single point of failure.

This isn't just about outages: it also applies to the core service the network provides. For example, if the network's job is to serve files, even if one machine goes down, other machines would still share those files with whoever needs them. There are no "system interruptions" as long as the network is stable enough.

P2P networks come in two flavors:

Pure P2P: No central entity. Every node can connect to every other node.
Hybrid P2P: Has a central entity, used to share metadata about the data across peers: not the data itself.

Note: If the central entity goes down, the network and its services are affected. This hybrid P2P architecture is what powers BitTorrent.

BitTorrent has a central entity called a tracker. Peers talk to each other, but to know who to talk to, they first consult the tracker.

Core Idea

The core idea of BitTorrent is to download a file from multiple machines concurrently.

We saw that download speed is limited by the upload capacity of the sender: be it a user, a server, or anything else. If you can download at 100 Mbps but the sender can only upload at 60 Mbps, you'll max out at 60 Mbps.

But what if instead of downloading from one machine, we distributed the file across the network and connected to 50 different clients simultaneously to download in parallel? That's the idea behind BitTorrent.

Faster downloads.
Upload load is distributed among peers. Every peer may hold some fragment of the file and can serve it to others. You still get high download speeds, but the upload burden is shared across the network.
A large number of downloads puts only a small load on each peer, because it's highly distributed.
Breaking a file into smaller chunks boosts concurrency.

A Simplified Download Flow

When a user wants to download a file, they sniff around the network to find peers that have the pieces. For this, they use a tracker.

The user goes to the tracker and says "I want this file." The tracker responds with a list of peers that have it. The user then connects directly to those peers and downloads the file.

Let's say a user wants a file that has 4 chunks. They go to the tracker, the tracker responds with the list of machines for each chunk, the user talks to those peers, downloads each chunk, and concatenates them locally to get the full file.

Nomenclature & Terminologies

These terms are useful when analyzing BitTorrent’s behavior and algorithms.

1. Pieces and Blocks

A file shared on the BitTorrent network is divided into pieces.
Each piece is further subdivided into blocks.
Data transfer happens at the block level (one block per request).
Example:
- A ~16 MB piece → ~1000 blocks of 16 KB each.
A piece is considered valid only if all its blocks are received.
The client reconstructs the original file by concatenating all pieces.

2. Peer Set

The peer set is the list of peers a node can connect to for uploading/downloading.
Typically obtained from a tracker.
Example:
- If peer A receives {C, E} from the tracker, it exchanges data only with C and E.

3. Active Peer Set

A subset of the peer set used for active data transfer.
Not all peers are connected simultaneously.
Example:
- Out of 50 peers received, only ~10 may be actively connected.
Purpose:
- Limits bandwidth usage.
- Reduces network congestion.
- Improves stability of connections.

4. Seeders & Leechers

Seeder:
- A peer that has the complete file.
- Uploads pieces to others.
Leecher:
- A peer that is still downloading.
- May also upload already downloaded pieces.

Impact on Performance

More seeders → higher availability → faster downloads.
Few seeders → bottleneck (resembles client-server model).
If leechers ≫ seeders:
- Increased contention.
- Slower download speeds.

BitTorrent is Popularity-Friendly

New and popular files will have many seeders and download faster. Old or unpopular files have fewer seeders and download slower.

For example, when a new version of an operating system is released, there's a very high chance many people want to download it. Ubuntu and Debian offer official torrent distributions, and there will be many seeders: so whoever wants to download gets fast speeds.

Applications of BitTorrent

Downloading Linux distributions (faster than FTP & HTTP), and large software, movies, games, etc.
Sending patches to users (e.g., security patches). You can run a small BitTorrent-based system where you drop a file into one node and it automatically distributes across every machine in your network, which can then run the patches. Massive data centers use this to power security patch distribution.
Facebook uses this to power massive deployments and distribute build artifacts across servers. Instead of thousands of servers all downloading a binary from one source, it splits the file across multiple places. The network gradually converges and every node ends up with the full file.

The Torrent File

To download or upload any file from the torrent network, you need a .torrent file. This file holds metadata about the file you want to download.

For example, if you want to download Ubuntu from the torrent network, the Ubuntu ISO would have a corresponding .torrent file. You download it, which contains all the metadata, and then use it to fetch the actual file from the network.

Lifecycle of a Torrent File

Seeders are seeding data in the network, and as long as at least 1 seeder is serving the file, the torrent is alive. Otherwise, the torrent is dead.

It's therefore very important to have at least 1 seeder: otherwise nobody can download the file.

What separates BitTorrent from a classic blockchain/cryptocurrency use case is that there's no incentive for anyone to join and stay as a seeder. Cryptocurrency incentivizes participation in the network: BitTorrent doesn't.

What Does the Torrent File Hold?

The torrent file is static: no matter when you download it, it will always have the same content.

It holds metadata about the file, not the actual data.

A torrent file is essentially a dictionary of key-value pairs:

announce: URL of the tracker. This tells your torrent client which tracker to contact to find peers in the network.
created by: Name and version of the program that created the torrent.
creation date: Creation timestamp in Unix epoch.
encoding: Encoding used for strings in the info dictionary. Defaults to UTF-8.
comment: Optional comment from the author.
info: A dictionary describing the file(s) of the torrent. For example, if you're downloading Ubuntu, it would contain information about the Ubuntu image itself.

BitTorrent supports two types of downloads: single-file and multi-file. Depending on the type, the structure of the info dictionary varies.

File Data Information

The info dictionary also stores information about the pieces:

piece length: Number of bytes in each piece.
pieces: 20-byte SHA1 hash values for each piece, concatenated together.

Since a file is split into equal-size pieces, piece length tells you how big each one is.

For example, a 1 GB file with a piece size of 1 MB would have 1024 pieces. The torrent file doesn't store the actual piece data: instead, for each piece it stores a 20-byte SHA1 hash and concatenates all of them together.

Torrent File Format: Bencoding

Torrent files use a custom encoding format called bencoding: not JSON.

When you open a .torrent file in a client like qBittorrent, the client first decodes the bencoded file to extract the metadata. The component that does this is called a bencoding decoder.

Bencoding Specification

Every torrent file is a bencoded dictionary. The bencoding specification supports only 4 data types: strings, integers, lists, and dictionaries.

So the entire torrent file is a bencoded dictionary.

Wrote one in Go: understood it way better. (https://ufraan.dev/projects/bencode-foo)

The BitTorrent Architecture

The BitTorrent architecture consists of 4 entities:

.torrent file
Trackers
Seeders
Leechers

Pieces

Whenever a file is shared on the BitTorrent network, it's not shared in its entirety. It's first broken into pieces, which become the unit of transmission.

The downloader gets these pieces and concatenates them locally to form the complete file. All pieces are the same length.

For example, a 3 MB file with a piece size of 1 MB creates 3 pieces: p1, p2, p3.

When you join the network and download a piece from a seeder, you immediately broadcast to the rest of the network: "I have this piece now: if anyone needs it, come to me instead."

As each peer downloads any piece, they inform everyone else. This is the power of P2P.

Torrent File

A metafile that holds static information about the file: filename, size, piece information, etc. It does not hold the actual data.

One critical field it holds is the announce URL: the tracker URL. The tracker is the only central entity in the BitTorrent architecture, acting as a metadata store where you can get information about other peers and the torrent.

Each torrent file is uniquely identified by an infohash: a SHA1 hash of the info section of the .torrent file. The .torrent file itself is typically downloaded through a regular HTTP web server.

Tracker

The tracker is the only central entity in this P2P network, and it's very lightweight.

For a given torrent, the .torrent file contains the tracker URL. Every peer in the network connects to this tracker to get metadata about who else is in the network.

It's a decentralized network where there can be multiple trackers, but you'll connect to one tracker for a given .torrent file.

Note: The tracker does not download or transfer files. It only holds information about peers and their distribution: that's why it's so lightweight.

The core jobs of a tracker:

Keep track of peers that hold the file.
Keep track of peers that are downloading.
Help peers find other peers to download content from.

A tracker is essentially a simple HTTP server that:

Hands out peer information to the network.
Periodically collects stats from peers.

When you have a .torrent file, you first extract info from it, then contact the tracker saying "I want to join your network." The tracker responds with roughly 50 peers that are part of this network.

The tracker doesn't just send info to users: peers in the network also periodically report back to the tracker: downloaded amount, uploaded amount, which torrent they're part of, etc.

How Twitter Served 300,000 Timelines Per Second

ufraaan — Fri, 01 May 2026 12:59:25 +0000

This post is a plain breakdown of how Twitter (circa 2013) handled timeline scale, why the first approach broke, and why the final solution is a hybrid.

Based on concepts from Designing Data Intensive Applications.

The Problem

When I strip Twitter down to the basics, there are really only two product operations that matter:

Post Tweet: a user publishes a new message to their followers
Home Timeline: a user views tweets from the people they follow

Here are the numbers Twitter published (Nov 2012):

Operation	Average	Peak
Post Tweet (writes)	4,600 req/sec	12,000 req/sec
Home Timeline (reads)	300,000 req/sec	-

The read-to-write ratio is roughly 65x. People read way more than they write. That asymmetry is the whole story here.

What surprised me the first time I studied this: 12,000 writes/sec is not the scary part. A solid relational setup can handle that. The hard part is fan-out: one tweet may need to show up in millions of home timelines almost immediately.

The Schema

At the core, Twitter's model has three tables.

tweets: the content

id	sender_id	text	timestamp
1	12	"Excited to announce"	1000
2	5	"Grateful. Humbled. Dehydrated."	1001
3	12	"I didn't come this far to only come this far"	1002
4	8	"Rejected 12 times. Hired once. Now I speak at conferences."	1003

Every tweet lives here. Notice sender_id = 12 appears twice: rows 1 and 3 are both from alice. This table is just tweet content, nothing else.

users: the profiles

id	screen_name	profile_image
5	bob	bob.jpg
8	charlie	charlie.jpg
12	alice	alice.jpg

This is profile data only. No tweets, no follow graph. If I see sender_id = 12 in tweets, I resolve id = 12 here and get "alice".

follows: the relationships

follower_id	followee_id	meaning
100	5	User 100 follows bob
100	12	User 100 follows alice
101	12	User 101 follows alice

This table stores the follow graph. One row = one relationship.

Follower vs. Followee

Follower: the person doing the following. If you follow alice, you are the follower.
Followee: the person being followed. alice, in this case, is the followee.

So when alice tweets, the question is: "Who has followee_id = 12?" That result is everyone whose timeline might need an update.

Approach 1: Query at Read Time

Twitter's original approach was straightforward: writes go into tweets, and timelines are computed on demand at read time.

SELECT tweets.*, users.*
FROM tweets
  JOIN users ON tweets.sender_id = users.id
  JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
ORDER BY tweets.timestamp DESC

Why this broke at scale

300,000 timeline reads/sec means hammering this multi-join query constantly. Even with indexes, this gets expensive fast. That pushed Twitter to move work away from reads.

Approach 2: Fan-Out on Write

The key idea: if reads outnumber writes by 65x, pay more at write time so reads are cheap.

Instead of building timelines on demand, build them when tweets are created. Each user gets a precomputed cached timeline (like a mailbox), ready to read.

The mailbox analogy

This is what happens when alice tweets:

The tweet is saved to the global tweets table
A background worker queries all of alice's followers
For each follower, it prepends the new tweet to their cached timeline

When I open the app, the timeline is basically a cache fetch. No heavy joins in the hot read path.

The math

Average Twitter user has approximately 75 followers.

4,600 tweets/sec × 75 followers = 345,000 cache writes/sec

A cache write is a list prepend in Redis: microseconds, no disk I/O. A JOIN query involves scanning indexed B-trees, merging results, sorting: milliseconds, disk-bound.

The Full Data Pipeline

Each follower has their own dedicated timeline cache (e.g., User 1: T7→T5→T3→T1, User 2: T8→T6→T5). The tweet IDs differ per user because each user follows a different set of people. When you request your timeline, it's served directly from your pre-built cache. Reads are fast because all the work happened at write time.

The tradeoff: one new tweet triggers N cache updates where N is the number of followers.

The Celebrity Problem and the Hybrid Solution

Fan-out on write has a ceiling. Celebrity accounts have tens of millions of followers. If one celebrity tweet triggers 30-80 million fan-out writes, queues back up and latency spikes.

So the production answer is a hybrid:

Normal users: fan-out on write (Approach 2)
Celebrities: no fan-out. Their tweets stay in the global tweets table.

When I request my home timeline, the system merges:

Your pre-built cache (tweets from normal users you follow)
A small real-time query for tweets from celebrities you follow (Approach 1)

This stays manageable because most users follow only a small number of celebrity accounts.

The Core Principle

Do more work at write time so the common path is trivially cheap.

Twitter optimized for reads because reads dominated writes.

When I design systems now, I start with one question: what is my read/write ratio, and where should I pay cost? That one answer often determines the rest of the architecture.

Git Under the Hood: What Actually Happens When You Commit

ufraaan — Fri, 01 May 2026 12:51:31 +0000

I used to just memorize git commands without understanding what was going on behind the scenes. Add, commit, push, and hope it works. Then one day I actually opened the .git folder and everything clicked.

This post covers the basics of how Git works internally, how to configure it properly, and how branches and merging actually function. If you are tired of blindly typing commands and want to understand what Git is actually doing, this is for you.

Setting Up Git Properly

When you first install Git, it needs to know who you are. Every commit gets tagged with a name and email, so Git stores these in a file called .gitconfig.

You can set these globally, which means every repository on your machine uses the same identity. Or you can set them locally per project. Most people go global for their name and email since those do not change.

You can also change your default editor. Git loves opening Vim for commit messages, which is fine if you know Vim, but most beginners would rather just use VSCode. Run this to fix that:

git config --global core.editor "code --wait"

The --wait flag is important. It tells Git to pause and wait until you close the editor window before it continues.

Then set your name and email:

git config --global user.name "ufraan"
git config --global user.email "ufraan1@gmail.com"

You can check what you set by running the same commands without the values. These settings live in your .gitconfig file. On Linux or macOS it is at ~/.gitconfig. On Windows it is at C:\Users\<YourUsername>\.gitconfig. Open it up and you will see your settings, plus any SSH keys you have configured for talking to GitHub or GitLab.

The .gitignore File

Git tracks everything by default. That means build artifacts, environment files with your API keys, node_modules, and random OS metadata all get picked up. You do not want any of that in your repository.

Create a .gitignore file in your project root to tell Git what to ignore:

node_modules/
*.log
.DS_Store
*.pyc
.env

Quick breakdown of what these cover:

node_modules/ gets massive and you can always reinstall it with npm install
*.log files are just program output
.DS_Store is macOS metadata that has nothing to do with your code
*.pyc are compiled Python files generated automatically
.env usually has secrets and should never be committed

If you are starting a new project and do not know what to put in there, search for "gitignore generator" online. Pick your tech stack and it will give you a ready-made template.

Once your .gitignore is set up:

git add .
git commit -m "Initial commit"
git log

The git add . command automatically respects your .gitignore file, so it will not stage anything you told it to ignore.

What Happens Behind the Scenes

Run git log --oneline and you will see short commit hashes and your messages. You might also see something like HEAD -> master. But what is actually happening?

Every commit stores a reference to the previous commit. Except the very first one, which has no parent. Each commit also gets a unique hash that acts like a fingerprint for that snapshot. This creates a chain of commits going all the way back to the beginning.

You can see this for yourself by peeking into the .git folder:

ls -la
cd .git
ls

You will find a few important things inside:

HEAD points to the branch you are currently on. Open it and you will see something like ref: refs/heads/master
objects/ stores all your actual data, including commits and file contents
refs/ stores branch pointers
logs/ keeps a history of changes to HEAD and other references

So when you run git commit, Git creates a new commit object in the objects directory, updates the branch pointer in refs, and HEAD follows the current branch. That is the whole process.

Understanding Git Branches

Branches are basically parallel timelines for your project. You can work on a feature in isolation without touching the main codebase. Git creates a default branch called master (or main on newer setups), which is usually where the stable version lives.

Let us walk through how this actually works. Create a new folder and initialize a repo:

mkdir gittwo
cd gittwo
git init
git status

After git init, you will see "On branch master". Create a file and track it:

# Create an index.html file with some basic HTML
git status
git add index.html
git commit -m "add index file"
git branch

The output shows * master. The asterisk means that is where HEAD is pointing.

Now make some changes and commit again:

git add index.html
git commit -m "update code for index file"

Let us say you need to work on a navigation bar without affecting master. Create a branch:

git branch nav-bar
git branch

You will see nav-bar listed below master, and the asterisk is still on master. You created the branch but you are not on it yet. Switch over:

git checkout nav-bar
# or the newer command:
git switch nav-bar

git branch

Now the asterisk moved to nav-bar. Create a file on this branch:

# Create nav-bar.html
git add nav-bar.html
git commit -m "add navbar to codebase"

Now switch back to master:

git checkout master

Notice that nav-bar.html disappeared from your editor. It is not deleted. It just lives on the nav-bar branch now. This is the key thing about branches. Each branch has its own working directory state. Switch back to nav-bar and the file comes right back.

You can always check where HEAD points:

git branch
git log --oneline
# Shows HEAD -> master or HEAD -> nav-bar

A few useful branch commands to keep around:

git branch                  # list all branches
git branch bugfix           # create a new branch without switching
git switch bugfix           # switch to a branch
git switch -c dark-mode     # create and switch in one step
git checkout -b pink-mode   # same thing, older syntax
git branch -d branch-name   # delete a branch

One thing to remember. Always commit your changes before switching branches. If you leave work uncommitted and switch, it can either follow you to the new branch or cause problems. Just commit first.

Merging Branches

Once you are done on a branch, you need to bring those changes back into master. That is what merging does.

Fast-Forward Merge

This is the simple case. If master has not changed since you created your branch, Git just moves the master pointer forward to your latest commit. No extra merge commit needed.

git checkout master
git merge nav-bar

Non-Fast-Forward Merge

If both master and your branch have new commits since you split off, Git cannot just fast-forward. It has to create a merge commit that combines both histories. You can also force this behavior even when a fast-forward is possible:

git checkout master
git merge --no-ff nav-bar

The --no-ff flag creates an explicit merge commit. This is useful if you want to keep a clear record of when a feature branch was integrated, even if a fast-forward was possible.

Let us try merging in practice:

git checkout master
git merge nav-bar -m "merge navbar"
git log --oneline

The nav-bar commits are now part of master. You can see nav-bar.html in your editor alongside the other files. Once merged, you can clean up the branch:

git branch -d nav-bar

Do it again with another branch:

git checkout -b footer

# Create footer.html
git add footer.html
git commit -m "add footer section to codebase"

git checkout master
git merge footer

Now footer.html appears in master as well.

Resolving Merge Conflicts

Conflicts happen when two branches modify the same part of the same file. Git does not know which version to keep, so it stops and asks you to decide.

Here is how to trigger and fix one:

# On master, modify index.html
# Add "footer added" inside the body tag
git add index.html
git commit -m "add footer in index file"

# Switch to footer branch
git checkout footer

# Modify the same file differently
# Add "footer was added successfully" inside the body tag
git add index.html
git commit -m "update index file with footer code"

# Switch back to master and try to merge
git checkout master
git merge footer

Git will stop and tell you there is a conflict. Open index.html and you will see something like:

<<<<<<< HEAD
footer added
=======
footer was added successfully
>>>>>>> footer

The section between <<<<<<< HEAD and ======= is your current branch. The section between ======= and >>>>>>> footer is the incoming branch. You need to pick one, combine them, or write something entirely different. Then delete the conflict markers completely.

Once the file looks right:

git add index.html
git commit -m "resolve conflict in index file"

That is it. The conflict is resolved and both branches are merged into master.

My Git Aliases

I use a few aliases to save myself from typing the same commands over and over. Here is what I have in my .zshrc:

alias gs='git status'
alias ga='git add'
alias gcm='git commit -m'
alias gp='git push'
alias gpl='git pull'
alias gc='git clone'
alias gb='git branch'

They are nothing fancy, but gcm and gs alone save a ton of keystrokes over the course of a day.

Wrapping Up

If you want to go deeper into how Git actually works, the Pro Git book is the best resource out there. It is free to read online at https://git-scm.com/book/en/v2 and covers everything from basics to advanced internals. I highly recommend it if you really want to master Git.

Git stops being scary once you understand what is happening under the hood. Commits are just snapshots linked in a chain. Branches are pointers to those snapshots. Merging is moving pointers around. The .git folder is not some black box. It is just files and references that you can look at whenever you want.

Adios! :)