Forem: Joseph Boone

When Stability Improves Performance (Threading)

Joseph Boone — Sat, 09 May 2026 17:33:56 +0000

The common assumption in concurrent systems is that stability and performance pull in opposite directions. You add safety mechanisms, locks, routing constraints, and you pay for them in throughput. This post is about a case where that assumption turned out to be wrong.

The Premise

TokenGate is a token-managed concurrency system. Decorated functions return tokens instead of executing immediately. Those tokens are admitted through a wrapped decorator, routed to per-core mailboxes by weight class, and executed on thread pool workers.

This system also includes a successful separation over async coordination and threaded execution, which is a common source of complexity in concurrent systems.

TokenGate aims to ease this by using tokens as a bridge between the async event loop and a thread pool. The async event loop manages the routing and coordination of tokens, while the thread pool handles the execution. The routing model assigns tokens to cores by weight and storage speed.

Ops are handled in separation making them distinct at every stage.

The weight classification sets the core viability of a task. This is useful
to allow "front-back-fill" scheduling patterns where light work occupies the
latter cores while heavy work utilizes the first core and falls back toward
the others as load increases.

Each weight class has a defined core range:

HEAVY → All Cores
MEDIUM → Core 2 +
LIGHT → Core 3 +

Within those ranges, a staggered position counter distributes tokens across workers in FIFO with the ability to interleave retry tokens.

What Was Built

Sticky Token Registry

Tokens are marked when seen to have related args: Pinning tokens with matching (operation_type, args) keys to the core that first receives them keeping data locality clean.

When a token arrives, sticky_registry.mark() creates the sticky anchor that automatically sees these related parts.

@task_token_guard(
    operation_type="my_op",
    tags={"weight": "medium", "sticky_anchor": "my_domain"},
)
def my_operation(n: int) -> int:
    ...

This is reactive it catches collisions as they arrive. It handles the case where two tokens with identical logical identity are submitted concurrently and would otherwise chain across other core domains.

Hash Conductor

The second layer is proactive. Instead of waiting to see if the collision would happen, the conductor anchors an entire call chain to a domain before any child token is even routed, it's a heavier pattern but is a secure way to ensure all related data is interpreted on the same core.

When a lead token is decorated with external_calls, a SHA-256 seed is
generated from the token ID and the call list:

seed = SHA-256( token_id + ":" + freeze(external_calls) )

Full 64-character hex digest. The token ID is included so two leads with
identical call lists still get independent domains.

That seed is pinned to whichever core the lead lands on. Any token spawned
during the lead's execution inherits the seed and is routed to the same core automatically. No configuration needed inside call-sites. No explicit passing. The seed propagates through a thread-local set in the executor thread before the lead function runs while in token routing.

The most important thing to consider is that workers within the core domain may still execute the tokens in parallel. Respecting the workers staggered routing. It's not a "nerf" to the capability of the system - while operating under saturated load conditions noticeable performance gains were observed.

@task_token_guard(
    operation_type="lead_op",
    tags={"weight": "medium", "external_calls": ["child_op"]},
)
def lead_operation(n: int) -> list:
    # These children inherit the seed and land on the same core
    return [child_op(n + i) for i in range(4)]

The pending lead operations count starts at 1 and increments for each subsequent external call at creation time, and decrements on every completion. When it reaches zero the seed is released.

The Benchmark

15 doubling waves. 131,068 tokens total.

Wave   Tokens    Tok/s     Lat(ms)   Overlap
1      4         1386.2    0.721     1.44×
2      8         2391.2    0.418     2.48×
3      16        2744.8    0.364     4.82×
4      32        2812.7    0.356     11.32×
5      64        2880.0    0.347     22.01×
6      128       2907.6    0.344     29.78×
7      256       2846.8    0.351     37.98×
8      512       2811.5    0.356     41.81×
9      1024      2813.9    0.355     44.18×
10     2048      2644.3    0.378     44.86× ← peak overlap
11     4096      2816.3    0.355     38.34×
12     8192      2819.9    0.355     32.64×
13     16384     2765.0    0.362     27.92× ← better sustained performance
14     32768     2707.7    0.369     24.96× ←
15     65536     2789.5    0.358     24.21× ←

Zero failures. Avg latency 0.386ms/token.

The previous ceiling was around 17× overlap after saturation. This
run hit 44.86× at wave 10 and descended gracefully from there.
Latency moved 0.04ms across the entire run from wave 3 to wave 15.

Why Stability Produced More Concurrency

The 17× previous ceiling wasn't a capacity ceiling. It was a friction ceiling.

At that overlap level the old routing was generating cross-domain
traffic. Related tokens landing on different cores meant cache lines being
written back and refilled across the interconnect. The scheduler was spending a growing proportion of its time on coordination rather than execution.

Domain anchoring removed the wasted power. Tokens that belong together stay together. The cache lines loaded for a lead token's data are still warm when its children execute on the same core. The cross-core traffic that was expanding with overlap now barely exists for conducted chains. The scheduler has more execution headroom compared to coordination cost. The overlap ceiling rises.

This is why the overlap column reaches over 20× and holds flat latency while doing it. The system isn't working harder. It's working cleaner, scaling better.

Calling Production Crews

If you run concurrent Python workloads task queues, async pipelines, anything with related operations that currently route freely, I'd like to know what you see looking at this and any poking at my work is helpful.

As a self taught developer I'm open to criticism and would love to learn from some trained or learned folks.

Registering calls for anchoring

The sticky registry and hash conductor are opt-in. Existing code routes normally.

Hashed domain anchoring and sticky tokens are aiming to be the first "production ready features" for this.

The cases I'm most interested in:

Anything that hits an unexpected concurrency ceiling without obvious cause.

The repo is public. Issues and observations welcome. GitHub

Leave some feedback! Tavari

TokenGate represents my journey through hobbyist coding for nearly 4000 hours, times change, I'm now opening up as business. Stay tuned for the future of Tavari.

What Code is About (IMO)

Joseph Boone — Wed, 06 May 2026 19:38:58 +0000

I want to talk about what code is about and what it did to my life and mind, not as a career or hobby, but as a way of thinking that I didn't expect and won't let go of.

I've been building TokenGate and learned what I know from many prototype applications over 1000's of hours of code. I started working with LLMs - mostly Claude Sonnet, to help me understand things I didn't know, debug things I couldn't see, and make decisions about architecture I had no formal training for. And somewhere in that process, something shifted.

Code is not a language. It's a lens.

People talk about code like it's a skill, like typing or driving, or something you learn. But that framing misses something important. Code is closer to a way of perceiving. Once it starts working on you, you start seeing structure. Systems, dependencies, state changes, feedback loops, and contracts.

You start asking "what are the rules here?" about things that have nothing to do with a terminal, things that don't relate to feedback loops, things that are part of our life start to gain structure.

I'd call this structural truthiness. Not truth like a fact, but truth like: does this actually hold together? Is the logic real, or am I just convincing myself? Code is ruthless about this in a way that most thinking isn't. It either runs or it doesn't. The compiler doesn't care about your vibes.

That ruthlessness, weirdly, becomes freeing. Because once you trust the structure, you can go anywhere inside it. I've found myself reading things about physics, economics, biology, Lagrangian formulas - fields I didn't learn explicitly but got to "taste" the essence of because of code while actually following the logic because I learned to look for it. Code widened something in me that I had lost as a person: "Perspective".

Working with AI made it weirder and better

Working with an LLM like Sonnet changes the dynamic in an interesting way. You're not just writing code, you're describing what you want, and then negotiating with something that can tell you if your description is coherent. If you're fuzzy, the output is fuzzy. If you're precise, it's surgical.

That forced me to get better at knowing what I actually wanted. Not what I thought I wanted. Not what sounded right. What I could actually articulate as an intention with a structure behind it. That's a skill that transfers everywhere.

The thing that actually matters

Here's the point I keep coming back to:

A calculator computes what must be. A computer equates what you want.

A calculator is deterministic and bounded. You give it inputs, it gives you the only possible output. There's no room for intention, only execution of fixed rules. A computer is different. A computer runs your model of something. It doesn't know what you're trying to do. It just faithfully executes whatever you describe which means the quality of what comes out is a direct reflection of the quality of your thinking going in.

That's not a technical distinction. That's a philosophical one. Code is the medium where intention becomes testable. Where you stop saying "I think this is how it works" and start saying "let's find out". You're not looking up an answer, you're building a small version of your understanding and seeing if it holds.

For some people - myself included, that's the clearest and most satisfying way to think. Not because it's easier, but because it's honest in a way that feels like you're challenging your ability to see the truth in the structure.

If you're on the fence about going deeper into code: Don't think of it as learning a tool. Think of it as picking up a new way to think and challenge what you know.

It's worth it.

What's next for me: The Gemma 4 challenge kicks off today on dev.to. I'll be using the Gemma API to build something. Best of luck to everyone who enters, happy coding!

Maximum Concurrency (& Sub-Quadratic Scaling) By TokenGate

Joseph Boone — Sun, 26 Apr 2026 20:00:58 +0000

What is TokenGate?

TokenGate is a beta Python concurrency system built around a token-managed execution model. Instead of managing threads directly, you decorate synchronous functions and the system handles routing, admission, and worker assignment automatically.

The core idea is simple: every function call becomes a token. That token moves through a lifecycle — created, waiting, admitted, executing, completed — while the coordinator manages a pool of core-pinned workers underneath. You interact with the public API, TokenGate handles everything else.

As of v0.2.2.0 tokens are natively awaitable. That's what made this test possible to write cleanly:

result  = await token
results = await asyncio.gather(*tokens)

The decorated functions stay synchronous. The orchestrator stays async. TokenGate sits between them and keeps the pipeline full.

What I did

I dispatched 65,536 task tokens simultaneously and completed all of them in 30 seconds. Zero failures. This's what the numbers actually showed.

What I found was an unexpected goldilocks zone — a region of task sustainability that not only exceeded worker capacity but held stable all the way to 65,536 simultaneous submissions:

  RESULTS SUMMARY

  Wave   Tokens   OK    Fail      Time    Tok/s   Lat(ms)    Conc   Overlap   ΣTask(ms)
  -------------------------------------------------------------------------------------
  1      4        4     0       0.002s   1718.7    0.582ms   1.00×     2.39×       5.56ms
  2      8        8     0       0.002s   3781.8    0.264ms   2.20×     3.79×       8.01ms
  3      16       16    0       0.004s   4067.5    0.246ms   2.37×     5.42×      21.31ms
  4      32       32    0       0.007s   4392.7    0.228ms   2.56×    11.90×      86.71ms
  5      64       64    0       0.015s   4268.0    0.234ms   2.48×    23.01×     345.10ms
  6      128      128   0       0.029s   4475.1    0.223ms   2.60×    38.72×    1107.58ms
  7      256      256   0       0.059s   4331.8    0.231ms   2.52×    52.89×    3125.85ms
  8      512      512   0       0.113s   4532.8    0.221ms   2.64×    60.30×    6810.70ms
  9      1024     1024  0       0.246s   4165.0    0.240ms   2.42×    61.87×   15211.02ms
  10     2048     2048  0       0.505s   4052.4    0.247ms   2.36×    33.82×   17091.60ms
  11     4096     4096  0       0.980s   4181.1    0.239ms   2.43×    33.15×   32476.06ms
  12     8192     8192  0       2.163s   3786.8    0.264ms   2.20×    30.14×   65197.90ms
  13     16384    16384 0       4.327s   3786.7    0.264ms   2.20×    23.91×  103450.88ms
  14     32768    32768 0       8.754s   3743.3    0.267ms   2.18×    18.72×  163876.79ms
  15     65536    65536 0      17.314s   3785.2    0.264ms   2.20×    17.16×  297095.85ms
  -------------------------------------------------------------------------------------
  TOTAL  131068   131068 0      86.845s

  Avg latency across waves  : 0.268 ms/token
  Peak concurrency ratio    : 2.64×
  Peak overlap ratio        : 61.87×

What "Sub-Quadratic Scaling" Really Means

This isn't a special property of AI or machine learning — it's basic math that shows up anywhere work can be parallelized. A calculator running two operations simultaneously instead of sequentially is doing the same thing at a smaller scale depending on the return times. The term just describes a curve: double the input, less than double the cost.

The overlap ratio — the rightmost metric — measures Σ(individual task execution times) divided by wave elapsed time. If every task ran back-to-back on a single thread it would read 1.0×. Values above 1× mean real parallel execution is happening. Values well above worker count mean something more interesting is going on.

Workers that finish a short task don't wait for the wave to end. They immediately pull the next token. So a worker that cycles through eight short tasks within one wave window contributes eight task-durations to the overlap sum while only occupying one worker-slot in wall time. The total concurrent activity compounds.

This is sub-quadratic scaling — doubling the token count costs less than double the time, because additional tokens slide into gaps that already exist in the schedule rather than adding full sequential cost. The system does more work per unit time as load increases, up to a point that point is wave 9.

At 1024 tokens the overlap ratio peaks at 61.87×. Wave 10 transitions — workers saturate fully, rapid cycling slows, and the ratio settles near the hardware worker count. From there it holds. Wave 15 at 65,536 tokens: 17.16× sustained overlap, flat throughput, zero failures. The system found its floor and stayed there even when heavily overloaded.

What do these tasks look like?

These are deliberately varied and non-trivial — a prime sieve, string manipulation, list sorting, and an iterative SHA-256 chain. All four are plain synchronous functions. The decorator is the only thing that makes them token-aware:

@task_token_guard(operation_type='cpu_crunch', tags={'weight': 'light'})
def cpu_crunch(n: int) -> int:
    """Sum primes up to n — lightweight CPU-bound work."""
    total = 0
    for i in range(2, n):
        if all(i % j != 0 for j in range(2, int(i ** 0.5) + 1)):
            total += i
    return total


@task_token_guard(operation_type='string_ops', tags={'weight': 'light'})
def string_transform(seed: int) -> str:
    """Generate and mangle a string — lightweight string work."""
    rng = random.Random(seed)
    chars = [rng.choice(string.ascii_letters) for _ in range(300)]
    text = ''.join(chars)
    return text[::-1].upper().replace('A', '4').replace('E', '3').replace('I', '1')


@task_token_guard(operation_type='data_transform', tags={'weight': 'medium'})
def data_sort(size: int) -> List[int]:
    """Sort a random list — medium CPU work."""
    rng = random.Random(size)
    data = [rng.randint(0, 100_000) for _ in range(size)]
    return sorted(data)


@task_token_guard(operation_type='hash_compute', tags={'weight': 'heavy'})
def hash_chain(seed: str, iterations: int) -> str:
    """Iterative SHA-256 chain — heavier CPU-bound work."""
    h = seed.encode()
    for _ in range(iterations):
        h = hashlib.sha256(h).digest()
    return h.hex()

The entire orchestrator is async and touches no internal controls:

tokens  = submit_batch(target)
results = await asyncio.gather(*tokens, return_exceptions=True)

That's it. Submit, await, report.

Closing

The architecture stabilizes under load rather than degrading. That's not an accident, it's a consequence of how token admission and worker pinning interact at scale.

Your sweet spot will land in a different place than mine depending on your hardware. The test is in demo/ — run it and see where your system peaks.

https://tavari.online/

I Built a Threading Engine - I Need Results (Feedback)

Joseph Boone — Thu, 23 Apr 2026 04:26:37 +0000

I've been building TokenGate - an experimental Python concurrency engine that uses a token-based model to manage threaded tasks. No manual thread management, no futures, no ThreadPoolExecutor. Just a 3 line coordinator and single line decorators to manage all threading.

On my machine (Ryzen 7800 / RTX 4070 Super) I'm seeing 7.25x concurrency across 8 tasks of mixed work and 6.01x on sustained high variety workloads - but that's just one setup. I want to know what it does on yours.

Concurrency is measured across batches of 8 tasks in my testing scenarios to match my core count. It's possible to get a higher concurrency ratio. My tests are normalized.

What I'm asking

Try the demos (or make an app!), paste your results, that's it.

The whole demo suite takes about 5 minutes to set up and the results
tell me a lot about how the engine scales across different hardware.

How to get started

Option 1 - Direct download (fastest):

Grab the beta zip from tavari.online

Option 2 - Clone the repo:

git clone https://github.com/TavariAgent/Py-TokenGate
cd Py-TokenGate
pip install -r requirements.txt

Then check BETA.md
for the quick start.

What TokenGate actually does

Decorates synchronous functions with @task_token_guard - one line, done
Routes tasks through a token-managed thread pool automatically
Built-in DoS protection to prevent the system from overwhelming itself
Live telemetry via WebSocket GUI at localhost:5000
Works standalone or with the WebSocket dashboard

What I want to hear back

Your hardware (CPU model, core count)
Your experience (how did it work for you?)

Drop results here, open an Issue on GitHub or leave me feedback on tavari.online.

This is an active beta - rough edges are expected.
I'm self-taught, this is the work my learning experience produced.
If it's useful or interesting to you, I would appreciate the feedback.

GitHub Repo

1m Tokens (& WebSocket)

Joseph Boone — Thu, 19 Mar 2026 21:32:25 +0000

Greetings readers, I made a threading engine with many optimizations (including ML) and WebSocket task controls per operation.

Even when computing a slow moving series like Leibniz PI at 1 million token executions the tasks all resolved as expected in ~200 seconds.

# ── LAYER 0: TERM TOKENS ──────────────────────────────────────────────────────
@task_token_guard(operation_type='pi_term', tags={'weight': 'light'})
def compute_pi_term(n: int) -> str:
    """
    Compute a single Leibniz term: (-1)^n / (2n + 1)
    Returns as string to preserve Decimal precision across token boundary.
    Light weight — 1,000,000 of these fire simultaneously.
    """
    getcontext().prec = DECIMAL_PRECISION
    sign = Decimal(-1) ** n
    term = sign / Decimal(2 * n + 1)
    return str(term)

# ── LAYER 1: CHUNK TOKENS ─────────────────────────────────────────────────────
@task_token_guard(operation_type='pi_chunk', tags={'weight': 'light'})
def sum_chunk(term_strings: List[str]) -> str:
    """
    Sum a batch of Leibniz terms.
    Receives resolved term strings from Layer 0 tokens.
    Light weight — 1,000 of these, each summing 1,000 terms.
    """
    getcontext().prec = DECIMAL_PRECISION
    total = sum(Decimal(t) for t in term_strings)
    return str(total)

# ── LAYER 2: PARTIAL TOKENS ───────────────────────────────────────────────────
@task_token_guard(operation_type='pi_partial', tags={'weight': 'medium'})
def sum_partial(chunk_strings: List[str]) -> str:
    """
    Sum a batch of chunk sums.
    Receives resolved chunk strings from Layer 1 tokens.
    Medium weight — 10 of these, each summing 100 chunks.
    """
    getcontext().prec = DECIMAL_PRECISION
    total = sum(Decimal(c) for c in chunk_strings)
    return str(total)

Leibniz is intentionally the slowest converging PI series, it needs ~10 million terms for 7 correct digits. That makes it a good stress test: maximum token volume, minimum mathematical payoff.

(Note: 64 workers with SMT enabled is only ~7% faster on a 7800X3D — more workers doesn't always mean more throughput, especially for micro-ops where execution port contention becomes the real ceiling.)

Tokens move through async admission and resolve on pinned workers, CPU heavy tasks stay on core 1, light tasks distribute across the rest. Failure nets, duplication safety, and WebSocket controls prevent runaway processes at the process level.

Take a look at the repo:

TavariAgent / Py-TokenGate

Beta Python concurrency model using token-managed routing

TokenGate

Welcome to the TokenGate repository.

What it is:

A small experimental system for routing decorated synchronous functions
through a token-managed concurrency model. It is intended to operate as
its own concurrency workflow rather than alongside normal threading patterns.

What it is not:

It is not presented as production code.

Overview:

TokenGate is an exploration of token-managed concurrency: a
concept for coordinating async orchestration with thread-backed
work in a structured way.

This repository is a proof of concept, not a finished product.
It is experimental, still evolving, and shared in the spirit of
exploration.

If you'd like the fuller overview, please start here:

Proof of Concept

If anything here is useful, interesting, or sparks an
idea, that already makes this project worthwhile.

How to Use (Two Versions, Two Decorators)

Note: Do not attempt to decorate an async function.

The token decorator uses asyncio, but the decorated function itself should
…

View on GitHub

Threading Async Together

Joseph Boone — Mon, 16 Mar 2026 22:39:11 +0000

Hello readers,

I built a proof-of-concept application I call TokenGate. It’s a high performance async/threaded event bus, with control mechanisms designed to be extremely minimalist.

The core concept is to produce parallelism in concurrent operations through async token gathering and coordinated threading workers.

Here's what "TokenGate" uses to thread an operation:

# -- Python 3.12 -- #
from token_system import task_token_guard
from operations_coordinator import OperationsCoordinator

# 1. Decorated standard synchronous function for threading
@task_token_guard(operation_type='string_ops', tags={'weight': 'light'})
def string_operation_task(task_data):
    # This function is now threaded
    return result

# 2. Starts the coordinator (through a running loop)
coordinator = OperationsCoordinator()
coordinator.start()

# 3. finally or an exception stops on close
coordinator.stop()

Task tokens are generated by using a wrapped decorator.

Here's some test results on operations in a "release mechanism" that dispatches batches of mixed tasks incrementally:

CONCURRENCY BURST: Medium x8 | release 1464 (8 tasks)
======================================================================
  Submit spread (barrier jitter): 0.19ms
  Overall wall-clock:             0.009045s
  Min task duration:              0.007818s
  Max task duration:              0.008432s
  Mean task duration:             0.008148s
  Stdev (clustering indicator):   0.000218s

  Duration per task (tight clustering = true concurrency):
    Task 00: 0.007928s  
    Task 01: 0.008000s  
    Task 02: 0.008136s  
    Task 03: 0.008209s  
    Task 04: 0.008432s  
    Task 05: 0.008300s  
    Task 06: 0.008362s  
    Task 07: 0.007818s  

  Serial estimate (sum):  0.065186s
  Actual wall-clock:      0.009045s
  Concurrency ratio:      7.21x  (concurrent)

CONCURRENCY BURST [Medium x8 | release 1464] PASSED
======================================================================
CONCURRENCY WINDOW: Sustained mixed releases (30s)
======================================================================
  Releases:                       1484
  Total tasks:                    11872
  Overall wall-clock:             30.070291s
  Min task duration:              0.001157s
  Max task duration:              0.105874s
  Mean task duration:             0.014970s
  Stdev (clustering indicator):   0.025983s

  Serial estimate (sum):          177.728067s
  Actual wall-clock:              30.070291s
  Sustained concurrency ratio:    5.91x  (concurrent)

CONCURRENCY WINDOW [Sustained mixed releases (30s)] PASSED

CONCURRENCY SUITE COMPLETE.

(Concurrency ratios of up to 7.21x were witnessed on an 8 core CPU with ~32 dynamic workers in ideal conditions, which is roughly 90% of the 8x concurrent operation ceiling.)

I've tested a wide variety of normally threaded operations with result delivery as expected.

It's still a just a proof, however I've used it in various side-projects with good results.

For anyone interested here's my project on GitHub (with proofs):

Repo link - https://github.com/TavariAgent/Py-TokenGate