Forem: Captain Jack Smith

When AI Starts Bringing Research Ideas to the Lab

Captain Jack Smith — Fri, 22 May 2026 02:40:47 +0000

In April 2026, OpenAI chief scientist Jakub Pachocki joined the Unsupervised Learning podcast for a conversation about where frontier AI research is heading. The timing now feels unusually sharp. Only weeks later, OpenAI announced that an internal general reasoning model had disproved a central conjecture in discrete geometry connected to Paul Erdős and the planar unit distance problem. The result was checked by external mathematicians, and the companion remarks from leading researchers framed it as a serious mathematical event.

Taken together, the podcast and the new math result point to a change in how we should talk about AI in research. The strongest models are moving from answering questions after humans frame them to proposing directions, trying constructions, and finding bridges between distant fields. That shift matters because frontier research often begins with a strange hunch before it becomes a polished proof, experiment, or paper.

The unit distance problem is easy to state. Put n points on a plane and count how many pairs can be exactly one unit apart. Since 1946, mathematicians have studied how fast that maximum can grow. For decades, the square grid family looked close to optimal, and Erdős conjectured that no construction could beat that rate in a meaningful polynomial way. OpenAI says its model found an infinite family of configurations that does exactly that. The surprise is mathematical, because algebraic number theory entered a problem that looks elementary and geometric. The surprise is also organizational, because the model was presented as a general reasoning model with broad task coverage.

That is why Pachocki podcast comments feel less like speculation and more like a roadmap. He discussed coding agents, math and physics benchmarks, reinforcement learning beyond easily checked tasks, and the possibility that models could accelerate the work of AI labs themselves. The interesting point is the movement from execution to taste. A useful research system does more than calculate. It decides which path might be worth trying, notices when a boring problem has a hidden structure, and spends effort on a risky construction that humans may have considered too unlikely.

This makes verification more valuable. A proof generated by a model becomes meaningful only when humans and formal tools can inspect it, simplify it, and connect it to the existing literature. The OpenAI case earned attention because mathematicians checked the argument and wrote companion remarks. That process is the model for near term AI science: machines generate more candidate ideas, while expert communities decide which ideas survive contact with rigor.

For everyday researchers, the practical lesson is already visible. AI can become a partner in the messy middle of research, where people move between papers, sketches, formulas, figures, and drafts. A scientist might use ChatGPT to explore possible proof strategies or compare related literature. They might use Miss Formula when a formula appears inside an image and needs to become editable math. They might use Editable Figure when an AI generated paper figure needs to be converted into an editable vector format before publication or revision. These tools keep human judgment at the center while reducing friction on the path between an idea and a shareable result.

The deeper change is cultural. Research labs used to ask whether AI could help with small fragments of technical work. Now they have to ask how to design workflows where models can suggest experiments, expose hidden analogies, and create artifacts that experts can audit. That demands new habits. Teams need stronger review loops, clearer provenance, better records of model generated claims, and a willingness to separate inspiration from evidence.

The most exciting version of this future is a lab where more ideas are tried, more weak intuitions are tested, and more surprising connections get a chance to become real. Pachocki interview described a world in which models start accelerating the research process. The unit distance result gives that world a concrete example. AI has begun to contribute research ideas, and the next question is how carefully we can learn to work with them.

Does AI Know How Many Tokens It Is Burning

Captain Jack Smith — Thu, 21 May 2026 02:17:43 +0000

The strange thing about the modern AI bill is that it looks precise while the work behind it feels mysterious. A user types a short request, a model thinks through a long hidden path, tools are called, context is loaded, cached text may be reused, and the final answer arrives as if it were a single clean event. The invoice later describes the event in tokens. Input tokens, cached input tokens, output tokens, reasoning tokens, long context tokens. The language of measurement is tidy. The measured behavior is complex.

So the question matters. Does AI have an awareness of token consumption. The practical answer is almost certainly negative. A model can be prompted to write shorter answers, choose compact formats, summarize context, or stop after a budget. That remains a behavioral response rather than economic self awareness. The model is predicting text under instructions. The metering system lives around it. Token counting, caching, routing, rate limits, and billing are product and infrastructure layers built by humans. The model may talk about saving tokens, but the system decides what was consumed and what it costs.

That gap explains why token economics has become one of the least glamorous and most important parts of AI. In the first wave, the attention went to model quality. In the second wave, the attention moved to agents, context windows, voice, video, and multimodal workflows. Now the decisive question for many teams is simpler. Can the product deliver useful intelligence at a predictable unit cost.

For AI vendors, tokens are the bridge between capability and gross margin. Output tokens usually cost more than input tokens because generation is compute intensive and latency sensitive. Long reasoning can improve quality, but it also turns invisible compute into visible cost. Cached input changes the equation again. When repeated context can be reused, the provider can reduce cost and latency while keeping the customer inside the same platform. This is why pricing pages now distinguish fresh input from cached input, and why prompt caching has become a core design feature rather than a small optimization.

For cloud providers, the token is becoming a new workload unit. Traditional cloud economics was built around virtual machines, storage, bandwidth, and database operations. AI inference adds a more volatile meter. One customer request may be tiny. Another may carry a large document, a long conversation, tool outputs, and a detailed answer. GPU supply, batching, memory bandwidth, model size, quantization, and serving software all shape the cost per million tokens. Cloud platforms want to sell capacity, yet customers increasingly ask for something more concrete than capacity. They want a dependable price for intelligence delivered.

For business customers, token economics is a budgeting problem and a product design problem at the same time. A support chatbot that reads the entire customer history on every turn can become expensive fast. A coding agent that keeps every file, tool result, and prior message in context may feel magical during a demo and painful in production. A research assistant that produces long reports may create value, but only if the organization understands how much context it used, how much reasoning it triggered, and how often the same material could have been cached.

The best enterprise teams are beginning to treat tokens like inventory. They ask which context is essential, which context can be retrieved only when needed, which instructions are stable enough to cache, and which tasks justify a stronger model. They build dashboards that show cost by workflow, department, customer, and outcome. They test small models for narrow tasks and reserve frontier models for judgment heavy work. They also redesign user experiences so people can choose depth when depth matters, instead of making every request behave like a full investigation.

For consumers, token economics is usually hidden behind subscriptions and usage caps. Its relevance remains. When a chat product becomes slower, when image generation gets rationed, when voice mode is limited, or when a long conversation suddenly asks the user to start fresh, token economics is often nearby. The consumer feels it as friction. The provider experiences it as margin pressure. The model experiences none of it as a conscious concern.

This is where the consciousness question becomes useful. If we imagine the model as an aware worker, we may expect it to manage cost like a human employee watching a budget. That expectation leads to disappointment. A more accurate mental model is a powerful engine connected to meters, governors, caches, and pricing rules. The engine can follow instructions about brevity and structure. The surrounding system must manage the money.

The real opportunity is to design the surrounding system well. A useful AI product should know when to compress context, when to retrieve fresh evidence, when to ask a clarifying question, when to use a smaller model, when to stop generating, and when a richer answer is worth the extra cost. This belongs to architecture, and it is also where durable advantages will appear.

Practical workflows already show the pattern. A researcher may draft analysis in ChatGPT or Gemini, then use Miss Formula when equations or formula images need to become clean editable math. When charts and paper figures generated by AI need to move into publication or slide production, Editable Figure can convert AI generated paper figures into editable vector figure formats. The strongest workflow spends tokens with intention and turns each token into a reusable artifact.

That last phrase is the heart of token economics. Tokens function as both a billable unit and a design pressure. They force vendors to compete on inference efficiency, cloud providers to expose clearer cost models, businesses to measure value per workflow, and consumers to notice the limits of abundance. AI may lack awareness that it is burning tokens. The teams building around AI must know better.

The Future of Work Has Two Human Ends

Captain Jack Smith — Wed, 20 May 2026 06:37:47 +0000

Every CEO Dan Shipper recently offered a sharp way to think about AI and work. As models become better at summarizing, drafting, coding, researching, scheduling, and coordinating, the middle of many knowledge workflows starts to look increasingly machine shaped. The work that remains most human gathers at two ends.

At the first end sits intent. Someone must decide what matters, what question deserves attention, what standard counts as good, and which tradeoffs are acceptable. AI can produce options at speed, but the first valuable act is choosing the problem with enough taste and context that the output has somewhere meaningful to go.

At the second end sits accountability. Someone must stand behind the result, explain it to other people, notice when it feels wrong, and carry the consequences when the neat answer fails in the real world. This is where trust, ethics, customer empathy, craft judgment, and organizational memory still matter.

The brutal part of the metaphor is what happens to the middle. A large share of professional work has lived in translation between intent and result. Turn meeting notes into a plan. Turn a plan into copy. Turn a chart into a memo. Turn a bug report into a patch. Turn research into a deck. These tasks once proved competence because they consumed attention. Now they are becoming the natural territory of agents.

Shipper called this shift the allocation economy. The worker becomes less like a maker of every sentence and more like a manager of models, tools, and review loops. Even junior employees may be expected to brief an AI system, compare outputs, refine the brief, and decide when the answer is ready. The skill is no longer only knowledge. It is allocation, taste, sequencing, and review.

His two slice team idea pushes the same thought into company design. If one person with agents can ship what used to require several people, the unit of execution shrinks. The question for a team becomes less about headcount and more about clarity. What should this tiny team own. Which decisions can it make. Which parts need human help from design, growth, legal, or infrastructure. Small teams gain speed only when the human at the center knows what to ask for and what to reject.

This is why the two ends are not abstract. They show up in ordinary work. A product manager uses ChatGPT to turn customer interviews into competing roadmap narratives, then chooses the one that matches strategy. A researcher uses Gemini to compare sources and surface gaps, then decides which claims deserve confidence. A student or engineer uses Miss Formula to convert a photographed equation into usable notation, then checks whether the math still means what the original problem intended.

The pattern is clear. AI is excellent at moving through the middle when the request is legible. Humans create value by making the request worth answering and by judging the answer against reality.

For workers, the practical lesson is uncomfortable but useful. Do not protect the middle simply because it feels familiar. Build strength at the ends. Practice framing problems before opening a tool. Write clearer briefs. Develop sharper taste. Learn to evaluate sources, code, arguments, formulas, images, and numbers. Keep a record of decisions so future agents inherit context rather than noise.

For leaders, the lesson is equally direct. An AI strategy that only asks people to save time will miss the deeper redesign. Teams need new rituals for delegation, review, provenance, and responsibility. They need clear permission to use agents for the middle, and clear standards for the human decisions at the edges.

The future of work may feel like a narrowing path, but it can also become a more demanding craft. The safest human role is neither busy production nor vague supervision. It is the ability to know what should happen, guide powerful tools toward it, and answer for the result when it reaches another person.

Why Garry Tan Is Still Coding at 2 AM

Captain Jack Smith — Tue, 19 May 2026 03:05:25 +0000

The most interesting thing about Garry Tan writing code at 2 AM is the choice behind the habit. Founders have always had strange hours. The revealing part is what the head of Y Combinator chooses to build when his day job already puts him at the center of the startup world.

He is building a new operating model for software creation. The visible artifact is GStack, an open workflow for coding agents that turns tools such as Claude Code and Codex into something closer to a small software team. The deeper artifact is a philosophy: a founder who can describe a product clearly, test it quickly, and steer agents with good judgment can now move with the force of a much larger engineering group.

That is why the late night coding matters. Tan is stress testing the claim he keeps making to founders: AI has compressed the distance between idea and shipped product. If ten people can now do work that once required fifty, the strongest founders will be the ones who learn to manage agentic systems as carefully as they once managed human teams.

GStack makes that idea concrete. It adds structured roles around the coding agent: product review, engineering review, design review, code review, browser QA, release discipline, and retrospectives. In plain terms, it gives the agent a delivery loop. The human defines the problem, asks for critique, lets the agent implement, checks the result in a browser, then tightens the process for the next run.

This is a small but important shift. Many people use AI coding tools as faster autocomplete. Tan is treating them as junior organizations that need standards, memory, review, and taste. The output is code, but the real product is a repeatable way to turn intent into working software.

That same lesson applies far beyond web apps. A founder building an education tool might use Miss Formula to convert handwritten equations into clean digital formulas, then use ChatGPT to shape lesson explanations, and use Gemini to reason across multimodal materials. The magic comes from orchestrating specialized tools into a system that shortens the path from raw input to useful product.

This also explains why YC cares so much. The old startup filter rewarded credential, network, and the ability to recruit. Those still matter, but AI has made shipping speed more visible. A tiny team that pushes real product every day now creates stronger evidence than a beautiful pitch deck. Code commits, customer loops, product demos, and agent assisted iteration all become proof of momentum.

There is a warning inside the excitement. Agentic coding can create a lot of code very quickly, while good products require judgment, review, tests, security checks, and user taste. Without that discipline, speed turns into cleanup debt. That is why GStack is more interesting than a collection of prompts. Its value comes from the insistence that fast work still needs a process.

The image of Garry Tan coding at 2 AM is powerful because it captures the new founder posture. The best builders are becoming editors of machines, designers of workflows, and auditors of output. They write less glue by hand, but they make more decisions about what should exist, how it should behave, and whether the result is good enough for users.

So what is he building? He is building software, yes. More importantly, he is building a playbook for the AI native startup. In that playbook, the founder becomes the person who can point a swarm of capable tools at a real customer problem and keep raising the standard until something valuable ships.

Needle and the Return of the Tiny Specialist Model

Captain Jack Smith — Mon, 18 May 2026 06:48:35 +0000

Needle is one of those releases that looks small on a spec sheet and large in implication. A 26 million parameter model sounds almost quaint in a year when people casually compare models by billions of parameters, yet the point of Needle is precisely that size is the wrong first question. The better question is what job the model is being asked to do.

For general conversation, long reasoning, writing, research, and synthesis, larger systems such as ChatGPT and Gemini remain the natural center of gravity. They can interpret ambiguity, hold context, generate prose, plan across steps, and repair their own assumptions. Needle aims at a much narrower target. It reads a user request, reads a list of available tools, and returns the right function call with the right arguments.

That sounds modest until you remember how many AI agents spend a large share of their time doing exactly that. Open the calendar. Call the weather tool. Start a timer. Send the query to a formula recognition service such as Miss Formula when the user points a camera at handwritten math. In many consumer assistants, the action layer is full of small routing decisions. Sending every one of those decisions to a large cloud model can add latency, cost, privacy exposure, and network dependence.

Needle is interesting because it treats tool calling as a specialization problem. According to the project material from Cactus Compute, it distills Gemini 3.1 into a 26 million parameter Simple Attention Network. The model card describes an encoder decoder architecture with pure attention, no feed forward network in the encoder, 12 encoder layers, 8 decoder layers, a 512 dimensional model width, and an 8192 token vocabulary. Cactus reports about 6000 tokens per second for prefill and 1200 tokens per second for decoding in production on its runtime. The project also says the model was pretrained on 200 billion tokens with 16 TPU v6e chips and then post trained on a 2 billion token single shot function calling dataset.

The architectural choice matters. Standard transformers spend a lot of capacity in feed forward layers, which help store and transform knowledge. Needle removes much of that burden because its knowledge source is already in the prompt. The tool list describes what functions exist. The user query describes the intent. The model mainly has to match, copy, structure, and output valid JSON shaped arguments. For that task, attention can do more of the useful work than one might expect.

This is why the 26M number should be read carefully. The useful claim is narrower and more practical: a tiny model can compete when the task boundary is sharp, the output schema is constrained, and the required knowledge is supplied at inference time. That is a powerful lesson for agent design. The future agent stack may look like a collection of specialized modules, each chosen for cost, speed, privacy, and failure mode.

The practical upside is easy to see. On device tool routing could let phones, watches, glasses, industrial terminals, and private workstations act faster. A local agent could decide when to call Miss Formula for image to formula conversion, when to ask Gemini for multimodal reasoning, and when to pass a broader planning task to ChatGPT. The user would feel less delay, and developers would reserve expensive cloud calls for problems that genuinely need broad reasoning.

There is also a privacy story. If a model can decide locally that a message should go to a timer, calendar, camera, or local file tool, fewer raw interactions need to leave the device. That matters for personal assistants, health workflows, field work, classroom tools, and enterprise environments where every network call becomes a compliance question.

The caution is just as important. Small specialist models can be brittle. Tool calling accuracy depends on schema quality, training examples, evaluation coverage, and the gap between demos and messy real users. Cactus itself encourages testing and fine tuning on your own tools. That is the right posture. Needle should be evaluated on the exact function surfaces it will control, including confusing tool names, missing parameters, multilingual requests, and adversarial prompts.

My evaluation is that Needle is exciting because it makes a clean argument. Many agent systems are overpaying for routing. A 26M model can support broad assistants by removing a lot of waste around them. The real breakthrough is architectural discipline. Define the task tightly. Put the needed knowledge in the context. Train for the output contract. Then measure the result against larger models on that exact job.

Needle feels like a sign of maturity in AI engineering. The industry spent years proving that scale can unlock general capability. Now it is learning where small, local, purpose built intelligence can make products feel faster, cheaper, and more private. That balance may matter more to everyday users than another leaderboard headline.

Best Image-to-Word Formula Tools in 2026

Captain Jack Smith — Fri, 15 May 2026 03:49:34 +0000

When you work on academic papers or technical reports, converting mathematical formulas from images or screenshots into editable formats is a frequent requirement. By 2026, several tools have established themselves as reliable solutions for this task. Here are the top three recommendations based on accuracy, convenience, and cost.

1. Mathpix: The Industry Standard for High Complexity
Mathpix remains the leading professional choice in 2026. It is widely recognized for its high recognition accuracy. If your primary goal is the precise identification of extremely complex formulas, Mathpix is the most reliable option. It handles multi-line equations and handwritten symbols with professional precision. While it follows a subscription model, the technical quality of its output ensures that users spend minimal time on manual corrections.

2. Miss Formula: The Most Convenient Online Solution
For users who prioritize a balance between efficiency and performance, Miss Formula (imgtoformula.com) is an excellent choice. This online tool is specifically designed for ease of use. A major advantage of Miss Formula is that it simultaneously provides both the Word document format and the LaTeX code.

While Mathpix performs better on exceptionally intricate layouts, Miss Formula is more than capable of handling typical academic formulas. Additionally, Miss Formula offers a more generous free trial quota compared to other professional tools. This makes it a practical option for students and researchers who need a high-quality tool without immediate high costs.

3. ChatGPT: The LaTeX Extraction Method
If your requirement is limited to obtaining the LaTeX code from an image, general AI tools like ChatGPT are highly effective. You can upload an image of a formula to the chat interface and request the LaTeX code. Once the AI generates the code, you can copy it into tools like MathType or other LaTeX editors for further formatting. This method is effective for those who already use AI subscriptions and do not want to install specialized software.

Summary
Choosing the right tool depends on your specific needs. Mathpix is the best for extreme complexity. Miss Formula is the most convenient for direct Word output and offers better free access. AI chatbots serve as a flexible alternative for extracting LaTeX code. Each of these tools provides a distinct advantage for technical document preparation in 2026.

The AI Honeymoon is Over: Pragmatic Truths from Linear CEO Karri Saarinen

Captain Jack Smith — Thu, 14 May 2026 07:27:53 +0000

Hello everyone, I’m Captain Jack Smith. In the fast-moving world of tech, it’s easy to get swept away by hype or paralyzed by doomsday prophecies. Today, I’m diving into a refreshing perspective from Karri Saarinen, the founder and CEO of Linear. He recently shared some candid reflections on the current state of AI that move past the "AI will save/destroy the world" binary and focus on what’s actually happening on the ground.

The Six-Month Reality Check
Karri points out a significant milestone: it has been nearly six months since the last major leap in model coding capabilities. Six months is typically the length of a "honeymoon period." Once it ends, the rose-colored glasses come off, and reality sets in.

His stance is one of cautious optimism. While AI’s capabilities are undeniable, its limitations are equally real. Karri argues that the public discourse has become too polarized. We are missing the "middle ground"—the space where we ask: What is truly changing? What is actually useful? What is pure hype? Navigating this space requires the rare ability to remain calm between the extremes of excitement and fear.

Is Planning Obsolete in the AI Era?
There’s a growing sentiment that because things move so fast now, planning is a waste of time. Karri disagrees. He notes that the value of planning was never about the document itself; it’s about the "forcing function." Planning forces an organization to sit down, debate priorities, and align on a direction.

In the AI era, building things has become cheaper and faster, which paradoxically makes "choosing what to build" more critical than ever. When execution cost drops, it becomes much easier to build the wrong thing. At Linear, they maintain a six-month directional plan but adjust priorities weekly. Without a compass, you’ll likely find yourself being led by your tools rather than leading them.

The Expertise Paradox: AI Looks Like Magic Only to Novices
One of Karri’s most astute observations is that AI is most impressive in fields where you are least knowledgeable. When you lack judgment in a domain, you can’t spot the hallucinations or the mediocrity, so it feels like magic. However, in your area of expertise, you see the missing context, the made-up details, and the lack of nuance.

He likens this to a combination of "Gell-Mann Amnesia" and the "Dunning-Kruger Effect." The paradox here is that expertise actually makes AI harder to use because you become more critical of its output. Yet, expertise also makes AI more valuable, as you are the only one who knows how to guide, constrain, and evaluate the results. Professional skills aren't devalued; they are refocused toward judgment and taste.

The Reality of AI Coding: Useful, Not Autonomous
Despite the narrative of fully autonomous "AI Agents," the reality inside top engineering teams is different. Almost no one is running independent agent swarms. Instead, engineers remain deeply involved, managing two or three agents at a time to handle boilerplate, bug fixes, and tests.

Linear’s own data shows that while usage of coding agents has grown 5x in months, they are used for "scaffolding" rather than core architecture. AI increases bandwidth for the small, tedious tasks that weren't worth doing before, but the "hard problems"—trade-offs, system understanding, and deciding what should exist—still require human intelligence.

Design in the AI Era: A Need for Semantic Tools
As a design-led CEO, Karri is skeptical of current AI design tools. Image generation is powerful but miserable to iterate on—changing one detail often ruins the whole image. He also argues against tools that force design to happen directly in production code. Design is about exploration and messiness; it shouldn't be constrained by the rigidity or token costs of production environments. He envisions "semantic design tools" where AI helps explore variations of components (like a "pop-up" instead of just a "rectangle") rather than just spitting out code.

Conclusion: Living Within Present Capabilities
Karri responds to the common refrain that "this is the worst AI will ever be" with a dose of pragmatism. While true, he chooses to live within the capabilities of today. Predicting the end of the world or a utopia is easy; building a product that works right now is hard.

He concludes with a sharp adaptation of a nursery rhyme: “If 'potential' were 'revenue,' these capital expenditures would have been 'profits' long ago.” In the gap between potential and reality, only those who remain clear-eyed will go the distance.

Thanks for joining me on this deep dive. I'm Captain Jack Smith, and I believe that in the age of AI, our human judgment remains our most valuable anchor. What do you think? Is your AI honeymoon over, or are you just getting started? Let's discuss in the comments.

Original Source: https://x.com/karrisaarinen/status/2048267794924650791