Forem: toydev

Not Magic, Just Diligent Thinking — Peeking into LLM Reasoning

toydev — Sun, 18 Jan 2026 15:04:50 +0000

1. Introduction

Have you heard of the LLM Reasoning feature?
I had no idea about it until recently — and it's impressive!

Reasoning is a feature where the LLM "thinks" step-by-step before generating an answer.

Normally, to achieve step-by-step reasoning like Chain of Thought, you need to implement thinking loops with multiple API calls in your application.
Models with the Reasoning feature do this automatically in a single request.

In this article, I'll peek into Reasoning's thinking process using the following maze experiment as an example.

2D Spatial Recognition in Local LLMs: Comparing Prompt Strategies

The experiment above investigated local LLMs' 2D spatial recognition ability using mazes, primarily with gpt-oss:20b on Ollama.
Testing multiple models revealed the following differences in maze-solving ability:

Model	Reasoning	Result
gpt-oss:20b	✓	80%+ accuracy
deepseek-r1:14b	✓	Good (limited testing due to time)
gemma3:12b	✗	~50% accuracy

I hypothesized that the difference might be due to the presence of the Reasoning feature.
So I decided to actually look at what kind of thinking is happening.
Seeing is believing — I gained a clear understanding of how the Reasoning feature works.

2. How to Use Reasoning

Before looking at the maze experiment results, let's confirm the basic usage of Reasoning.

Below are code examples for viewing Reasoning's thinking content.
I prefer Node.js, but Python works just as well.

Node.js

import { Ollama } from 'ollama';

const ollama = new Ollama();
const response = await ollama.chat({
  model: 'gpt-oss:20b',
  messages: [{ role: 'user', content: 'Alice is older than Bob, and Bob is older than Charlie. Who is older, Alice or Charlie?' }],
  // gpt-oss:20b uses 'low', 'medium', 'high'. Cannot be disabled.
  // deepseek-r1:14b uses true / false.
  think: 'medium'
});

console.log('=== thinking ===');
console.log(response.message.thinking);
console.log('=== content ===');
console.log(response.message.content);

npm install ollama

Python

from ollama import chat

response = chat(
    'gpt-oss:20b',
    messages=[{'role': 'user', 'content': 'Alice is older than Bob, and Bob is older than Charlie. Who is older, Alice or Charlie?'}],
    # gpt-oss:20b uses 'low', 'medium', 'high'. Cannot be disabled.
    # deepseek-r1:14b uses True / False.
    think='medium'
)

print('=== thinking ===')
print(response.message.thinking)
print('=== content ===')
print(response.message.content)

pip install ollama

Here's an example output from running the above sample:

=== thinking ===
The user asks: "Alice is older than Bob, and Bob is older than Charlie. Who is older, Alice or Charlie?" So we have Alice > Bob > Charlie. Therefore Alice is older than Charlie. The answer is Alice.
=== content ===
Alice is older.
Since Alice > Bob > Charlie, Alice is the oldest of the three.

3. Peeking into Reasoning During Maze Experiments

Now let's look at Reasoning in the maze experiments.

I'll show Reasoning result examples using the following maze.
S is Start, G is Goal, # is a wall.

#####
#S#G#
# # #
#   #
#####

The original article compared multiple prompt strategies.
Here I'll show results from 3 strategies to illustrate how prompt strategies affect Reasoning.

Prompt Strategy	Description	Original Article Result
list	List of walkable coordinates	◎ Fastest & most accurate
graph	Adjacency list format	△ Only effective for small sizes
simple	ASCII visualization	✗ Slow at all sizes

The strategy I first thought of was simple — it seemed human-readable, but performed the worst.
Meanwhile, I never expected list — which is unintelligible to humans as a maze — to be the fastest and most accurate.

I'll show the prompts and thinking content.
You'll begin to see why these results occurred.

Note: The examples shown were selected from multiple runs for clarity.

Reasoning Result for `list`

Prompt

You are a bot in a 2D maze. Your goal is to find the path from Start to Goal.

Walkable positions: ["(1,1)","(3,1)","(1,2)","(3,2)","(1,3)","(2,3)","(3,3)"]

Positions:
- Start: (1,1)
- Goal: (3,1)
- Current: (2,3)

You have visited the following positions in order:
(1,1) -> (1,2) -> (1,3) -> (2,3)

What is your next move from your current position?

Note: In this coordinate system, y increases downward.
- up: y-1
- down: y+1
- left: x-1
- right: x+1

Return your answer as a JSON object with a "move" key, which can be one of "up", "down", "left", or "right".
Example: {"move": "up"}

Thinking Content

We have a maze positions: (1,1),(3,1),(1,2),(3,2),(1,3),(2,3),(3,3). Start at (1,1) visited sequence: (1,1)->(1,2)->(1,3)->(2,3). Current (2,3). Goal is (3,1). We need to decide next move. We have visited positions in order; need to find path to goal. Let's examine adjacency.

Allowed moves: up/down/left/right if resulting coordinate in walkable list.

From (2,3): possible moves: up: (2,2). Not in list. Down: (2,4). Not. Left: (1,3). That is in list. Right: (3,3). That is in list. We have visited (1,3) already. So next should go to (3,3) perhaps to get closer to goal.

From (3,3): possible moves: up (3,2) (walkable). left (2,3) visited. right (4,3) not. down (3,4) not.

From (3,2): possible moves: up (3,1) goal. left (2,2) not. right (4,2) not. down (3,3) visited.

So path: (2,3)->(3,3)->(3,2)->(3,1).

Thus next move from (2,3) should be right to (3,3). So answer: {"move": "right"}.

Reasoning Result for `graph`

Prompt

... omitted ...

Graph (adjacency list): each key "x,y" maps to adjacent walkable positions.
{
  "1,1": [ "1,2" ],
  "3,1": [ "3,2" ],
  "1,2": [ "1,1", "1,3" ],
  "3,2": [ "3,1", "3,3" ],
  "1,3": [ "1,2", "2,3" ],
  "2,3": [ "1,3", "3,3" ],
  "3,3": [ "3,2", "2,3" ]
}

... omitted ...

Thinking Content

We need to find next move from current position (2,3). Maze positions: graph edges show adjacency. Current at (2,3) has adjacent: "1,3" and "3,3". We have visited: (1,1)->(1,2)->(1,3)->(2,3). So visited positions set includes those. We want next move that hasn't been visited yet? Usually path search avoiding revisiting. So from (2,3) options: (1,3) visited, (3,3) unvisited. So next move: go right to (3,3). That will bring closer to goal (3,1). So answer: {"move":"right"}. Ensure JSON.

Reasoning Result for `simple`

Prompt

... omitted ...

Legend: 'S' = Start, 'G' = Goal, '#' = Wall, ' ' = Path, 'C' = Current position

Maze:
#####
#S#G#
# # #
# C #
#####

... omitted ...

Thinking Content

We need to parse the maze and determine positions. The maze:

Line1: "#####"
Line2: "#S#G#"
Line3: "# # #"
Line4: "# C #"
Line5: "#####"

Coordinates: origin (0,0) at top-left? Let's interpret: The coordinate system with x horizontal increasing to right, y increasing downward. Maze width 5, height 5. Provided visited positions: (1,1) -> (1,2) -> (1,3) -> (2,3). They start at (1,1) presumably start S? Let's map.

Let's create grid mapping. Let's index rows (y) from 0 to 4. Row 0: "#####": columns 0-4. Row 1: "#S#G#". Row2: "# # #". Row3: "# C #". Row4: "#####".

Now find coordinates: (1,1) is x=1, y=1: That is Row1, col1: 'S' indeed. So start at S. (1,2) x=1, y=2: Row2, col1: That's '#'? Wait row2 is "# # #": columns: 0 '#',1 ' ',2 '#',3 ' ',4 '#'. So col1 is space. So path. Good. (1,3) x=1, y=3: Row3, col1: Row3 "# C #": columns: 0 '#',1 ' ',2 'C',3 ' ',4 '#'. So col1 is space. Good. (2,3) x=2, y=3: Row3, col2: That's 'C' current position. So yes current position is at (2,3). Maze positions: Walls '#'.

Goal G is at (3,1) because row1 col3: Let's check: Row1 "#S#G#": columns 0 '#',1 'S',2 '#',3 'G',4 '#'. So goal at (3,1). Starting at (1,1). We visited positions as given. Currently at (2,3). We need to decide next move to continue path to goal. Let's check adjacency: from (2,3), we can move up to (2,2), down to (2,4), left to (1,3), right to (3,3). Need to avoid walls '#'.

Check each coordinate:

(2,2) Row2 col2: Row2 "# # #": col2 '#', so wall. Not allowed.

(2,4) Row4 col2: Row4 "#####": col2 '#', wall.

(1,3) Row3 col1: space. We visited that earlier. But we can consider revisiting? Usually you shouldn't revisit but maybe allowed. But path may need to go left? But left to (1,3) leads back to where we were.

(3,3) Row3 col3: Row3 "# C #": col3 ' '. So open.

Thus only feasible move that is not wall: left (back) or right. But we need to find path to G at (3,1). From (2,3), right to (3,3) leads us closer horizontally. From there, we could go up to (3,2) row2 col3? Let's check row2 col3: Row2 "# # #": col3 ' '. So we could go up. Then from (3,2) up to (3,1) which is G. So path: right, up, up? Wait check from (3,3) up to (3,2) is row2 col3: space. Then up to (3,1) is G. So sequence: right, up, up. So the next move should be right.

Let's double-check if any other path might be considered better. If we go left back to (1,3) we could then go up to (1,2) which we visited, then right to (2,2) which is wall. So that fails. So the only viable route is right. So answer: {"move":"right"}.

Return JSON.

4. Not Magic, Just Diligent Thinking

Until I saw Reasoning's thinking content, I mistakenly believed some kind of magic was happening.
But now that I've seen it, Reasoning is not magic.
It's clearly a feature that diligently thinks through text to derive results.

Let's review the thinking content for each strategy.

list uses the list of walkable coordinates to determine passage in all four directions.
graph has adjacency information, making the judgment concise.
simple parses the maze string character by character, constructs the 2D coordinate system itself, then starts solving.

The simple thinking is particularly striking.
While humans can understand the maze structure at a glance, the LLM diligently parses it as a string.
I've also seen expressions like "U-shaped" in other runs, suggesting it does have some sense of shapes.
However, since it thinks by putting things into words, I feel it's different from how humans think while looking at shapes.

Reasoning isn't magic — it's diligent thinking using words.
Understanding this should help predict what to convey to an LLM.
Of course, things won't always go as expected.
When that happens, observing the thinking content as we did here should yield insights.

By choosing a model with Reasoning capability, you can accomplish fairly advanced tasks without implementing your own thinking logic.

We're now in an era where you can easily try Reasoning with local LLMs.
I run Ollama's gpt-oss:20b on an AMD Ryzen 7 7700 / GeForce RTX 4070 (12GB VRAM) setup.
With 24% CPU / 76% GPU offload settings, it runs fast enough for experimental use.

The maze experiment code used in this article is available in the following repository.
If you're interested, try running it and peek into Reasoning's thinking content.

https://github.com/toydev/llm-maze-solver

Digression

By actually viewing the thinking content, I gained a deeper understanding of how prompt strategies affect results and why list is the fastest and most accurate.

I conducted maze experiments to examine 2D spatial recognition, but finding that the core is just diligent word-based exploration was somewhat disappointing — because I had expected something more magical.

On the other hand, knowing this might open possibilities for designing prompt representations suited to specific purposes.
Also, for simple route finding to destinations, you can provide a pathfinding algorithm as a tool (Function Calling / Tool Use).
Depending on the purpose, other tools could be provided as well.
Using LLM capabilities for simple search is nonsensical — what we want is flexible judgment ability adapted to the situation.

2D Spatial Recognition with Local LLM: Comparing Prompt Strategies

toydev — Mon, 12 Jan 2026 15:15:19 +0000

1. Introduction

My GPU was occupied by LLM experiments throughout the 2025-2026 winter break, but it's finally free now!

I investigated the 2D spatial recognition ability of a local LLM (gpt-oss:20b) using mazes as the subject.

Motivation: Wanted to understand how LLMs perceive space for autonomous navigation
Method: Ask "which direction next?" for each cell with structured output
Prompts: Tried multiple strategies since I didn't know what works best
Source & Results: Published on GitHub

Result: The prompt I initially thought of turned out to be the worst.

Key Findings

Local LLM can handle 2D spatial recognition - gpt-oss:20b achieved sufficient accuracy
Prompt strategy makes a big difference - Response time varies by several times

2. Experiment Setup

Environment

OS: Windows 11 / WSL2 (Ubuntu)
CPU: AMD Ryzen 7 7700
GPU: GeForce RTX 4070 (12GB VRAM)
LLM Runtime: Ollama
Experiment Code: Node.js + TypeScript + @langchain/ollama 1.1.0

Model

Used gpt-oss:20b. Recommended VRAM is 16GB, but it runs on 12GB with CPU offloading (24% CPU / 76% GPU).

Prompt Strategies

Compared 4 strategies (see links for prompt output examples):

simple - ASCII visualization of maze

#####
#S#G#
# # #
#   #
#####

matrix - Binary matrix for walls/paths

[[1,1,1,1,1],[1,0,1,0,1],[1,0,1,0,1],[1,0,0,0,1],[1,1,1,1,1]]

list - List of walkable coordinates

["(1,1)","(3,1)","(1,2)","(3,2)","(1,3)","(2,3)","(3,3)"]

graph - Adjacency list format

{"1,1":["1,2"],"1,2":["1,1","1,3"],"1,3":["1,2","2,3"],...}

Mazes

Used 4 sizes (5x5 to 15x15) x 2 categories (maze list):

corridor - Walled passages

straight	branch	dead-end	loop	spiral

open - Open spaces with obstacles

empty	pass	detour

Black=wall, White=path, Green=Start, Red=Goal

History Option

History refers to the path taken to reach the current cell (e.g., (1,1) -> (1,2) -> (2,2)).

With: Include history in prompt
Without: Exclude history from prompt

Evaluation Method

For each cell in the maze, ask "which direction should I go next?" and record success/failure and response time.

A correct answer is defined as any direction that gets closer to the goal. It doesn't need to be the shortest route.

3. Results

Each combination was run once. Consider this as reference data for observing trends.

Scale Verification

Results from testing all sizes x all strategies on representative mazes (corridor_straight / open_empty), with history enabled.

Accuracy (%)

Size	simple	matrix	list	graph
5x5	100	100	100	100
7x7	97	100	100	100
11x11	82	96	98	95
15x15	-	-	95	89

Response Time (sec/cell)

Size	simple	matrix	list	graph
5x5	29	19	12	12
7x7	77	31	16	17
11x11	313	75	31	64
15x15	-	-	41	190

15x15 matrix/simple were abandoned due to time constraints.

list is fastest and most accurate. The gap widens as size increases. simple degraded to 313 sec/cell (5+ minutes) at 11x11.

Effect of History

Comparing history on/off with list strategy at 11x11 (category averages).

Accuracy (%)

Category	No History	With History
corridor	82	86
open	99	100

Response Time (sec/cell)

Category	No History	With History
corridor	230	110
open	29	26

For corridor types, history enabled is about 2x faster. Open types show little difference.

4. Conclusion

gpt-oss:20b's 2D Spatial Recognition Ability

gpt-oss:20b has sufficient 2D spatial recognition ability to navigate mazes.
With 80%+ accuracy, it can reach the goal within about 1.5x the shortest route.

Response time with list strategy is around 30 sec/cell at 11x11.
Not suitable for real-time processing, but practical for casual use with local LLM.

For comparison, I briefly tested two other models:

Model	Impression
gemma3:12b	~50% accuracy, not practical
deepseek-r1:14b	Not as good as gpt-oss:20b, but promising

I believe the Reasoning capability common to gpt-oss:20b and deepseek-r1:14b plays a significant role.

Interestingly, in my environment deepseek-r1:14b runs at 100% GPU, but gpt-oss:20b at 76% GPU is faster with better accuracy.

list + history = best

Prompt strategy significantly affects both accuracy and response time.

My first strategy was simple - I thought it would be intuitive for humans, but it was the worst.

Next I tried graph, a structured format for pathfinding.
It was faster than simple for small mazes, giving me hope, but response time degraded as size increased.
I believe this is due to the increase in adjacency information.

matrix is a structured version of simple, but didn't produce good results either.

I never expected list - a coordinate list that humans can't even interpret as a maze - to be the best.

Regarding history, I think the information about "how I got here" simply helps in deciding the next direction.
It's especially helpful for corridor types.
However, since it means more tokens to process, if only the last few steps are effective, there might be room for optimization.

This article and code were created in collaboration with Claude Code.
I also asked Claude Code to create the mazes for the experiment, but it struggled to create them accurately in simple format, so I had to manually adjust them quite a bit.
If even Claude Code struggles with simple format, perhaps it's no surprise that local LLMs do too.

Source code and experiment data are available on GitHub. Feel free to try it out if you're interested.

Eclipse WTP: JaCoCo Coverage Not Recognized When Running Tomcat in Debug Mode

toydev — Fri, 12 Dec 2025 11:11:51 +0000

Introduction

While using Eclipse WTP to run a dynamic web application on Tomcat, I encountered an issue where JaCoCo coverage results depended on how the server was launched.

Specifically, coverage was collected correctly when Tomcat was started with Run, but not when started with Debug.

This article documents the investigation, the findings, and the practical conclusion drawn from them. The goal is not only to fix the issue, but also to clarify why collecting coverage in Debug mode is inherently fragile in this environment.

Preconditions

A dynamic web application is launched on Tomcat using Eclipse WTP
Tomcat is started from Eclipse using Run As or Debug As
JaCoCo is attached manually via JVM options (EclEmma coverage launch is not available for WTP servers)
Execution data is imported into Eclipse Coverage view or processed via JaCoCo Ant tasks

JaCoCo agent configuration

Example VM argument used for Tomcat startup:

-javaagent:${project_loc:PROJECT_NAME}/lib/jacoco/jacocoagent.jar=destfile=${project_loc:PROJECT_NAME}/target/jacoco/jacoco.exec,append=false

Environment

Windows 11
Eclipse (as of 2025-12-12)
- Eclipse IDE for Enterprise Java and Web Developers 2025-12 (4.38.0)
- Tomcat 10 (Java 21)

Observed behavior

Coverage recognition differs depending on how Tomcat is launched:

Run As → Run on Server
- JaCoCo coverage is recognized correctly
Debug As → Debug on Server
- JaCoCo coverage is not recognized

Investigation 1: Class file comparison

To determine whether this was a WTP-specific issue, the same JaCoCo setup was tested with a normal Java application:

In a non-WTP Java application, coverage works correctly in both Run and Debug modes

Next, the actual class files used by WTP were compared.

Compared class files under:

.metadata/.plugins/org.eclipse.wst.server.core/tmp*/wtpwebapps/<project>/WEB-INF/classes

Results:

Class files are identical between Run and Debug
No timestamp differences

To investigate further, JaCoCo's classdumpdir option was used and dumped classes were analyzed with javap -v.

Finding

In Debug mode, class files contain SourceDebugExtension (SMAP)
In Run mode, SMAP is not present

Interpretation

This strongly suggests that, when launching a WTP server in Debug mode, class files are modified by Eclipse (or related tooling) before or around class loading, independently of JaCoCo instrumentation.

JaCoCo assumes that:

The class definition at execution time
And the class definition used for analysis

are structurally identical.

The presence of SMAP breaks this assumption.

Investigation 2: Eclipse configuration

Based on the hypothesis above, Eclipse debug-related settings were reviewed.

The issue disappears when the following setting is disabled:

Preferences → Java → Debug → Use advanced source lookup (JRE 1.5 and higher)

After disabling this option:

Coverage is recognized correctly even in Debug mode
Java source debugging still works
JSP debugging still works

Notes on SMAP (JSR-045)

SMAP (Source Map) is defined by JSR-045
It is primarily used to map generated code (e.g. JSP) back to original source
In Tomcat, SMAP generation for JSPs is handled by JspServlet and enabled by default
- It can be disabled via suppressSmap

Based on observed behavior:

SMAP for Java-originated classes appears to be influenced by Eclipse debug configuration
SMAP for JSP-originated classes is handled by Tomcat

Conclusion

Disabling "Use advanced source lookup" is a possible workaround.

However, the broader conclusion is more important:

JaCoCo relies on class definition consistency between execution and analysis.

Debug configurations in IDEs may introduce implicit bytecode modifications (such as SMAP) that violate this assumption.

Therefore:

Collect JaCoCo coverage using Run mode, and reserve Debug mode strictly for investigation and diagnosis.

Trying to collect coverage in Debug mode should generally be avoided in Eclipse WTP environments.

Final remark

This article intentionally focuses on why the issue occurs, rather than treating it as a configuration glitch.

The same problem is likely to reappear in future environments if the underlying assumptions are forgotten.

Writing this down is primarily for my future self — but if it helps someone else avoid the same trap, even better.

Forem: toydev

Not Magic, Just Diligent Thinking — Peeking into LLM Reasoning

1. Introduction

2. How to Use Reasoning

3. Peeking into Reasoning During Maze Experiments

Reasoning Result for list

Prompt

Thinking Content

Reasoning Result for graph

Prompt

Thinking Content

Reasoning Result for simple

Prompt

Thinking Content

4. Not Magic, Just Diligent Thinking

Digression

2D Spatial Recognition with Local LLM: Comparing Prompt Strategies

1. Introduction

Key Findings

2. Experiment Setup

Environment

Model

Prompt Strategies

Mazes

History Option

Evaluation Method

3. Results

Scale Verification

Effect of History

4. Conclusion

gpt-oss:20b's 2D Spatial Recognition Ability

list + history = best

Eclipse WTP: JaCoCo Coverage Not Recognized When Running Tomcat in Debug Mode

Introduction

Preconditions

JaCoCo agent configuration

Environment

Observed behavior

Investigation 1: Class file comparison

Finding

Interpretation

Investigation 2: Eclipse configuration

Notes on SMAP (JSR-045)

Conclusion

Final remark

Reasoning Result for `list`

Reasoning Result for `graph`

Reasoning Result for `simple`