Forem: Steven Hur

Escaping Localhost

Steven Hur — Fri, 12 Dec 2025 22:12:18 +0000

For a long time, my development life existed within the predictable world of my local machine. I wrote code, it ran, and that was the extent of my world.

Few Months ago, I had chance to step outside of my comfort zone and dive into the world of Open Source. If I had to describe the feeling of that first moment, I would point to a specific scene from the Disney movie, "Ralph Breaks the Internet".

picture of movie "Rack it Ralph"
first time Ralph and Vanellope walks into the world of internet

Just like Ralph and Vanellope stood on that balcony, gazing wide-eyed at the endless, futuristic skyline, I felt completely small. In the movie, the Internet is described as a sprawling, infinite metropolis—bustling with flying vehicles, and towering skyscrapers representing the giants of the web.

Coming from the quiet, controlled environment of my local machine, the Open Source ecosystem felt like that futuristic city. The towering buildings weren't Amazon or Google, but massive repositories with millions of lines of code. The flying cars weren't just traffic. They were the large stream of Pull Requests, Issues, and Discussions happening in real-time across the world. People were continuously building, rebuilding, breaking and fixing the projects.

It was terrifying, yes. But just like Ralph looking out at that horizon, I realized the potential of this limitless world.

My Contribution Highlights
Driven by this excitement, I didn't want to just be a tourist in this new city. It was intimidating, but I am incredibly proud to say that I have successfully contributed to some of the foundational pillars of the Python data ecosystem.

I have had PRs merged into:

Scikit-learn
NumPy
Pandas
Dagster

Seeing my code become part of tools that millions of developers rely on was a exciting experience.

Why I Fell for Dagster
This realization explains why I fell so deeply for Dagster.

While exploring it, I got amazed by their core philosophy of Software-defined Assets. The concept of treating data not just as a byproduct, but as a first-class asset was very interesting. Treating data as assets shifts the focus from managing execution tasks to maintaining the freshness of the actual data products. This approach automatically generates clear lineage graphs, allowing you to easily understand dependencies and track how data flows through the system. As a result, debugging and collaboration become significantly more efficient because you are interacting with defined data outcomes rather than abstract code logic.

Reading the Dagster source code didn't feel like studying. I found myself mentally visualizing the entire process like how the data flows, how the assets are materialized, and how the engine handles dependencies. Simulating these complex data journeys in my head was incredibly fun and engaging.

Stepping out of my local machine and jumping into the open-source world brought lots of changes. It helped me realize my passion toward data management system. This was fantastic and fun experience and I will be continuing this journey.

Continuous Journey through Dagster - bugs and testing

Steven Hur — Tue, 09 Dec 2025 21:25:14 +0000

Lately, I've been diving deep into open-source contributions for Dagster. I think I am getting bit more comfortable with their codebases which hastened my working process(placebo?). Today, I want to share the issues I've tackled recently and talk about a significant roadblock I'm currently facing.

My Recent Contributions
I focused on fixing several bugs and improving stability across different parts of the Dagster. Here is a breakdown of the issues I worked on:

Fixing ECS Pipes Client Execution

The Issue: Users were encountering an IndexError when launching tasks using the PipesECSClient. This caused pipelines to crash unexpectedly in ECS environments.

The Fix: I added proper exception handling and bounds checking to ensure the client launches tasks smoothly without crashing on index errors.

Issue #32936

Resolving Asset Specs Mapping Dependencies

The Issue: There was a logic error in AssetsDefinition.map_asset_specs that caused failures when attempting to add dependencies while input definitions were already set.

The Fix: I adjusted the core logic to correctly handle the mapping of asset specs even when inputs are pre-configured.

Issue #32913

[WIP]Correcting Asset Sensor Event Processing

The Issue: The asset_sensor had a critical bug where it would only process the last materialization event if multiple partitions materialized simultaneously. This issue stems from a race condition, making it notoriously difficult to reproduce and debug in a local environment.

The Fix: Still working in progress but initially, I modified the sensor logic to ensure every single materialization event is captured and processed, regardless of concurrency. Precise approach with careful testing is required for further progress.

Issue #32853

[WIP]Implementing Merge Support for Polars & Delta Lake

The Use Case: Currently, the dagster-deltalake I/O manager allows writing data, but it lacks out-of-the-box support for the merge operation when using Polars.

The Implementation: I am working on updating the dagster_deltalake/handler.py to support merge mode. The logic involves checking if the write mode is set to merge. If so, instead of calling the standard write_deltalake() function, it creates a DeltaTable object and executes the merge operation.

Issue #32644

The CI
While fixing the code was satisfying, getting the Pull Requests (PRs) merged has been a different story. I am currently stuck in a loop regarding CI tests.

The Situation:

I run the unit tests locally on my environment, and everything passes perfectly.
I push the code to GitHub, and the CI pipeline fails.
Because of this, I can't get a proper code review from the maintainers.

It is frustrating because I cannot reproduce the errors locally. It could be an environment configuration mismatch, a linting rule that strictly applies in CI, or a hidden dependency issue.

Next Steps
I plan to reach out to the Dagster team and the community for guidance. I need to understand how their CI environment differs from a standard local setup so I can replicate the failure and fix it. Sometimes, reading thousands of lines of codes and fixing errors is easier than testing.

Deepening My Roots in the Data Ecosystem - Choosing Depth Over Breadth

Steven Hur — Thu, 04 Dec 2025 23:17:53 +0000

The Original Plan vs. Reality
In my previous post, I planned to step into LLM orchestration by contributing to LangChain or diving into full-stack development with Django. However, digging into the codebase made me realize a distinct difference in engineering styles.

The library relies heavily on abstraction layers to wrap various LLMs. While this is architecturally impressive, I found that I didn't get the same satisfaction as I did when working with scikit-learn or Dagster. It wasn't just about complexity, it was about the nature of the code. I realized that I prefer the logic of data pipelines and algorithms over the integration-heavy nature of LLM wrappers.

Rediscovering the Joy of Data Engineering
Naturally, I shifted my focus back to Dagster. Scanning the issue tab, I found myself drawn to problems that dealt with strict data flow and orchestration logic.

It wasn't just because Dagster was familiar, it was because the challenges were genuinely more stimulating. For instance, working on a feature that required learning Polars was exciting, even though it was a completely new library for me. This confirmed my preference:

"I enjoy the process when working on the concrete logic of data processing rather than the abstraction layers of LLM applications."

Choosing Depth Over Breadth
I made a strategic decision. Instead of making surface-level contributions in a new repository, I decided to double down on Dagster. This allowed me to move beyond minor patches and focus on high-impact work.
I focused on:

Resolving Core Issues: Diving deep into the internal logic to fix bugs that were blocking other users.
Expanding Functionality: Implementing new features that enhance the tool's usability.

Leveraging my previous experience with the codebase allowed me to use the time more efficiently. I could navigate the source code with intuition, enabling me to tackle complex architectural problems that would have been out of my reach just a few months ago.

Finding My Path
This journey took an unexpected turn, but it taught me a valuable lesson. Being a skilled developer isn't about following the latest trends; it's about recognizing your strengths and doubling down on them. Instead of spreading myself thin, I chose to deepen my expertise in the data ecosystem.

Stepping Out of the Comfort Zone - Plan for the Final Stretch

Steven Hur — Fri, 28 Nov 2025 22:10:21 +0000

The Journey So Far
Over the past few months, my journey through open-source development has been a deep dive into the Python data ecosystem. In previous releases (0.1 through 0.3), I focused heavily on data engineering and machine learning libraries. I had the opportunity to contribute to Dagster, scikit-learn, and NumPy.

These experiences were invaluable. I learned how to navigate complex C-extensions in NumPy, understood the orchestration logic in Dagster, and worked through to the strict code standards of scikit-learn. However, I felt this is another time to move out of the box one more time and push me to the new world.

Bridging Data and Application
One of the main goal is to suggest or contribute to a new feature.

Before I jump into anything, I asked myself: Where do I want to be as a developer?

I have some background in data processing, but I want to strengthen my skills in building the applications that utilize this data. I want to bridge the gap between "backend logic" and "user-facing functionality." Therefore, for this final step, I plan to walk towards the LLM (Large Language Model) orchestration or Web Framework domain.

The Target Project: LangChain, Django
After researching potential projects, I have found two interesting open source projects, LangChain and Django.

Why LangChain? With the explosion of Generative AI, LangChain has become the framework for building LLM applications. Since I have already contributed to scikit-learn and understand the fundamentals of ML pipelines, moving into LLM orchestration feels like the natural next step. It allows me to apply my Python skills to a high-impact technology.

Why Django? Django is one of the most robust web frameworks in existence. While my previous contributions were in data libraries, I want to explore the world of Full Stack development. Contributing to Django will give me chance to deal with different types of challenges such as ORM optimizations and security which are crucial for my career growth.

Moving from scientific libraries like NumPy to application frameworks like LangChain and Django is a shift in mindset. It’s a move from optimizing calculation to architecting functionality. It makes me nervous, but that’s exactly why I need to do it.

I am giving my final push to close out my 3 years of study. Stay tuned for my progress update next week.

My First Python Package Release on PyPI: repo-code-packager

Steven Hur — Fri, 21 Nov 2025 22:49:23 +0000

For OSD600 Lab 9, I took on the challenge of releasing my open-source project to the world. My goal was to take my code and package it so that anywhere I can install it with a single command.

I chose to package my Python project, repo-code-packager, and publish it to PyPI (Python Package Index)

The Tools
Since I am working within the Python ecosystem, I used the standard industry tools for packaging:

PyPI: The official third-party software repository for Python.
build: A standard tool to create distribution packages.
twine: A utility for publishing Python packages to PyPI securely.
pyproject.toml: The modern configuration file for defining package metadata and build system requirements.

The Process

Preparing the Package
The first step was organizing my project structure and creating the pyproject.toml file. This file is the heart of the package, containing the name, version, author info, and dependencies. I had to ensure my source code was properly structured in a src directory with __init__.py files to make it importable.
Building and Tagging
I used the python -m build command to generate the distribution artifacts. Before releasing, I practiced using Git Tags, marking my repository with v0.9.0 to simulate a pre-release state. Once I was ready for the official launch, I bumped the version to v1.0.0 and pushed the tags to GitHub.
Publishing to PyPI
Uploading was surprisingly straightforward using twine. I generated an API Token from PyPI for security and used it to authenticate during the upload process.

python -m twine upload dist/*

Seeing my package live on PyPI for the first time was a exciting moment.

Unexpected Challenges
However, the road to a stable release wasn't smooth. I learned that publishing is easy, but publishing correctly is hard.

The Case Sensitivity Trap
My biggest problem came from directory naming. My source folder was named Repo_Code_Packager, but I used repo_code_packager for project name. When I released the package, I realized that users had to import it exactly as the folder was named:

# correct
from Repo_Code_Packager.content_packager import ContentPackager

# wrong
from repo_code_packager.content_packager import ContentPackager

This taught me the importance of adhering to naming conventions before starting a project. For this release, I updated the documentation to clearly instruct users to use the capitalized import.

Missing Dependencies and Class Structures
In my initial v1.0.0 release, I missed declaring Pygments as a dependency in pyproject.toml. This meant users installed my package but crashed immediately upon running it. I also realized my code was a collection of loose functions, which was hard for users to integrate. I quickly refactored the code into a proper ContentPackager class, added the missing dependency, and released patches v1.0.1 and v1.0.2, to fix these issues.

User Testing
To verify my release, I asked my cousin, who is also a software developer, to test the package. This session was incredibly valuable.

I provided him with the PyPI link and my README.md.

The Install: He successfully installed it using pip install repo-code-packager.
The Confusion: He instinctively tried to import it using lowercase and hit an ImportError. I had to point out that the directory name required uppercase.
The Fix: Seeing him struggle with the import confirmed that I needed to update my documentation immediately to highlight the case-sensitive import statement.

It was a great reminder that documentation is just as important as the code itself.

This lab taught me that software release is an iterative process. It's rare to get v1.0.0 perfect on the first try, and that's okay. Tools like twine and semantic versioning allow us to fix and improve our packages continuously.

How I Fixed a Confusing Bug in NumPy

Steven Hur — Fri, 21 Nov 2025 18:09:15 +0000

Contributing to a massive open-source project like NumPy can feel intimidating. You imagine complex C code, advanced math, and scary build processes. But sometimes, a bug is just a simple logic error hiding in plain sight.

I just submitted a Pull Request to NumPy to fix a bug that was causing misleading error messages in numpy.convolve. Here’s the story of the bug, the fix, and how I verified it.

"Wait, What?"
Imagine you are using numpy.convolve. You accidentally pass an empty array as your first argument, but your second argument is perfectly fine.

import numpy as np

a = np.array([])      # Empty!
v = np.array([1, 2])  # Not empty!

np.convolve(a, v)

You would expect an error saying a cannot be empty, right? Instead, NumPy screams at you:

ValueError: v cannot be empty

Wait... what? I know v isn't empty. I just double-checked it! This is the kind of error message that sends developers down a rabbit hole for an hour, debugging the wrong variable.

Keep Calm, just Find the Bug
I search through the NumPy source code, numpy/_core/numeric.py to see what was happening under the hood. The logic looked something like this:

# The original buggy logic
def convolve(a, v, mode='full'):
    # ...

    if (len(v) > len(a)):
        a, v = v, a  # <--- The SWAP happens here!

    # Validation
    if len(a) == 0:
        raise ValueError('a cannot be empty')
    if len(v) == 0:
        raise ValueError('v cannot be empty') # <--- The error triggers here

Do you see the problem?

The function sees that v is longer than a.
It decides to swap them for performance reasons.
Now, internally, variable v holds the empty array.
The check if len(v) == 0 triggers, raising ValueError: v cannot be empty. The function was swapping the contents of the variables, but the error message was hardcoded to the variable name. It was basically gaslighting the user.

Check First, Optimize Later
The fix was simple. We just needed to ensure the input validation happens before any internal swapping takes place.

I changed the order of operations:

# The fixed logic
def convolve(a, v, mode='full'):
    # ...

    # 1. Check for empty inputs FIRST
    if len(a) == 0:
        raise ValueError('a cannot be empty')
    if len(v) == 0:
        raise ValueError('v cannot be empty')

    # 2. THEN perform the optimization swap
    if (len(v) > len(a)):
        a, v = v, a

Now, if a is empty, it gets caught immediately, and the user gets the correct error message, a cannot be empty.

This was a small change, just moving a few lines of code but it significantly improves the developer experience. No one likes misleading error messages.

It was a great reminder that you don't need to be a math genius to contribute to libraries like NumPy. Sometimes, you just need to spot a logic bug and move some if statements around.

My PR is up! Fingers crossed for the merge.

Debugging Windows Race Conditions in Dagster

Steven Hur — Tue, 18 Nov 2025 21:13:07 +0000

Okay, another PR on Dagster. I tackled a deceptively complex issue. Specifically, I focused on the dagster-dbt integration.

At first glance, the Pull Request might look small. However, getting to that one line required diving deep into Windows filesystem internals and race conditions.

Here is the story of how I diagnosed and fixed a nondeterministic crash that was haunting Windows users.

"It works, until it doesn't"
The issue appeared simple. When a user reloads their dbt project definitions in Dagster on Windows, the process crashes with a FileExistsError: [WinError 183].

The traceback pointed to this logic in dbt_project_manager.py:

# The original code
shutil.rmtree(local_dir, ignore_errors=True)
local_dir.mkdir()

On paper, this logic seems flawless. It delete the directory, and then create it. So, why was Python complaining that the file already exists right after deleting it?

The Root Cause - The Windows Race Condition
This is where the complexity lies. Unlike Linux or macOS, the Windows filesystem behaves differently regarding to file locking and deletion latency.

When shutil.rmtree() is called, it requests the OS to delete the directory. However, on Windows, if a file inside that directory is briefly locked, the deletion doesn't happen instantaneously.

Python executes rmtree. Often Windows starts deleting but lags slightly due to a lock.
Because ignore_errors=True was set, rmtree returns silently without finishing the job.
Python immediately executes mkdir().
CRASH!: The directory is essentially a "zombie" - it’s flagged for deletion but still technically exists.

This is a classic Race Condition. The code assumed instant deletion but the Windows proved otherwise.

Aligning Intent with Reality
I didn't want to simply suppress the error. I needed to architect the creation step to be flawless.

The original author used ignore_errors=True for deletion, implying a design philosophy of "Availability over Atomicity" which means that if cleanup fails, the program should try to continue rather than crash. However, the mkdir() step was strict, breaking this philosophy.

This is my suggestion:

local_dir.mkdir(parents=True, exist_ok=True)

By adding exist_ok=True, I ensured that even if the "zombie" directory lingers due to OS latency, the program proceeds gracefully. The subsequent sync() operation then handles the data consistency by overwriting files, ensuring that no stale data causes issues.

The Dilemma - Logic over Observation
This presented a significant engineering dilemma for me. As a developer, I crave the validation of seeing a test fail before I fix it. I wanted to witness the crash with my own eyes to confirm the bug.

However, my local test environment worked too well. Because I was testing with a relatively small dataset, the rmtree operation on my Windows machine finished instantaneously, beating the race condition every time. No matter how many times I reloaded, the crash wouldn't trigger.

I decided to trust my static analysis of the code over my local observation. I looked closely at the existing code's intention:

shutil.rmtree(local_dir, ignore_errors=True)

The original code explicitly ignores errors during deletion. It proved that the original authors anticipated that deletion might fail or be incomplete. They designed the system to tolerate a messy cleanup.

However, the very next line, mkdir() was strict and intolerant of pre-existing directories. This was a logical contradiction in the code's design philosophy.

I concluded that my fix, adding exist_ok=True was necessary to align the creation logic with the deletion logic. Trusting this architectural logic, I submitted the Pull Request.

PR, issue-32841

Setting up CI/CD with GitHub Actions

Steven Hur — Thu, 13 Nov 2025 21:09:56 +0000

Welcome to my reflection on CI/CD experience, where the core objective was to move beyond local testing and integrate a CI pipeline into my project using GitHub Actions. This lab was an interesting experience understanding how to manage project complexity and a fundamental concept in collaborative software development.

1. The ci.yml Blueprint
The first step was setting up the automation pipeline. Since my project is based on Python and uses Pytest for testing, I configured a workflow to automatically run tests whenever code was pushed or a Pull Request (PR) was opened.

-The GitHub Actions Workflow-
The configuration was defined in the .github/workflows/ci.yml file, ensuring a consistent testing environment.

What does the YAML file do?
This workflow automates four key steps:

1. Checkout the code
2. Set up the required Python environment
3. Install project dependencies
4. Execute all unit tests using pytest

This process guarantees that no new changes will break the existing functionality before they are merged.

2. Mastering the CI Cycle: Pass, Fail, Pass
The most important part of setting up CI was running the full test cycle. This proved that the CI system itself works as expected.

Initial Pass (Success): I created a PR, and the CI successfully ran all existing tests, resulting in a green checkmark.
Intentional Fail (Failure): I then committed a change that caused a test to fail. The CI automatically re-ran and immediately reported a red X.
Final Pass (Recovery): After reverting the breaking change, the CI successfully ran again, showing a green checkmark.

This cycle confirmed that the CI is functioning properly.

3. The Cross-Project Collaboration Challenge
Afterwards, I had to find a partner and contribute a new test case to their repository. My partner's project was also Python-based, utilizing Pytest and a similar src/tests directory structure.

The experience of writing tests for external code highlighted the need for strong documentation and well-encapsulated functions. I chose to write tests for ArgParser class, focusing on its custom logic for loading TOML configuration files.

The main challenge I faced was an unexpected ModuleNotFoundError when trying to run tests locally, even though the file structure was clear. This was because Python's default import path does not automatically include the sibling src/ directory.

The solution required manually inserting the src folder's absolute path into the system's path configuration within the test file.

# The Fix for ModuleNotFoundError:
import sys
import os
src_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'src'))
sys.path.insert(0, src_dir)
from arg_parser import ArgParser

This experience was a reminder that while CI automates testing, basic project setup is essential for any developer joining the project.

Having successfully set up CI and used it in a real-world scenario, I now strongly believe that CI is inevitable.

Before this lab, running tests felt like an optional. However, proper testing removes human error and ensures that every code change, no matter how small, is immediately validated against the entire suite of existing tests.

CI is not just about testing. It's about reducing integration friction and providing immediate feedback which is key to maintaining a stable and scalable codebase

Adding Automated Testing to My Project

Steven Hur — Thu, 06 Nov 2025 23:23:03 +0000

For Lab 7, I added automated testing to my Repo Code Packager project. This tool analyzes Git repositories and generates formatted output for sharing with LLMs. Before this lab, I had no automated tests, which made it risky to add new features or refactor code. This lab taught me how to set up a testing framework and write test cases.

Choosing a Testing Framework
Since my project is written in Python, I researched several testing frameworks:

unittest: Python's built-in testing framework
nose: An extension of unittest (but less actively maintained)
pytest: The most popular modern Python testing framework

I chose pytest for several reasons:

Simple syntax: Uses plain assert statements instead of self.assertEqual()
Great documentation: Easy to find examples and tutorials
Powerful features: parametrization and well written error messages

Setting Up the Testing Environment
Following the lab instructions, I created a testing branch:

git checkout -b testing

├── tests/
│   ├── __init__.py
│   ├── test_file_utils.py
│   ├── test_content_packager.py
│   └── test_git_utils.py

The tests/__init__.py file is empty but necessary for Python to recognize the directory as a package.

Writing My First Tests
I started with file_utils.py because it contains pure functions with no external dependencies which makes perfect for learning unit testing.
is_recently_modified() function checks if a file was modified within a specified time window. Here's what I tested:

def test_nonexistent_file_returns_false(self):
    """Non-existent file should return False"""
    result = is_recently_modified("nonexistent_file.txt")
    assert result == False

def test_recently_created_file_returns_true(self, tmp_path):
    """Recently created file should return True"""
    test_file = tmp_path / "recent.txt"
    test_file.write_text("test content")

    result = is_recently_modified(str(test_file), days=7)
    assert result == True

I discovered pytest's tmp_path fixture, which creates a temporary directory for each test. This was incredibly useful because test don't leave files on my system and each test is isolated.

Testing Edge Cases
The most interesting test was checking old files:

def test_file_modified_beyond_time_window(self, tmp_path):
    """File modified beyond the time window should return False"""
    test_file = tmp_path / "old.txt"
    test_file.write_text("old content")

    # Set modification time to 10 days ago
    ten_days_ago = time.time() - (10 * 86400)
    os.utime(str(test_file), (ten_days_ago, ten_days_ago))

    result = is_recently_modified(str(test_file), days=7)
    assert result == False

I learned about os.utime() which allows you to manipulate file timestamps. It is very useful for testing time-based functionality.

Bugs Discovered Through Testing
Bug #1: Missing Error Handler
When I wrote tests for git_utils.py, I discovered a bug:

NotADirectoryError: [WinError 267] The directory name is invalid

The problem was in my exception handling. The original code only caught subprocess.CalledProcessError and FileNotFoundError, but Windows throws NotADirectoryError for invalid paths.
Fix:

except (subprocess.CalledProcessError, FileNotFoundError, NotADirectoryError, IndexError, OSError):
    return "Not a git repository"

This was actually my first time discovering cross-platform issue through formal testing procedure. When subprocess tries to run a git command with an invalid path on Windows, it raises NotADirectoryError instead. This wasn't being caught, causing the test to crash. Different operating systems can raise different exceptions for the same error condition. By adding NotADirectoryError and the more general OSError, my code now handles edge cases better.

Open Source Journey

Steven Hur — Sat, 01 Nov 2025 16:09:32 +0000

Looking back at my previous blog posts feels incredibly humbling. I can see how much I've grown through this journey and honestly, it's been one of the most fun experiences I've had in my academic career. Why? Because I got to browse through some absolutely brilliant and amazing projects that are actually used in production.
Let me take you through the key lessons I learned from contributing to four different open source projects.

Communication Over Confidence
Project: BEHAVIOR-1K
My first contribution taught me the most fundamental lesson of open source.
I spent full 3 days just setting up the project and understanding the codebase. When I finally identified the issue, I faced a dilemma. There was a line of code that seems very important but I had to remove to fix the issue. The function returned False if it identified anything other than True in a list, but there was also an assert all(...), child_values has NoneTypes line checking for NoneType values.
Should I remove it or Keep it?
Instead of making assumptions, I created a Pull Request with a [WIP] tag to open a conversation with the reviewers. This turned out to be the right call. In open source, especially as a newcomer, communication is the golden key. Nobody expects you to be perfect but they do expect you to be thoughtful. Don't be afraid to ask questions. Maintainers would much rather answer your questions than dealing with a poor PR.

Start Simple, Build Confidence
Project: Scikit-learn
After the intense first experience with BEHAVIOR-1K, I needed something more approachable. I went straight to Scikit-learn's good first issue label and found a task that seemed manageable: changing relative imports to absolute imports in Cython files.
From this

from ...utils._typedefs cimport float64_t, float32_t

To this

from sklearn.utils._typedefs cimport float64_t, float32_t

Was it a simple task? Yes. But I learned something out of it.
This was my first real encounter with Cython, and I discovered how Python libraries achieve C-level performance. I learned what cimport means, why float64_t exists, and how type definitions help optimize the code. Even a simple task in a well structured project teaches you something new.
Simple contributions are not lesser contributions. They're opportunities to learn the project's architecture and tooling. Furthermore, they build your confidence for tackling harder issues later.

Embrace Challenges
Project: Dagster
After building confidence with Scikit-learn, I wanted something more challenging. Dagster, a data orchestration platform used by real companies in production, had an interesting bug.

callable objects with custom signatures were crashing the type hints resolution system.

The problem was technical

class MyWrapper:
    def __init__(self, fn):
        self.__signature__ = inspect.signature(fn)

    def __call__(self, **kwargs):
        ...

This would crash with TypeError: <callable object> is not a module, class, method, or function.
At first, I thought this is too complex for me. But kept trying and I managed to find the solution. Instead of passing the object to typing.get_type_hints(), extract the type information directly from the __signature__ object.
I've learned couple things while contributing to this issue.

Python's signature protocol and the __signature__ attribute
The importance of comprehensive testing in production systems
How to read and understand complex codebases with decorator systems and dependency injection

Don't be afraid to tackle issues that seem slightly beyond your current skill level. The struggle is where the real learning happens.

Contributing to Tools You Use
Project: Optuna
By my fourth contribution, I felt much more comfortable with the open source process. I chose Optuna, a hyperparameter optimization framework which I've heard about while studying Machine Learning. I found an issue asking to modernize the code by replacing .format() with f-strings.
Old way

"{cls}({kwargs})".format(cls=..., kwargs=...)

New way

f"{cls}({kwargs})"

It was fairly easy issue but I wanted to contribute to a tool I actually use and understand. Working on Optuna felt much more comfortable than my first Scikit-learn contribution because I had context about what the library does and why it matters.
Contributing to projects you actually use is always a good way to get this going. Not only does being part of the amazing project feel good, but it also makes you feel proud that you contributed to a big community.

The Joy of Exploration
One of the most unexpected pleasures of this journey was simply browsing through brilliant projects. Each repository I explored, whether I contributed to it or not, taught me something about software architecture, testing, or documentation. It's like getting a behind-the-scenes tour of how professional softwares are built.

I'm having a lot of fun doing this, and I hope that comes through in my writing. Open source contribution isn't just about adding lines to your resume. It's about being part of big community and contribute to tools that developers around the world rely on.
Looking back at these four contributions, I'm proud of what I've accomplished. Looking forward, I'm excited about what comes next. Each contribution has taught me new technologies and new possibilities.
If you're thinking about contributing to open source, just start. Find a project you use, look for a good first issue and take that first step. The open source community is very welcoming and who knows? you might just have fun doing it.

Optuna f-string Refactoring

Steven Hur — Wed, 29 Oct 2025 05:38:58 +0000

Hello! Just submitted my 4th PR to open source. This time it's Optuna.

What is Optuna?
Optuna is a hyperparameter optimization framework for machine learning. Basically when you're training ML models, you have tons of parameters to tune - learning rate, batch size, number of layers, etc. Optuna automates this process using smart algorithms instead of random guessing.
What makes it interesting is the define-by-run API. You can dynamically construct search spaces, which is way more flexible than traditional grid search or random search. It's used by a lot of ML practitioners and has integrations with PyTorch, TensorFlow, XGBoost, and basically every major ML library.

What I Did
Found this issue asking to replace old .format() with f-strings. Some what simple refactoring.
issue-6305

They wanted this:

Old way (ugly)
"{cls}({kwargs})".format(cls=..., kwargs=...)

New way (clean)
f"{cls}({kwargs})"

The Code Change
Changed this old python 3.8 style:

def __repr__(self) -> str:
    return "{cls}({kwargs})".format(
        cls=self.__class__.__name__,
        kwargs=", ".join(
            "{field}={value}".format(
                field=field if not field.startswith("_") else field[1:],
                value=repr(getattr(self, field)),
            )
            for field in self.__dict__
        )
        + ", value=None",
    )

Into this:

def __repr__(self) -> str:
    kwargs = ", ".join(
        f"{field if not field.startswith('_') else field[1:]}={getattr(self, field)!r}"
        for field in self.__dict__
    ) + ", value=None"
    return f"{self.__class__.__name__}({kwargs})"

Key Points
The main change was replacing all .format() calls with f-strings, which is the modern Python way since 3.8+. I also used !r instead of calling repr() directly because that's more pythonic in f-strings. The issue specifically asked for one file per PR to make reviews easier, so I only touched this single file.

Why This is Easy
This is just syntax conversion with no logic changes at all. The output stays exactly the same, just written differently. The issue had clear examples showing exactly what they wanted, so there was zero guesswork involved. Best part is tests won't break because the functionality is identical, just cleaner code.

This issue was some what easier then what I've been doing but I wanted to contribute to this project because Optuna is a framework that I've been studying recently. It does feel much comfortable compare to the first contribution that I made to Scikit-learn. I guess I am improving in some way through this process.

Fixing Type Hints for Callable Objects with Custom Signatures in Dagster

Steven Hur — Tue, 28 Oct 2025 20:49:06 +0000

So... it's been an interesting week. After my last contribution to Scikit-learn (which was honestly pretty straightforward), I wanted to find something a bit more challenging. Something that would actually make me think, maybe?

I've been getting more into Machine Learning(ML) lately, especially pipelines and orchestration stuff. That's when I found Dagster.

What is Dagster?
If you're not familiar, Dagster is a data orchestration platform. Think of it like this. When you're building ML pipelines or data workflows, you need something to coordinate all the different steps such as fetching data, transforming it, training models, deploying them, and etc. Dagster helps you organize all of that massive work into something manageable size.
What caught my attention is that it is actually used in production by real companies. This isn't some hobby project. Plus, it has a really active community and the codebase is actually pretty readable.

Finding the Issue
I was browsing through their GitHub issues, I found Issue #32574: "Callable object custom signatures are resolved incorrectly."
Issue-32574
At first glance, I thought "Oh cool, this looks easy." But then I read the details and realized this was actually pretty interesting.

The Problem
Here's the deal: Python has this cool feature where you can create callable objects (basically, classes with a __call__ method) that act like functions. You can even give them custom signatures using the __signature__ attribute. This is super useful for decorators and wrappers that need to preserve type information.
But Dagster's get_type_hints() function wasn't handling this correctly. When you had something like this:

class MyWrapper:
    def __init__(self, fn):
        # Set custom signature
        self.__signature__ = inspect.signature(fn)

    def __call__(self, **kwargs):
        # Generic signature
        ...

The code would crash with:

TypeError: <callable object> is not a module, class, method, or function.

Why? Because the code was trying to pass the callable instance directly to Python's typing.get_type_hints(), which doesn't know how to handle arbitrary objects. It only works with actual functions, classes, and modules.

The Solution
The fix was actually straight forward once I understood the problem. Instead of passing the object to typing.get_type_hints(), you should extract the type information directly from the __signature__ object.

if hasattr(fn, "__signature__"):
    sig = fn.__signature__
    hints = {}
    for param_name, param in sig.parameters.items():
        if param.annotation != inspect.Parameter.empty:
            hints[param_name] = param.annotation
    if sig.return_annotation != inspect.Signature.empty:
        hints['return'] = sig.return_annotation
    return hints  # Return immediately!

The signature object already has all the type information you need which means that you can simply extract it.

Testing
One of the most important procedure of open source contribution is testing. I created test_sensor_invocation_resources_callable_with_custom_signature() which basically does exactly what the issue described.

Creates a callable object with a custom `__signature__`
Verifies that `Dagster` can now correctly read the type hints
Confirms that resources are properly recognized.

Once the code passed the test, I ran the entire test suite to make sure I didn't break anything. All 52 tests in test_sensor_invocation.py passed. That's always a good feeling.

What I Learned
This contribution taught me way more than just "fix this bug".

Python's Signature Protocol: I had no idea Python had such a sophisticated system for custom signatures. The __signature__ attribute is part of the standard library and is specifically designed for cases like this.
Testing is Critical: In a production system like Dagster, you can't just "fix it and good to go." I had to make sure my change didn't break any existing functionality. The test suite is your safety net.
Reading Complex Codebases: This required understanding how Dagster's decorator system works, how it resolves resources, and how the whole dependency injection mechanism functions. It was challenging but super rewarding.

I'm really enjoying this open source contribution journey. Each project teaches me something new. If you're thinking about contributing to open source, my advice is, don't be afraid to tackle issues that seem a bit over your head. You'll realize you are not as stupid as you think you are. Just make sure you understand the problem and the project before you start coding.