Forem: Philipp Burckhardt

Using AI in the development of stdlib

Philipp Burckhardt — Thu, 17 Jul 2025 19:18:41 +0000

Feeling fast, but working slow? A reflection on stdlib's participation in the recent METR study on AI's impact on open-source developer productivity.

I read the results of the recent METR study on "Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" with great interest for two reasons. Firstly, I have been an early adopter of LLM tools. In 2020, I was lucky enough to get access to the private beta of the OpenAI API from then CTO Greg Brockman and explored the use of AI for education at Carnegie Mellon University. Secondly, because stdlib participated in the METR study, I was personally involved and contributed by working on randomized issues over several months, being allowed to use AI for some tasks and forbidden for others.

Given that stdlib's involvement is central to my perspective, it's worth providing some context on the project. stdlib is a comprehensive open-source standard library for JavaScript and Node.js, with a specific and ambitious goal: to be the fundamental library for numerical and scientific computing on the web. It is a large-scale project with well over 5 million source lines of JavaScript, C, Fortran, and WebAssembly, and composed of thousands of independently consumable packages, bringing the rigor of high-performance mathematics, statistics, and machine learning to the JavaScript ecosystem. Think of it as a foundational layer for data-intensive applications similar to the roles NumPy and SciPy serve in the Python ecosystem. In short, stdlib isn't your average JavaScript project.

A Word of Thanks

Before diving into my reflection, I want to take the opportunity to thank the METR team and especially Nate Rush for giving stdlib the chance to participate in this study with two core stdlib developers, Muhammad Haris and myself. It was a great experience to work with the METR team, and I am eager to see any future studies they will conduct. It is my conviction that, with the entire tech industry being gripped by an AI gold rush, it is incredibly valuable to have a non-profit research institute like METR conduct studies that cut through the noise with actual data.

The Slowdown

The results of the METR study are surprising, clashing with some previously published and very optimistic study results on the impact of generative AI (e.g., see GitHub and Accenture's 2023 study on the impact of Copilot on developer productivity). Citing from the Core Result section of the METR study page:

When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

Rather predictably, the results have led to a lot of discussion on Hacker News and other social channels, with parties on both sides lining up with their pitchforks.

The Perception Gap

I am part of the group of developers who estimated that they were sped up 20%-30% during the study's exit interview. While I like to believe that my productivity didn't suffer while using AI for my tasks, it's not unlikely that it might not have helped me as much as I anticipated or maybe even hampered my efforts.

But how can that be? Daily, we are reading about how AI is already revolutionizing the workplace or making software engineers redundant, with companies like Salesforce announcing that they won't be hiring for software engineering positions anymore or online lender Klarna announcing that they were shuttering their entire human customer support in favor of AI.

Many of these stories have turned out to be more hyperbole than reality. Klarna still has human support, and Salesforce still has many engineering job listings. Sadly, some of these stories appear influenced by ulterior motives, such as Klarna's strategic positioning as an "AI-native" company to capture premium valuations ahead of its IPO amid the current AI wave.

However, I have been using AI tools daily for the past three years, both at work and outside, and find them immensely useful. How do I square these benefits with the study results?

On Study Design

When confronted with results that go counter to one's expectations, it is a natural instinct to try to attack the study and identify holes to explain away the result. For example, one could point to the small sample size of 16 developers. There is also the argument that the study was conducted in a very specific context, with experienced developers working on projects they are intimately familiar with.

There might also have been a subtle selection effect in the tasks themselves: since project maintainers proposed their own task lists, it is possible that those more experienced with AI subconsciously selected issues they believed were more amenable to an agentic workflow. One could also argue that the developers were subject to the Hawthorne effect, altering their behavior simply because they knew they were being video-recorded, perhaps over-relying on the AI tools for the sake of the experiment.

Finally, and perhaps most importantly, the experimental setup of requiring screen recordings and active time tracking for a single task enforced a synchronous workflow. This effectively locked developers into what I call "supervision mode", where they had to watch the agent work rather than being free to context-switch to another problem.

Some of these critiques, particularly the enforced "supervision" workflow, could directly contribute to the observed slowdown. But others, such as selecting "AI-friendly" tasks or over-relying on the tool to impress researchers, should have biased the results toward a speedup. This makes the final outcome even more notable. The direction of various potential biases is ambiguous at best, which is why we must look at the study's core design.

As a randomized control trial, the study follows the gold standard experimental design for detecting causality. By randomizing individual tasks to "AI-allowed" or "AI-disallowed", the study isolates the effect of AI tooling. Instead of comparing one group of developers against a control group (where differences in skill could skew the results), it compares each developer against themselves. This "within-subjects" design controls for individual characteristics, from typing speed to experience with the project. With such a study design, results are harder to write off as mere statistical noise, even with a smaller sample size.

Crucially, the tasks were defined before this randomization. This avoids a common pitfall where AI might simply produce more verbose code or encourage developers to break tasks into smaller pull requests, which can inflate some productivity metrics without representing more work getting done.

16 developers from several open-source projects might not sound like much, but, in total, we completed 246 tasks. To give a sense of the work involved, the tasks Haris and I worked on were not trivial, while still being hand scoped to be completed in a few hours or less. They were a mix of core feature development (such as adding new array, string, and BLAS functions), creating custom ESLint rules to enforce project-specific coding standards, enhancing our CI/CD pipelines with new automation, and fixing bugs from our issue tracker.

And while a single developer's performance on one task is likely correlated with their performance on another and the precision of the estimates thus larger than otherwise, it is quite notable that the effect was in the opposite direction from what economists, ML experts, and the developers themselves predicted (with the former two groups being more in the range of a 40% speedup). Moreover, the effect is quite large in magnitude. A quick back-of-the-envelope calculation reveals that if the true effect were a 40% speedup, the probability of observing a result this far in the opposite direction is astronomically low.

In light of this, I have no reason to doubt the internal validity of the study and would venture that the effect measured is real within the context of the experiment. If one believed the chatter on social media and the hype merchants who two years ago were all shilling cryptocurrency (and maybe still are!) but have meanwhile all switched over to extolling the amazing speedup AI offers, then increases of 100%, 5x, or even 10x should have been in the cards. But this is definitively not what the study observed.

Embracing Agentic Development

The more important consideration for squaring my own experience with these results is external validity: how generalizable are the study's findings? The paper is a great read and touches on many possible criticisms and threats to external validity, and I won't belabor any of the points raised therein.

Instead, I will solely focus on my experience as a study participant and how I have been leveraging AI with success. I will also share my own hypotheses for why the performance of the developers in this sample was overall negatively affected by the use of AI.

To give some context, my main way of incorporating LLMs into my work before participating in this study was twofold. As something of an early adopter, I had used GitHub Copilot for auto-completion and inline suggestions and made heavy use of ChatGPT and Anthropic Claude web apps by assembling relevant context, writing detailed prompts, and copying results back into my editor. Tools such as Repomix helped streamline the process of incorporating LLMs into my daily development workflow. This general approach allowed me to review changes quickly, iterate on them by asking questions, and have the LLM make follow-up edits directly in a chat interface.

The METR study subsequently provided an excuse for me to delve into agentic programming and make Cursor an integral part of my workflow. I had used it briefly some time before but didn't find the AI-generated results compelling enough to let it loose on any codebase I was working on. But Claude Sonnet 3.7 had come out, which is still one of the most powerful models for coding tasks. Due to some very encouraging results during early testing, I was eager to put it to work on a backlog of tooling that we wanted to build for stdlib, alongside various refactoring and bug fixes.

One of my first impressions with Cursor this time around was the underlying LLM's rather impressive ability to follow the very specific coding standards and conventions of the project and, when placed in agent mode, to automatically and reliably fix lint errors and attempt to iteratively resolve errors in unit tests. This felt like another step change in capabilities, just like when OpenAI released GPT-3 Davinci in June 2020, which made a lot of use cases suddenly feasible that before would break down in any realistic scenario.

While I no longer use Cursor and have meanwhile switched to Claude Code (more on that later), I found Cursor straightforward to use, especially given that it is a fork of VSCode, which has been my IDE of choice for many years. I heavily doubt that inexperience with Cursor, which I shared with roughly a half of the developers in the study, played a major role in the results. While I didn't have an extensive .cursorrules setup (which has since been deprecated in favor of project rules), I did add basic instructions and context about the project and made sure to index the stdlib codebase. Aside from that, further customization was neither possible nor necessary, as the Cursor Agent was able to automatically pull in other files, look up function call signatures, and perform other operations for assembling context.

My experience of Cursor was largely positive during the study. As an example, I ended up working on several Bash scripts for our CI/CD pipeline, and Cursor definitely sped up my development workflow by not having to look up the man page of jq for the eleventh time given that I only use this command-line tool for manipulating JSON once in a blue moon. With the AI agent's help, I could quickly generate a function like this one to check if a GitHub issue has a specific label:

# Check if an issue has the "Tracking Issue" label.
#
# $1 - Issue number
is_tracking_issue() {
    local issue_number="$1"
    local response

    debug_log "Checking if issue #${issue_number} is a tracking issue"
    # Get the issue:
    if ! response=$(github_api "GET" "/repos/${repo_owner}/${repo_name}/issues/${issue_number}"); then
        echo "Warning: Failed to fetch issue #${issue_number}" >&2
        return 1
    fi

    # ...

    # Check if the issue has the "Tracking Issue" label:
    if echo "$response" | jq -r '.labels[].name' 2>/dev/null | grep -q "Tracking Issue"; then
        debug_log "Issue #${issue_number} is a tracking issue"
        return 0
    else
        debug_log "Issue #${issue_number} is not a tracking issue"
        return 1
    fi
}

The agent correctly assembled the jq -r '.labels[].name' filter to extract the label names from the JSON response—something that would have sent me to a documentation page for a few minutes. While a small speed bump, these moments add up. The AI handled the rote task of recalling obscure syntax, letting me focus on the actual logic.

My first takeaway is this: current LLMs are very powerful for tasks in domains that you are not intimately familiar with, allowing you to move much more quickly. Agentic tools such as Cursor and Claude Code are also very helpful to quickly navigate and learn your way around a large codebase, allowing you to ask questions and explore the codebase in a natural way. Leveraging "deep research" provides another means to more exhaustively explore a problem space in a way that the search engines of old simply cannot match.

On the other hand, some tasks were very frustrating. For example, the Cursor agent wrote one ESLint rule almost fully in one shot, but for another one, the Cursor agent was running in circles and unable to figure out the correct algorithm. Trying to prompt it to fix the bug was unsuccessful multiple times. It would have been better to not fall prey to the sunk cost fallacy and instead throw away the code and then either give the agent another shot or write it myself.

Cursor does have a neat feature of breakpoints which allow you to stop the agent at any time and revert to a prior state, something I wholeheartedly recommend using. It is a great way to avoid getting stuck in a loop of the agent trying to fix a bug that it cannot figure out.

I freely admit that I may have been a bit overeager about using AI for all of the AI-enabled tasks, partly due to my desire to learn to use Cursor productively but also due to my general amazement of what these new technologies unlock. However, maybe the METR study suggests that the question of whether a task can be more efficiently completed by AI, or whether one would be better off completing it by hand, is far from settled.

The Blank Slate Problem

Aside from occasional inefficiencies and outright mistakes in the generated code, coding agents do not have access to all the implicit knowledge and conventions of a large, mature project, which often might not be written down. In his reflections on the study, John Whiles identifies a core conflict: an expert engineer's primary value isn't just writing code; it's holding a complete, evolving mental model of the entire system in their head. The agent does not have such a mental model. Every interaction starts from a blank slate.

It is possible that some of this can be mitigated with better, more targeted instructions. As usual, there is no free lunch. One has to actively invest in making one's codebase more accessible to coding agents. And more generally, memory and learning is an unsolved problem with transformer-based LLMs, and changing that will likely require fundamental architectural advancements.

The necessity of auditing the agent's code for mistakes created two major sources of friction: the cognitive drain of 'babysitting' the AI and the time spent waiting for and reviewing its output. For every minute the agent spent running in circles on that ESLint rule, I was blocked, my attention monopolized by the need to supervise its flawed process. This synchronous, blocking workflow is exhausting and inefficient. It's the digital equivalent of shoulder-surfing an overconfident junior developer who has memorized everything there is to know about programming but cannot be trusted and who will make subtle mistakes that are hard to spot.

My advice: stay in the driver's seat during such pair programming and use the AI as a sparring partner to bounce ideas back and forth instead of yielding agency.

Delegate, Don't Supervise

Partly based on my experiences in the study, my workflow has evolved, and I have subsequently switched to using Anthropic's Claude Code. This has changed my interaction model from synchronous supervision to asynchronous delegation. I can now define a complex task via Claude Code's planning mode and then have the agent work on the task in the background. I can then turn my full attention elsewhere, be it attending a meeting, reviewing a colleague's code, or simply thinking through the next problem without interruption. Claude's work happens in parallel and is not a blocker to my own. The cognitive cost of babysitting is replaced by the much lower cost of reviewing a completed proposal later; if it didn't work out, I might just throw away the code and have the model try again, instead of engaging in a fruitless back and forth.

Claude Sonnet 4 and Opus 4 were not released at the time the METR study was conducted, and, while they mark another improvement, especially with regard to tool use by the model, the dynamics haven't fundamentally changed. The models still make mistakes and do not always implement things in an optimal or sound way, but they are now much better at following instructions and can work uninterrupted for longer periods of time.

At least for me, in contrast to those who frame coding agents as mere "stochastic parrots", I find myself absolutely amazed that, despite its warts and hiccups, we have now a technology that, given a set of instructions, is able to generate a fully-formed pull request that correctly implements logic, adheres to style guidelines, and has a passing test suite. And, in the best cases, this can happen without any human intervention.

The First 80 Percent

We still need to reconcile the observed performance decrease with how many developers, including myself, have now been leveraging AI to get tasks done in a fraction of the time, tasks that would have taken them hours or days previously. I believe that the Pareto Principle is a helpful yardstick. Named after Italian economist Vilfredo Pareto, it is commonly referred to as the 80/20 rule and posits that roughly 80% of effects come from 20% of the causes. Coding agents can now generate working code that mostly works but that might fall short if the goal is 100%.

In many instances, coding agents can easily accomplish the first 80% of a programming task, generating boilerplate, scaffolding logic, implementing core functionality, and writing a test suite. However, the final 20% of the task, from handling tricky edge cases, adhering to unwritten architectural conventions, ensuring optimal performance, and avoiding code duplication by reusing existing utilities is where the complexity lies. This last mile still requires the developer's deep, stateful mental model of the project. The rub here is that, by using the AI agent, one may bypass all the little steps which are necessary in the process of building that mental model.

But does it matter? When working on a crucial piece of a larger, complex system, it definitely does, and I would be hesitant with generative AI. But when working on a well-defined, isolated piece of code with expected behavior for inputs and outputs, why bother? The marginal cost of writing code (long recognized as only a small part of software engineering) is going to zero. In the event that there is a problem with the code, it can simply be thrown away and rewritten. The code that AI agents now generate is of decent quality, well-documented, and capable of adhering to one's coding conventions.

This brings to mind the following quote by Kent Beck.

The value of 90% of my skills just dropped to $0. The leverage for the remaining 10% went up 1000x. I need to recalibrate.

AI as a force multiplier is why I am long on AI, even though the METR study is a good reminder that we all can easily fall prey to cognitive biases.

In Thinking, Fast and Slow, Daniel Kahneman gives a classic example for biases driven by the availability heuristic: people overestimate plane crash risks due to vivid media coverage, making such events more "available" to memory than statistically riskier, yet routine, car crashes. Our judgment is swayed not by data, but by the ease of recall. In the case of working with AI agents, observing them build fully-functioning tools in seconds is a very memorable and visceral experience. On the other hand, the slow, frustrating "death by a thousand cuts" experience of auditing, debugging, and correcting the AI's subtle mistakes is the equivalent of the mundane car crash. It's a distributed cost with no single dramatic moment.

Nevertheless, I have no reason to believe that this technology will not continue to improve, and I, for one, am excited about the possibilities. For any big and ambitious project, the amount of tickets to be completed, features to be implemented, and bugs to fix vastly outstrips the available amount of time and human bandwidth to work on them.

What Future Studies Should Tell Us

It remains to be seen whether the results of the METR study can be replicated. However, the study clearly demonstrated that experts and developers were overly optimistic about the impact of AI on productivity. This is an important insight that should inform future research.

In some ways, the study raises more new questions than it answers. It looked at a very particular situation: seasoned experts working in the familiar territory of their own large, mature projects. Future studies by METR and others could vary these conditions. What happens when we throw developers into unfamiliar codebases, where, at least per my anecdotal experience, AI agents shine? Or what about junior developers or new contributors to an established open-source codebase? Under what conditions can AI act as a great equalizer, compressing the skill gap and providing a speed boost rather than slowdown?

Furthermore, the current study centered on completion time, but faster isn't always better. One possible follow-up would be a blinded study where human experts review pull requests without knowing if AI was involved. We could then measure things like the number of review cycles, the time spent in review, and the long-term maintainability of the code. This might shed light on when and how AI-assisted development may impact trading short-term speed for long-term technical debt.

Finally, the field of AI is still evolving at a rapid pace. The synchronous workflow that the study's setup encouraged could be fundamentally suboptimal. Exploring different interaction models, such as the asynchronous delegation workflow that I've moved to, could yield very different results.

How to Work With AI Now

What follows are my current recommendations for using AI in your daily workflow based on my experiences and the METR study.

Adopt an Asynchronous Workflow

The biggest drain from using AI is the cognitive load of "babysitting" it. Instead of watching the agent work, adopt an asynchronous model:

Define one or more tasks (e.g., running a set of commands to audit a codebase for lint errors and documentation mistakes) and then let AI agents work on them in the background (e.g., in separate Git worktrees of your repository), and turn your attention elsewhere.
Review the completed task(s) later. If the output is flawed, it's often better to discard it and have the model try again with a better prompt rather than engaging in a frustrating back-and-forth.

Know What to Delegate

AI can now handle the first 80% of many programming tasks, but the final 20% often requires deep context. The key is to choose the right tasks for AI:

"Vibe Code" and Prototypes: use AI for mock-ups or small, isolated tools that can be thrown away. This is where the technology's speed offers a distinct advantage.
Verifiable Code: AI is excellent for tasks that can be fully verified against an existing, robust test suite. The tests act as a safety net to catch the subtle mistakes the AI might make.
Boilerplate Code: AI can quickly generate boilerplate code, such as REST API endpoints or form validation, and can do so in a way that follows project conventions.
Learning and Navigation: use AI to quickly learn your way around a large codebase, document previously undocumented code, or to get help with tools you use infrequently. Asking LLMs questions can be much faster than hunting through documentation, particularly if that documentation is split across multiple resources.

Use and Customize Claude Code

For tools such as Claude Code, customization is a helpful means of writing down any implicit knowledge about the project that is not readily accessible from the code alone.

Provide Proper Context: drag and drop relevant files (this can include images!) into the Claude Code window for the model to use as context for the task at hand. One approach I have found useful is to add TODO comments in the codebase with the required changes, and then have Claude Code work on them. Use the planning mode to have the model think through the task and generate a plan that can be approved before immediately jumping into implementation.
Use Project Memory: use CLAUDE.md files to give the model project-specific memory, specifically on its architecture and unwritten knowledge. You can have multiple CLAUDE.md files in different project sub-directories, and the model will intelligently pick up the most relevant one based on your current context.

Automate Repetitive Actions: create custom slash commands for frequent tasks performing routine work. Below is an example stdlib:review-changed-packages command that I run to flag any possible errors in PRs that were recently merged to our development branch:

- Pull down the latest changes from the develop branch of the stdlib repository.
- Get all commits from the past $ARGUMENTS day(s) that were merged to the develop branch
- Extract a list of @stdlib packages touched by those commits
- Review the packages for any typos, bugs, violations of the stdlib style guidelines, or inconsistencies introduced by the changes.
- Fix any issues found during the review.

Build Custom Tooling: use the Claude CLI to build small, automated tools, such as a review bot that flags typos as a daily CRON job. For fuzzy tasks such as pointing out typos or inconsistencies in a PR, it's best to let Claude generate output that can be verified by a human. For well-defined tasks that can be fully automated, it is better to have Claude produce code that deterministically runs and can be verified.
Set up Hooks to Automate Actions: hooks are a powerful new feature of Claude Code that allows you to run scripts and commands at different points in Claude's agentic lifecycle.

Final Thoughts

It's natural to attack a study whose results you don't like. A better response is to ask what they might be telling you. For me, it tells me there is still a lot to learn about how to use this new, powerful, but often deeply weird and unpredictable technology. One mistake is treating it as the driver in a pair programming session that requires your constant attention. Instead, treat it like a batch process for grunt work, freeing you to focus on the problems that actually require a human brain.

stdlib is an open source software project dedicated to providing a comprehensive suite of robust, high-performance libraries to accelerate your project's development and give you peace of mind knowing that you're depending on expertly crafted, high-quality software.

If you've enjoyed this post, give us a star 🌟 on GitHub and consider financially supporting the project. Your contributions and continued support help ensure the project's long-term success and are greatly appreciated!

GSoC 2025 Projects Announced

Philipp Burckhardt — Fri, 09 May 2025 02:30:04 +0000

Today, we are grateful to announce that stdlib, the fundamental numerical library for JavaScript, was awarded five slots in this year's Google's Summer of Code (GSoC). We participated in the program last year for the first time, and had four talented students working on a variety of projects. It was a resounding success, which we hope to surpass this year given all that we have learned over the past year and a half.

This achievement comes after a tremendously productive start to 2025. Since January 1st of this year, the stdlib community has:

Opened two thousand PRs with 1,377 successfully merged.
Welcomed contributions from 88 different contributors.
Added 3,452 commits to the repository.

For GSoC, we received 99 excellent applications from enthusiastic students. Ranking proposals was a tough decision, and we would have loved for a few more projects to be accepted. We are grateful to everyone who applied and encourage those not selected this year to stay connected, continue to contribute to the project, and to apply again next year! In fact, one of this year's accepted contributors was a repeat applicant, demonstrating how persistence and continued engagement can pay off.

The accepted projects are listed below. Each project addresses key areas that will expand JavaScript's potential for technical and scientific applications.

Add LAPACK bindings and implementations for linear algebra
Contributor: Aayush Khanna

The goal of Aayush's project is to develop JavaScript and C implementations of LAPACK (Linear Algebra Package) routines. This project aims to extend conventional LAPACK APIs by borrowing ideas from BLIS, thus ensuring easy compatibility with stdlib ndarrays and adding support for both row-major (C-style) and column-major (Fortran-style) storage layouts. This work will help overcome the LAPACK's column-major limitation and thus make advanced linear algebra operations more accessible and efficient in JavaScript environments.

Expanding array-based statistical computation in stdlib
Contributor: Gururaj Gurram

Gururaj will advance statistical operations in stdlib by introducing convenience array wrappers for all existing strided APIs, thus improving developer ergonomics for common use cases. Additionally, he will develop specialized ndarray statistical kernels with the aim of facilitating efficient statistical reductions across multi-dimensional data.

Implement base special mathematical functions in JavaScript and C
Contributor: Karan Anand

Karan will implement and enhance lower-level scalar kernels for special mathematical functions in stdlib. The goal is to complete missing C implementations for existing double-precision packages, develop new single-precision versions, and ensure consistency, accuracy, and IEEE 754 compliance. These enhancements will provide developers with the most comprehensive set of high-precision mathematical tools for scientific computing in JavaScript.

Achieve ndarray API parity with built-in JavaScript arrays
Contributor: Muhammad Haris

Haris will extend stdlib's ndarray capabilities by implementing familiar JavaScript array methods like concat, find, flat, includes, indexOf, reduce, and sort for multi-dimensional arrays. The project will develop high-performance C implementations with Node.js native add-ons for compute-intensive operations. These enhancements will allow JavaScript developers to work with multi-dimensional arrays as easily as built-in arrays, significantly expanding JavaScript's capabilities for scientific and numerical computing.

Add BLAS bindings and implementations for linear algebra
Contributor: Shabareesh Shetty

Shabareesh will expand stdlib's BLAS (Basic Linear Algebra Subprograms) support by implementing missing Level 2 (vector-matrix) and Level 3 (matrix-matrix) operations in JavaScript, C, Fortran, and WebAssembly. The project will focus on key dependencies for LAPACK routines and create performance-optimized APIs that work in both browser and server environments. These enhancements will provide essential building blocks for developing high-performance machine learning and statistical analysis applications on the web.

We're excited to see these projects develop over the coming months. Each contribution will significantly enhance stdlib's capabilities and make advanced mathematical and statistical operations more accessible to the JavaScript community. The work done by these talented contributors will help bridge the gap between traditional scientific computing environments and JavaScript, furthering our mission to create a comprehensive, high-performance standard library for JavaScript.

We'd like to extend thanks to Google for their continued support of open-source development through the Summer of Code program, and we look forward to sharing updates as the above projects progress over the course of this summer. In addition to watching for more posts on this blog, you can follow development by joining our community chat. We also hold regular office hours over video conferencing, which is a great opportunity to ask questions, share ideas, and engage directly with the stdlib team.

We hope that you'll join us in our mission to advance cutting-edge scientific computation in JavaScript. Start by showing your support and starring the project on GitHub today: https://github.com/stdlib-js/stdlib.

Reflecting on GSoC 2024

Philipp Burckhardt — Fri, 04 Oct 2024 03:14:52 +0000

Achievements, Lessons, and Tips for Future Success

An exciting summer has come to a close for stdlib with our first participation in Google Summer of Code (GSoC). GSoC is an annual program run by Google and a highlight within the open source community. It brings together passionate contributors and mentors to collaborate on open source projects. Selected contributors receive a stipend for their hard work, while organizations benefit from new features, improved project visibility, and the potential to cultivate long-term contributors.

stdlib (/ˈstændərd lɪb/ "standard lib") is a fundamental numerical library for JavaScript. Our mission is to create a scientific computing ecosystem for JavaScript and TypeScript, similar to what NumPy and SciPy are for Python. This year, we were granted four slots in GSoC, marking a significant milestone for us as a first-time participating organization.

The purpose of this post is to share our GSoC experiences to help future organizations and contributors prepare more effectively. We aim to provide insights into what worked well, what challenges we faced, and advice for making the most out of this incredible program.

Highlights of the Program

While we certainly encountered bumps along the way (more on that in a second), overall, our participation in GSoC was packed with standout moments. Our accepted contributors successfully completed their four GSoC projects.

To illustrate the impact of our participation, here are some key statistics and accomplishments from our community since the GSoC organization announcement in February:

Over 1,000 PRs opened
More than 100 unique PR contributors
Over 2,000 new commits to the codebase

We had a range of successful contributions that significantly advanced stdlib. Specifically, our four GSoC contributors worked on the following projects:

Aman Bhansali worked on BLAS bindings, overcoming the challenge of integrating complex numerical libraries into JavaScript.
Gunj Joshi developed C implementations for special mathematical functions, significantly improving the performance of our library.
Jaysukh Makvana added support for Boolean arrays, enhancing the library's functionality and usability and paving the way for NumPy-like array indexing in JavaScript.
Snehil Shah worked on enhancing the stdlib REPL for scientific computing in Node.js, making it easier for users to interact with our library and perform data analysis in their terminals.

Each project addressed critical areas in our mission to create a comprehensive numerical library for JavaScript and the web platform.

Finally, we already see a glimpse of the project attracting long-term contributors from both GSoC participants and the broader community.

An Unexpected Challenge

Despite the many positives, our journey wasn't without its share of challenges. Early on, we faced an unexpected incident that seemed straight out of a movie plot. A prospective contributor tried to sabotage a fellow applicant by impersonating them on Gitter, the open source instant messaging and chat room service where we engage with the community. After signing up via a fake Twitter/X account, he started sending unhinged messages to several of the project's core contributors. While it quickly became clear that we were communicating with an impersonator, it was an unsettling experience nonetheless. The impersonator even ended up copying the real applicant's proposal and later attempted to claim the work as their own on GitHub after the conclusion of GSoC.

In light of this experience, we advise any organizations participating in GSoC to keep in mind that competition for slots can be fierce, and that some individuals may be tempted to use subterfuge or actively jeopardize others' applications. One must be vigilant and expect the unexpected. We also recommend having a Code of Conduct (CoC) in place to address such unethical behavior and raising awareness among GSoC contributors of its existence, such as having a CoC acknowledgment checkbox on pull requests and when submitting proposals.

Lessons Learned and Advice for Future Participants

Engage Early with the Community

First and foremost, it is crucial to encourage potential contributors to start interacting with the community and codebase well before the application period. This helps build familiarity and commitment. Although we were aware of this, we could have done more to encourage early engagement and provide clearer guidance on how to get started. Going through all onboarding steps afresh may help uncover outdated information in documentation or other inconsistencies.

💡 Community Outreach: Actively promote your participation through social media, blogs, and coding forums. Use platforms like X/Twitter, LinkedIn, and relevant forums to announce your participation and engage with potential contributors.

Handling Community Queries

After our participation was announced, we were quickly bombarded with what seemed like a non-stop barrage of messages per day on Gitter and other communication channels, and with dozens of PRs opened each day. As the core stdlib team is not working on the project full-time, it was very challenging to keep up. We learned that it's essential to set clear expectations and boundaries early on to manage the influx of new contributors.

Managing the Onboarding Process

Answering the same questions repeatedly can be time-consuming, so having frequently asked questions (FAQs) and a well-documented onboarding process will prove to be invaluable. We also started a weekly office hour for people to drop by. This had a decent turnout and proved valuable, as only individuals who were genuinely interested in the project attended and helped weed out those who were just making "drive-by" contributions. In addition to the weekly office hours, we also held two sessions during the application period to serve as informational sessions specifically focusing on GSoC so we could answer all questions that prospective contributors had.

After the conclusion of GSoC, we have continued to hold weekly office hours, which have been a great way to keep the community engaged!

💡 Communication Channels: Clearly outline the primary communication channels (e.g., mailing lists, chat platforms like Gitter, etc) and how to use them.

Good First Issues

What worked less well were the "good first issues" issues we had opened and labeled as such on GitHub. We found that issues we thought were good first ones, such as updating documentation and examples, resulted in a very high number of low-quality submissions, often suffering from hallucinated contents due to AI generation or other issues, which caused more work for reviewers. On the other hand, other tasks, such as refining TypeScript definitions, were often too complex and challenging for newcomers.

We learned that the best first issues are those that are well-scoped, have clear instructions, and are easy to test and verify. Having a bunch of trivial issues provides weak signal; you want to see contributors progressively tackle more complicated tasks as they become more acquainted with the project. To aid in this progression, one would be well served to have enough issues of varying difficulty that prospective contributors can tackle. If possible, it may be ideal to have issues build on top of each other and take the contributor on a journey toward mastery. Similarly, it may be good to create open issues that are related to each of the potential GSoC project topics, so that contributors can get familiar with the parts of the codebase they would be working on during the GSoC program. And lastly, consider creating issue templates specifically for GSoC participants, which include detailed instructions, links to relevant documentation, and expected outcomes. This reduces ambiguity and helps set clear expectations for newcomers.

Going forward, we plan to focus on creating well-defined, incremental issues that serve as stepping stones for new contributors to build familiarity and gradually take on more complex tasks.

💡 Starter Issues and Mini-Projects: Offer beginner-friendly issues and smaller tasks early on to help newcomers familiarize themselves with the codebase. Fixing existing bugs or writing tests can be a good starting point.

The Role of AI

I think it's fair to conclude that Generative AI has emerged as both a blessing and a curse in the world of open source contributions. Personally, I am an avid user of LLMs and happy about the innovation they have sparked in the developer tooling space. They can assist non-native English speakers in better communicating their ideas, provide a conversation partner equipped with vast knowledge of even quite remote topics, and can increase developer productivity through code completions and code generation. However, AI has also led to a flood of low-quality PRs generated by AI tools, often filled with hallucinated code or content that doesn't align with the project's actual requirements. While writing code can feel more rewarding than the often tedious task of reviewing it—especially when the code isn't your own—reviewer fatigue becomes a real issue when faced with a barrage of poorly constructed or misaligned PRs.

Contributors must recognize that AI is an assistant, not a replacement for personal responsibility and craftsmanship. We have by now spent a significant amount of effort in automation to filter out low-effort submissions before they even reach the review stage. Beside workflows that close PRs which don't adhere to basic contribution conventions, we have added jobs that post helpful comments on how to set up a development environment or which remind contributors that they have to accept the project's contribution guidelines before their PR can be reviewed. This significantly reduces the burden on reviewers and ensures contributors are aware of expectations from the beginning.

Contributor Triage

Another important takeaway is to watch out for contributors claiming multiple issues without completing them. We found that it's best to avoid assigning issues to anyone via the respective GitHub feature and instead focus on encouraging quality contributions over sheer quantity. Additionally, be prepared to manage contributors who may place unrealistic demands on review times, such as insisting on immediate feedback.

One has to be ruthless in prioritizing contributions. This approach ensures that contributors who show genuine interest and effort receive the attention they deserve, leading to higher quality interactions and outcomes for both the project and the contributor. Reviewer time is a limited resource, and it's simply not feasible to provide equal attention to every contributor.

At the end of the day, contributors must invest the time necessary to familiarize themselves with a project's conventions, guidelines, and best practices. If they don't meet this minimum threshold and do not show genuine effort, it's not worth allocating the finite resources of the core team. This may sound harsh, but it's necessary to ensure there is enough time to focus on the high-quality contributions. Otherwise, one ends up in a position where everybody is unhappy with your responsiveness. This may be less of an issue for organizations in niches requiring specialized skills and which may not have as wide an audience as a JavaScript library.

Provide Clear Documentation

Ensure that your project documentation is comprehensive and up-to-date. This includes installation guides, contribution guidelines, and a clear roadmap. Poor documentation can be a significant barrier to entry. During the community bonding period, we found that our documentation was outdated in some areas and that there were issues arising from our setup instructions not working on all operating systems. Providing a devcontainer setup for Visual Studio Code helped to mitigate these issues and streamline the onboarding process.

💡 Contribution Guides: Providing detailed guides on setting up the development environment, navigating the codebase, and submitting contributions is crucial.

Mentor Selection and Training

Choose experienced and committed mentors who can provide guidance and support throughout the program. Consider providing mentor training sessions and setting clear expectations around time commitments and responsibilities to better prepare mentors for their roles. Expect mentoring to be more demanding than envisioned.

We found that having weekly stand-ups allowed contributors to get to know each other and share their progress. We had also, early on, decided to have weekly 1:1s between contributors and mentors, combined with active conversations on PRs, RFC issues, and our project-internal Slack. All these channels helped to keep the communication flowing and ensure that everyone was on the same page. However, it's crucial to try to be responsive. Personally, I could have been better at responding to PRs and questions given how quickly the time flies by, with GSoC being over before you know it!

💡 Encourage mentors to actively communicate with each other about their experiences and challenges, so they can offer consistent advice and collaborate on strategies for effectively supporting contributors.

Post-GSoC Engagement Strategies

After GSoC ends, it's essential to keep contributors engaged in order to build a sustainable community. Continue holding regular office hours, offer additional project ideas, or even invite selected GSoC contributors to mentor the next round of participants. This will go a long way toward creating a sense of belonging and long-term commitment.

Common Pitfalls to Avoid

Overwhelming Newcomers: Don't assign tasks that are too complex or lacking adequate documentation.
Inadequate Support: Ensure mentors are available and can provide adequate guidance.
Poor Documentation: Avoid outdated or incomplete documentation which can create barriers to entry.
Insufficient Community Interaction: Foster a sense of community and two-way communication.

To provide an illustrative example of where we fell prey to the pitfalls above, a number of contributors working on Windows machines initially struggled with setting up their local development environment. Because the core stdlib team primarily develops on MacOS and Linux, we are largely unaware of the needs and constraints of Windows users, and our contributing guidelines largely reflected that ignorance. Needless to say, telling people to just use Ubuntu shell was not sufficient. We could have saved ourselves a lot of back and forth by (a) providing preconfigured dev containers, (b) investing the time necessary to create more comprehensive documentation, and (c) having a quick onboarding session over a higher bandwidth medium than chat.

Advice for Contributors

Early Engagement: Interact with the community and start working on beginner-friendly issues early on. If you start contributing before the application period and show your commitment to the project, you will stand out as a proactive candidate during the selection process. This is probably the biggest hack to get selected for GSoC.
Invest in Project Familiarity Early On: Before contributing code, take time to read through old issues, PR discussions, and any architectural documentation available. Understanding the project's historical context can help avoid misunderstandings and improve the relevance of your contributions.
Prioritize Code Quality and Documentation: Don't rush to make as many contributions as possible. Take your time to write high-quality code and back it up with sufficient documentation and test cases. Especially in stdlib, we place a high priority on ensuring consistency throughout the codebase, so the more your contributions look and feel like stdlib, the more likely your contributions will be accepted. This attention to detail will set you apart from others who may focus solely on quantity and ignore project conventions.
Clear Communication: Don't hesitate to ask questions and seek guidance from mentors and the community. Organizations may be overwhelmed with applications, so stepping up and answering questions on the community forums can help you stand out as well.
Ask for Feedback: Throughout the GSoC program, ask for and incorporate feedback from project mentors. During the GSoC application phase, contributors who clearly demonstrate an ability to receive and act on feedback will stand out. It can be frustrating for project reviewers to repeat the same feedback across multiple PRs, especially concerning project style and conventions. Make it a goal to reduce the number of reviewer comments on each PR. Clean PRs requiring little-to-no review feedback significantly improve the odds of you setting yourself apart from the pack.
Respect Maintainer Time: Be respectful of maintainer time. GSoC can be highly competitive, and, for many, GSoC acceptance is a meaningful resumé item. Recognize, however, that maintainers often have obligations and jobs outside of their open source work. Sometimes it just isn't possible to immediately review your PR or answer your question, especially toward the end of the GSoC application period. You can significantly improve the likelihood of a response if you heed the advice above; namely, invest in project familiarity early on, prioritize code quality and documentation, and incorporate feedback. Maintainers are human, and they are more likely to invest in you, the more you show you care about them.
Time Management: Plan your time effectively to meet project milestones and deadlines. The time will fly by, and you don't want to be scrambling to complete your project at the last minute. Break down your project into smaller tasks, and set realistic goals for each week. Where possible, be strategic in your planning, such that, if one task becomes blocked, you can continue making progress by working on other tasks in parallel. If you encounter obstacles, reach out for help sooner rather than later. Being proactive not only ensures you stay on track but also demonstrates your commitment and initiative.
Participate Beyond Code: Engage in discussions beyond code contributions. Once you have familiarized yourself with the project, gotten up to speed on how to contribute, and successfully made contributions to the codebase, help other newcomers by participating in community channels, answering questions, and directing them to appropriate resources. Not only does this show that you are invested in the community, but it also helps reduce maintainer burden—something which is unlikely to go unnoticed.
Be Adaptive and Open to Change: Sometimes your initial project plan may not work out as expected. Be flexible and willing to adjust your project scope or approach based on feedback and evolving project priorities.

💡 Remember that valuable contributions aren't limited to code alone. Participating in community discussions, improving documentation, and offering support to other newcomers are all meaningful ways to contribute and demonstrate commitment to the project.

Acknowledgments

Our heartfelt thanks go out to everyone involved in this year's GSoC, from the mentors and contributors to the broader community, and last but not least, to Google. We're excited to build on the momentum from this summer and look forward to seeing what the future holds for stdlib!

If you're interested in becoming a part of our growing community or exploring the opportunities GSoC can provide, visit our Google Summer of Code repository and join the conversation on our community channels. We're always excited to welcome new contributors!

And if you're just generally interested in contributing or staying updated, be sure to check out the project repository. Don't be shy, and come say hi. We'd love for you to be a part of our community!