Forem: Nicholas Synovic

Use Your Tokens Before You Lose Your Tokens

Nicholas Synovic — Wed, 04 Mar 2026 03:37:10 +0000

If you have the privilege of a GitHub Copilot Education license or a workplace-wide Google AI Plus subscription, I have one primary piece of advice: burn through your credits. These institutional offerings provide a unique “sandbox” where you can fail for free. My recommendation for mastering these agents is to start small but think critically.

Begin with a project you already know inside and out. Take a well-documented method and ask the agent to document it from scratch. Does it capture the nuance? Does it understand the “why” behind the logic? Now, expand the scope: provide the agent with the entire class or module and repeat the task. Observe how the quality of the output shifts as you provide more project context. This exercise isn’t just about documentation; it’s about learning the “contextual threshold” of the model you are using.

Once you understand the agent’s baseline, move into active validation:

Tooling Audit: Ask the agent to identify all the explicit and implicit configuration options within your codebase. See if it can find the “ghosts” in your architecture.
Security & Memory Loops: Ask the agent to generate a memory-safe implementation of a function, then validate that code against a tool like valgrind. If it fails, pass the valgrind error log back into the agent. Watching an agent respond to a debugger’s output is the best way to understand its ability to “reason” through technical constraints.
Planning vs. Execution: Use the Plan Mode to have the agent tackle a specific GitHub Issue. Evaluate it not just on the code it writes, but on the logic of the steps it proposes.

We are at a unique juncture where LLMs trained on code are only going to become more pervasive and more capable. Use the opportunity your institution has provided to become a leader in understanding what these agents can—and cannot—do. Identify the patterns that lead to failure and the strategies that lead to success.

It is a tall order to stay ahead of this curve, but as students, scientists, and engineers, we are built for this challenge. Burn the tokens, make the mistakes, and break the models now. These agents are here to stay, and the best time to learn their limitations is while someone else is picking up the tab.

This is a section of a larger blog post I made on my website. Feel free to read the full post for free here. Thanks!

Sensible Chuckle: The First `git commit` Message Of The Git Version Control System

Nicholas Synovic — Wed, 14 May 2025 15:01:38 +0000

As part of a side project, I was interested in exploring the first git commit message of the Git Version Control System project.

It was made by Linus Torvalds on 04/07/2005 and it is:

Initial revision of "git", the information manager from hell

KiSSES: Keep Static Site Examples Simple

Nicholas Synovic — Tue, 04 Mar 2025 20:21:32 +0000

I don't know about you, but every time that I check out a static site generator's example GitHub page, I'm both over and underwhelmed at the same time. On one hand, the Github page often has great technical details and depth to allow me to leverage and extend the example to fit my needs. On the other hand, feature's such as GitHub Action integration, or deploying to GitHub pages is often left to the engineer to figure out. And in some cases, the example site is not longer in line with current revisions of the tool!

And look, I know that every project is different, and that your preferred static site generator probably has better documentation and examples than what I've seen. But of the projects that I have seen, GitHub pages deployment or recommended project repository layouts are sidelined to focus on technical documentation.

Is this good or bad? I don't know. Am I too unexperienced to work within these constraints? Maybe. But I can't be the only engineer to have faced these issues. And for projects aimed at quickly and rapidly creating websites from limited format text documents (e.g., Markup, ReStructured Text), I'd think that features such as starter or template GitHub repositories would be more common.

Because of my frustrations, I've released two example GitHub repositories for two popular static site generators: MkDocs and Sphinx. The goal with these repositories is to be focussed on a minimal project using the static site generator, that builds into a Read The Docs theme compatible website, and provide supporting tooling regarding formatting of the underlying formatting language. It also provides the tooling needed to deploy to GitHub Pages both from the command line and via GitHub Actions (both are powered by the ghp-import project).

Now I understand that my examples are not going to be complete to everyone. So I'd like to open my issue boards to the community to suggest how to better improve these examples. I think it's a real shame that better examples of minimal static sites don't exist, and I think projects like mine address low hanging fruit on their issue boards.

MkDocs example site: https://github.com/NicholasSynovic/example_mkdocs

Sphinx example site: https://github.com/NicholasSynovic/example_sphinx

DeepSeek w/ Ollama + Open WebUI

Nicholas Synovic — Tue, 28 Jan 2025 14:57:41 +0000

DeepSeek R1 Exists

It's the latest exciting open-source LLM model and the first (to my knowledge) open-source reasoning model. While I'm unfamiliar with the intricacies of reasoning models, the gist of it is that these LLMs "think through" the problem before responding. In other words, as part of the output that you get from your prompt, you also get the chain of thought that supports the reasoning behind the model's output. This provides context as to why the model generated its final output.

To be clear, I wouldn't call these models self-explaining; at the end of the day, LLMs are still considered black boxes that generate text based on statistical and mathematical computations. Just because DeepSeek "thinks through" a problem does not mean that it is truly sentient, accurate, or correct. There is still a need for human-in-the-loop (i.e., human reviewer) style usage when leveraging these models.

With the context and clarification out of the way, how can you leverage DeepSeek R1 locally? And more broadly, how do you do so with any open-source LLM?

Ollama

You leverage Ollama, an open-source inference engine that is designed to work with quantized LLMs via the GGUF file format or hosted on the Ollama Model Hub.

https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExOHJuYXpkeHpvN2N1YXFqbmphZTJmaHJpcm1uM2JoNzJ2d3dtdzVzZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3WmWdBzqveXaE/giphy.gif

In short:

An inference engine is a utility to run machine and deep learning models efficiently by optimizing the model's underlying computational graph.
- The computational graph is similar to a program's call graph (or the order in which instructions are executed) but for mathematics
Quantized LLMs are large language models whose computational graph relies on either a reduced number of bits to represent floating point numbers or integers.
- Deep learning models are often trained using bit widths of 64. 128, or higher to represent the nuances of data that can be represented at a given time. Reducing the bit width or precision (e.g., floating point representation to integer representation) often improves the latency of the model (measured in tokens per second) at the cost of precise answers.
The GGUF file format is not important for this discussion, but you can learn more about it here
The Ollama Model Hub hosts quantized LLMs ready for downstream consumption via the ollama command line utility.

Ollama provides a very simple interface to get started with using LLMs locally. Alternatives do exist (e.g., vllm) but the tooling surrounding Ollama is extensive and well-documented, so it is my preferred choice when running LLMs locally.

As Ollama is a command-line utility it can be difficult to leverage tooling such as document and image reasoning, web searching, retrieval augmented generation (RAG), and multi-modal data analysis without having to develop your own interface. This is where GUI interfaces such as Open WebUI fill the gap.

Open WebUI

Open WebUI is a self-hostable application that communicates to Ollama via Ollama's HTTP REST API. It provides a ChatGPT-like interface that I find familiar while exposing existing ChatGPT features such as image generation, document reasoning, RAG, and web search. It also supports new features like the ability to chain multiple models together to provide one model with a prompt, and then automatically pass the response of that model into a second or third LLM for post-processing! I think it's a neat project and an exemplar of the Ollama ecosystem. You can find more information about it here.

Putting It All Together

Having gone through all of this now, how can we install these tools?

If you are on an M series Mac, you should install Ollama locally and ignore all references to Ollama docker installation hereafter. This is because Ollama via Docker does not support M series Mac GPU acceleration, but the compiled binary does. You can read about it here.

For everyone else, I recommend installing Ollama and Open Web UI via Docker Compose via this YAML file:

version: '3.8'
name: ai

services:
  ollama:
    container_name: ollama
    image: ollama/ollama:0.5.7
    restart: always
    networks:
      - ollama-network
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

  open-webui:
    container_name: open-webui
    image: ghcr.io/open-webui/open-webui:0.5.7
    restart: always
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - ollama-network
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data

networks:
  ollama-network:
    external: false

volumes:
  ollama:
  open-webui:

Copy this to a docker-compose.yml file and then run:

docker compose --file ./docker-compose.yml create
docker compose --file ./docker-compose.yml start

This installs Ollama at its latest version (as of writing) with NVIDIA GPU acceleration support across all GPUs. It also installs the latest version of Open WebUI (as of writing). The Ollama HTTP REST API is exposed on port 11434 and Open WebUI is exposed on port 3000.

If you don't have NVIDIA GPU support for Docker or are using a different GPU vendor or intend to run this on CPU, see this post from Ollama.

Once installed run the following command to install DeepSeek R1 from Ollama's Model Hub:

docker compose --file ./docker-compose.yml exec ollama ollama pull deepseek-r1:7b

Then refresh your browser's connection to Open WebUI (via http://localhost:3000) and you should be able to start using DeepSeek R1 locally!

NOTE: All computers are different and unique snowflakes. For context, I have deployed this on a system running Pop OS with an NVIDIA 3060 GPU. While I've done my best to make deployment of Ollama, Open WebUI, and DeepSeek R1 repeatable and reproducible, your system might need additional tinkering to get it to work right. Please see both the Ollama and Open WebUI documentation and GitHub Issue boards for support.

Polyglot: Lua (Part 1)

Nicholas Synovic — Mon, 13 Jan 2025 00:06:28 +0000

In my previous post, I talked about the reasons why I want to learn more programming language, the Lua programming language, and the developer tooling for Lua. Now it's time to actually code in Lua!

For this post, I'll be completing several basic Rosetta Code tasks. Nothing crazy, but enough to get me familiar with the language and its syntax. As Lua has a fairly minimal and straight forward syntax, I'll post the code snippets and output here, but I won't explain the implementation. For the complete source code, you can see my GitHub repository here.

GitHub Template

I created a GitHub Template to bootstrap my Lua projects going forward. You can find it here. As I find tooling to improve my Lua experience, I'll update the template.

Rosetta Code Problems

Integer Arithmetic

Outcome: Taught me how to take in user input and function declarations

local function sum(a, b)
    return a + b
end

local function difference(a, b)
    return a - b
end

local function product(a, b)
    return a * b
end

local function int_quotient(a, b)
    return a // b -- Rounds to negative infinity
end

local function remainder(a, b)
    return a % b
end

local function exponentiation(a, b)
    return a ^ b
end

local function main()
    io.write("First number: ") -- Use with io.read for single line input
    local a = io.read("n") -- Captures user input

    io.write("Second number: ")
    local b = io.read("n")

    print("===")

    print("Sum: " .. sum(a, b)) -- ".." syntax used to concatenate
    print("Difference: " .. difference(a, b))
    print("Product: " .. product(a, b))
    print(
        "Integer Quotient (rounds to negative infinity): " .. int_quotient(a, b)
    )
    print("Remainder" .. remainder(a, b))
    print("Exponentiation: " .. exponentiation(a, b))
end

main()

String Length Comparison

Outcome: Learned that all objects (including arrays) are tables, how to sort tables, and how to index over them with a for loop

local function main()
    io.write("First string: ")
    local a = io.read("l")

    io.write("Second string: ")
    local b = io.read("l")

    io.write("Third string: ")
    local c = io.read("l")

    print("===")

    local strings = { a, b, c } -- Loads strings into an array (implemented as a table)

    table.sort(strings, function(foo, bar)
        return #foo > #bar
    end) -- Sort array based on string length
    for _, s in ipairs(strings) do
        print(#s, s) -- Print string size then string content
    end
end

main()

Conclusions

Lua wasn't that hard to get a basic grasp of. While yes, I did not cover aspects such as loops, control flow, or binary operations, reading the manual and book provided enough context for me to grasp the core concepts.

I'd like to thank the Rosetta Code community for their problems and solutions. Without them it would be far more difficult for me to understand these core language features.

Polyglot: Lua (Part 0)

Nicholas Synovic — Sun, 12 Jan 2025 21:39:05 +0000

I've been interested in expanding my toolkit of programming languages for some time now. I would currently say that I am proficient in Java, C, and C++ and have expertise in Python. But this clearly isn't the full range of programming languages or experiences out there. For example, I have very little knowledge of functional or embedded languages.

To encourage me to write more posts, I'm going to start documenting my experience learning different programming languages and the projects that I write with them. To start this series, I will begin with the Lua scripting language.

What Is Lua?

Lua is an "efficient, lightweight, embeddable scripting language" in active development since 1993. It claims to be fast, but most importantly the interpreter is very small at only a few 552Kb for the latest (5.4.7) binary.

Personally, this doesn't matter a whole lot to me. Binary size and speed mean less than if I can glean a new technique or experience from using the language. But I also don't want to waste time learning a dead language either. So every language that I learn needs to meet the following criteria:

Must have a package manager,
Must be able to test code,
Must have development tooling (e.g. LSP support, code formatting, linting) and,
(Optional) Must support static typing

Lua supports most of this primarily through community packages. luarocks is the Lua package manager. Lua does not ship with a unit testing framework by default, but the community seems to have selected luaunit as the defacto testing library. LSP and linting support is provided through the lua-language-server and code formatting is handled through stylua. However, I can't find tooling similar to Python's bandit to perform security audits. I believe this to be an open area of Lua library development.

Lua does not support static typing. But, given the minimal keywords and language features of Lua, the community has come up with different interpreters and programming languages that generate Lua code that implement static typing. typedlua seemed promising, as it promised to implement a type system on top of Lua (like TypeScript), but hasn't received a commit in 5 years. ravi also seemed promising, but leverages a modified Lua VM which breaks compatibility with some Lua libraries. I would prefer the TypeScript-like approach to implementing static types to not break compatibility with existing libraries.

Learning Lua

... Will have to wait for the next post. This post took me longer than expected to compile all of my sources. As a sneak peak, I intend to release a GitHub Lua template following my other templates and another repo that is focussed on solving code kata from Rosetta Code.

GitHub Templates on templates on templates

Nicholas Synovic — Thu, 02 Jan 2025 02:20:30 +0000

Did you know that you can create a new repository from an already existing repository on GitHub? This allows you to inherit both the history and contents of the repository. But what if you want the contents?

GitHub allows you to create template repositories, repositories whose histories are not inherited but whose content is when creating a new repository. This simple feature is compelling for bootstrapping new projects together. It allows you to define a generic repository with all your config files ready to go rather than copying and committing them after instantiation. Furthermore, depending on how you architect your templates, you can have templates that inherit other templates.

I've found this particularly useful when creating per language templates. I have my generic repository which contains my GitHub-specific files, generic tooling config files, and other supporting documents. Then for each programming language template, each inherits the generic template. Finally, for each project, it inherits the programming language tooling most relevant to it.

I have found this to be an extreme time saver in my day-to-day work and personal projects. For an example template repository, you can see my generic template and my Python template repositories.

Introducing acolor: A small utility to print ANSI color codes

Nicholas Synovic — Tue, 31 Dec 2024 17:26:23 +0000

In my previous post, I wrote about a tool I wanted to create to print ANSI color codes to the console. I currently need a this as I am "prettifying" my shell prompt at the moment and figured it would just be faster to leverage this tool over Googling the necessary shell codes.

So I created acolor, an open-source Python utility built on top of colorist to provide a convient way to output ANSI color codes to the terminal. Currently, only named color codes are supported (e.g., red, green, blue). Hex, HSL, VGA, and RGB color codes are currently not supported but acolor can easily be extended to include them.

You can view the source code here. You can install it with pipx via:

pipx install git+https://github.com/NicholasSynovic/acolor

Here are the current command line options of the applicaton:

acolor --help

Usage: acolor [OPTIONS]

Options:
  -c, --color TEXT  Color name to generate ANSI code
  -r, --reset       Print ANSI reset code
  --help            Show this message and exit.

Here's an example usage:

$ acolor --color red
'\x1b[31m'

$ acolor --reset
'\x1b[0m'

$ acolor --color test
test is not a valid color: dict_keys(['BLACK', 'RED', 'GREEN', 'YELLOW', 'BLUE', 'MAGENTA', 'CYAN', 'WHITE'])

Install Tailscale With Ansible

Nicholas Synovic — Sat, 28 Dec 2024 22:59:47 +0000

I recently found out about Tailscale from the Level1Tech's interview with its founder. After trying it out, I can say that I am more than satisfied with its performance, ease of use, and ability to network all of my devices together across different intranets.

As someone who prefers to configure their computer using infrastructure-as-code (IaC) practices, I decided to write an Ansible play for installing Tailscale. The following is the play that I created:

- name: Install Tailscale
  hosts: myhosts
  become: true
  tasks:
    - name: Download Tailscale GPG Key
      ansible.builtin.uri:
        dest: /usr/share/keyrings/tailscale-archive-keyring.gpg
        url: https://pkgs.tailscale.com/stable/ubuntu/jammy.noarmor.gpg 

    - name: Add Tailscale repository
      ansible.builtin.uri:
        dest: /etc/apt/sources.list.d/tailscale.list
        url: https://pkgs.tailscale.com/stable/ubuntu/jammy.tailscale-keyring.list

    - name: Install Tailscale
      ansible.builtin.apt:
        name: tailscale
        update_cache: true
        state: present

This play is my attempt at a direct translation from the Tailscale download instructions. For those who are more familiar with Ansible, let me know how I can improve upon this play.

Thanks!

Back To Basics: git

Nicholas Synovic — Sat, 28 Dec 2024 01:44:45 +0000

Not to brag, but...

I can use git (like everyone else). I've been using git since ~2016 and its been my primary VCS tooling throughout university. I've also published research that leverages git and GitHub to derive project insights. I've been very fond of the technology, but have come to realize that I'm not adequetly leveraging both, and thus hindering my progress.

So for today's post, I want to optimize my git config to maximize my productivity when using the tool.

Mr. Worldwide

git can be configured globally, system-wide, or on a per-project basis. When configured globally, it affects every project for that particular user. For me, this is my preferred configuration option as I'm often working on solo-projects.

The git documentation for git config can be viewed here. If you are following along, this is the file stored at: ~/.gitconfig. Each option can be configured with git config --global --add KEY VALUE, but I'll be displaying the output from the file itself. To start, we'll configure git blame

"It's Your Fault!"

git blame reports who contributed each line in a given file. This is particularly useful when identifying who contributed a specific feature, created a bug, or maliciously tampered with a file. There isn't much to configure here, but I will turn on repeated line coloring (for repeated lines contributed in a commit), using the UNIX Epoch as the time format, and reporting author email addresses over names.

[blame]
    coloring = repeatedLines
    date = unix
    showEmail = true

Color Makes It Cooler

I typically work in terminals that support ANSI color codes, so anytime that I can add a splash of color to my development experience is pleasant. I've made git output most of its UI in color if possible using the ui.color config option set to auto.

[color]
    ui = auto

All My Ducks In A Column

Some of git's commands can be formatted as columnar output. However, I don't know which commands they are? It's undocumented as to which commands are affected, but it does affect git blame and I like standardized output so I'm going to set it to always be on.

[column]
    ui = always

Signing Off

I wrote a Dev.to post on why you should sign your commits with GPG, and I still stand by that post today. While tedious to setup and maintain across workstations, it does provide a layer of collaborator authentication.

[commit]
    gpgSign = true

Speed Demon

Some of the work that I do involves assessing the quality of software repositories longitudinally. Thus, I'm often checking out many commits sequentially in a git repository. Therefore, when I heard about the core.fsmonitor config option, I was ecstatic. This option, "can speed up Git commands that need to refresh the Git index (e.g. git status) in a working directory with many files. The built-in monitor eliminates the need to install and maintain an external third-party tool" (Source).

In my testing, I found that when checking out 500 commits sequentially from the numpy repository, disabling this feature required 13.8 seconds to complete on average across 10 runs. Enabling this feature took on average 11.2 seconds across 10 runs. Not an astounding difference in testing, but if core.fsmonitor can save me 2.6 seconds per 500 commits, on a project with 37,775 commits that could add up to a time savings of 211.54 seconds, or 3 minutes and 32 seconds! More testing on my end needs to be done if this feature scales linearly, but for now I will keep it on and use version 1 of the tool.

[core]
        fsmonitor = true
        fsmonitorHookVersion = 1

Core Defaults

In addition to the fsmonitor config, I also leverage nvim and less as my editor and pager of choice.

[core]
        fsmonitor = true
        fsmonitorHookVersion = 1
        editor = nvim
        pager = less

Optimizing Nodes And Edges

git can create a graph of how every commit relates to one another. This allows for efficiently applying patches to a commit once checked out. However, this has to be done manually with git commit-graph write. We can automate some of this by enabling the commit graph to be written anytime git fetch is called.

[fetch]
    writeCommitGraph = true

Don't Forget About The User!

Finally, I'll configure my name and email for git.

[user]
    name = Nicholas M. Synovic
    email = ***

Wrapping Up

I know that I've skipped over many different configuration options that git has to offer. So consider this post and my config a jumping off point that you can extend.

My full config is:

[blame]
        coloring = repeatedLines
        date = unix
        showEmail = true
[color]
        ui = auto
[column]
        ui = always
[commit]
        gpgSign = true
[core]
        fsmonitor = true
        fsmonitorHookVersion = 1
        editor = nvim
        pager = less
[fetch]
    writeCommitGraph = true
[user]
    name = Nicholas M. Synovic
    email = ***

Submitting GPU jobs to Slurm @ Loyola University Chicago

Nicholas Synovic — Sun, 08 Dec 2024 01:06:03 +0000

Slurm logo taken from [0]

Context

The Computer Science department at Loyola University Chicago [1] utilizes a high-performance computing cluster to support both research and teaching initiatives with particular interest in running HPC and AI applications with efficiency and effectiveness. As our department's needs have grown and changed over time, it has become clear that we require a more structured approach to allocating computational resources to individual projects. Our current method of running scripts as background processes thereby leveraging shared resources across all users, has limitations and does not scale effectively for ongoing projects. Specifically, our reliance on shared resources often results in computational bottlenecks due to multiple concurrent jobs competing for limited system resources.

To effectively manage the execution of jobs, resource allocation, and job order within our department, we are exploring the use of a job scheduler [2] as a solution. Specifically, Slurm [3] is currently under consideration due to its ability to meet our computational needs. However, given that not all members of the department have experience with job scheduling, and no formal training process currently exists, this blog post aims to serve as a brief, informal introduction to the technology and its applications, providing a foundation for readers to pursue further research into the applications of job schedulers and their benefits.

What is Slurm?

Slurm is an open-source job scheduler software package that enables efficient management of workload execution on shared computing resources, such as cluster computers [3]. A job scheduler like Slurm manages the order in which programs or applications (referred to as "jobs") are executed on these resources. By queuing jobs and allocating access to computational resources on a managed basis (e.g., first-in-first-out, last-in-last-out, or when specific hardware becomes available), Slurm ensures that each job has exclusive access to the required resources, preventing conflicts between multiple concurrent processes.

More information about Slurm can be found here [3]. The user guide and documentation for Slurm can be found here [4]. Slurm's source code is available on GitHub here [5].

Problem

Our department aims to integrate AI methods into our research and teaching programs, with a current focus on batch inferencing, training, and fine-tuning large language models (LLMs). To achieve this, we require access to significant GPU resources. However, our current setup limits individual users from fully utilizing the available GPUs for these computationally intensive tasks, as multiple users are often competing for simultaneous access to the same or related resources.

Our cluster computer has the necessary hardware and software infrastructure to execute AI and HPC codes efficiently. However, due to shared resource allocation among multiple users, these codes often take longer than expected to complete. In some cases, they may even stall or be terminated by the system, as it prioritizes freeing resources for other users over allowing a single task to run for an extended period.

Solution

With Slurm, we can utilize a set of fully available computational resources to schedule jobs efficiently. Users can configure their jobs to take advantage of specific resources and allocate a specified number of each resource as needed. Additionally, if a job does not require exclusive access to system capabilities, Slurm enables parallel execution by running multiple jobs simultaneously on separate hardware units.

The rest of this post is a tutorial that provides a step-by-step guide on how to submit jobs to Slurm. We'll use a real-world example - training a simple Convolutional Neural Network model on the MNIST dataset using TensorFlow [6] and Keras [7] in Python.

While our focus is on using Slurm, it's essential to write your code with concurrency and parallelism in mind. This means designing your program to take advantage of multiple computational resources simultaneously. If your code isn't optimized for concurrent execution, scaling its performance will be challenging. We assume prior knowledge of writing high-performance computing (HPC) codes and focus on using Slurm to manage and execute them efficiently.

NOTE: The code utilized in this tutorial is not optimized for running on multiple GPUs by default and will not scale with additional resources. To scale the code, you'll need to extend the code to support a multi-GPU, distributed training strategy as outlined in [8].

Tutorial

This tutorial will guide you through submitting jobs to Slurm in a series of easy-to-follow steps. Important notes and considerations will be highlighted in block quotes.

Connect to the cluster computer.
Clone your code from GitHub to a directory on the cluster computer.

You are using git and GitHub to keep track of versions, right?

Configure, build, and test your software.

Here is where the tutorial will begin, I will be using this code provided by the Tensorflow team for training CNN model on the MNIST dataset [9].

Create a bash script called job.bash

touch job.bash

You can name this file whatever you want, but it will have to be a bash script

Add the following code to job.bash:

#!/bin/bash

#SBATCH --gres=gpu:1

module load python/3.10

srun python train.py

Here's what the code is doing line-by-line:

#!/bin/bash: shebang to inform the operating system what interpreter to use

#SBATCH --gres=gpu:1: This defines an sbatch directive to set Slurm to use a general resource (--gres) of a single GPU (gpu:1). If multiple GPUs are required, you would replace 1 with the number of GPUs needed.

module load python/3.10: Configure the user environment to use python3.10.

srun python train.py: Submit the job (python) to the Slurm queue with its arguments (train.py) and configure the job with the aforementioned directives (see 2).

Run sbatch job.bash to queue the job.
Run squeue to see the queued Slurm jobs.
Wait for the job to execute. A slurm-$(JOB_NUMBER).out file will be created with any standard output or error piped into it.

Conclusion

And that's it! You now have a basic understanding of how to use Slurm for running GPU-related jobs. For a comprehensive guide on directives, configuration options, and more advanced usage, please refer to the official Slurm documentation here [10 - 12].

References

[0] https://slurm.schedmd.com/slurm_logo.png
[1] https://www.luc.edu/cs/
[2] https://en.wikipedia.org/wiki/Job_scheduler
[3] https://www.schedmd.com/slurm
[4] https://slurm.schedmd.com/documentation.html
[5] https://github.com/SchedMD/slurm
[6] https://www.tensorflow.org/
[7] https://keras.io/
[8] https://www.tensorflow.org/guide/distributed_training
[9] https://github.com/keras-team/keras-io/blob/master/examples/vision/mnist_convnet.py
[10] https://slurm.schedmd.com/sbatch.html
[11] https://slurm.schedmd.com/srun.html
[12] https://slurm.schedmd.com/squeue.html

Creating an arXiv DB

Nicholas Synovic — Sun, 01 Sep 2024 00:27:10 +0000

As a Ph.D. student studying Deep Learning (DL) from the perspective of a Software Engineer, I rely upon academic resources to learn about DL models, techniques, and methods. arXiv is arguably the largest host of the latest academic (but not peer-reviewed) DL manuscripts.

However, as it relies upon community donations to support the service, there are limitations to the service. One of them is that only the last week of manuscripts are browsable at any given time with the rest being searchable.

As someone who checks the service often for the latest information, it can become irritating when I'm casually browsing the site, find an interesting manuscript, and (for one reason or another) forget to bookmark it and then not find the paper as I can't nail down the exact keywords to search for it. Additionally, I'd like to leverage the data on the site for other projects like testing retrieval augmented generation (RAG) techniques for finding information from manuscripts.

To support users like me, the arXiv team releases the metadata of all papers submitted to the platform weekly on Kaggle as JSON. So for today's blog post, let's convert the JSON file into a queriable SQLite3 database!

Project Setup

I'll leverage Python 3.10 and bash for this project primarily for the pandas library. pandas provides convenient read_json and to_sql methods for reading JSON files and writing to SQL databases respectfully.

To start, I created a GitHub repository based on my Python template repo. You can find all the project code here.

Getting And Cleaning The Data

As the arXiv Dataset is hosted on Kaggle, we can use their kaggle Python library to download and unzip the data. Wrapping this as a bash script, we get:

#!/bin/bash

kaggle datasets download --unzip Cornell-University/arxiv -p $1

Where the --unzip argument decompresses the data, and the -p argument specifies a path to download the data. We can improve this further by leveraging optparse to provide command-line arguments for our script. All said and done, we have a download script that looks like this:

#!/bin/bash

source optparse.bash
optparse.define short=p long=path desc="Directory to store dataset" variable=PATH default="."
source $( optparse.build )

ABS_PATH=$(realpath $PATH)

kaggle datasets download --unzip Cornell-University/arxiv -p $ABS_PATH

Now that we have the data in JSON format, we can further optimize it by converting it into JSON Lines (JL) format. JL is a format for storing JSON data where each line is a single object. This effectively removes top-level arrays of objects which is how the arXiv Dataset is stored. By converting the data into a JL format, Pandas can read the file in chunks, thereby reducing the memory overhead by loading only portions of data into memory.

We can leverage jq to do the conversion with this script:

#!/bin/bash

source optparse.bash

optparse.define short=i long=input desc="Input JSON file" variable=inputPath
optparse.define short=o long=output desc="Output JSON Lines file" variable=outputPath

source $( optparse.build )

if [[ -z $inputPath ]]; then
    echo "No input provided."
    exit 1
fi

if [[ -z $outputPath ]]; then
    echo "No output provided."
    exit 1
fi

absInputPath=$(realpath $inputPath)
absOutputPath=$(realpath $outputPath)

jq -c . $absInputPath > $absOutputPath

With that, our data is finally in a format where we can start loading it into a database!

Creating The Database

We will define our database schema using SQLAlchemy. First, we will store a subset of the information in a single table called documents. This is to test that our database configuration is correct and avoid storing nested data now. The code is fairly simple to create a SQLite3 database with SQLAlchemy:

from pathlib import Path

from sqlalchemy import (
    Column,
    Engine,
    MetaData,
    PrimaryKeyConstraint,
    String,
    Table,
    create_engine,
)


class DB:
    def __init__(self, path: Path) -> None:
        self.path: Path = path
        self.engine: Engine = create_engine(url=f"sqlite:///{path}")
        self.metadata: MetaData = MetaData()

        self.documentTable: str = "documents"

        self.createTables()

    def createTables(self) -> None:
        _: Table = Table(
            self.documentTable,
            self.metadata,
            Column("id", String),
            Column("title", String),
            Column("submitter", String),
            Column("comments", String),
            Column("journal-ref", String),
            Column("doi", String),
            Column("report-no", String),
            Column("categories", String),
            Column("license", String),
            Column("abstract", String),
            Column("update_date", DateTime),
            PrimaryKeyConstraint("id"),
        )

        self.metadata.create_all(bind=self.engine, checkfirst=True)

Running this code and checking the database schema we see that the table and columns have been created successfully:

We will extend this later by adding tables and relationships between nested values and the documents table.

Inserting Data Into The Database

With Pandas, we can read the data in from the JL file as chunks:

from pathlib import Path
from typing import Iterator

import pandas
from pandas import DataFrame

def readJSON(fp: Path, chunksize: int = 10000) -> Iterator[DataFrame]:
    return pandas.read_json(
        path_or_buf=fp,
        lines=True,
        chunksize=chunksize,
        engine="ujson",
    )

This Iterator[DataFrame] object lazily reads the file into memory which we can do with a for loop.

def loadData(dfs: Iterator[DataFrame], db: DB)  ->  None:
    df: DataFrame
    for df in dfs:
        print(df.columns)
        quit()

As our current database schema doesn't take all of the fields captured in the JSON objects, we need to parse our DataFrame for the columns that are captured:

def getDocuments(df: DataFrame) -> DataFrame:
    documentsDF: DataFrame = df[
        [
            "id",
            "title",
            "submitter",
            "comments",
            "journal-ref",
            "doi",
            "report-no",
            "categories",
            "license",
            "abstract",
            "update_date",
        ]
    ].copy()

    documentsDF["update_date"] = pandas.to_datetime(
        arg=documentsDF["update_date"],
    )

    return documentsDF

Then we can load the document data into the database:

def loadData(dfs: Iterator[DataFrame], db: DB) -> None:
    df: DataFrame
    for df in dfs:
        documentsDF: DataFrame = getDocuments(df=df)
        documentsDF.to_sql(
            name=db.documentTable,
            con=db.engine,
            if_exists="append",
            index=False,
        )
        quit()

Checking our testing database, we can see that the first set of documents was loaded correctly:

However, when we try to import the entire dataset into the database, we get a sqlalchemy.exc.IntegrityError because some of the primary keys are duplicated in the JL file. Rather than handling this when converting the data, we can extend our DB class to support reading DataFrames into a table while checking for duplicates should an error arise:

...

from pandas import DataFrame
import pandas
from sqlalchemy.exc import IntegrityError

from typing import List

class DB:

...

    def toSQL(self, tableName: str, df: DataFrame) -> int:
        try:
            df.to_sql(
                name=tableName,
                con=self.engine,
                if_exists="append",
                index=False,
            )

        except IntegrityError as error:
            ids: List[str] = [param[0] for param in error.params]
            df = df[~df["id"].isin(values=ids)]

            df.to_sql(
                name=tableName,
                con=self.engine,
                if_exists="append",
                index=False,
            )

        return df.shape[0]

If an error occurs, a unique DataFrame is created from the list of primary keys not reported by the IntegrityError, another attempt is made to reinsert them into the database. Additionally, we now return the number of rows committed to the database.

So our updated loadData method now looks like this (with a Spinner object to help report progress from the progress library):

def loadData(dfs: Iterator[DataFrame], db: DB) -> None:
    with Spinner(f"Loading data into {db.path}... ") as spinner:
        df: DataFrame
        for df in dfs:
            documentsDF: DataFrame = getDocuments(df=df)
            db.toSQL(tableName=db.documentTable, df=documentsDF)
            spinner.next()

Wrapping Up

Now that the basic structure of the application has been created, all that's left is to add the other tables.

For example, we can create a table called authors to store each author of a document:

_: Table = Table(
            self.authorTable,
            self.metadata,
            Column("id", Integer),
            Column("document_id", String),
            Column("author", String),
            PrimaryKeyConstraint("id"),
            ForeignKeyConstraint(
                columns=["document_id"],
                refcolumns=["documents.id"],
            ),
        )

And then access only the authors from the DataFrame with this method:

def getAuthors(df: DataFrame, idIncrement: int = 0) -> DataFrame:
    authorsDF: DataFrame = df[["id", "authors_parsed"]]
    authorsDF = authorsDF.explode(column="authors_parsed", ignore_index=True)
    authorsDF["author"] = authorsDF["authors_parsed"].apply(
        lambda x: ", ".join(x),
    )

    authorsDF = authorsDF.drop(columns="authors_parsed")
    authorsDF.index += idIncrement
    authorsDF = authorsDF.reset_index()
    authorsDF = authorsDF.rename(columns={"id": "document_id", "index": "id"})

    return authorsDF

Then we can use the DB.toSQL() method to write it to the database.

The final database schema is as follows (generated with SchemaCrawler):

As seen here, there is additional complexity in storing this data. The versions table undergoes similar transformations as well.

If you are interested in how the versions table is created and to leverage this tool, please visit the GitHub project page.

Thanks for taking the time to read this post. I hope to be posting more in the future.