Forem: Rohit Sah

Per Segment Plane Sweep Line Segment Intersection on the GPU Paper Summary

Rohit Sah — Fri, 27 Jun 2025 18:06:51 +0000

This blog is originally post on https://blog.sahrohit.com.np/posts/plane-sweep-parallel and distributed across platforms for more reach. The original paper of this paper summary is available here.

There are two algorithms presented in the paper one is HMAL, and another is Per Segment Sweep Line algorithm. HMAL is limited by synchronization requirement and number of threads allowed in a single GPU grid cell. The other is the PSSL algorithm where it addresses the drawback of the HMAL by using dynamic parallelism. The PSSL algorithm is fast, memory efficient and easier to implement than other competing approaches.

HMAL utilizes processing cores to function as a hardware representation of the active list data structure,containing list segments encountered by the sweep line.The drawback of the HMAL algorithm is the synchronization requirement, it is limited by the maximum number of threads allowed in a single GPU grid cell (currently 1024 threads). Second Algorithm used in this paper is the Per Segment Sweep Line (PSSL) algorithm. It addresses the drawbacks of the HMAL algorithm by eliminating the indeed for synchronization primitives in the GPU kernels, and thus eliminating the limitations of constraining the number of threads to a single thread block. It achieves these improvements by taking advantage of dynamic parallelism. PSSL is easy to implement, fast and memory efficient compared to competing approaches.

Related Work

Serial Plane Sweep Algorithm

Once the sweep line encounters the left endpoint of the line segment, the line segment will be added to the list, and when the right endpoint of the line segment is added to the list it is removed from the active.

Grid based Algorithm

The algorithm treats each grid cell as a separate unit of work for a processing unit by testing each line segment for intersections against all other line segments on the same grid. Polygons with high line segment count in a relatively small area (high-density), will result in grid cells requiring large numbers of comparisons, leading to unbalanced workloads for processing units.

Two-level Grid based Algorithm

In this algorithm, the grid is subdivided again into sub grids if the number of line segments in a grid exceeds a user-specified threshold.

Neighbor finding Reduction Variation of Parallel Sweep Line Algorithm

This algorithm generates events for each endpoint of a line segment and each found intersection point. The initial endpoint events are processed in parallel first. The intersections that are found during the processing of the original endpoints generate new events.

Line Segment Filtering PSCMBR

PSCMBR is a combination of PolySketch (used to find a Minimum Bounding Rectangle (MBR) around a predetermined number of line segments, called a tile) and CMBR (Common minimum bounding rectangle). PolySketch filters for overlapping tiles and CMBR of the tiles refines the search for candidate line segments that intersect.

These are five different types of line segment algorithms that do already exist. The results and time complexity of these algorithms vary and these are also considered for benchmarking and comparing the result of the proposed algorithm.

Preliminaries

List of lines sorted in X-axis forming complex regions. If the segments do not form complex regions (self-intersecting regions or regions without a closed loop) the algorithm will still run to completion. The line segments are generated using a python script.The generated line segments are not generated in sorted order, as sorting these lines are also a part of the algorithm and time complexity of sorting is also considered in the final algorithm

The input data is not sorted and time taken to sort is also considered a part of the algorithm.

Hardware Materialized Active List

The main difference is that of handling the active list. In traditional sweep line algorithms when the left endpoint of a segment is encountered it is entered into the active list and when the right endpoint of the line segment is encountered, it is removed from the active list.

Now since the list is sorted, the newly inserted line segment only checks its neighboring segments in the active list of intersections.

An imaginary sweep line moves across the plane from left to right. As the sweep line encounters new line segments, it is added to the hardware active list and processed for intersections and topology.

When the left endpoint of the segment is encountered, a thread will be assigned. Then this new added segment will be checked for intersection with all the existing threads in parallel. Thus, the active list is itself materialized into the hardware. This removes the logarithmic lookup time for a central tree-based active list. We need to make sure that the number of active line segments doesn’t exceed the number of available CUDA processor cores.

If the x value of the left endpoint is larger than the right endpoint of the segment in the hardware active list. The thread will notice and free itself, automatically removing itself from the active hardware list.

At the end of each loop, the processing threads must be synchronized, to make sure all threads are using the current line segment and all are testing the current line segment for intersections.

The implementation is limited to the maximum active list size of 1024 due to the block level synchronization. The size of the active list for N number of line segments is roughly sqrt(N). So, this algorithm can process 1M total line segments for two polygons.

The active list is actually threads, and when the left endpoint of a line segment is encountered, it is assigned a thread and then it looks for intersection on all items of the active list on their respective threads and stores the result. Then, the right endpoint of the line segment is encountered and the thread is cleared. This makes sure that when the sweep line is moved, it only checks for the intersection in the active list.The limitation of this approach is the maximum size of the active list. Due to the block level synchronization, the maximum size of the active list can be 1024, and hence it can roughly process 1M line segments of two polynomials at once.

Per Segment Sweep Line Algorithm

This algorithm will test line segments, as if encountered by a per-segment sweep line. The intersection tests will be held off for later but the pairing of the line segments is computed first.

Once the pairing of the line segments has been calculated, then a separate dynamic thread will be launched to perform the intersection tests. Since the pairing and intersection happen separately, these are not constrained as HMAL was.

The endpoints are sorted using their x-value, y-value, segmentID and left-right flag using standard parallel sort. The left-right endpoint is used to represent the left endpoint (value of 1) or right endpoint (value of 0) of the line segment.

Test Count Values(TCVs) (number of intersection tests for each line) are calculated.

Threads are launched based on the worklist created in the previous step. Then the possible intersections are calculated based on the left-most endpoint. This is done to prevent the repetition of the intersection point calculation. As in Fig(a,b,c) these are the types of intersection calculated for line segment A. Fig (d) type of intersection will be calculated under line segment B. Fig (e) type of intersection will be calculated by line segment B because the rightmost endpoint of the line segment A falls in between line segment B. Fig (f) type of intersection is also calculated by line segment B because A and C don't fully cross segment B.

Once the intersection points are found, the line segment will be broken down and the interior above and interior below data tags will be updated on each line segment.

First the TCVs are calculated by sorting each line by features and also number of intersections are calculated. Then worker threads are launched based on the number of intersections. Then based on the type of intersection the intersections are calculated in order to prevent the repeated calculations. Then the line segments are broken at intersection points and the data is updated.

Algorithm Analysis

HMAL Time Complexity

Sorting of Line Segments: O (N log N)

Detecting Intersections: O(N)

Sorting the number of intersection: O(k log k)

Breaking of Line Segment: O(k)

Total: O(N log N + k log k).

PSSL Time Complexity

Creation of Stencils and determination of intersection tests: O(N)

PSSL Algorithm: O(N)

Breaking of Line Segment: O(k log k)

Update Topology Flags: O(k)

Total: O(N log N + k log k).

The time complexity of both the algorithms is very similar and to be O(N log N + k log k). The main difference is that the HMAL algorithm is limited to the maximum block size of the CUDA threads whereas the PSSL algorithm isn’t.

Experimental Results

Each test is performed 5 times and average of that is taken. The PS-ACC algorithm was not tested as the source code for this program was not available. (Despite being quadratic, PS-ACC performs way better)

By far the best performance is given by PSSL-CPU because the size of the dataset is small (10K to 80K lines). PSSL-GPU lacks behind by an order of magnitude due to the data transfer to/from PGU and kernel launch overheads. This shows that the PSSL-CPU can be used in database applications like PostGIS.

Their own implementation of sweep line algorithm was used instead in order to address the known sensitivity of the number precisions and rounding errors. In the Parks/Counties dataset, the SSL enters an infinite loop due to the arithmetic precision errors.

The PSSL-GPU performs at linear and sublinear times due to the maximum size of the list never exceeding the number of processing cores.

PSSL didn’t struggle either on CPU or GPU, but it is sensitive to the density of the intersections.

PSCMBR is the algorithm that uses geometric overlay instead of joins to find the intersection of the polygon. This cannot be entirely compared with the PSSL algorithm as PSSL performs joins operation but PSCMBR does polygon overlay. In the results, PSCMBR does perform better on datasets with high density where PSSL lacks.

Each test is run 5 times and their average time for the results is taken. The algorithms are tested both on generated dataset as well as real world datasets. The HAML algorithm didn’t run on all tests due to the block synchronization requirements. PSSL did perform better on all tests and showed little bit of struggle while performing operation on dense datasets. The PSSL algorithm is also compared with PSCMBR.

Parallelizing Sorting Algorithms using OpenMP

Rohit Sah — Sun, 13 Oct 2024 23:00:06 +0000

This blog is originally post on https://blog.sahrohit.com.np/posts/parallel-sorting-algorithm and distributed across platforms for more reach.

This blog focuses on investigating parallel programming using OpenMP, particularly in the context of sorting algorithms. The goal is to compare the performance of serial and parallel implementations of three well-known sorting algorithms — Bubble, Quick Sort, and Merge Sort — on randomly generated numbers, providing insights into how parallel programming impacts computational efficiency.

The blog involves creating both serial (S) and parallel (P) implementations for each of the three sorting algorithms, alongside a reference implementation using STL’s std::sort to benchmark the results. The performance of these algorithms is evaluated by measuring the execution times across varying input sizes, and the effects of hyper-threading are also studied.

Hardware Overview

The hardware setup for the systems used in this blog includes two servers. The first, radish, has 4 cores with 2 threads per core, resulting in 8 virtual CPUs (vCPUs). The second, jalapeno, has 4 cores with 2 threads per core, providing 16 vCPUs, indicating hyper-threading is also enabled. These systems provide a range of hardware configurations for testing the parallel efficiency of sorting algorithms with OpenMP.

Sorting Algorithm Overview

Bubble Sort (O(n^2))

Bubble Sort is a simple comparison-based sorting algorithm with quadratic time complexity that works by comparing each item with its adjacent item and swapping to sort. Parallelizing Bubble Sort is challenging due to its inherent sequential nature.

The parallel version of this algorithm alternates between two phases to ensure all elements are correctly positioned. Each thread operates on distinct portions of the data during these phases, improving efficiency by reducing the number of redundant comparisons. The shared state of the array is maintained consistently across threads by using synchronization techniques, such as reduction, to ensure proper sorting without conflicts or race conditions. This approach enables the system to leverage available computational resources fully, speeding up the sorting process, especially for large datasets.

void bubbleSortParallel(int arr[], int n) {
    bool sorted = false;
    int tmp;

    while (!sorted) {
        sorted = true;

        #pragma omp parallel private(tmp)
        {
            // Even phase
            #pragma omp for reduction(&&:sorted)
            for (int i = 0; i < n-1; i += 2) {
                if (arr[i] > arr[i+1]) {
                    tmp = arr[i];
                    arr[i] = arr[i+1];
                    arr[i+1] = tmp;
                    sorted = false;
                }
            }

            // Odd phase
            #pragma omp for reduction(&&:sorted)
            for (int i = 1; i < n-1; i += 2) {
                if (arr[i] > arr[i+1]) {
                    tmp = arr[i];
                    arr[i] = arr[i+1];
                    arr[i+1] = tmp;
                    sorted = false;
                }
            }
        }
    }
}

Quick Sort (O(n log n))

Quick Sort is a widely used, efficient, and divide-and-conquer algorithm. By recursively partitioning the array, it achieves O(n log n) performance in its average case. Parallelizing Quick Sort is possible by assigning different partitions to different threads.

In the parallel version of Quick Sort, the divide-and-conquer approach is further optimized by assigning different array partitions to separate threads. After partitioning the array around a pivot element, the left and right subarrays are handled independently. This independence allows for parallel execution, where multiple threads can simultaneously sort different parts of the array, greatly reducing the overall sorting time. By controlling the depth of recursion, tasks are conditionally created to avoid excessive parallelization overhead, ensuring that computational resources are utilized efficiently. As a result, parallel Quick Sort can handle large datasets more effectively than its sequential counterpart.

void quickSortParallel(int arr[], int low, int high, int depth = 0) {
    if (low < high) {
        int pi = partition(arr, low, high);

        #pragma omp task shared(arr) if(depth < 3)
        quickSortParallel(arr, low, pi - 1, depth + 1);

        #pragma omp task shared(arr) if(depth < 3)
        quickSortParallel(arr, pi + 1, high, depth + 1);

        #pragma omp taskwait
    }
}

In the source code, parallelization is achieved through the #pragma omp task directives, which create parallel tasks for sorting the left and right subarrays after partitioning. Specifically, after calculating the pivot index (pi), two separate tasks are spawned: one for sorting the subarray to the left of the pivot (quickSortParallel(arr, low, pi - 1, depth + 1)) and another for the subarray to the right of the pivot (quickSortParallel(arr, pi + 1, high, depth + 1)). These tasks are executed concurrently by different threads.

The condition if(depth < 3) ensures that task creation is limited to the first few recursive levels, which prevents excessive overhead from creating too many tasks in deep recursion. The #pragma omp taskwait directive ensures that the program waits for both sorting tasks to complete before moving forward, maintaining correctness in the overall sorting process. This structured approach efficiently parallelizes the Quick Sort algorithm while balancing task creation and resource usage.

Merge Sort (O(n log n))

Merge Sort is another divide-and-conquer algorithm with guaranteed O(n log n) performance in the worst case. It is particularly suitable for parallelization since different subarrays can be merged concurrently, making it an ideal candidate for OpenMP.

void mergeSortParallel(int arr[], int l, int r, int depth = 0) {
    if (l < r) {
        int m = l + (r - l) / 2;

        #pragma omp task shared(arr) if(depth < 3)
        mergeSortParallel(arr, l, m, depth + 1);

        #pragma omp task shared(arr) if(depth < 3)
        mergeSortParallel(arr, m + 1, r, depth + 1);

        #pragma omp taskwait
        merge(arr, l, m, r);
    }
}

In the parallel version of Merge Sort, the array is recursively divided into smaller subarrays that can be sorted independently. The key advantage for parallelization lies in the merging process, where two sorted subarrays can be merged simultaneously in separate threads. This parallel execution of independent subarray sorts significantly reduces the overall runtime, especially for large datasets. The recursion depth is controlled to prevent excessive overhead, ensuring that only the upper levels of recursion are parallelized, where the benefits of parallelism are most significant.

In the source code, parallelization is implemented using the #pragma omp task directive, which creates parallel tasks for sorting the left and right halves of the array (mergeSortParallel(arr, l, m) and mergeSortParallel(arr, m + 1, r)). The if(depth < 3) condition restricts task creation to higher levels of recursion, balancing the trade-off between parallelism and overhead. The #pragma omp taskwait directive ensures that both sorting tasks are completed before proceeding with the merging step, which is crucial for maintaining the correctness of the Merge Sort algorithm.

Methodology

Experimental Setup

To evaluate performance, the program generates random arrays of integers, where the size of the arrays was varied from small to large (e.g., 2,500 to 500,000 integers). The random arrays were generated using the command-line arguments for array size and seed value. Each sorting algorithm, both serial and parallel versions (16 threads), was compiled into separate executables, and the execution time was measured.

Each executable took two command-line arguments:

cpp [executable name] [number of random integers to generate] [seed value for random number generation]

Then, a bash script was written to build and execute all the executable and write the runtime of each into a single CSV file.

Parallelization was achieved using OpenMP, with each sorting algorithm being parallelized based on its structure. Execution times were measured using the C++ chrono library, ensuring accuracy. Performance was compared for each algorithm in serial, parallel, and using STL’s std::sort as a reference.

Metrics for Comparision

The primary metric used for comparison was execution time in seconds. Tests were performed across a range of input sizes to observe how performance scales with increasing data. The results were plotted in graphs to visualize the difference between serial, parallel, and reference implementations.

Bubble Sort Analysis

Graph 1: Bubble Sort (bbs, bbp, reference)

This graph compares the execution times of three sorting implementations: the C++ STL sort, the parallel bubble sort, and the serial bubble sort, for various input sizes. Specifically, the parallel version could be considerably slower for smaller-sized arrays due to the overhead introduced by thread management, synchronization, and cache coherency. However, for larger inputs, its execution time will decrease considerably. This is due to the fact that CPU cores are utilized more effectively, memory access patterns are more efficient, and the hardware prefetching is better exploited, even though both bubble sort implementations have an O(n²) complexity.

Explanation of Graph

For smaller datasets, the overhead associated with managing threads in the parallel version (bbp) leads to slower performance compared to the serial version (bbs). However, as the dataset size increases, the parallel implementation starts to outperform the serial one by distributing the workload across multiple cores. The reference STL sort algorithm consistently surpasses both bbs and bbp due to its sophisticated, optimized design, which includes considerations for memory access patterns, CPU pipelining, and more advanced sorting algorithms with better time complexity.

Analysis

The weaker performance of parallel bubble sort on smaller arrays stems from the considerable overhead involved in thread creation and synchronization. Additionally, bubble sort’s inherent memory access patterns introduce frequent synchronization barriers due to data dependencies. However, on larger datasets, the increased workload enables parallelism to be more effective, resulting in better performance for bbp compared to its serial counterpart. Nevertheless, both bubble sort variants remain slower than the STL sort, which leverages advanced algorithms and optimizations tailored for modern CPU architectures.

Quick Sort Analysis

Graph 2: Quick Sort (qss, qsp, reference)

The graphs compare the execution times of three different quick sort implementations: serial (qss), parallel (qsp with 16 threads), and the reference (STL sort). Surprisingly, both the serial and parallel implementations outperform the std::sort, which can be attributed to key differences in the algorithmic approach and optimizations.

Explanation of Graph

For smaller datasets, the serial Quick Sort (qss) performs comparably to the parallel Quick Sort (qsp). As input sizes increase, the parallel version takes advantage of multi-core processing, drastically reducing execution times compared to the serial version. Interestingly, despite the std::sort being highly optimized, both qss and qsp demonstrate faster runtimes. This outcome suggests that the overhead introduced by std::sort's additional optimizations becomes detrimental, especially for the data distributions tested here.

Analysis

The slower performance of std::sort in this comparison is likely due to the overhead of its more complex optimizations, such as the 3-partition quicksort and the switch to insertion sort for small partitions. While these optimizations are designed for handling worst-case scenarios and duplicate-heavy datasets, they can add unnecessary overhead when such conditions are not present. In contrast, the simpler Lomuto partition scheme used in qss and qsp proves more efficient for the input sizes and data distributions tested here, leading to better performance.

Merge Sort Analysis

Graph 3: Merge Sort (mss, msp, reference)

The third graph presents the performance of Merge Sort, comparing the serial version (mss), the parallel version (msp), and the reference (STL sort). Merge Sort is well-suited for parallelization, making it reasonable to expect the parallel version (msp) to outperform the serial counterpart (mss) when tested across dataset sizes. The graph illustrates how the parallel version takes advantage of multi-threading, particularly for larger inputs, while the reference STL sort only performs slightly better than serial merge sort.

Explanation of Graph

The parallel Merge Sort (msp) consistently outperforms the serial version (mss) for larger input sizes. By utilizing 16 threads, msp efficiently splits the array and merges the sections concurrently, significantly improving execution time compared to the serial version (mss). However, for smaller datasets, the overhead of managing multiple threads diminishes msp's advantage. On the other hand, STL sort, while not parallelized, leverages highly efficient algorithms, including a 3-partition quicksort, insertion sort, and heap sort fallback, allowing it to perform slightly better than the serial Merge Sort (mss), particularly on small datasets.

Analysis

Merge Sort is highly suited for parallelization, as its divide-and-conquer approach naturally allows independent sections of an array to be merged simultaneously. The parallel Merge Sort (msp) significantly reduces execution time for larger datasets by distributing the workload across 16 threads. While the STL sort benefits from optimized memory management and CPU cache utilization, it cannot leverage parallelism, making it slower than msp on large datasets. However, it still manages to outperform the serial Merge Sort (mss) in small data scenarios due to its efficient hybrid algorithms that avoid the worst-case time complexities typical of quicksort.

Hyper-threading Analysis

Hyper-threading Results

The hyper-threading experiment was conducted on all the above algorithms as the test case. Hyper-threading was enabled and the execution time was measured for varying input sizes. The graph shows the execution times of the hyper threaded runtime decrease in all cases and performs better with hyper-threading.

Explanation of Results

hyper-threading shows a small improvement in performance for all input sizes, particularly when the workload is large enough to saturate the logical cores. For smaller datasets, hyper-threading introduces minimal benefit as the threads do not fully utilize the available resources.

Analysis

hyper-threading can provide performance improvements by allowing more threads to run simultaneously, but this is only beneficial when the workload is large enough to keep all logical cores busy. In cases where the workload is too small, the overhead of hyper-threading can actually degrade performance.

Conclusion

The results demonstrate that while parallel implementations of sorting algorithms can provide performance improvements, the benefits are heavily dependent on the size of the dataset and the specific algorithm. For smaller datasets, the overhead of managing threads often negates the benefits of parallelism. However, for larger datasets, parallel versions outperform their serial counterparts. The analysis also shows that hyper-threading can provide performance boosts, but only under certain conditions.

Source Code

https://github.com/sahrohit/parallel-sorting.git

https://github.com/sahrohit/parallel-sorting/blob/main/combined.csv

Automated Latex Resume with GitHub Action

Rohit Sah — Sun, 12 May 2024 17:49:44 +0000

This blog is originally post on https://blog.sahrohit.com.np/posts/automated-latex-resume and distributed across platforms for more reach.

Resumes are a crucial tool for developers. They are often the first impression you make on a potential employer, so it's important to have a well-formatted and ATS-friendly resume. However, keeping your resume updated across multiple platforms can be a hassle. This blog post explores how to use Latex and GitHub Actions to automate the resume update process.

The Problem

Finding ATS-friendly resume templates is easy. The challenge lies in keeping your resume updated across various platforms, including your portfolio, GitHub README, and downloadable links. Manually updating each location can be time-consuming, error-prone and may lead to versioning hell of resume.

Versioning Hell of my Resume’s

Solution Attempts

Version Control with Git

I've always wanted to use Git to manage versions of my resume, but dealing with .docx files in Git isn't very enjoyable. My breakthrough with Git version control came when I switched to creating my resume in LaTeX. Not only is LaTeX ATS-friendly, but it's also easy to maintain since I can track changes in my Git repository. LaTeX does have a learning curve, but you can start editing in minutes by referring to guides on platforms like Overleaf, one of the best online LaTeX editors. This solved one of my problems, but updating links everywhere was still an issue.

Resume.tex - https://github.com/sahrohit/sahrohit/blob/master/resume.tex

Overleaf Link - https://www.overleaf.com/read/qxxqjqgxcfzn#5da12c

GitHub Actions for Automation

I decided to create a separate GitHub repository to manage my resume but realized that using the default README (same as my username) would be the best place to keep it. I used GitHub Actions to automate the build process, which generates and commits the PDF to the same repository. This setup serves two purposes: it checks whether the LaTeX file builds successfully and generates the PDF of my resume in my profile repository. The action only runs if there are changes in the resume.tex file, committing the generated PDF back to the same repository.

Additionally, the generated PDF is uploaded as an artifact and later downloaded to be committed to the portfolio repository. Currently, my portfolio is built with Next.js and includes the resume in the public folder, making it downloadable via the website.

name: Build and Commit Resume

on:
  push:
    branches: [master]
    paths:
      - resume.tex

jobs:
  compile:
    name: Compile resume PDF
    runs-on: ubuntu-latest

    steps:
      - name: Check out Resume Repository
        uses: actions/checkout@v4
        with:
          repository: sahrohit/sahrohit

      - name: Run the build process with Docker
        uses: addnab/docker-run-action@v3
        with:
          image: thomasweise/docker-texlive-full:latest
          options: -v ${{ github.workspace }}:/data
          run: |
            cd data
            pdflatex resume.tex

      - name: Upload PDF artifact
        uses: actions/upload-artifact@v3
        with:
          name: resume-pdf
          path: resume.pdf

      - name: Commit and push changes
        uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: Updated Resume PDF

  commit:
    name: Commit resume to Portfolio
    needs: compile
    runs-on: ubuntu-latest

    steps:
      - name: Check out portfolio repo
        uses: actions/checkout@v4
        with:
          repository: sahrohit/portfolio
          token: ${{ secrets.ACCESS_TOKEN }}

      - name: Download PDF artifact
        uses: actions/download-artifact@v3
        with:
          name: resume-pdf

      - name: Move PDF to public directory
        run: |
          mv resume.pdf public/"Rohit's-Resume.pdf"

      - name: Commit and push changes
        uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: Updated Resume PDF

Upon changes to my resume.tex file in the sahrohit/sahrohit repository, it generates a PDF and adds it to the same repository. Additionally, it creates a commit to update the resume in my portfolio (sahrohit/portfolio). As sahrohit/portfolio is private, for committing you'll need an ACCESS_TOKEN in project’s secrets to authorize the commit.

At the end of the day, what it does is save me a few manual commits and updates. But more than that, it gives me the satisfaction of updating the resume.tex file and initiating the entire process with a single commit. It's fulfilling to see how the resume updates across multiple places. It's these small satisfactions that keep us going and looking forward to another day.