Forem: Shankar

Adding API Gateway to My Cloud Resume

Shankar — Sun, 12 Apr 2026 05:03:50 +0000

Five Failures in One Evening: Adding API Gateway to My Cloud Resume

In my previous article, I documented migrating my Cloud Resume from ClickOps to Terraform. The system worked: S3 + CloudFront for the frontend, a Lambda Function URL for the visitor counter, DynamoDB for persistence, and GitHub Actions for CI/CD.

But the Lambda Function URL had a problem. It was a bare endpoint with no throttling, no API key, and no usage tracking. Anyone with the URL could call it a million times and I'd be paying for a million DynamoDB writes.

I added API Gateway in front of the Lambda. Five things broke.

What I added

Three new Terraform modules:

An api-gateway module (modules/api-gateway/) with a REST API exposing GET and POST on /visitors, both requiring an API key. It includes a usage plan with rate limiting (5 requests/sec, burst of 10), a monthly quota of 10,000 requests, a MOCK integration for CORS preflight, and CloudWatch access logging on the prod stage.

A vpc module (modules/vpc/) with a 10.0.0.0/16 network, two public subnets and two private subnets across us-east-1a and us-east-1b. I skipped the NAT Gateway because that's $32/month I don't need yet. This is prep for Phase 2 when I add containers or RDS.

A dns module (modules/dns/) for ACM certificate and API Gateway custom domain mapping to api.arlingtonhood21.work, gated behind a feature flag (enable_custom_domain = false) since it requires manual DNS validation.

The frontend JavaScript changed from calling the Lambda Function URL directly to calling the API Gateway with an x-api-key header.

Failure 1: Resources already exist

I pushed everything to main. The backend CI ran terraform apply and failed with four errors:

ResourceAlreadyExistsException: CloudWatch Logs log group /aws/apigateway/visitor-api already exists
EntityAlreadyExists: Role with name api-gateway-cloudwatch-role already exists
ResourceConflictException: The statement id (AllowAPIGatewayGetInvoke) provided already exists

I had created these resources locally with terraform apply before the CI existed. The CI's import step only covered root-level resources (DynamoDB, Lambda, SNS). The new API Gateway module resources weren't in the import list.

The fix was adding six terraform import commands for the module resources: the CloudWatch log group, IAM role, role policy attachment, both Lambda permissions, and the API Gateway account.

Failure 2: CI didn't trigger

I pushed the import fix to .github/workflows/backend-cicd.yml. Nothing happened.

The workflow trigger only watched resume-backend/**. The workflow file itself lives at .github/workflows/backend-cicd.yml, outside that path. GitHub Actions path filters are literal glob matches. If the file you change doesn't match the path filter, the workflow doesn't run.

I added the workflow file to its own trigger and threw in workflow_dispatch for manual runs:

paths:
  - 'resume-backend/**'
  - '.github/workflows/backend-cicd.yml'
workflow_dispatch:

Failure 3: New API Gateway, old URL

After the CI succeeded, the visitor counter showed "--" instead of a number.

The import step had only imported resources with globally-unique identifiers (IAM roles, CloudWatch log groups, Lambda permissions). The REST API itself, its methods, integrations, stages, and deployment weren't imported because they don't have globally-unique names. Terraform created a brand new API Gateway with a different ID.

My frontend was still pointing at the old URL. And Terraform had updated the Lambda permissions to reference the new API Gateway, so the old URL lost its ability to invoke the Lambda. Both URLs were broken.

I pulled the new URL and API key from terraform output, updated index.js, pushed, and invalidated the CloudFront cache.

Still "--".

Failure 4: CORS preflight mismatch

I tested with curl and got {"count": 162} back. The API worked. The browser was blocking it.

I sent an OPTIONS request mimicking the browser's preflight check:

curl -D- -X OPTIONS \
  "https://pb5rav4teh.execute-api.us-east-1.amazonaws.com/prod/visitors" \
  -H "Origin: https://shankar-resume.arlingtonhood21.work" \
  -H "Access-Control-Request-Method: POST"

The response:

Access-Control-Allow-Origin: https://arlingtonhood21.work

My site loads from https://shankar-resume.arlingtonhood21.work. The browser does an exact string match. arlingtonhood21.work does not equal shankar-resume.arlingtonhood21.work. Preflight rejected, POST blocked.

The problem was in how API Gateway handles CORS. The actual POST request flows through the Lambda proxy integration, where my Python code checks the Origin header dynamically and returns the matching origin. But the OPTIONS preflight never reaches the Lambda. It hits a MOCK integration that returns a hardcoded, static value. I had set that static value to the apex domain instead of the subdomain.

Curl doesn't send preflight requests, which is why it worked from the terminal. Browsers always send OPTIONS first for cross-origin POST requests with custom headers.

Failure 5: The state kept disappearing

I fixed the CORS config, pushed, and the CI created a third API Gateway. Apply complete! Resources: 35 added, 4 changed, 4 destroyed. 35 new resources for a one-line change. Something was destroying the Terraform state between runs.

I checked S3:

$ aws s3 ls s3://shankar-resume-2025/ --recursive
2026-04-11  404.html
2026-04-11  favicon.svg
2026-04-11  index.html
2026-04-11  index.js
2026-04-11  style.css

No resume-backend/terraform.tfstate. Gone.

My Terraform backend stores state in the same S3 bucket that hosts the frontend. The frontend CI pipeline runs aws s3 sync . s3://bucket --delete on every push. That --delete flag removes any S3 object not present in the source directory. The source directory has five HTML/CSS/JS files. It does not have resume-backend/terraform.tfstate.

Every frontend deploy deleted the Terraform state. Every backend CI run started from zero, imported a subset of resources, and created everything else from scratch. That's why the API Gateway URL kept changing.

The fix:

aws s3 sync . s3://${{ secrets.AWS_S3_BUCKET_NAME }} --delete --exclude "resume-backend/*"

After adding the exclude flag, I triggered one final backend CI run, updated the frontend with the fourth and final API Gateway URL, and confirmed the state file survived the next frontend deploy.

What I'd change next time

Terraform state and application assets should not share a bucket. I stored infrastructure state in the same S3 bucket as the website. One --delete flag on a sync command was all it took to wipe the state on every deploy. If I were starting over, the state bucket would be its own resource with no other purpose.

CORS on API Gateway has two paths, and they don't share configuration. The Lambda handles CORS dynamically for actual requests. The MOCK integration returns static headers for preflight. If those two don't agree on the allowed origin, the browser blocks everything. Curl won't catch this because it skips preflight entirely. I should have tested with browser DevTools instead of curl. A 200 from curl tells you nothing about whether a browser can reach your API.

The import-on-every-run pattern is a workaround, not a design. It exists because I deployed manually before CI existed. If the first deploy had gone through CI, I would never have needed imports. For new projects: set up CI first, then deploy through it.

Current state

Browser (shankar-resume.arlingtonhood21.work)
  |
  +-- Static files: CloudFront -> S3
  |
  +-- Visitor API: POST /prod/visitors
        |
        API Gateway (EDGE, api-key-required)
          Rate limit: 5 req/sec, burst 10
          Monthly quota: 10,000
          CloudWatch access logging
          |
          Lambda (Python 3.9) -> DynamoDB

Kill Switch:
  AWS Budget ($2/mo) -> SNS -> Lambda -> disables CloudFront

Four API Gateways were created and destroyed in the process. The fifth one stuck.

Live site: shankar-resume.arlingtonhood21.work

My K3s Pi Cluster Died After a Reboot: A Troubleshooting Story

Shankar — Thu, 30 Oct 2025 17:11:36 +0000

I have a Raspberry Pi homelab running k3s, all managed perfectly with FluxCD and SOPS for secrets. It was stable for weeks.

Then, I had to reboot my router.

When it came back up, my Pi was assigned a new IP address (it went from 192.168.1.9 to 192.168.1.10). Suddenly, my entire cluster was gone.

Running kubectl get nodes from my laptop gave me the dreaded:

The connection to the server 192.168.1.9:6443 was refused...

Ah, I thought. "Easy fix." I updated my ~/.kube/config to point to the new IP, 192.168.1.10.

I ran kubectl get nodes again... and got the same error.

The connection to the server 192.168.1.10:6443 was refused...

This meant the k3s service itself wasn't running on the Pi. This post is the story of the troubleshooting journey that followed, and the three major "fixes" it took to get it all back online.

Part 1: The Crash Loop (On the Pi)

I SSH'd into the Pi to see what was wrong.

ssh shankarpi@192.168.1.10

First, I checked the service status. This is the #1 thing to do.

sudo systemctl status k3s.service

The service was in a permanent crash loop:

● k3s.service - Lightweight Kubernetes
     Active: activating (auto-restart) (Result: exit-code) ...

This means k3s is starting, failing, and systemd is trying to restart it over and over. Time to check the logs.

sudo journalctl -u k3s.service -f

And there it was. The first smoking gun:

level=fatal msg="Failed to start networking: unable to initialize network policy controller: error getting node subnet: failed to find interface with specified node ip"

This was a double-whammy:

K3s was starting before the wlan0 (Wi-Fi) interface had time to connect and get its 192.168.1.10 IP. This is a classic race condition on reboot.

K3s was still configured internally to use the old IP.

Part 2: The Service File Fixes

The fix was to edit the systemd service file to (1) wait for the network and (2) force k3s to use the new IP for everything.

# On the Pi
sudo nano /etc/systemd/system/k3s.service

I made four critical changes to the [Service] section:

Added ExecStartPre: This line forces systemd to wait until the wlan0 interface actually has the IP address 192.168.1.10 before trying to start k3s.

Added --node-ip: Tells k3s what IP to use internally.

Added --node-external-ip: Tells k3s what IP to advertise externally (this was the fix for the IP conflict).

Added --flannel-iface: Tells the flannel CNI which network interface to use.

The ExecStart block now looked like this:

[Service]
...
Restart=always
RestartSec=5s

# FIX 1: Wait for the wlan0 interface to have the correct IP
ExecStartPre=/bin/sh -c 'while ! ip addr show wlan0 | grep -q "inet 192.168.1.10"; do sleep 1; done'

# FIX 2: Hard-code the new IP and interface for k3s
ExecStart=/usr/local/bin/k3s \
    server \
    --node-ip=192.168.1.10 \
    --node-external-ip=192.168.1.10 \
    --flannel-iface=wlan0
...

I reloaded systemd and restarted the service, full of confidence.

sudo systemctl daemon-reload
sudo systemctl restart k3s.service

...and it still went into a crash loop.

Part 3: The "Aha!" Moment (The Corrupted Database)

I was stumped. The service file was perfect. The IP was correct. The Pi was waiting for the network. Why was it still crashing?

I watched the logs again (sudo journalctl -u k3s.service -f) and saw something I'd missed.

The service would start, run for about 15 seconds... and then crash. In that 15-second window, I saw this:

I1030 20:37:56 ... "Successfully retrieved node IP(s)" IPs=["192.168.1.9"]

It was still finding the old IP!

This was the "Aha!" moment. The k3s.service config flags were correct, but k3s was loading its old database, which was still full of references to the old .9 IP (like for the Traefik load balancer). It was seeing a conflict between its new config (.10) and its old database (.9), and crashing.

The database was corrupted with stale data.

The Real Fix: Nuke the database and let k3s rebuild it from scratch using the new, correct config.

# On the Pi

# 1. Stop k3s
sudo systemctl stop k3s.service

# 2. Delete the old, corrupted database
sudo rm -rf /var/lib/rancher/k3s/server/db/

# 3. Start k3s
sudo systemctl start k3s.service

I checked the status one more time, and...

● k3s.service - Lightweight Kubernetes
     Active: active (running) since Thu 2025-10-30 20:55:03 IST; 1min 9s ago

It was stable. It worked. The cluster was back.

Part 4: The GitOps Restoration (Flux is Gone!)

I went back to my Mac. kubectl get nodes worked!

NAME        STATUS   ROLES                  AGE   VERSION
shankarpi   Ready    control-plane,master   1m    v1.33.5+k3s1

But when I ran flux get kustomizations, I got a new error:

✗ unable to retrieve the complete list of server APIs: kustomize.toolkit.fluxcd.io/v1: no matches for ...

Of course. When I deleted the database, I deleted everything—including the FluxCD installation and all its API definitions (CRDs).

The cluster was healthy, but empty.

Luckily, with a GitOps setup, this is the easiest fix in the world. I just had to re-bootstrap Flux.

# On my Mac

# 1. Set my GitHub Token
export GITHUB_TOKEN="ghp_..."

# 2. Re-run the bootstrap command
flux bootstrap github \
  --owner=tiwari91 \
  --repository=pi-cluster \
  --branch=main \
  --path=./clusters/staging \
  --personal

This re-installed Flux, and it immediately started trying to deploy my apps.

Part 5: The Final "Gotcha" (The Missing SOPS Secret)

I was so close. I ran flux get kustomizations one last time. This is what I saw:

NAME                READY   MESSAGE
apps                False   decryption failed for 'tunnel-credentials': ...
flux-system         True    Applied revision: main@sha1:784af83f
infra...            False   decryption failed for 'renovate-container-env': ...
monitoring-configs  False   decryption failed for 'grafana-tls-secret': ...

My flux-system was running, but all my other apps were failing with decryption failed. Why?

When I reset the cluster, I also deleted the sops-age secret that Flux uses to decrypt my files.

The solution was to put that secret back.

On my Mac, I deleted the (possibly stale) secret just in case.

kubectl delete secret sops-age -n flux-system

I re-created the secret from my local private key file. (Mine was named age.agekey)

cat age.agekey | kubectl create secret generic sops-age \
  --namespace=flux-system \
  --from-file=age.agekey=/dev/stdin

I told Flux to try one last time.

flux reconcile kustomization apps --with-source

Success! Flux found the key, decrypted the manifests, and all my namespaces and pods (linkding, audiobookshelf, monitoring) started spinning up.

TL;DR: The 3-Step Fix for a Dead k3s Pi
If your k3s Pi cluster dies after an IP change:

Fix k3s.service: SSH into the Pi. Edit /etc/systemd/system/k3s.service to add the ExecStartPre line to wait for your network and add the --node-ip, --node-external-ip, and --flannel-iface flags with your new static IP.

Reset the Database: The old IP is still in the database. Stop k3s, delete the DB, and restart:

sudo systemctl stop k3s.service
sudo rm -rf /var/lib/rancher/k3s/server/db/
sudo systemctl daemon-reload
sudo systemctl start k3s.service

Restore GitOps: Your cluster is now empty.

Run flux bootstrap ... again to re-install Flux.

Re-create your sops-age secret: cat age.agekey | kubectl create secret generic sops-age -n flux-system ...

Force a reconcile: flux reconcile kustomization apps --with-source

And just like that, my cluster was back from the dead.

From Zero to GitOps: Building a k3s Homelab on a Raspberry Pi with Flux & SOPS

Shankar — Thu, 30 Oct 2025 11:14:12 +0000

This post documents the end-to-end process for setting up a k3s Kubernetes cluster on a Raspberry Pi, managing it remotely from a Mac, and deploying applications securely using GitOps with FluxCD and SOPS encryption. We'll cover everything from OS install to deploying encrypted secrets and tackling common troubleshooting hurdles.

1. Initial Pi Setup & OS Installation

This phase covers preparing the Raspberry Pi hardware and operating system.

Install OS: Use the Raspberry Pi Imager to write Raspberry Pi OS (64-BIT) to an SD card.
OS Configuration: In the Imager's advanced settings, pre-configure:
- Hostname: k3s-node (or your preferred name)
- Username and Password: e.g., pi-admin
- Wireless LAN: Your Wi-Fi SSID and password.
Set a Static IP: To ensure a stable connection, set a DHCP Reservation for the Pi in your home router's settings, linking the Pi's MAC address to a specific IP (e.g., 192.168.1.100).

2. Kubernetes (k3s) Installation

We installed k3s, a lightweight Kubernetes distribution, directly onto the Pi's operating system.

Enable Cgroups (Critical Fix): The k3s service will crash on startup without this Linux kernel feature.
- SSH into the Pi: ssh pi-admin@k3s-node.local
- Edit the boot config file: sudo nano /boot/firmware/cmdline.txt
- Add cgroup_memory=1 cgroup_enable=memory to the end of the single line in the file.
- Save (Ctrl+X, Y, Enter) and reboot: sudo reboot.

Install k3s:

# On the Pi
curl -sfL [https://get.k3s.io](https://get.k3s.io) | sh -

Verify Service: Ensure the k3s service is stable and running.
```
# On the Pi
sudo systemctl status k3s.service
```
The output must show Active: active (running).

3. Remote Management from macOS

To manage the Pi's cluster from your Mac, you need to copy its configuration.

Update Kubeconfig File:
- On the Pi, copy the config: sudo cat /etc/rancher/k3s/k3s.yaml > k3s_config.yaml
- Edit the file (nano k3s_config.yaml) and change the server address from https://127.0.0.1:6443 to the Pi's static IP (e.g., https://192.168.1.100:6443).
Copy to Mac: From your Mac's terminal, copy the file to your local kubeconfig location. Warning: This overwrites your default config. If you manage multiple clusters, merge this file's contents manually.
```
scp pi-admin@k3s-node.local:~/k3s_config.yaml ~/.kube/config
```
Test Connection:
```
kubectl get nodes
```
You should see your Pi node (k3s-node) listed.

4. GitOps Setup with Flux & SOPS

This phase automates deployments and configures secret encryption.

Bootstrap Flux: Install Flux on the cluster and configure it to watch your Git repository.

# On your Mac
# (Ensure GITHUB_USER is set in your env)
flux bootstrap github \
  --owner=$GITHUB_USER \
  --repository=pi-cluster \
  --branch=main \
  --path=./clusters/staging \
  --personal

Generate age Keypair: Create a new keypair for encryption.
```
# On your Mac
age-keygen -o age.agekey
```
This creates age.agekey (your private key) and shows your public key (starts age1...). Keep the private key safe!
Add Private Key to Cluster: Create a Kubernetes secret in the flux-system namespace containing your private key. This allows Flux's controllers to decrypt files.
```
# On your Mac
cat age.agekey | kubectl create secret generic sops-age \
  --namespace=flux-system \
  --from-file=age.agekey=/dev/stdin
```

Configure SOPS Rules: Create a .sops.yaml file in clusters/staging/ to tell SOPS which public key to use for encrypting files.

# In clusters/staging/.sops.yaml
creation_rules:
  - path_regex: .*.yaml
    encrypted_regex: ^(data|stringData)$
    age: <PASTE_YOUR_PUBLIC_AGE_KEY_HERE>

Configure Flux for Decryption: Edit clusters/staging/flux-system/kustomization.yaml to tell Flux to use the sops-age secret.

# In clusters/staging/flux-system/kustomization.yaml
spec:
  # ...
  decryption:
    provider: sops
    secretRef:
      name: sops-age

Commit & Push your new .sops.yaml and modified kustomization.yaml to Git.

5. Deploying Encrypted Secrets via GitOps

This is the process of creating encrypted secret files and adding them to your Git repository for Flux to deploy.

1. : Deploy the Cloudflare Secret

Generate Secret YAML: Create a YAML manifest from your Cloudflare credential file (<tunnel_id>.json).

kubectl create secret generic tunnel-credentials \
  --from-file=credentials.json=./<tunnel_id>.json \
  --namespace <your-app-namespace> \
  --dry-run=client -o yaml > cloudflare-secret.yaml

Encrypt the File:

sops --config clusters/staging/.sops.yaml --encrypt --in-place cloudflare-secret.yaml

Move and Rename: Move the encrypted secret into your application's directory (e.g., apps/base/linkding/secret-cloudflare.sops.yaml).

2: Deploy the Linkding Superuser Secret

Generate Secret YAML: Create a secret with the environment variables linkding expects.

kubectl create secret generic linkding-superuser \
  --from-literal=LD_SUPERUSER_NAME=your-user \
  --from-literal=LD_SUPERUSER_PASSWORD=YourSecurePassword \
  --namespace <your-app-namespace> \
  --dry-run=client -o yaml > secret-superuser.yaml

Encrypt and Move:

sops --config clusters/staging/.sops.yaml --encrypt --in-place secret-superuser.yaml
mv secret-superuser.yaml apps/base/linkding/secret-superuser.sops.yaml

Update kustomization.yaml: Edit apps/base/linkding/kustomization.yaml to tell Flux to deploy these new secret files.

resources:
  - namespace.yaml
  - deployment.yaml
  # ... other resources
  - secret-cloudflare.sops.yaml
  - secret-superuser.sops.yaml

Update deployment.yaml: Modify apps/base/linkding/deployment.yaml to use the superuser secret.

# In deployment.yaml, inside the container spec:
envFrom:
  - secretRef:
      name: linkding-superuser

Final Step: Commit and Reconcile

After adding the files and updating the Kustomizations, commit everything to Git. Flux will automatically sync the changes, decrypt the secrets, and deploy them to your cluster.

git add .
git commit -m "Add encrypted secrets for Cloudflare and Linkding"
git push origin main

6. Troubleshooting Common Issues

`ImagePullBackOff`

Symptom: Kubernetes can't download the container image. You see this status when you run kubectl get pods.

Cause 1: Wrong Architecture. You're trying to run an amd64 (standard PC/server) image on your arm64 Raspberry Pi.
Solution 1: Find a multi-arch image or an arm64/aarch64 specific version. Look for tags like -arm64, -aarch64, or check image descriptions on Docker Hub/GHCR. lscr.io (LinuxServer.io) often provides good multi-arch images. Update the image: tag in your deployment.yaml and git push.
Cause 2: Pi Network/DNS Issues. The Pi itself can't reach the container registry.
Solution 2: SSH into the Pi.
1. Test basic connectivity: ping google.com
2. Test DNS: nslookup ghcr.io
3. If DNS fails, try setting static DNS servers: edit /etc/dhcpcd.conf (sudo nano /etc/dhcpcd.conf).
4. Add a line: static domain_name_servers=8.8.8.8 1.1.1.1 (Google/Cloudflare DNS).
5. Save and reboot (sudo reboot).

Pod Stuck in `Pending`

Symptom: The pod stays in the Pending state and never starts. Running kubectl describe pod <pod-name> shows an event like failed to bind volume.

Cause: The pod is waiting for a PersistentVolumeClaim (PVC), but no suitable PersistentVolume (PV) is available to fulfill it (e.g., wrong size, access mode, storage class, or no PVs exist).
Solution (Simple hostPath PV for Homelab): Define a PersistentVolume in your GitOps repo that uses a directory on the Pi's filesystem. Warning: hostPath ties the data to that specific Pi node.
1. Create a directory on the Pi:
```
sudo mkdir -p /mnt/data/my-app-data && sudo chown nobody:nogroup /mnt/data/my-app-data
```

2.  Create a pv.yaml manifest in your GitOps repo (e.g., alongside the PVC):

    ```bash
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: my-app-data-pv # Unique name
    spec:
      capacity:
        storage: 5Gi # Must be >= PVC request
      volumeMode: Filesystem
      accessModes:
        - ReadWriteOnce # Must match PVC
      persistentVolumeReclaimPolicy: Retain # Keep data if PV is deleted
      storageClassName: manual # Give it a name
      hostPath:
        path: "/mnt/data/my-app-data" # Path on the Pi node
    ```

3.  Add pv.yaml to your Kustomization.
4.  Update your application's **PVC** `spec.storageClassName` to `manual` (or whatever name you chose) so it binds to this PV.

`Connection Refused` / `ServiceUnavailable` (from remote `kubectl`)

Symptom: Running kubectl get nodes from your Mac or remote machine fails with ServiceUnavailable or connection refused.

Cause: The k3s service on the Pi is down, restarting, or unstable. This is almost always caused by:
1. Forgetting the cgroups fix (critical for k3s on Raspberry Pi OS).
2. The Pi is out of resources (memory/CPU).
Solution:
1. SSH into the Pi.
2. Check the k3s service status: sudo systemctl status k3s.service.
3. If it's not active (running), check the logs for crash reasons: sudo journalctl -u k3s.service -f.
4. Confirm /boot/firmware/cmdline.txt includes the cgroup_memory=1 cgroup_enable=memory flags (and reboot if you had to add them).
5. Check resource usage with htop.

Troubleshooting k3s on Raspberry Pi (Fixing the Auto-Restart Crash Loop)

Shankar — Thu, 30 Oct 2025 11:05:56 +0000

So you've installed k3s (the lightweight Kubernetes distribution) on your Raspberry Pi, but your kubectl commands from your main computer are failing with "connection refused"? You SSH into the Pi, check the service status (sudo systemctl status k3s.service), and see it stuck in activating (auto-restart)?

You're likely facing a very common issue, especially on Raspberry Pi OS. Let's diagnose it and fix it step-by-step.

The Symptoms

You'll typically see two related problems:

On your Raspberry Pi: The k3s Kubernetes service itself is crashing and getting stuck in a restart loop.
- sudo systemctl status k3s.service shows Active: activating (auto-restart) or mentions code=exited, status=1/FAILURE.
- Running kubectl get nodes on the Pi (with sudo k3s kubectl) might intermittently work or show The connection to the server 127.0.0.1:6443 was refused.
On your Remote Machine (e.g., Mac/Linux/Windows): Your kubectl commands targeting the Pi fail consistently with connection errors (like connection refused or timeouts) because the k3s API server on the Pi isn't reliably available.

The core issue we need to solve is on the Raspberry Pi: why is k3s crashing?

The Problem: `k3s` is Crashing

The systemctl status k3s.service output tells the story:

Active: activating (auto-restart): The service manager (systemd) is trying to start k3s, it fails, and systemd automatically tries again, repeatedly.
(code=exited, status=1/FAILURE): This confirms the main k3s process crashed with an error.

The connection refused errors happen because kubectl tries to talk to the k3s API server while it's down during one of these crashes or restarts.

The Most Likely Cause on Raspberry Pi

For Raspberry Pi setups, this crash-restart loop is almost always due to missing Linux kernel features required by k3s, specifically related to Control Groups (cgroups) for memory management. The k3s installation script often warns about this, but it's an easy step to miss.

The Fix: Enable Cgroups and Reinstall k3s

We'll ensure the kernel is configured correctly and then perform a clean re-installation of k3s to fix any potentially corrupted state from the failed starts.

1. Enable Cgroups on the Raspberry Pi

SSH into your Raspberry Pi using its hostname or IP address:

ssh <your_pi_user>@<your_pi_hostname_or_ip>
# e.g., ssh pi-admin@k3s-node.local

Edit the boot configuration file using sudo permissions:
```
sudo nano /boot/firmware/cmdline.txt
```
This file contains a single, long line of text. Use your arrow keys to navigate to the very end of that line.
Add a single space, and then paste the following text:
```
cgroup_memory=1 cgroup_enable=memory
```
(Make absolutely sure the entire file content remains on one single line!)
Save the file by pressing Ctrl + X, then Y, then Enter.

2. Cleanly Uninstall k3s

Let's remove the current (likely broken) installation.

Run the official uninstall script (this stops the service and removes k3s files):

# Run this command if it exists
/usr/local/bin/k3s-uninstall.sh

# If the above gives "command not found", try this one:
# /usr/local/bin/k3s-agent-uninstall.sh

Now, reboot the Pi to apply the kernel changes from Step 1:
```
sudo reboot
```

3. Reinstall `k3s` and Verify

Wait for the Pi to reboot, then SSH back in.

Run the installation script again:

curl -sfL [https://get.k3s.io](https://get.k3s.io) | sh -

Give it a minute to start up, then check the service status again:
```
sudo systemctl status k3s.service
```
You should now see the glorious green text: Active: active (running). If it's stable and running, you've fixed the main issue!

Final Step: Update Your Remote Kubeconfig

Because k3s was reinstalled, it has generated new security certificates. The old configuration file on your Mac is now invalid. You need to copy the new one over.

On the Pi: Copy the new config to your home directory:

sudo cat /etc/rancher/k3s/k3s.yaml > $HOME/k3s_config.yaml

On the Pi: Edit the copied file (nano $HOME/k3s_config.yaml) and change the server: address from https://127.0.0.1:6443 to use your Pi's static IP address (e.g., https://192.168.1.100:6443). Save and exit.
On your Mac: Use scp to copy the updated file, replacing your old config. (Remember: back up ~/.kube/config first if you have other cluster contexts in it!)
```
# On your Mac - use your Pi's user/hostname/IP
scp <your_pi_user>@<your_pi_hostname_or_ip>:~/k3s_config.yaml ~/.kube/config
```
Test Again:
```
# On your Mac
kubectl get nodes
```
Your kubectl commands should now connect successfully and consistently!

How to Use kubectl Directly on Your Raspberry Pi k3s Node

Shankar — Thu, 30 Oct 2025 11:02:49 +0000

You've set up a k3s Kubernetes cluster on your Raspberry Pi and deployed an application. While managing it remotely with kubectl from your main computer is great, sometimes you need to quickly check pod status or logs directly on the Pi itself.

You might notice that just typing kubectl get pods on the Pi gives you a connection error. That's because the standard kubectl command doesn't automatically know where to find the k3s cluster configuration or have the right permissions.

Luckily, k3s provides a handy wrapper command! Here's how to use it:

Steps

SSH into your Raspberry Pi:
Connect to your Pi using its hostname or IP address.

ssh <your_pi_user>@<your_pi_hostname_or_ip>
# Example: ssh pi-admin@k3s-node.local

Run the k3s kubectl Command:
Prefix your usual kubectl commands with sudo k3s kubectl. This special command automatically uses the correct admin configuration (/etc/rancher/k3s/k3s.yaml) and runs with the necessary permissions.

To check your running pods:
```
# On the Pi
sudo k3s kubectl get pods -A # -A shows pods in all namespaces
```
Or, if you know the namespace (e.g., default):
```
# On the Pi
sudo k3s kubectl get pods -n default
```
Check Pod Logs (Optional but useful):
First, get the full name of the pod you're interested in from the get pods command above (it will look something like my-app-deployment-xxxxxxxxxx-xxxxx). Then, view its logs:
```
# On the Pi - Replace <your-pod-name> and <namespace>
sudo k3s kubectl logs -f <your-pod-name> -n <namespace>
# Example: sudo k3s kubectl logs -f my-app-deployment-7f8c9d4b4f-g2hjl -n default
```
The -f flag follows the logs in real-time, showing you the latest output from your application's container directly in the Pi's terminal.

That's all there is to it! Using sudo k3s kubectl is the straightforward way to interact with your k3s cluster directly on the node it's running on.

Setting Up GitOps with Flux on a Kubernetes Cluster

Shankar — Thu, 30 Oct 2025 11:00:10 +0000

Ready to automate your Kubernetes deployments? GitOps is the way to go, and FluxCD is a fantastic tool to make it happen. This guide walks you through the initial setup: installing Flux on your cluster and connecting it to your GitHub repository. Let's get started!

What's GitOps, Anyway?

In simple terms, GitOps means using a Git repository as the single source of truth for your desired infrastructure and application state. Flux is the operator that runs in your Kubernetes cluster, constantly comparing the cluster's live state to the state defined in your Git repo. If they differ, Flux automatically makes changes to the cluster to match the repo. Magic!

Prerequisites

Before we begin, make sure you have:

A Kubernetes Cluster: Any cluster will do (like k3s on a Raspberry Pi, minikube, or a cloud provider's offering). Ensure kubectl is configured to access it.
A GitHub Account: We'll use GitHub to host our configuration repository.
A GitHub Personal Access Token (PAT): Flux needs this to create deploy keys and potentially commit manifests back to your repository during the bootstrap process.

1. Create a GitHub Personal Access Token (PAT)

Flux needs permissions to interact with your repository.

Go to your GitHub Settings.
Navigate to Developer settings (usually near the bottom left).
Click on Personal access tokens -> Tokens (classic).
Click Generate new token -> Generate new token (classic).
Give it a descriptive Note (e.g., "flux-bootstrap").
Set an Expiration (e.g., 90 days - remember to rotate it!).
Select the repo scope. This grants permissions needed for Flux to manage repository configuration.
Click Generate token.
Immediately copy the generated token! You won't see it again.
Store the Token Securely (Temporarily in ENV): For the bootstrap command, export it as an environment variable in your terminal. Never commit this token to Git!
```
export GITHUB_TOKEN="<paste-your-token-here>"
```
Export Your GitHub Username: Flux also needs your username.
```
export GITHUB_USER="<your-github-username>"
```

2. Install the Flux CLI

You'll need the flux command-line tool to interact with Flux. Installation methods vary by OS:

macOS (Homebrew):
```
brew install fluxcd/tap/flux
```

Linux/Other (Curl):

curl -s [https://fluxcd.io/install.sh](https://fluxcd.io/install.sh) | sudo bash

(Check the official Flux documentation for other methods)

Verify the installation:


bash
which flux
flux --version