Forem: Meng Lin

Building NewHomie property analytics tool — Part 1

Meng Lin — Thu, 16 Apr 2026 22:07:48 +0000

From Scrappy Scraper to Production Pipeline

It all started with a question.

“How am I supposed to afford a house?”

So I set out to transfigure my anxiety into a software product.

But first, I needed data I could trust.

So I could build this insanity 😉:

1. Validate scrapeability

There is no point building a scraper locally only to discover it breaks the moment you deploy it.

I have chosen to scrape Domain for Australian property data; since Realestate will probably send endless wave of lawyers for bypassing the picky Kasada bot protection.

Before writing much pipeline code, I wanted to validate two things: whether the site was practically scrapeable, and whether it could be scraped reliably in a deployed environment.

Scrapeability factors to consider:

Frontend framework
- Domain uses Next.js client-side rendering.
- Next.js sends a JSON payload in a <script id=“NEXT_DATA” type=“application/json”/> tag.
- Extracting that JSON would be much more stable and structured than parsing rendered HTML, because it was closer to the frontend’s source data.
Bot detection
- Domain uses Akamai which prevents simple HTTP requests unless they come from a real browser.
- I did not observe stronger protections like IP blocking, fingerprinting or human cursor detection.

To test quickly, I deployed Playwright on AWS Lambda using SST so I could iterate against a live environment. For stronger bot protection, I would consider something like Camoufox, but at a performance cost.

2. Validate scraped data at boundary

Once I knew I could scrape the site, I needed a way to enforce data integrity.

Validating data early in the scrape process simplified the rest of the pipeline. My first pass used LLM-generated scraper code, which duplicated validation logic in multiple places and made the transformation layer harder to reason about.

Introducing a schema validator at the boundary fixed a lot of that. Using Zod made it easier to distinguish expected data from unexpected data, catch bad assumptions early, and keep the downstream transformation logic much simpler.

After deploying, I noticed some localities were not being scraped because certain paths in the extracted JSON did not exist. It turned out those pages did not exist at all. Once I confirmed that, I added a validator to detect the 404 page shape so the pipeline could handle it gracefully instead of failing deeper in the system.

Boundary validation made debugging easier and simplified the downstream transformation logic.

3. Insert data with raw SQL

SQL abstractions can hide important insertion logic and encourage assumptions that only surface later as dirty data.

I was tempted to use Kysely query builder for inserts. It reduced boilerplate, but it also made it too easy to assume every table shared the same conflict-handling logic. In practice, each table needed different upsert and deduplication behaviour.

That mismatch introduced bad data which only became obvious later when I started exploring the dataset. Cleaning it up was expensive. I had to write careful migration scripts to transform or delete rows that should never have been inserted in the first place.

One example was property listings that shared the same address and overwrote each other. Another was duplicate listings with different prices, including oddly precise unrounded numbers. These cases occurred less than 1% of the time, but they still produced enough junk data to cost me one to two weeks of cleanup work.

For this part of the pipeline, raw SQL was more verbose, but it made conflict handling explicit.

4. Add observability and iterate before scaling

Observability on Grafana LGTM made it possible to see where the pipeline was slow, fragile, or built on bad assumptions. That sped up architectural iteration by exposing bottlenecks and clarifying the real requirements.

Once the scraper moved into a real deployed environment, I added an SQS queue in front of the workers and started tracking a few key signals:

Worker duration:

Workers are split by locality.
This made it easier to see whether cold starts and setup overhead were dominating runtime.
In practice, that pushed me toward batching work where possible.

CPU and memory usage

I wanted workers to use available CPU efficiently rather than sit idle.
Higher utilization generally meant better cost efficiency, as long as memory stayed within safe limits.

Pipeline fragility:

I wanted to know where the pipeline failed most often, so I added class and method names to the OTEL code_function_name attribute.
I also logged ambiguous or partially extracted data so bad assumptions were visible earlier.
Error occurrence patterns over time and space can be visualized on a heat map.

Anomalies:

Unusual worker duration.
Unusual resource usage.
Unexpected failure patterns.

These signals made the pipeline easier to iterate on. Instead of guessing where the bottlenecks and fragile spots were, I could observe them directly and improve the system from there. Once failure modes were better understood and error rates came down, I could scale the pipeline with much more confidence.

5. Design for iteration speed first

To validate scrapeability quickly, I needed infrastructure that could deploy fast. I intentionally traded long-term flexibility for iteration speed. At this stage, the main requirements were simple: keep scrape speed within rate limits, keep SQL inserts idempotent, and retry workers on failure.

Once the scraper worked reliably on 3 localities, I scaled it to 362 localities near the CBD. At that scale, rarer problems started to appear in the observability dashboard (showcasing 367 errors):

Browser process was hanging

This showed up in the heat map as repeated failures in ScrapeController.tryExtractSuburbPage across many localities after scrape attempts. Eventually the browser process would restart.

Errors:

TimeoutError — BrowserService.getHTML Navigation timeout of 10000 ms exceeded
ProtocolError — ScrapeController.tryExtractSuburbPage Target.createTarget timed out. Increase the protocolTimeout setting in launch/connect calls for a higher timeout if needed.
ProtocolError — ScrapeController.tryExtractRentsPage Target.createTarget timed out. Increase the protocolTimeout setting in launch/connect calls for a higher timeout if needed.

Fix:

I reworked the browser error-handling and retry logic so failures triggered a full browser restart instead of leaving the process in a bad state. This greatly decreased the average scrape worker duration.

Price extraction logic was flawed, but caught by database constraint

Errors*:*

Warn — DomainListingsService.tryTransformSalePrice no price in listing.listingModel.price — "2bedroom + 1bedroom (study)"
Warn — ScrapeModel.tryUpdateSaleListing value "640000680000" is out of range for type integer

Fix:

I added tests from production logs so more valid price strings are accepted while invalid price strings are rejected.

Non-existent pages were being scraped without enough context in the logs

Errors:

ZodError — DomainSuburbService.tryExtractProfile
ZodError — DomainListingsService.tryExtractListings

Fix:

The error logs needed to include the locality that caused the failure. Once I added that context, I found that some of the localities did not exist at all.

This version of the architecture did what it needed to do: it validated scrapeability quickly and exposed real failure modes early. Its weaknesses only became obvious once scale increased, which was acceptable for a design optimized for learning speed.

Fast validation was the right trade at the start, because it exposed real failure modes before the architecture was worth hardening.

6. Design for observed requirements next

Once the pipeline became stable with less than 50 errors per run, iteration speed was no longer the main priority. At that point, I could trade some of it away to meet the requirements the system had actually revealed:

Orchestration for pre and post processing.
Smaller blast radius when scrape workers failed.
Independent workflow execution.
Scrape workers that were easier to test and deploy.
Timeout flexibility above Lambda’s 15 minutes.
Full workflow completion within 1 day.

The next architecture centered around Step Functions for orchestration, with Fargate workers running the scraper in Docker containers. This made the workflow easier to reason about, and the Step Functions visualizer was especially useful for debugging and manually retrying failed runs.

My original plan was to use Fargate workers running the scraper in Docker containers. However, I could not work out how to inject environment variables into Fargate tasks from Step Function using SST, so I temporarily kept Lambda workers despite their limitations.

However, Step Functions introduced its own constraints. The 256 KB message limit and 25,000 event history limit added complexity to larger runs. The simplest workaround I found at the time was to trigger two workflows in parallel.

Once this architecture looked stable in a preview branch, I scaled it from 362 to 4,491 localities. That was the point where Step Functions began to hit its practical limits and forced a temporary redesign.

Once the system revealed its real constraints, the architecture had to evolve around them rather than the assumptions that shaped the first version.

7. Optimize for cost last

After observing the cost of the previous design, I realized it was more expensive than expected, partly because Lambda’s free tier had hidden some of the true cost earlier on.

At this stage, I wanted the cheapest compute that still fit the workload. In theory, that meant spot instance or spot Fargate on Arm64. In practice, that would reduce scrape worker availability and increase the chance of interrupted runs and forced restarts.

My target was to keep the overhead of batch locality scraping below 10%. Since AWS Batch on Fargate adds roughly 1 minute of provisioning overhead, each worker needed to run for around 10 minutes to make that overhead acceptable. Based on a median scrape time of 15 seconds per locality, I designed each worker to handle 50 localities.

Although spot instances were cheap, they introduced additional costs and complexity, including public IPv4 charges and more complex IaC.

Cost only became worth optimizing once the workload was understood well enough to separate real savings from premature complexity.

Conclusion

What began as a scrappy scraper gradually became a production pipeline. Each stage exposed a different class of problem: scrapeability, data integrity, insertion rules, observability, and finally architecture itself.

The main lesson was to validate assumptions early, especially through observability. Production data exposed where those assumptions failed, and each redesign became a response to that reality rather than guesswork.

The result is a scraper that has been in production since October 2025 and has been operating with an average of fewer than 100 warnings and errors per run.

The biggest benefit was being able to explore tens of thousands of properties with a single SQL query instead of being constrained by the limited interfaces of property listing websites. That made the engineering effort worthwhile: it turned messy public listing data into something I could reason about quickly.

What’s Next?

I could use AWS FIS to test the resilience of the pipeline by deliberately injecting faults in the spirit of Netflix-style chaos engineering.

The next obvious step is to explore the data itself, but that is a separate problem.

That opens the door to a different set of questions:

How to design an interactive UX for exploring large property datasets?
How much complexity does local-first caching introduce?

Reconnecting websockets on AWS

Meng Lin — Mon, 09 Oct 2023 01:11:32 +0000

AWS API Gateway supports websockets.

Unfortunately, their service does not provide the ability to persist connection data and reconnect on flaky internet sessions. Nor could I find any example projects with those features. In this article, I will explore a potential solution using Lambda and DynamoDB.

Firstly, how do API Gateway websockets work?

AWS uses routes to execute different actions. Consider a chat application workflow, assuming Lambda is used for compute:

Client connects to websocket, firing the $connect route and associated Lambda.
Client sends JSON payload {action: 'sendmessage'}, firing the sendmessage route.
Server can send data to client by specifying a socketUrl with ConnectionId.
If client sends JSON payload without action, the $default route is fired.
Client disconnects, firing the $disconnect route.

Limitations with API Gateway websocket:

Every time the $connect route is called, a new ConnectionId is created. To persist connection on flaky internet, the client must store an ID.
Lambda is ephemeral, so a database like DynamoDB is required to persist connection data — connection data shouldn’t be stored in memory in case of failure.
The 10 minute connection timeout can be avoided with a ping/pong request.
Max websocket duration of 2 hours, which would require a new connection session.

A solution to reconnecting websockets on AWS

This solution uses custom actions instead of $connect and $disconnect to manage connections, which shifts connection management to the client instead of AWS. DynamoDB will persist connection data and provide an event-driven architecture for returning messages as well as interfacing with external services.

Case 1: Ideal connection

After opening a websocket connection, the client will send an empty socketId to the open route, which generates a new socketId for the client. The socketUrl, ID and currentId are stored in DynamoDB.
Client calls external service “ping”, providing socketUrl and socketId. These details are used to update the DynamoDB table.
Updating DynamoDB will trigger the message Lambda to post any stored messages in the updated row to the client.
Client is responsible for triggering the close route, which will delete the associated row in DynamoDB.
If client fails to close the connection, a TTL (time to live) data deletion timer can be specified for Dynamo
DB.

Case 2: Flaky connection

If the internet connection is poor, the websocket connection may close on the client side.

When the client is offline, new messages will be stored in DynamoDB.
A “back online” event listener on the client will create a new websocket connection.
Client will send an existing socketId to the open route, updating DynamoDB with the new ConnectionID. This triggers an event to send the last message back to the client.

Case 3: Long connection

An under 10 minute disconnect and connect interval will solve this issue.

Limitations to consider

DynamoDB stream adds latency for message updates to client.
Connection data will not persist if client is refreshed. Also, storing socketId in local storage is not ideal when multiple tabs are used.

Fast piano transcription on AWS -Part 2

Meng Lin — Mon, 09 Oct 2023 00:59:53 +0000

Previously, I tested a few architectures for music transcription. I concluded that separating the preprocessing, inferencing and postprocessing steps for parallel processing is crucial to improving model speed — refer to the architecture below:

In this article, I will explore further optimisations.

Improving inference speed with ONNX

Replacing PyTorch with ONNX improves inference speed by 2x.

The benchmark metric I will use is the average 3 second audio frame inference time (FIT) — The FIT for PyTorch CPU on M1 Pro is around 1.1 seconds, using around 3 CPU cores.

The 172 Mb PTH model weight file can be reduced to 151 Mb ONNX file, which also embeds some of the preprocessing steps like generating a spectrogram. Running an ONNX file on Microsoft’s ONNXruntime reduces FIT to 0.56 seconds using 3 CPU cores — 2x improvement!

Converting PTH to ONNX only takes a couple of lines:

import torch

pytorch_model = ...
dummy_input = ...
modelPath = ...
input_names = ['input']
output_names = [...]
torch.onnx.export(pytorch_model, dummy_input, modelPath,
  input_names=input_names,
  output_names=output_names
)

Running ONNX on ONNXruntime:

import onnxruntime as ort

# Inference with CPU
onnx_path = ...
model = ort.InferenceSession(onnx_path, providers=['CPUExecutionProvider'])

# Outputs will be a list of tensors
input = ...
outputs = model.run(None, {'input': input})

# You can convert the outputs to dicts
output_names = [...]
output_dict = {}
  for i in range(len(output_names)):
    output_dict[output_names[i]] = outputs[I]

Quantising a model is also possible. This reduces the model to 72 Mb and uses float16 calculations. However, this did increase FIT to 0.65:

import onnx
from onnxconverter_common import float16

onnx_path = ...
onnx_model = onnx.load(onnx_path)

onnx_f16_path = ...
onnx_model_f16 = float16.convert_float_to_float16(onnx_model)
onnx.save(onnx_model_f16, onnx_f16_path)

ONNXruntime ARM Lambda issues

Ensure the CPU architecture and OS of CI/CD and cloud servers are similar. This overcomes incompatibilities between development environment and cloud environment.

At the time of writing this article, ARM Lambda doesn’t provide CPU info from “/sys/devices” for ONNXruntime calculation optimisation. I suspect that CPU information is not supported for the proprietary Graviton Chip.

Currently, x86_64 Lambdas do work with ONNXruntime. A CI/CD workflow on an x86_64 server is an easy way to deploy x86_64 Lambdas from Apple Silicon — or any ARM machine.
Some Python libraries like ONNXruntime depend on binaries, so where the code is deployed from matters.

Why not GPU acceleration?

For on-demand, parallelised inferencing, consider Step Function Maps with CPU.

GPU acceleration seemed like an obvious solution. However, the bottleneck for inferencing in this application is concurrency. Scaling up hundreds of GPU inferences quickly is challenging. Furthermore, GPUs are expensive.

A quantised ONNX model on x86_64 Lambda with 1.7 vCPUs achieves a FIT of 2.3 seconds. The cold start FIT is around 10 seconds.

On Google Colab, a V100 GPU can achieve a fit of around 2.5 seconds. Together with GPU fast scalability issues, serverless CPU inferencing is a more suitable solution.

From Docker to zipping static binaries

Using Docker files for rapid prototyping, then zipped files for optimisation.

Docker speeds up deployments since only changed layers need to be uploaded. But the downside is massive files that increase cold start time. My PyTorch Docker file was 1.9 Gb and ONNX was 1.1 Gb.

In contrast, zipped deployments are small, leading to small cold start times. However, installing Python packages, zipping and uploading… takes so much time, especially when you are debugging.

Packaging static FFmpeg

Package static binaries instead of apt download where possible to reduce size.

The naive method to installing FFmpeg through apt install, even with no recommended installs. Even with multi-stage builds:

# Adds over 300 Mb to Docker file (Debian)
RUN apt update –no-install-recommends && \
    apt install -y ffmpeg && \
    apt clean

While lazy-loaded FFmpeg libraries do exist, Lambda restricts file writing to the “/tmp” folder. An alternative strategy is to use a FFmpeg static binary tailored to OS and CPU architecture. Simply call the path to static binary to use:

# 300 Mb apt install is equivalent to 78 Mb FFmpeg binary
path/to/ffmpeg/binary ... arg1 arg2

# To use the 'ffmpeg' command, create an alias, assuming it's not taken
echo 'alias ffmpeg="path/to/ffmpeg/binary"' > ~/.bashrc
source ~/.bashrc

ffmpeg ... arg1 arg2

The binary is around 78 Mb, so be sure to maximise usage. For example, there is no need to write downsampling audio code when FFmpeg can downsample much faster.

Double/float64 is faster than float32 or float16

Prefer a type that has native hardware support.

Float16 is much smaller than float64, so it should be faster? — NOT NECESSARILY! Whatever is supported by hardware is faster. For instance, Lambda x86_64 runs on top of Intel Xeon and AMD EPYC, which natively supports float64 calculations and matrix dot products.

To illustrate, postprocessing example.mp3 completes in 0.107 seconds using np.float32 on M1 Pro. This is reduced to 0.014 seconds using np.float64. Float64 turned out to be 7.5x faster than float32 or float16.

Even when deployed onto the cloud, an 8 minute audio is transcribed in 60 seconds using float16. Float64 not only has higher precision but also transcribes in 28 seconds.

Conclusion

The most important lesson is to measure performance and not blindly listen to others. For all you know, maybe GPU inferencing is faster for one use case, then CPU for another. Questioning assumptions and believing that something can be improved eventually led to 2x inferencing improvement — 5x faster than inferencing on M1 Pro for an 8 minute audio!

Fast piano transcription on AWS -Part 1

Meng Lin — Mon, 09 Oct 2023 00:34:14 +0000

"The best way to learn jazz is to listen to it.”— Oscar Peterson

Sometimes I marvel at beautiful music, wondering how to reproduce it. This process is called “Playing by Ear” or “Transcription”. Just as a child learns to speak by listening first, music can be learnt the same way.

But there’s a problem…

Transcribing music is time-consuming. It takes a professional musician roughly 4–60 minutes to transcribe 1 minute of music. But beginning musicians who benefit most from transcriptions may not have the skills to transcribe.

We need a computational transcription method

To illustrate the transcription process, I will use Bach’s Passacaglia C Minor (BWV 582) as an example:

We could identify the energy of each frequency through time with a spectrogram. But the actual notes played are in green. Somehow we need to identify the base frequency of the notes played.

AI to the rescue!

Luckily some employees from ByteDance (TikTok’s parent company) were working on a piano transcription algorithm. The result is a Python module for inferencing that converts audio files to midi: https://github.com/qiuqiangkong/piano_transcription_inference

How does this module work?

Preprocess:
– Downsamples the original audio.
– Separate audio into segments.
Inference:
– Generate a spectrogram from the audio segment
– Perform CNN inferencing on the spectrogram.
Postprocess:
– Stitch rough midi output together.
– Perform regression to find the most likely midi events.

Core backend architecture

An ideal piano transcription service would be blazingly fast. So this would be the main focus of architecture exploration below.

Local server on M1 Pro — v0

A few observations when running locally:
– Enabling torch multicore processing speeds up inferencing.
– Up to 3 cores are used, beyond that there is no improvement and
utilisation.
– Transcription rate of 0.5 seconds per 1 second of audio.

Running locally gives some idea of how performance could be improved before applying to the cloud. I find that iterating on the cloud takes much longer than local development.

Naive server — v1

Although this works, there are many limitations:

API Gateway limits payloads to 6MB — equivalent to a 6-minute mp3 file.
Lambda’s computation speed is slow — 3 seconds per 1 second of audio
API Gateway enforces a 30-second timeout — can transcribe up to 10 seconds of audio.

Monolithic server — v2

Presigned POST urls can upload up to 5GB files to S3. Lambda can download (to /tmp folder only) and upload to S3 using AWS SDK.

Lambda has 900-second timeout — can transcribe up to 5 minutes of audio.

Step Function server — v3

Step Function orchestrator can run inference lambdas in parallel — this prevents lambda timeouts, so you can transcribe over 1 hour of audio:
– Inference takes 12.5 seconds for 5-second segment
– Can achieve a transcription rate of 0.25 seconds per 1 second of audio.

Credits:

Melbourne AWS User Group — inspired me to use Step Functions.
Melbourne Serverless Meetup — exposed me to serverless architecture.
Figma — image e ditor + AWS icons.