Forem: somyabhalani

Tile Extractor

somyabhalani — Sat, 23 May 2026 15:03:13 +0000

Parsing the Unparsable: Building a Layout-Aware Computer Vision Pipeline for 50,000+ Stone SKUs

Executive Summary

The stone and marble industry operates on visual catalogs. Manufacturers publish hundreds of pages of PDF catalogs showing marble slabs, tile patterns, texture variations, and dimension tables. For digital inventory platforms and wholesalers, extracting these products to populate databases is a massive bottleneck.

Standard OCR (Optical Character Recognition) tools fail immediately because these catalogs are highly visual, containing complex grid structures where product images are loosely aligned with text descriptions, dimensions, and SKU codes. Ananta Labs was hired to design a layout-aware computer vision and text parsing pipeline that could ingest multi-page catalogs, segment individual product tiles, extract their corresponding text details, and output clean, database-ready JSON arrays. The target was 95%+ accuracy over a database of 50,000+ unique marble and stone SKUs.

The Architecture: Segmentation-First Parsing

Traditional text extraction tools parse documents top-to-bottom, left-to-right. In a product catalog, this approach merges the text of Slab A with the dimensions of Slab B.

To prevent data mismatch, we implemented a segmentation-first approach. Instead of reading the document as text, we treat each catalog page as an image canvas, locate the individual physical grid cells (tiles), isolate them, and then run OCR within the boundaries of each isolated cell.

Project Metrics & Impact

Throughput: Processing a standard 100-page catalog (containing roughly 1,200 product variations) took less than 180 seconds.
Accuracy: Out of 50,000+ processed stone tiles, our layout segmentation maintained an extraction accuracy of 96.4%.
Human Verification: Reduced manual data entry time by 94%, shifting the operator's role from manual transcription to simply reviewing a clean, visual admin UI validation screen.

Step 1: Document Rasterization and Pre-processing

We use PyMuPDF to rasterize incoming PDF pages into high-resolution PNG images (300 DPI) to ensure fine print text is highly legible. The document is converted page-by-page, and zoomed in to optimize the text characters before OCR processing occurs.

Step 2: Contour Detection & Grid Cell Isolation

Catalog pages usually group slab images and SKU data inside visual grid cells or boxes. We use computer vision (OpenCV) to detect these bounding boxes:

Binarization: Convert the page image to grayscale and apply adaptive thresholding to isolate boundaries.
Morphological Operations: Apply vertical and horizontal kernels to detect solid horizontal and vertical grid lines, creating a clean binary mask of the catalog layout.
Contour Extraction: Find contours on the grid mask and filter out shapes that are too small (noise) or too large (page borders).

Step 3: Isolated OCR and Data Normalization

Once we have the coordinates (x, y, w, h) of each tile cell, we crop the image of the stone slab from the top half of the cell, crop the text area from the bottom half, and run OCR exclusively on the cropped text area.

By running OCR on a tiny, isolated box rather than the whole page, we guarantee that the extracted SKU, finish (polished/honed), and size parameters belong only to the stone slab image cropped from the same box.

Key Engineering Challenges Solved

1. The Borderless Grid Problem

Some catalogs do not have visible grid lines; they display product images floating on a white page with text underneath. When morphological grid detection returns zero cells, the pipeline switches to a clustering-based layout analyzer. We use projection profiles (scanning rows and columns for white-space gaps) to programmatically compute virtual grid lanes, establishing bounding coordinate zones dynamically.

2. Text-to-Data Normalization

OCR outputs raw string data like "Volacas Wt (Pol) 60x120cm - SKU9087". We run the OCR output through a regex parser and a light local dictionary matching layer. The parser strips punctuation, standardizes measurements (600x1200mm, 60x120 to standard metric floats), and categorizes stone colors and finishes into database-ready enumerations (Material: Marble, Color: White, Finish: Polished).

Conclusion

Parsing highly visual document layouts requires moving beyond raw character recognition. By merging traditional computer vision techniques (contour detection, morphological thresholding) with targeted localized OCR, Tile Extractor transformed chaotic catalogs into clean, standardized commercial APIs. Building systems that bridge the gap between unstructured visual media and structured databases is at the core of what we do at Ananta Labs.

How We Automated Catalog Image Extraction using Computer Vision & FastAPI

somyabhalani — Tue, 19 May 2026 19:24:39 +0000

For businesses in the stone, marble, and interior design industries, managing digital catalog assets is a massive headache.

When a new product catalog arrives as a 100-page PDF, design teams spend hours manually cropping out individual tile samples to upload to their websites or inventory sheets.

To automate this, we built Tile Extractor—a high-performance, automated parsing engine designed specifically to isolate tile samples from raw catalog documents.

How it Works (Under the Hood)

PDF Ingestion: The system uses a FastAPI backend to ingest multi-page PDFs. We process the pages using PyMuPDF to extract raw page vectors and high-res layout structures.
Object Detection & Border Cleaning: Instead of relying on slow, expensive cloud Vision APIs, we use local Pillow and OpenCV-based spatial algorithms. The engine analyzes:
- Edge density to isolate individual tile boundaries.
- Aspect ratios to filter out page noise (like page numbers or logos).
- Color distributions using RGB histograms.
Lossless Cropping: Once a tile is identified and classified, the engine performs a lossless crop directly from the PDF's high-resolution asset stream, ensuring no pixel resolution is lost.
Batch ZIP Packaging: The isolated tile PNGs are packaged into a single ZIP file and returned to the user instantly.

Why it Matters for B2B Automation

What used to take a human designer 2 hours now takes our engine 5 seconds. By running localized computer vision algorithms instead of cloud APIs, we eliminate usage fees and keep client data fully private.

If your business manages product catalogs, you can try the tool for free here:

👉 Try Tile Extractor: https://tile-extractor.onrender.com
👉 Explore our work: https://anantalabs.app/

How We Built a Contactless Digital Signature App inside the Browser (No Servers, 100% Private)

somyabhalani — Tue, 19 May 2026 19:21:46 +0000

Traditional digital signature platforms have two major issues: privacy and cost.

To sign a document, you have to upload sensitive agreements to a third-party server. And as a developer, running server-side document rendering and signatures can lead to heavy API bills and database management overhead.

At Ananta Labs, we wanted to see if we could build a completely secure, contactless alternative that runs entirely on the client side using browser-native AI.

Here is how we built AirSign.

The Architecture: 100% Client-Side

Instead of hosting heavy machine learning models on a GPU server, we compiled our hand-tracking models to run locally inside the user's browser.

Gesture Capture: We utilized MediaPipe's hand-landmarker models compiled into WebAssembly. This allows the browser to track 21 3D hand coordinates in real-time at 30 FPS using a standard webcam.
Contactless Canvas: Using WebGL, we map the index finger coordinate to a HTML5 canvas. We implemented a custom interpolation algorithm to smooth out hand jitter and render a fluid, realistic signature line.
Local PDF Stamp: Once the user finishes drawing their signature in the air, we generate the final document. The signature coordinate vector is parsed and stamped onto the PDF using a client-side library.

Why this is the Future of AI Integration

By moving the computation from the server to the client:

Absolute Privacy: 0 video frames, coordinate points, or document bytes are transmitted to any database.
Zero Server Overhead: The hosting cost for this app is exactly $0 since it runs on the user's CPU/GPU.
Instant Load Times: Zero network latency during signature interpolation.

Try it Yourself

AirSign is completely open and free to test. We’d love to hear your feedback on the hand-tracking latency and mobile performance:

👉 Try AirSign: https://airsign-red.vercel.app/
👉 Explore our work: https://anantalabs.app/