Forem: Cara Jung

Building a Unified Korean Entertainment Database from 10 APIs and Web Scrapers

Cara Jung — Sat, 09 May 2026 20:56:22 +0000

Korean entertainment has become a global phenomenon with shows such as Squid Game breaking records and K-dramas topping global charts. And yet, the data infrastructure behind it is fragmented.

Getting complete data on a single Korean show or film — cast, ratings (Korean and international), episode viewership numbers, where to stream it, what awards it won, its OST albums — requires hopscotching different websites.

The issue is that dominant platforms like NAVER and Melon lack English-first APIs. As Session Zero points out in this article, Korean data is heavily underserved in MCP ecosystems because when Western developer tools and AI systems are built, Korean platforms are often invisible by default.

The data exists. But it’s trapped behind language barriers, undocumented endpoints, JavaScript-rendered pages, and closed ecosystems. So while AI agents can easily retrieve structured information about Hollywood movies, Spotify charts, or IMDb ratings, asking the same systems about Korean dramas, OSTs, or Korean audience sentiment often returns incomplete results or nothing at all.

So I decided to build a unified database to fix it.

The Data Landscape

Korean entertainment data splits along two axes: language (Korean vs. English sources) and type (official vs. community vs. streaming).

English-language sources

TMDB is the closest thing to a comprehensive English-language database for Korean content. It has structured data on tens of thousands of Korean films and shows, a stable API, and community ratings. But it lacks Korea-specific data: no verified Korean audience scores, no Nielsen viewership ratings, no Korean box office data, no OST information.

MyDramaList fills a critical gap that TMDB misses entirely: community tags. MDL users have tagged Korean dramas with labels such as "Bromance", "Time Travel", "CEO Male Lead", and "Found Family." No official database captures that taxonomy. MDL also tracks airing status more accurately than TMDB for Korean dramas.

HanCinema has the deepest historical coverage of Korean content in English, including films from the 1950s through 1990s that TMDB barely covers.

JustWatch is the most reliable real-time source for streaming availability. TMDB's streaming data lags reality by weeks. JustWatch checks 364 services daily.

Wikipedia has rich content for major Korean films and shows including detailed plot summaries, production history, cultural reception sections that no English entertainment database captures.

Korean-language sources

Here's where things get interesting and painful.

NAVER is Korea's dominant search engine and entertainment portal. Search for any Korean film on NAVER and you'll get a rich information card with two ratings that don't exist anywhere else:

실관람객 평점 (Verified ticket buyer rating): Only people who purchased cinema tickets through affiliated platforms can rate. This is Korea's equivalent of a verified purchase review.
네티즌 평점 (Netizen rating): Korean general public rating.

These ratings often diverge significantly from international scores. Parasite has a 9.08 verified buyer rating on NAVER versus 8.5 on TMDB. The Korean audience who saw it in theaters rated it exceptionally highly.

NAVER also has per-episode Nielsen Korea viewership ratings for TV dramas, which is the official broadcast ratings that Korean media reports on weekly. No other English-language source has this data structured and queryable.

The critical catch: NAVER has no public API for any of this. Their entertainment data is rendered dynamically in JavaScript, served through their search interface, and entirely undocumented. Every data point requires a browser.

KOBIS (Korean Film Council) is the exception. It has an official government API that provides authoritative weekly and daily box office rankings. It's the only Korean government data source with a proper REST API.

Building the Scrapers

The Playwright Problem

Most of the Korean data sources render content through JavaScript. Static HTML requests return empty shells. This meant nearly every scraper needed a real browser.

To address this, I used Playwright with Chromium headless across all JS-rendered sources. The setup is consistent:

from playwright.sync_api import sync_playwright

def _get_page_html(url: str, wait_selector: str = "body") -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
            locale="ko-KR",
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_selector(wait_selector)
        time.sleep(2)  # let lazy content settle
        html = page.content()
        browser.close()
    return html

The locale="ko-KR" matters for NAVER since it ensures Korean content is served rather than any region-specific variant.

The NAVER Genre Problem

One of the more unexpected parsing challenges came from NAVER's movie information card. Genre, country, and runtime appeared concatenated: 공포대한민국95분 (Horror South Korea 95min). They were in a single dd tag separated by invisible span.cm_bar_info elements.

The fix was to replace the separator spans with pipe characters before splitting:

first_dd = info_groups[0].select_one("dd")
if first_dd:
    for span in first_dd.select("span"):
        span.replace_with("|")
    segments = [s.strip() for s in first_dd.get_text().split("|") if s.strip()]
    result["genre"] = segments[0] if segments else None
    result["country"] = segments[1] if len(segments) > 1 else None

Extracting Nielsen Ratings from SVG

The trickiest scraping problem was NAVER's episode viewership chart. The data is rendered as an interactive SVG chart where viewership percentages, episode numbers, and air dates are all inside SVG text elements.

def _parse_episode_chart(soup: BeautifulSoup) -> list[dict]:
    # Rating values from bb-text elements inside the SVG
    rating_texts = soup.select("g.bb-texts-rank text.bb-text")
    ratings = []
    for t in rating_texts:
        val = t.get_text(strip=True)
        try:
            f = float(val)
            if f > 0:
                ratings.append(f)
        except ValueError:
            pass

    # Episode numbers and dates from x-axis ticks
    x_ticks = soup.select("g.bb-axis-x g.tick")
    ep_labels = []
    for tick in x_ticks:
        tspans = tick.select("tspan")
        if len(tspans) >= 2:
            ep_num = _parse_episode_num(tspans[0].get_text(strip=True))
            date_text = tspans[1].get_text(strip=True)
            if ep_num and date_text:
                ep_labels.append({"episode": ep_num, "date": date_text})

    return [
        {"episode": ep["episode"], "air_date": ep["date"], "rating": ratings[i]}
        for i, ep in enumerate(ep_labels)
        if i < len(ratings)
    ]

This gives us per-episode Nielsen ratings:

Ep 1 (12.02.): 6.3%
Ep 8 (12.24.): 12.3%
Ep 16 (01.21.): 20.5%

For Goblin. No English-language API has this data.

The JustWatch Shadow DOM Problem

JustWatch uses Web Components with Shadow DOM for their streaming offer cards. The score and provider data that appears in the browser is inside <slot> elements that don't render in server-side HTML:

<div class="score-wrap">
  <div class="critics-score-wrap">
    <slot name="critics-score"></slot>  <!-- empty in scraped HTML -->
  </div>
</div>

However, the streaming offers themselves (provider names, prices, monetization types) render in the regular DOM inside div.buybox-selector a.offer elements. The key insight was that the offers were accessible even though the score slots weren't.

Extracting the actual streaming URLs from JustWatch's redirect links required parsing the r= parameter:

def _extract_redirect_url(href: str) -> str:
    parsed = urlparse(href)
    params = parse_qs(parsed.query)
    r = params.get("r", [None])[0]
    return unquote(r) if r else href

Awards Parsing: Five Ceremonies, Three Formats

Korean drama and film awards span five major ceremonies, each with slightly different HTML structure. I scraped all of them from AsianWiki plus the official Baeksang Awards site.

The unexpected challenge was that award categories use different winner formats depending on ceremony type:

Drama awards: Person ("Show Title") → links[0] is person, links[1] is show
Film awards: "Film Title" → title-only, no person/show split
Blue Dragon Series: "Title" (Platform) → title plus streaming platform

The format detection:

value_text = item_text.replace(bold_text, "").strip().lstrip(":").strip()
if value_text.startswith('"'):
    # Film/series format: title only
    current_category["winner"] = links[0].get_text(strip=True)
    current_category["winner_show"] = None
else:
    # Drama format: person + show
    current_category["winner"] = links[0].get_text(strip=True)
    if len(links) > 1:
        current_category["winner_show"] = links[1].get_text(strip=True)

And the search function deduplicates winners who also appear in the nominees list. AsianWiki includes the winner in the nominees list, so a naive search returns double entries:

# Skip if this is the same person/title as winner (dedup)
if nom_name == winner_name and winner_matches:
    continue

The Wikipedia Section Problem

Wikipedia articles don't have standardized section names. "Plot" might be called "Synopsis", "Series overview", "Story", or "Episodes" depending on who wrote the article and when.

We built a section alias system:

SECTION_ALIASES = {
    "Plot": ["Plot", "Synopsis", "Story", "Series overview", "Premise", "Episodes"],
    "Cast": ["Cast", "Cast and characters", "Characters", "Main cast"],
    "Ratings": ["Ratings", "Viewership ratings", "Television ratings", "Viewership"],
    "Reception": ["Reception", "Critical response", "Critical reception"],
}

The smart lookup tries each alias in order until it finds content. Crash Landing on You uses "Episodes" for its episode list and "Viewership" for its ratings section — both non-standard names that the alias system handles automatically.

Cross-Source ID Management

One of the harder database design decisions was how to link data across sources. A single show like Crash Landing on You has:

TMDB ID: 94796
MDL ID: 70 (from slug 70-crash-landing-on-you)
Naver show OS ID: 3522952
JustWatch slug: crash-landing-on-you
Wikipedia title: Crash Landing on You

The database stores all of these as columns on the tv_shows table. No single ID links all sources — the TMDB ID is the primary key because TMDB has the broadest coverage and most stable IDs.

create table tv_shows (
  id           uuid primary key default uuid_generate_v4(),
  tmdb_id      text unique,
  mdl_id       text unique,
  mdl_slug     text,
  naver_show_id text,
  justwatch_slug text,
  wikipedia_title text,
  -- ... ratings, metadata, etc.
);

The pipeline links sources progressively: TMDB runs first to seed core titles, then MDL enriches with ratings and tags, then NAVER TV adds episode ratings using the Korean title stored by TMDB.

Rating Field Naming Convention

With eight different rating sources covering different audiences and methodologies, naming discipline was essential:

tmdb_rating              # Global community (0-10)
mdl_rating               # International K-drama fans (0-10)
naver_audience_rating    # Korean verified ticket buyers (0-10)
naver_netizen_rating     # Korean general public (0-10)
naver_latest_rating      # Nielsen Korea latest episode (%)
naver_highest_rating     # Nielsen Korea peak episode (%)
rt_tomatometer           # Western professional critics (0-100)
rt_audience_score        # Western RT users (0-100)

These are never stored as a generic "rating" field. The naming makes the source and audience type explicit at the schema level, preventing any ambiguity in downstream queries or API responses.

What This Unlocks

The unified database enables queries that weren't previously possible:

Find dramas where Korean audiences loved it but Western critics were lukewarm:

SELECT title_english, naver_audience_rating, rt_tomatometer
FROM tv_shows
WHERE naver_audience_rating > 8.5
AND rt_tomatometer < 60;

Find all content with OST albums by IU:

SELECT ts.title_english, oa.album_name, oa.vibe_url
FROM ost_albums oa
JOIN tv_shows ts ON oa.show_id = ts.id
WHERE oa.artist ILIKE '%IU%' OR oa.artist ILIKE '%아이유%';

Find award winners available on Netflix US:

SELECT DISTINCT ts.title_english, a.category, a.year
FROM awards a
JOIN tv_shows ts ON a.show_id = ts.id
JOIN streaming_availability sa ON sa.show_id = ts.id
WHERE sa.provider = 'Netflix'
AND sa.region = 'us'
AND sa.monetization_type = 'Subscription'
AND a.won = true;

Find the drama with the biggest episode-to-episode rating jump:

SELECT show_id, episode_number, nielsen_rating,
       nielsen_rating - LAG(nielsen_rating) OVER (PARTITION BY show_id ORDER BY episode_number) AS jump
FROM episodes
ORDER BY jump DESC NULLS LAST
LIMIT 10;

None of these queries are possible against any single existing source.

What's Next

The goal of this project is to build an MCP server for Korean entertainment data so that makes Korean movies and TV shows are accessible to AI agents and developer tooling in a structured, searchable way.

Instead of forcing developers to manually scrape different sites just to answer a basic query, the MCP server will expose unified tools that support natural language requests like “Find me a political thriller Korean audiences loved that’s available on Netflix and maintained strong episode ratings throughout its run.”

Under the hood, that means reconciling fragmented metadata across 10,000+ titles from APIs, scrapers, streaming providers, audience ratings, box office systems, and community-driven sources.

Part 2 will dive into the pipeline architecture itself: the automated sync system, GitHub Actions orchestration, incremental updates, scraper failures, rate limits, deduplication headaches, and the very questionable debugging decisions made at 2AM.

Pick Your Auth: An Interactive Guide

Cara Jung — Mon, 13 Apr 2026 16:00:00 +0000

Most auth tutorials focus on how authentication works such as how to drop in a component, spin up a dev server, and get a login screen running. There's no shortage of guides that tell you which method to use for your use case. What's missing is the hands-on part: actually experiencing each flow the way your users do, so you can feel the friction, see the session it produces, and make an informed decision from the ground up.

Magic link or passkey? Social login or OTP? The answer changes depending on whether you're building a consumer app, a fintech product, a B2B SaaS, or an internal tool. The choice is a product decision that affects activation, security posture, compliance, and long-term maintainability.

To tackle this dilemma, I built an interactive demo called Auth Decision Kit that lets you try three Descope auth flows live: magic link, social login, and passkey. This demo focuses on how each approach fits different product contexts and the tradeoffs you need to consider.

Demo: auth-decision-kit.vercel.app
GitHub: github.com/carasjung/auth-decision-kit

Each method has five tabs:

01 Auth Flow
Authenticate for real using a live Descope integration. See the actual UX users experience.

02 Session Inspector
After authenticating, inspect every claim in your JWT payload. Each field is annotated with what it means, why it matters, and when you'd use it.

03 Decision Matrix
Green / yellow / red ratings across six product contexts: B2B, consumer app, developer tool, internal tool, fintech, SaaS, and mobile-first.

04 Failure Simulator
Trigger each failure mode and see the Descope error code and the correct handling code.

05 Code
Copy-ready implementation snippets for Next.js

The Session Inspector: JWT Breakdown

One of the most useful things I learned building this is how different the JWT payload looks depending on which auth method you used and why they matter for your backend logic.

After a magic link auth, your session contains authenticationMethod: "magiclink" and verifiedEmail: true. The email verification is implicit, clicking the link is proof of inbox access. This is a meaningful signal for risk scoring and it also shows that magic link is a single factor (access to your inbox). For products that require two factors like healthcare and fintech, magic link on its own won’t satisfy.

After social login, you get the provider's access token nested under oauth.google.accessToken (or whichever provider). You also get externalIds.google, a stable provider-specific user ID that won't change even if the user changes their email address on Google's side. That's the field you want for account linking. Since you get free profile data, this is effective for consumer and developer tools.

After a passkey auth, the amr (Authentication Methods References) claim contains "hwk" (hardware key) and "user". This is the claim compliance teams care about. It's proof that a hardware-bound credential was used, not just a password or a link. Passkey is also the only method here where the private key never leaves the user’s device. Even a full Descope breach couldn’t expose user credentials.

The Decision Matrix

Here's a condensed version of what I found after thinking through six product contexts for each method:

Magic link is the sweet spot for B2B SaaS and early-stage products. Zero password management, implicit email verification, and simple implementation. However, it falls apart on mobile (context switch to email app kills conversion) and in high-security contexts where email as a sole factor isn't enough.

Social login is the fastest path to activation for consumer and developer tools. GitHub login in particular gives you free org and repo data via the OAuth token, which is useful for developer-focused products. Avoid it for fintech and banking where regulations often require you to own the identity directly.

Passkey is genuinely the best option for mobile-first and high-security context. Phishing-proof by design, the private key never leaves the device. The catch: users still need education on what a passkey is and you need a fallback for older browsers and lost devices.

Most products should offer at least two methods where one can be the default while the other an alternative. For instance, using magic link as the default and passkey as the upgrade path once users are comfortable.

The Failure Simulator

Auth flows break in predictable ways. Understanding those failure points from day one lets you design a seamless recovery experience so users can continue without friction and avoid escalating to support.

The failure simulator surfaces these scenarios using real Descope error codes and responses. While it doesn’t make live network calls, it replays actual API error outputs so you can explore failure cases without having to intentionally break a real session.

Magic links expire (Descope's default is 2 minutes). When they do, the onError callback fires with error code E011303. Your UI should catch this and offer to send a new link, not show a generic error message.

Social login gets cancelled. Users click "Continue with Google," see the permissions screen, and hit Cancel. That fires E062503. The right response is to return the user silently to the login screen and treat a cancellation as a choice, not an error.

Passkeys on new devices fire E083002 (WebAuthn NotAllowedError). The recovery flow is: fall back to magic link or OTP to verify identity, then offer to enroll a passkey on the new device. This is also why you should never make passkey the only auth method since you always need a fallback for device loss.

Stack and Setup

Next.js 15 with App Router
Descope Next.js SDK (@descope/nextjs-sdk) for auth flows and session management
Framer Motion for tab transitions
Tailwind CSS for layout

The entire setup is about 800 lines of TypeScript across nine files. All core data (steps, session highlights, decision matrix scores, failure scenarios, and code snippets) lives in a single lib/auth.ts file. Adding a new auth method requires only a single entry point, keeping the system easy to extend.

To run it yourself:

git clone https://github.com/carasjung/auth-decision-kit
cd auth-decision-kit
npm install
cp .env.local.example .env.local
# add your NEXT_PUBLIC_DESCOPE_PROJECT_ID
npm run dev

You'll need a free Descope account. Once you’ve created your account, grab your Project ID from app.descope.com/settings/project and configure a sign-up-or-in flow with whichever methods you want to test.

From Demo to Decision

There are plenty of great auth demos that show how things work. This one focuses on how to choose between them.

Auth is infrastructure and like many infrastructure decisions, the cost of getting it wrong rarely shows up immediately. It appears later through conversion drop-offs, security tradeoffs, compliance constraints, and migrations.

While modern tools make it easier to support multiple methods and evolve your approach over time, the decision of what to use and when still requires good judgement upfront. This project is designed to help make that choice more intentional.

The live demo is at auth-decision-kit.vercel.app and the full source is on GitHub.

What Predicts a Hit? I Trained 3 ML Models to Find Out

Cara Jung — Mon, 06 Apr 2026 07:00:00 +0000

In many entertainment adaptation decisions, content selections are still instinct-driven. Maybe a producer was vibing with a story or overheard their Gen Alpha nephew mentioning a GOAT title. This subjective approach has often led to expensive missteps and wasted resources for studios when the feature or show turns into a flop.

As someone who has worked in the breeding ground of popular webcomics, I asked: what if there was a system that could measure “success potential” of IPs based on real user behavior? Using ML, I wanted to see if I could build a forecasting model that could rank unadapted titles by their predicted commercial success.

The Data

For my endeavor, I worked with three datasets:

Source material metadata of roughly 1,500 titles that included engagement metrics such as views, likes, subscribers, genre, release schedule, and creator usernames
Produced show metadata of 1,977 titles including ratings, watcher counts, genre, episode count, and cast
Historical webcomic adaptation records of 424 cross-referenced titles that went from source material to screen, with data pulled from both sides

Before any modeling, I ran exploratory data analysis on all three and found a few things:

Engagement metrics (likes, views, subscribers) were strongly correlated with each other and overall popularity
Genre and tags correlated with watcher counts in the produced show data
Creator frequency showed no statistically significant impact on adaptation success, which directly contradicted what studios commonly assume

Engineering the Target Variable

One hurdle I ran into was that I couldn't directly measure adaptation “success" from the source material side alone. So I engineered a composite Popularity Score by normalizing and combining views, likes, and subscribers into a single metric representing audience appeal, which became the target variable for prediction.

For the produced show data, I created a parallel score using rating and watcher count.

Since correlation analysis confirmed that source popularity and show popularity moved together in historical adaptations, I used source popularity as a proxy target.

Simple vs Complex Models

I implemented three models: Random Forest, XGBoost, and Ridge Regression. If you worked with ML models, there’s an expectation that the more complex models will win. However, this wasn’t the case. Ridge Regression became the unexpected underdog model that won:

I cross-validated all three models to reduce overfitting and validate stability.

Likes = Success

Using standardized coefficients for feature importance in the Ridge model, the ranking was as follows:

Likes (strongest predictor by a significant margin)
Views
Subscribers

The factors that studios often focus on such as creator reputation, genre, rating, and engagement rate showed weak or no statistical significance.

I validated this further using Mann-Whitney U tests comparing adapted titles against the general pool. Adapted titles showed significantly higher “likes” than non-adapted ones and the difference was meaningful.

Feature Importance for Ridge regression(standardized coefficients)

So why “likes”?

One interpretation is that likes are intentional. A view can be passive while a subscription can be habitual. But giving a “like” is an act of emotional investment and this behavior is exactly what translates from IP to screen.

The Output

The final model produced a ranked list of the top 10 unadapted webcomic titles by predicted success, along with contextual signals for each including genre appeal, subscriber trends, engagement consistency, and creator track record.

Qualitative review of the top 10 confirmed alignment with the engagement patterns seen in historically successful adaptations. Cliff's Delta calculations showed that the predicted top titles had significantly higher likes than past adaptations.

Limitations on the Model

Part of doing good data work is being honest about the limitations. There were a few things that fell short:

Small adaptation dataset. 424 entries is workable, but more data would reduce overfitting risk and better generalization.
Proxy target variable. Using source popularity instead of actual show performance is a justified simplification, but it means the model can't fully capture real-world production quality, casting, or distribution reach.
Categorical features dropped. Creator and genre have too many levels and their coefficients dominated the model without adding significance. Excluding them improved interpretability but at the cost of losing some nuance.

What I'd Do Next

If I extended this project, I'd rethink how signal is captured and focus on the following:

Use NLP for deeper context
- Synopsis embeddings or sentiment analysis on reader reviews could capture thematic richness that raw engagement metrics miss.
Take a hybrid ranking approach
- Combining regression with a learning-to-rank algorithm could improve recommendation quality at the top of the list, where small differences actually matter.
Longitudinal validation
- The real test is tracking what happens when predicted titles actually get produced. Building a feedback loop into the model would sharpen it over time.

Final Thoughts

The core insight here doesn’t only strictly apply to entertainment. It can apply to decisions that are being made by intuition or legacy practice. As the models showed, behavioral signals from real users outperform assumptions about what will succeed.

Likes beat creator prestige. Engagement beat genre conventions. The audience’s preferences, not the ones from industry decision makers, predicted outcomes more reliably.

Whether you're choosing which content to produce, which features to build, or which markets to enter, the same principle applies. The answers are within the data, but we often overlook the right signals.