A0mineTV

Posted on May 21

Building a Rich Movie & Social Knowledge Graph with Neo4j and Python

#python #neo4j #graphql #tutorial

In this deep-dive tutorial you’ll learn how to connect to Neo4j exclusively from Python, model a non-trivial schema, ingest multi-domain data (movies, people, characters, companies, reviews, social links, release history), and run Graph Data Science algorithms—all in one script.

🚀 1. Why a Knowledge Graph?

Natural model for connected data: nodes (entities) + relationships.
Cypher: a declarative, SQL-style graph query language.
Use cases: recommendations, fraud detection, social networks, knowledge graphs, taxonomy management.

🛠 2. Prerequisites

Neo4j v5+ running locally (Community or via Docker).
Python 3.8+ virtual environment.
Install packages:

pip install neo4j pandas

📦 3. Script Overview

This script (complex_kg.py) orchestrates the entire lifecycle of your movie‐social knowledge graph:

Establish Connection
- Opens a Bolt session to Neo4j using the official Python driver.
Schema Setup
- Creates unique constraints on key node labels (Movie.title, Person.name, etc.).
- Builds property indexes and a full‐text index for fast lookups.
Data Ingestion
- Genres & Companies Loads genre tags and production studios with their founding dates and countries.
- Movies & Genre Links Imports each movie node and attaches it to its genres.
- Characters & Roles Defines character nodes (with archetypes) and links them to their movies.
- People (Actors, Directors, Writers) Creates Person nodes, then establishes ACTED_AS, DIRECTED and WROTE relationships.
- Reviews & Social Graph Inserts Review nodes, connects them to users and movies, and builds a FOLLOWS network (with timestamps).
- Temporal Releases & Versions Models per‐region release dates and version nodes (e.g., remasters) with RELEASED_IN and HAS_VERSION edges.
Graph Data Science
- Social PageRank Projects the User + FOLLOWS subgraph and computes influence scores.
- Movie Similarity Builds a movie–genre projection and streams top‐N similar movie pairs.
Results Output
- Formats and prints PageRank scores and similarity pairs as pandas DataFrames.

🔧 4. Configuration & Connection

from neo4j import GraphDatabase, basic_auth
import pandas as pd

URI      = "bolt://localhost:7687"
USER     = "neo4j"
PASSWORD = "rootderoot"

driver = GraphDatabase.driver(URI, auth=basic_auth(USER, PASSWORD))

This opens a connection pool to your local Neo4j.

🔐 5. Constraints & Indexes

def create_constraints_and_indexes(tx):
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (m:Movie)     REQUIRE m.title IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (p:Person)    REQUIRE p.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (c:Character) REQUIRE (c.name, c.movie) IS NODE KEY")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (co:Company)  REQUIRE co.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre)     REQUIRE g.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User)      REQUIRE u.name IS UNIQUE")

    tx.run("CREATE INDEX IF NOT EXISTS movie_year FOR (m:Movie) ON (m.released)")
    tx.run("""
      CREATE FULLTEXT INDEX IF NOT EXISTS movie_text
      FOR (m:Movie) ON EACH [m.title, m.tagline]
    """)

Node key on Character ensures uniqueness per (name, movie).
A full-text index on movies lets you search titles or taglines.

🌐 6. Loading Static Domain Data

6.1 Genres & Companies

def load_genres_and_companies(tx):
    genres = ['Action','Sci-Fi','Thriller','Drama']
    for name in genres:
        tx.run("MERGE (:Genre {name:$name})", name=name)

    companies = [
        {"name":"Warner Bros.", "founded":1923, "country":"US"},
        {"name":"Paramount Pictures", "founded":1912, "country":"US"}
    ]
    for c in companies:
        tx.run("""
          MERGE (co:Company {name:$name})
          SET co.founded = $founded, co.country = $country
        """, **c)

6.2 Movies & Genre Links

def load_movies(tx):
    movies = [
        {"title":"Inception","released":2010,"tagline":"Your mind is the scene of the crime","genres":["Thriller","Sci-Fi"]},
        {"title":"Interstellar","released":2014,"tagline":"Mankind’s next step will be our greatest","genres":["Sci-Fi","Drama"]}
    ]
    for m in movies:
        tx.run("MERGE (mv:Movie {title:$title}) SET mv.released=$released, mv.tagline=$tagline", **m)
        for g in m["genres"]:
            tx.run("""
              MATCH (mv:Movie {title:$title}), (g:Genre {name:$genre})
              MERGE (mv)-[:IN_GENRE]->(g)
            """, title=m["title"], genre=g)

MERGE ensures idempotent loads.
We attach two genres per movie.

🎭 7. Characters, Actors, Directors, Writers

def load_people_and_roles(tx):
    # Characters
    characters = [
        {"name":"Cobb","movie":"Inception","archetype":"Hero"},
        {"name":"Murph","movie":"Interstellar","archetype":"Protege"}
    ]
    for ch in characters:
        tx.run("""
          MERGE (c:Character {name:$name, movie:$movie})
          SET c.archetype=$archetype
        """, **ch)

    # Actors & ACTED_AS
    actors = [
        {"name":"Leonardo DiCaprio","born":1974,"nationality":"US","character":"Cobb","year":2010},
        {"name":"Jessica Chastain","born":1977,"nationality":"US","character":"Murph","year":2014}
    ]
    for a in actors:
        tx.run("""
          MERGE (p:Person {name:$name})
          SET p.born=$born, p.nationality=$nationality
        """, **a)
        tx.run("""
          MATCH (p:Person {name:$name}), (c:Character {name:$character, movie:$character})
          MERGE (p)-[:ACTED_AS {roles:[$character], year:$year}]->(c)
        """, name=a["name"], character=a["character"], year=a["year"])

    # Directors
    directors = [
        {"director":"Christopher Nolan","movie":"Inception","year":2010},
        {"director":"Christopher Nolan","movie":"Interstellar","year":2014}
    ]
    for d in directors:
        tx.run("""
          MERGE (p:Person {name:$director})
          MERGE (m:Movie {title:$movie})
          MERGE (p)-[:DIRECTED {year:$year}]->(m)
        """, **d)

We model characters separately from people.
Each Person may ACTED_AS, DIRECTED, or WROTE a Movie.

📝 8. Reviews, Social Follows & Likes

def load_reviews_and_social(tx):
    # Users
    for u in ["Alice","Bob","Carol"]:
        tx.run("MERGE (:User {name:$name})", name=u)

    # Reviews
    reviews = [
        {"user":"Alice","movie":"Inception","rating":5,"date":"2021-01-01","comment":"Mind-blowing!"},
        {"user":"Bob","movie":"Interstellar","rating":4,"date":"2021-02-02","comment":"Epic visuals."}
    ]
    for r in reviews:
        tx.run("""
          MATCH (u:User {name:$user}), (m:Movie {title:$movie})
          CREATE (rev:Review {rating:$rating, date:date($date), comment:$comment})
          MERGE (u)-[:WROTE]->(rev)
          MERGE (rev)-[:FOR_MOVIE]->(m)
        """, **r)

    # Follows
    follows = [("Alice","Bob","2021-03-01"),("Bob","Carol","2021-03-05")]
    for fr,to,date in follows:
        tx.run("""
          MATCH (a:User {name:$fr}), (b:User {name:$to})
          MERGE (a)-[f:FOLLOWS]->(b)
          ON CREATE SET f.since = date($date)
        """, fr=fr, to=to, date=date)

We create Review nodes with rating, date, comment.
Users WROTE reviews and FOLLOW one another.

📆 9. Temporal Releases & Versions

def load_temporal_and_versions(tx):
    # Releases by region
    releases = [
        {"movie":"Inception","region":"US","date":"2010-07-16"},
        {"movie":"Inception","region":"FR","date":"2010-07-21"}
    ]
    for r in releases:
        tx.run("MERGE (rel:Release {region:$region, date:date($date)})", **r)
        tx.run("""
          MATCH (m:Movie {title:$movie}), (rel:Release {region:$region, date:date($date)})
          MERGE (m)-[:RELEASED_IN {region:$region, date:date($date)}]->(rel)
        """, **r)

    # Versions / Remasters
    versions = [{"movie":"Interstellar","label":"4K Remaster","releaseDate":"2020-11-01"}]
    for v in versions:
        tx.run("""
          MERGE (ver:Version {label:$label})
          SET ver.releaseDate=date($releaseDate)
        """, **v)
        tx.run("""
          MATCH (m:Movie {title:$movie}), (ver:Version {label:$label})
          MERGE (m)-[:HAS_VERSION {releaseDate:date($releaseDate)}]->(ver)
        """, **v)

Each Release captures a region and a date.
Version nodes let you track director’s cuts or remasters over time.

🔬 10. Graph Data Science

10.1 PageRank on Social Graph

def run_gds(tx):
    tx.run("CALL gds.graph.drop('social', false)").consume()
    tx.run("""
      CALL gds.graph.project('social','User',{FOLLOWS:{orientation:'NATURAL'}})
    """).consume()

    return tx.run("""
      CALL gds.pageRank.stream('social')
      YIELD nodeId, score
      RETURN gds.util.asNode(nodeId).name AS user, round(score,3) AS pr
      ORDER BY pr DESC
    """).data()

10.2 Movie Similarity via Genres

def run_movie_similarity(tx):
    tx.run("CALL gds.graph.drop('movieActor', false)").consume()
    tx.run("""
      CALL gds.graph.project.cypher(
        'movieActor',
        'MATCH (m:Movie) RETURN id(m) AS id',
        'MATCH (m1)<-[:IN_GENRE]-(:Genre)-[:IN_GENRE]->(m2)
         WHERE id(m1)<id(m2)
         RETURN id(m1) AS source, id(m2) AS target'
      )
    """).consume()

    return tx.run("""
      CALL gds.nodeSimilarity.stream('movieActor',{similarityCutoff:0.2})
      YIELD node1,node2,similarity
      RETURN gds.util.asNode(node1).title AS A,
             gds.util.asNode(node2).title AS B,
             round(similarity,3)             AS sim
      ORDER BY sim DESC LIMIT 5
    """).data()

PageRank reveals the most “influential” users in your social network.
Node similarity finds the top 5 most similar movie pairs based on shared genres.

▶️ 11. Putting It All Together

def main():
    with driver.session() as s:
        s.execute_write(create_constraints_and_indexes)
        s.execute_write(load_genres_and_companies)
        s.execute_write(load_movies)
        s.execute_write(load_people_and_roles)
        s.execute_write(load_reviews_and_social)
        s.execute_write(load_temporal_and_versions)

        pr_scores = s.execute_read(run_gds)
        sim_pairs = s.execute_read(run_movie_similarity)

    import pandas as pd
    print("PageRank scores:\n", pd.DataFrame(pr_scores))
    print("\nTop movie similarities:\n", pd.DataFrame(sim_pairs))
    driver.close()

if __name__ == "__main__":
    main()

Run:

python complex_kg.py

Example output:

PageRank scores:
      user     pr
0   Alice  0.500
1     Bob  0.333
2   Carol  0.167

Top movie similarities:
            A              B    sim
0   Inception  Interstellar  0.707
...

📈 12. Next Steps

Expose an API (Flask/FastAPI) that runs parameterized Cypher.
Load real data from CSV, TMDB or Wikidata.
Add neosemantics (n10s) plugin to export RDF/SPARQL.
Visualize with Neo4j Bloom or Neodash.

You now have a comprehensive Python-driven workflow: schema definition, data ingestion, analytics, all in a single, reproducible script. Happy graphing!

DEV Community