DEV Community

A0mineTV
A0mineTV

Posted on

2 1 1 1 1

Building a Rich Movie & Social Knowledge Graph with Neo4j and Python

In this deep-dive tutorial you’ll learn how to connect to Neo4j exclusively from Python, model a non-trivial schema, ingest multi-domain data (movies, people, characters, companies, reviews, social links, release history), and run Graph Data Science algorithms—all in one script.


🚀 1. Why a Knowledge Graph?

  • Natural model for connected data: nodes (entities) + relationships.
  • Cypher: a declarative, SQL-style graph query language.
  • Use cases: recommendations, fraud detection, social networks, knowledge graphs, taxonomy management.

🛠 2. Prerequisites

  1. Neo4j v5+ running locally (Community or via Docker).
  2. Python 3.8+ virtual environment.
  3. Install packages:
pip install neo4j pandas
Enter fullscreen mode Exit fullscreen mode

📦 3. Script Overview

This script (complex_kg.py) orchestrates the entire lifecycle of your movie‐social knowledge graph:

  1. Establish Connection

    • Opens a Bolt session to Neo4j using the official Python driver.
  2. Schema Setup

    • Creates unique constraints on key node labels (Movie.title, Person.name, etc.).
    • Builds property indexes and a full‐text index for fast lookups.
  3. Data Ingestion

    • Genres & Companies Loads genre tags and production studios with their founding dates and countries.
    • Movies & Genre Links Imports each movie node and attaches it to its genres.
    • Characters & Roles Defines character nodes (with archetypes) and links them to their movies.
    • People (Actors, Directors, Writers) Creates Person nodes, then establishes ACTED_AS, DIRECTED and WROTE relationships.
    • Reviews & Social Graph Inserts Review nodes, connects them to users and movies, and builds a FOLLOWS network (with timestamps).
    • Temporal Releases & Versions Models per‐region release dates and version nodes (e.g., remasters) with RELEASED_IN and HAS_VERSION edges.
  4. Graph Data Science

    • Social PageRank Projects the User + FOLLOWS subgraph and computes influence scores.
    • Movie Similarity Builds a movie–genre projection and streams top‐N similar movie pairs.
  5. Results Output

    • Formats and prints PageRank scores and similarity pairs as pandas DataFrames.

🔧 4. Configuration & Connection

from neo4j import GraphDatabase, basic_auth
import pandas as pd

URI      = "bolt://localhost:7687"
USER     = "neo4j"
PASSWORD = "rootderoot"

driver = GraphDatabase.driver(URI, auth=basic_auth(USER, PASSWORD))
Enter fullscreen mode Exit fullscreen mode

This opens a connection pool to your local Neo4j.


🔐 5. Constraints & Indexes

def create_constraints_and_indexes(tx):
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (m:Movie)     REQUIRE m.title IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (p:Person)    REQUIRE p.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (c:Character) REQUIRE (c.name, c.movie) IS NODE KEY")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (co:Company)  REQUIRE co.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre)     REQUIRE g.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User)      REQUIRE u.name IS UNIQUE")

    tx.run("CREATE INDEX IF NOT EXISTS movie_year FOR (m:Movie) ON (m.released)")
    tx.run("""
      CREATE FULLTEXT INDEX IF NOT EXISTS movie_text
      FOR (m:Movie) ON EACH [m.title, m.tagline]
    """)
Enter fullscreen mode Exit fullscreen mode
  • Node key on Character ensures uniqueness per (name, movie).
  • A full-text index on movies lets you search titles or taglines.

🌐 6. Loading Static Domain Data

6.1 Genres & Companies

def load_genres_and_companies(tx):
    genres = ['Action','Sci-Fi','Thriller','Drama']
    for name in genres:
        tx.run("MERGE (:Genre {name:$name})", name=name)

    companies = [
        {"name":"Warner Bros.", "founded":1923, "country":"US"},
        {"name":"Paramount Pictures", "founded":1912, "country":"US"}
    ]
    for c in companies:
        tx.run("""
          MERGE (co:Company {name:$name})
          SET co.founded = $founded, co.country = $country
        """, **c)
Enter fullscreen mode Exit fullscreen mode

6.2 Movies & Genre Links

def load_movies(tx):
    movies = [
        {"title":"Inception","released":2010,"tagline":"Your mind is the scene of the crime","genres":["Thriller","Sci-Fi"]},
        {"title":"Interstellar","released":2014,"tagline":"Mankind’s next step will be our greatest","genres":["Sci-Fi","Drama"]}
    ]
    for m in movies:
        tx.run("MERGE (mv:Movie {title:$title}) SET mv.released=$released, mv.tagline=$tagline", **m)
        for g in m["genres"]:
            tx.run("""
              MATCH (mv:Movie {title:$title}), (g:Genre {name:$genre})
              MERGE (mv)-[:IN_GENRE]->(g)
            """, title=m["title"], genre=g)
Enter fullscreen mode Exit fullscreen mode
  • MERGE ensures idempotent loads.
  • We attach two genres per movie.

🎭 7. Characters, Actors, Directors, Writers

def load_people_and_roles(tx):
    # Characters
    characters = [
        {"name":"Cobb","movie":"Inception","archetype":"Hero"},
        {"name":"Murph","movie":"Interstellar","archetype":"Protege"}
    ]
    for ch in characters:
        tx.run("""
          MERGE (c:Character {name:$name, movie:$movie})
          SET c.archetype=$archetype
        """, **ch)

    # Actors & ACTED_AS
    actors = [
        {"name":"Leonardo DiCaprio","born":1974,"nationality":"US","character":"Cobb","year":2010},
        {"name":"Jessica Chastain","born":1977,"nationality":"US","character":"Murph","year":2014}
    ]
    for a in actors:
        tx.run("""
          MERGE (p:Person {name:$name})
          SET p.born=$born, p.nationality=$nationality
        """, **a)
        tx.run("""
          MATCH (p:Person {name:$name}), (c:Character {name:$character, movie:$character})
          MERGE (p)-[:ACTED_AS {roles:[$character], year:$year}]->(c)
        """, name=a["name"], character=a["character"], year=a["year"])

    # Directors
    directors = [
        {"director":"Christopher Nolan","movie":"Inception","year":2010},
        {"director":"Christopher Nolan","movie":"Interstellar","year":2014}
    ]
    for d in directors:
        tx.run("""
          MERGE (p:Person {name:$director})
          MERGE (m:Movie {title:$movie})
          MERGE (p)-[:DIRECTED {year:$year}]->(m)
        """, **d)
Enter fullscreen mode Exit fullscreen mode
  • We model characters separately from people.
  • Each Person may ACTED_AS, DIRECTED, or WROTE a Movie.

📝 8. Reviews, Social Follows & Likes

def load_reviews_and_social(tx):
    # Users
    for u in ["Alice","Bob","Carol"]:
        tx.run("MERGE (:User {name:$name})", name=u)

    # Reviews
    reviews = [
        {"user":"Alice","movie":"Inception","rating":5,"date":"2021-01-01","comment":"Mind-blowing!"},
        {"user":"Bob","movie":"Interstellar","rating":4,"date":"2021-02-02","comment":"Epic visuals."}
    ]
    for r in reviews:
        tx.run("""
          MATCH (u:User {name:$user}), (m:Movie {title:$movie})
          CREATE (rev:Review {rating:$rating, date:date($date), comment:$comment})
          MERGE (u)-[:WROTE]->(rev)
          MERGE (rev)-[:FOR_MOVIE]->(m)
        """, **r)

    # Follows
    follows = [("Alice","Bob","2021-03-01"),("Bob","Carol","2021-03-05")]
    for fr,to,date in follows:
        tx.run("""
          MATCH (a:User {name:$fr}), (b:User {name:$to})
          MERGE (a)-[f:FOLLOWS]->(b)
          ON CREATE SET f.since = date($date)
        """, fr=fr, to=to, date=date)
Enter fullscreen mode Exit fullscreen mode
  • We create Review nodes with rating, date, comment.
  • Users WROTE reviews and FOLLOW one another.

📆 9. Temporal Releases & Versions

def load_temporal_and_versions(tx):
    # Releases by region
    releases = [
        {"movie":"Inception","region":"US","date":"2010-07-16"},
        {"movie":"Inception","region":"FR","date":"2010-07-21"}
    ]
    for r in releases:
        tx.run("MERGE (rel:Release {region:$region, date:date($date)})", **r)
        tx.run("""
          MATCH (m:Movie {title:$movie}), (rel:Release {region:$region, date:date($date)})
          MERGE (m)-[:RELEASED_IN {region:$region, date:date($date)}]->(rel)
        """, **r)

    # Versions / Remasters
    versions = [{"movie":"Interstellar","label":"4K Remaster","releaseDate":"2020-11-01"}]
    for v in versions:
        tx.run("""
          MERGE (ver:Version {label:$label})
          SET ver.releaseDate=date($releaseDate)
        """, **v)
        tx.run("""
          MATCH (m:Movie {title:$movie}), (ver:Version {label:$label})
          MERGE (m)-[:HAS_VERSION {releaseDate:date($releaseDate)}]->(ver)
        """, **v)
Enter fullscreen mode Exit fullscreen mode
  • Each Release captures a region and a date.
  • Version nodes let you track director’s cuts or remasters over time.

🔬 10. Graph Data Science

 10.1 PageRank on Social Graph

def run_gds(tx):
    tx.run("CALL gds.graph.drop('social', false)").consume()
    tx.run("""
      CALL gds.graph.project('social','User',{FOLLOWS:{orientation:'NATURAL'}})
    """).consume()

    return tx.run("""
      CALL gds.pageRank.stream('social')
      YIELD nodeId, score
      RETURN gds.util.asNode(nodeId).name AS user, round(score,3) AS pr
      ORDER BY pr DESC
    """).data()
Enter fullscreen mode Exit fullscreen mode

10.2 Movie Similarity via Genres

def run_movie_similarity(tx):
    tx.run("CALL gds.graph.drop('movieActor', false)").consume()
    tx.run("""
      CALL gds.graph.project.cypher(
        'movieActor',
        'MATCH (m:Movie) RETURN id(m) AS id',
        'MATCH (m1)<-[:IN_GENRE]-(:Genre)-[:IN_GENRE]->(m2)
         WHERE id(m1)<id(m2)
         RETURN id(m1) AS source, id(m2) AS target'
      )
    """).consume()

    return tx.run("""
      CALL gds.nodeSimilarity.stream('movieActor',{similarityCutoff:0.2})
      YIELD node1,node2,similarity
      RETURN gds.util.asNode(node1).title AS A,
             gds.util.asNode(node2).title AS B,
             round(similarity,3)             AS sim
      ORDER BY sim DESC LIMIT 5
    """).data()
Enter fullscreen mode Exit fullscreen mode
  • PageRank reveals the most “influential” users in your social network.
  • Node similarity finds the top 5 most similar movie pairs based on shared genres.

▶️ 11. Putting It All Together

def main():
    with driver.session() as s:
        s.execute_write(create_constraints_and_indexes)
        s.execute_write(load_genres_and_companies)
        s.execute_write(load_movies)
        s.execute_write(load_people_and_roles)
        s.execute_write(load_reviews_and_social)
        s.execute_write(load_temporal_and_versions)

        pr_scores = s.execute_read(run_gds)
        sim_pairs = s.execute_read(run_movie_similarity)

    import pandas as pd
    print("PageRank scores:\n", pd.DataFrame(pr_scores))
    print("\nTop movie similarities:\n", pd.DataFrame(sim_pairs))
    driver.close()

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Run:

python complex_kg.py
Enter fullscreen mode Exit fullscreen mode

Example output:

PageRank scores:
      user     pr
0   Alice  0.500
1     Bob  0.333
2   Carol  0.167

Top movie similarities:
            A              B    sim
0   Inception  Interstellar  0.707
...
Enter fullscreen mode Exit fullscreen mode

📈 12. Next Steps

  • Expose an API (Flask/FastAPI) that runs parameterized Cypher.
  • Load real data from CSV, TMDB or Wikidata.
  • Add neosemantics (n10s) plugin to export RDF/SPARQL.
  • Visualize with Neo4j Bloom or Neodash.

You now have a comprehensive Python-driven workflow: schema definition, data ingestion, analytics, all in a single, reproducible script. Happy graphing!

Top comments (0)

Survey image

Calling All Cloud Developers - Your Insights Matter

Take the Developer Nation Survey and help shape cloud development trends. Prizes await!

Join Today