Forem: Vlasta Pavicic

Memgraph vs. TigerGraph

Vlasta Pavicic — Fri, 18 Aug 2023 06:43:44 +0000

In today's data-driven world, the necessity to process and interpret complex relationships within massive datasets is making organizations continually search for the go-to graph database, leaving the traditional relational database options behind. After the initial DB-Engines consultations, two names commonly arise in conversations: TigerGraph and Memgraph.

Background on both solutions

Founded in 2012 by Dr. Yu Xu, TigerGraph's core objective is to provide a scalable and efficient graph database platform that enables organizations to leverage the power of interconnected data, supporting applications ranging from fraud detection to AI and machine learning.

Memgraph is an in-memory, open-source graph database with roots in the UK and Croatia. Founded by Marko Budiselic and Dominik Tomicevic in 2016, and backed by American investors, Memgraph prioritizes high performance and developer accessibility. With a robust community edition, the platform offers a blend of ease of use and practical functionality, all presented through clear and uncomplicated licensing, making it the backbone of many cybersecurity solutions.

Memgraph vs. TigerGraph differences

Although both TigerGraph and Memgraph have been developed in C++ and aim to provide performant solutions for real-time data analytics, there exist some important differences between the two platforms that set them apart. Let’s check what those are.

Query language

The choice of query language plays a significant role in the overall user experience.

GSQL, TigerGraph's proprietary query language, does offer an expressive, Turing-complete language tailored for graph pattern-matching and analytics functions but might present a steeper learning curve for those new to graphs. It has been specifically designed for TigerGraph, and the skillset may not be easily transferred from or to other graph database platforms.

In contrast, Cypher query language is an open-source, declarative language known for its user-friendly syntax. Cypher's human-readable style has propelled it into a standard for querying graph databases. It has been developed by Neo4j but is utilized by various systems, including Memgraph. Due to its simplicity, and broad community support, it is a preferred choice for many developers who know their applications will need minimum changes if they require a switch to another database vendor.

Data storage

TigerGraph and Memgraph offer distinctive approaches to handling data in their graph databases, each reflecting a unique strategy to balance performance, scalability, and flexibility.

TigerGraph employs a hybrid memory-disk approach, leveraging RAM for storing frequently accessed data and disk storage for large graphs that may exceed available memory. This hybrid model allows TigerGraph to achieve real-time analytics, where active datasets are immediately available, while also scaling to handle massive datasets without being constrained by RAM.

In contrast, Memgraph's architecture has been built natively for in-memory data analysis and storage, focusing on lightning-fast data processing. Being ACID compliant, it ensures consistency and reliability in its core design. However, Memgraph also offers flexibility. An analytical storage mode that bypasses ACID compliance is available, accelerating analytics and data import operations when absolute consistency is not required. Additionally, an on-disk storage option allows users to weigh performance against budget constraints, thus achieving a balance tailored to specific needs.

While TigerGraph's hybrid approach offers a comprehensive solution for both speed and scalability, Memgraph's focus on in-memory processing with adaptable options reflects a commitment to performance with versatility to suit various requirements. The distinction between these two models shows the innovation in graph database technology, catering to diverse needs in data management, analysis, and storage.

Data isolation level

TigerGraph employs a read-committed data isolation level, meaning that a transaction can access data which is committed before and during this transaction’s execution.
For example, two same READ queries inside one transaction can return different results because between them another transaction was committed.

On the other hand, Memgraph uses snapshot isolation by default, where each query operates on a consistent snapshot of the data at the query's start time, with the option to change the isolation level, but snapshot isolation offers an advantage as it provides a more consistent view of the data, reducing the chance of reading partial or uncommitted changes. This ensures more accurate query results and a smoother transaction experience, making snapshot isolation generally considered a more reliable approach in many scenarios.

Pricing models and support

TigerGraph has a free version that allows users to work with up to 50GB of data, making it suitable for small projects or initial exploration. Memgraph offers something different with its Community Edition, which is not only free but also open-source and packed with features.

For example, both TigerGraph and Memgraph offer high availability features to ensure that data is consistently accessible and resistant to failures, but Memgraph's replication is available even in the Community Edition of the product. This means that the Community Edition is not "crippleware" but a fully functional version that allows users to "kick the tires" on the product and properly test it to ensure it meets requirements before deploying it in a production environment.

By offering this, and a plethora of other features in the Community Edition, Memgraph not only shows a commitment to performance and reliability but also to accessibility, empowering users to explore and validate the capabilities of the software without barriers.

Due to the lack of complicated layers of management that some larger companies might have, in Memgraph you can talk directly to engineers if you have questions or need help. It's a more hands-on, direct way of working that puts you closer to the people who built the product, and it can make working with Memgraph a more pleasant and efficient experience.

Overview of features

Takeaways on both graph databases

Memgraph and TigerGraph both offer graph database solutions, but Memgraph's native in-memory design sets it apart. Built for speed without losing stability or ACID compliances, Memgraph provides efficient real-time querying and analytics. Although TigerGraph claims to be "The World’s Fastest Graph Analytics Platform for the Enterprise", clients have reported increased performance after switching to Memgraph. If speed, reliability, direct interaction, and support from engineers are key priorities, Memgraph may be the more appealing choice.

Check the performance of Memgraph on your own dataset using Benchgraph, a graph database performance benchmark, and feel free to contact us about making the switch.

Graph Search Algorithms: Developer's Guide

Vlasta Pavicic — Thu, 15 Jun 2023 14:32:38 +0000

Graph search algorithms form the backbone of many applications, from social network analysis and route planning to data mining and recommendation systems. In this developer's guide, we will delve into the world of graph search algorithms, exploring their definition, significance, and practical applications.

At its core, a graph search algorithm is a technique used to traverse a graph, which is a collection of nodes connected by relationships. In various domains such as social networks, web pages, or biological networks, graph theory offers a powerful way to model complex interconnections.

The significance of graph search algorithms lies in their ability to efficiently explore and navigate these intricate networks. By traversing the graph, these algorithms can uncover hidden patterns, discover the shortest paths, and identify clusters.

One of the primary benefits of graph search algorithms is their adaptability to a wide array of applications. Whether you're developing a social media platform, designing a logistics system, or building a recommendation engine, by understanding the foundations of graph search algorithms and learning how to leverage them effectively, you'll unlock new possibilities for solving complex problems and advancing your development skills.

Types of graphs

Graphs are versatile data structures that can represent various types of relationships between objects. Understanding the different types of graphs is essential for effectively applying graph search algorithms. In this chapter, we will explore the most common types of graphs, their characteristics, and applications, along with visual examples to aid comprehension.

Directed graphs, also known as a digraph, consist of nodes connected by directed relationships. In a directed relationship, there is a specific direction from one node (the source) to another node (the target). A directed graph is used to model relationships with a sense of direction, such as web pages with hyperlinks, dependencies between tasks, or social media following relationships.

Undirected graph is a graph in which relationships have no specified direction. Undirected graphs are commonly used to represent relationships that are bidirectional, such as friendships in a social network or connections between web pages.

Weighted graphs assign numerical values, known as weights, to relationships to represent the strength, distance, or cost between nodes. These weights can influence the behavior of graph search algorithms, allowing for more specific optimizations or finding the shortest or cheapest paths. Weighted graphs find applications in areas like network routing, resource allocation, or finding optimal solutions in various domains.

Bipartite graph is a graph whose nodes can be divided into two disjoint sets, and all relationships connect nodes from different sets. Bipartite graphs are often used to model relationships between two distinct types of entities, such as users and products, students and courses, or employees and skills. They are particularly useful in applications like recommendation systems or matching problems.

Cyclic graphs contain at least one cycle, which is a path that starts and ends at the same node. Acyclic graphs, as the name suggests, do not contain any cycles. Cyclic graphs can represent scenarios with repeating or circular relationships, while acyclic graphs are often used in tasks like topological sorting or modeling hierarchical structures.

Understanding the different types of graphs provides developers with a foundation for choosing the appropriate graph representation for their specific problem. Each type has its own characteristics and applications, and selecting the right graph structure can significantly impact the efficiency and accuracy of graph search algorithms.

Basic graph search algorithms

In the realm of graph search algorithms, two fundamental techniques stand out: Breadth-first search (BFS) and Depth-first search (DFS). These algorithms provide crucial building blocks for traversing and exploring graphs in different ways. While BFS focuses on exploring the breadth of a graph by systematically visiting all neighboring nodes before moving deeper, DFS delves into the depths of a graph, exhaustively exploring one branch before backtracking.

BFS and DFS Basics

Depth-first search is an algorithm that explores a graph by traversing as far as possible along each branch before backtracking. The key working principle of DFS is to visit a node and then recursively explore its unvisited neighbors until there are no more unvisited nodes in that branch. This depth-first exploration strategy can be implemented using a stack or recursion. By utilizing a stack, DFS ensures that the most recently discovered nodes are visited first, delving deeper into the graph.

Breadth-first search or the BFS algorithm systematically explores a graph by visiting all the neighboring nodes of a given level before moving to the next level. The core working principle of BFS is to use a queue to maintain a level-wise exploration order. Initially, the algorithm starts with a source node, enqueues its neighbors, and continues this process until all nodes have been visited. The breadth-first traversal strategy ensures that nodes are visited in increasing order of their distance from the source, allowing for finding the shortest path in unweighted graphs.

Both DFS and BFS maintain a record of visited nodes to prevent revisiting the same node repeatedly. This visited node management is crucial to ensure that the algorithms terminate correctly and avoid infinite loops in cyclic graphs. Typically, a boolean array or hash set is used to keep track of visited nodes, marking them as visited when they are encountered during graph traversal.

In terms of time and space complexities, the time complexity of both algorithms is O(V + E), where V represents the number of vertices (nodes) and E represents the number of edges (relationships) in the graph. The space complexity for both algorithms is O(V) since DFS requires a stack to keep track of nodes, while BFS requires a queue to store nodes during traversal.

Use cases

DFS and BFS are powerful graph search algorithms that find wide-ranging applications across various domains.

BFS is often employed in web crawling, where it helps in systematically exploring and indexing web pages starting from a given source page. By utilizing BFS, web crawlers can visit neighboring pages before moving to deeper levels, ensuring comprehensive coverage of a website or a set of interconnected websites. BFS-based crawling allows search engines to index web pages efficiently, enabling users to find relevant information quickly.

BFS plays a vital role in route planning algorithms, especially in unweighted graphs or maps. It helps in finding the shortest path between two locations, ensuring that the traversal explores neighboring locations before expanding the search to farther areas. By utilizing BFS, route planning applications can provide optimal directions, whether it's for driving, walking, or public transportation.

DFS and BFS are valuable tools in recommendation engines, helping to discover relevant content or items for users. DFS can be used to find similar users or items based on shared characteristics, facilitating collaborative filtering approaches. BFS can explore the graph of user-item interactions to recommend items that are highly connected or related to the user's preferences.

In social networks DFS can be used to identify connected components, finding groups of individuals who are mutually connected. BFS is useful for determining the shortest path between two individuals, showcasing the most efficient connections within the network. These algorithms also aid in social network analysis, such as identifying influential individuals or detecting communities and clusters.

The applications mentioned above provide just a glimpse into the broad spectrum of possibilities that DFS and BFS offer. These algorithms are versatile tools that can be adapted to various domains and problem spaces, allowing developers to tackle complex challenges efficiently.

Implementation

Implementing DFS and BFS requires careful consideration and adherence to best practices to ensure correct and efficient implementations.

As a first step, understand the implications of DFS and BFS traversal orders and select the appropriate order for your specific use case.

For DFS, decide whether to explore the graph in a pre-order, in-order, or post-order manner. In pre-order DFS, a node is processed (visited or printed) before traversing its children. The traversal starts at the root node and then recursively explores the left and right subtrees in pre-order. The pre-order strategy is often used in tree-based traversals.

In in-order DFS, a node is processed between the traversal of its left and right subtrees. In other words, the left subtree is explored first, followed by processing the current node and then exploring the right subtree. In-order DFS is primarily used in binary tree traversals and is especially relevant for binary search trees to visit nodes in ascending order. In post-order DFS, a node is processed after traversing its children.

The traversal starts at the root node and recursively explores the left and right subtrees in post-order before finally processing the current node. Post-order DFS is commonly used in tree-based algorithms that require processing child nodes before processing the parent node.

Pre-order is often useful for constructing a copy of the tree or encoding the tree structure. In-order is suitable for tasks like sorting elements in a binary search tree. Post-order is valuable for tasks such as calculating expressions in a parse tree.

For BFS, ensure that the traversal explores nodes in the order dictated by the queue, maintaining the level-wise exploration.

If the graph contains cycles, consider incorporating cycle detection mechanisms or break conditions to avoid infinite loops during traversal. Implement a mechanism to handle backtracking in DFS to ensure that previously visited nodes are not revisited unnecessarily.

Be aware of the time and space complexities of DFS and BFS to gauge their efficiency and performance for different graph sizes and structures.

Consider the following pseudocode examples for implementing DFS and BFS.

Depth-First Search (DFS):


function dfs(graph, start):

    stack.push(start)

    visited[start] = true


    while stack is not empty:

        current = stack.pop()

        process(current)


        for each neighbor in graph.adjacentNodes(current):

            if neighbor is not visited:

                stack.push(neighbor)

                visited[neighbor] = true

BFS (Breadth-First Search):


function bfs(graph, start):

    queue.enqueue(start)

    visited[start] = true


    while queue is not empty:

        current = queue.dequeue()

        process(current)


        for each neighbor in graph.adjacentNodes(current):

            if neighbor is not visited:

                queue.enqueue(neighbor)

                visited[neighbor] = true

Advanced graph search algorithm

In addition to the basic graph search algorithms DFS and BFS, advanced graph search algorithms offer powerful solutions to more complex problems. Two such algorithms, Dijkstra's and Bellman-Ford algorithms, play a crucial role in finding the shortest paths in weighted graphs. In this section, we will introduce Dijkstra's Algorithm and Bellman-Ford Algorithm, highlighting their significance in optimizing route planning, network routing, and other applications that involve finding the most efficient paths in graphs with weighted edges.

Dijkstra's and Bellman-Ford basics

Dijkstra's algorithm is a widely used algorithm for finding the shortest path between a source node and all other nodes in a weighted graph. It works on graphs with non-negative relationship weights and is particularly suitable for scenarios like route planning, network routing, or finding optimal paths in transportation networks.

Dijkstra's algorithm employs a greedy approach, iteratively selecting the node with the smallest tentative distance and updating the distances of its neighboring nodes until all nodes have been visited. It maintains a priority queue or a min-heap to efficiently extract the node with the minimum distance. It has a time complexity of O(V^2) using the adjacency matrix representation of the graph. The time complexity can be reduced to O((V+E)logV) using an adjacency list representation of the graph, where E is the number of edges (relationships) in the graph and V is the number of vertices (nodes) in the graph.

Bellman-Ford is another important algorithm for finding the shortest paths in graphs that may contain negative relationship weights. Unlike Dijkstra's algorithm, which assumes non-negative weights, the Bellman-Ford algorithm can handle graphs with negative weights as long as there are no negative cycles. It is commonly used in scenarios such as network routing, distance vector routing protocols, or detecting negative cycles in graphs. The Bellman-Ford algorithm iterates through all relationships in the graph repeatedly, updating the distances of nodes based on the relaxation principle. It maintains an array to track the distances of nodes and iterates through all relationships |V| - 1 times to ensure convergence. The time complexity of the Bellman-Ford algorithm is O(V E), where V represents the number of vertices and E represents the number of relationships in the graph.

Use cases

Dijkstra's and Bellman-Ford algorithms excel in solving real-world problems related to route planning, network optimization, resource allocation, and various other domains.

Both algorithms are extensively used in route planning applications to find the shortest path between locations in transportation networks, such as roads, railways, or flight routes. These algorithms enable navigation systems to calculate optimal routes for drivers, public transportation, and logistics planning.

In computer networks, these graph search algorithms play a crucial role in determining the most efficient paths for routing packets or establishing network connections. They help optimize the flow of data, reduce latency, and improve network performance. Dijkstra's Algorithm is commonly used in link-state routing protocols like OSPF (Open Shortest Path First), while the Bellman-Ford algorithm is employed in distance-vector routing protocols like RIP (Routing Information Protocol).

Both Dijkstra's algorithm and Bellman-Ford algorithm find applications in resource allocation problems, such as allocating resources in cloud computing, scheduling tasks, or optimizing supply chains. These algorithms assist in identifying the most cost-effective or time-efficient paths to allocate resources and optimize resource utilization. The Bellman-Ford algorithm is especially useful with negative relationship weights (without negative cycles) in scenarios where resource costs or constraints are involved.

The algorithms also aid in managing critical infrastructure and facilities, such as power grids, telecommunications networks, or water distribution systems where they can identifying optimal paths for maintenance crews, detecting faults or disruptions, and optimizing resource allocation for efficient operation.

They can also help optimize supply chain and logistics operations, such as inventory management, order fulfillment, or delivery route optimization by determining the most efficient paths for transporting goods, minimizing costs, and improving customer satisfaction.

The applications mentioned above are just a glimpse of the diverse possibilities that Dijkstra's and Bellman-Ford algorithms offer. Both are versatile and can be adapted to various problem domains requiring optimization and efficient path-finding in graphs.

Implementation

When implementing graph search algorithms, such as Dijkstra's algorithm and Bellman-Ford algorithm, be sure to provide proper handling of relationship weights according to the requirements of the algorithms.

In the case of using Dijkstra's algorithm, validate that the relationship weights are non-negative. For the Bellman-Ford algorithm, ensure that the graph does not contain any negative cycles to maintain correct behavior.

Initialize data structures and variables appropriately before starting the algorithm. Set initial distances to infinity for Dijkstra's algorithm and initialize distances to 0 for the source node in the Bellman-Ford algorithm. Initialize other auxiliary data structures as required, such as priority queues or arrays.

Understand the relaxation step in each algorithm and implement it correctly. Update the distances and other relevant information whenever a shorter path is found during the traversal.

Implement appropriate termination conditions for the algorithms to ensure they terminate correctly. For Dijkstra's algorithm, terminate when the destination node is reached or when all reachable nodes have been visited. In the Bellman-Ford algorithm, terminate when no further updates can be made, indicating that the distances have converged.

Consider the following pseudocode examples for implementing Dijkstra's and Bellman-Ford algorithms. The pseudocode assumes the existence of appropriate data structures like priorityQueue for Dijkstra's Algorithm and arrays like distances and visited to store necessary information.

Dijkstra's algorithm:


function dijkstra(graph, start):

    distances[start] = 0

    priorityQueue.enqueue(start, 0)


    while priorityQueue is not empty:

        current = priorityQueue.dequeue()

        visited[current] = true


        for each neighbor in graph.adjacentNodes(current):

            if not visited[neighbor]:

                distance = distances[current] + graph.edgeWeight(current, neighbor)

                if distance < distances[neighbor]:

                    distances[neighbor] = distance

                    priorityQueue.enqueue(neighbor, distance)

Bellman-Ford algorithm:


function bellmanFord(graph, start):

    distances[start] = 0


    for i = 1 to |V| - 1:

        for each edge in graph.edges:

            source = edge.source

            destination = edge.destination

            weight = edge.weight


            if distances[source] + weight < distances[destination]:

                distances[destination] = distances[source] + weight


    for each edge in graph.edges:

        source = edge.source

        destination = edge.destination

        weight = edge.weight


        if distances[source] + weight < distances[destination]:

            // Negative cycle detected

            // Handle the presence of negative cycles accordingly

Performance analysis and optimization

Efficient performance is crucial when working with graph search algorithms to tackle real-world problems effectively. The increase in the number of nodes and relationships in the graph and the higher density of the graph affect the algorithm's performance.

Efficiency can be increased by incorporating domain-specific knowledge or rules, thus heuristically guiding the search process towards more promising paths and prioritizing nodes or edges, improving efficiency. Another way is by pruning, or eliminating branches or subtrees from further exploration based on certain criteria. Pruning techniques, such as alpha-beta pruning in game-playing scenarios, discard portions of the search space that are deemed irrelevant, significantly reducing the algorithm's runtime. Another technique is memoization, which involves storing previously computed results to avoid redundant calculations. In graph search algorithms, memoization can cache intermediate results, such as computed distances or paths, to avoid re-computation, especially when there are overlapping subproblems.

BFS and DFS can benefit from performance optimizations by applying pruning techniques to skip unnecessary relationships or paths during traversal. Additionally, using appropriate data structures for maintaining visited nodes and tracking the traversal order can improve efficiency.

The performance of Dijkstra's algorithm can be optimized by utilizing efficient data structures like priority queues or heaps to retrieve nodes with the minimum distance more quickly. Heuristics can also be incorporated to prioritize the exploration of nodes that are more likely to yield shorter paths.

The Bellman-Ford algorithm can benefit from early termination if no updates are made during an iteration, indicating that the distances have converged. This can save unnecessary iterations, improving runtime.

Wrap up

In this comprehensive guide, we have explored the fundamental concepts and practical applications of various graph search algorithms. Starting with the basic algorithms like BFS and DFS, we gained insights into their working principles and traversal strategies. We also delved into Dijkstra's and Bellman-Ford algorithms, which excel in solving complex problems involving weighted graphs. We explored their specific use cases, advantages over basic algorithms, and discussed optimization techniques to improve their efficiency.

Understanding the nuances of graph search algorithms empowers the developer community to solve a wide range of problems across diverse domains, ultimately contributing to the advancement of technology and society.

If you want to know more about various graph algorithms, download the Graph Algorithms for Beginners whitepaper that also deals with centrality and machine learning algorithms.

Why are nodes with a high betweenness centrality score high maintenance

Vlasta Pavicic — Thu, 16 Feb 2023 10:52:15 +0000

Each process or network has very important resources you need to take extra care of - whether it’s that crucial piece of data, the unicorn dev you hired, or an expensive piece of infrastructure. Consequently, every graph dataset has very important nodes. Some nodes are important in the way they are crucial to the successful performance of your system, but some (sometimes those exact same) nodes are important because if they fail, they can wreak havoc.

To find out which nodes in a network are important based on the topological structure of the network and thus relevant to its success or ruin, run a centrality analysis of the nodes in your graph database.

Centrality analysis measures

Centrality analysis can be done using measures that examine node degrees or short paths.

A degree is the number of relationships a certain node has. When only the relationships of a specific node are important, the analysis is done using degree centrality. But, if the degree of surrounding nodes is also included in the equation, the analysis is done using the eigenvector centrality.

Just like in real life, sometimes it’s important to have a lot of friends, but other times it’s important to have friends who have many other friends (politicians have cracked this one).

The other way of looking at the importance of a node is by examining the number of shortest paths a node is a part of. To find out which node spreads the information the quickest because it’s the closest one to many other nodes, analyze the graph using the closeness centrality (and find that one person that’s friends with everybody else and knows everything).

In the graph above, both C and E nodes have high closeness centrality as they can access all other nodes the fastest.

To identify the nodes that control the passing of information, you need to find out the nodes’ betweenness centrality.

Except for torturing you with spelling, betweenness centrality discovers nodes that have considerable influence over the network as they play the bridging role. Betweenness centrality is defined as the number of shortest paths that pass through the node divided by the total number of shortest paths between all pairs of nodes.

It’s the person who brings many different people together. But when that bridging person is out of town, some of their friends are left spending the night alone in front of the TV.

In the graph above, node E is the node with the highest betweenness centrality score because if it became unoperational, it would disconnect the highest number of nodes from the rest of the graph.

In other words, the moment the node with the high centrality betweenness score in any way fails to perform whatever it was designed to do, it’s time for fixing issues because some nodes are no longer attached to the network. So let’s look at betweenness centrality more closely to learn how to avoid problems.

Betweenness centrality use cases

Betweenness centrality can help discover pain points in networks and knowledge graphs built around various industries.

Network Optimization

Maybe the most important usage of this algorithm is transportation. In a complex and urban transportation network, betweenness centrality measures can reveal the main bottlenecks and congestions within the system. It can help organize the infrastructure of a big city and decrease time spent optimizing routes.

It can also be used to identify the most relevant landing points of the global network of submarine internet cables. Keeping a close eye on the cables between those points can avoid havoc if sharks attack and damage the cables… again.

In energy management systems, betweenness centrality is an excellent indicator of weak points that could help prevent power outages between two distinct areas in the country or analyze critical points in the supply chain pipeline.

Data Lineage

In the data lineage graph, the betweenness centrality algorithm will identify nodes that are the main sources of data for other assets. Any mistakes found in those origin assets could impact the reliability of all the other connected data nodes.

Also, nodes with high betweenness centrality scores can have a large number of data sources, making them susceptible to frequent changes which propagate to other data entities. It would be wise to check the reliability of those nodes often to make sure they source their data correctly.

Fraud Detection

When analyzing a known fraud organization, betweenness centrality helps identify nodes that act as bridges between clients as they are more likely to commit fraud. Once suspicions are confirmed, data can be fed to the machine learning model to identify suspicious behaviors and clusters on larger datasets or predict fraud.

While on the topic of fraud, lots of information, goods, and activities flow through the important nodes in a criminal network, especially through those with high betweenness centrality measures. These nodes could be particularly interesting to disrupt. Identifying and stopping the most effective fraudster could slow down the activity of the entire criminal network.

Identity and Access Management

The betweenness centrality measure can pinpoint which resources are controlled by a limited number of individuals. Access limited through only a few key persons might slow down the flow of information if some of those resources or people become unavailable.

The worst-case scenario is if the flow is stopped completely. Just as PageRank can warn about excessive privileges in identity and access management systems, betweenness centrality can alert about inadequate privileges.

Cyber security

Some cyber-attacks aim to remove the most important node in the network topology, detaching its adjacent relationships and thus creating the most damage. By calculating the importance of the nodes in the network, you can direct your efforts to the most important parts of your infrastructure.

Also, as a cyberattack is a series of actions that are carried out in a certain order (path) against certain assets in the network (nodes), you can calculate through which node the most possible attack paths go. If each hop in the attack path is given a probability measure, you can use the graph as a vulnerability tree to predict which attacks are most likely to succeed. And come up with a plan to prevent them.

Recommendation Engines

This is sort of a bonus use case, as it’s more targeted towards you succeeding rather than avoiding issues. If you are struggling with what products to recommend, think of the shopping process as a path of several crucial steps for your business (adding to basket, paying and similar actions) and steps that are mostly browsing and checking stuff out.

The high betweenness centrality measure indicates that people bought certain items without too much wandering and overthinking - they saw it, added it to the basket, checkout and paid. The performed the shortest buying path and their paths might cross on a single product. Those are the products you should recommend as the must-haves.

Implementation in Memgraph

The calculation of betweenness centrality is not standardized, and there are many ways to solve it. The algorithm implemented in Memgraph is described in the paper "A Faster Algorithm for Betweenness Centrality" by Ulrik Brandes of the University of Konstanz.

Memgraph has implemented betweenness centrality using C++, which makes it ideal for use cases where performance is highly valuable. The graph can be both directed and undirected, and the algorithm doesn’t take relationships' weight into account.

Default arguments are:

directed: boolean (default=True) ➡ If False, the direction of relationships is ignored.
normalized: boolean (default=True) ➡ If True, the betweenness values are normalized by 2/((n-1)(n-2)) for graphs, and 1/((n-1)(n-2)) for directed graphs where n is the number of nodes. * threads: integer (default=number of concurrent threads supported by the implementation) ➡ The number of threads used to calculate betweenness centrality.

The function returns the node and its betweenness centrality measure.

To call betweenness centrality in Memgraph, use the following query:

CALL betweenness_centrality.get() 
YIELD node, betweenness_centrality 
RETURN node, betweenness_centrality;

You can try it out on Playground, in the Sandbox of the Protein-protein interaction network dataset. Proteins found in human tissue actually make an interaction network. By calculating the betweenness centrality of proteins, it was found that there is a correlation between the APP protein and Alzheimer’s disease, which could be interpreted as a connection between essential proteins and diseases in general.

Check the protein with the highest betweenness centrality measure that could be connected to certain diseases:

CALL betweenness_centrality.get() 
YIELD node, betweenness_centrality 
RETURN node, betweenness_centrality
ORDER BY betweenness_centrality DESC
LIMIT 10;

As with all the other algorithms in the MAGE open-source library, you can run betweenness centrality only on a specific group of nodes with the project() function. Save the sub-graph in a variable, then provide it as a first argument of the algorithm:

MATCH p=(n:SpecificLabel)
WITH project(p) AS subgraph
CALL betweenness_centrality.get(subgraph)
YIELD node, rank
RETURN node, rank;

If your application is highly time-sensitive and nodes and relationships are arriving in a short period of time, use the dynamic betweenness centrality which allows the preservation of the previously processed state. When entities are updated, or new ones arrive in the graph, instead of restarting the algorithm over the whole graph, only the neighborhood objects of that arriving entity are processed at a constant time.

Conclusion

Betweenness centrality will help you identify nodes that are ticking time bombs. If they fail, your network could be in trouble, as the flow between all nodes in the graph will be stopped until you come up with a fix. Keep the betweenness centrality of your nodes low, and if there is no other choice, take extra special care of them.

If you need help identifying those weak points, a welcoming community at Memgraph’s Discord server will be more than happy to help you integrate graphs and algorithms into your system. We all want you to succeed!

PageRank Algorithm for Graph Databases

Vlasta Pavicic — Mon, 30 Jan 2023 14:34:44 +0000

The most interesting and famous application of PageRank is certainly the one that actually sparked its creation. Google founders Larry Page and Sergey Brin needed an algorithm to rank pages and provide users with the best possible search results.

Using the PageRank algorithm, each page receives a ranking based on the number and importance of other pages that are linking to it. The pages with a higher page rank, increase the ranking of the page they link to more than the pages with a lower rank.

In graph database terminology, the PageRank algorithm is used to measure the importance of each node based on the number of incoming relationships and the rank of the related source nodes. What the PageRank algorithm actually outputs is a probability distribution that represents the likelihood of visiting any particular node by randomly traversing the graph.

So, it’s basically a node popularity contest.

A widely used type of PageRank is Personalized PageRank, which is extremely useful in recommendation systems. With Personalized PageRank, you can restrain the random walk by allowing it to start only from one of the nodes in a given set, and jump only to one of the nodes in a given set. This type of PageRank brings out central nodes from the perspective of that set of specific nodes. For example, Twitter uses Personalized PageRank to recommend who to follow online.

The animation below shows the results of PageRank on a simple network. A sequel of a well-liked movie will automatically be more popular than just a random new title because it already has an established fan base. In graph terms, the biggest node pointing to an adjacent node makes it more important.

PageRank can be used as a measure of influence that can be used on a variety of applications, not just on website and movie rankings.

PageRank use cases

If a social network or a search engine are not the products you are developing, check out how you can utilize PageRank in various other use cases or knowledge graphs built to infer knowledge in these niches.

Recommendation Engines

In Recommendation Engines, PageRank algorithm can be utilized to recommend products that match the target user's preferences or are currently trending among all the other users. The algorithm considers the number of purchases and the reliability of the users who bought or reviewed the product.

A reliable user has a valid usage history and reviews, while unreliable users are fake customers whose purpose is to artificially inflate the metrics of certain products to make them appear more desirable.

Data Lineage

Knowing the importance of documents in the data lineage graph has two important applications: impact analysis and system reliability.

In events of adding new data property, migration or major updates, such as merging data sources after the acquisition, impact analysis can help assess the upstream and downstream impacts of such changes.

PageRank can also help identify high-impact nodes that are required to remain highly reliable because they are used in many other places throughout the organization.

Fraud Detection

In fraud detection, PageRank can be used as an additional feature (input) to a machine learning algorithm, to improve classification and reduce the false positives.

Users who are involved in fraudulent transactions with shared cards are more likely to be fraudsters. So the node ranks involved in these particular transactions can be a piece of valuable information that can be used in machine learning models to predict and detect fraud among individuals that have connections with known fraudsters in the network.

Nodes can also be ranked based on how much money flows through each one to flag transactions that move much more money than what is average for a specific user.

Identity and Access Management

While managing permission, it is important to restrict access to sensitive assets, as their exploitation could cause expensive damage to the company. In many systems, due to a lack of time and resources, high permissions are often given to people that don’t actually need them.

PageRank can help identify which sensitive assets are accessible by many users to determine who, in fact, requires access and remove permissions for the rest of the users.

Network Optimization

Critical infrastructures are systems that can be represented as a network of highly interdependent nodes and relationships. Due to their nature, failure in one node may result in a cascade of failures in other nodes. PageRank can help identify nodes likely to fail and if they would cascade to other nodes in the network.

As energy infrastructure is also a network, using PageRank to identify vulnerabilities in the topology is invaluable and can save time, money and frustration for both companies and users.

Cyber Security

As it is not feasible to remove absolutely every threat in the system. PageRank can help calculate probabilities of certain malignant events causing severe attacks. Just as PageRank’s original purpose was to determine which sites will more probably be randomly clicked on due to all the other sites pointing at it, in the security system, it can be used to point out which attack will more probably be performed, and consequences of which attacks will be more severe.

Implementation in Memgraph

Memgraph has implemented PageRank using C++ which makes it ideal for use cases where performance is highly valuable. The graph needs to be directed, and the algorithm doesn’t take relationships' weight into account.

Default arguments are the same as in the NetworkX PageRank implementation, so if you are a NetworkX user it will be smooth sailing:

max_iterations: integer (default = 100) ➡ The maximum number of iterations within the PageRank algorithm.
damping_factor: double (default = 0.85) ➡ PageRanks damping factor. This is the probability of continuing the random walk from a random node within the graph.
stop_epsilon: double (default = 1e-5) ➡ Value used to terminate the iterations of PageRank. If the change from one iteration to another is lower than stop_epsilon, execution is stopped.

To call PageRank in Memgraph use the following query:

CALL pagerank.get()
YIELD node, rank
RETURN node, rank;

You can try it out on Playground, in the Sandbox of the Europe gas pipelines dataset. Check the nodes with the highest value (that could cause problems if they fail) with the following query:

CALL pagerank.get()
YIELD node, rank
RETURN node, rank
ORDER BY node DESC;

As with all the other algorithms in the MAGE open-source library, you can run PageRank only on a specific group of nodes with the project() function. Save the sub-graph in a variable, then provide it as a first argument of the algorithm:

MATCH p=(n:SpecificLabel)
WITH project(p) AS subgraph
CALL pagerank.get(subgraph)
YIELD node, rank
RETURN node, rank;

If your application is highly time-sensitive and nodes and relationships are arriving in a short period of time, use the Dynamic PageRank which allows the preservation of the previously processed state. When entities are updated or new ones arrive in the graph, instead of restarting the algorithm over the whole graph, only the neighborhood objects of that arriving entity are processed at a constant time.

Conclusion

PageRank is a mature graph algorithm that hasn’t yet lost its relevance. Even more so, with the rise of graph database usage it will surely find its place in many management systems. For more examples of using PageRank check Memgraph’s blog posts on the same topic, or explore more graph algorithms.

If you ever have any doubts about whether you are using PageRank correctly, or you need help with implementing it into your use case, a welcoming community at Memgraph’s Discord server will be more than happy to help you integrate graphs and algorithms into your system.