memgraph/docs/user_technical/concepts/graph_algorithms.md

157 lines
7.3 KiB
Markdown
Raw Normal View History

## Graph Algorithms
### Introduction
The graph is a mathematical structure used to describe a set of objects in which
some pairs of objects are "related" in some sense. Generally, we consider
those objects as abstractions named `nodes` (also called `vertices`).
Aforementioned relations between nodes are modelled by an abstraction named
`edge` (also called `relationship`).
It turns out that a lot of real-world problems can be successfully modeled
using graphs. Some natural examples would contain railway networks between
cities, computer networks, piping systems and Memgraph itself.
This article outlines some of the most important graph algorithms
that are internally used by Memgraph. We believe that advanced users could
significantly benefit from obtaining basic knowledge about those algorithms.
The users should also note that this article does not contain an in-depth
analysis of algorithms and their implementation details since those are
well documented in the appropriate literature and, in our opinion, go well out
of scope for user documentation. That being said, we will include the relevant
information for using Memgraph effectively and efficiently.
Contents of this article include:
* [Breadth First Search (BFS)](#breadth-first-search)
* [Weighted Shortest Path (WSP)](#weighted-shortest-path)
### Breadth First Search
[Breadth First Search](https://en.wikipedia.org/wiki/Breadth-first_search)
is a way of traversing a graph data structure. The
traversal starts from a single node (usually referred to as source node) and,
during the traversal, breadth is prioritized over depth, hence the name of the
algorithm. More precisely, when we visit some node, we can safely assume that
we have already visited all nodes that are fewer edges away from a source node.
An interesting side-effect of traversing a graph in BFS order is the fact
that, when we visit a particular node, we can easily find a path from
the source node to the newly visited node with the least number of edges.
Since in this context we disregard the edge weights, we can say that BFS is
a solution to an unweighted shortest path problem.
The algorithm itself proceeds as follows:
* Keep around a set of nodes that are equidistant from the source node.
Initially, this set contains only the source node.
* Expand to all not yet visited nodes that are a single edge away from that
set. Note that the set of those nodes is also equidistant from the source
node.
* Replace the set with a set of nodes obtained in the previous step.
* Terminate the algorithm when the set is empty.
The order of visited nodes is nicely visualized in the following animation from
Wikipedia. Note that each row contains nodes that are equidistant from the
source and thus represents one of the sets mentioned above.
![visualization](https://upload.wikimedia.org/wikipedia/commons/5/5d/Breadth-First-Search-Algorithm.gif)
The standard BFS implementation skews from the above description by relying on
a FIFO (first in, first out) queue data structure. Nevertheless, the
functionality is equivalent and its runtime is bounded by `O(|V| + |E|)` where
`V` denotes the set of nodes and `E` denotes the set of edges. Therefore,
it provides a more efficient way of finding unweighted shortest paths than
running [Dijkstra's algorithm](#weighted-shortest-path) on a graph
with edge weights equal to `1`.
### Weighted Shortest Path
In [graph theory](https://en.wikipedia.org/wiki/Graph_theory), weighted shortest
path problem is the problem of finding a path between two nodes in a graph such
that the sum of the weights of edges connecting nodes on the path is minimized.
#### Dijkstra's algorithm
One of the most important algorithms for finding weighted shortest paths is
[Dijkstra's algorithm](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm).
Our implementation uses a modified version of this algorithm that can handle
length restriction. The length restriction parameter is optional and when it's
not set it could increase the complexity of the algorithm. It is important to
note that the term "length" in this context denotes the number of traversed
edges and not the sum of their weights.
The algorithm itself is based on a couple of greedy observations and could
be expressed in natural language as follows:
* Keep around a set of already visited nodes along with their corresponding
shortest paths from source node. Initially, this set contains only the
source node with the shortest distance of `0`.
* Find an edge that goes from a visited node to an unvisited one such that the
shortest path from source to the visited node increased by the weight of
that edge is minimized. Traverse that edge and add a newly visited node with
appropriate distance to the set of already visited nodes.
* Repeat the process until the destination node is visited.
The described algorithm is nicely visualized in the following animation from
Wikipedia. Note that edge weights correspond to the Euclidean distance between
nodes which represent points on a plane.
![visualization](https://upload.wikimedia.org/wikipedia/commons/e/e4/DijkstraDemo.gif)
Using appropriate data structures the worst-case performance of our
implementation can be expressed as `O(|E| + |V|log|V|)` where `E` denotes
a set of edges and `V` denotes the set of nodes.
A sample query that finds a shortest path between two nodes looks as follows:
```opencypher
MATCH (a {id: 723})-[edge_list *wShortest 10 (e, n | e.weight) total_weight]-(b {id: 882}) RETURN *
```
This query has an upper bound length restriction set to `10`. This means that no
path that traverses more than `10` edges will be considered as a valid result.
##### Upper Bound Implications
Since the upper bound parameter is optional, we can have different results based
on this parameter.
Consider the following graph and sample queries.
![sample-graph](../data/graph.png)
```opencypher
MATCH (a {id: 0})-[edge_list *wShortest 3 (e, n | e.weight) total_weight]-(b {id: 5}) RETURN *
```
```opencypher
MATCH (a {id: 0})-[edge_list *wShortest (e, n | e.weight) total_weight]-(b {id: 5}) RETURN *
```
The first query will try to find the weighted shortest path between nodes `0`
and `5` with the restriction on the path length set to `3`, and the second query
will try to find the weighted shortest path with no restriction on the path
length.
The expected result for the first query is `0 -> 1 -> 4 -> 5` with the total
cost of `12`, while the expected result for the second query is
`0 -> 2 -> 3 -> 4 -> 5` with the total cost of `11`. Obviously, the second
query can find the true shortest path because it has no restrictions on the
length.
To handle cases when the length restriction is set, *weighted shortest path*
algorithm uses both node and distance as the state. This causes the search
space to increase by the factor of the given upper bound. On the other hand, not
setting the upper bound parameter, the search space might contain the whole
graph.
Because of this, one should always try to narrow down the upper bound limit to
be as precise as possible in order to have a more performant query.
### Where to next?
For some real-world application of WSP we encourage you to visit our article
on [exploring the European road network](../tutorials/exploring_the_european_road_network.md)
which was specially crafted to showcase our graph algorithms.