Shortest Path

At Folding@home, often we use Markov state models (MSMs) to represent how molecules (usually proteins) move around in a simulation. These models help simplify large amounts of simulation data and ultimate yield a more concise network representation. Because of this, we can borrow tools normally used in fields such as graph theory to study how biological molecules behave.

One such tool is Dijkastra’s algorithm, which finds an approximate to the shortest path between any two nodes in a (weakly connected) graph. At each iteration, the algorithm finds the neighboring node that will minimize the total distance needed to traverse the graph. This method can be extremely powerful, and has become as ubiquitous to our society as getting driving directions from Google Maps.

In social networks, Dijkastra’s algorithm is what you would use to compute an Erdös number or how many degrees of separation you share with Kevin Bacon. In protein folding (my line of work), Dijkastra’s algorithm is useful for identifying the possible steps (e.g. confomational changes) that lead to a folded protein. This information can help scientists better understand the underlying physics of biology, which could lead to new therapeutic drugs.

In this tutorial, I’ll go over how to use NumPy and NetworkX to construct a random graph and calculate the shortest path between two nodes. I’ll also be using matplotlib and Seaborn to visualize the results. Optional: If you’d like to make interactive graphs, as the ones below, the mpld3 package is fairly handy to convert static plots into interactive JavaScript.

Installing Packages

Assuming you already have Python (and pip) installed on your computer, in your terminal type:

$ pip install conda

This will install the conda package manager, which will make installation of the other required Python packages relatively painless:

$ conda install numpy matplotlib networkx

Seaborn can be installed using pip:

$ pip install seaborn

Generating Data

To start, we will need a dataset which can be respresented as a directed graph. There are a ton of ways to create this sort of dataset, but in this tutorial I’ll be using NetworkX to generate an Erdös–Rényi graph with directed edges. This type of graph is constructed using a binomial distribution to determine whether an edge is formed between any two nodes. We can do this pretty easily using the erdos_renyi_graph function in NetworkX, which samples the number of edges for each node from a binomial distribution, $B(n = N, p)$ . Let’s try $n = 25$ and $p = 0.12$ for our binomial:

import networx as nx

# Number of nodes
N = 25
# Generate Erdös–Rényi graph
G = nx.erdos_renyi_graph(N, 0.12, directed=True)

Visualizing the Graph

Before we go on to the shortest path calculation, let’s plot the graph to get a sense about what it looks like. In the code below, I’ve used the scatter function in matplotlib.pyplot to plot nodes, with node sizes representing PageRank scores, and node color representing in-degree, or the number of incoming edges. We can get a decent layout to position the nodes using nx.spring_layouts, which creates a list of xy-coordinates for each node by quickly simulating the graph as a set of masses connected by springs. Edges are simply lines connecting the nodes using the plot function, with the thicker gray ticks denoting an incoming edge to a node.

import matplotlib.pyplot as pp
import seaborn as sns
import numpy as np

# Add a dark background
sns.set_style('dark')

# Initialize figure
fig = pp.figure(figsize = (8, 8))
ax  = fig.add_subplot(1, 1, 1)

# Create force-directed layout for the graph
pos = nx.spring_layout(G)
pos_array = np.array(pos.values()).T # Format into a NumPy array

# Define node colors based on in-degree
node_color = np.array([len(G.predecessors(v)) for v in G], dtype=float)

# Define node sizes to be proportional to PageRank values
sizes = 1E4*nx.pagerank(G).values()

# Plot each edge
for i in G:
    for j in G[i]:
        # Compute in-marker size
        # The next few lines just resize the
        # incoming markers to look decent
        # with the varying node sizes
        p = pos_array[:, j] - pos_array[:, i]
        s = -.4*(1 + .4*np.log(sizes[j]/sizes.min()))
        p = s*p/np.linalg.norm(p) + pos_array[:, j]

        # Plot markers
        ax.plot([pos_array[0, j], p[0]], [pos_array[1, j], p[1]], '-k', alpha = .5, linewidth = 2, zorder = 9)

        #Plot edges
        ax.plot(pos_array[0, [i, j]], pos_array[1, [i, j]], '-w')

# Plot nodes
ax.scatter(pos_array[0], pos_array[1], sizes, color = cm.coolwarm(node_color/node_color.max()), zorder = 10)

# Remove axis ticks
pp.xticks([])
pp.yticks([])

Computing the Shortest Path

Now that we have an idea of how to plot a network graph, let’s finally get to calculating the shortest path between a pair of nodes. For this exercise, let’s find the path between the nodes with the lowest and highest PageRank scores. In terms of social networks, this might represent the distance in social hierachy between a second year grad student (low PageRank) and someone like Neil deGrasse Tyson (high PageRank). In protein folding, it could model the steps taken from an unfolded state (low PageRank) to a folded one (high PageRank).

In our example graph from above, it turns out that nodes 0 and 7 are the lowest and highest PageRank scorers, respectively. We can find this by applying np.argmin and np.argmax to sizes, the array PageRank scores we calculated earlier. With this, we can finally calculate the shortest path using nx.shortest_path like so:

path = nx.shortest_path(G, source = 0, target = 7)

Since we are interested in going from node 0 to node 7, we set the must set the arguments source and target accordingly. The only other input to this function is our graph, G. In this example, our shortest path only requires 3 steps, passing through nodes 22 and 2 along the way. Below you can find modified versions of the previous code and plot, which have been changed to highlight the nodes that form the shortest path between 0 and 7. You can even hover over the nodes to check the labels for yourself!

sns.set_style('dark')
fig = pp.figure(figsize = (8, 8))
ax  = fig.add_subplot(1, 1, 1, frameon = False)

#Plots edges (shortest path edgest will be thicker)
for i in G:
    for j in G[i]:
        if (i, j) in [(path[k], path[k + 1]) for k in range(len(path) - 1)]:
            alpha = 1
            linewidth = 4.0
        else:
            alpha = .4
            linewidth = 1.5
        p = pos_array[:, j] - pos_array[:, i]
        s = -.4*(1 + .4*np.log(sizes[j]/sizes.min()))
        p = s*p/np.linalg.norm(p) + pos_array[:, j]
        ax.plot([pos_array[0, j], p[0]], [pos_array[1, j], p[1]], '-k', alpha = .5, linewidth = linewidth, zorder = 9)
        ax.plot(pos_array[0, [i, j]], pos_array[1, [i, j]], '-w', alpha = alpha, linewidth = linewidth)

#Plots nodes
ax.scatter(pos_array[0], pos_array[1], sizes, color = 'lightgrey', zorder = 10)

#Plot nodes along path in red
for node in path:
    ax.scatter(pos_array[0, path], pos_array[1, path], sizes[path], color = 'darkred', alpha=.6, zorder=11)

_=pp.xticks([])
_=pp.yticks([])

On a parting note, what should be taken away from this toy model is the not-so-intuitive fact that even in a somewhat sparse network (i.e. low probability of sharing a connection)– although this example is a bit exaggerated in sparsity – things are fairly closely connected. We took the two furthest objects in a directed graph of 25 nodes and were still able to cross it in only 3 steps! Such an effect is made even more apparent when analyzing other graph metrics, such as average path length, on real-life social networks as done in Milgram’s Small-world experiment. Milgram found that on average, any two people in the United States are only 6 degrees of separation away. So just think about that next time you’re alone, watching TV– you could totally become BFFs with someone like Shaq!

He's waiting!