At Folding@home, often we use Markov state models (MSMs) to represent how molecules (usually proteins) move around in a simulation. These models help simplify large amounts of simulation data and ultimate yield a more concise network representation. Because of this, we can borrow tools normally used in fields such as graph theory to study how biological molecules behave.
One such tool is Dijkastra’s algorithm, which finds an approximate to the shortest path between any two nodes in a (weakly connected) graph. At each iteration, the algorithm finds the neighboring node that will minimize the total distance needed to traverse the graph. This method can be extremely powerful, and has become as ubiquitous to our society as getting driving directions from Google Maps.
In social networks, Dijkastra’s algorithm is what you would use to compute an Erdös number or how many degrees of separation you share with Kevin Bacon. In protein folding (my line of work), Dijkastra’s algorithm is useful for identifying the possible steps (e.g. confomational changes) that lead to a folded protein. This information can help scientists better understand the underlying physics of biology, which could lead to new therapeutic drugs.
In this tutorial, I’ll go over how to use
NetworkX to construct a random graph and calculate the shortest path between two nodes. I’ll also be using
Seaborn to visualize the results. Optional: If you’d like to make interactive graphs, as the ones below, the
Assuming you already have Python (and
pip) installed on your computer, in your terminal type:
$ pip install conda
This will install the
conda package manager, which will make installation of the other required Python packages relatively painless:
$ conda install numpy matplotlib networkx
Seaborn can be installed using
$ pip install seaborn
To start, we will need a dataset which can be respresented as a directed graph. There are a ton of ways to create this sort of dataset, but in this tutorial I’ll be using
NetworkX to generate an Erdös–Rényi graph with directed edges. This type of graph is constructed using a binomial distribution to determine whether an edge is formed between any two nodes. We can do this pretty easily using the
erdos_renyi_graph function in
NetworkX, which samples the number of edges for each node from a binomial distribution, . Let’s try and for our binomial:
import networx as nx # Number of nodes N = 25 # Generate Erdös–Rényi graph G = nx.erdos_renyi_graph(N, 0.12, directed=True)
Visualizing the Graph
Before we go on to the shortest path calculation, let’s plot the graph to get a sense about what it looks like. In the code below, I’ve used the
scatter function in
matplotlib.pyplot to plot nodes, with node sizes representing PageRank scores, and node color representing in-degree, or the number of incoming edges. We can get a decent layout to position the nodes using
nx.spring_layouts, which creates a list of xy-coordinates for each node by quickly simulating the graph as a set of masses connected by springs. Edges are simply lines connecting the nodes using the
plot function, with the thicker gray ticks denoting an incoming edge to a node.
import matplotlib.pyplot as pp import seaborn as sns import numpy as np # Add a dark background sns.set_style('dark') # Initialize figure fig = pp.figure(figsize = (8, 8)) ax = fig.add_subplot(1, 1, 1) # Create force-directed layout for the graph pos = nx.spring_layout(G) pos_array = np.array(pos.values()).T # Format into a NumPy array # Define node colors based on in-degree node_color = np.array([len(G.predecessors(v)) for v in G], dtype=float) # Define node sizes to be proportional to PageRank values sizes = 1E4*nx.pagerank(G).values() # Plot each edge for i in G: for j in G[i]: # Compute in-marker size # The next few lines just resize the # incoming markers to look decent # with the varying node sizes p = pos_array[:, j] - pos_array[:, i] s = -.4*(1 + .4*np.log(sizes[j]/sizes.min())) p = s*p/np.linalg.norm(p) + pos_array[:, j] # Plot markers ax.plot([pos_array[0, j], p], [pos_array[1, j], p], '-k', alpha = .5, linewidth = 2, zorder = 9) #Plot edges ax.plot(pos_array[0, [i, j]], pos_array[1, [i, j]], '-w') # Plot nodes ax.scatter(pos_array, pos_array, sizes, color = cm.coolwarm(node_color/node_color.max()), zorder = 10) # Remove axis ticks pp.xticks() pp.yticks()
Computing the Shortest Path
Now that we have an idea of how to plot a network graph, let’s finally get to calculating the shortest path between a pair of nodes. For this exercise, let’s find the path between the nodes with the lowest and highest PageRank scores. In terms of social networks, this might represent the distance in social hierachy between a second year grad student (low PageRank) and someone like Neil deGrasse Tyson (high PageRank). In protein folding, it could model the steps taken from an unfolded state (low PageRank) to a folded one (high PageRank).
In our example graph from above, it turns out that nodes 0 and 7 are the lowest and highest PageRank scorers, respectively. We can find this by applying
sizes, the array PageRank scores we calculated earlier. With this, we can finally calculate the shortest path using
nx.shortest_path like so:
path = nx.shortest_path(G, source = 0, target = 7)
Since we are interested in going from node 0 to node 7, we set the must set the arguments
target accordingly. The only other input to this function is our graph,
G. In this example, our shortest path only requires 3 steps, passing through nodes 22 and 2 along the way. Below you can find modified versions of the previous code and plot, which have been changed to highlight the nodes that form the shortest path between 0 and 7. You can even hover over the nodes to check the labels for yourself!
sns.set_style('dark') fig = pp.figure(figsize = (8, 8)) ax = fig.add_subplot(1, 1, 1, frameon = False) #Plots edges (shortest path edgest will be thicker) for i in G: for j in G[i]: if (i, j) in [(path[k], path[k + 1]) for k in range(len(path) - 1)]: alpha = 1 linewidth = 4.0 else: alpha = .4 linewidth = 1.5 p = pos_array[:, j] - pos_array[:, i] s = -.4*(1 + .4*np.log(sizes[j]/sizes.min())) p = s*p/np.linalg.norm(p) + pos_array[:, j] ax.plot([pos_array[0, j], p], [pos_array[1, j], p], '-k', alpha = .5, linewidth = linewidth, zorder = 9) ax.plot(pos_array[0, [i, j]], pos_array[1, [i, j]], '-w', alpha = alpha, linewidth = linewidth) #Plots nodes ax.scatter(pos_array, pos_array, sizes, color = 'lightgrey', zorder = 10) #Plot nodes along path in red for node in path: ax.scatter(pos_array[0, path], pos_array[1, path], sizes[path], color = 'darkred', alpha=.6, zorder=11) _=pp.xticks() _=pp.yticks()
On a parting note, what should be taken away from this toy model is the not-so-intuitive fact that even in a somewhat sparse network (i.e. low probability of sharing a connection)– although this example is a bit exaggerated in sparsity – things are fairly closely connected. We took the two furthest objects in a directed graph of 25 nodes and were still able to cross it in only 3 steps! Such an effect is made even more apparent when analyzing other graph metrics, such as average path length, on real-life social networks as done in Milgram’s Small-world experiment. Milgram found that on average, any two people in the United States are only 6 degrees of separation away. So just think about that next time you’re alone, watching TV– you could totally become BFFs with someone like Shaq!