I made a new, much more efficient and effective script that makes the one described in this post obsolete. See my post on the new script here.
In my PhD research project I reconstruct processes as sequences of events. In a recent article I visualize these sequences of events and their relationships as network graphs. This was inspired by Abell’s (1987) The Syntax of Social Life. In the network graphs the events are represented by nodes. The arcs (arrows) between the events indicate which events contributed to the conditions under which later events occurred. The layout of the events on the horizontal axis represents the order of occurrence of events. The layout on the vertical axis has no other purposes than clarity of presentation. The resulting graph is always a directed, a-cyclic graph. The graph is directed because of the way I defined the relationship (“X made a contribution to the conditions under which Y happened”), which is a one-way relationship. The graph is a-cyclic because the arcs can never point back in time.
In my current work I have to go a few steps further because of the questions that I ask (I won’t go into the detail of these questions). One of the steps that I have to take in answering those questions is to find all possible paths in the sequences of events. There are multiple algorithms available for finding shortest paths between the nodes of a graph, but I specifically needed an algorithm that is capable of finding all paths. For this reason (and because I like the exercise) I decided to write my own algorithm. In this case I use R as my programming language. The resulting code can be found at the bottom of this post.
The function that runs the algorithm
To run the algorithm in R (see instructions at the bottom of this post) the user uses the following function:
Paths <- FindPaths(edgeList, 1, 128)
Here, the results of the algorithm are stored as a matrix in an object called Paths. The first argument to the function is an edge list (I called it edgeList here, but the user can use any name she/he likes), which is basically a list of all the relationships that constitute the network (see below for more details). The other two arguments indicate the pair of events for which the paths are to be reconstructed. Thus, if one wants to find the paths between multiple pairs of events, then the algorithm should be run separately for all these pairs.
The nice thing about an edge list is that it uses a very simple structure to describe networks that can be rather complex. Also, an edge list can be easily exported from my favorite network visualization program: Gephi. An edge list looks something like this:
The algorithm only works if the events are represented by numbers that indicate the order of events, which is information that the algorithm requires. To import an edge list, first store one as (for example) a comma-delimited file (.csv) and then import it into gephi as a matrix. For example, the following command could be used (some arguments that should be used, such as the separator [sep], depend on your system settings):
edgeList <- as.matrix(read.table(“MyEdgeList.csv”, sep = ‘;’, header = T))
Additional preparations (optional)
If you have stored your directed, a-cyclic graph as an edge list it is possible to just keep it as it is and import it into R without any further changes. However, depending on the complexity of the graphs (in terms of the number of nodes, the number of arcs, as well as the number of points of convergence and divergence) it may be helpful to make some further preparations before using the algorithm. First of all, for complex graphs it is usually best to first remove all the nodes from our graph of which we know that we don’t need them, because this will save the algorithm a lot of time. As I indicate in the previous section, the algorithm always looks for the paths between two specific events. Let’s call them the origin event and the terminal event. We already know that we don’t need all the events that are not in the past of the terminal event. Removing these events might save the algorithm some time because it won’t start looking in directions that turn out to be dead ends (see picture below).
The easiest way to remove the unnecessary nodes from the graphs (that I currently know of) is to use an algorithm that is available in the NetworkX library for python. To run NetworkX you’ll need to have Python installed to your machine, and you’ll also need a Python interpreter, such as IDLE (see this website). If the NetworkX libraries for python are installed to your machine, you can import the NetworkX library by running the following command in a python interpreter:
import networkx as nx
We can then import our graph as an gexf-file (you can export your graph as an gexf-file from Gephi). Use the following command in your python interpreter:
G = nx.read_gexf(“Path-to-your-gexf-file/example.gexf”)
This command will store the graph in an object that we called G. We can then use an algorithm that finds all ancestors of a given node. Use the following command.
A = nx.ancestors(G, ’10’)
Here we find all ancestors of event 10 in the graph G, the graph that we just created. The results are stored in an object that we named A. This object A that we just created is an object of the type set. We will want to export this set as a csv-file. For this we first need to import the csv library for Python (which is included as one of the standard libraries). Use the following command:
The next couple of commands can be used to export the results. Make sure that in the second command the delimiter argument corresponds with the delimiter that your system uses for csv-files (this is the same thing as the separator [sep] argument that you pass to R-functions).
myfile = open(“/Path-to-export-folder/Ancestors_of_10.csv”, ‘wb’)
wr = csv.writer(myfile, quoting = csv.QUOTE_ALL, delimiter = ‘;’)
The exported file will now have all the ancestors of event 10 in a single column. It is useful to turn this list into a nodes list that can be imported into Gephi. First add the terminal event itself to the list, because it is not in there yet. Then give the column a header that you call Id. You can import the list as a nodes list in the Gephi data laboratory. Next, you can import the original edge list. However (and this is very important), make sure that in the import settings you uncheck the box that indicates whether missing nodes should be created as well. Export the resulting edge list as a csv-file, and you’ll have an edge list that can be used to find paths to the terminal event of interest much more efficiently. I realize that the whole process can be a bit time consuming if you have multiple terminal nodes that you want to consider, but at the moment this is the best option that I can think of. Also, depending on the complexity of your original graph, these steps may actually save you a lot of time.
What the algorithm does
A detailed description of different steps of the algorithm is offered in the comments that accompany the source code itself (see bottom of post). The algorithm will first go through the edge list that was passed to the function and remove all edges that lie in the past of the origin event, or in the future of the terminal event. The algorithm then makes a copy of the edge list, and then basically iteratively expands that copy by adding new rows and columns to it. We’ll call this copy the paths list.
At the first iteration of the algorithm the paths list will simply be an exact copy of the original edge list. The algorithm will compare the entries in the second column of the paths list with those in the first column of the edge list. If it finds a match, then it knows that the matching edges together constitute a larger chain of events. If such a match is found, then the algorithm looks in the second column of the edge list to find out what the next event in the chain should be. This event is added to the corresponding column of the paths list. In the picture below, the cells that are shaded yellow and green indicate where the algorithm has found a match, and the orange and blue shaded cells indicate where the algorithm found and added the corresponding events to the paths to which they belong.
If the algorithm has finished its iteration for the current column of the paths list, then it will move on to the next column in the next iteration and repeat the process. If the algorithm finds points of divergence (multiple events follow from the event that is currently inspected), then it will add a new row to the list in which the new path is recorded. The algorithm will also recognize any paths that are actually a subsequence of a larger path and remove these from the paths list. These operations are illustrated in the picture below. The yellow, green and purple shaded cells indicate matches that the algorithm found. The orange, blue and red shaded cells indicate the events that have been added to the chains of events. For event 2, a new row had to be created, because two events follow from event 2. Two columns will be removed from the paths list during this iteration, because they contain paths that are already included in other paths.
That is basically it. After the algorithm finishes all iterations, it will return the resulting paths list.
There is one quite serious problem that I wasn’t able to solve. The number of possible paths in a graph increases very quickly if there are many points of convergence (nodes with a high indegree) and divergence (nodes with a high outdegree) in the graph. In those cases the algorithm will take a long time to finish, even on a relatively fast computer.
I also haven’t figured out a way to predict how much iterations the algorithm must run before finishing the task (to be honest, I haven’t thought about it too much). The algorithm will indicate which iteration it is currently doing, but it won’t indicate how many iterations it still has to go. If you’re running the algorithm on a large and/or complex graph, expect long waiting times, especially on slower machines.
You can download the code here. The zip-file contains a file called FindPaths.R. To use it in R, move the file to your current working directory for R and then type source(“FindPaths.R”) in your R-interface. This will make available the FindPaths function. Open the file itself with a text editor, or with R’s script editor to find more instructions on how to use the function.
I’m relatively new to R and it is highly likely that parts of the code (or perhaps even all of it) are less efficient than they could be. I’m always open to suggestions for improvement.