Category Archives: Software

Bi-dynamic line graphs

Bi-dynamic line graphs

Two days ago Frank Boons and I had the opportunity to present some of our ongoing work during a session of the Mitchell Centre Seminar Series. In addition to receiving some very useful feedback on our work and meeting interesting people, we were introduced to some of the ideas that members of the Mitchell Centre are working on. One of these ideas immediately grabbed our attention, which is the idea to represent social processes in bi-dynamic line graphs (BDLGs), as laid out in this paper by Chiara Brocatelli, Martin Everett, and Johan Koskinen. These graphs can be used to visualise the (joint) involvement of actors in events, showing which actors worked together over time, and allowing one to trace the involvement of actors throughout the process.

The basic ideas behind BDLGs

A BDLG can be understood as a special projection of a two-mode network of actors and events. In matrix form, a two-mode network is usually presented as an incidence matrix with actors in rows, and events in columns, like so (this example was taken from the paper mentioned earlier):

In Social Network Analysis, it is common to convert two-mode networks to one-mode networks that indicate which actors participated in the same events (and how many times they participated in events together). In fact, this is what I usually do myself, and it is the central idea underlying the Dynamic Networks Tool. It can be done by converting the incidence matrix to a weighted adjacency matrix, which can be achieved by multiplying the incidence matrix with a transposed version of itself. If we would do that for the incidence matrix above, we would get the following adjacency matrix:

Although this gives insight in the ties that emerge from the events in which actors participate, the drawback is that all temporal information is lost; only a static network remains. This can be solved partially by making separate adjacency matrices for different segments of the incidence matrix, but the choice for the size of the segments is always arbitrary, and information about changes within each segment is still lost (see Brocatelli et al. 2016 for a discussion on this).

Brocatelli et al. (2016) propose BDLGs as an alternative projection of two-mode networks that have a temporal dimension. In a BDLG the edges between actors and events of the original two-mode graph are ‘converted’ to nodes. For example, the edge between actor A and event 1 in the original two-mode graph becomes the node A1 in the BDLG. In a BDLG, two types of relationships can exist between these nodes. Mutual (or reciprocal) edges only exist between nodes that indicate actors that are involved in the same event (e.g., A1<->B1). Edges with only one direction indicate the continued presence of actors throughout events (e.g., A1->A2->A4). To prevent the inclusion of redundant information in the BDLG, Brocatelli et al. (2016) also introduce the rule that transitive paths are always reduced to a single path. For example, if actor A was involved in events 1, 2, and 4, it is only necessary to create the edges A1->A2 and A2->A4. The edge A1->A4 introduces redundant information about the continued involvement of actor A, and can therefore be excluded. This rule leads to more elegant presentations that allow researchers to trace actors’ involvement in the studied process. The matrix form of the BDLG thus becomes as follows (rows with only 0 values, as well as their corresponding columns, have been removed from the matrix):

A new tool

When I read the paper that outlines these ideas, I was immediately enthusiastic, because the approach fits well with the approach to studying social processes that Frank Boons and I have been developing ourselves, and because it offers a new (to me) and interesting way to study network dynamics in social processes. I therefore set out to write a script for R that can be used to convert incidence matrices into BDLG matrices. I quickly decided to go a bit further, and write a standalone tool for this in C++.

The benefit of the tool that I wrote is that, in addition to converting incidence matrices into BDLG matrices, the tool can export a node list and edge list that can be imported into a visualisation program like Gephi. Also, node and edge lists allow you to include variables, such as variables that hold information on the temporal dimension of the nodes.

I wrote a layout plugin for Gephi, called the Event Graph Layout, some years ago. This layout is great for creating BDLG visualisations, because it allows one to layout nodes in order of occurrence from left to right, and uses a force-based algorithm for the layout of the nodes on the vertical axis. Thus, by using the BDLG matrix converter tool in combination with the Event Graph Layout, one can make useful visualizations in a matter of minutes.

Getting the tool

The tool can be downloaded for Windows and for Linux from this web page. No additional software is required to use the tool, but I highly recommend to use it in combination with Gephi, and the Event Graph Layout plugin for Gephi. The tool is completely free, and the source code of the tool can be found here.

New piece of software: CSV-parser for Gephi

I wrote a new program, primarily intended for personal use, but probably useful to others as well. To explain why I wrote it I have to give you some background:

In my own research I tend to work with so-called Event Sequence Datasets (see my thesis), which I at some point store in a CSV format, because that is a format that many software packages I work with can import. One characteristic of the files that I work with is that cells can have multiple values that are separated by some delimiter character (usually semicolons). Often, these values are codes that I assigned to some entity in the dataset. I like to visualise the contents of my datasets with Gephi, and Clement Levallois wrote a very useful plugin for Gephi that allows you to import data from files that have multiple values in one cell. However, I wanted to have a bit more control over the process of parsing the data from my files to something that can be read by Gephi (as well as other software, such as the Neo4J CSV-importer). This includes the possibility to add properties to the nodes list, based on information included in the input file. Also, because I potentially want to use the output files for other software as well, I wanted something that works independent from Gephi. I therefore decided to write my own CSV-parser, and chose to do it in C++, making heavy use of the Qt4 GUI library.

Basically, the program can be used to import a csv file that contains data on entities that you want to visualise as a network. For example, the csv file may have entities in different columns that you want to relate to each other, or you may have multiple entities in a single column that you want to relate to each other (e.g., a co-authorship network). The program allows you to select the appropriate source and target node, assign properties to these nodes, indicate the type of relationship between them (directed or undirected) and assign a label to the relationship. After specifying the desired settings, it is possible to export a nodes file, as well as an edges file, which can be imported easily into Gephi (via the data laboratory) or into other software packages.

Currently, the program does not support the creation of dynamic networks directly, which is something Clement’s plugin does support. This feature is not currently included because I do not immediately need it myself, although I might still add it in the future. I am also thinking about creating an export-to-gexf function.

The program can be downloaded for Windows and Linux here. The source code is open, and can be found here.

Additional functions for directed, a-cyclic graphs

Earlier, I posted R-scripts that include functions for finding paths in directed, a-cyclic graphs. I (and some colleagues) use this type of graph to model processes (sequences of events) as networks. We refer to this type of graph as event graphs (Boons et al. 2014; Spekkink and Boons, in press), where events are represented by nodes, and the relationships between events are represented by arcs and/or edges. The underlying principles of this approach were based primarily on the work of Abell (1987) on the Syntax of Social Life, where he introduces narrative graphs.

In addition to the path-finding algorithm, I recently created a plugin for Gephi, called Lineage, that identifies all the ancestors and descendants of a node (chosen by the user) in directed graphs. Based on the logic that I used for the Lineage plugin I also did a complete rewrite of the function (for R) that I designed for finding paths in directed, a-cyclic graphs (see this post). I decided to use the same logic to write some additional functions that explore ancestors and descendants of nodes in directed a-cyclic graphs. I wrote four new functions for R:

  • GetAncestors(arcs, node): This function returns all the ancestors of a given node, which is submitted by the user by passing it as an argument of the function (node). The other argument (arcs) should be an edge list, structured as a 2-column matrix. The first column should report ‘source’ nodes, and the second column should report the ‘target’ nodes that the ‘source’ nodes link to (see table below for an example of such an edge list). This edge lists represents the network under investigation.
  • GetDescendants(arcs, node): This function does the same as the GetAncestors() function, but returns the descendants of the input node, rather than its ancestors.
  • CommonAncestors(arcs, nodes): This function finds all the common ancestors of a pair of nodes, or multiple pairs of nodes. In this case too, the arcs argument should be an edge list, as described earlier. The nodes argument can either be a vector of two nodes, or a matrix with multiple pairs of nodes (similar to the edge list displayed in the table below). The function writes a file in which it is reported whether the submitted pairs of nodes have ancestors in common, how many ancestors they have in common, and what these ancestors are. The file is placed in the current working directory of the user.
  • CommonDescendants(arcs, nodes): Same as the CommonAncestors() function, but in this case the function returns the common descendants of the pairs of nodes submitted by the user.

All four functions assume that the events are labelled with numbers. Unless the edge lists that are submitted by the user have numerical variables, the functions won’t work as they should.

Source Target
1 2
1 3
2 4
2 5

So what are the functions useful for? I assume that different people may come up with different uses, but I use them specifically as a step in the identification of emergent linkages between events (see Spekkink and Boons, in press). It will take me some paragraphs to explain this.

In my PhD research I introduced a distinction between intentional linkages between events, and emergent linkages between events. An intentional linkage exists between two events A and B if in event B actors intentionally respond to conditions created in event A. For example, event A might concern the development of a plan, and plan B may concern a feasibility study of that plan. I used the idea of intentional linkages to identify different streams of events in a process, that may, for example, represent different projects that unfold independent from each other. In my thesis, I used intentional linkages to identify independent projects that were later assembled into a larger collaborative process, and can thus be understood to serve as building blocks of the collaboratie process. The figure below shows a phase model that I visualize in the conclusions of my thesis, and that also aims to visualize the idea of different building blocks existing independently at first, and being assembled into a collaborative process at a later stage.

FigureProcess

Emergent linkages refer to another type of relationship between events. I found that actors involved in independent events often address similar (or the same) issues in their activities. It might, for example, be the case that different actors are working on different projects that are all related to biobased economy, but that actors do not react to each other’s projects. They may not even be aware of each other’s activities. To capture these similarities between events (i.e., the similarities of issues addressed) I introduced the notion of emergent linkages. I defined emergent linkages such that they refer to similarities between events, but that they can only exist between events that are not connected by a path of intentional linkages. The reason for the latter condition is that I was only interested in coincidental similarities. If similarities exist between events that are also connected by a path of intentional linkages, then there is a very strong chance that the similarity is also intentional. To explain what I mean, let me go back to my earlier example, concerning the intentional linkage between event A (a plan is developed), and event B (a feasibility study is performed). Say that the plan concerns the construction of a biofuel factory. In this case it will be logical that both the plan and the feasibility study address issues related to biofuel production. These similarities are not emergent, in my view, and that is why my definition of emergent linkages excludes them.

There are different approaches to identifying similarities between events. The approach that I used is described in Spekkink and Boons (in press). In short, in this approach the similarities between events are calculated as correlations between the column profiles of an incidence matrix, where the rows represent issues, and the columns represent events (a ‘1’ in a given cell indicates that the corresponding issue was addressed in the corresponding cell). The problem with this approach (and other approaches) is that the similarities between all events are calculated this way. This includes the similarities that I do not consider to be emergent. Thus, I need some way of filtering out the non-emergent linkages from the complete list of similarities between events. Based on my definition, I used to filter out the similarities between events that are also on a path of intentional linkages together (this was the reason why I wrote the function for finding paths in directed, a-cyclic graphs). Lately, I have been thinking about expanding this definition, because there are cases where two events are not on a path together, but do have one or more common ancestors. Similarities between these events may also be considered intentional in some cases. For example, consider the development of a large program (event A), the details of which are worked out in multiple, independent working groups (events B, C, and D). Events B, C and D may not be intentionally linked, but they do have a common ancestor (event A), and they all inherit issues from that ancestor. I considered introducing a more restrictive definition of emergent linkages where not only nodes that share a path of intentional linkages are excluded, but also events that have common ancestors. That is the reason why I wrote the CommonAncestors() function. I included the other three functions because it only took small adaptations to make them, and they might be useful for someone.

You can download the script with the functions here.

New algorithm for finding paths

In an earlier post I described an R-script I wrote for finding paths in directed, a-cyclic graphs. That script had some problems, including efficiency problems. I made a completely new script for finding paths in directed, a-cyclic graphs, and replaced the old script with the new one. The new script also appears to find paths that the old algorithm did not identify, so I must have missed something in the old script.

The logic of the new script is quite simple. It contains one function with a number of parameters (these are also described in the script itself). The function and its possible arguments are as follows:

FindPaths(arcs, origin = 0, originsOnly = F, endsOnly, F, writeToDisk = F)

The first argument (arcs) should be an edge list with two columns, where the first column contains the source nodes, and the second column contains the target nodes of arcs (see table below for an example). All nodes should be represented by numbers for the function to work.

Source Target
1 2
1 3
2 4
2 5

The first step in the function is to identify all origins in the list of arcs (nodes with 0 indegree) and all endpoints (nodes with 0 outdegree). The arguments originsOnly or endsOnly can be set to TRUE (T) to have the function return the list of origins or endpoints respectively. No paths are identified if one of these options is chosen.

The user may also choose a specific origin node (origin), which tells the function to only look for paths that flow from that specific node. This is a wise choice when the graph under investigation is very large and/or complex. If origin is left in its default value of 0, then the function will attempt to identify all paths in the graph.

The function first identifies an initial list of paths with only two nodes. Then this list of paths is fed into a while-loop, in which the paths are completed further (also see comments in script for more detailed descriptions).

The user can choose to write the files to the disk immediately, by setting the function argument writeToDisk to TRUE (T). This creates a file called Paths.csv. If such a file already exists, then new paths will be appended to the existing file. This allows you to easily add paths to the file by using the function in steps (choosing a different origin point at each run), without having to manually merge the files later.

If this argument is set to FALSE (F) (default setting), then the function returns a list of all paths.

The new function can be downloaded here.

See the earlier post here.

 

Small change to Dynamic Networks tool

For those that have already downloaded the Dynamic Networks tool: I made a small change to the tool. I noticed that, when using a sliding frame for dynamic networks, the time intervals would lead to weird behaviour in the visualizations. I therefore changed the way that time intervals are recorded for sliding frames. I uploaded an updated version of the tool (for Windows, Linux and Mac) to the software page. If you downloaded the tool before reading this post, I advise downloading the updated version.

Using R to find paths in a directed a-cyclic graph

Update

I made a new, much more efficient and effective script that makes the one described in this post obsolete. See my post on the new script here.

Introduction

In my PhD research project I reconstruct processes as sequences of events. In a recent article I visualize these sequences of events and their relationships as network graphs. This was inspired by Abell’s (1987) The Syntax of Social Life. In the network graphs the events are represented by nodes. The arcs (arrows) between the events indicate which events contributed to the conditions under which later events occurred. The layout of the events on the horizontal axis represents the order of occurrence of events. The layout on the vertical axis has no other purposes than clarity of presentation. The resulting graph is always a directed, a-cyclic graph. The graph is directed because of the way I defined the relationship (“X made a contribution to the conditions under which Y happened”), which is a one-way relationship. The graph is a-cyclic because the arcs can never point back in time.

In my current work I have to go a few steps further because of the questions that I ask (I won’t go into the detail of these questions). One of the steps that I have to take in answering those questions is to find all possible paths in the sequences of events. There are multiple algorithms available for finding shortest paths between the nodes of a graph, but I specifically needed an algorithm that is capable of finding all paths. For this reason (and because I like the exercise) I decided to write my own algorithm. In this case I use R as my programming language. The resulting code can be found at the bottom of this post.

The function that runs the algorithm

To run the algorithm in R (see instructions at the bottom of this post) the user uses the following function:

Paths <- FindPaths(edgeList, 1, 128)

Here, the results of the algorithm are stored as a matrix in an object called Paths. The first argument to the function is an edge list (I called it edgeList here, but the user can use any name she/he likes), which is basically a list of all the relationships that constitute the network (see below for more details). The other two arguments indicate the pair of events for which the paths are to be reconstructed. Thus, if one wants to find the paths between multiple pairs of events, then the algorithm should be run separately for all these pairs.

The nice thing about an edge list is that it uses a very simple structure to describe networks that can be rather complex. Also, an edge list can be easily exported from my favorite network visualization program: Gephi. An edge list looks something like this:

Source Target
1 2
1 3
2 4
2 5

The algorithm only works if the events are represented by numbers that indicate the order of events, which is information that the algorithm requires. To import an edge list, first store one as (for example) a comma-delimited file (.csv) and then import it into gephi as a matrix. For example, the following command could be used (some arguments that should be used, such as the separator [sep],  depend on your system settings):

edgeList <- as.matrix(read.table(“MyEdgeList.csv”, sep = ‘;’, header = T))

Additional preparations (optional)

If you have stored your directed, a-cyclic graph as an edge list it is possible to just keep it as it is and import it into R without any further changes. However, depending on the complexity of the graphs (in terms of the number of nodes, the number of arcs, as well as the number of points of convergence and divergence) it may be helpful to make some further preparations before using the algorithm. First of all, for complex graphs it is usually best to first remove all the nodes from our graph of which we know that we don’t need them, because this will save the algorithm a lot of time. As I indicate in the previous section, the algorithm always looks for the paths between two specific events. Let’s call them the origin event and the terminal event. We already know that we don’t need all the events that are not in the past of the terminal event. Removing these events might save the algorithm some time because it won’t start looking in directions that turn out to be dead ends (see picture below).

Paths

The easiest way to remove the unnecessary nodes from the graphs (that I currently know of) is to use an algorithm that is available in the NetworkX library for python. To run NetworkX you’ll need to have Python installed to your machine, and you’ll also need a Python interpreter, such as IDLE (see this website). If the NetworkX libraries for python are installed to your machine, you can import the NetworkX library by running the following command in a python interpreter:

import networkx as nx

We can then import our graph as an gexf-file (you can export your graph as an gexf-file from Gephi). Use the following command in your python interpreter:

G = nx.read_gexf(“Path-to-your-gexf-file/example.gexf”)

This command will store the graph in an object that we called G. We can then use an algorithm that finds all ancestors of a given node. Use the following command.

A = nx.ancestors(G, ’10’)

Here we find all ancestors of event 10 in the graph G, the graph that we just created. The results are stored in an object that we named A. This object A that we just created is an object of the type set. We will want to export this set as a csv-file. For this we first need to import the csv library for Python (which is included as one of the standard libraries). Use the following command:

import csv

The next couple of commands can be used to export the results. Make sure that in the second command the delimiter argument corresponds with the delimiter that your system uses for csv-files (this is the same thing as the separator [sep] argument that you pass to R-functions).

myfile = open(“/Path-to-export-folder/Ancestors_of_10.csv”, ‘wb’)

wr = csv.writer(myfile, quoting = csv.QUOTE_ALL, delimiter = ‘;’)

wr.writerows(zip(A))

myfile.close()

The exported file will now have all the ancestors of event 10 in a single column. It is useful to turn this list into a nodes list that can be imported into Gephi. First add the terminal event itself to the list, because it is not in there yet. Then give the column a header that you call Id. You can import the list as a nodes list in the Gephi data laboratory. Next, you can import the original edge list. However (and this is very important), make sure that in the import settings you uncheck the box that indicates whether missing nodes should be created as well. Export the resulting edge list as a csv-file, and you’ll have an edge list that can be used to find paths to the terminal event of interest much more efficiently. I realize that the whole process can be a bit time consuming if you have multiple terminal nodes that you want to consider, but at the moment this is the best option that I can think of. Also, depending on the complexity of your original graph, these steps may actually save you a lot of time.

What the algorithm does

A detailed description of different steps of the algorithm is offered in the comments that accompany the source code itself (see bottom of post). The algorithm will first go through the edge list that was passed to the function and remove all edges that lie in the past of the origin event, or in the future of the terminal event. The algorithm then makes a copy of the edge list, and then basically iteratively expands that copy by adding new rows and columns to it. We’ll call this copy the paths list.

At the first iteration of the algorithm the paths list will simply be an exact copy of the original edge list. The algorithm will compare the entries in the second column of the paths list with those in the first column of the edge list. If it finds a match, then it knows that the matching edges together constitute a larger chain of events. If such a match is found, then the algorithm looks in the second column of the edge list to find out what the next event in the chain should be. This event is added to the corresponding column of the paths list. In the picture below, the cells that are shaded yellow and green indicate where the algorithm has found a match, and the orange and blue shaded cells indicate where the algorithm found and added the corresponding events to the paths to which they belong.

ExamplePaths

If the algorithm has finished its iteration for the current column of the paths list, then it will move on to the next column in the next iteration and repeat the process. If the algorithm finds points of divergence (multiple events follow from the event that is currently inspected), then it will add a new row to the list in which the new path is recorded. The algorithm will also recognize any paths that are actually a subsequence of a larger path and remove these from the paths list. These operations are illustrated in the picture below. The yellow, green and purple shaded cells indicate matches that the algorithm found. The orange, blue and red shaded cells indicate the events that have been added to the chains of events. For event 2, a new row had to be created, because two events follow from event 2. Two columns will be removed from the paths list during this iteration, because they contain paths that are already included in other paths.

ExamplePaths2

That is basically it. After the algorithm finishes all iterations, it will return the resulting paths list.

Unsolved problems

There is one quite serious problem that I wasn’t able to solve. The number of possible paths in a graph increases very quickly if there are many points of convergence (nodes with a high indegree) and divergence (nodes with a high outdegree) in the graph. In those cases the algorithm will take a long time to finish, even on a relatively fast computer.

I also haven’t figured out a way to predict how much iterations the algorithm must run before finishing the task (to be honest, I haven’t thought about it too much). The algorithm will indicate which iteration it is currently doing, but it won’t indicate how many iterations it still has to go. If you’re running the algorithm on a large and/or complex graph, expect long waiting times, especially on slower machines.

The code

You can download the code here. The zip-file contains a file called FindPaths.R. To use it in R, move the file to your current working directory for R and then type source(“FindPaths.R”) in your R-interface. This will make available the FindPaths function. Open the file itself with a text editor, or with R’s script editor to find more instructions on how to use the function.

I’m relatively new to R and it is highly likely that parts of the code (or perhaps even all of it) are less efficient than they could be. I’m always open to suggestions for improvement.