Category Archives: R code

Second example of MDS, using travelling time between cities as distance

Introduction

A while ago, I wrote a post on Multidimensional Scaling in R and Gephi, using an example based on geographical distances between some Dutch cities. As I mention in that post, the example is a bit silly, because reproducing the geographical distances between the cities in an MDS configuration has little added value. I therefore decided to write a new post, in which I make use of the same set of cities, but use a different kind of distance between them. The distance that I used is travelling time when using public transport (e.g., train, bus, tram, etc.). As in the first post on this topic, I use R and Gephi in the demonstration. For a brief introduction to MDS, please see my first post on this topic, referenced above (I also mention several academic references there).

The travelling distances

As I mentioned in the introduction, I use travelling times (using public transport) between a set of cities in the Netherlands for this example. Indeed, these distances are not absolute, and they will depend on the time of day that you are travelling, possible disturbances on the way, and so on. To determine the travelling times, I made use of the website 9292.nl, which is the website that you can use to plan journeys with public transport in the Netherlands. For each travel, I assumed that we would want to arrive at our destination at around 09:00 on Monday October 24th, 2016. Indeed, if I would have picked another time of day, or another day in the week, the results of this exercise could have been different.

When searching for possible journeys, the website 9292.nl will typically give you four options that will bring you to your destination around the requested time (see the screenshot below), and it will (among other things) indicate the time it will take to reach the destination.

Screenshot of 9292.nl

I always picked the option that would bring us to the destination quickest. I also assumed that travelling from city A to B would take the same amount of time to travel from city B to A. I tested this assumption for a few cities, and it does not actually hold for all cities, although the differences are small enough to be considered negligible. The picture below shows the distance matrix that I constructed this way, where the distances are travelling time in minutes when using public transport.

Travelling Distances

Running the MDS analysis in R

I performed the MDS analysis in R (see my first post for a more detailed discussion of how this works). As I mentioned in my earlier post on MDS, I embedded R in emacs, using emacs speaks statistics (just in case you are wondering). I saved the matrix shown above as a csv-file, using semicolons as the column delimiters. I then started a session of R, and loaded the vegan package, which is the package that contains the MDS algorithm that I want to use (see the screenshot below for a full log of the commands that I used). I imported the csv-file, loaded it into a data frame, and ran the metaMDS algorithm, requesting a solution with 2 dimensions. As the log below shows, I got a solution with a stress value of about 0.0917, which indicates a very good fit. I plotted the space created by the solution, which is also shown in the screenshot below.

Running the MDS in R

Visualising the result

To make a nicer visualisation, I exported the coordinates of the MDS configuration to a new csv-file. For this, I made two vectors (dim1 and dim2), and put these in a data frame as two columns (see the screenshot below for the commands I used). I then wrote the resulting data frame to the disk, with the write.table() command.

Exporting the coordinates

The file that I created can be seen in the screenshot below. Before I imported this file in Gephi, I made some small changes. The original file has three columns: 1 column with the city names, and 2 columns with coordinates on 2 dimensions. I renamed the first column to Id, then copied the column entirely to a new column that I called Label. These are columns that are typically used in a Gephi nodes list, which is exactly how I am going to import the data. I did not change anything in the other two columns.

Adjusted coordinate matrix

I started up Gephi, and I imported the data from the data laboratory (see my first post for details). In this step it is especially important to import the coordinate data as double values. To reproduce the layout in Gephi, I used the MDS Layout that I wrote for Gephi some time ago. A visual inspection of the original plot that I made with R already showed me that the MDS configuration places the cities in a similar layout as the layout that we normally use in geographical maps, with the exception that the x-axis (dimension 1) corresponds to the north-south axis on geographical maps, and the y-axis (dimension 2) corresponds to the west-east axis on geographical maps. For a more intuitive map, we want this the other way around. In the MDS Layout menu, I therefore selected dimension 2 as my x-axis, and dimension 1 as my y-axis. After running the algorithm, I get the layout as shown below (I only re-positioned the labels in Inkscape).

Gephi Layout

Interpreting the result

So what do we have here? In the 2-dimensional configuration that we have created, the distances between the cities are proportional to the time it takes to travel between these cities, using public transport. In the previous example, we used geographical distances between the cities as distances, which meant that we could plot the resulting configuration on a blank map of the Netherlands, and thereby create a quite accurate geographical map. That was possible because that configuration (with geographical distances), and the blank map we pasted it on both model ‘geographical space’. However, our new configuration, which is based on travelling times as distances, models a different kind of space. Let us call this space the ‘travelling time space’ for now.

In principle these two spaces are not really comparable. However, I thought it would still be interesting to plot both of them on a blank map of the Netherlands. Therefore, I took the configuration that I produced in my previous example (again, based on geographical distances), and copied the coordinates of that configuration in the file of the new configuration (based on travelling time). Thus, I now have a file with 4 coordinate columns, 2 of which hold the coordinates of the cities in our space of geographical distances, and 2 of which hold the coordinates of the cities in our space of travelling times (I switched some of the dimensions around to make sure that the configurations are oriented in an intuitive way). The resulting file can be seen below.

Both configurations - data file

I imported this file into Gephi, and I plotted and exported the two configurations separately. I then used Inkscape to integrate the two plots, making two different versions of the integration. In one version I used the city of Utrecht as my reference point (making sure that the position of Utrecht overlaps in both plots), because it is a central and major public transport hub in the Netherlands. In the other version I used Rotterdam as my reference point, because this lead to a result in which the cities in the two configurations lie much closer to each other (also, because I have lived in Rotterdam for quite some time, I am a bit biased, and think that Rotterdam is the best reference point that you can have in the Netherlands). I gave the cities that serve as reference points a green colour to emphasise their special role in the plots, I gave the cities in the geographical space a blue colour, and I gave the cities in the travel time space an orange colour. I then pasted the results on top of a blank map of the Netherlands, and made sure that the position of the cities in the geographical space (blue) are more or less correct. It should be noted that integrating the two spaces is, in principle, not possible, but by doing so we can make some simple comparisons of the relative distances between the cities in the two types of spaces. Using the blank map of the Netherlands mostly makes this exercise less boring, but (as we will see below) may also help us explain some observations. The resulting plots can be seen below.

Combined map A

Combined map B

What is immediately obvious is that in the space of travelling times the relative distances between the cities are different from their relative distances in the geographical space. Some cities seem to be closer to each other (e.g., Utrecht and Zwolle, Leeuwarden and Groningen, Rotterdam and Breda), while other cities are further away from each other (Utrecht and Maastricht, Utrecht and Rotterdam, Zwolle and Den Helder, Rotterdam and Middelburg). The interpretations I give below should really be taken with a grain of salt, because they are all based on intuitions, rough guesses, and pairwise comparisons of city’s positions, where it is probably much more helpful to consider the configuration as a whole (but I don’t want to break my brain; I am going to need it for a bit longer).

In both configurations, Den Helder and Middelburg are both further away from all the other cities in the travel time space than in the geographical space. Intuitively, I think this makes sense, because due to their geographical locations, these two cities seem relatively hard to reach by public transport (or even in general). For example, there is a natural barrier between Den Helder on the one hand, and Leeuwarden, Groningen and Zwolle on the other hand. It is possible to reach Den Helder from Leeuwarden more or less directly by bus (I suspect this bus crosses the “Afsluitdijk”, indicated on the map by the line between the land mass of Den Helder, and the land mass of Leeuwarden and Groningen), but it is not possible to make that journey by train (you would have to travel via Utrecht). The estuary in which Middelburg is located can also be understood to act as a kind of natural barrier. According to a map of Dutch train tracks, there is only one train track that leads to Middelburg, coming (more or less) from the direction of Breda.

Other observations are a bit harder to explain. For example, Utrecht seems to have moved quite a bit to the northeast (this can be seen most easily in the plot where Rotterdam is the anchoring point). It might be that this is because Utrecht is less affected by the natural barrier formed by the “IJsselmeer”, and is therefore relatively close to both Den Helder and Groningen and Leeuwarden, but that does not explain, for example, why the distance between Utrecht and Rotterdam has increased. Here, it may have to do something with the exact route taken by trains that run between these two cities, and the number of stops they have to make on the way. However, it may be necessary to consider the distances between a multitude of cities to fully understand these observations. For example, Rotterdam and Breda are pulled towards each other because there is a high speed train service between these cities that does not run between most other cities. At the same time, Breda and Utrecht are probably pushed away from each other because they are relatively poorly (or indirectly) connected. Thus, these can be understood as pulling (Rotterdam-Breda) and pushing (Utrecht-Breda) forces that act against each other, while also interacting with other pulling and pushing forces, finally leading to a rather complex picture. As I mentioned before, offering a good interpretation of these distances, and especially of the differences between the geographical and the travelling time distances, is a bit tricky.

Closing statements

This was another simple example of performing MDS in R and Gephi, and I hope you enjoyed it. As promised in the first post, I will try to write posts with additional examples in the future, probably focusing more on “social distances”, thereby moving further away from the more intuitive types of distances such as geographical distance and travelling time. I hope these future examples will make clear that the concepts of space and distance can be generalised in ways that we would not consider in our daily lives, and that they can show us things that interesting and relevant from a social scientific point of view (for example, some of Bourdieu’s most famous work makes heavy use of forms of ‘social distance’). For now, I say goodbye.

Simple example of Multidimensional Scaling with R and Gephi

Introduction

During my PhD project I developed a fascination for geometric approaches to exploratory data analysis, such as Multidimensional Scaling (MDS) and Correspondence Analysis. These methods have some very interesting applications. For example, they can help you to strongly reduce the dimensionality of datasets, capturing the most essential patterns in your data. I especially like the ways in which these methods can help you to visually explore concepts that can be captured in terms of ‘distances,’ such as (dis)similarities between people, social groups, events, and etcetera. I thought it would be nice to write a few posts in which I offer basic examples of the methods in use (without going into the technical details), using only open source software. In this post, I focus specifically on MDS, and I offer a basic demonstration of the method, using R and Gephi, two amazing pieces of software.

A very brief introduction into MDS

Let me first say that I am not going to give an in-depth methodological introduction into MDS. I like to keep things at an intuitive level for this post (and I am likely to do so in future posts on this topic). There are plenty of books that offer a great technical introduction into MDS, including Kruskal and Wish (1978) (I love those little green books on Quantitative Applications in the Social Sciences) and Gatrell (1983) (oh, the smell of old books). I should also say that there is more than one type of MDS. For example, there is the distinction between metric and non-metric MDS. I am not going into all these nuances either, but I can at least share with you that, in the example below, I am applying non-metric MDS.

Basically, what MDS can do for you is to take a matrix in which the distances between a set of entities are recorded (i.e., a distance matrix), and try to create a low-dimensional spatial configuration in which the same entities are located in a way that the distances between them are proportional to the distances in the original distance matrix. This may sound trivial, but keep in mind that some types of distances may be very difficult to accurately reproduce in a low-dimensional space. For example, I have experimented with using MDS to visualize the (dis)similarities between social events, based on the actors and issues (e.g, problem and solution definitions) that the events have in common. I usually ended up needing at least 6 dimensions to accurately express these (dis)similarities. Unfortunately, it is impossible to visualize any space with more than three dimensions in one picture, and trying to do so is a serious threat to your mental health.

At this point, some of you might already wonder how we can decide the number of dimensions that we need to accurately represent the contents of a given distance matrix. MDS comes with a measure called ‘stress,’ which you can interpret as a measure of fit between the spatial configuration produced, and the original distance matrix. The higher the stress, the worse the fit. In other words: The higher the stress, the less confidence you should have that the spatial configuration you produced accurately represents the distances in your original distance matrix. Kruskal and Wish (1978) offer an insightful discussion on stress, and on different rules of thumb on how to interpret stress values. A very rough rule of thumb is that stress values below 0.10 usually indicate a good fit, and that values above 0.20 are unacceptable (but don’t cite me on this!).

Our example of today

The example that I will be using is a bit silly, but it serves quite well to show how powerful MDS can be at creating spaces that accurately reflect the distances that you feed into the analysis. For this example, I have taken 12 cities in the Netherlands, and I looked up the distances between all of them. To find the distances, I used this website (thank you, Google). See the distance matrix that I produced below (some of the larger pictures in this post may be easier to read if you click on them).

Screen_0

There is no particular reason why I picked these cities, except for the fact that they are spread out quite nicely throughout the Netherlands. I already know that it will be very easy to represent these distances in a 2-dimensional space, because that is what we do all the time with geographical maps. So the results will not be very surprising, but I still think the results are cool enough.

Let’s get to work!

So, let’s do this! The first thing I did is to save the matrix shown above as a comma-delimited file (.csv extension), because it is easy to import such files into R. If you don’t know how to do this, go to ‘save as’ in Excel, and find comma-delimited in the file options. Comma-delimited files are basically text-files that use commas, or other symbols, to separate columns. I always use semicolons ‘;’ as separators, but you can use whatever you like.

Next, we start up R. My R-session may look at bit weird to you, because I embedded R in Emacs, using Emacs Speaks Statistics, and that is just because I am a nerd. You can of course use the standard GUI for R if you like. R comes with built-in functions that can do MDS, but I like to work with a package called ‘vegan.’ I already had the package installed, but if you don’t, then you can install the package by typing install.packages(“vegan”) into the R-console. After that, you can activate the package by typing library(vegan).

After starting R, I imported the csv-file as a matrix object (see picture further below for a log of all the commands that I used). I called the matrix geoMatrix. After importing the file, I immediately turned the matrix into a so-called dist object (called geoDist), which is something that R uses to store distances. This is also the object that I fed into the MDS analysis.

The ‘vegan’ package comes with a function called metaMDS(). The function takes a lot of arguments, but most of them have default values, and for our example these default values will work just fine. There are two arguments that I would like to draw your attention to: The first argument to the function should be the distance object that you want to use as the input for the MDS analysis. There is another argument k, which indicates the number of dimensions that you want the spatial configuration to have. The default value for k is 2, which is fine for us. However, if you find that with this default value you end up with high stress values, then you might want to consider increasing the value of k. You would then, for example, use the function with the following arguments: metaMDS(geoDist, k = 3).

I use the function with the geoDist object as the only argument, and I tell R to assign the results of the function to an object called geoMDS. The function will start an iterative process in which it attempts to come up with increasingly accurate spatial configurations, and it will spit out the stress value that it ends up with at each run. By default, this function will make up to 20 attempts to come up with a decent configuration (this is also something you can change by changing the arguments given to the function), but in our case the function already reaches a very low stress value at its second run (see log below).

I immediately plotted the resulting configuration, using the standard plot() function. To assign labels to the points in the visualized configuration, I used the function text(). See the screen shot below for a log of all the commands I used (including the arguments given to the different functions) on the right, and the plot of the resulting configuration on the left (if you look below the visualization, you’ll see that I was checking out a discussion on calculating stress in R, just before doing the analysis).

Screen_1

After a visual inspection of the results, I decided that the distances seem pretty close the actual distances you will find on a map. However, the layout of the cities is a bit counterintuitive. Also, the orientation of the space is different from what we are used to with geographical maps, so I decided to import the data into Gephi, and change the layout there, without changing the distances.

Importing the data into Gephi

To do this, I first had to export the coordinates of all the cities in the new space that the MDS analysis created for them. These coordinates are stored somewhere in the geoMDS object. This object has a rather complicated structure that I do not want to get into, and I do not expect you to immediately understand the trick that I use to extract the coordinates from the object. However, let me tell you what I did: I created two new vectors (one for each dimension). The first vector holds all the coordinates in dimension 1, and the second vector holds all the coordinates in dimension 2. In the geoMDS-object, these coordinates are stored in what can be understood as a ‘sub-object’ of geoMDS, which is  called ‘points,’ and which is basically a two-column matrix (1 column for each dimension). All coordinates in dimension 1 are stored in the first column, and all coordinates in dimension 2 are stored in the second column. So, if you understand the basics of R, you may also understand that we can access the coordinates in each dimension with the commands geoMDS$points[,1] (for dimension 1) and geoMDS$points[,2] (for dimension 2). After putting the coordinates in two separate vectors, I assembled the vectors into a data.frame. Then I wrote the data.frame to the disk, using the write.table() function. I wrote the results to a file that I named “CoordinatesGeo.csv.” See the screen shot below to see the commands that I used.

Screen_2

Before we can import the resulting file in Gephi, it needs some minor changes. Without changes “CoordinatesGeo.csv” is a table with three colums: One unnamed column with the names of the cities, and two named columns with the coordinates that we created with our MDS analysis. I turned this table into a Gephi nodes list by naming the column with city names “Id”, and by copying this column to a second column that I renamed “Labels” (see screen shot below).

Screen_3

Now we start up Gephi. In Gephi, we start a new project, and we go to the data laboratory. There, we use the ‘import spreadsheet’ function (see below).

Screen_4

This will open a new dialog, where we can indicate how we want to import our data. One thing is very important here, and that is that we want to import the coordinates as numerical data. I usually import numerical data as doubles (see below).

Screen_5

If the settings are like shown in the screen shot above, then you should be okay. We can go back to the overview in Gephi, and the nodes will be randomly distributed in the plot screen. Gephi doesn’t come with a built-in layout algorithm for nodes that have coordinates, but you can use a plugin that I created some time ago, called MDS Layout. You should have this plugin installed before you can proceed with the next steps.

I selected the MDS Layout from the layout menu, and the plugin automatically recognizes the variables ‘dim1’ and ‘dim2’ as appropriate dimensions. The layout that is created if you run the plugin is the same as what we saw when we plotted the results in Gephi (see below).

Screen_7

I wanted the cities to be oriented the same way as in commonly used geographical maps, and for that I used Gephi’s built-in layout algorithm ‘Clockwise rotate’ (interestingly, this algorithm rotates the nodes in the counter-clockwise direction). I set the algorithm to rotate the layout by 90 degrees. After doing that, I immediately saw that the layout of the cities created by the MDS is like a mirror image of the layout that you’ll find on a geographical map (Maastricht is in the West, while it should be in the East, and Middelburg is in the East, while it should be in the West). This is something that we cannot fix in Gephi, but we can easily fix it in Excel.

I reopened the nodes list we created earlier, and after inspecting the coordinates I saw that the coordinates in dimension 2 are the coordinates that are mirrored. To fix this, I created a new column with coordinates for dimension 2. The values in this column are the values of the original column multiplied by -1 (see below).

Screen_8

I imported the adjusted nodes list into Gephi, using the same approach as before. I used the MDS Layout again to create my initial layout. In this case too, I had to use the “Clockwise rotate” layout to rotate the layout by 90 degrees. I also increased the size of the nodes a bit and gave them a color, just for aesthetic reasons. See the result below. If you know the map of the Netherlands, then you’ll see that this is already more like it!

Screen_10

As an ultimate test (okay, this is a bit of an exaggeration) I decided to plot the points on top of a blank map of the Netherlands. I found such a map easily enough through Google Images. I exported the visualization from Gephi as an SVG-file, such that I could adjust the positions of the labels (the city names) in Inkscape (another wonderful piece of open source software). After adjusting the labels, I exported the result as a PNG image. I then opened the blank map of the Netherlands, as well as the PNG file with the cities in of the Netherlands in Gimp (yes, yes, open source) and pasted the cities on the blank map. It took some rescaling and additional rotating to get the layout right, but ladies and gentlemen, the result looks very, very nice.

BlankMap

Just compare the results to an actual map of the Netherlands, and you’ll see that the MDS analysis did a very good job at creating a spatial configuration that accurately represents the distances that we fed into it. The main difference between the layout of the cities in the MDS space, and the layout of the cities in ‘actual’ geographic space is that the layout of the cities on one dimension was mirrored. It reminds me of those crappy Sci-Fi stories in which mirrors are portals to alternative dimensions, but as I mentioned before: I am a nerd.

Like I said, this is a bit of a silly example, but I plan to come up with more interesting ones in the future, where I will explore, for example, the application of MDS to ‘social stuff.’

Hope you enjoyed this!

BTW: Some time ago I created a plugin for Gephi that does MDS, using the path distances between the nodes of a graph as its input. The plugin produces coordinates that can be used to create layouts with the MDS layout algorithm. See here for more details.

 

Simulating a neighborhood network (not serious science)

Intro

Just to warn you: This is not an attempt at serious science. It is just something I made for fun, based on a project that two students are working on.

Background

Let me start by giving some background to what I’ve done. I’m involved in the European GLAMURS project as a postdoctoral researcher. Among other things, we investigate what role bottom-up citizen initiatives can have in sustainability transitions. In the Netherlands, one of the initiatives we work together with is Vogelwijk Energie(k), which is a very interesting local energy initiative in the Hague. One of the things that makes the initiative special is that it is situated in a neighborhood with exceptionally strong investment power, and relatively many connections to important professional networks. Since its early stages (it started in 2009), Vogelwijk Energie(k)  has had around 250 members (there are over 2000 households in the neighborhood). One of the ambitions of the board of the initiative is to mobilize a ‘second wave’ of people within and outside the neighborhood, to have them invest in sustainable development. This doesn’t mean that Vogelwijk Energie(k) wants to attract more members; the idea is to raise awareness about sustainability among a broader group of people, and stimulate them to act on that awareness.

We had the opportunity to write an assignment on this problem for a course on Agent Based Modeling (ABM). Two students have been working on that assignment for some time now. They are basically attempting to develop an ABM that models the process through which the mobilization of the ‘second wave’ could take place, allowing for the exploration of different scenarios. Within a few weeks, the two students had developed some very interesting ideas for the model. Their work so far seems to be based on the implicit theory that people can develop a stronger awareness about sustainability by talking about the subject with their neighbors, and that increased awareness may at some point trigger an exploratory process that may, or may not lead to the decision to start investing in sustainability. Their conceptual model is actually more complex, but this is the gist of it. One of the reasons why I like their ideas is that, without knowing, they included some kind of awareness-behavior gap in their model (increased awareness does not immediately lead to changes in behavior).

The problem

One of the operational problems that the students are currently facing is how to simulate the interactions between neighbors. Intuitively, we can already say that the likelihood of neighbors interacting with each other is not equal for all pairs of neighbors. Any person is likely to interact frequently with only a small amount of his/her neighbors. Who these neighbors are will also depend (intuitively speaking) on how close they live to each other, and how many opportunities they have to meet (e.g., their children go to the same school, they go to the same supermarket, they are member of the same associations). You’ll get the idea.

One approach to modeling the patterns of interactions between neighbors is the social networks approach. We can visualize neighbors as nodes, and we can visualize frequent interactions between neighbors by drawing edges between those neighbors. In this context, the students had a question for me: “What do you think this network should look like?” This is a rather difficult question to answer without any empirical observations on the actual interactions that take place in the neighborhood. Ideally, we would do a survey on this among the residents of the neighborhood, but that is well beyond the scope of the students’ assignment. I told the students I needed to think about this problem for a while.

Creating a random network based on distances

One of my initial thoughts on the problem described above is that the structure of a social network in a neighborhood will at least depend on the geographical proximity of the neighbors in the network. I realize that there are many other mechanisms that will influence the structure of the network, but it is the geographical dimension that I focused on this evening.

One of the things that I wanted to do is to place neighbors in a space that more or less corresponds with the geographical boundaries of the Vogelwijk neighborhood. I first took a look at a map of the neighborhood on Google Maps. In the visualization below, the neighborhood is marked by the red area.

Vogelwijk

As you can see, the neighborhood includes two large green areas that have no houses. I used Google Maps to mark the ‘inhabited’ areas, and make a rough calculation of their combined surface area (see below).

VogelwijkSurface

Google maps tells us that the area with the black outlines is about 1.12 square kilometers, but to keep things simple, I decided to assume that the area is a rectangular area that is 2.5 kilometers wide, and 0.5 kilometers high, which brings us to a surface area of 1.25 square kilometers.

I also decided to focus on households, rather than neighbors, and I found that there are over 2000 households in the neighborhood. I decided to round the number of households down to 2000.

I created a so-called nodes list in which I listed 2000 households that I simply numbered from 1 to 2000. I randomly assigned X-coordinates and Y-coordinates to each household. The X-coordinates are a random number between -1250 and 1250, and the Y-coordinates are a random number between -250 and 250. See a screen shot of part of the nodes list below.

Nodes

I wrote a script in R with several functions. One of the functions calculates the Euclidean distances between all the households in the neighborhood, based on the randomly assigned coordinates. The function returns a distance matrix that reports the distances between all pairs of households. Another function normalizes these distances (all distances are converted to proportional values between 0 and 1), and inverts them to create proximities (1-distance). The resulting proximity matrix became the basis for the simulation of the network.

I wrote a simple function in R to simulate the network in the neighborhood, using the proximities of the households as a basis. The logic of the function is very simple: The function considers each pair of households in turn. The proximity of the households (a number between 0 and 1) is multiplied by a random number, for example a number between 0 and 1 (which is used to simulate other influences; I know it is very naive). The resulting number is then compared to a threshold. If the number is below that threshold, then there is no tie between the two households. If the number is equal to, or above the threshold, then a tie between the households in created. I made sure that it is possible for the user of the function to set the threshold, as well as boundaries for the random number that the proximities are multiplied with. Different parameters for this will also lead to different networks. I ran the function numerous times, each time with different parameters. Below, I visualize 2 examples.

Both examples are visualized with Gephi. To visualize a network I opened its adjacency matrix in Gephi, and I imported the nodes list with the household coordinates separately. I used the MDS Layout algorithm to layout the households. Below is a first example (clicking the picture should enlarge it). The size of the nodes is based on their degree. The colors indicate communities, which I identified using Gephi’s built-in modularity algorithm.

Example_3

In the picture you can see that ties between the households are relatively sparse. This network has a degree distribution of 2.063. In the picture below you can see that most households have one or two connections with other households in the neighborhood. This may not seem like much, but from the few papers on neighborhood networks that I have scanned so far, I understood that ties within a neighborhood tend to be sparse (people are typically more strongly connected with people outside their neighborhood). The visualization also nicely shows the effect of the simple simulation function that I wrote: The connections only exist between households that are relatively proximate.

DegreeDistribution_Example3

The pictures below show another example, based on a network that I generated using other parameters. This network has more connections, which can also be seen in the graph of the degree distribution. In this case, most households seem to have connections with 17 other households in their neighborhood, which intuitively sounds unrealistic to me, but it sure creates a pretty picture. I also like how the neighborhood divides up nicely in different communities.

Example_4

DegreeDistribution_Example4

Closing

So, that was it. I know this is not a very serious simulation of neighborhood networks, and indeed it is based on intuitions, and not on good science. But it was fun to do!

Additional functions for directed, a-cyclic graphs

Earlier, I posted R-scripts that include functions for finding paths in directed, a-cyclic graphs. I (and some colleagues) use this type of graph to model processes (sequences of events) as networks. We refer to this type of graph as event graphs (Boons et al. 2014; Spekkink and Boons, in press), where events are represented by nodes, and the relationships between events are represented by arcs and/or edges. The underlying principles of this approach were based primarily on the work of Abell (1987) on the Syntax of Social Life, where he introduces narrative graphs.

In addition to the path-finding algorithm, I recently created a plugin for Gephi, called Lineage, that identifies all the ancestors and descendants of a node (chosen by the user) in directed graphs. Based on the logic that I used for the Lineage plugin I also did a complete rewrite of the function (for R) that I designed for finding paths in directed, a-cyclic graphs (see this post). I decided to use the same logic to write some additional functions that explore ancestors and descendants of nodes in directed a-cyclic graphs. I wrote four new functions for R:

  • GetAncestors(arcs, node): This function returns all the ancestors of a given node, which is submitted by the user by passing it as an argument of the function (node). The other argument (arcs) should be an edge list, structured as a 2-column matrix. The first column should report ‘source’ nodes, and the second column should report the ‘target’ nodes that the ‘source’ nodes link to (see table below for an example of such an edge list). This edge lists represents the network under investigation.
  • GetDescendants(arcs, node): This function does the same as the GetAncestors() function, but returns the descendants of the input node, rather than its ancestors.
  • CommonAncestors(arcs, nodes): This function finds all the common ancestors of a pair of nodes, or multiple pairs of nodes. In this case too, the arcs argument should be an edge list, as described earlier. The nodes argument can either be a vector of two nodes, or a matrix with multiple pairs of nodes (similar to the edge list displayed in the table below). The function writes a file in which it is reported whether the submitted pairs of nodes have ancestors in common, how many ancestors they have in common, and what these ancestors are. The file is placed in the current working directory of the user.
  • CommonDescendants(arcs, nodes): Same as the CommonAncestors() function, but in this case the function returns the common descendants of the pairs of nodes submitted by the user.

All four functions assume that the events are labelled with numbers. Unless the edge lists that are submitted by the user have numerical variables, the functions won’t work as they should.

Source Target
1 2
1 3
2 4
2 5

So what are the functions useful for? I assume that different people may come up with different uses, but I use them specifically as a step in the identification of emergent linkages between events (see Spekkink and Boons, in press). It will take me some paragraphs to explain this.

In my PhD research I introduced a distinction between intentional linkages between events, and emergent linkages between events. An intentional linkage exists between two events A and B if in event B actors intentionally respond to conditions created in event A. For example, event A might concern the development of a plan, and plan B may concern a feasibility study of that plan. I used the idea of intentional linkages to identify different streams of events in a process, that may, for example, represent different projects that unfold independent from each other. In my thesis, I used intentional linkages to identify independent projects that were later assembled into a larger collaborative process, and can thus be understood to serve as building blocks of the collaboratie process. The figure below shows a phase model that I visualize in the conclusions of my thesis, and that also aims to visualize the idea of different building blocks existing independently at first, and being assembled into a collaborative process at a later stage.

FigureProcess

Emergent linkages refer to another type of relationship between events. I found that actors involved in independent events often address similar (or the same) issues in their activities. It might, for example, be the case that different actors are working on different projects that are all related to biobased economy, but that actors do not react to each other’s projects. They may not even be aware of each other’s activities. To capture these similarities between events (i.e., the similarities of issues addressed) I introduced the notion of emergent linkages. I defined emergent linkages such that they refer to similarities between events, but that they can only exist between events that are not connected by a path of intentional linkages. The reason for the latter condition is that I was only interested in coincidental similarities. If similarities exist between events that are also connected by a path of intentional linkages, then there is a very strong chance that the similarity is also intentional. To explain what I mean, let me go back to my earlier example, concerning the intentional linkage between event A (a plan is developed), and event B (a feasibility study is performed). Say that the plan concerns the construction of a biofuel factory. In this case it will be logical that both the plan and the feasibility study address issues related to biofuel production. These similarities are not emergent, in my view, and that is why my definition of emergent linkages excludes them.

There are different approaches to identifying similarities between events. The approach that I used is described in Spekkink and Boons (in press). In short, in this approach the similarities between events are calculated as correlations between the column profiles of an incidence matrix, where the rows represent issues, and the columns represent events (a ‘1’ in a given cell indicates that the corresponding issue was addressed in the corresponding cell). The problem with this approach (and other approaches) is that the similarities between all events are calculated this way. This includes the similarities that I do not consider to be emergent. Thus, I need some way of filtering out the non-emergent linkages from the complete list of similarities between events. Based on my definition, I used to filter out the similarities between events that are also on a path of intentional linkages together (this was the reason why I wrote the function for finding paths in directed, a-cyclic graphs). Lately, I have been thinking about expanding this definition, because there are cases where two events are not on a path together, but do have one or more common ancestors. Similarities between these events may also be considered intentional in some cases. For example, consider the development of a large program (event A), the details of which are worked out in multiple, independent working groups (events B, C, and D). Events B, C and D may not be intentionally linked, but they do have a common ancestor (event A), and they all inherit issues from that ancestor. I considered introducing a more restrictive definition of emergent linkages where not only nodes that share a path of intentional linkages are excluded, but also events that have common ancestors. That is the reason why I wrote the CommonAncestors() function. I included the other three functions because it only took small adaptations to make them, and they might be useful for someone.

You can download the script with the functions here.

New algorithm for finding paths

In an earlier post I described an R-script I wrote for finding paths in directed, a-cyclic graphs. That script had some problems, including efficiency problems. I made a completely new script for finding paths in directed, a-cyclic graphs, and replaced the old script with the new one. The new script also appears to find paths that the old algorithm did not identify, so I must have missed something in the old script.

The logic of the new script is quite simple. It contains one function with a number of parameters (these are also described in the script itself). The function and its possible arguments are as follows:

FindPaths(arcs, origin = 0, originsOnly = F, endsOnly, F, writeToDisk = F)

The first argument (arcs) should be an edge list with two columns, where the first column contains the source nodes, and the second column contains the target nodes of arcs (see table below for an example). All nodes should be represented by numbers for the function to work.

Source Target
1 2
1 3
2 4
2 5

The first step in the function is to identify all origins in the list of arcs (nodes with 0 indegree) and all endpoints (nodes with 0 outdegree). The arguments originsOnly or endsOnly can be set to TRUE (T) to have the function return the list of origins or endpoints respectively. No paths are identified if one of these options is chosen.

The user may also choose a specific origin node (origin), which tells the function to only look for paths that flow from that specific node. This is a wise choice when the graph under investigation is very large and/or complex. If origin is left in its default value of 0, then the function will attempt to identify all paths in the graph.

The function first identifies an initial list of paths with only two nodes. Then this list of paths is fed into a while-loop, in which the paths are completed further (also see comments in script for more detailed descriptions).

The user can choose to write the files to the disk immediately, by setting the function argument writeToDisk to TRUE (T). This creates a file called Paths.csv. If such a file already exists, then new paths will be appended to the existing file. This allows you to easily add paths to the file by using the function in steps (choosing a different origin point at each run), without having to manually merge the files later.

If this argument is set to FALSE (F) (default setting), then the function returns a list of all paths.

The new function can be downloaded here.

See the earlier post here.

 

Using R to find paths in a directed a-cyclic graph

Update

I made a new, much more efficient and effective script that makes the one described in this post obsolete. See my post on the new script here.

Introduction

In my PhD research project I reconstruct processes as sequences of events. In a recent article I visualize these sequences of events and their relationships as network graphs. This was inspired by Abell’s (1987) The Syntax of Social Life. In the network graphs the events are represented by nodes. The arcs (arrows) between the events indicate which events contributed to the conditions under which later events occurred. The layout of the events on the horizontal axis represents the order of occurrence of events. The layout on the vertical axis has no other purposes than clarity of presentation. The resulting graph is always a directed, a-cyclic graph. The graph is directed because of the way I defined the relationship (“X made a contribution to the conditions under which Y happened”), which is a one-way relationship. The graph is a-cyclic because the arcs can never point back in time.

In my current work I have to go a few steps further because of the questions that I ask (I won’t go into the detail of these questions). One of the steps that I have to take in answering those questions is to find all possible paths in the sequences of events. There are multiple algorithms available for finding shortest paths between the nodes of a graph, but I specifically needed an algorithm that is capable of finding all paths. For this reason (and because I like the exercise) I decided to write my own algorithm. In this case I use R as my programming language. The resulting code can be found at the bottom of this post.

The function that runs the algorithm

To run the algorithm in R (see instructions at the bottom of this post) the user uses the following function:

Paths <- FindPaths(edgeList, 1, 128)

Here, the results of the algorithm are stored as a matrix in an object called Paths. The first argument to the function is an edge list (I called it edgeList here, but the user can use any name she/he likes), which is basically a list of all the relationships that constitute the network (see below for more details). The other two arguments indicate the pair of events for which the paths are to be reconstructed. Thus, if one wants to find the paths between multiple pairs of events, then the algorithm should be run separately for all these pairs.

The nice thing about an edge list is that it uses a very simple structure to describe networks that can be rather complex. Also, an edge list can be easily exported from my favorite network visualization program: Gephi. An edge list looks something like this:

Source Target
1 2
1 3
2 4
2 5

The algorithm only works if the events are represented by numbers that indicate the order of events, which is information that the algorithm requires. To import an edge list, first store one as (for example) a comma-delimited file (.csv) and then import it into gephi as a matrix. For example, the following command could be used (some arguments that should be used, such as the separator [sep],  depend on your system settings):

edgeList <- as.matrix(read.table(“MyEdgeList.csv”, sep = ‘;’, header = T))

Additional preparations (optional)

If you have stored your directed, a-cyclic graph as an edge list it is possible to just keep it as it is and import it into R without any further changes. However, depending on the complexity of the graphs (in terms of the number of nodes, the number of arcs, as well as the number of points of convergence and divergence) it may be helpful to make some further preparations before using the algorithm. First of all, for complex graphs it is usually best to first remove all the nodes from our graph of which we know that we don’t need them, because this will save the algorithm a lot of time. As I indicate in the previous section, the algorithm always looks for the paths between two specific events. Let’s call them the origin event and the terminal event. We already know that we don’t need all the events that are not in the past of the terminal event. Removing these events might save the algorithm some time because it won’t start looking in directions that turn out to be dead ends (see picture below).

Paths

The easiest way to remove the unnecessary nodes from the graphs (that I currently know of) is to use an algorithm that is available in the NetworkX library for python. To run NetworkX you’ll need to have Python installed to your machine, and you’ll also need a Python interpreter, such as IDLE (see this website). If the NetworkX libraries for python are installed to your machine, you can import the NetworkX library by running the following command in a python interpreter:

import networkx as nx

We can then import our graph as an gexf-file (you can export your graph as an gexf-file from Gephi). Use the following command in your python interpreter:

G = nx.read_gexf(“Path-to-your-gexf-file/example.gexf”)

This command will store the graph in an object that we called G. We can then use an algorithm that finds all ancestors of a given node. Use the following command.

A = nx.ancestors(G, ’10’)

Here we find all ancestors of event 10 in the graph G, the graph that we just created. The results are stored in an object that we named A. This object A that we just created is an object of the type set. We will want to export this set as a csv-file. For this we first need to import the csv library for Python (which is included as one of the standard libraries). Use the following command:

import csv

The next couple of commands can be used to export the results. Make sure that in the second command the delimiter argument corresponds with the delimiter that your system uses for csv-files (this is the same thing as the separator [sep] argument that you pass to R-functions).

myfile = open(“/Path-to-export-folder/Ancestors_of_10.csv”, ‘wb’)

wr = csv.writer(myfile, quoting = csv.QUOTE_ALL, delimiter = ‘;’)

wr.writerows(zip(A))

myfile.close()

The exported file will now have all the ancestors of event 10 in a single column. It is useful to turn this list into a nodes list that can be imported into Gephi. First add the terminal event itself to the list, because it is not in there yet. Then give the column a header that you call Id. You can import the list as a nodes list in the Gephi data laboratory. Next, you can import the original edge list. However (and this is very important), make sure that in the import settings you uncheck the box that indicates whether missing nodes should be created as well. Export the resulting edge list as a csv-file, and you’ll have an edge list that can be used to find paths to the terminal event of interest much more efficiently. I realize that the whole process can be a bit time consuming if you have multiple terminal nodes that you want to consider, but at the moment this is the best option that I can think of. Also, depending on the complexity of your original graph, these steps may actually save you a lot of time.

What the algorithm does

A detailed description of different steps of the algorithm is offered in the comments that accompany the source code itself (see bottom of post). The algorithm will first go through the edge list that was passed to the function and remove all edges that lie in the past of the origin event, or in the future of the terminal event. The algorithm then makes a copy of the edge list, and then basically iteratively expands that copy by adding new rows and columns to it. We’ll call this copy the paths list.

At the first iteration of the algorithm the paths list will simply be an exact copy of the original edge list. The algorithm will compare the entries in the second column of the paths list with those in the first column of the edge list. If it finds a match, then it knows that the matching edges together constitute a larger chain of events. If such a match is found, then the algorithm looks in the second column of the edge list to find out what the next event in the chain should be. This event is added to the corresponding column of the paths list. In the picture below, the cells that are shaded yellow and green indicate where the algorithm has found a match, and the orange and blue shaded cells indicate where the algorithm found and added the corresponding events to the paths to which they belong.

ExamplePaths

If the algorithm has finished its iteration for the current column of the paths list, then it will move on to the next column in the next iteration and repeat the process. If the algorithm finds points of divergence (multiple events follow from the event that is currently inspected), then it will add a new row to the list in which the new path is recorded. The algorithm will also recognize any paths that are actually a subsequence of a larger path and remove these from the paths list. These operations are illustrated in the picture below. The yellow, green and purple shaded cells indicate matches that the algorithm found. The orange, blue and red shaded cells indicate the events that have been added to the chains of events. For event 2, a new row had to be created, because two events follow from event 2. Two columns will be removed from the paths list during this iteration, because they contain paths that are already included in other paths.

ExamplePaths2

That is basically it. After the algorithm finishes all iterations, it will return the resulting paths list.

Unsolved problems

There is one quite serious problem that I wasn’t able to solve. The number of possible paths in a graph increases very quickly if there are many points of convergence (nodes with a high indegree) and divergence (nodes with a high outdegree) in the graph. In those cases the algorithm will take a long time to finish, even on a relatively fast computer.

I also haven’t figured out a way to predict how much iterations the algorithm must run before finishing the task (to be honest, I haven’t thought about it too much). The algorithm will indicate which iteration it is currently doing, but it won’t indicate how many iterations it still has to go. If you’re running the algorithm on a large and/or complex graph, expect long waiting times, especially on slower machines.

The code

You can download the code here. The zip-file contains a file called FindPaths.R. To use it in R, move the file to your current working directory for R and then type source(“FindPaths.R”) in your R-interface. This will make available the FindPaths function. Open the file itself with a text editor, or with R’s script editor to find more instructions on how to use the function.

I’m relatively new to R and it is highly likely that parts of the code (or perhaps even all of it) are less efficient than they could be. I’m always open to suggestions for improvement.