In the summer, I started a series of posts on using the LangChain framework. This is the second post in that series. In the first post I introduced the idea of retrieval augmentation and vector stores and I explained the Python scripts that I use to create and update vector stores. In this post I go into the actual use of retrieval augmentation. The example I focus on here is the use of retrieval augmentation to chat with OpenAI’s Large Language Models (LLMs) about the literature on my Zotero library.
The initial LangChain tools that I built were simple command-line tools. I soon discovered a framework called Chainlit, which allows you to use you browser as an interface for your LangChain apps, and it comes with several other goodies. To be able to follow along with the examples below, you will need to install the Chainlit Python module. Chainlit is in development and future updates might break some of the things that I showcase in this post. The version of Chainlit that I have installed at the time of writing this post is 0.7.1.
Without going into too much detail, a short recap of the idea of retrieval augmentation might come in useful. Our goal is to chat with LLMs about the contents of the literature in our Zotero library. This is useful for multiple things, such as quickly consulting our literature on concepts that we are interested and finding back the papers in which these concepts are described, but that we may have forgotten about. Given that can have a conversation with an LLM about our literature and that the LLM to some extent memorizes what has been said before in the conversation, we can even chat with LLMs about the relationships between concepts. I use this tool, for example, to quickly create notes on concepts that I can integrate in my writings. I also use it when preparing for teaching, for example to quickly compile lecture notes with additional background on the concepts and theories that I am teaching about. We can do this with the help of retrieval augmentation. When we ask a question, our question gets embedded, that is, it gets converted into a coordinate in a semantic space that we have populated with fragments of texts from the papers we wish to chat about. These fragments are stored in our vector store (see my previous post on this topic). Our tool retrieves fragments of text that have content that is semantically similar to our question. Our tool then includes these fragments of text as context in our question, allowing the LLM we interact with to use this knowledge to answer our question.
The script that I detail below assumes that you have already created a vectorstore that contains the literature from your Zotero library (again, see my previous post on this topic).
Let us start with the folder structure for our tool. In our main folder, we create two subfolders:
One thing that we can do with retrieval augmentation is to record the actual fragments of text that our tool retrieved, as well as details of the sources that these fragments were retrieved from. I find this extremely useful, because it allows us to have our tool cite its sources, so that we can double check its output. To this end, I have my tool write log files in to the answers folder. Whenever I start a new chat session, a new log file is created in which the tool records the questions I ask, the answers it gives and then the fragments of text that it used to come to this answer as well as the sources that these fragments are from.
The ‘vectorstore’ folder simply contains the vector store that contains our literature.
In the main folder we also have a few files:
You might have several other files in the folder, such as scripts that I described in my first post of this series. However, the ones listed above are the only ones we really need to make the examples below work.
Let us go over the main script now: the ask_cl.py script. We’ll first import all the modules that our tool will make use of.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.document_transformers import LongContextReorder, EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.retrievers import ContextualCompressionRetriever
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
import openai
from dotenv import load_dotenv
import os
from langchain.vectorstores import FAISS
from langchain.callbacks import OpenAICallbackHandler
from datetime import datetime
import chainlit as cl
from chainlit.input_widget import Select, Slider
import textwrap
import constants
from langchain.prompts import (
ChatPromptTemplate,
PromptTemplate,
SystemMessagePromptTemplate,
AIMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from langchain.schema import (
AIMessage,
HumanMessage,
SystemMessage
)
I will go through the various things that we import from top to bottom.
The ConversationalRetrievalChain
is a type of chain (imported from LangChain) that does most of the heavy lifting for us when it comes to retrieval augmentation. We can include a vector store in this chain, which will be used to retrieve fragments of information that we want to include as context in our questions to the LLM. There is also ‘RetrievalChain’ that does just that. What the ConversationalRetrievalChain
adds to this is a chat history, so that we can actually have a conversation with an LLM that then memorizes earliers parts of the conversation. See the LangChain docs on the ConversationalRetrievalChain
here.
To enable an LLM to memorize our conversation, we also need a memory object that we include in our chain. The ConversationBufferWindowMemory
is one of several memory objects that langchain offers. We need one of those to store the chat history of our conversation, so that the LLM that we interact with has access to that history. The ConversationBufferWindowMemory
is a kind of sliding window memory that memorizes a limited part of the conversation (see the docs here). This allows our tool to memorize the most recent interactions, without that memory getting too large for our LLM to handle (without exceeding the available context window of the model).
We also import various modules that include utilities that we use to process the fragments that our ConversationalRetrievalChain
retrieves. The background to the LongContextReorder
object could be the topic of its own blog post. It goes back to a finding that is discussed in this paper, which is that LLMs that are given a long context (basically the information included with the question), tend to get ‘lost in the middle’: They pay more attention to the information that is either in the beginning of the context, or at the end of the context, and information in the middle is not given as much consideration. The LongContextReorder
object helps us order the retrieved fragments such that the fragments that it considers to be most important tend to be at the beginning or at the end of the collection of retrieved fragments. The EmbeddingsRedundantFilter
filters out redundant fragments if we have multiple fragments that semantically are highly similar. Given that we have only a limited context window to work with, this object helps us to ensure that this context window is not filled with a lot of redundant information. The DocumentCompressorPipeline
is an object that allows us to combine these different types of filters in a pipeline. The ContextualCompressionRetriever
then allows us to integrate that pipeline in our retrieval chain. It is the retriever that we will integrate into our ConversationalRetrievalChain
.
To be able to access OpenAI’s chat models, we import the ChatOpenAI
module from the LangChain framework. We also need to use functions from the OpenAI
module, so we import that as well. Given that we are working with OpenAI models, we need to import the openai
module from OpenAI itself too. We then import the constants
module, which is simply the other Python file in our main folder; the file in which we store our OpenAI API key. The dotenv
module allows us to set environment variables, which we use in our script to set our OpenAI API key as an environment variable. We also need the os
module to set our environment variable.
The literature that we wish to chat about is recorded in a FAISS
vector store, so we’ll have to import the corresponding module to be able to use it. We also need to import the OpenAIEmbeddings
module because when we retrieve fragments from our vector stores, we need to indicate the embeddings model that we used to create the embeddings.
To be honest, I do not know a lot about callback handlers. They are apparently functions that are executed in response to specific events or conditions in asynchronous or event-driven programs. To the best of my knowledge, we need a callback handler because our Chainlit-powered app will make use of asynchronous programming. In our case, we specifically need the OpenAICallbackHandler
, so we import the corresponding module.
The datetime
module allows us to get the current date and time, which we will use in the log files that we write to our answers folder for fact checking (see the section on folder structure above).
Since we are building a Chainlit-powered app, we also import the chainlit
module. We then import various input widgets from the chainlit
module that allow us to set various settings for our app in the browser interface. We will not be doing that much with these settings in this particular example, but in examples that I will discuss in future posts, the ability to change settings on the fly is important.
The remaining imports that we do are all related to prompt templates. As described in the LangChain documentation, prompt templates “are pre-defined recipes for generating prompts for language models”. As you’ll see in the example below, it allows us to predefine a prompt for the LLMs that we interact with. In the prompt template we can also include variables, which allows us ‘dynamically’ insert things in our prompt, such as the question that we asked and the context to that question that consists of the fragments of text we retrieve from our vector store. Since our desire is to develop a chat app, we need to use templates that have been specifically designed for chat purposes. These templates make use of various schemas that we also need to import.
The remainder of the script consists of four chunks:
Let’s start with the setting up chunk:
# Cleanup function for source strings
def string_cleanup(string):
"""A function to clean up strings in the sources from unwanted symbols"""
return string.replace("{","").replace("}","").replace("\\","").replace("/","")
# Set OpenAI API Key
load_dotenv()
os.environ["OPENAI_API_KEY"] = constants.APIKEY
openai.api_key = constants.APIKEY
# Load FAISS database
embeddings = OpenAIEmbeddings()
db = FAISS.load_local("./vectorstore/", embeddings)
# Set up callback handler
handler = OpenAICallbackHandler()
# Set memory
memory = ConversationBufferWindowMemory(memory_key="chat_history", input_key='question', output_key='answer', return_messages=True, k = 3)
# Customize prompt
system_prompt_template = (
'''
You are a knowledgeable professor working in academia.
Using the provided pieces of context, you answer the questions asked by the user.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
"""
Context: {context}
"""
Please try to give detailed answers and write your answers as an academic text, unless explicitly told otherwise.
Use references to literature in your answer and include a bibliography for citations that you use.
If you cannot provide appropriate references, tell me by the end of your answer.
Format your answer as follows:
One or multiple sentences that constitutes part of your answer (APA-style reference)
The rest of your answer
Bibliography:
Bulleted bibliographical entries in APA-style
''')
system_prompt = PromptTemplate(template=system_prompt_template,
input_variables=["context"])
system_message_prompt = SystemMessagePromptTemplate(prompt = system_prompt)
human_template = "{question}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
# Set up retriever
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
reordering = LongContextReorder()
pipeline = DocumentCompressorPipeline(transformers=[redundant_filter, reordering])
retriever= ContextualCompressionRetriever(
base_compressor=pipeline, base_retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"k" : 20, "score_threshold": .75}))
# Set up source file
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
filename = f"answers/answers_{timestamp}.org"
with open(filename, 'w') as file:
file.write("#+OPTIONS: toc:nil author:nil\n")
file.write(f"#+TITLE: Answers and sources for session started on {timestamp}\n\n")
You’ll see that this chunk starts with the definition of a function that removes various characters from strings that you pass it. I wrote this function to clean up the strings that I have the app write to a log to report its sources (the log that is stored in the answers folder). The part of the script where these strings are created are shown further below. Basically, these strings are reconstructed from the metadata that the ConversationalRetrievalChain
extracts from the fragments that it retrieves. These bits of metadata originate from the bibtex entries of my Zotero library (see my first post). Given that they these bits of metadata often contain characters that make them look messy, I created the string_cleanup
function to tidy them up a bit.
After defining this function we do some basic setting up. We first set our OpenAI API key as an environment variable and we make the openai
module aware of our API key as well. We then load the OpenAIEmbeddings
model, which is currently the text-embedding-ada-002
model. We need to pass this model as an argument when loading our vector store. By doing so, we clarify what type of embeddings are stored in the vector store, so that we can make use of them. We then load the actual vector store itself and store it in an object we simply call db
. We then set up our callback handler. Finally, we set up our ConversationBufferWindowMemory
. We need to tell it about the keys by which we identify our chat history, our questions and the output of the model after answering our questions. We can set the window size of the memory with the k
argument, which in this example is set to three. This means it will remember up to three messages of conversation.
After this we write out our prompt template. You can see that we tell the LLM to assume a certain role and that we offer instructions on how to respond. I do not have a lot of experience with prompt engineering yet, so this prompt template probably can be improved. Also notice that I include one variable in this template, which is called context
. This context consists of the retrieved fragments of text associated with our question. After writing out the prompt template, we create the actual template and relate our context
variable to it. We then specify that this is the system message template. We create a separate template for the human messages, which simply consist of our question
variable. We then create a chat prompt from the system message template and the human message template. We will later include this last prompt in our ConversationalRetrievalChain
.
Next, we set up our retriever. We first set up the redundancy filter and then the LongContextReorder
object. We combine these in a pipeline, which is itself included in a ContextualCompressionRetriever
, along with the vector store, which acts as our ‘base retriever’. It is possible to simply use the vector store as a retriever directly, but then we would not have the benefits that the redundancy filter and LongContextReorder
object give us. We pass several arguments to our base retriever, namely:
k
);Twenty fragments is a pretty large number of fragments to retrieve, given that we work with a limited context window and given that longer contexts also have the unfortunate consequence that not all of it will be equally considered by the LLM. However, I have set this relatively high number because we also do some filtering and reordering as part of our pipeline, which compensates somewhat from the downsides of retrieving many fragments. We will later integrate our ContextualCompressionRetriever
in our ConversationalRetrievalChain
.
The last thing we do in this chunk of the script is to set up the log file that we wish to write to the answers folder. I opted to use org files for this, since I work a lot with org files in general. We give the file a filename that includes a timestamp and we write a header and a title to the file itself. We will populate it further during our conversation with the LLM.
The next chunk of our script is a shot one:
# Prepare settings
@cl.on_chat_start
async def start():
settings = await cl.ChatSettings(
[
Select(
id="Model",
label="OpenAI - Model",
values=["gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k"],
initial_index=0,
),
Slider(
id="Temperature",
label="OpenAI - Temperature",
initial=0,
min=0,
max=2,
step=0.1,
),
]
).send()
await setup_chain(settings)
In this chunk we define the chat settings that the user can change on the fly, which is a feature of chainlit apps. In this case we allow the user to select different types of models that OpenAI has on offer through its API. I default to the gpt-3.5-turbo-16k
model, because it is still relatively cheap and has a longer context window than the gpt-3.5-turbo
model. I have found that, with the number of fragments that I retrieve, this longer context window is often necessary. The user can also set temperature for the model, which controls how deterministic its answers will be: A higher temperature will allow for more variability in answers.
We could include more settings if we wanted. For example, in another, more elaborate tool that I made, I also allow the user to set the number of text fragments retrieved by the base retriever
. It should also be possible to control the size of the memory window using chat settings.
The next chunk of the script is called whenever one of the above chat settings is changed, but it is also run the outset, when a new chat is started:
# When settings are updated
@cl.on_settings_update
async def setup_chain(settings):
# Set llm
llm=ChatOpenAI(
temperature=settings["Temperature"],
model=settings["Model"],
)
# Set up conversational chain
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
chain_type="stuff",
return_source_documents = True,
return_generated_question = True,
combine_docs_chain_kwargs={'prompt': chat_prompt},
memory=memory,
condense_question_llm = ChatOpenAI(temperature=0, model='gpt-3.5-turbo'),
)
cl.user_session.set("chain", chain)
Here, we first specify the LLM that we will be conversing with. As you can see, the parameters of this model are retrieved from the chat settings, namely the model name and its temperature.
We then finally get to setting up the ConversationalRetrievalChain
. We clarify what LLM we wish to converse with. We then tell the chain what retriever we will be using (our ContextualCompressionRetriever
). We also define how our documents are integrated in the context
variable of our prompts, for which the LangChain framework offers several options. I use the simplest one, ‘stuff documents’, which basically stuffs also retrieved fragments in the context. The other options usually involve iterating over our fragments in different ways. This is more time consuming and it often involves additional calls to LLMs, which makes them more expensive options. So far, I have not seen great benefit from using any of these other options. We tell the chain to also return the ‘source documents’, which allows us to access the actual fragments of text that our retriever retrieves. We need to do this if we want to enable our tool to report its sources in the log file that we have it create. For similar reasons, we also tell the chain to retun the question that it generated. We then specify the prompt that we want the chain to use, which is the chat prompt we created earlier. We also specify the memory object that the chain can use to memorize our conversation, such that we can have an actual conversation with the LLM. Finally, in this we case we also specify the model that the chain can use to condense questions (which is something it apparently always does). By default, it will use the model that we set with the llm
parameter, but I force it to use the gpt-3.5-turbo
model, because it is unnecessary to use a more expensive model for this.
So now we have our ConversationalRetrievalChain
all set up!
The last chunk of our script basically handles what happens when messages are being sent to an LLM:
@cl.on_message
async def main(message: str):
chain = cl.user_session.get("chain")
cb = cl.LangchainCallbackHandler()
cb.answer_reached = True
res = await cl.make_async(chain)(message.content, callbacks=[cb])
question = res["question"]
answer = res["answer"]
answer += "\n\n Sources:\n\n"
sources = res["source_documents"]
print_sources = []
with open(filename, 'a') as file:
file.write("* Query:\n")
file.write(question)
file.write("\n")
file.write("* Answer:\n")
file.write(res['answer'])
file.write("\n")
counter = 1
for source in sources:
reference = "INVALID REF"
if source.metadata.get('ENTRYTYPE') == 'article':
reference = (
string_cleanup(source.metadata.get('author', "")) + " (" +
string_cleanup(source.metadata.get('year', "")) + "). " +
string_cleanup(source.metadata.get('title', "")) + ". " +
string_cleanup(source.metadata.get('journal', "")) + ", " +
string_cleanup(source.metadata.get('volume', "")) + " (" +
string_cleanup(source.metadata.get('number', "")) + "): " +
string_cleanup(source.metadata.get('pages', "")) + ".")
elif source.metadata.get('ENTRYTYPE') == 'book':
author = ""
if 'author' in source.metadata:
author = string_cleanup(source.metadata.get('author', "NA"))
elif 'editor' in source.metadata:
author = string_cleanup(source.metadata.get('editor', "NA"))
reference = (
author + " (" +
string_cleanup(source.metadata.get('year', "")) + "). " +
string_cleanup(source.metadata.get('title', "")) + ". " +
string_cleanup(source.metadata.get('address', "")) + ": " +
string_cleanup(source.metadata.get('publisher', "")) + ".")
elif source.metadata.get('ENTRYTYPE') == 'incollection':
reference = (
string_cleanup(source.metadata.get('author', "")) + " (" +
string_cleanup(source.metadata.get('year', "")) + "). " +
string_cleanup(source.metadata.get('title', "")) + ". " +
"In: " +
string_cleanup(source.metadata.get('editor', "")) +
" (Eds.), " +
string_cleanup(source.metadata.get('booktitle', "")) + ", " +
string_cleanup(source.metadata.get('pages', "")) + ".")
else:
author = ""
if 'author' in source.metadata:
author = string_cleanup(source.metadata.get('author', "NA"))
elif 'editor' in source.metadata:
author = string_cleanup(source.metadata.get('editor', "NA"))
reference = (
string_cleanup(source.metadata.get('author', "")) + " (" +
string_cleanup(source.metadata.get('year', "")) + "). " +
string_cleanup(source.metadata.get('title', "")) + ".")
if source.metadata['source'] not in print_sources:
print_sources.append(source.metadata['source'])
answer += '- '
answer += reference
answer += '\n'
file.write(f"** Document_{counter}:\n- ")
file.write(reference)
file.write("\n- ")
file.write(os.path.basename(source.metadata['source']))
file.write("\n")
file.write("*** Content:\n")
file.write(source.page_content)
file.write("\n\n")
counter += 1
await cl.Message(content=answer).send()
This chunk is pretty lengthy, but much of it is a somewhat convoluted way of having the tool report its sources, both in is responses to the user, but also in the log file that we write to the answers folder.
The chunk more or less starts with a specification of the chain that we are using (the one we just created). We then define our callback handler. The res
object is what we store the response of the LLM in. It consists of several parts, including the question that we asked (remember that we told our ConversationalRetrievalChain
to return the question), the answer to our question, and the source documents.
As you can see, we extend the original answer of the model with a list of our sources. Most of what you see in the remainder of the chunk are different approaches to formatting these sources, depending on the type of source it is. We retrieve various metadata from our sources to format the actual references. As mentioned before, we also include these references in our log file, along with our question and the answer that the LLM gave us.
Now that we have our complete script, let us actually use it.
We can start our chainlit app by going to its main folder and typing the following command:
chainlit run ask_cl.py
This will open a browser window in which we are greated with a default readme file (the chainlit.md file). As mentioned previously, we can change this file as we wish. My version of the tool looks as shown in the picture below.
We can now start asking our app questions. In the example shown below, I asked the app what “chains of action” are, a concept used by Theodore Schatzki in his version of social practice theories. The answer that we get is pretty good. Also notice how the app reports the sources it consulted, which are papers that I have in my Zotero library.
Let us also take a look at part of the answer file that has been created and populated in the meantime:
#+OPTIONS: toc:nil author:nil
#+TITLE: Answers and sources for session started on 20231023_222003
* Query:
What are chains of action?
* Answer:
Chains of action refer to a series of performances or actions in which each member of the chain responds to the preceding activity, what was said, or a change in the world brought about by the previous activity. These chains link individuals and their actions, forming a temporal thread that connects their pasts and presents. Chains of action can occur within and between bundles and constellations, and they play a crucial role in shaping social life and bringing about changes in practices, arrangements, and bundles. (Schatzki, 2016)
Bibliography:
Schatzki, T. R. (2016). Chains of action and the plenum of practices. In The Timespace of Human Activity: On Performance, Society, and History as Indeterminate Teleological Events (pp. 67-68). Franz Steiner Verlag.
** Document_1:
- Schatzki, Theodore R. (2016). Keeping Track of Large Phenomena. Geographische Zeitschrift, 104 (): 4--24.
- Schatzki_2016_Geographische Zeitschrift.txt
*** Content:
(7) Interactions, as noted, form one sort of action chain, namely, those encompassing reciprocal reactions by two or more people. Subtypes include exchange,
teamwork, conversation, communication, and the transmission of information. (These concepts can also be used to name nexuses of chains: compare two
people exchanging presents in person to two tribes exchanging gifts over several months.) Other types of chain are imitation (in a more intuitive sense than
Tarde’s ubiquitous appropriation) and governance (intentional shaping and influencing). Beyond these and other named types of chains, social life houses an
immense number of highly contingent and haphazard chains of unnamed types
that often go uncognized and unrecognized yet build a formative unfolding rhizomatic structure in social life.
4. Chains of Action and the Plenum of Practices
Individualists can welcome the idea of action chains. Indeed, unintentional consequences of action, the existence of which is central to the individualist outlook, can
be construed as results of action chains. Contrary to individualists, however, practice
theorists do not believe that action chains occur in a vacuum, less metaphorically, that
they occur only in a texture formed by more individuals and their actions. Rather,
chains transpire in the plenum of practices. This implies that they propagate within
and between bundles and constellations.
** Document_2:
- Schatzki, Theodore R. (2010). The Timespace of Human Activity: On Performance, Society, and History as Indeterminate Teleological Events. Lanham, Md: Lexington Books.
- Schatzki_2010_The timespace of human activity.txt
*** Content:
The second type of sinew embraces chains of action. Lives hang together
when chains of action pass through and thereby link them. A chain of ac-
tion is a series of performances, each member of which responds either to
the preceding activity in the series, to what the previous activity said, or to a
change in the world that the preceding activity brought about. For example,
when a person taking a horse farm tour drops a map on the ground, and the
tour leader picks it up and puts it in a trash receptacle, their lives are linked
Activity Timespace and Social Life 67
by a chain of action (which also connects them to the people who installed
the receptacle). Conversations, to take another example, are chains of action
in which people respond to what others say or to the saying of it. Chains of
action are configurations of interwoven temporality. For responding to an
action, to something said, or to a change in the world is the past dimension of
activity. Each link in a chain of action thus involves some person’s past, and
a chain comprising multiple links strings together the pasts and presents of
different people.
You can see that it lists my question, the answer that I was given, and the various fragments of text on which the answer was based. This not only allows us to double check the answers that I got, but also to quickly identify parts of different papers that we might want to look into more.
I hope this post is useful to people that would like to build something similar themselves. The app described in this post builds on one of the first LangChain tools that I developed (I did do a lot of fine-tuning of it over time). It has been incredibly useful for me, but it has more or less become redundant after I started developing an agent that I can use to not only chat about my literature, but also for various other things. This includes retrieving information from empirical sources (e.g., news archives) and then relating conceptual knowledge to that empirical information. I will go into the use of agents in future posts.
]]>I use multiple internet browsers. My default browser is qutebrowser, but for some things that qutebrowser doesn’t handle well I switch to Brave. I am also experimenting with Nyxt, which I started exploring as a possible alternative to qutebrowser. However, it is not yet stable enough for me to make the switch. Also, some things, like Netflix, do not work on Nyxt yet and I have no idea if there is a way to fix that.
One minor inconvenience when switching between browsers is that my bookmarks are not synchronized across them. I have converted my qutebrowser quickmarks to Nyxt bookmarks multiple times, but that is not something that I would like to keep doing, because it is time-consuming.
I had a vague memory of Luke Smith talking about a simple solution for universal bookmarking that he uses. With ‘universal bookmarking’ I mean one bookmarking system that can be used across different browsers. Luke Smith discusses this in a Youtube video. His makes use of lightweight tools, such as xdotool, xclip and dmenu. I also found this repo with a script that expands on Luke’s idea a bit. I played around with this expanded solution and then tweaked it further to have something that I am happy about.
My version, at its core, consists of two scripts. The first script is basically the command that inserts bookmarks that Luke shows in his video’s (with minor edits). Luke binds this command directly to a keybind in his DWM configuration. I found it more convenient to keep the command as a separate script and to just call the script with a similar keybind. This allows me to modify the script without having to rebuild DWM.
#!/bin/sh
xdotool type $(grep -v '^#' ~/.local/share/bookmarks | dmenu -l 20 -F | cut -d' ' -f1)
The script finds all the lines in my bookmarks file (a plain text file) that do not start with a #
(a comment), pipes these into dmenu, allowing me to select one of the bookmarks recorded in the file, and this then gets typed into whatever text field I have selected at the time (using xdotool type
). I include titles and tags with my bookmarks, so dmenu should only output the url address itself, which is what the cut -d' ' -f1
is for.
See the example of an url, as it is recorded in my bookmarks file, below.
https://mynoise.net/noiseMachines.php - MyNoise - Audio
I also have a script for creating new bookmarks. Unlike Luke, I opted for not just bookmarking whatever text is currently selected, but I went for something where you can type or paste in a url and then edit it further (following the example of the earlier mentioned repo). I also decided to get rid of the part of the script where you paste in the currently selected text altogether. I think that makes sense when you typically use browsers with normal url-bars, but qutebrowser and Nyxt don’t have one. With those browsers, it makes more sense to just copy the currently visited url (yy
in qutebrowser) and then paste it into the dmenu prompt (the default keybind for that in dmenu is C-Y
).
My version of the script thus opens an empty dmenu prompt where you can paste in the url you want to bookmark and type anything else that you want to include alongside it. For example, I will typically type in a title and some tags as shown in the example above.
The script then checks if an entry with the url already exists in the bookmark file, ignoring any titles or tags that might be associated with it. If the url already exists, no new bookmark will be created.
#!/bin/sh
file="$XDG_DATA_HOME/bookmarks"
bookmark=$(:|dmenu -p "edit bookmark:")
if [ -z "$bookmark" ]
then
notify-send -e "bookmark creation cancelled."
return 1
else
bookmarkUrl="${bookmark%% *}"
if grep -q "^$bookmarkUrl" "$file"
then
notify-send -e "already bookmarked."
else
notify-send -e "bookmark successfully added as $bookmark."
echo "$bookmark" >> "$file"
fi
fi
I like this simple solution. It uses, lightweight tools, it uses a simple text file that we can easily edit to keep the bookmarks, and it uses scripts that we can easily adapt or update if we want to. Also, the bookmarks can be typed in anywhere; not just url-bars. For example, I could also use it to include links that I’ve bookmarked in blog posts. My bookmarking troubles are over.
]]>This is going to be the first post in a series of posts on using the LangChain framework. With LangChain you can develop apps that are powered by Large Language Models (LLMs). I primarily use LangChain to build applications for chatting about literature in my Zotero library and other text data on my computer. My intention was to write a blog post that explains how to build these applications and how they work. However, there is too much ground to cover for a single post, so I decided to break it down into multiple posts.
This first post covers:
When ChatGPT was first released, I barely took notice. I heard some people say impressive things about ChatGPT, but I didn’t immediately feel the urge to try it out. Eventually, a few months ago, I decided to give it a try. I was mind-blown. For an entire week, I had ChatGPT spit out crazy, nonsensical stories. There was one about a talking horse that specialized in public-private partnerships and saved a village by helping to create new infrastructure. I remember one about a hero who rode his horse backwards because he was afraid of being followed by purple frogs. There was also another one about someone who put themselves in orbit around the Earth by pulling themselves up by their own hair. The funniest part was that ChatGPT added a disclaimer by the end, stating that it was purely fictional and that you cannot actually put yourself in orbit in this way.
I quickly started trying out things that might be useful for my work in academia. For example, I had ChatGPT come up with an assignment about analyzing policy interventions from a behavioral perspective. Although I didn’t actually use it, I could have with just a few tweaks. I also entered into discussions with ChatGPT about theories and philosophy. I found ChatGPT to be a useful conversational partner on many topics, as long as you already know your stuff and can spot the things that ChatGPT gets wrong (which happens frequently; at some point, I got fed up with ChatGPT constantly apologizing for getting things wrong). I even tried a hybrid of storytelling and conversation on theories, having ChatGPT tell a fictional story about an academic and then querying ChatGPT about the contents of the papers written by this fictional academic.
I don’t remember exactly when I started using ChatGPT for code writing, but its co-pilot capabilities are another aspect that blew me away and changed the way I write code. I recently read a post on hacker news about how traffic on StackOverflow has declined recently. I have a strong suspicion that ChatGPT has contributed to this.
While I continue to be mind-blown to this day (in a positive way), I would also like to note that, like many others, I have occasionally felt uncertain and worried about where this will all lead. I am certainly no expert on AI, so please take anything I say on this with a grain of salt. That being said, I am not that concerned about the ‘AI going rogue’ scenario, because I think that tools like ChatGPT give the strong appearance of being intelligent, but in reality are as dumb as a bag of rocks. What I am more afraid of is what humans might do with powerful tools like LLMs (or whatever comes next). Also, I feel somewhat uncomfortable with the fact that progress in this area is driven almost entirely by business interests. I think it is important that we think of alternative models for the further development of AI, such as the ‘Island’ idea put forward in this article of the FT Magazine. It is also encouraging to see initiatives such as the development of the Bloomz model and petals (my GPU is now hosting one block of a BloomZ model through petals; admittedly an almost meaningless contribution), which are both initiatives of BigScience. Yet, in my limited experience OpenAI’s GPT models blow models such as Bloomz out of the water when it comes to the quality of their output. The debate on how this AI revolution should unfold and be governed is an important one, but it is not a debate I want to engage with in these posts. I would like to focus on different ways in which we can make LLMs useful for academic work.
I'll reiterate that I am no expert on AI, and given that many people who are an expert on the topic are worried, I am probably overlooking or misunderstanding something. Feel free to write me a message to educate me on this.
As described above, my first introduction to LLMs was through ChatGPT, which I believe is the case for many others. While I had a lot of fun with ChatGPT alone, things became even more interesting after I discovered LangChain. I was introduced to LangChain by a Youtube video. In the video, Techlead demonstrates how LangChain allows you to chat with Language Models (LLMs) about data stored on your own computer. Techlead also provides a simple example on his GitHub repository, which can help you get started even if you don’t fully understand how LangChain works. As mentioned in the introduction, you can use LangChain to develop LLM-powered apps. These LLMs can run on your own computer or be accessed via APIs. The apps I have created using LangChain so far make use of the OpenAI API, which provides access to chat models like gpt3.5-turbo
and gpt4
, as well as the text-embedding-ada-002
embedding model (Using the OpenAI API is not for free).
As the name suggests, LangChain utilizes chains, which the documentation defines as sequences of calls to components. These components are abstractions for working with language models, along with various implementations for each abstraction. In simple terms, LangChain offers a set of tools that allow you to interact with LLMs in different ways, and it offers an easy way to chain these tools together. This enables you to accomplish complex tasks with minimal code. LangChain comes with a wide variety of pre-built chains, which means that you can build useful tools quickly.
LangChain also allows the creation of agents, which are LLMs that can choose actions based on user prompts. Agents simulate reasoning processes to determine which actions to take and in what order. These actions often involve using tools that we provide to the agent, which are different types of chains powered by LLMs. In simple terms, an agent is an LLM that can use other instances of LLMs for different tasks, depending on the specific needs. There are undoubtedly many different kinds of useful applications that you can build with this framework, but I was drawn primarily to the idea of ‘chatting with your own data’. This involves something that is called retrieval augmentation.
With retrieval augmentation, you extract information from documents and include that information as context in the messages that you send to an LLM. This allows the LLM to not only make use of the knowledge that it obtained during training (parametric knowledge), but also of ‘external’ knowledge that you extract from the documents (source knowledge). Supposedly, this helps to combat so-called hallucinations (or check this link if you don’t have a Medium account). That in itself is useful, but I was primarily enthusiastic about the idea of chatting with an LLM about the literature that I have collected in my Zotero library.
Retrieval augmentation thus is something different from training or fine-tuning an LLM with your own data. It doesn't actually alter the model weights as would be the case with fine-tuning and training. Instead, the model just temporarily learns about the external knowledge included in your messages.
While the idea of extracting information from documents to include them as context in your messages to LLMs is simple enough, there are some challenges we need to overcome:
First, it is not practical if we have to manually find and extract the relevant information from our documents. We might not even know exactly which information from which documents is relevant to our query in the first place. Obviously, this part of the process is something we want to automate, which fortunately is easy using retrieval augmentation.
Second, there are limits to how much context we can include in our messages to LLMs. Every LLM model has something called a context window, which refers to the number of tokens we can use in a single interaction with an LLM, including both the input (our query) and the output (the LLM’s answer) of that interaction. Different models have differently sized context windows. For example, the gpt3.5-turbo
model has a context window of 4,096 tokens. The slightly more expensive gpt3.5-turbo-16k
model, which I now use as my default, has a context window of 16,384 tokens. The gpt-4-32k
model has a context window of 32,768 tokens, but it is much more expensive than the gpt3.5
models. Anthropic’s Claude 2, currently only available in the US and the UK, has an impressive context window size of 100k tokens! Regardless, the length of the text that you include in your messages as context is limited by the model’s context window. If we want to ask questions about our literature, we cannot simply dump our entire library of papers into our messages.
Third, we might not want to dump our entire library in our messages, or even an entire book or paper, for another reason: Not all information in a given paper will be relevant to the question we are asking the LLM. It would be preferable to include only the relevant bits of information in our messages and exclude anything that might distract from our question. Fortunately, this can be easily achieved using tools provided by the LangChain framework. I will now discuss some of these tools.
Vector stores are perhaps the most important tools in the process of retrieval augmentation. A vector store is a kind of database in which you can store documents in two forms:
text-embedding-ada-002
model turns documents into vectors with 1,536 dimensions.LangChain supports a variety of vector stores. The one I chose to use is the FAISS vector store, for the following reasons:
Another type of vector store that offers similar functionality is ChromaDB, which also seems to be popular. I advise you to explore the different available types of vector stores in LangChain before picking one to use yourself.
To store our documents in a vector store we need to take multiple steps (I’ll walk through these in more detail in the remainder of this blog post):
pdfttotext
and pdfimage
command-line tools. You might want to simply make use of the built-in tools that LangChain provides instead. I opted for the bash script because it makes it easier to check the results of the conversion process.DirectoryLoader
as we’ll be loading multiple documents from a directory.RecursiveCharacterTextSplitter
.text-embedding-ada-002
. Again, there are other options available, but I haven’t had the chance to experiment with these yet and I’m quite happy with the results I have achieved with the OpenAI solution.Let’s now go through these steps in more detail.
I’ll briefly explain the logic of the bash script (which you can find below) that I use to convert the literature in my Zotero library (all PDFs) to text.
The bash script finds all the PDFs included in my Zotero storage folder, which has sub-folders for each publication. For each file it checks if the filename is already mentioned in a text file that I use to keep track of every document that I have already ingested. I frequently update my Zotero library, and if I want update my vector store by adding new publications, I don’t want to also convert all files that have already been ingested. Keeping track of the files that have already been ingested allows me to skip these in the conversion process.
#!/bin/bash
# One file to keep the papers that I have already ingested
# One dir to store newly added papers
# A temporary dir for image-based pdfs.
existing_file="/home/wouter/Documents/LangChain_Projects/Literature/data/ingested.txt"
output_dir="/home/wouter/Documents/LangChain_Projects/Literature/data/new"
temp_dir="/home/wouter/Documents/LangChain_Projects/Literature/data/temp"
counter=0
total=$(find /home/wouter/Tools/Zotero/storage/ -type f -name "*.pdf" | wc -l)
find /home/wouter/Tools/Zotero/storage -type f -name "*.pdf" | while read -r file
do
base_name=$(basename "$file" .pdf)
if grep -Fxq "$base_name.txt" "$existing_file"; then
echo "Text file for $file already exists, skipping."
else
pdftotext -enc UTF-8 "$file" "$output_dir/$base_name.txt"
pdfimages "$file" "$temp_dir/$base_name"
fi
counter=$((counter + 1))
echo -ne "Processed $counter out of $total PDFs.\r"
done
I have the bash script convert all PDFs to text with pdftotext
, but I also convert the same files with pdfimages
, since some of the PDFs have images rather than text (the PDFs where you cannot select the text in a regular PDF reader). The images are stored in a temporary folder. After converting the files, I basically just inspect the resulting files and try to identify files that pdftotext
was not able to convert successfully (usually these are just a few bytes in size). For all files that were converted successfully I delete the image files in the temporary folder. The remaining image files are converted with another bash script (shown below), which makes use of tesseract.
#!/bin/bash
output_dir="/home/wouter/Documents/LangChain_Projects/Literature/data/new/"
pbm_directory="/home/wouter/Documents/LangChain_Projects/Literature/data/temp"
# Create an associative array
declare -A base_names
# Handle filenames with spaces by changing the Internal Field Separator (IFS)
oldIFS="$IFS"
IFS=$'\n'
# Go through each file in the PBM directory
for file in "$pbm_directory"/*.pbm "$pbm_directory"/*.ppm
do
# Get the base name from the path
base_name=$(basename "$file" | rev | cut -d- -f2- | rev)
# Add the base name to the associative array
base_names["$base_name"]=1
done
# Restore the original IFS
IFS="$oldIFS"
# Go through each unique base name
for base_name in "${!base_names[@]}"
do
# Remove any existing text file for this base name
rm -f "$output_dir/$base_name.txt"
# Go through each PBM file for this base name, handling spaces in filenames
for ext in pbm ppm
do
find "$pbm_directory" -name "$base_name-*.$ext" -print0 | while read -r -d $'\0' file
do
# OCR the file and append the results to the text file
echo "Converting $file"
tesseract "$file" stdout >> "$output_dir/$base_name.txt"
done
done
done
After doing this, I have all documents stored as plain text files in one folder.
The remaining steps that I discuss in this post are all implemented in Python. I will share a Python script that I call indexer.py
, which I use to create a new vector store for the literature in my Zotero library. I’ll be going through the script in steps. Let’s start with some basic ‘housekeeping’ stuff, such as imports, loading our OpenAI API Key (without it, we cannot use the OpenAI models), and setting some paths that we’ll be using throughout the script.
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
import langchain
import bibtexparser
import os
from dotenv import load_dotenv
import openai
import constants
# Set OpenAI API Key
load_dotenv()
os.environ["OPENAI_API_KEY"] = constants.APIKEY
openai.api_key = constants.APIKEY
# Set paths
source_path = './data/new/'
store_path = './vectorstore/'
destination_file = './data/ingested.txt'
bibtex_file_path = '/home/wouter/Tools/Zotero/bibtex/library.bib'
You’ll notice that I import my API key from a file called constants.py
, which is a file that just defines one variable, called APIKEY
, which is a string that contains my API key. If you don’t have an OpenAI API key yet, you can make one on the OpenAI platform. It is important that you don’t share your API key with anyone.
In the snippet of Python code above we set a couple of paths:
source_path
which contains all the text files we created in the previous step.store_path
where we will keep our vector store.destination_file
to which we’ll write the names of all the files we’ve successfully ingested later on.bibtex_file_path
where we store our Zotero-generated bibtex file. We will access this file to retrieve the bibliographical metadata that we want to include with our documents.The next step is to actually load our documents, which we can easily accomplish with LangChain’s DirectoryLoader. Before chunking our documents we will also want to add the metadata to them, so that the metadata is associated with the relevant chunks.
We simply set up our DirectoryLoader
, passing our source_path
as its first argument and then setting a few options that help ensure a smooth process (the show_progress=True
argument is not strictly necessary).
To add our metadata, we can go through our bibtex file, using the bibtexparser library, and we’ll match the names of our documents to the filenames recorded in the bibtex file (Zotero conveniently records these names along with the bibliographical details). After extracting the metadata, we go through our list of the imported documents, and we add the metadata.
# Load documents
print("===Loading documents===")
loader = DirectoryLoader(source_path,
show_progress=True,
use_multithreading=True,
loader_cls=TextLoader,
loader_kwargs={'autodetect_encoding': True})
documents = loader.load()
# Add metadata based in bibliographic information
print("===Adding metadata===")
# Read the BibTeX file
with open(bibtex_file_path) as bibtex_file:
bib_database = bibtexparser.load(bibtex_file)
# Get a list of all text file names in the directory
text_file_names = os.listdir(source_path)
metadata_store = []
# Go through each entry in the BibTeX file
for entry in bib_database.entries:
# Check if the 'file' key exists in the entry
if 'file' in entry:
# Extract the file name from the 'file' field and remove the extension
pdf_file_name = os.path.basename(entry['file']).replace('.pdf', '')
# Check if there is a text file with the same name
if f'{pdf_file_name}.txt' in text_file_names:
# If a match is found, append the metadata to the list
metadata_store.append(entry)
for document in documents:
for entry in metadata_store:
doc_name = os.path.basename(document.metadata['source']).replace('.txt', '')
ent_name = os.path.basename(entry['file']).replace('.pdf', '')
if doc_name == ent_name:
document.metadata.update(entry)
Just for reference, I include an example of what a bibtex entry in my bibtex files looks like.
@article{Abbott1984,
title = {Event Sequence and Event Duration: Colligation and Measurement [in Medicine].},
shorttitle = {Event Sequence and Event Duration},
author = {Abbott, Andrew},
year = {1984},
journal = {Historical methods},
volume = {17},
number = {4},
eprint = {11620185},
eprinttype = {pubmed},
pages = {192--204},
issn = {0161-5440},
doi = {10.1080/01615440.1984.10594134},
isbn = {0161-5440},
pmid = {11620185},
keywords = {Historiography,History of Medicine,History- Ancient,History- Early Modern 1451-1600,History- Medieval,History- Modern 1601-,Medicine,United States},
file = {/home/wouter/Tools/Zotero/storage/7BM53SZ6/Abbott1984.pdf}
}
Now that we have our documents, including metadata, we can go on and split them into chunks. As mentioned previously, we can use the RecursiveCharacterTextSplitter for this, which is very good at splitting texts into chunks of the size that we desire, while keeping semantically meaningful structures (e.g., paragraphs) intact as much as possible.
We need to decide what the size of our chunks will be. I believe a popular choice is to go with chunks of 1000 tokens. I opted for 1500 tokens because it just slightly increases the chances that parts of the text that belong together also end up in chunks together. We can also set an overlap for our chunks, which I set to 150.
# Splitting text
print("===Splitting documents into chunks===")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1500,
chunk_overlap = 150,
length_function = len,
add_start_index = True,
)
split_documents = text_splitter.split_documents(documents)
The final steps are to create embeddings for our chunks of texts and then store them, alongside the chunks themselves, in our vector store. These embeddings are what we actually use later when we want to retrieve information from our vector store (discussed in more detail in a future post). Basically, when we ask our LLM a question, the question will be embedded as well and its vectorized form will then be used to find entries in our vector store that are similar in meaning. This approach to finding relevant information is much more accurate than finding relevant information purely based on matches between the texts themselves.
One cool benefit of storing documents in their vectorized form is that the language in which the documents were written becomes less relevant. Two documents that are written in different languages, but have similar meanings, will end up in similar positions in the semantic space when they are embedded.
As mentioned previously, we use the text-embedding-ada-002
model to create our embeddings. This is the default model when using LangChain’s OpenAIEmbeddings()
function.
Creating the embeddings is the most time consuming part of this process. I started out with a library of about 1750 documents (before chunking), which I believe takes about an hour to complete the embeddings for (this is a guess, because I didn’t consciously keep track of time). It is also a relatively expensive part of the process, since we’ll be sending a lot of tokens through the OpenAI API. This is one of the reasons why it is useful to have a setup where you don’t have to recreate these embeddings repeatedly (see the comments on updating our vector store by the end of this post).
You will probably also frequently see warnings about hitting OpenAI’s rate limits. Fortunately, LangChain has built-in functions that delay further requests until we’re ready to resume the process, so we don’t need to worry about this.
After the embeddings have been created, you can create your vector store as shown in the snippet. We immediately save our vector store in the folder that we defined for it earlier.
The last thing that we do is to write the filenames of the ingested documents to the file that we use to keep track of all ingested documents, allowing us to skip these when updating the vector store.
After the script finishes its run, I manually delete the text files from the folder from which we sourced them.
# Embedding documents
print("===Embedding text and creating database===")
embeddings = OpenAIEmbeddings(
show_progress_bar=True,
request_timeout=60,
)
db = FAISS.from_documents(split_documents, embeddings)
db.save_local(store_path, "index")
# Record what we have ingested
print("===Recording ingested files===")
with open(destination_file, 'w') as f:
for document in documents:
f.write(os.path.basename(document.metadata['source']))
f.write('\n')
As mentioned above, creating embeddings for documents is relatively expensive, both in terms of time and in terms of actual money spent on using the OpenAI API. Therefore, we do not want to create embeddings for any given document more than once. I already explained how the bash script that I use to convert PDFs skips documents that we’ve already ingested. If I add new papers to my Zotero library, and I run the conversion script, only the PDFs of the newly added papers will be converted and eventually end up in the folder from which we source the documents to be ingested in the vector store.
To add these new papers to my existing vector store, I use a script that I named updater.py
(see below). This script is identical to the indexer.py
script, except for the last part, where I:
indexer.py
script,This process requires me to create embeddings only for new papers that I added to my library.
print("===Embedding text and creating database===")
new_db = FAISS.from_documents(split_documents, embeddings)
print("===Merging new and old database===")
old_db = FAISS.load_local(store_path, embeddings)
old_db.merge_from(new_db)
old_db.save_local(store_path, "index")
# Record the files that we have added
print("===Recording ingested files===")
with open(destination_file, 'a') as f:
for document in documents:
f.write(os.path.basename(document.metadata['source']))
f.write('\n')
This is all that I wanted to share in this particular post. What we have done now is to create a vector store that includes (for example) literature in our Zotero library, which allows us to then use that literature as context in chat sessions with LLMs. How we actually set up these chat sessions and how we can use the vector stores in them is something I will discuss in a future post.
]]>I have recently migrated from Doom emacs to a vanilla emacs config. What I like about the vanilla config is that I am more in control of what it does. Doom emacs, in some respects, was a bit of a black box to me. Also, Doom emacs does a lot of things I do not actually need or want. In my original post, I linked to my dot files, which included my doom emacs config. The doom config files are still there, but not in the old_doom_config
branch, instead of the master
branch. My vanilla config maintains (I think) all of what I describe in my post. However, the config looks different in most places, for example because I could no longer make use of the Doom emacs macros.
After upgrading my Doom emacs installation today, I noticed that org-roam-bibtex was broken. This has to do with the fact that org-roam-bibtex has recently switched over to org-roam V2, which has been in development for a while now and was released in July. Doom emacs now also supports org-roam V2 and one of the gibub pages for Doom emacs describes how you can make the switch. You have the option to stick with org-roam V1, but that version is no longer actively maintained. Making the switch to org-roam V2 basically involves changing a flag in your init.el file from +roam
to +roam2
. However, for org-roam-bibtex to work, you will need to make additional changes to your config (the config.el file). It is a good idea to check the new README at the org-roam-bibtex repository, as well as the manual. One of the more important changes is that org-roam-bibtex now uses an org-roam-capture template, instead of using its own capture templates.
You will also find that org-roam-server (a plugin that I did not discuss in the post below, but that I did occasionally use) does not work with org-roam V2. However, the author of org-roam-server points to an excellent alternative, called org-roam-ui. Make sure that you carefully follow the installation instructions for Doom emacs that are provided in the README of the repository.
You will find a working configuration in the doom folder of my dotfiles.
I also came across a blog post in which org-cite was announced. This is something I have not delved into yet, but when I briefly scanned the post, it seemed like org-cite is a more powerful ‘org-ecosystem’ citation solution. I have also read that the author of the org-ref mode started porting org-ref to org-cite with org-ref-cite. This is exciting stuff that I still need to look into.
I spent some hours in the past couple of days writing a paper on Q-SoPrA (finally), and I wanted a break from that. However, I am not very good at doing nothing, so my break consists of simply writing something else; something a bit lighter. I had thought about writing some posts about my workflow(s), so I thought I could use my break to start working on that. And then I thought it would perhaps be nice to write something about how I actually go about writing papers nowadays. That is what you’ll read about in the below.
I wrote this in a bit of a rush. I will revisit this post a couple of times to fix inevitable language mistakes and typos. It is also likely that some parts need further clarification.
The idea to write a post like this is not original. In fact, there is already a very helpful blog post on writing academic papers with org-mode that has helped me develop my own workflow for paper writing. At the same time, there is not much else that I could find. There were some things in the example linked above that did not really work for me and that I decided to do slightly differently. For some of this I had to cobble together some ideas from various pages where org-mode is discussed. One such page is of course the org-mode manual. Other pages include discussions I found on Stack Overflow and LaTeX Stack Exchange, but I don’t remember all the specific pages I consulted on this.
I wanted to write this post to provide another source of inspiration on how to use org-mode for academic paper writing, hopefully providing some additional snippets of information that you couldn’t already find tied together in another post.
I imagine that there are four possible types of audiences for the text below. I assume all these types consist of people that either work in academia, or have an interest in academia. Within that range, the five types of audiences I imagine would be interested in reading this post are:
Okay, I guess there is actually a fifth audience, which would be the people that know emacs, but dislike it because it is bloat. If you belong to that audience, you could skim through the next bit, where I dedicate about 2 sentences to this discussion, and then direct your hate mail to my email address.
Let’s start with emacs. I find it tricky to accurately describe what emacs is, because, in a way, it is many things at the same time, depending on how you set it up and use it. However, I think I would settle on a description of emacs as a lisp-interpeter, as well as an ecosystem of tools that together constitute an integrated development environment. Emacs can be configured and extended with a dialect of lisp, which is a programming and scripting language (you’ll see plenty of that in snippets below). I use emacs to write software, to read my email, to organize my to-do lists (actually, this is an org-mode thing that I won’t discuss here), to keep notes on papers, to keep a journal with ideas that I am afraid I would otherwise forget about, and several other things. It is usually the first thing that I load up after booting into Linux and probably the software that I spend most time ‘in’. Emacs has been around for a long time and is still being actively developed today, which I believe is testimony to its incredible power.
There are plenty of Linux-enthusiasts that don’t like emacs at all, for example because they feel it goes against the UNIX philosophy that software tools should “do one thing and do it well”. In other words, in their opinion emacs is bloat. As I mentioned before, emacs does indeed try to be a lot of things at the same time. However, in the above, I consciously described emacs as “an ecosystem of tools”, because I think emacs can be seen as a collection of lisp-based tools that each do one thing (and do it well), but that can interact in various ways to make them do whatever you want emacs to do. I think you could even use emacs as a kind of Operating System if you wanted (I am not sure why you would, though). The benefit of having all these tools work together in one ecosystem is that they often work together well ‘out of the box’.
Emacs is highly configurable, so there are many different ways in which Emacs can be set up. Obviously, I am not going to discuss all of that in detail. What I do want to note here is that I use a particular ‘flavor’ of emacs that is known as Doom emacs. Doom emacs has a number of benefits over ‘vanilla emacs’, such as a tight package management system and a pre-configured setup that just does a lot of things ‘right’. I also like that it makes use of evil-mode, which more or less means that it works with vim keybindings. However, that is a story for another time.
Org-mode is a popular extension of emacs which revolves around a flexible structured plain text format. You can use it to manage to-do lists, for agenda and/or project management, to keep notes, as well as a wide range of writing applications (among others!). Org-mode has an ecosystem of its own, which means that it is highly extensible itself. For example, in the below I write a few things about org-ref, org-noter and org-roam, which all play a role in my writing process. I combine these tools with others that are not specifically org-mode-related. For example, to read pdfs in emacs, I make use of pdf-tools, and I use helm-bibtex (in combination with Zotero) to navigate my library of publications and find citations to insert. I will discuss all of that in more detail further below.
But let’s take a few steps back first. So, org-mode is a flexible structured plain text format, one that includes a markup language similar to that of markdown (which is the markup language in which I am writing this blog post right now). What does that mean? Well, I can do things like encapsulating text in asterisks (*) if I want to make that text bold and encapsulating text in forward slashes (/) if I want to make that text italic. I can make headings by starting a line with an asterisk followed by a space. If I want to make a heading at a lower level, I simply add more asterisks. I can also make (numbered) lists and these can be nested. And this is only the simple stuff. As an example of something more advanced, org-mode offers very powerful support for tables. You can even use formulas in tables, similar to what you can do in Microsoft Excel or LibreOffice Calc.
All of that in plain text.
A typical org file might look something like the following:
After writing a document in org-mode, you’ll often want to export them to some other format. Org-mode in emacs provides a powerful exporting engine for this, but you could also use pandoc. With emacs’ export engine, you can, for example, export an org-mode document to a beautifully formatted pdf. In the process, the document will first be exported to TeX as an intermediary step, so if you are familiar with LaTeX, you’ll know what nice outputs you could make with org-mode in this way.
Some of you might think: “What’s the point? What benefit does this give you over Microsoft Word or LibreOffice Writer, in which you can do most of these things?” You might also think that plain text looks ugly (I don’t; I’ve come to love it) and you’d rather see what you get, as you do in Microsoft Word, rather than first having to export to some other format before you see what it’ll eventually look like.
Know that there are certain benefits to working in a plain text format like the org format. First, it makes version control much easier. For example, I use Github for version control, which means that after each time I that I add bits and pieces to a paper (usually at the end of the day), I push a commit to github. This allows me to keep track of all the changes I make between versions, without having to explicitly save a new version of my document whenever I want to make a new version, and without the risk of accidentally overwriting changes that I might later regret. If you want to know more about this approach to version control, there are plenty of places to read more about it, such as here. One thing to note is that this approach to version control only works well if you work in plain text.
Note that, with "version control", I mean something different than simply saving changes to your document on a regular basis. You should of course always do that to ensure that you don't accidentally lose a lot of work. My purpose with versioning of papers is not to keep track of each and every small change that I make, but to keep track of the larger chunks of text that I change over time. I might decide that a certain idea doesn't work well and that I want to revert to an alternative idea that I tried earlier. In that case it is nice to know that I have a version of that somewhere on my Github repository.
Second, as I mentioned previously, you can export org-formatted documents to many other formats, including odt, docx, TeX, and (via TeX) pdf. You don’t get this kind of flexibility with Microsoft Word or LibreOffice Writer. Moreover, because the text is plain text with some simple markup added to it, it is relatively easy anyway to reuse the text in another format. This could just be a matter of slightly changing the markup. However, I would just use pandoc for this most of the time.
Third, since you’re working in plain text, you don’t need a dedicated editor. Sure, writing org-mode in emacs makes a lot of things easier, but in principle there is nothing to stop you from using any other text editor. This means that you don’t depend on one or a few software packages. Also, because (again) we are working in plain text, it is unlikely that at some point you won’t be able to open your document anymore because it was written with outdated software. I guarantee you that plain text editors will be around for a loooooooong time.
That being said, my purpose with this post is not to convince you of using org-mode. In fact, there are plenty of downsides to it too. For me, one of the most important ones is that none of my colleagues works with org-mode. In fact, I doubt they ever heard of emacs or org-mode. This is of course unproblematic if you mostly write by yourself, but it is problematic if you frequently write with others, which is true for me. This downside is compounded by the fact that my university more or less forces you to work with Windows and Office365, unless you choose to work on your own machine. The result is that everybody is socialized into using the same set of tools, and trying to work around this quickly becomes cumbersome.
However, despite these downsides, I find myself using org-mode whenever I can. I like the fact that I can just focus on writing without too many distractions. I also like the fact that, with a few keystrokes, I can…
There might be things I forget right now. More generally speaking, I somehow find working in emacs and org-mode much more satisfying than working in Word, and I often find myself looking for ways to work around Office-based programs more than I already do.
Some of the things I outline above are things that others use LaTeX for. I already mentioned you can make beautifully formatted text with LaTeX and org-mode exports to pdf via TeX. So why wouldn’t you just work in LaTeX directly?
Well, I love LaTeX, and I worked in it for a while, but LaTeX markup is quite elaborate and, in my opinion, a bit distracting. I often found myself going back to tex documents that I wrote earlier, for example to remember how I got my figures to come out the way I liked it. Org-mode’s markup is much cleaner and intuitive. Also, if you really want to just insert ‘plain’ LaTeX in some places, you an easily do that. You could say that org-mode makes working with LaTeX a lot easier.
Also, exporting to pdf from org-mode takes just one command. In LaTeX you would need to run a LaTeX compiler, then typically BibTeX to resolve the linkages to citations, then the LaTeX compiler again… When you export to pdf from org-mode, all that needs to be done as well, and you actually add additional steps because it first needs to export from org to tex. However, emacs does all of this for you in the background. You don’t need to worry about it.
Org-mode does not provide the possibility to directly export to docx. You can of course export to odt and then save your file as docx. However, I've noticed that exporting from org-mode to odt does not work well if you want to include a bibliography. I think the easiest way to convert from org tot docx is to use pandoc. If you use the `--citeproc` argument, you can also get the bibliography exported properly.
So let’s get into the details of my setup. I’ll (roughly) discuss the following:
Please note that some snippets of my emacs config file below will only work for you if you use Doom emacs. My config file makes use of macros that are specific to Doom emacs, such as `after!`. However, if you are familiar with emacs, and you don't want to use Doom emacs, it should be relatively easy to adjust my examples.
I made a template for papers that I can reuse. The template looks more or less like the below.
So, what do we have here? At the top of the file we see a kind of preamble. Org documents often include some export settings in their preamble. I call this a preamble, but you can actually put these export settings anywhere in your file.
Note that a lot of my export settings are actually LaTeX headers. This is the kind of stuff that you would also put in your LaTeX preamble, which org-mode can handle for you in this case. I use these LaTeX headers to set the document type (apa6 class article, in this case), some options related to the types of citations I use, some options that are apa6 class specific, and some options to change the default behavior of the LaTeX formatting process.
Some of these options were adopted from the example set here, such as the use of \usepackage{breakcites}
to allow citations to word wrap, \usepackage{paralist}
to make lists more compact (the three lines directly below this are related to this too), and \usepackage{apa6}
for apa-compliant citations. I added \usepackage{natbib}
to allow for natbib-style citation markup (see further below).
The apa6 LaTeX class specification uses the \abstract{}
command to include an abstract and the \keywords{}
command to include keywords. The class also natively supports multiple ways to report authors and affiliations. In the template, you’ll see what you’d need to include if you are the sole author of the paper. If you have two authors, you could instead use \twoauthors{Author One}{Author Two}
. For more authors, you follow the same logic \threeauthors{}{}{}
and so on (this class currently supports up to six authors). The same goes for affiliations: For two authors you can use \twoaffiliations{}{}
. I noticed that the left header only shows the first author by default. I therefore included the \leftheader{}
command so I can just customize it. The \shorttitle{}
will be shown as the right header.
You can also specify the author name with org-mode’s own #+AUTHOR
export option, but I found that it quickly becomes a pain to handle authors and affiliations properly when there is more than one author. I therefore use the LaTeX headers for this instead. In that case, it is necessary to include author:nil
in #+OPTIONS
, because emacs may default to using the username specified in your config file as the author name, even if you didn’t include #+AUTHOR
in your file. In that case, you will always be included as a co-author, alongside any other authors that you already specified in the LaTeX headers.
I also set toc:nil
to prevent a table of contents from being included.
You should make sure to include your bibliography by the end of your document. The bibliography will appear right where you put the following lines:
One thing that org-mode makes very easy to do is to add figures and tables to your paper. As I mentioned before, with LaTeX I often had to look up how to do those things properly. Org-mode is just much more intuitive in its use here.
To include a figure, you basically just need to include a link to it. In org-mode, a link is always placed between double brackets (as shown below). You can add a caption for your figure, as well as a name that you can use for reference above the link to the figure. If you are using a multi-column document class (like apa6) and you want figures to span the full width of the page, instead of just one column, you also want to add the third option (ATTR_LATEX: :float multicolumn
) that I have added below.
The name variable comes in useful whenever you want to refer to the figure in your text. This is something else that I find painful to do in (for example) Word documents. If you refer to your figures by numbers in your text, you have to be careful to update these numbers if something changes in the order of figures. In org-mode, you can instead refer to your figures by their label, using an internal link, and this link will automatically be resolved to a number upon exporting. So, for example, you could write something like the following in your text:
And if that figure happens to be the third figure in your document, then in the exported pdf document this will resolve to:
“Please see figure 3 below.”
For tables, you can simply use org-mode’s built-in table functionality, and when exporting tables will be formatted according to the specifications of whatever you’re exporting to.
There are various ways in which you can configure the way tables come out too. As with figures, you’d place these export options above your table.
With the information above you already know most of the basics required for writing a document in org-mode that can be exported as a pdf (via LaTeX).
For the exporting via LaTeX to work properly, you’ll need to set up the LaTeX classes that you plan to use, which you do in your emacs config.el file. This is just something I shameless copied from the example that I’ve mentioned a few times before, and it works well for me.
From what I understood from the explanation offered here, this maps the org-mode headings and lists (and so on) to their LaTeX equivalents.
In the above we have the basics more or less covered, such as including export settings and knowing how to format a basic org file, including figures and tables. We also know how to include our bibliography, but an important topic that we haven’t covered yet is how to include citations in your document in the first place.
The org-mode ecosystem includes a wonderful tool for this, which is org-ref. Org-ref depends on helm-bibtex. Helm-bibtex is a wonderful tool that you can use to browse your library of references from within emacs. If you use a citation management tool like Zotero, it is easy to keep a library of references that also have pdfs linked to them. You can export the library as a .bib file, which can then be read by helm-bibtex. Org-ref uses the helm-bibtex menu to search for citations that you want to insert. If you have org-ref installed, you can use the C-c ]
keybinding to insert a citation at point. You will then be shown a list of the references included in the .bib-file that you pointed helm bibtex to (see below). The list is filtered as you type. When you press enter on any reference, it will be included in your paper. For an example, see the citation by the end of the snippet below.
This gives a good idea of how org-ref citations are formatted. They basically use the natbib format. For example:
And so on…
What the snippet above doesn’t show, is that the citations in org-mode behave a bit like links. You can place your cursor on them, press return, and you’ll be shown a helm-bibtex menu as shown below (click the picture to view a larger version of it):
You’ll see that you can do a number of things related to this citation, such as opening the pdf to re-read something if you need to. You can also access your notes on this paper from here (see more on that further below), you can add pre- or post-text, and a number of other things.
For org-ref and helm-bibtex to work properly, you’ll need to put some things in your emacs config.el file. See my configuration below (and notice the comments):
Some of this configuration involves pointing helm-bibtex and org-ref to my default library.bib file. I also had to add several lines to make sure that (1) helm-bibtex knows where to find the path to pdf files in the library.bib file, (2) org-ref is able to find the pdf file associated with the citation that I have my cursor on and (3) org-ref is then able to open that file. I am not entirely sure, but I believe the default behavior was for org-ref to open the pdf in an external viewer. My configuration also makes org-ref open pdfs in emacs instead.
Since I’ve mentioned the possibility to open pdfs a couple of times, this as a good a place as any to bring up that I use pdf-tools to read pdfs instead of emacs’ default doc-viewer. In pdf-tools, pdfs look much better than in doc-viewer. In addition, pdf-tools comes with some annotation tools, although I don’t use those myself.
To use pdf-tools as the default viewer, you can add the folowing lines to your config (keep in mind that the below is specific to Doom emacs configs):
So far, so good. We now know how to write basic documents in org-mode, and we’re able to use org-ref (with helm-bibtex) to insert and alter references, and to open the pdf files associated with them.
Let’s stick with org-ref and helm-bibtex for a bit longer. Let’s say that you’ve kept notes on the papers that you want to cite in your own paper. You might want to consult those notes while writing. Here too, I find it convenient to be able to do this within the program that I am doing my writing in. That is why I use org-noter to keep notes on my papers, and I use org-roam-bibtex (orb) to link org-noter and helm-bibtex together. This is a topic that is discussed in detail in this blog post, so I won’t cover everything in detail here.
Very briefly then: org-noter is an emacs package that you can use to keep notes on pdfs in an org document. It works together quite well with the pdf-tools package that I mention above. Basically, with org-noter you can read a pdf, and have your notes document open beside it. You can write notes that are associated with particular pages in the pdf document, as well as notes that are associated with particular locations on pages. If you then browse through your notes, org-noter will automagically bring up the page (or location) of the pdf document that the note is associated with. It looks a bit like this:
In the image, you see the pdf on the left, and you see the notes I wrote on this pdf on the right. Each of those notes is related to a particular location in the document, which I can find back easily by jumping back and forth between my notes.
See my org-noter configuration below:
The most important thing is that you need to tell org-noter what the default location of your notes is. In my case, I include my notes in my org-roam folder (we’ll discuss org-roam soon). I also added a bunch of keybinds that I use to insert and navigate notes.
Org-noter doesn’t integrate with helm-bibtex out of the box. That is what you can use orb for. One very important thing that orb does is to tell helm-bibtex to use org-noter as the default note-keeping package. It also integrates org-noter with org-roam (again, we’re getting to org-roam soon).
See my config snippet for orb below. It is an altered version of what I copied from another post about this.
The first few lines are basically required to get orb to work with org-roam.
The most important lines are the ones below that. The orb-templates
function basically creates the templates of the files that you use to record your notes on pdfs with. I only have one template and at the bottom of the snippet you’ll see what is included in it (e.g., a url to the file, the author). You have to define the keywords included in the template with the orb-preformat-keywords
function.
With these settings, if I navigate to a reference in helm-bibtex, and then tell helm-bibtex to open my notes on that reference, it will create a new file (if a file on that reference does not already exist) which is preformatted according to the template defined above.
In other words, whenever I am writing a paper, I can easily consult my notes on another paper by opening up helm-bibtex, navigating to the paper I want to consult the notes on, and then pressing F9 (or simply choosing ‘‘add notes’’ from the helm-bibtex menu). Combined with the things I mentioned earlier: For any citation in my paper I can quickly open the pdf of that citation as well as my notes on that paper within a few seconds, and all within the same program.
Not all notes that are relevant to the paper I am writing are necessarily notes on ‘another paper’. They might just be notes on an idea that I wrote up a while ago. They might also be more elaborate notes that I kept on a certain topic and that relate to multiple papers at the same time. I might want to consult these notes too.
In addition, I might want to be able to link my notes on one paper to notes that I made on another paper, or to notes that I wrote on a broader topic.
This is (more or less) where org-roam comes in. There is a lot to be said about org-roam itself, but I don’t want to get lost in the details here. Briefly, org-roam allows you to build a kind of ‘second brain’. It is an approach to taking notes that makes use of the Zettelkasten method. The idea behind that method is more or less that you keep relatively short notes that you explicitly link to other short notes. Together, the notes form a horizontal network of notes that can be quickly navigated.
So, whenever you are writing a note on something, you will want to link it to notes on other ideas that you think are associated to the topic you are currently writing a note on. An excellent discussion of that principle is offered here.
Org-roam itself provides a powerful infrastructure that implements this approach to note-taking. A key element of it is the use of a buffer that gives you an overview of all the notes that are linked to the note you are currently inspecting (see below for an example; the overview buffer is on the right).
If you keep this approach to note-taking up for a while, you’re slowly but surely building a network of notes. Basically, you are building your own wiki.
My network of notes includes notes on various broader topics (such as the notes on social practice theory of which I included a screenshot above), notes on projects that I am working on, notes that I took on a particular day (e.g., ideas I didn’t want to forget about or perhaps notes on a meeting I had on that day), and of course my notes on papers.
What makes org-roam so powerful is that you can quickly (re)trace the associations between ideas that you made notes on. You might open a document with notes that you made on a particular paper, then follow the links to other notes that might seem interesting for whatever you are currently writing on, and so on and so forth.
This could even be a viable approach to outlining your initial ideas for a new paper, similar to what is shown in this video, which uses the software that inspired org-roam.
See my basic org-roam config below.
What I do here is to first point to my org-roam directory, which is a folder without any sub-folders (except for one subfolder in which I keep my ‘daily’ notes, which are structured more like a journal) where I keep all my notes (including the ones made with org-noter). I make sure that org-roam runs on emacs startup, and then I define two kinds of templates. These templates define the names of the files to keep your notes in, as well as the basic contents of these files.
The first template is the template for my general notes. These notes are kept in files with the date and time of their creation, followed by a short title that I type when creating the note. The file itself only includes a title, and nothing else.
The second template is the template for my org-roam-dailies, which is basically a journal-type note, associated with a particular date. These files simply have the date on which they were made as their filename, and in this case too, the file itself includes nothing but the title of the file.
Whenever I am writing notes in one of these files, I can use simple keystrokes to insert links to other notes. If I insert a link to a note that doesn’t already exist, then a new file is automatically created for that note, ready to be filled at some point.
This lengthy blog posts discusses some of the key elements involved in my current paper writing process. In short, org-mode serves as its basis, but this basis is enriched with powerful tools that allow me to:
I hope you found something useful / interesting in this post!
]]>The title of this blog post is a bit overly dramatic. Yet, it has been almost three years since my last blog post, so it does to some extent feel to me as if I am finally giving a sign of life after a long absence.
There are multiple reasons why I have not updated my website (except for some minor things) in a long time. One of them is that I became a father in 2018, which of course upended all my routines. Another reason is that I have changed jobs (and countries) in 2019. There are more reasons, but not all that interesting or important.
What is more important is that I would like to pick up the routine of blogging again. As before, I will probably blog quite a bit about Q-SoPrA, the software that I have been working on for a while now. The development of Q-SoPrA is another thing that got slowed down due to the various changes in my personal life. However, I have done quite some work on it, and I feel like it is probably ready for a public beta release. However, I am not going to actually release it until after I finish a publication on the software (writing publications, another thing that got slowed down). In the meantime there have been a few people that have been using it (including myself, of course). This includes a PhD student who recently graduated, as well as a master student who also recently graduated. One of my own projects in which I used Q-SoPrA was the evaluation of the development the Rotterdam Climate Agreement. We recently finished a report on this project (only available in Dutch) that includes various visualizations that I made with Q-SoPrA. Last year I was involved in a H2020 bid in which we also included Q-SoPrA as one of our main tools. The bid got excellent reviews, but when only 3 out of more than 50 bids can get funded, it becomes a bit of a lottery.
Anyway, I expect to be using Q-SoPrA and its underlying methodology in various future projects and I still hope that Q-SoPrA will be useful to others too. I hope that picking up this blog again also gives a boost to my efforts to work towards a public release.
As an aside, I am also thinking about blogging on other topics related to my research, although I have to admit I find that much harder to do. Maybe those posts will just be much shorter.
Anyway, I wanted to keep this one short as well. Let me close with an overview of a few (probably not all) things that I changed in Q-SoPrA in the meantime:
There is probably a lot of stuff that I forget now, because it has been a while since I have been able to actively work on Q-SoPrA.
There is also some bigger stuff that I am actively working on in separate branches of the code repository. One of these branches implements something based on truth tables that are based of event graphs. This is related to a paper that I am collaborating on with someone who is into set theory. Anyway, hopefully I can write more on that later too.
Actually, while writing this I realize there is plenty of stuff that could be interesting to discuss further in future blog posts.
Okay, this was a short one, but my main purpose was to make a start again, and do it quick. Until the next one!
]]>I wrote a non-technical post on relationships in Q-SoPrA. One of the things I discuss there is plotting parallel edges, that is, multiple edges between the same pair of nodes. This is something I definitely wanted to be able to do, since I am interested in looking at social arrangements (of people, places, things, etc.) that can be related to each other in various ways. If I want to look at multiple relationships at the same time, being able to visualise parallel edges is a necessity. In this post I discuss some of the details of visualising parallel edges using Qt’s tools for visualisation. For a visual impression of what parallel edges look like in Q-SoPrA, see the screenshot below.
In this post, I will be focusing almost exclusively on the drawing of edges, but I just want to briefly touch upon some of the other basic things we need in order to visualise them. Below you see a screenshot of the Network Graph Visualisation widget with all the menus unfolded. In the middle of this screenshot you see the space where the drawing actually happens. The drawing space itself is an object of the QGraphicsViews
class, although I should add that in Q-SoPrA I use a sub-classed version in which I re-implemented many of the member functions of this class (the same goes for most other classes I discuss in this post). The QGraphicsView
object visualises objects that are included in another object of the QGraphicsScene
class (a QGraphicsScene
object needs to be assigned to the QGraphicsView
object). These classes are quite well documented in the Qt documentation, so I will not go into details here (also see this page to read more about the Graphics View Framework of Qt).
Also, I recently read this blog post, which seems to suggest that Qt Quick is a potential replacement for the Graphics View Framework.
In the screenshot we also see that we have drawn objects in the drawing space. We can see nodes, we can see edges, and we can see labels. All these objects are items that are currently included (and visible) in the QGraphicsScene
object. The nodes are objects of a sub-classed version of the QGraphicsItem
class, the edges are objects of a sub-classed version of the QGraphicsLineItem
class, and the labels are objects of a sub-classed version of the QGraphicsTextItem
class.
So, the QGraphicsView
, the QGraphicsScene
and the QGraphicsItem
are the three basic types of objects that you need to make visualisations like the ones included in Q-SoPrA. The QGraphicsItem
s are the things we want to visualise, the QGraphicsScene
contains and manages these items, and the QGraphicsViews
visualises the contents of the QGraphicsScene
.
Before I get into some specific challenges related to drawing parallel edges, it is useful to briefly discuss what goes into drawing a basic edge. There are a few basic properties of edges that we need to take into account:
In Qt’s documentation you can find the Diagram Scene example that achieves almost exactly this (see especially the header file and the cpp file of the Arrow class used in this example). The objects that Q-SoPrA uses to draw edges are inspired by this example, although the classes ended up looking quite different due to various specific requirements I had for my own class. The class I developed is called DirectedEdge
, and it is what I will focus upon in the remainder of this post.
Let us first take a look at the constructor of the DirectedEdge
class. Here is a code snippet from which I have removed some details that are not important for the examples in this post.
As you can see, we are passing pointers to objects of the NetworkNode
class to the constructor of DirectedEdge
. The NetworkNode
class is a sub-classed version I created of the QGraphicsItem
class, and I use this class to draw the nodes of my network diagrams. When we add NetworkNode
s (or any other QGraphicsItem
to a QGraphicsScene
object, then these will be assigned a scene position, that is, a point in the scene that is defined by an x-coordinate and a y-coordinate. We can access the scene position of a NetworkNode
(or other types of QGraphicsItems
) by using the ‘scenePos()’ member function. This will return a ‘QPointF’ object that contains the item’s coordinates. So, if we want to know from where to where to draw a certain edge, the obvious thing to do would be to first find the scene positions of its start node - start->scenePos() - and end node - end->scenePos() - and then draw the line between those two positions.
As I mentioned before, we actually want our line to stop shortly before it reaches its end point. What we could do is to create a QLineF
object, passing our start and end points as parameters, and then use the setLength()
function to make the line slightly shorter: myLine.setLength(myLine.length() - 18)
.
Then we still need to add our arrowhead. For this, I simply followed the Diagram Scene Example provided in the online documentation for Qt. Basically, this involves creating a QPolygonF
object with the shape of our arrowhead, and have this object drawn near the end point of our line.
We should do most of the above in the ‘paint()’ function of our edge, because that is the function where we determine where and how the edge is drawn. I have included a code snippet below to illustrate what our paint()
function might look like in this scenario.
And that should do the trick. At least, if we were only interested in drawing edges as straight lines. However, what if we want to draw parallel edges, that is, multiple edges between the same pair of nodes? In this case, if we would just use straight lines, then the edges would overlap, and we would actually not be able to see that multiple edges exist between our nodes. In this case, it is better to draw edges as curved lines, and to change the strength of the curve for each additional edge that we add to a given pair of nodes. The remainder of this post will be about how we can do this, and what challenges we will face.
Drawing a curved line with the Qt library is not difficult to do. One of the easiest ways to paint()
a curved line is by creating a QPainterPath
object, and use its quadTo()
function. This function takes two arguments: One of the arguments is the end point that we want to draw the curved line to, and the other argument is a so-called control point that we will use to determine how the line will be curved (how strong the curve will be and which direction it will curve in). We do not give the function a starting point. Instead, we should move the QPainterPath
object to our starting point using its ‘moveTo()’ function before calling the quadTo()
function.
So, what about this control point? Consider the image I have linked to below (found through this Stack OverFlow discussion). The image shows nicely how the control point works. It is a point somewhere ‘above’ the place where we want our line to curve towards the control point. In the image you see that if we place the control point somewhere above the middle of the line, then the line will also curve around its middle point. This is exactly what I wanted for my parallel edges. In the image you also see that the curve will change if we move the control point closer to the starting point or the end point. This is something that I want to avoid.
So far so good. Assume that our starting point is at coordinates (0, 5)
in the scene, and our end point is at coordinate (10, 5)
in the scene, we can first simply calculate the point that lies exactly in between them: x = (10 + 0) / 2 = 5
and y = (5 + 5) / 2 = 5
, giving us the point at coordinates (5, 5)
. Then we still need to set the ‘height’ of the curve, which, in this case, we can do simply by adding some constant to the y-coordinate of our middle point, giving us, for example, the point at coordinates (5, 25)
. If we then pass this point to the quadTo()
function, we will get a nice curved line.
This example was relatively simple, because the slope of the straight line between our starting point and our end point is 0. This makes finding the control point relatively straightforward. However, consider now that we have a line that starts at coordinates (0, 5)
and ends at coordinates (10, 10)
. Finding the point that lies exactly in the middle is still quite easy: x = (0 + 10) / 2 = 5
and y = 5 + 10 / 2 = 7.5
. However, how do we now find the control point, somewhere ‘above’ this middle point? We cannot simply add a constant value to the y-coordinate of our middle point, because that would place the control point somewhere right from the middle of the line, and create a curve that skews to the left (similar to the left part of the picture above, where the line is skewed to the right).
There are multiple possible solutions here. One thing we could do is to simply (1) calculate the distance between the start point and end point of the edge (using the pythagorean theorem), (2) draw an imaginary straight line from our start point with a slope of 0 (that is, parallel to our x-axis), and with a length that equals the distance we measured in step 1, (3) create a curved edge from the start and end point of our imaginary straight line, using the procedure described above, (4) calculate the angle between our imaginary straight line and the sloped line from our original starting point to our original end point, and then (5) rotate() the painter by the number of degrees of that angle before drawing our edge. In effect, we are just drawing a curved edge between two points on a horizontal line, and we are then rotating that edge before it is drawn, thereby changing its end point. This is the solution I used for a while, because it is relatively simple to implement. However, it does cause some complications for calculating the bounding rect and the shape of the edge, because we are essentially working in multiple coordination systems. I will not discuss these complications in detail here, but it is important to know that they can cause unwanted behaviour in visualisations.
There is another, better solution that I eventually switched to after experiencing issues with my first solution. This solution starts with drawing an imaginary straight line between the start and the end points of our edge. We then calculate the midpoint of our line, as before. Then we draw a straight line perpendicular to our first line that crosses our midpoint, and we pick a point on that line as our control point. This sounds relatively straightforward, but it took me some time to figure out how to properly implement the formula for setting the control point.
Rather than including all these steps in the ‘paint()’ function of our edge, I wrote a separate calculate()
function that makes the necessary calculations, and is called by the paint function (as well as by other functions that require knowledge of the control point’s position). See the two functions in the code snippet below.
In addition to finding the control point (which I explained above how to do), there are a few other things we need to do to make sure that we end up with nice looking curved edges. First of all, we might have 3 or more parallel edges between the same nodes. If we want to make all edges visible, we need to increase the strength of the curve for each parallel edge that we add (to prevent them from overlapping). That is what the height
scaling factor in the snippet above is used for. This height has to be set explicitly, for which I wrote a very simple function. Then it is simply a matter of keeping track of what edges we already have in our scene
and to make sure that the height
s of the curves of our edges are adjusted accordingly. This is something that needs to be handled at a higher level, and I will not discuss it any further here.
We also need to make sure that our arrowhead actually points from the right direction. If we would attach our arrowhead to the original line we draw from the start point to the end point, it would make an awkward angle, as shown in the screenshot below.
This is relatively easy to correct with the resources that we already have. In the code snippet that I included above, you will see that I created an object that I called ghostLine
, which is a line that runs from the control point that we calculated for the bezier curve to the end point of the edge (minus a small distance to prevent the line from overlapping with the node). This ghostLine
is useful for determining where the curved edge should end, as well as for determining the angle that the arrowhead should point from. Essentially, what we can do is attach the arrowhead to the ghostLine
and draw the arrowhead, but not draw the line itself. For illustrative purposes, I included a screenshot below where the ghostLine
is drawn, so that you can get an idea of how it helps to determine the angle of the arrowhead.
And that is it. The code snippets above contain the most essential ingredients for drawing curved edges within the Qt Framework. Indeed, there is plenty of stuff that goes on around this that needs to be implemented for all of this to work. Later this year, the source code for Q-SoPrA will be open, which gives you the opportunity to examine the code in more detail.
]]>As the name suggests, Q-SoPrA (Qualitative Software for Process Analysis) is focused first and foremost on the qualitative analysis of social processes. However, I invested a lot of time and energy in also creating features that can be used to study how these processes relate to structures. In my opinion, this is a somewhat obvious thing to do, since so many social theories posit some kind of relationship between process and structure (often couched in terms of agency and structure. I can use my own application of Q-SoPrA as an example. I am currently using Q-SoPrA in a study that, conceptually, builds on Schatzki’s concepts of social practices (in short, organised doings and sayings) and social arrangements of people, artefacts, places, and other types of entities. I use event graphs to reconstruct practices as networks of activities (Schatzki also writes about chains of action in this context), and I use more traditional network graphs to reconstruct arrangements as networks of relationships between entities. One aspect of my investigation is to study how unfolding practices relate to (changes in) arrangements.
I believe that many other theoretical perspectives are compatible with Q-SoPrA, although Q-SoPrA probably fits best with perspectives that assume primacy of process over structure (see Rescher). In other words, Q-SoPrA connects best with the assumption that (social) life is essentially a flux, and that the emergence and persistence of structures are generally accidental properties of processes, even if those these accidental properties can be widespread (and I think they are). This assumption can be opposed to the assumption that reality is fundamentally structural, and that change and development are accidental properties of structures.
I should perhaps add that, in my view, this does not necessarily mean that processes should always have priority in the explanation of social phenomena. I think it is perfectly reasonable to assume that structures emerge from process, but are then capable of shaping or inducing processes in radical ways, and are therefore of great explanatory value, depending on the specific research questions being asked.
That being said, from a process perspective it would then still be obvious to also ask the question what processes lead to the (re)production of that structure.
So how does Q-SoPrA assume primacy of social process? Well, by using incidents as indicators for relationships (although it is of course equally possible to think of incidents as enactments of relationships without changing the general approach). More specifically, Q-SoPrA allows you to define relationships (I discuss the details below), and then assign these to incidents in the same way that you would assign attributes. This also creates the benefit that it becomes possible to study networks of relationships that were indicated by incidents in a particular episode of the process, thereby providing a rudimentary way to look at changes in networks of relationships over time, as I discuss further below.
In the past, I used to do something similar by assigning actors to events (as attributes), and then looking at the co-participation (or co-affiliation) of actors in events over time (also see my post about bi-dynamic line graphs). Two important limitations of this approach are that (1) co-participation is basically an abstract summary of many different ways in which actors can be related to each other through events, and (2) not all relationships that are indicated by events are necessarily captured by their co-participation. To offer an example of the first limitation, imagine that we have two events in which the same pair of actors interacted with each other, but that in one event the interaction concerned the joint organisation of an activity, and in the other event the interaction concerned one actor providing financial support to the other actor. If we simply model both occasions as co-participation, then we lose the ability to distinguish between the two situations. To offer an example of the second limitation, imagine that we have an activity in which an individual is acting as a representative of a certain group. If we want to capture this fact as a membership relation, it would be awkward to capture it as co-participation of the individual and the group in the activity. Instead, it would make more sense if we could simply define a membership relationship, and say that the activity is an indication of the individual being a member of the group.
The way I implemented relationship coding in Q-SoPrA is basically an attempt to take away these limitations. In addition, I made sure that Q-SoPrA is able to visualise networks of relationships that have multiple modes (for an introduction into two-mode networks, see this blog post; Q-SoPrA allows you to define more than two modes), and multiple types of relationships. Parallel edges (multiple edges between the same pair of entities) are drawn with curved lines to make sure that multiple types of relationships can be visualised in one graph. In addition, it is possible to perform multi-mode transformations to infer ‘latent’ relationships from observed ones. I will explain all of this in detail in the remainder of this post.
Before I proceed, I should make the note that the example data set that I use below is based on a narrative that Peter Abell provided in one of his papers on Comparative Narratives. Abell’s theory and method of Comparative Narratives are major sources of inspiration for the ideas implemented in Q-SoPrA.
In the screenshot below you see what the relationships widget looks like. If you’ve seen my earlier post on attributes in Q-SoPrA, you’ll notice that the relationships widget and attributes widget look very similar. Each incident in the data set can be inspected individually, with the information available on the incident displayed in the left half of the screen. With the navigation buttons at the bottom-left you can go through the previous or next incident as they appear in the overall chronological order (which you set in the data widget), you can jump to the previous or next incident that is marked (incidents can be marked and unmarked with the Toggle mark button), or you can jump to a specific incident by using the Jump to button.
In the right half of the screen you will see a relationships tree (when you start a new data set, this tree will be blank). This tree works a bit different from the attributes tree. One important difference is that the relationship tree can only have two levels. The first level of the tree shows relationship types, and the children of each relationship type are instances of this type. As we can see in the screenshot below, we have two instances of the conceived of relationship type, one of which captures that the entity “Cooperative Manager” conceived of the entity “Moratorium PLan”, while the other captures that the entity “External members” conceived of the entity “General assembly proposal”.
We can learn more about the definition of this relationship type by hovering our mouse over it (or hovering it over one of its instances). A tool tip with a description of the relationship will appear, as shown in the screenshot below. The description will always start with an indication of the directedness of the relationship, which may be set to Directed or Undirected. This is followed by the definition of the relationship that was created by the user.
The directedness of relationship types is also visualised in the labels that represent instances of that relationship type. As you can see in the screenshot below, the label of the two visible instances have a single arrow pointing from left to right, meaning that the relationship is directed from the entity on the left to the entity on the right. With undirected relationship types, there would be a double arrow pointing in both directions.
I assume that you are familiar with the idea of directedness in relationships (in the context of network analysis). If not, this probably means that you still need to familiarise yourself with the basics of social network analysis, and I would suggest taking a look at the book by Wasserman and Faust, which I think works great as an introduction, as well as an in-depth work of reference.
So how do we define relationship types in Q-SoPrA? This can be done by clicking the Add relationship type button. This will open a new dialog where the details of the relationship type can be written. The screenshot below shows an example of this, using the conceived of relationship type. As you can see, we can create a label for the relationship type, which is the label that is shown in the relationships tree. We are also required to offer a description of the relationship type, and we have to set its directedness. The same dialog will appear if you select an existing relationship type and click Edit relationship type in the relationships widget, but in this case the details of the existing relationship type will be shown in the dialog.
As you can see in the screenshot above, in the definition of this relationship type I have also indicated what types of entities can enter into this relationship (and in what role). In this case, I indicated that the source of the relationship should always be a (type of) actor, and the target should always be a (type of) plan. This implies that actors and plans are different types of entities in our data set, and in Q-SoPrA we can actually define different types based on attributes that we assign to entities. I’ll get back to this point further below.
After we have defined a new relationship type, no instances of this relationship type will exist yet. For this, we have to explicitly create new relationships. This can be done by first selecting a relationship type in the relationships tree, and then clicking the Add relationship button. This will open another dialog (see below). In the dialog we see a list of entities (this list can be filtered with the Source filter), and several controls.
The list of entities will be empty if you have not defined any entities yet. Let us assume for a moment that this is the case. Before we can actually create a new relationship, we need to define entities that can enter in that relationship. This can be done by clicking the Define new entity button. This will open another dialog where a new entity can be created, as shown below. You are always required to provide a name and a description for your entity. In addition, we can use an attributes tree to assign attributes to the entity. This work almost exactly the same as assigning attributes to incidents, except that we cannot associate any ‘raw text’ with entity attributes.
New attributes can be created in this screen as well. For this we would click the New attribute button. This will open yet another dialog (see below), but you’ll be pleased to know that we won’t open any other dialogs from here. The new dialog is a simple dialog where a label and description for the new attribute can be created. After creating the new attribute, it will appear in the attributes tree of the previous dialog. For this example, I have decided not to create the entity “Toby, the magic purple elephant” anyway, because it doesn’t really contribute anything to my case study.
If we have defined at least two entities, we can assign them to the relationship that we are creating (in Q-SoPrA, entities cannot enter into a relationships with themselves). If you look at the screenshot of the relationships dialog below, you’ll see that under the list of entities, there are two buttons: use as source and use as target. Below these buttons you’ll see a description of the relationship as it is currently defined. When no entities have been assigned yet, in place of the Source and the Target, it will simply say “-Unselected-“. Entities can be assigned by selecting them in the list, and then clicking one of the use as… buttons. This will also change the description of the relationship (see below).
After we have assigned entities to our relationship, we can save it. It will now appear as one of the instances of the selected relationship type in the relationships tree. Each relationship can only be defined once. I should note here that undirected relationships with the source and target switched around are treated as identical. So, a relationship like Wouter<–has contact with–>Toby is treated as being identical to Toby<–has contact with–>Wouter.
In addition to defining new entities, we can also edit existing ones from this dialog. This can be done by selecting an entity in the list and clicking the Edit highlighted entity button, or by clicking the Edit left assigned entity or the Edit right assigned entity buttons to edit entities that were already assigned to the currently inspected relationship. This will open the same dialog that is used for defining new entities, but with the details of the selected entity already filled out.
Assigning a relationship to an incident does work the same as assigning attributes. In short, you select a relationship in the relationships tree, and then click the Assign relationship button to associate the relationship with the incident that is currently being inspected. It is also possible to associate a fragment of raw text (in the Raw field in the left half of the screen), by highlighting the text before clicking the Assign relationship button (this can also be done after the relationship is already assigned). If you want to disassociate a fragment of text from an assigned relationship you can either select this fragment of text in the Raw field, and then click the Remove text button, or you can click the Reset texts button, which will remove all fragments of text associated with the selected relationship and incident.
As with attributes, it is possible to navigate incidents via relationships that are assigned to them. This can be done by clicking a relationship, and then clicking the Previous coded or the Next coded buttons, which will jump to the previous/next incident that has the selected relationship assigned to it.
I think it is likely that you’ll create quite a large number of relationships if you’re going to make use of the relationships widget. This means that the relationships tree quickly becomes heavily populated. By grouping relationships under different types, it should be possible to keep an overview relatively easily. However, often it will be easier to simply filter the relationships by using the Filter relationships field. For example, if you’re looking for a relationship that involves a particular entity, you can type the entity’s name in this field, and Q-SoPrA will filter out all relationships that do not include this entity.
You can also add comments to relationships. These comments are associated with the relationship itself, not with specific relationship-incident pairs. This is just like writing a comment/memo, but in this case the comment/memo applies specifically to a relationship. This can be achieved by selecting the relationship in the list, then typing the comment in the Comment field, and clicking the Set comment button afterwards.
In the previous sections, I discussed how relationships can be defined, and then assigned to incidents. After you have identified relationships in your data, the more interesting thing to do is indeed to visualise them. I created the network graph visualisation widget for this purpose.
When you switch to this widget, it will initially be blank. To start plotting a network, you have to select a relationship type from a drop-down menu in the top left of the screen. In this menu, all relationship types that you have created are listed (see above). If you select one of them, you can then click the Plot new button. This will open a dialog that you can use to assign a colour to the relationship type you wish to plot (see below). By assigning different colours to different relationship types, we can distinguish between them in the visualisation. The colours belonging to different relationship types are listed in the legend, which can be opened by clicking the Toggle legend button at the bottom right of the network graph widget’s screen.
When you plot the network (I have chosen to plot the has contact with relationship, using the default black colour for the edges/relationships), it will appear in the draw screen (see below). The graph will initially just have a collection of unlabeled nodes (a selection of our entities) and the relationships between them. Q-SoPrA will always only show entities that are in a relationship that is currently visible (this also means that there can never be isolates in networks plotted by Q-SoPrA).
The default layout by Q-SoPrA is a ‘spring-like layout’. It is basically an intuitive layout algorithm that I quickly implemented as a temporary placeholder when I was creating this widget. However, with some improvements over time it actually turned out to function quite well, so I never bothered to find and use another algorithm to replace it with. I did add a second layout algorithm, which is the circular layout. As the name suggest, this will simply layout the nodes in a circle.
The circular layout does in fact do a little bit more than that. If you have modes assigned to your network (discussed further below), the nodes in the circular layout will be sorted by mode.
It is possible to expand or contract the layout by using the appropriate controls (these can be found in the Controls menu, which can be opened by clicking the Toggle controls button). It is possible to drag around nodes by clicking and dragging them with your mouse cursor. If you select multiple nodes, you can drag them around as a group by holding the CTRL button while clicking and dragging. I implemented very basic collision detection to make sure that nodes push each other away when they bump into each other.
In our current plot, it is impossible for us to clearly identify our entities. We can improve the visualisation a bit by adding labels to the graph. This is also done from the Controls menu, where you will find a button Toggle labels, which can be used to show/hide node labels. Like the nodes, node labels can be dragged around individually by clicking and dragging them with your mouse cursor. However, wherever a node label is located relative to its ‘parent node’, it will always mimic the movements of that parent node. This allows you to change the position of the labels to make the graph more readable (see below).
There is also another way to see more details about the nodes. We can hover our mouse cursor over nodes to see a tool tip with their name and description, or we can open the Details menu (by clicking the Toggle details button) to see the details we have available on the nodes that are currently selected, including attributes assigned to them (see below).
The network we are currently visualising is quite dense. This is because nearly all actors have been in contact with each other at least once in the process. In this visualisation, all the moments that we observed that actors were in contact with each other are aggregated. It can be interesting to filter out certain episodes in the process, to see what the network looked like during a specific episode of the process.
Q-SoPrA does this in a rudimentary way, by allowing you to set upper and lower bounds for the incidents that should be included in the visualisation. Indeed, the incidents themselves are not directly visible in the graph, but the relationships that were assigned to incidents are. By changing the upper and lower bounds in the Controls menu, we can thus manipulate the visualisation to only show relationships that were assigned to incidents that fall within those bounds. In the screenshot below you can see that our network becomes sparser if we filter out some of the later incidents in our data set.
Filters can also be disabled for individual relationship types. This can be done by selecting the relationship type in the legend, and clicking the Filter off button. This means that all relationships of this type will be shown, no matter what bounds you have set in the Controls menu, effectively allowing you to filter some relationship types, while keeping others fixed. To set the filter on again for a relationship type, you select the relationship type in the legend, and click the Filter on button.
It is also possible to temporarily hide relationships of a certain type altogether, by selecting the relationship type in the legend and clicking the Hide button. Hidden relationships can be revealed again by clicking the Show button.
There are some basic ways to change the visualisation, in addition to a few more advanced ones that I discuss further below. A basic change we can make is to change the colour of the nodes, labels, as well as the background of the plot screen. This is all done using the appropriate controls in the Controls menu. Using one of the colour controls will open the colour-picking dialog that we have seen in an earlier screenshot.
We can also change the colour of the current relationship type by double clicking its colour in the legend. This will also open the colour-picking dialog, where you can select the colour you would like to change to. In screenshot below, I have changed the colour of the nodes to red, I have changed the colour of the labels to blue, and I have changed the colour of the relationship type has contact with to orange. There is no other consequence of these changes beyond the visual ones.
Now let’s make things a little bit more interesting. I will add another relationship type, in this case the conceived of relationship type that we saw earlier. I can do this by selecting this relationship type from the drop-down menu in the top left of the screen, and by clicking the Add relationship type button (Clicking Plot new would just overwrite the current plot). As the colour of this relationship, I pick dark blue in this case.
Q-SoPrA will add the relationship type, as well as any new entities that this relationship type introduces to the network. The upper and lower bounds of the network filter will be reset, as well as the layout. The result can be seen in the screenshot below (I did adjust the position of the nodes and the labels already).
You can see that our legend now has two entries. You may also notice that two new entities have been added to the plot, namely the Moratorium plan and the General assembly proposal. If we would inspect their details, we could see that I assigned the attribute Plan to both these entities.
So now we have a network with two quite different types of nodes, but we haven’t made this explicit yet in the visualisation. We can do this by assigning modes to the network (see this blog post if you don’t know what modes in a network are). In Q-SoPrA, modes can be assigned to nodes based on attributes that we associated with entities. One attribute that I used while creating entities is the Actor attribute. If I want to create a new mode based on this attribute, I can click the Create mode button near the top of the Legend menu. This will open a screen with our tree of entity attributes, where we can select an attribute, as well as a node colour, and a label colour to be associated with that mode (see below).
In the example above, I have selected to create a mode based on the Actor attribute, and I set the node colour to light blue, and the label colour to black. I used the same procedure to create a second mode, using the Plan attribute, but here I set the node colour to green, and the label colour to black. The results of these operations are shown in the screenshot below.
You can see that Q-SoPrA has automatically changed the colours of the nodes and the labels. Q-SoPrA does this by looking through all entities that were assigned the attribute that serves as the basis for a mode, and then classifying the matching entities in the appropriate mode. You will also see that we now have another legend, which lists the modes that we have just created (the top part of the Legend menu).
It is of course possible that an entity could be classified in multiple modes, based on the attributes assigned to it. However, any node in the network is allowed to be in only one mode at the same time. Q-SoPrA decides which mode a given entity should be assigned to based on the order in which the modes appear in the mode legend. Modes to the top of the list are always assigned first, which means that modes lower in the list may overwrite those higher in the list.
For example, if we now decide to create yet another mode, using one of the children of the Actor attribute (like Individual actor), we will see that some nodes that were previously in the Actor mode will now be in the Individual actor mode (see screenshot below).
By now, we have a three-mode network, although technically we don't. Yes, that is confusing...
The reason why I say this is that in graph theory, two-mode networks are typically only considered to be two-mode networks if they are structurally bipartite. A network is bipartite if relationships only exist between nodes of two different modes, and never between nodes of the same mode.
In other words, whether or not a network is bipartite is essentially a structural question; we could determine that a network is bipartite by only looking at the patterns of relationships in the network, and without knowing anything about attributes that were assigned to the nodes.
It is of course possible that the graphs you create in Q-SoPrA are also bipartite (or maybe even tripartite) in this structural sense, but this makes no difference for how modes are interpreted in Q-SoPrA, which is purely attribute-based.
Since my last post, on attributes in Q-SoPrA, I have actually implemented a new feature, which allows you to change the order in which modes are assigned in the network graph widget and the event graph widget (discussed briefly in my last post). We can do this by selecting a mode, and then using the Up or Down buttons to change its position in the list. So, if we move the Actor mode to the bottom of the list, this means that the Individual Actor mode will be overwritten, as shown below. This allows you to control in a bit more detail how modes are assigned to nodes in the network.
This might be a good place to also write something about mode transformations, but with our current graph, there would not be an interesting transformation to look at. Let us therefore first add a little bit more complexity to our graph.
Q-SoPrA allows you to visualise parallel edges between nodes. These exist when multiple relationships exist between the same pair of nodes. To demonstrate this, I will add three additional relationship types to the graph, showing which actors have (1) shown support for certain plans, (2) offered resistance against them, or (3) shown a somewhat neutral interest in plans (that is, not being explicitly supportive, but also not dismissive of a plan).
The resulting network is show below. The graph is quite complex, since all relationships observed over the entire case study period are visualised. Some actors changed their stance towards plans over time. For example, older worker members were first resistant against the Moratorium plan, but later on started showing mild interest, before finally giving the plan full support. Indeed, to see this development, we could simply change the lower and upper bounds of the network visualisation to filter the network, but in this case I want to demonstrate how parallel edges are visualised.
You will see that three edges are directed from the Older worker members entity to the Moratorium plan entity, capturing the three different attitudes that the former have had towards the latter. Q-SoPrA will automatically increase the ‘height’ of edge curves if another edge between the corresponding pair of nodes is already visible.
Now that we have a slightly more complex network, more interesting options for performing mode transformations arise. I think of mode transformations as a way of inferring ‘latent’ (not empirically observed) relationships from empirically observed ones.
For example, imagine that we have a case study, in which we empirically observe (1) that some actors communicate with each other, and also (2) that some actors organise activities together. It is possible that we did not empirically observe some actors that co-organise activities also communicating with each other. However, it would actually make sense to assume that they did, because how will you co-organise something without getting in touch with each other?
We could of course solve this by assuming that co-organisation entails communication, and thus assigning both relationship types to an incident whenever that incident describes an instance of co-organisation of activities. However, to keep things simple, I like to stick to relationships that I can observe more or less directly in my data, and it can be quite difficult to keep the level of concentration required for this type of double-coding. Therefore, I instead choose to tell Q-SoPrA that whenever two actors co-organised something, that must mean that they have been communicating as well. This is what mode transformations can be used for.
In the example case study that we have here, we’ll look at a slightly different situation. We now have a network with several actors who have relationships to each other, but also to plans that have been conceived during the process of interest. One thing we could look at is to see if we can identify coalitions around plans based on the attitudes that actors have towards them.
One way to do this would be to say that actors are in a coalition if they support the same plans, or if one of them conceived of a plan, and the other supports it. Creating a new relationship based on these situations will involve multiple transformations.
Let us first simplify our network a bit by removing some of the relationships that we won’t be working with, that is, only keeping the conceived of and supports relationships. We can do this by selecting them in the edge legend, and then clicking the Remove button at the bottom of the Legend menu. See the result below.
Our next step could be to create a new relationship type based on situations where one actor conceived of a plan, and another actor supports that plan. After that we will create yet another relationship type for situations where two actors support the same plan. We will later merge these relationships into one. Since it is not possible to create a new relationship type with a label that is already taken by existing one, I will take the following steps:
As you will see, merging two relationships will actually remove their original versions from the plot (but not from the data set, unless the relationships were created through transformations, as is the case in this example).
So let’s start with the first new relationship type. To create it, I click the Multimode trans. button (short for Multimode Transformation). This will open the dialog shown below (where I have already chosen my settings).
This dialog can be quite confusing when you first use it, or when you are unfamiliar with mode transformations (also, I take a somewhat different approach than is commonly used in other software, possibly adding to the confusion).
We first have to select the modes that will be part of this transformation. We want to create a new relationship type between nodes of the mode Actor, based on the relationships that these nodes have to nodes of the mode Plan. The first mode to select in this dialog (Mode one) should always be the mode among which the new relationship type can exist. So, in this case we set Mode one to Actor. The second mode should always be the mode to which nodes of the first mode might have a shared relationship (this is exactly like co-affiliation), so in this case we select Plan for Mode two.
We are not finished yet. We also have to set the relationship types that are to be considered for this transformation. It is possible that our nodes in the Actor mode have different types of relationships to nodes in the Plan mode, which is actually true in this case: Some actors conceive of plans, while others support them. In this case, we set Relationship ego to supports, and we set Relationship alter to conceived of. Why does it matter which relationship type we assign to ego or alter? Well, in this particular case it actually doesn’t matter. However, imagine that we were creating a relationship type called is supportive of, which is a directed relationship from actors that support plans to actors whose plans they are supportive of. In this case, we would thus want to create a directed relationship type from ego to alter, and then it matters which relationship types are set for ego and alter in the multimode transformation dialog.
If this is not clear, then simply trying out different settings will probably clarify things.
When we have set the relationship types, we get to the easy part: We need to create a label for the new relationship type, as well as a description. Finally, we need to choose the directedness of the relationship, which in this case we set to Undirected. Now we can save our new relationship type (we will be asked to pick a colour for the new relationship type), and it will be added to the graph. See the result below.
The multimode transformation dialog will only allow you to pick between modes and relationship types that are visible in the current network. This helps to simplify the procedure somewhat.
So, the first step is done. You will see that I have added a new relationship type with a pink-ish colour, and which exists between pairs of actors where one of those actors conceived of a plan, and the other actor supported that plan.
Now let’s add the other relationship type. We again open the multimode transformation dialog. We set the modes in the same way we did the last time, but now we set the relationships of both ego and alter to supports (see below).
As you will see in the screenshot of the dialog above, with these settings the options for choosing the directedness of the relationship are not available. Since we have set the relationship type for both ego and alter to the same type, it would not make sense to create a directed relationship based on this transformation. Essentially, we do not even have a way to really distinguish between an ego and an alter in this case.
For our second relationship type, I have chosen the colour light blue. The result of our second transformation is shown in the screenshot below.
I think the graph is quite pretty like this, but it is not really easy to read. Also, our two relationship types still exist separately from each other. To merge the two relationships, we can use the Merge option near the bottom of the Legend menu. This will open the dialog shown below.
In this case we are shown a list of check boxes, where each check box represents one of the currently visible relationship types. We can use the check boxes to select the relationship types that we wish to merge (we can select more than two). It is not possible to merge relationships that have a different directedness (directed vs. undirected).
We are also asked to provide a label and a description for the new relationship type. Then, after saving the new relationship, we are again asked to pick a colour (I picked the colour black this time), and the new relationship will be added in the graph. At the same time, the relationships that we merged will be removed. See the result below.
You can see that our network has already become much simpler after the merger. If we are interested in looking specifically at coalitions, we can hide the other relationship types to simplify things even further. We select them in the edge legend, and click the Hide button for each.
We can now see our new relationship type more clearly. We can see a coalition around the General assembly proposal in the bottom, and we can see a coalition around the Moratorium plan in the top. We can also see that there is one actor that appears in both coalitions (Older worker members).
While making these transformations, we did actually also lose some information. For example, the newly created relationship types are no longer filterable, because they are not themselves associated with any particular incidents. Thus, if you want to create such networks for a more specific episode of the process, you will have to filter the network from the very beginning. If you wish to compare the network at different points in time, you will have to repeat this procedure several times.
We have now already discussed quite some details about how relationships can be used in Q-SoPrA. One thing that we haven’t discussed is the possibility to calculate network metrics. I will immediately say that it is not possible to do this in Q-SoPrA… yet. I do have long-term plans to support the calculation of network metrics, but there are a couple of difficulties that are hard to overcome. For example, in Q-SoPrA you can create quite complex networks, with multiple types of relationships that may differ in directedness. It will be a challenge to create an interface for network metrics that takes all these nuances into account. I want to prevent a situation in which one calculates network metrics that actually do not make a lot of sense for the type of network structure under consideration (for example, certain network measures for one-mode networks have to be adapted before they can be applied to two-mode networks).
For now, I work with a much simpler solution: I allow the user to export network data from Q-SoPrA, which can then be imported into other software packages for further analysis. I have chosen to allow exports of node lists and edge lists that are structured in such a way that they can be imported directly into Gephi. I have two main reasons for this. First, Gephi is open source software, and I prefer to support an open source solution over a commercial one. Second, data that is imported to Gephi can be exported in many different formats. Thus, I see Gephi as a kind of gateway to other software packages. In addition, the edge list format for Gephi is quite close to the edge list formats used by some other software packages.
Exporting network data can be done from the Controls menu, using the Export nodes button and the Export edges button. The latter option will immediately open a dialog where you are asked to select a location and a name for your edge list, which is exported in CSV-format. If you click the Export nodes button, you will first be shown a table that shows the node list that will be exported. This list will include the Id of the nodes, the Label of the nodes (the labels are actually identical to the Ids, but this has something to do with how I typically structure the node lists I import into Gephi), the Description of the entities associated with the nodes, and the Mode that the nodes are in (this will be blank if no modes were assigned).
In this screen, you can add additional variables to the node list, by clicking the Add attribute button. Attributes can be added as a boolean, or as a valued variable. Boolean variables simply indicate whether or not a given attribute was assigned to a node. When you select to add an attribute as a boolean, than a 1
will be inserted in the new column if (a) the selected attribute was assigned, or (b) one of the children of the selected attribute was assigned. A 0
will be inserted in all other cases. If you select the option to add valued variables, then only the values assigned to the selected attribute will be considered (and not its children). Assigning values to attributes is not something I discussed in this post, but it works exactly the same as explained in my earlier post on attributes.
In the screenshot below, you see an example where I have added two attributes to the node list that indicate, respectively, whether a node has the attribute collective actor and whether a node has the attribute Individual actor. After adding attributes, the final node list can be exported as a CSV-file by clicking the Export button. This will also automatically close this dialog.
It is also possible to export the visualisation of the network graph you have created. This can be done by clicking the Export svg button in the Controls menu. This will open a dialog where you are asked to provide a name and location for the SVG-file. The SVG-file can be opened with Inkscape, which is another wonderful open source tool. In Inkscape you will be able to manipulate your graph further (each element of the graphic can still be manipulated individually), and then export it as (for example) a PNG file or a PDF, with several options to determine the resolution of the exported graphic.
This concludes my discussion of relationships in Q-SoPrA. I hope it has given you some ideas about how Q-SoPrA can be used for (mostly qualitative) network analysis.
]]>One of the tasks that you are likely to use Q-SoPrA for is the qualification of incident data, by assigning codes to them (see this earlier post for an explanation of what incidents are). The tools you will use for this are not very different from those that you would typically use to assign codes to interviews or the contents of documents when using CAQDAS software. In Q-SoPrA, the codes that are used to qualify incident data are referred to as attributes.
You can use attributes to capture any information included in incident data that is meaningful for your study. For example, you could use attributes to distinguish between different types of activities, to identify actors or other entities involved in incidents, to record changes in variables associated with the incidents, and so on. What exactly you will use attributes for will depend heavily on your research question(s), and the theoretical basis of your research. You might of course also use attributes in a ‘grounded research’ approach, where you develop attributes on the fly, and develop these further as you proceed with your project. In this post I try to be mostly agnostic to different ways in which attributes might be used, while discussing various ways in which attributes (can) play a role when using Q-SoPrA in research projects.
In this post I discuss four main topics. I first discuss some basic, somewhat technical details of what attributes in Q-SoPrA are, how they can be organised, and how they are stored, and so on (but this is really quite simple). I then discuss different ways in which attributes can be assigned to incidents (and more abstract events; more on this later). This is followed by a brief discussion of various other simple ways to interact with attributes. Finally, I discuss some ways in which attributes can be used in visualisation and analysis.
Attributes have to be defined by the user. There are three different widgets in which this can be done, but the most obvious ‘place’ to define new attributes is the Attributes Widget (see screenshot below).
On the left side of this screen you see some fields that show the data associated with the incident that is currently being inspected (the timing, source, description, raw source text, and other comments/memos written by the analyst). On the right side of the screen you can see the attributes view. When you create a new database, this view will be empty. In the bottom-right of the screen you see all controls that are specifically associated with attributes. For example, you find controls to define new attributes, to edit existing ones, to assign the currently selected attribute to the present incident, and etcetera.
We will focus first on the creation of new attributes. Attributes in Q-SoPrA can be understood to have three main properties:
To create a new attribute, click the New Attribute button. This will open the dialog shown below.
You are always required to provide a label and a description for your attributes. If one of these is missing, then Q-SoPrA won’t allow you to save the attribute. Also, the name of the attribute has to be unique. The idea behind forcing you to provide a description is to make you think immediately about the definition of your attribute, and about the phenomena that your attribute is supposed to capture.
I mentioned above that attributes can have parents. That is, attributes in Q-SoPrA can be hierarchically organised by assigning parents/children to them. A parent is automatically assigned to a new attribute if you select an existing one before clicking the New Attribute button. In this case, the new attribute will be created as a child of the existing attribute you selected. You can re-parent attributes by dragging and dropping them onto other attributes. If you drag an attribute to an empty space in the attribute view, the attribute will be ‘orphaned’, and appear in the highest level of the attribute hierarchy. In this way, you can create as many hierarchical levels of attributes as you like.
The position of an attribute in the hierarchy has important consequences. An attribute that has a parent is always considered as a sub-type of that parent. For example, if you have an attribute called Activities, and you have another attribute called Night-time activities that is a child of Activities, then Q-SoPrA assumes that Night-time activities are a kind of Activities. In this way, any attribute can serve as a sort of category for other attributes.
The specific organisation of your attributes hierarchy will be important in, among other things, the identification of ‘modes’ in your event graph (modes are discussed in a bit more detail further below): When you create a new mode based on the attribute Activities, then all incidents and events that were assigned an attribute that is a child of Activities (or grandchild, or great-grandchild, or great-great… Well, you get the idea) will also be considered to belong to that mode.
Like all other data imported into, or created with Q-SoPrA, attributes are stored in sql tables. One of these tables simply lists the names, descriptions and parents of attributes. The parent of an attribute is either (1) one of the other attributes, or (2) a string called 'NONE', which tells Q-SoPrA that the attribute exist at the highest level of the attributes hierarchy.
After you create an attribute, you can assign the attribute to incidents in your data set. The attributes widget lets you scroll through all incidents in your data set, and assign attributes to them. Assigning an attribute to an incident can be done by first selecting it in your attributes tree, and then clicking the Assign attribute button.
If an attribute has been assigned to the incident that you are currently inspecting, it will show up in the attributes tree with a bold font. If an attribute has a child that has been assigned to the currently inspected incident, then it will show up in the attributes tree with an italic font. If an attribute is assigned to the incident, and also has a child that was assigned to the same incident, it will show up with a font that is both italic and bold.
After assigning an attribute, it is also possible to give the attribute a value. This value can be anything from a numeric value to a string of words. Assigning a value can only be done after assigning the attribute itself. The value can be typed into the appropriate field (the Value field below the attributes tree), and then stored by clicking the Store value button (which is greyed out in the screenshot above).
Combinations of attributes and incidents are stored in a separate sql-table. Each row in this table records (1) the name of the attribute, (2) the ID of the incident that the attribute was assigned to, and (3) the value of the attribute (if one was assigned).
You may have noticed in the screenshot above that a small fragment of the text in the Raw field is highlighted (it is underlined and it is bolded). As I explained in a previous post, the Raw field is used to record any fragments of text from your sources (for example, interviews, documents, news items, and etcetera) that were the basis for creating the incident. Q-SoPrA allows you to associate fragments of text in the Raw field with attributes that you assign to incidents (you can also assign an attribute without highlighting text). Associating a fragment of text with an attribute can be achieved by (1) selecting the attribute, (2) selecting a piece of text with your mouse cursor, and (3) then clicking the Assign attribute button. Even when an attribute has already been assigned, you can assign additional fragments of text to the attribute by using this procedure. If fragments of text are assigned to an attribute, these will be highlighted whenever the attribute is selected.
The fragments of text assigned to combinations of attributes and incidents are stored in yet another sql table. In this case the rows of the table record (1) the name of the attribute, (2) the name of the incident that the attribute was assigned to, and (3) the fragment of text that is associated with the attribute-incident combination. Multiple fragments of text may exist per attribute-incident combination.
There are two other ‘places’ in Q-SoPrA where attributes can be assigned to incidents (or events). One of these places is the Event Graph Widget. I won’t go into the details of the Event Graph Widget here, but, briefly, the Event Graph Widget is what you will use to visualise (1) incidents and/or events, and (2) relationships between these incidents/events. An event graph gives an abstract overview of the process of interest, and how different developments in the process are related.
You can also use the Event Graph Widget to build more abstract events from your incident data. I will dedicate a future post to an explanation of how exactly this works, but the basic idea underlying the creation of abstract events is that you take a group of inter-related incidents, and say that these together constitute some larger event. For example, we might have a few incidents that capture meetings between actors ‘X’ and ‘Y’ that took place over time, and group these together in an event that we describe as ‘X and Y meet over an extended period of time.’ Abstract events are different from your incidents, because they are constituted from incidents, and thus an additional level above your data (or even more than one level, because you can also use abstract events as building blocks for yet more abstract events).
In the screenshot below you see an example of an event graph that includes incidents, as well as an abstract event that has been built from incidents.
In the screenshot above, notice how the labels of most nodes are numbers, whereas there is one node (the wide elipse) that has the label "P-3". This node is an abstract event, and the labels of abstract events are based on what type of constraints were used when creating them. I will discuss the details of the possible constraints in a future post.
If you are curious, check out Peter Abell's approach to abstracting narrative graphs, as outlined in his book "The Syntax of Social Life" (1987), which is unfortunately very hard to get hold of for a reasonable price (I got lucky one time). I based two types of constraints on Abell's approach: (1) path based constraints (which are quite severe) and (2) semi-path based constraints (which are far less severe). The label of an abstract event will always start with a "P-" if it was created with path based constraints, and it will always start with an "S-" if it was created with semi-path based constraints.
In the left part of the screenshot, you see some details of the currently selected incident (the same details you would see in the Attributes Widget), as well as an attributes tree. In the Event Graph Widget, you can select any incident, and then assign attributes to them in the same way you would do this in the Attributes Widget.
One important difference is that, in this case, you can also assign attributes to abstract events. This works largely in the same way (you select an abstract event in the event graph, after which you can assign attributes), with the exception that you cannot associate a particular fragment of text with the attribute-event combination. This is because the abstract events don’t have ‘raw’ data directly associated with them (indeed, ‘raw’ data can still be indirectly associated, through the incidents that constitute abstract events). Abstract events therefore don’t have a Raw field (see the screenshot below).
Say that you have created an event graph in which you have abstracted numerous events. In the event graph, the incidents and events that were made into components of more abstract events are no longer visible themselves. This is a problem when you want to inspect those incidents and events, or when you want to change something in the attributes that you assigned to them. This is why I introduced Hierarchy Graphs to Q-SoPrA. Hierarchy graphs visualise the complete hierarchy of a given abstract event in your visible event graph. A hierarchy graph for a given abstract event can be inspected by selecting the event in your event graph, and then clicking the Components button (see screenshot below);
Clicking this button will cause Q-SoPrA to switch to the Hierarchy Graph widget, where the hierarchy of the currently selected event will be visualised.
In the screenshot above, we are inspecting the hierarchy of abstract event “P-3”, which consists of two other abstract events “P-1” and “P-2”, which, in turn, each consist of three incidents (P-1: 14, 15, 16, and P-2: 17, 19, 20).
In this screen, we can select all visible events individually. As you can see, we again have the attributes tree available to the left of the screen, and we can assign attributes in the same way we would in the Event Graph Widget. Hierarchy graphs are thus the third ‘place’ where you can assign attributes to incidents (and events).
Of course, assigning attributes to incidents is not the only type of interaction with attributes that we need. Occasionally, we will want to edit the name or the description of an attribute, which can be done by selecting the attribute in the attributes tree, and clicking the Edit attribute button. This will simply open the dialog that we have seen before (when creating a new attribute), but with the details of the currently selected attribute already filled in.
If we want to unassigning an attribute, we just select it in the attributes tree, and click the Unassign attribute button. We can also remove all attributes that are not currently being used by clicking the Remove unused attributes button.
I did not include the possibility to simply remove an attribute by clicking some kind of delete button. I thought that having this option is risky, because you might delete an attribute that you assigned to some other incident, for good reasons, that you have forgotten about. The only way you can thus get rid of attributes for good is by unassigning them from all incidents, and then clicking the Remove unused attributes button.
If there are fragments of ‘raw’ texts associated with the attribute, we can either select one of these fragments and click the Remove text button to remove an individual fragment, or we can simply click the Reset texts button to remove all fragments associated with the selected attribute (of course, this will only remove fragments that are associated with the combination of this attribute and the currently inspected incident). Any fragments of ‘raw’ text associated with an attribute will also be removed when you unassign the attribute.
We can also navigate incidents via attributes, by selecting an attribute, and clicking the Previous coded or the Next coded button, which will jump to the previous/next incident that was assigned to the selected attribute (or one of its children).
One other thing you’ll notice in the screenshot above is the tool tip that is displayed when you hover your mouse over an attribute in the attributes tree. The tool tip will show the description that you gave the attribute, making it easy for you to read the description without having to open another dialog.
There are a few other widgets in which attributes appear. These are tables that, for your convenience, present information that we have already seen in a somewhat different form. For example, in the screenshot below you see a table of attributes, the incidents they were assigned to, and the fragments of ‘raw’ texts associated with them.
This particular table is mostly useful for a comparison of the various fragments of ‘raw’ text that are associated with an attribute. Based on such comparisons you can make decisions about, for example, the appropriateness of the name and description that you assigned to a given attribute. This is particularly useful in more grounded approaches to coding, where you will be revising your coding scheme (or, in this case, attribute scheme) repeatedly as you progress with your study.
In this table, incidents are identified by a number that corresponds to the order in which incidents appear (this is used as the primary way to identify incidents to the user throughout Q-SoPrA). Indeed, this number alone won’t tell the user very much, but as the screenshot above demonstrates, the description that was given to incidents can be revealed by hovering the mouse cursor over them.
It is possible to filter this table by using the filter field below the table. You can apply this filter to any of the visible table columns. The table can also be sorted by double clicking one of the column headers, and the order of columns can be changed by dragging columns to another place. Fragments of text can be removed by selecting a row in the table and clicking the Remove selected button. Attributes can be edited by selecting a row in the table and clicking the Edit attribute button. Finally, the table currently being shown can be exported by clicking the Export visible table button.
Another table in which attributes appear is shown in the screenshot below. In this case, the table shows attributes, the incidents that they were assigned to, and the values that were given to the attributes. The screenshot shows an example where I am inspecting an attribute that I used to map projects to which the activities recorded in incidents are related. To do this, I simply created an attribute called project, and I gave values to the attribute to indicate which project a given incident (activity) belongs to.
Like the previous table, this table can be filtered. Values that were assigned to attribute-incident combinations can be changed in this table by clicking the Edit value button. It is also possible to export a matrix of incident-attribute combinations. The exported matrix will be an incidence matrix, where the incidents appear in the rows, and the attributes appear in the columns. If you choose to use the Export normal matrix option, the cells of this table will show a 1
if the corresponding incident-attribute combination exists, and a 0
if it doesn’t. If you choose to use the Export valued matrix option, the cells will instead show the values that were assigned to incident-attributes combinations, wherever these are available. If no value was given, then the table will again show either a 1
or 0
depending on whether the corresponding incident-attribute combination was observed.
The final table I will discuss here is the one shown in the screenshot below. This is actually not a table in which attributes appear. Instead, it is a table that shows all incidents that do not have an attribute assigned to them. When you are coding a large data set, you are likely to at some point overlook something. For example, you may have inserted a new incident somewhere in your data set at some point, but you have forgotten to assign attributes to it. The table shown below makes it easier to spot any incidents that you may have forgotten about. In the example offered in the screenshot, the incidents shown are simply the incidents that I haven’t gotten to yet in the coding process.
This table has something special, which is the last column of the table, called Marked. In Q-SoPrA all incidents can be marked, which simply makes it easier to find them back when using one of the coding widgets (in these widgets, you have the possibility to jump back and forth between incidents that are marked, skipping all other incidents). In this table, you can mark an incident by clicking its corresponding check box in the last column.
So, now we get to the last major part of this post, which is about how to use attributes in visualisations, which can be one step in your analysis of the data.
We have already seen the Event Graph Widget and how we can assign attributes to incidents/events there. Attributes can also be used to manipulate the visualisation of event graphs. One thing we can do is to create modes in the network based on attributes. In Q-SoPrA, a mode can be understood as a special identifier for nodes in the event graph (modes also appear in network graphs, but this is yet another topic that I will need to discuss in a future post). Any node (incident or abstract event) in the event graph can only have one mode at a time. What mode a given node belongs to is made visible by giving it a distinct colour.
In order to achieve this, the user has to create a mode. This can be done from the legend menu (see screenshot below). In the screenshot the legend is currently empty, because no modes have been created yet. A new mode can be created by clicking the Create mode button (in the bottom-right area of the screenshot). This will open a dialog with the attributes tree, as well as a few buttons that can be used to choose a colour for the mode (colours can be chosen for the nodes and for the labels).
After choosing an attribute and the colours to be associated with it, we can save the new mode. Q-SoPrA will immediately identify all nodes in the graph that were assigned the selected attribute, or one of its children. I emphasise the last bit, because it is important to realise that modes in the event graph are inclusive in the following sense: A mode is identified by an attribute, as well as by all children of that attribute. Thus, in the screenshot below, all the red coloured nodes can be understood as incidents in which some kind of meeting took place, without specifying what type of meeting we are referring to. A label identifying the newly created mode will also be shown in the legend, to the right of the screen.
Modes are, in a sense, also hierarchical. In deciding what mode an incident or abstract event should be assigned to, Q-SoPrA will always work its way from the top of the list of modes (the legend) to the bottom. Any modes that appear near the bottom of the list will overrule modes that appear above them. For example, In this case, we could add an additional mode that refers to a more specific kind of meeting (a child of the Meetings attribute).
In the screenshot above, you will see that two nodes have been switched to another mode (incidents 22 and 24). In this case, the attribute Knowledge-oriented meetings overruled the attribute Meetings in deciding which mode should be assigned to these incidents.
So what happens if we add a mode that is associated with a parent of the Meetings attribute, rather than a child? Let’s test this by creating a mode based on the Organizational activities attribute, which is the parent of the Meetings attribute.
As you see in the screenshot shown above, this causes the earlier created modes to be overruled. Imagine that, at this point, we realise that we have made a mistake; that it was never our intention to overrule our first two modes. We can easily recover from this situation by selecting the last mode we created, and clicking the Remove mode button. This will remove the mode from the event graph, and Q-SoPrA will automatically reassign the earlier created modes, basically reverting the graph to the version we saw earlier.
I plan to make playing around with the modes even easier, by adding the option to move modes up or down in the hierarchy. This will be implemented in a future version of Q-SoPrA.
These modes are not just for visualisation purposes. One feature of Q-SoPrA that makes use of modes is the creation of transition matrices. These matrices show how often an event of a given mode has a relationship to an event of another given mode. Making such matrices can tell you a lot about how different types of events in your graph tend to be related. For example, it can give you crude measures of how likely an event of a given type is to be followed by various other types of events.
In the screenshot below, I have assigned modes that cover nearly all of the incidents visible in the current screen.
What I can do now, is to click the button Export transitions. This will open a new dialog (shown below), where I can choose what kind of transition matrix I want to export.
Now, the first two options shown here (modes vs. attributes) require a bit of explanation. In the above, we have learned that any incident can have only one mode at a time. However, an incident can of course have multiple attributes, and sometimes we will want to take all attributes into account when calculating transitions, rather than taking into account only the currently assigned modes. This is basically what you can do by selecting the Attributes based option in the dialog shown above. In this example, we’ll go with the Modes based option instead.
The other two options are much more straightforward: You can either (1) export a transition matrix with the raw counts of the observed transitions, or (2) export a transition matrix where these counts were divided by the row marginals (the number of times the source event of the transition appears), which converts the value into a crude measure of probability. We’ll go with the latter option.
Q-SoPrA will now ask us what we want to name our file (which is a csv-file), and where we want to store it. After saving the file, we can open it with a spreadsheet editor, like LibreOffice Calc or Microsoft Excel. In the screenshot below, I show the matrix I exported from the last version of the event graph. I slightly edited the matrix before making the screenshot, adding bold fonts to the rows and columns, making the final row and column italic font, and formatting the transition probabilities to show 4 decimal numbers.
As you will see below, the transition matrices that Q-SoPrA exports always contain a column with the row marginals. Thus, if you export a transition matrix with the raw counts, then you can easily calculate probabilities yourself by dividing these counts by the row marginals.
In addition to the row marginals, the transition matrices will always include a row with the number of times the different event times participated in transitions (as the source event).
One of the things we see in this table is that, based on the observed event graph (which I filtered to only show the first 26 incidents), there is a probability of 0.6 that an event of the mode Meetings followed an event of the same mode. Given that the row marginal for this mode is 10, this must mean that this particular transition was observed 6 times (6/10 = 0.6). We also see that it is as likely that an event of the mode Planning follows an event of the mode Communication as it is to follow an event of the mode Group formation (0.5 probability in both cases). You can do a visual check of our screenshot of the event graph to confirm this.
Of course, with this small number of observations, these numbers are unlikely to be very meaningful. However, I hope the example demonstrates how transition matrices can be useful in summarising patterns in your data. Patterns like these will be especially hard to detect without transition matrices once you start considering larger numbers of incidents/events.
The last thing I will discuss here is the use of attributes in the creation of Occurrence Graphs. Occurrence graphs visualise which attributes co-occur at which points in time. Attributes can co-occur, for example, because they are assigned to the same incident, because they are assigned to incidents/events that were grouped in the same abstract event, or because they are assigned to incidents that occurred in parallel (incidents and events can be made parallel in the Event Graph Widget).
Occurrence graphs are basically a slightly adapted variant of Bi-Dynamic Line Graphs (BDLGs), which, I should emphasise, is not a type of graph I came up with myself (see the link to my earlier post on BDLGs to learn more about them). I will discuss occurrence graphs in more detail in another post, and only touch on some basic details here.
In an occurrence graph, attribute-incident (or attribute-event) combinations are visualised as nodes. The position of the nodes on the horizontal axis is based on the chronological ordering of the incidents/events (this ordering can also be matched to that of customised event graphs). In the occurrence graphs edges only appear between nodes that refer to the same attribute, and they always point to the next attribute-incident (or attribute-event) combination in which that attribute occurs. When attributes co-occur, their attribute-incident (or attribute-event) nodes appear at the same coordinate of the horizontal axis.
Occurrence graphs can thus be understood to show the ‘history’ of one or more attributes’ appearance/involvement in the process of interest. Imagine that you have a set of attributes that identify different actors that participate in the process your are studying. An occurrence graph can give you a quick overview of (1) when these actors were active, (2) how frequently they were active, and (3) which actors tend to be involved in the same incidents/events. For another example, imagine that you have a set of attributes that identify different types of activities that occur in incidents/events. An occurrence graph can give you a quick overview of (1) when these activities occur, (2) how frequently these activities occur, and (3) which activities are typically carried out in combination.
Q-SoPrA has a separate widget for the visualisation of occurrence graphs. An occurrence graph is built by adding attributes one by one, as I will demonstrate in the example below.
In this example, I will use an occurrence graph to study a specific issue of interest me. The data set that I use in this example records the emergence and development of a community sustainability initiative, somewhere in the UK. One of the ambitions of the initiative is to stimulate people in the community to adopt more sustainable practices (that is, what the members of the initiative understand to be sustainable practices, like growing your own food, commuting by bicycle rather than by car, crafting your own products from locally sourced materials, and so on). I have various attributes that refer to these different kinds of practices, and these all share a parent that I called Sustainability practices. What I want to know is what other activities the members of the initiative use to promote sustainable practices. One way to do this is to study what other activities the Sustainability practices co-occur with.
The first thing that I do is to add the Sustainability practices attribute to my occurrence graph. As was the case with modes in the event graphs, this will create nodes for all incidents that have the Sustainability practices attribute assigned to them, or one of its children.
Q-SoPrA will plot the nodes that refer to the same attribute on a straight line, where the position of the nodes on the horizontal axis respects their position in the chronological order of incidents. A small portion of the graph is shown in the screenshot above. Not terribly exciting, is it?
Let’s add some other attributes. One of the things that I learned is that members of the community sustainability initiative I studied like to make use of workshops to teach other members of the community certain skills associated with Sustainability practices. I therefore add an attribute that I used to capture the occurrence of workshops, called Initiative workshop.
There are immediately quite some co-occurrences that appear in the graph (see the screenshot above for a fragment of the graph). I made sure that Q-SoPrA initially plots all attributes-incident combinations on their own line, and that combinations that co-occur appear close to each other, so that co-occurrences are more pronounced (of course, you can move the nodes around a bit to adapt the visualisation).
What I unfortunately cannot show you without flooding this post with screenshots is that the occurrence graph shows that workshops occur quite often, but mostly in a particular episode of the process. Thus, it is interesting to explore some other types of activities that might be used to promote Sustainability practices to see if these are perhaps bound to certain episodes as well. The next attribute I’ll add is Competition, which captures those occasions where members of the community sustainability initiative organised some kind of competition to engage people in certain practices (e.g., cycling competition, a competition for who builds the nicest planter, and more).
The screenshot above shows just a small segment of the graph, but it shows one example where a Competition co-occurs with Sustainability practices. Actually, these co-occurrences appear in various parts of the graph, but overall they appear more sporadically than those of Sustainability practices and Workshops.
For this example, let me add just one more attribute. In this case I added the attribute Door-to-door advice, which is an activity in which members of the initiative went door-to-door in their community to provide community members advice about energy saving measures that can be taken at home.
As the screenshot above shows, I apparently only used this attribute once. There are no edges going into, or out from the yellow node shown in the screenshot, indicating that the node only appears once in this graph. The fact that this activity only occurs once is easily explained. The activity was carried out as part of a project for which the community sustainability initiative obtained funding, and which was dedicated to implementing and promoting energy saving measures in the community. As part of the project, some members of the initiative were trained to provide personalised energy advice, and these members went door-to-door in their community. Of course, this is a relative expensive operation for a small initiative, and it is not something you would expect to happen on a regular basis. The organisation of workshops, for example, typically requires lower investments of time and money, which perhaps makes it a more attractive type of activity for the initiative to make use of.
If we really want to make a more in-depth analysis, we would of course not stop here, and we would probably repeatedly go back and forth between our underlying data, the occurrence graph, and other visualisations and tools, but I think this small example gives you a basic idea of what you can do with attributes and occurrence graphs.
There is just one other thing I want to demonstrate. If you are familiar with CAQDAS, then you are probably also familiar with the concept of co-occurrence of codes. Basically, the occurrence graphs visualise a certain type of co-occurrence, a type that is sensitive to temporal order. However, it can also be useful to just have an aggregate measure of co-occurrence, like you would have in most CAQDAS. After creating an occurrence graph in Q-SoPrA it is possible to export a (co-)occurrence matrix that records aggregate measures of (co-)occurrence. The screenshot below shows an example of such a matrix.
The diagonal of the matrix simply shows how often a given attribute occurs overall. The other cells of the matrix show co-occurrences of the attributes associated with the corresponding rows and columns. This matrix basically gives us similar information to the occurrence graphs we studied earlier. For example, it shows that, when workshops occur, they typically co-occur with sustainability practices. It also shows that workshops were used more often than competitions, and that door-to-door advice occurred only once. Of course, what the matrix does not show us is when in the process all these activities (co-)occurred, so with the matrix we do loose some potentially relevant information.
There are of course more things that can be done with attributes. It is relatively easy to export data from Q-SoPrA, and to import it into some other software package for further analysis. For example, you could try to import the incidence matrix of incidents and attributes into some network visualisation tool, and make a network graph of what attributes are related to what incidents, giving you an overview that is otherwise perhaps hard to obtain.
It is very likely that I will also be adding additional features to Q-SoPrA over time. For now, I hope that the above demonstrates how attributes in Q-SoPrA can support in the analysis of social processes.
]]>This post is on an issue that I struggled with very recently, while working on Q-SoPrA. What I wanted to achieve was relatively simple: I wanted to have tables that fetch data from sql databases, and in which one column shows check boxes to set/unset a boolean variable. The screenshot below shows an example.
What you see in the screen shot is QTableView widget that shows data that it fetches from a QSqlTableModel that interfaces with a table of a sqlite database. This post is about how to create the interactive check boxes shown in the right-most column. There are two main hurdles in getting the QTableView widget to work with check boxes:
0
(for false
) or 1
(for true
), but you will of course need to do a bit of extra work to make the QSqlTableModel properly treat it as a boolean (or something that can be switched on or off) in read & write operations. This can be done by sub-classing the QSqlTableModel, and re-implementing its flags(), data() and setData() functions, as suggested here.paint()
function that determines how the column is visualised. A somewhat outdated example of how to do that is offered in the Qt FAQ.For good results in my use-case, where I want to have check boxes that (1) are able to handle a ‘pretend boolean’ variable (a boolean that is actually an integer) from a sqlite database, and (2) are not all aligned to the extreme left of their column, you have to combine the two solutions mentioned above. I haven’t really encountered a worked out example of this combination, which is why I decided to provide one in this post.
In the below, I briefly outline how you could achieve a result like the one shown in the screenshot above.
Qt5 is a popular library for C++ that can be used for the development of Graphical User Interfaces (GUIs). It also includes a module that allows for easy interfacing of your program with sql databases. I make heavy use of the Qt5 library for the development of Q-SoPrA, including the possibilities it offers for interfacing with sqlite databases. Qt5 comes packed with a number of great sql database classes, such as QSqlDatabase, QSqlQueryModel, QSqlTableModel, QSqlRelationalTableModel, and QSqlQuery.
For the development of Q-SoPrA, I make use of most of all these classes, although often a sub-classed version of them in which I re-implemented some of their member functions.
I think a very common setup for interfacing with sql databases is to have a QSqlTableModel (or QSqlRelationalTableModel) that fetches data from a table in your sql database, a QSortFilterProxyModel that reads from the QSqlTableModel and acts as a 'filtering layer', and a QTableView that reads from the filter and displays the results on the user's screen. This is also the kind of setup that I use for several widgets that I included in Q-SoPrA, although I typically subclass the QSqlTableModel / QSqlRelationalTableModel to change some of the ways in which the data is presented to the user (such as tool tip behaviour).
In the below I make one main assumption about your ‘pretend boolean’ variable, which is that this variable is stored in one of the tables of your sql database as an integer, and that you programmatically set this integer to 0
or 1
whenever necessary. I tend to use the QSqlQuery class for this, but as Doug Forester pointed out in the comment section below, this is a bit inefficient. Instead, it is enough to subclass the QSqlTableModel and re-implement some of its functions as discussed further below.
Before showing how this works, I should add that in the particular use case discussed here, the pretend boolean is stored in a particular column of the sql table that my QSqlTableModel interfaces with. In the code snippets below you will regularly see a conditional that checks if the current column index is 7
, because that is the column that holds the ‘pretend boolean’ in my specific example. You would of course need to adapt this column index to your own specific use case.
If we want user interaction with our ‘pretend boolean’ variable to be handled by a check box, we need to sub-class the QSqlTableModel that interfaces between the sql table and the QTableView that visualises the data for the user. More specifically, we need to re-implement the flags()
, data()
and setData()
member functions, and make our ‘pretend boolean’ (an integer set to 0
or 1
) behave as an item that can be checked and unchecked (I believe it was this discussion that led me to this insight). The flags() function basically determines how items recorded in the QSqlTableModel can be manipulated (see this list of possible flags). We want to re-implement this function to make sure that items that correspond to our ‘pretend boolean’ can be checked or unchecked by the user. This can be achieved quite easily:
As mentioned previously, in the code snippet above we assume that our ‘pretend boolean’ is recorded in the seventh column of our sql table. We therefore check whether the current index being accessed exists in the seventh column. If yes, then we communicate to the program that the item at this index should be checkable by the user (we also make sure that we return all flags that are set by default). If no, then we revert to the default behaviour of the QSqlTableModel::flags()
member function.
Our sub-classed version of QSqlTableModel still won’t understand how to handle our ‘pretend boolean’ properly. For that, we also need to re-implement the data()
and setData()
functions. The data() and setData() are used to read data from, and write data to the sql table with which the QSqlTableModel is interfacing.
We should keep in mind here that in the QSqlTableModel, data are stored under different ‘roles’ (see an overview of these roles here). For example, data stored under the Qt::DisplayRole
are the data that are actually shown to the user in the QTableView, data stored under the Qt::ToolTipRole
are the data that are shown when the user hovers his/her mouse cursor over an entry in the table, and data stored under the Qt::EditRole
are the data that the user can manipulate in an editor. In the list of roles you will also find the Qt::CheckStateRole
, and this is the role that we want to (re-)implement for our ‘pretend boolean’.
Another thing we should keep in mind is that we don’t want all data in our sql table to be treated as ‘pretend booleans.’ We will want to keep the default behaviour of the data()
and setData()
functions in most cases. We can do that in the same way by as we did with the flags()
function: We create a special case for the column that holds our pretend boolean
variable, and we revert to the default implementation of QSqlTableModel::data()
and QSqlTableModel::SetData()
in all other cases.
Actually, in the example below we have one other exception, which is the special case in which the user is hovering the mouse cursor over a cell, which will cause the QSqlTableModel to return data under the Qt::ToolTipRole.
Let’s start with our data()
function. See a snippet with its re-implemented version below. EDIT: The updated version below implements a suggestion made by Doug Forester in the comments. It is more efficient than what I originally came up with.
So, the basic structure of this function is quite simple. We first check if the column of the sql table that is being accessed is the column with our pretend boolean
. If yes, then we check whether the data are being accessed under the ‘Qt::CheckStateRole’. If the answer is yes again, we can fetch the current value of our ‘pretend boolean’ by accessing the data stored in the current index, and we store this value in int checked
. The function will return Qt::Checked
if checked == 1
, and it will return 0
if checked==0
.
There are a few other situations that the re-implemented function handles. If we are accessing data in the column with our pretend boolean
, but we are not accessing the data under the Qt::CheckState
role, then the function simply returns an empty QVariant(), effectively returning nothing. I did this to make sure that the corresponding column in the QTableView only shows a check box that visualises the current check state, and nothing else. If we are accessing data in any other column, the function first checks whether we are accessing data under the Qt::ToolTipRole
. If yes, then we treat it as another special case, in which the user gets shown a tool tip that simply contains the visible contents of the cell currently being hovered over with the mouse cursor. If we are not accessing the data under the Qt::ToolTipRole
(in all other cases), we just revert to the default implementation of the QSqlTableModel::data()
function.
If you have re-implemented the data()
function in this way, then your QTableView should already show check boxes in the corresponding column. So far, so good. However, there are still a few problems. One problem is that the check boxes are aligned to the extreme left of the column, which makes the table look ugly. We will deal with this later. A more urgent problem is that checking / unchecking the check boxes won’t actually do anything meaningful with the underlying data, unless we also re-implement the setData()
function. Let’s do that next.
We have a similar kind of check to the one we had with the data()
function: We check whether thes column we are writing to is the column that contains our pretend boolean
variable. In this case, we check at the same time whether we are trying to write data under the Qt::CheckStateRole
. If yes, we write either a 1
or a 0
to the corresponding index, depending on what value was passed to the function. In all other situations we simply refer to the default implementation of the QSqlTableModel::setData()
function.
After re-implementing the data()
function this way, checking / unchecking the check boxes in our QTableView will actually do something meaningful with the data in the underlying sql table.
If you don’t care about the alignment of the check boxes in their corresponding column of the QTableView, then you’re done. However, I think that the table looks much nicer with the check boxes aligned to the centre of their column. How to achieve this is what I will discuss next.
It turns out that the only way to align the check boxes to the centre of their column is to use a QStyledItemDelegate. Well, there is another approach that uses layouts, but that only works if you (1) manually append another column to your QSqlTableModel, (2) explicitly create QCheckBox
objects, (3) assign these to a parent QWidget
object, (4) to which you then apply a centralised layout. This is quite an expensive procedure that can significantly reduce the performance of your program, and it only works well if you don’t update your QSqlTableModel frequently.
This is because every time that you update your QSqlTableModel the manually appended column will disappear. You can of course solve this by writing a function that creates the extra column, and fills it with your manually created check boxes, and run this function whenever the QSqlTableModel's select() function gets called. However, this will slow down your program quite a bit, and I also found some other drawbacks to this approach that are a bit outside the scope of this discussion.
The approach that uses a QStyledItemDelegate is actually what is recommended in Qt’s FAQ section. An example code snippet is offered there as well, although it is a bit outdated (it uses some functions that were deprecated in Qt5). Essentially, what you are required to do is to create your own sub-class of QStyledItemDelegate, and re-implement its paint()
and editorEvent()
functions. The paint()
function determines how the check box is visualised to the user, and the editorEvent()
function determines how the user can interact with the check box.
I included the snippet with my slightly adapted version of the re-implemented functions below. I am not going to pretend that I grasp every detail of what happens in these functions. I mostly replaced some of the deprecated functions in the example offered in Qt’s FAQ.
The re-implemented version of the paint()
function basically just changes the position at which the check box is drawn, by changing the rectangle within which the paint event takes place, and then calling the default version of the paint()
function with an option parameter that includes the altered rectangle.
The re-implemented version of the editorEvent()
function handles various events through which the user might manipulate the current state of the check box. If an event meets the required conditions (e.g., a QEvent::MouseButtonRelease
took place within the rectangle where the check box is drawn), then we call the setData()
function (the one we re-implemented earlier) with the appropriate parameters.
There is actually one other thing we need to do if we want our QStyledItemDelegate to work. We need to tell our QTableView object to use the delegate in its seventh column. In my case, the QTableView object is named tableView
, and the sub-classed version of the QStyledItemDelegate I created is named CheckBoxDelegate
. I set tableView
to use the CheckBoxDelegate
by using the following function:
And that should do the trick! Now you should have a QTableView with a nice-looking column of check boxes through which the user can interact with ‘pretend boolean’ variables in your sql database.
]]>As I mentioned in my previous post, I am currently working on an integrated software package for the qualitative study of social processes, called Q-SoPrA (Qualitative Social Process Analysis). The program can be used for various tasks, including data management, qualitative coding, and visualisation. It is still under construction, and I have not opened the source or distributed any binaries yet (but I will once a more or less fully-featured and stable version is ready). In addition to some more technical posts about the software (such as the previous one), I intend to write several non-technical posts, discussing the features of Q-SoPrA and how to use them. Eventually, most of this information will also be available from the GitHub wiki of Q-SoPrA, but I think having some less condensed discussions of features is also useful.
I understand that having posts like this one will be more useful once the software is actually released, but for me writing this post is a way to prepare for the release, as well as a way to document some thoughts that might help me decide how to improve things further.
In this post, I want to discuss the basics of how to get started with Q-SoPrA. I start with a discussion of the logic behind the kind of data that Q-SoPrA works with. Then, I discuss the creation and loading of data sets, as well as data management (e.g., entering data, editing data, importing data).
I usually refer to the data sets that Q-SoPrA works with as event data sets, because it gives an intuitive sense of the kind of data stored in them, but the name incident data sets is actually more appropriate. The data sets consist of chronologically ordered incidents, which are bracketed, qualitative descriptions of interest to the process under study. The concept of incidents was, as far as I can tell, invented during the Minnesota Innovation Research Program (also see this book for detailed instructions on how to construct data sets like those used during the program, and this book for some empirical results of the program).
The operational definition of an incident that I use, and that I implemented in Q-SoPrA, is slightly different from that of the original inventors of the concept. In Q-SoPrA, an incident is assumed to include the following information:
I discuss each of these elements in more detail below.
Incidents are supposed to capture activities of importance to the process under study. Indeed, what exactly these activities are depends on various factors, including characteristics of the process, the nature of the available data (e.g., granularity of data), the research questions of the analyst, the theoretical angle chosen for the study, and so on (see the earlier mentioned references, and also see this, this (and other) work by Abbott for good discussions on this matter).
Moreover, not all social processes are best reconstructed at the same level of abstraction. Some studies may be about processes that unfold over a period of several years (e.g., innovation processes), whereas other processes unfold over a period of just a few hours (e.g., a meeting), or even just a few minutes (e.g., a conversation). The level of detail available in the source data, and the desired level of detail of incident descriptions is quite different across different types of processes.
The activities of interest are captured in brief descriptions (typically just a few sentences), created by the researcher. These descriptions typically mention at least what activity was performed, and by whom. Providing such descriptions is mandatory when using Q-SoPrA.
Writing incidents is an important step in the analytic process, and the descriptions used for incidents tell us a lot about the analyst’s interpretations of the raw data. As Poole and his colleagues write, incidents can be understood to be “one step above the raw material for analysis” (page 140). When parsing the raw data into incidents, the analyst makes various decisions about, among other things (1) what activities are important to record explicitly, (2) the granularity of the activities (e.g., are we just interested in the occurrence of a political debate, or are we interested in the specific questions raised during the debate, and how people responded to these?), (3) what aspects of the described activity are important (do we only need to record the activity itself, or perhaps also, for example, where the activity took place and what instruments were used in the performance of the activity?).
In the Minnesota Innovation Research Program, the researchers agreed on a set of explicit decision rules that were applied across the different cases that they studied. I believe that making explicit decision rules is valuable for the transparency of the process, but I also believe that an important part of the interpretive process is to reconsider these choices repeatedly, and to make adjustments to the data set accordingly. It is important to record such decisions in some kind of research diary (Q-SoPrA includes a Journal Widget that can be used for this purpose).
As I mentioned before, incidents are chronologically ordered in data sets. The order is to be determined by the analyst. I did not implement automatic sorting of incidents by date in Q-SoPrA, since the precise date of an incident may not be known, and a more descriptive indication of the time of occurrence is then required. For example, of some incidents we may only know that they happened before or after a certain other incident. In addition, I believe that determining the appropriate order of incidents is an important and informative task from which the analyst can learn a lot about the process of interest.
That being said, Q-SoPrA requires the analyst to always provide an indication of the time of occurrence. This may be a very precise time stamp (e.g., “At 15:30 on 25-12-2017”), or it may be a very vague indication of the time of occurrence (e.g., “Somewhere in 2013”). It may of course also be some form of duration, rather than a discrete point in time (e.g., “from March to April 2016”).
Q-SoPrA also requires you to always mention the sources of data. For most academics, this is a bit of a no-brainer. The specific sources of data used will depend on the design of the research project. In the studies included in the Minnesota Innovation Research Program, which were carried out as the innovations of interest unfolded, sources of data include observations, records of phone calls, notes of meetings, interview transcripts, and others.
I typically work with archival data, including newspaper articles, documents produced by people involved in the process of interest (e.g., minutes of meetings or project documents), as well as web sites (I make heavy use of The Internet Archive). I store files on these sources on my disk, and typically give them simple labels, such as “Webpage_1”, “Document_20”, “News_101” and so on. In the data set, I refer to these sources by their labels.
As much as possible, I also like to include fragments from raw sources of data whenever I create incidents. I believe that including these ‘raw’ fragments alongside my own descriptions of the activities contributes to the transparency of the interpretive process. Also, when I assign codes to the incidents, these fragments of ‘raw’ data are my main pieces of evidence. In fact, in the coding widgets that I have created for Q-SoPrA I have also included the possibility to highlight particular fragments of text that are to be associated with the codes (see below). Anyone who has used software like RQDA, Atlas.ti, NVivo, or MaxQDA will understand how this works.
Q-SoPrA does not require the analyst to include fragments of raw data, because there is not guarantee that these are available in the form of text. For example, some of my raw sources of data are images. However, I would encourage anyone to include raw fragments as much as possible, if only because such fragments are a great help in the qualitative coding process. I have personally noticed that I have become much more precise in my coding after implementing the possibility to highlight fragments of text.
Finally, in Q-SoPrA the analyst has the option to include comments, which may be anything from personal observations or thoughts on the incident, or perhaps reminders that certain aspect of the incident should be double checked (Q-SoPrA also offers the possibility to mark incidents, and to jump back and forth between incidents that were marked).
People that have experience with qualitative research (and particularly grounded theory) are probably familiar with the notion of memos. In Q-SoPrA, the comment field can be used to write such memos. Comment fields are the only fields related to incidents that can be altered at any time (the other incident-related fields, i.e., the indication of timing, the description, the sources of data, and the ‘raw’ data can only be changed from the Data Widget).
The structure of Q-SoPrA’s input datasets is quite simple: A data set is a table with 7 columns. Only 5 of these columns are visible to the user (see below), which are the columns that record the Timing, the Description, the Raw data, the Comments, and the Source(s) associated with incidents. The information included in these columns should be supplied by the analyst, by writing new incidents, or by importing data sets from external csv files (also see my previous post for a technical discussion on importing data from csv files).
The other two columns of the table are hidden from the user, and are always created and filled automatically. The first of these columns holds a unique ID for the incidents. The unique incident IDs are of crucial importance, because Q-SoPrA uses these IDs to refer to incidents in other tables of Q-SoPrA’s database. A unique ID is created whenever an incident is added to a data set. The first incident in the data set has the ID ‘0’, the second has the ID ‘1’, and so on. To ensure that we never confuse incidents, Q-SoPrA never reuses an ID number that has already been used before in the same data set.
Q-SoPrA uses sqlite databases to store all information that the user creates while working on a data set. These databases currently have 40 different tables (more may be added as I expand the program), but the only table that the user interacts with more or less directly is the table of incidents. Other tables are used, for example, to store the attributes that the user creates while coding, to store information on which attributes were assigned to which incidents, to store information on the fragments of text that have been associated with certain pairs of incidents and attributes, and so on. Obviously, it is important that Q-SoPrA knows which specific incidents we are referring to in these different tables, which is why we need unique IDs for all incidents.
The second column that is hidden from the user contains an order variable, which simply records the current position of an incident in the chronological order (the table of the data widget does show a row number, which always has the same value as the order variable). The value of this variable determines the position of an incident as it is shown to the user in the data widget (see the screenshot above). The value of this variable is changed whenever an incident is moved up our down in the table, which can be done from the data widget. Whenever the analyst performs such an operation, Q-SoPrA automatically calculates the new values of the order variable for the affected incidents. The values of the order variable also determine how the data are presented to the user in other widgets, such as the qualitative coding widgets. In addition, in the visualisation widgets the order variable can be used to filter visualisations to only visualise data from a particular episode in the process (I will discuss these topics in more detail in future posts).
When you start up Q-SoPrA, you will always be shown a Welcome Dialog first (see screenshot below).
The Welcome Dialog gives you three options: (1) You can create a new database, (2) you can open an existing database, or (3) you can exit the program. If you select the first option, Q-SoPrA will open a file dialog, which you can then use to select the name and location for your new database. After doing so, a new sqlite database is created. Essentially, this database is just a large collection of (initially empty) tables, but nearly all information created by, and presented to the user while using Q-SoPrA is stored in these tables. If the user selects to create a new database, and then selects an existing database in the file dialog, then the existing data base will be overwritten. This action cannot be undone, but the user will be shown a warning dialog before proceeding.
The user can indeed also select to open an existing database. In this case, Q-SoPrA will not attempt to create a new database, but any tables that are missing from the existing database will be added.
After choosing to create a new database, or to open an existing one, Q-SoPrA’s main window will be opened, and the data widget will be shown. All other widgets can be accessed from the main window, using the window’s menu bar (see screenshot below). I will discuss these other widgets in future posts.
If the user selected to create a new dataset, then the data widget will simply show an empty table, as well as a few buttons that can be used to interact with the data set. From this point, the analyst can either start creating new incidents from scratch, or import incidents from an external csv file, which is done by selecting the appropriate option from the File menu in the menu bar.
Imported csv files should meet two main requirements: (1) the file should have the column headers that Q-SoPrA works with, and they should exist in the same order (“Timing”, “Description”, “Raw”, “Comments”, and “Source”), (2) cells of the “Timing”, “Description” and “Source” columns are not allowed to be empty. If the user attempts to import files that do not meet these requirements, an error dialog will be shown to inform the user that the data will not be imported and why.
If the user wishes to create new incidents from scratch, the option Append incident can be used to add a new incident to the end of the data set. The user will then be shown a new dialog that can be used to enter information on the incident (see screenshot below). The user is always required to enter information into the “Timing”, “Description” and “Source” fields. If these fields are empty, the user is not allowed to create the incident, and will be shown an error dialog that explains why.
After saving the incident, it will be shown in the table of the data widget. The fields of the incident can now be manipulated by double clicking cells and changing the details. However, when the user clears a cell that is not allowed to be empty, Q-SoPrA will simply reset the cell. An incident can also be edited by selecting it, and then using the Edit incident option. This will open the same dialog that is used for the creation of new incidents, but the fields will be filled with the information that was already present.
The table in the data widget has various other features. For example, the information contained in a cell can be shown in a popup dialog by hovering the mouse cursor over it. This allows the user to display information on incidents without having to open another dialog. To the left of the table, the user will see a column with row numbers. The lower border of the cells in this column can be dragged up or down to decrease / increase the height of rows. The lower border of these cells can also be double clicked to maximise a row's height. If the user wishes to restore a row's height to its default value, the user should simply double click the corresponding cell with the row number. The user can also increase or decrease the table's font size by holding the CTRL button and scrolling the mouse wheel up or down.
Once the user has filled the table with some data, (s)he also might want to insert new incidents in between two existing ones, rather than appending them to the end of the table. For this reason, the user also has the option to insert a new incident before or after an existing one. These options work the same as appending a new incident (e.g., a dialog will be shown that can be used to enter information on the new incident), but the new incident will be inserted at the location chosen by the user.
Sometimes it also makes sense to duplicate an existing incident, and make only small adjustments to the duplicate, rather than having to enter all information again. For example, I often use this option if I decide to split up an existing incident into multiple incidents. This can be achieved by selecting an existing incident, and using the Duplicate incident option. The result will be similar to editing an incident, but in this case a new incident is created when saving the changes. Duplicates will always be inserted immediately after the incident that they are a duplicate of.
The user will occasionally want to change the order of the incidents (for example, after creating a duplicate). This can be done by selecting an incident, and using the Move up or Move down options. This will simply switch the selected incident with the one prior to it or the one following it.
Finally, the user can remove an incident from the table by using the Remove incident option. Removing an incident will also remove all information that has been associated with it, such as assigned attributes (not the attributes themselves), and linkages that have been created with other incidents (I discuss this in a future post). The user will always be shown a warning dialog when using this option, since it cannot be undone.
In conclusion, getting started with Q-SoPrA is easy. You simply create a new data set, and starting filling the data widget’s table with incidents. In my view, by building your data set you are basically building your case.
Indeed, while analysing the data you will often find yourself going back to the data widget to add more incidents, to edit, split or merge existing ones, or to change the order of the incidents. In my experience you will continue to do so throughout most of the analytic process, as your understanding of the case of interest gets more refined.
This is exactly the reason why the data widget was integrated into Q-SoPrA. In fact, the main reason I had for building Q-SoPrA was to have a program in which the tasks of data management and data coding are integrated (but things escalated quickly from there, as I kept adding other features to Q-SoPrA). Before building Q-SoPrA, I built several standalone tools that perform some of the tasks that Q-SoPrA performs (although Q-SoPrA performs these tasks much better, which is simply a result of my increased experience with coding software). One of the main problems with these standalone tools was that I often found myself having to re-import data sets after making small changes. Since this was quite a time-consuming and error prone process, I also often found myself postponing changes or ignoring certain flaws in the data set, simply because I did not want to waste too much time. One of the main ideas underlying Q-SoPrA is that management of the data set is actually a crucial part of the analytic process, and therefore I have attempted to make the process of going back and forth between your data and your analysis as painless as possible.
Of course, data management is only one part of what Q-SoPrA does, and the data widget is primarily a support for the various other tasks that can be performed with Q-SoPrA. I discuss these other tasks in future posts.
]]>