If you remember at the end of my previous post, I raised a set of questions. Two were related to the use of colors in the graphs I showed:
- Are the four levels of colors for the activity visualization enough?
- Should we have some color coding on the nodes and use different layouts?
Since I had limited time lately to push on the other questions I thought I would do something about the colors at least. :-)
So let's revisit our "whole year 2017 for all of KDEPIM" (that is the parts in KDE Applications, in Extragear and in Playground) with more colors!
Firstly, this gives us the weekly activity using the "Magma" palette and a linear interpolation of the colors between the minimum and maximum commit counts:
Like before we see Laurent Montel and Christian Mollekopf as most consistent committers throughout the year 2017. That being said the new palette also allows us to see more things. First, it's not only that Laurent Montel is engaged every week it is also that he has more commits than anyone else almost each week! We can also now more clearly see unusual spikes of activities from Volker Krause and from Daniel Vrátil both on week 37.
I think that next I'll try to investigate why Laurent's commit count is so high and what happened on week 37 last year. But that will be for a next installment, stay tuned!
Secondly, this also gives us a new contributor network graph. I adjusted it in several ways: removed all visual clues on where the (0;0) point is, it's highly irrelevant but people seem to cling to it and try to interpret it; I also switched back to "Kamada & Kawai" for the force-directed layout since it's the one which helps the most to perceive the graph topology; finally I color coded the nodes depending on their centrality. For that last part I used the "Magma" palette again linearized on the full scale of centrality values of the graph. And since there are more than one definition of centrality I used the Degree Centrality. This is the simplest one and is a defined for a given node as "the fraction of nodes it is connected to". Since we's still in the mind set of finding out the contributors who collaborate with the most people through the files they commit too it's very suitable.
This time we don't even need to zoom in to spot the code KDEPIM contributors in 2017. With the color coding, we see right away again that Laurent Montel, Daniel Vratil and Volker Krause are the core contributors. It's much less guess work than the last time, we're backed by the color coded centrality metric now. We can also better see that Allen Winter, Sandro Knauß and David Faure are very central too, something that we missed the last time.
Now what about Christian Mollekopf who is our other consistent committer? In the activity graph, with a better view of the topology like we have thanks to "Kamada & Kawai" we manage to find him on one of the outter rings (he's the pale orange node on the top left of the graph). He's indeed not very central considering the "degree centrality" but we can see that he seems to act as a bridge between the core contributors and contributors with a very low centrality.
This is an interesting new finding probably worth chasing further.
As you can see I'm not done exploring this data set! More questions are showing up before I can move to another area of KDE I think.
As you might remember our dear Paul Adams decided to retire, this is a loss because of the person... But he was also providing a very nice service in the form of community data visualization. He was famously known among us for his "green blobs (turned blue blobs) and contributor network graphs".
Note that he just took the "green blobs" idea from Adriaan de Groot and later on turned them blue... He might have made them popular in the process but it's unclear if that's due to the color change or his prose. ;-)
Anyway, he was doing that for other communities than KDE, but he almost stopped now. For instance, he did it only once for Habitat in all of 2017. Luckily he published the scripts he was using in his git-viz repository so not all the knowledge was lost.
Earlier this year, I decided to take the torch and try to get into community data analytics myself. I got in touch with Paul to talk a bit about my plans. My first step was to try to modernize his scripts while staying true to his original visualization.
It turned out in an almost complete rewrite which I didn't quite expect. At the same time I wanted this modernization to be a good base for other visualization and also general data analytics. The most prominent part remaining is his git log parsing code although I extended it to work properly across repositories and not just on a single one. But next to that I'm now using pandas, networkx and bokeh for all the data processing and visualization descriptions. This turned out in nice, concise and maintainable code.
So you might wonder... What's possible now? Well, fairly similar visualizations than before but now they can span on more than one repository and they are fully interactive! No more fixed resolution pictures we generate fully dynamic HTML code.
To validate the scripts I used them on the whole year 2017 for all of KDEPIM (that is the parts in KDE Applications, in Extragear and in Playground).
Firstly, this gives us the infamous blue blobs diagram to show contributor weekly activity in all those repositories in 2017:
Clearly we can spot Christian Mollekopf and Laurent Montel as the most consistent committers throughout the year 2017. It should come as unsurprising since they are almost single handedly maintaining Kube/Sink and the rest of KDEPIM respectively. Daniel Vratil, maintainer of Akonadi is also very active and noticeable.
Secondly, this also gives us back the contributor network graphs. Here I did a small exception and used "Fruchterman & Reingold" for the force-directed layout instead of the "Kamada & Kawai" one. This is simply due to a personal preference. I find that in practice "Fruchterman & Reingold" is a bit more agressive at conserving the center for the cluster of most connected (core) contributors (although it sacrifices a bit in readability). So for all the KDEPIM repositories in 2017, we obtain the following network:
Surprisingly we can spot two disconnected nodes. Those two contributors touched files no one else touched in 2017. Nothing out of the ordinary, after investigating those two they were very self-contained punctual contributions for default SPAM settings and for improved wording in the GUI. Valuable but indeed don't necessarily require very deep integration in the core contributors network.
Then if we zoom in, we can easily spot the core KDEPIM contributors in 2017: Laurent Montel, Daniel Vratil and Volker Krause. They are the ones who connected most to other contributors via their commits last year. Of course this is a bit of a visual check and as such not very scientific.
Which leads me to the "what's next?" question.
Now I plan to build up on that work and add more tools and analysis. Paul's scripts and graphs were an excellent start hence why I did my best to stick to them. But now it's time to add more! Their are various questions which can be pursued:
- Are the four levels of colors for the activity visualization enough? Could we get better insights with a different palette? Should we get closer to a heat map?
- Could we get some more insights from using different frequency for the activity graph than weekly? Can we learn from daily activity on shorter lengths? Or from monthly on longer scale?
- Should we really look at the contributor network as is? Could we plot it over time and see it evolve dynamically is there insights there or would it just look pretty? Should we have some color coding on the nodes and use different layouts?
- Should we have a higher level view on the contributor network? Maybe we would get more information from finding cliques and plotting their relationships?
- Or should we ditch the contributor network representation altogether? Should we instead plot metrics on the graph structure itself over time (like the density or connectivities)?
- Of course we can come up with even more visualizations and analysis departing further from activity and contributor network (for instance, I suspect we could ease a bit copyright attribution from files to make it easier to contact contributors in case of license changes... I had to do it once for a couple of files and it's fairly manual and error prone process for now).
- And can we process more than the git commits? What about collaborations on reviews? What about interactions on public mailing lists? For sure there are extra insights hiding in there which would also open up non development activities, at some point I'd like to get an idea of the promo people and designers activities too!
Of course, there are other people doing such community analysis work out there, like GrimoireLab, Gitential and more... They are more providing off-the-shelf solutions than what I'm after. But probably some inspiration can be taken from them too!
The scripts I've been using for the visualizations above are available in my ComDaAn repository. Of course I hope to get them to evolve and to have new ones appear due to the questions listed in this post.
As you can see, it opens up a very large field and I'd like to explore more of those questions in the future and also try to apply them on other communities for which I likely have less preconceived knowledge and biases than KDE.