As you might remember our dear Paul Adams decided to retire, this is a loss because of the person... But he was also providing a very nice service in the form of community data visualization. He was famously known among us for his "green blobs (turned blue blobs) and contributor network graphs".
Note that he just took the "green blobs" idea from Adriaan de Groot and later on turned them blue... He might have made them popular in the process but it's unclear if that's due to the color change or his prose. ;-)
Anyway, he was doing that for other communities than KDE, but he almost stopped now. For instance, he did it only once for Habitat in all of 2017. Luckily he published the scripts he was using in his git-viz repository so not all the knowledge was lost.
Earlier this year, I decided to take the torch and try to get into community data analytics myself. I got in touch with Paul to talk a bit about my plans. My first step was to try to modernize his scripts while staying true to his original visualization.
It turned out in an almost complete rewrite which I didn't quite expect. At the same time I wanted this modernization to be a good base for other visualization and also general data analytics. The most prominent part remaining is his git log parsing code although I extended it to work properly across repositories and not just on a single one. But next to that I'm now using pandas, networkx and bokeh for all the data processing and visualization descriptions. This turned out in nice, concise and maintainable code.
So you might wonder... What's possible now? Well, fairly similar visualizations than before but now they can span on more than one repository and they are fully interactive! No more fixed resolution pictures we generate fully dynamic HTML code.
To validate the scripts I used them on the whole year 2017 for all of KDEPIM (that is the parts in KDE Applications, in Extragear and in Playground).
Firstly, this gives us the infamous blue blobs diagram to show contributor weekly activity in all those repositories in 2017:
Clearly we can spot Christian Mollekopf and Laurent Montel as the most consistent committers throughout the year 2017. It should come as unsurprising since they are almost single handedly maintaining Kube/Sink and the rest of KDEPIM respectively. Daniel Vratil, maintainer of Akonadi is also very active and noticeable.
Secondly, this also gives us back the contributor network graphs. Here I did a small exception and used "Fruchterman & Reingold" for the force-directed layout instead of the "Kamada & Kawai" one. This is simply due to a personal preference. I find that in practice "Fruchterman & Reingold" is a bit more agressive at conserving the center for the cluster of most connected (core) contributors (although it sacrifices a bit in readability). So for all the KDEPIM repositories in 2017, we obtain the following network:
Surprisingly we can spot two disconnected nodes. Those two contributors touched files no one else touched in 2017. Nothing out of the ordinary, after investigating those two they were very self-contained punctual contributions for default SPAM settings and for improved wording in the GUI. Valuable but indeed don't necessarily require very deep integration in the core contributors network.
Then if we zoom in, we can easily spot the core KDEPIM contributors in 2017: Laurent Montel, Daniel Vratil and Volker Krause. They are the ones who connected most to other contributors via their commits last year. Of course this is a bit of a visual check and as such not very scientific.
Which leads me to the "what's next?" question.
Now I plan to build up on that work and add more tools and analysis. Paul's scripts and graphs were an excellent start hence why I did my best to stick to them. But now it's time to add more! Their are various questions which can be pursued:
- Are the four levels of colors for the activity visualization enough? Could we get better insights with a different palette? Should we get closer to a heat map?
- Could we get some more insights from using different frequency for the activity graph than weekly? Can we learn from daily activity on shorter lengths? Or from monthly on longer scale?
- Should we really look at the contributor network as is? Could we plot it over time and see it evolve dynamically is there insights there or would it just look pretty? Should we have some color coding on the nodes and use different layouts?
- Should we have a higher level view on the contributor network? Maybe we would get more information from finding cliques and plotting their relationships?
- Or should we ditch the contributor network representation altogether? Should we instead plot metrics on the graph structure itself over time (like the density or connectivities)?
- Of course we can come up with even more visualizations and analysis departing further from activity and contributor network (for instance, I suspect we could ease a bit copyright attribution from files to make it easier to contact contributors in case of license changes... I had to do it once for a couple of files and it's fairly manual and error prone process for now).
- And can we process more than the git commits? What about collaborations on reviews? What about interactions on public mailing lists? For sure there are extra insights hiding in there which would also open up non development activities, at some point I'd like to get an idea of the promo people and designers activities too!
Of course, there are other people doing such community analysis work out there, like GrimoireLab, Gitential and more... They are more providing off-the-shelf solutions than what I'm after. But probably some inspiration can be taken from them too!
The scripts I've been using for the visualizations above are available in my ComDaAn repository. Of course I hope to get them to evolve and to have new ones appear due to the questions listed in this post.
As you can see, it opens up a very large field and I'd like to explore more of those questions in the future and also try to apply them on other communities for which I likely have less preconceived knowledge and biases than KDE.