Tags: KDE, Community, Data Analytics
At the end of my previous post we concluded with yet another question. Indeed, on the 2017 KDEPIM contributor network we found out that Christian Mollekopf while being a very consistent committer didn’t appear as centrality as we would expect. Yet from the topology he seemed to act as a bridge between the core contributors and contributors with a very low centrality. This time we’ll try to look into this and figure out what might be going on.
My first attempt at this was to try to look into the contributor network on a different time period and see how it goes. If we take two snapshots of the network for the two semesters of 2017, how would it look? Well, easy to do with my current scripts so let’s see!
Alright, it still looks similar to the picture we got for the full 2017… Christian is still on the outter rings of our network and bridging toward low centrality nodes. Only difference is that he has a slightly higher centrality value than during the whole year. Needless to say just that semester doesn’t learn us much. Time to look at the second semester then.
Ah-Ah! Now we see something new, Christian is now mostly disconnected from the network! He is part of a clique containing him and Michael Bohlender. Looking further at their activity they are indeed focusing almost exclusively on Kube. Michael was in fact one of those low centrality nodes Christian was bridging to previously.
So what are we looking at? It seems to be the birth of an insular sub-team in the KDEPIM community. It’s technically not a fork since they’re working on a specific software but this clique configuration indicates they moved their focus there, they didn’t attract the rest of the KDEPIM community to contribute (yet?) and they stopped contributing completely to the wider KDEPIM effort (at least for the time frame we’ve been looking at). The community got split there.
Now we could leave it at that and consider it like a detail… or… if you’re like me and want not only to produce those graphs and metrics but wonder if some of those things could be turned into useful tools for community stewardship in general and the Community Working Group in particular, you won’t stop there.
From the two networks above and the one I produced the last time it’s clear that we need to deal with time… From a single network we freeze the time and get a configuration for a given period. If we ever want to see that something like the clique we saw appearing here can be detected we need a less static view.
For the time being, we will look at individual centrality of a contributor over time. For that we will get their monthly centrality value in the network over a three months sliding window (previous month, current month and next month). Since it’s also interesting to have an idea of the activity of the contributor over the time period, we’ll also plot the normalized monthly activity of the contributor. Finally, since centrality is dependent on the team size, we’ll plot the normalized team size on the period.
Regarding that last plot, a few more words because it’s a fairly important one that Volker Krause helped me realize during the KDEPIM sprint because of his own plots and discussing them with him, unfortunately it’s also what makes the centrality tricky to read. The centrality value of a node is a value between 0 and 1, if a node is not connected at all it gets a 0 if a node is connected to all other nodes it gets a 1. So obviously, if the team is large you need way more connections to get a high centrality than in a small team.
Corollary of the point above is that centrality values variation are meaningful only during a stable team size. If we’re a period of decreasing or increasing team size variations on a centrality can occur for a node even though it would have maintained the exact same connections! And that’s why we have the third plot on the team size in the graphs below to get an idea on how much trust we can put in the variation of the centrality plot.
Alright, with that out of the way (although it’ll keep haunting us while reading those plots), now it’s time to explore those plots. We won’t look only at one, I think it’s a good idea to look at more than one contribution pattern before coming back to Christian. To get there and keep those plots somewhat comparable I’ll drastically expand the time period we’ll look at, instead of looking at 2017 only, we’ll go all the way back to 2007! This way we can see more of KDEPIM’s history and get patterns also from old timers. Let’s start!
So first thing first, we see the evolution of the team size in KDEPIM on the last ten years. Interestingly we see the decrease that Paul Adams was pointing out in his last Akademy talk… but it’s not reaching the ground and it looks like it stabilized at least since 2014. Is it the case for the whole of KDE? Does the commit activity look the same globally? Clearly questions I’ll have to investigate as well, it never ends! :-)
In any case this variation on team size seems to indicate that we can look at the centrality variations from 2007 to end of 2009 or from 2014 to the end of 2017 somewhat safely. Of course the team size keeps varying but it’s more noise than a real trend so it should be fine overall.
With that in mind, what we can see from Till is a former core contributor who slowly stopped to contribute. This is crystal clear just from his activity plot and the centrality plot as expected follows the same pattern. It’s indeed less correlated with his activity in the 2010 to 2012 period but that’s to be expected with the downward trend in team size.
This second graph is now about Volker Krause. We can see the top activity he had during 2009 and because the team size was large at the time it required such a high activity for him to have his centrality spike as well. The mystery spike of September 2016 is what prompted the display of the team size plot. He had only a very tiny activity that month which generated a surge in centrality… well it turns out that even though he did only a hand full of commits some of them were on build system files which tend to be touched by others and because of the smaller team size than in 2009 the variation get amplified.
So now that we’re done with our two “example” core contributors… let’s get back in the territory of the very active contributors of the past year…
Let’s look at Laurent in this third graph, clearly he has been contributing to KDEPIM for a long time but overall not on a very high volume. It really started to increase around 2012 so I guess that’s when he slowly took over maintainership of KMail. As expected that’s when we see his centrality raising as well as he was getting involved with more and more components of KDEPIM. Of course it’s slightly amplified by the decrease of team size over the 2012 - 2014 period but he kept getting more central even after that.
And finally, this fourth graph gets us back to Christian. Clearly he joined KDEPIM at the end of 2010, from that point on he looks like any other future contributor with increased activity correlating with increased centrality (watch out for the decrease of team size until 2014 which amplifies a bit the effect on that period). Then during 2014 we have a somewhat stable centrality and activity. Some noise but nothing out of the ordinary over a year. It gets interesting after that though. During 2015 we see his activity increasing again but at the same time his centrality starts dropping a first time. It then stays somewhat stable while his activity spikes. And toward the end of 2017 centrality completely drops. This is a very different pattern from all the other contributors we looked at.
In my opinion, the interesting observation is that by looking at the contributor network, we see the clique only appearing at the second semester of 2017, but, on the centrality graph we see this pattern of increasing activity with decreasing centrality starting in 2015! Two years before the community split is visible.
Now the question I have, and I think it’ll be a tough one so I might leave it unanswered for a little while. Could we detect this kind of pattern early? Could we detect without too much false positive (even though there always will be some of them)?
I think it’s important to think about that because in that particular case, assuming we’d have such a tool, the Community Working Group would have been warned of a team split to come and maybe step in to see if they could save the situation. Currently our Community Working Group is mostly working in reactive mode since they talk to people when a conflict emerges, with such a tool they could also try to be proactive and check on a team if the “increasing activity with decreasing centrality” pattern emerges. I think it would be nice if they could do this and talk to people before too many feelings were hurt.
It’ll take time to get there, if at all. But I think it’s worth looking into.