If you are interested in community data analytics, you will have several opportunities to discuss them during Akademy.
Firstly, there will be my talk titled Bringing Community Data Analysis Back to KDE (why the hell did I use "Analysis" there... I only used "Analytics" everywhere so far, odd). It will happen on Saturday at 15:30 in room IE7. The slot is a bit small for the topic, but I'll try my best to create interest. Indeed you can catch me around talks to chat about it, and...
Secondly, there will be a BoF "Discussing Community Data Analytics" on Monday at 10:30 in room 127. We hope to see people coming up with interesting questions to explore or willing to lend a hand in those explorations. See you there!
And of course it also means I'm on my way to Vienna, see you all during Akademy 2018!
In my previous post I played with the team size and activity metrics on several communities and see what would come out of it. Interestingly, to me this wasn't necessarily the most interesting of what I posted (it's rather basic in what it presents) but somehow it's the one which triggered the most comments, especially in the KDE community. Looks like I struck a nerve. :-)
Anyway, it got quite a lot of good comments, so I thought it deserved a follow-up post with a different tone. For the record, I generally try to avoid putting too much of my own personal opinion in posts where I present metrics. I think it's sane to try to shield facts on the data from my biased position. It's obviously super hard, if not impossible. Indeed, at a minimum I'm forced to mention potential events in the time frame considered (if I know them)... it's risky, but still I do it because otherwise things would be just very dry and super annoying to read! And I think that's why the previous post struck a nerve, but more on that below.
In the rest of this post, I'll pick extract of the comments I got, in no particular order, followed by my own opinion. So contrary to my usual "data posts", the cursor between factual presentation and opinion piece will very much point toward "opinion piece". Be warned! ;-)
More Information About the Rust Community
I got a very nice comment from Florian Gilcher, I won't address it all, but I'll add my reactions to some of the extracts.
This doesn't mean that we have an explanation for everything. But I can certainly say that the slowdown in 2015 is a couple of people going on holidays and taking some time off for once. There was a lot of exhaustion and also tension in the community at that time.
Thanks a lot for the confirmation! When looking at the history of the project from the outside it seemed the most logical conclusion, glad to see I wasn't far off for once. ;-)
While I agree that we will probably not hold this growth forever, we are doing a lot of intentional work to make that happen. One of them is constantly reorganising the project and actively pulling people in.
That's what I got from attending a couple of RustFest. I was very much impressed by the efforts going on into growing the community. It's very proactive and welcoming while KDE has a more passive stance even though it's a very welcoming community as well.
We take a very clear stance that we want to recognise all work that people have done for the project and invest time in that, for example with projects like https://thanks.rust-lang.org/.
Wow, didn't know about that page! It's a great idea. I wonder if KDE has something similar, it's probably worth producing. Each application has an about box with some people, but there's nothing structured for the frameworks, also I like the breakdown per version used by Rust. Probably something I could script, I'll look into it I think.
We try to keep all processes as lean as possible and especially don't try to impose a huge process on first-time or small contributors. Contribution effort should scale with the size of it.
This is an interesting concern. Instead of going for a one size fit all contribution model. To be kept in mind for sure.
Also, I don't want to say that we do everything right - the contrary. But we have gotten our project used to reflection and change.
Right, I think it's important to maintain a culture which allows both to pass the wisdom of older contributors while also being open to change.
Debating the Impact of Tooling
From Florian Gilcher still, we got a point about tooling:
Talking a little about your readings here: In contrast to other comments, I don't find it unreasonable to attribute the bump in KDE to the change of tooling. [...] Free contribution also means that leaving is easy. And some people don't want to pay the mental tax of learning new tooling.
Of course I agree with that, and I think that's why KDE lost some of the old guard and quite a bit of the drive by contributions (Git was really painful at the time). Also, on top of the mental tax of learning the new tooling there's Conway's law to take into account, a switch means organizational changes which in turns mean quite some communication generated by it, updating our wiki pages for on boarding, team building and so on.
And we got a similar stance from Anton Latukha as well:
I strongly agree on "documented process of 100 steps or to follow" for contribution versus "documented process of 5 steps". It is so much, that I wanted, but really never bothered to contribute. It is just too much rules and info to fit-in myself for me to make drive-by commits.
I strongly agree with that point of course! It's not related to the "was the dip caused by the switch to git" debate, since most of the reasons why it's hard to contribute to KDE predates that switch anyway. But it's a very important thing to keep in mind if we want to improve. We don't get as many drive-by contributions as we could and it's unfortunate. People expectations are that it should be much easier than it is. That's why at last year Akademy we gave the talk Looking at the Application Developer Story with David Faure.
Of course there's also another position on that debate which consider the tooling as irrelevant in the contribution history of KDE.
First from Luke Parry:
In short, the excitement has gone. KDE has just become a utility that works pretty well for a DE. Yet, It is not pushing any boundaries of how we interact and work with a desktop.
It's indeed another factor to take into account and I agree with that. For some parts, KDE is just a commodity people are glad it's around and since it's not exciting enough anymore they don't feel compelled to contribute. Note this is worrying though because it means the community (not the software) has fallen in a kind of Tragedy of the Commons trap. That's why I think it's important to hear the comments from Florian above, they show a path forward: the KDE Community should be consciously groomed like a proper common (aka we're doing great on the software side, not so much on the people side).
Then we got comments from Martin Flöser who are very much on the "switch to git is irrelevant" side:
Personally I don't believe in the git theory for the kde community. It just doesn't fit. KDE is a highly technical community, our code has partially a superb quality. Why should the community metrics change just because the introduction of git?
I would say: just see Florian's and Anton's comments above. There's a cognitive cost to such changes, especially when git was really a pain to use.
Instead I would rephrase the question: what resulted in the 2010 peak? For this I see two main events in the KDE community: KDE 4 and Nokia buying Qt.
I agree about KDE 4 resulting in more activity, but not all the way to 2010. As for Nokia, well... I really don't think it brought much more to KDE than extra sponsorship. One particular project saw extra contributions due to Nokia, but it wouldn't account for a big increase in activity by a long shot.
After the release of KDE 4 new developers were brought in.
That's the thing, we always had a strong contributor influx. Except during KDE 4 preparation (it was very hard for people to join because everything moved all the time as the increased activity indicates) and after 2010... which is where we're debating the reasons still. :-)
If I remember correctly you still were a student at that time (or did you already do your phd?)
I already had my PhD at that time. Besides, I don't know if it goes like this for all PhDs everywhere, but it was really a full time job in my case. I was definitely not working like a student anymore through my PhD. Anyway, very specific and not very relevant so I won't dwell on this more. ;-)
So what did happen around 2010 that we did not get the new students in? My answer to that is Bologna and Android. From talking with Bachelor students around that time I got the impression that they don't have the time anymore to do things as open source development next to their studies. The second thing I mention is Android. I think for students we were no longer the interesting and attractive community to join. Why hack on the old thing if you could do Android apps?
I think that's where we touch the crux of the issue in this debate. I very much agree with the KDE 4 impact and the commodification of the KDE products. Now, I tend to ignore them because we can't act on them, and that's why I talk much more about the tooling: I know we can act on that part!
And that's where you disagree, you think that the tooling change had no impact on the existing community. Well, I'm pretty sure at least part of the "previous developers" were driven away both because of the personal reasons and the change of tooling. The two collided, learning new half done annoying to use tooling while having less spare time? No way they'd do that. Remember git back then it's was really very difficult to learn with all its quirks.
Second reason why the tooling matters: the students changed. It's not only about the time available and making Android apps being more hip. Post-PhD you might remember I was involved in an University program setting up student projects. They had the time, they had cool things to pick from, so only stuff they were motivated to work on. Still, over the years I could see it was less and less natural for them to contribute to KDE. There has been a strong cultural shift I witnessed over the course of a few years.
It became much easier for them to use git, but the paradox for me is that it became much harder for them to use the rest of our tools. Somehow as various generations of students became more skilled in Git, they also became more influenced (or brainwashed, pick your position) by the GitHub contribution model. And nowadays it's for them the only true way of contributing. Our current processes for contributing are thus looking very alien and preposterous to your average student. I don't like that, but that's what I witnessed.
Also Dominik points out git is a standard:
But git is the de-facto standard, it's not to blame for the decrease - well, at least not in the last 5 years.
Yes definitely. I was thinking of "git back then", which is in part why we lost a fair share of the old guard as mentioned above. Nowadays git is fine, but as I mentioned above the other tools seem to get in the way for new contributors.
About Comparing Projects
We got a couple of interesting points from Boudewijn as well revolving around comparing projects:
I think you should superimpose the rust graph on KDE's graph --- put the current 2018 location on KDE's 2010 location. Everything has a hype curve, skyrockets, drops then settles.
Yes, I definitely agree there and I mentioned that at the end of my previous post: the Rust curves currently look similar to KDE early days. That being said even if we do what you propose with the curves, there's still a difference between KDE and Rust in my opinion. Their team size and their commit count variations are much more strongly correlated than for KDE. The only time KDE had it in a similar fashion is roughly for its first five years as far as I can tell after quickly checking. They kept that almost twice as long now. This is interesting in itself I think.
It would also be interesting, but controversial, to make this graph for larger kde projects -- libs/frameworks, plasma, krita/calligra (that doesn't matter much, krita was always the larger part of koffice/calligra development), kdenlive, digikam, pim.
I don't think this would be controversial. And that's actually one of the gazillion things I'd like to do... so many angles to look at those things, I can keep myself busy for the next ten or twenty years I think. I kind of touched it a bit for PIM in my previous posts by the way.
On the Definition of Activity
Sho opened the debate on the commit count metric itself:
I think our commit count going down has a lot to do with how we do code review - which we didn't do before.
This could be a factor for the 2010 drop, I somehow doubt it is a strong one because I seem to recall we were already doing reviews in 2009. Now I agree that we likely increased the amount of reviews at some point, but it didn't take 6 years to do that. That'd account for some of the decreasing trend but not all.
This makes it difficult to assess activity by comment count alone without looking at diff sizes too.
Yes, this is something I'd like to explore better. Currently I use commit count for the activity, but I know it's a poor proxy for it: not all commits are born equal and also reviews is work and collaboration. The reviews I can likely get a partial history through the Phabricator API, but that will never go back many years. For the "value" of a commit, I still didn't find an approach I liked from what I drafted... still looking for one.
About the Choice of Phabricator
Finally, I got an unexpected comment from Anton Latukha regarding Phabricator:
Phabricator community/development is already virtually dead: https://www.openhub.net/p/phabricator
Obviously a very important point... It's really a shame because I personally like Phabricator quite a bit despite the fact it looks foreign to a fair share of our contributor base. Of course it makes things concerning since it became a very central piece of KDE's infrastructure.
Really it pains me (did I say I like Phabricator?) but between Phabricator's declining contributor base and what I've seen with students who consider it confusing, I wonder if we should reconsider it. I hate gratuitous tooling changes (see my points on their price above) but it looks like the price of staying with that particular choice might sooner or later increase by a lot... and it's already high (see my points above about it looking alien to nowadays students). :-/
If you remember my previous installment I raised a couple more questions which I pointed out as tougher to address and I'd keep on the side for a while. Well, I decided to look at something simpler in the meantime... which unexpectedly took more time than expected.
First I thought I'd try to reproduce the cohesion graph from Paul's Akademy 2014 talk... but it looks like we have a reproducibility issue on that one. However hard I try I don't manage to reproduce it. What I get is very different, so either there's a bug in my tentative script or there was a bug in Paul's script or somehow the input data is different. So one more mysteries to explore, I'm at a loss about what's going on with that one so far.
Then I realized that already more than a month passed fiddling with that particular issue so I looked for something else. I still wanted something simple. That's why I went for weekly activity and team size for the whole history of some repository. Sure everyone does that, but I looked at how we could make this kind of things more readable to have a better idea of trends. Indeed most of the graphs I look at tend to be noisy. For instance if we look at Thiago's graphs about the Qt community, there's clearly valuable information but the ones covering the history "since the beginning" don't convey much idea about the trends. That's what I'll try to improve in the current installment. Similar data, just a different way to look at it.
The Transparent KDE Community
Let's start with the community closest to home: KDE.
First a note on the production of the graph itself. This is done by looking at all the available history of all the git repositories I could clone from KDE (I'm using kdesrc-build for that). The small dots are the absolute data points sampled per week. Since they (obviously) exhibit quite some noise, I then apply to them a low-pass filter which gives us the final line plots on top of the data points. They give us less accurate absolute value but better clues on the actual trends. Also note that the two curves are on different scales (one on the left the other on the right) so don't get confused about that. Again it's not about comparing absolute values between the curves but trends.
Before really focusing on the data in graph above, a note on the team size. Here we consider the size of the "team" producing commits for a given week. It's not telling us the whole size of the community, it's not even telling us the whole size of the developers community. It just tells us how many developers from the community have been active that particular week. So obviously the community size is larger than that. We have a simple model here, evaluating the community size would require something much more complex about heuristics on when someone can be considered not participating anymore or not. I'll sound like a broken record but this simple model is still relevant for showing trends in the activity level of a community so I'll stick with it for now.
With all of that in mind, it's interesting to see that active part of the community has been steadily growing until around 2010. This is clearly the tipping point and so after 2010 the community started to be less active. My gut feeling is that it's been also shrinking but the graph is no proof of that. At its peak there were around 200 people active each week, now it seems to be around 100 on average (yes I'm rounding aggressively here). The good thing is that from the plots, it seems it stabilized since 2016, but only time will tell if it stays stable or somehow grows again.
Also interesting to note is that the peak and beginning of the shrink is around the same period than the drop in cohesion pointed out by Paul in his Akademy 2014 talk.
So why the shrink? What happened around 2010? The only thing which comes to mind is our change of tooling... it's in fact the only thing which can explain both a reduction in cohesion and in size. Clearly we lost something with the switch to git, existing contributors were perhaps less motivated and newcomers were perhaps not joining as much.
It might sound surprising now that git is a big deal and extremely popular... but at the time it wasn't really a walk in the park. KDE has been an early adopter of git and for people with limited spare time it was yet another thing to learn. Now git is fashionable and learned by most but contributors are not coming back to us, so something is still amiss. Maybe our tooling is still too fragmented? Indeed people are used to very uniform platforms like GitHub or GitLab nowadays with a very organized view on the code and having to deal almost exclusively with git commands and a bit of web interface, we're nowhere near that. Maybe building our projects is still too complicated? Indeed people are used to grabbing the code, running a single build command and have the thing built and ready to run and hack on... we're nowhere near that.
That's why it's important that the onboarding of new contributors is now one of KDE's goals. Hopefully it will make sure that we don't just stabilize as we did now but start to grow again. I'm slightly concerned that it seems to focus mainly on documenting the status quo without necessarily improving the tooling. Don't take me wrong, documenting how to join us is super important! It's just that it needs to be made simpler as well. It's not the same impact on people interested to join to follow a documented process of 100 steps or to follow a documented process of 5 steps.
Now that I rambled about KDE... what about other communities?
The Dual-Licensed Qt
Let's check out on our friends from the Qt project! It's more of an industry type of community, plenty of people paid to contribute, the project is backed by a commercial owner. For that community I looked only at "Qt itself" which is not as easy to define than you would think. I basically went for the two main products: the qt repository (containing the Qt4 history), the qt5 repository and all its sub-modules and the qt-creator repository. That covers fairly well what you get if you install a SDK with only the Free Software components. Note that the history isn't as complete as in the KDE case so it's not going back to before the governance of Qt became open. This means we won't see all the way to Qt creation and it's likely that the beginning of the curves won't be reliable since they won't follow the right commit patterns and show instead big bulks of code in a limited number of commits by a limited number of people.
What a surprise! We can see it is very slowly getting less and less activity over time. Both the number of commits and the number of people active a given week has been stagnating or going down since 2010. Again before 2010 the numbers can't be trusted, but the graph reads as a decrease in activity as soon as the governance got opened. And there's no way to know if that was the trend already before opening the governance, we can't even gauge the correlation there.
Another surprise is the surge on the team size plot around 2012. It created a small surge in commits too but didn't change the overall trend on the commit count. This period would require an investigation of its own to get a clearer picture on its cause. My current theory after checking the "per employer activity" graph done by Thiago is that it seems to correlate both with KDAB getting much more involved in the development and KDE's effort toward KDE Frameworks creation.
As for the overall trend toward less activity, should we start to worry about Qt's health? Well, it depends what you are considering. Qt as a product, I wouldn't worry yet. If we look at absolute number it still clocks around 200 commits with around 50 contributors each weeks. For such a product it doesn't strike me as very low maintenance level. Qt as a community on the other hand... if the numbers are indeed correct (remember how I defined the corpus leading to the history we're looking at: it might have a blind side), I see no way to spin it positively. It is clearly a shrinking community (much like KDE as we've seen above).
Still, there's the Qt Company around they're in business and they seem to try to hire currently. So it's likely that there is a slow shift from the main repositories to other repositories (potentially non public). Not necessarily bad news for the product since it'll likely mean new features getting in down the line, etc. But even though it's not ideal community wise since it's harder to contribute.
The Crazy Multimedia People From VLC
And now what about VLC? It is after all one of the most successful Free Software out there. It's very specialized on its domain though (multimedia) and so as such maybe not showing the same activity profile than others? That's what we're going to look into.
Note that here I'm focusing only on vlc itself, but the VideoLAN Organization do more than just VLC. So please don't compare the plots below to KDE, it's slightly unfair comparison to both. It'd be as if I was plotting only one product from KDE.
Still I was curious about VLC itself since it's the least arcane of the projects from VideoLAN. Also I didn't find the time to produce an extensive and definitive list of the VideoLAN repositories. Something I'd like to do later though to have a more complete picture.
At a glance we can see a very different profile compared to the previous too and as such it was worth producing those plots. It seems to have a very stable community. The trends are clear, since 2003 we got a fairly stable team size and mostly stable commit count. That being said there are two points worth noticing.
First, we can see a five years period between 2007 and 2012 where the commit count is much more of a bumpy ride. Similarly on that period we see more of the community active at the same time then dropping again. It seems to match with a period of new ports of VLC on more platforms and the work leading up toward VLC 2.0. Surprisingly we don't see a similar pattern toward the preparation of VLC 3.0.
Second, despite a mostly stable team size since 2012, we can see a constant increase in commits over time. So it looks like the patch leading up to VLC 3.0 had a different pattern than the one leading up to VLC 2.0. The activity increased but mostly in commits under the roughly the same number of people each weak. That means the turn around of commits per person was higher during that period than before. This is a clear change of pattern since VLC 2.0. My current theory would be that it could be caused by the creation of Videolabs which is a company created in 2012 and employing mostly VLC developers. This company provides services around VLC and multimedia. It is the only event that I know of which would explain that plot. That being said and as mentioned above I have only a partial view of the VideoLAN history here, so take that theory with caution.
The Very Successful Rust
And last but not least, I wanted to take a very quick peak at Rust. It's very different from our previous cases, no application or frameworks in the traditional sense but a language. It seems very popular toward developers using it, I'm personally interested in it hence why it is in that post.
Due to its nature it's even harder to choose a corpus of repositories to define it... Should I take just the compiler? Other tooling? Documentation? Should I try to reach toward the whole ecosystem since it's a language? I decided to go for compiler, tooling and documentation (that is mostly code coming from rust-lang and rust-lang-nursery). It made sense to go for those because they really are an integral part of the "Rust experience" if you look at it as a coherent product. Just the compiler would be clearly too small, and the whole ecosystem would drown us in data which would just tell us how popular Rust is to its users which is not what we're after here.
First word which comes to mind: wow! Indeed, it's very successful and clearly skyrocketing currently. There's just a slowdown in 2015 for which I have no good explanation, maybe people were tired after releasing 1.0? If someone who knows intimately the Rust community has another theory I'm very eager to hear it.
Anyway, apart from 2015, both the team size and the commit count are still correlated and they just go up, and up and up. Clearly they are doing something right and are very successful at attracting contributors to Rust itself. I'd say it's not just a fad with people playing with it and making libraries or apps with it. It looks like they manage to convert users into contributors very successfully. Well done!
Now of course it's a much younger project, so time will only tell when it will plateau and if it starts shrinking again. For now, it's clearly looking similar to the first years of KDE. Maybe the KDE community should look more at Rust and find ideas on how to be so popular again.
At the end of my previous post we concluded with yet another question. Indeed, on the 2017 KDEPIM contributor network we found out that Christian Mollekopf while being a very consistent committer didn't appear as centrality as we would expect. Yet from the topology he seemed to act as a bridge between the core contributors and contributors with a very low centrality. This time we'll try to look into this and figure out what might be going on.
My first attempt at this was to try to look into the contributor network on a different time period and see how it goes. If we take two snapshots of the network for the two semesters of 2017, how would it look? Well, easy to do with my current scripts so let's see!
Alright, it still looks similar to the picture we got for the full 2017... Christian is still on the outter rings of our network and bridging toward low centrality nodes. Only difference is that he has a slightly higher centrality value than during the whole year. Needless to say just that semester doesn't learn us much. Time to look at the second semester then.
Ah-Ah! Now we see something new, Christian is now mostly disconnected from the network! He is part of a clique containing him and Michael Bohlender. Looking further at their activity they are indeed focusing almost exclusively on Kube. Michael was in fact one of those low centrality nodes Christian was bridging to previously.
So what are we looking at? It seems to be the birth of an insular sub-team in the KDEPIM community. It's technically not a fork since they're working on a specific software but this clique configuration indicates they moved their focus there, they didn't attract the rest of the KDEPIM community to contribute (yet?) and they stopped contributing completely to the wider KDEPIM effort (at least for the time frame we've been looking at). The community got split there.
Now we could leave it at that and consider it like a detail... or... if you're like me and want not only to produce those graphs and metrics but wonder if some of those things could be turned into useful tools for community stewardship in general and the Community Working Group in particular, you won't stop there.
From the two networks above and the one I produced the last time it's clear that we need to deal with time... From a single network we freeze the time and get a configuration for a given period. If we ever want to see that something like the clique we saw appearing here can be detected we need a less static view.
For the time being, we will look at individual centrality of a contributor over time. For that we will get their monthly centrality value in the network over a three months sliding window (previous month, current month and next month). Since it's also interesting to have an idea of the activity of the contributor over the time period, we'll also plot the normalized monthly activity of the contributor. Finally, since centrality is dependent on the team size, we'll plot the normalized team size on the period.
Regarding that last plot, a few more words because it's a fairly important one that Volker Krause helped me realize during the KDEPIM sprint because of his own plots and discussing them with him, unfortunately it's also what makes the centrality tricky to read. The centrality value of a node is a value between 0 and 1, if a node is not connected at all it gets a 0 if a node is connected to all other nodes it gets a 1. So obviously, if the team is large you need way more connections to get a high centrality than in a small team.
Corollary of the point above is that centrality values variation are meaningful only during a stable team size. If we're a period of decreasing or increasing team size variations on a centrality can occur for a node even though it would have maintained the exact same connections! And that's why we have the third plot on the team size in the graphs below to get an idea on how much trust we can put in the variation of the centrality plot.
Alright, with that out of the way (although it'll keep haunting us while reading those plots), now it's time to explore those plots. We won't look only at one, I think it's a good idea to look at more than one contribution pattern before coming back to Christian. To get there and keep those plots somewhat comparable I'll drastically expand the time period we'll look at, instead of looking at 2017 only, we'll go all the way back to 2007! This way we can see more of KDEPIM's history and get patterns also from old timers. Let's start!
So first thing first, we see the evolution of the team size in KDEPIM on the last ten years. Interestingly we see the decrease that Paul Adams was pointing out in his last Akademy talk... but it's not reaching the ground and it looks like it stabilized at least since 2014. Is it the case for the whole of KDE? Does the commit activity look the same globally? Clearly questions I'll have to investigate as well, it never ends! :-)
In any case this variation on team size seems to indicate that we can look at the centrality variations from 2007 to end of 2009 or from 2014 to the end of 2017 somewhat safely. Of course the team size keeps varying but it's more noise than a real trend so it should be fine overall.
With that in mind, what we can see from Till is a former core contributor who slowly stopped to contribute. This is crystal clear just from his activity plot and the centrality plot as expected follows the same pattern. It's indeed less correlated with his activity in the 2010 to 2012 period but that's to be expected with the downward trend in team size.
This second graph is now about Volker Krause. We can see the top activity he had during 2009 and because the team size was large at the time it required such a high activity for him to have his centrality spike as well. The mystery spike of September 2016 is what prompted the display of the team size plot. He had only a very tiny activity that month which generated a surge in centrality... well it turns out that even though he did only a hand full of commits some of them were on build system files which tend to be touched by others and because of the smaller team size than in 2009 the variation get amplified.
So now that we're done with our two "example" core contributors... let's get back in the territory of the very active contributors of the past year...
Let's look at Laurent in this third graph, clearly he has been contributing to KDEPIM for a long time but overall not on a very high volume. It really started to increase around 2012 so I guess that's when he slowly took over maintainership of KMail. As expected that's when we see his centrality raising as well as he was getting involved with more and more components of KDEPIM. Of course it's slightly amplified by the decrease of team size over the 2012 - 2014 period but he kept getting more central even after that.
And finally, this fourth graph gets us back to Christian. Clearly he joined KDEPIM at the end of 2010, from that point on he looks like any other future contributor with increased activity correlating with increased centrality (watch out for the decrease of team size until 2014 which amplifies a bit the effect on that period). Then during 2014 we have a somewhat stable centrality and activity. Some noise but nothing out of the ordinary over a year. It gets interesting after that though. During 2015 we see his activity increasing again but at the same time his centrality starts dropping a first time. It then stays somewhat stable while his activity spikes. And toward the end of 2017 centrality completely drops. This is a very different pattern from all the other contributors we looked at.
In my opinion, the interesting observation is that by looking at the contributor network, we see the clique only appearing at the second semester of 2017, but, on the centrality graph we see this pattern of increasing activity with decreasing centrality starting in 2015! Two years before the community split is visible.
Now the question I have, and I think it'll be a tough one so I might leave it unanswered for a little while. Could we detect this kind of pattern early? Could we detect without too much false positive (even though there always will be some of them)?
I think it's important to think about that because in that particular case, assuming we'd have such a tool, the Community Working Group would have been warned of a team split to come and maybe step in to see if they could save the situation. Currently our Community Working Group is mostly working in reactive mode since they talk to people when a conflict emerges, with such a tool they could also try to be proactive and check on a team if the "increasing activity with decreasing centrality" pattern emerges. I think it would be nice if they could do this and talk to people before too many feelings were hurt.
It'll take time to get there, if at all. But I think it's worth looking into.
If you remember at the end of my previous post, I raised a set of questions. Two were related to the use of colors in the graphs I showed:
- Are the four levels of colors for the activity visualization enough?
- Should we have some color coding on the nodes and use different layouts?
Since I had limited time lately to push on the other questions I thought I would do something about the colors at least. :-)
So let's revisit our "whole year 2017 for all of KDEPIM" (that is the parts in KDE Applications, in Extragear and in Playground) with more colors!
Firstly, this gives us the weekly activity using the "Magma" palette and a linear interpolation of the colors between the minimum and maximum commit counts:
Like before we see Laurent Montel and Christian Mollekopf as most consistent committers throughout the year 2017. That being said the new palette also allows us to see more things. First, it's not only that Laurent Montel is engaged every week it is also that he has more commits than anyone else almost each week! We can also now more clearly see unusual spikes of activities from Volker Krause and from Daniel Vrátil both on week 37.
I think that next I'll try to investigate why Laurent's commit count is so high and what happened on week 37 last year. But that will be for a next installment, stay tuned!
Secondly, this also gives us a new contributor network graph. I adjusted it in several ways: removed all visual clues on where the (0;0) point is, it's highly irrelevant but people seem to cling to it and try to interpret it; I also switched back to "Kamada & Kawai" for the force-directed layout since it's the one which helps the most to perceive the graph topology; finally I color coded the nodes depending on their centrality. For that last part I used the "Magma" palette again linearized on the full scale of centrality values of the graph. And since there are more than one definition of centrality I used the Degree Centrality. This is the simplest one and is a defined for a given node as "the fraction of nodes it is connected to". Since we's still in the mind set of finding out the contributors who collaborate with the most people through the files they commit too it's very suitable.
This time we don't even need to zoom in to spot the code KDEPIM contributors in 2017. With the color coding, we see right away again that Laurent Montel, Daniel Vratil and Volker Krause are the core contributors. It's much less guess work than the last time, we're backed by the color coded centrality metric now. We can also better see that Allen Winter, Sandro Knauß and David Faure are very central too, something that we missed the last time.
Now what about Christian Mollekopf who is our other consistent committer? In the activity graph, with a better view of the topology like we have thanks to "Kamada & Kawai" we manage to find him on one of the outter rings (he's the pale orange node on the top left of the graph). He's indeed not very central considering the "degree centrality" but we can see that he seems to act as a bridge between the core contributors and contributors with a very low centrality.
This is an interesting new finding probably worth chasing further.
As you can see I'm not done exploring this data set! More questions are showing up before I can move to another area of KDE I think.
As you might remember our dear Paul Adams decided to retire, this is a loss because of the person... But he was also providing a very nice service in the form of community data visualization. He was famously known among us for his "green blobs (turned blue blobs) and contributor network graphs".
Note that he just took the "green blobs" idea from Adriaan de Groot and later on turned them blue... He might have made them popular in the process but it's unclear if that's due to the color change or his prose. ;-)
Anyway, he was doing that for other communities than KDE, but he almost stopped now. For instance, he did it only once for Habitat in all of 2017. Luckily he published the scripts he was using in his git-viz repository so not all the knowledge was lost.
Earlier this year, I decided to take the torch and try to get into community data analytics myself. I got in touch with Paul to talk a bit about my plans. My first step was to try to modernize his scripts while staying true to his original visualization.
It turned out in an almost complete rewrite which I didn't quite expect. At the same time I wanted this modernization to be a good base for other visualization and also general data analytics. The most prominent part remaining is his git log parsing code although I extended it to work properly across repositories and not just on a single one. But next to that I'm now using pandas, networkx and bokeh for all the data processing and visualization descriptions. This turned out in nice, concise and maintainable code.
So you might wonder... What's possible now? Well, fairly similar visualizations than before but now they can span on more than one repository and they are fully interactive! No more fixed resolution pictures we generate fully dynamic HTML code.
To validate the scripts I used them on the whole year 2017 for all of KDEPIM (that is the parts in KDE Applications, in Extragear and in Playground).
Firstly, this gives us the infamous blue blobs diagram to show contributor weekly activity in all those repositories in 2017:
Clearly we can spot Christian Mollekopf and Laurent Montel as the most consistent committers throughout the year 2017. It should come as unsurprising since they are almost single handedly maintaining Kube/Sink and the rest of KDEPIM respectively. Daniel Vratil, maintainer of Akonadi is also very active and noticeable.
Secondly, this also gives us back the contributor network graphs. Here I did a small exception and used "Fruchterman & Reingold" for the force-directed layout instead of the "Kamada & Kawai" one. This is simply due to a personal preference. I find that in practice "Fruchterman & Reingold" is a bit more agressive at conserving the center for the cluster of most connected (core) contributors (although it sacrifices a bit in readability). So for all the KDEPIM repositories in 2017, we obtain the following network:
Surprisingly we can spot two disconnected nodes. Those two contributors touched files no one else touched in 2017. Nothing out of the ordinary, after investigating those two they were very self-contained punctual contributions for default SPAM settings and for improved wording in the GUI. Valuable but indeed don't necessarily require very deep integration in the core contributors network.
Then if we zoom in, we can easily spot the core KDEPIM contributors in 2017: Laurent Montel, Daniel Vratil and Volker Krause. They are the ones who connected most to other contributors via their commits last year. Of course this is a bit of a visual check and as such not very scientific.
Which leads me to the "what's next?" question.
Now I plan to build up on that work and add more tools and analysis. Paul's scripts and graphs were an excellent start hence why I did my best to stick to them. But now it's time to add more! Their are various questions which can be pursued:
- Are the four levels of colors for the activity visualization enough? Could we get better insights with a different palette? Should we get closer to a heat map?
- Could we get some more insights from using different frequency for the activity graph than weekly? Can we learn from daily activity on shorter lengths? Or from monthly on longer scale?
- Should we really look at the contributor network as is? Could we plot it over time and see it evolve dynamically is there insights there or would it just look pretty? Should we have some color coding on the nodes and use different layouts?
- Should we have a higher level view on the contributor network? Maybe we would get more information from finding cliques and plotting their relationships?
- Or should we ditch the contributor network representation altogether? Should we instead plot metrics on the graph structure itself over time (like the density or connectivities)?
- Of course we can come up with even more visualizations and analysis departing further from activity and contributor network (for instance, I suspect we could ease a bit copyright attribution from files to make it easier to contact contributors in case of license changes... I had to do it once for a couple of files and it's fairly manual and error prone process for now).
- And can we process more than the git commits? What about collaborations on reviews? What about interactions on public mailing lists? For sure there are extra insights hiding in there which would also open up non development activities, at some point I'd like to get an idea of the promo people and designers activities too!
Of course, there are other people doing such community analysis work out there, like GrimoireLab, Gitential and more... They are more providing off-the-shelf solutions than what I'm after. But probably some inspiration can be taken from them too!
The scripts I've been using for the visualizations above are available in my ComDaAn repository. Of course I hope to get them to evolve and to have new ones appear due to the questions listed in this post.
As you can see, it opens up a very large field and I'd like to explore more of those questions in the future and also try to apply them on other communities for which I likely have less preconceived knowledge and biases than KDE.