Tags: foss, kde, libreoffice, kernel
At Akademy 2024, during my talk KDE to Make Wines I promised there would be a companion blog post focusing more on the technical details of what we did. This is the article in question.
This is a piece I also wrote for the enioka blog, so there is a French version available.
After years of working in the service industry, one thing which doesn’t cease to amaze me is the variety of needs our customers have and how we can still be surprised by them. They sometimes lead us in unexpected directions. This is obviously something I particularly like.
Today I’ll tell you the tale of a nice relationship enioka Haute Couture built with a customer down under.
It all started with a mail on a KDE mailing list a couple of years ago. People were looking for help. It was spotted by my colleague Benjamin Port who reached out. It turned out those people were from De Bortoli Wines an Australian winemaking company. They are known for using Free Software quite a bit and contributing when they can. They even got interviewed on the KDE’s dot ten years ago!
Not much really happened after our first contact but we kept the communication open… Until last year when they reached out to us for some help with Okular, the universal document viewer made by KDE.
They wanted to get rid of Acrobat Reader for Linux in favor of Okular. They had one issue though, the overprint preview support was visibly broken in Okular. This is essential when interacting with a reprographics company which will send back PDFs based on how they layer colors (or overprinting). You need the overprint preview to simulate what the final rendering will be. We were of course motivated to help them get rid of Acrobat Reader for Linux since it is an old and stale piece of proprietary software.
This looked at first like an easy task. Poppler (used for rendering PDFs) was exposing some API for the overprinting preview but Okular didn’t make use of it. We quickly made a patch to use the overprint preview API.
Alas, doing so uncovered another issue. As soon as we turned on the overprinting preview support the application would crash. We tracked it down one level down. Somehow the crash was hiding in the Poppler-Qt bindings.
After further exploration, it was due to the binding wrongly determining the row size of the raster images to generate. There’s a color space conversion occurring between the initial memory representation and the target raster image. The code was getting this row size before the transform occurred… and then ended up stuffing the wrong amount of data in the target raster. This couldn’t go well. Another patch was thus produced to address this.
The good news is that we managed to fix the issue in less time than the budget the customer allocated to us. So we gave them the choice between stopping here or using the remaining budget to address something else.
We used the Ghent Workgroup PDF Output suite to validate our work beyond the samples provided by our contact. While doing so we noticed Poppler was failing at properly rendering some other cases. So we proposed to investigate those as well.
After spending some time on those… we made tentative fixes but unfortunately they led to some regression. So in agreement with the customer, we wrapped up and created a detailed bug report instead as to not waste their budget. This helped the Poppler community figure out the problem and produce a fix. That’s when we realized we came really close to the right fix at some point. Clearly an expert view on how the various PDF color spaces work was required so it was a good call to create the detailed report.
With still some budget left, our contact proposed us to also bring the overprint preview support to Okular printing. This was initially left out of the scope but necessary when you want to print your preview on a regular laser printer. This required adjusting Okular and adjusting Poppler-Qt once more. It was ultimately done within budget.
Since the customer was satisfied with the work they came back for more. We setup a budget line for them to come up with issues to fix throughout the year.
Around that time, their focus moved more to CIFS mounts which they use extensively for their remote office branches. As active users of that kernel feature, they encounter issues in user facing software that you would otherwise not suspect.
They were affected by a bug preventing Okular to save on CIFS mounts. It is one of those which has been lingering on for a bit more than a year without a solution in sight. Some applications could modify and save a file opened on a CIFS mount but somehow not Okular.
It turned out to be due to some code in KIO itself (the KDE Frameworks API used for network transparent file operations) interacting in an unwanted way with CIFS mounts.
Indeed, the behavior of unlink()
(file deletion) on CIFS mounts can be a bit “interesting”. If the file one tries to delete is opened by another process then the operation is claimed to have succeeded but the filename is still visible in the file hierarchy until the last handle is closed. This is unlike the usual UNIX behavior, outside of CIFS mounts the file wouldn’t be visible in the hierarchy anymore. We thus were seeing the issue because Okular does keep a file handle opened on the file.
Now, KIO rightfully attempts to write under a temporary name, delete the original file and rename to the final name during its file copy operation. This would then fail as the unlink()
call would succeed, but the rename would unexpectedly fail due to the lingering file in the hierarchy.
So we proposed a patch for KIO which would do a direct copy for files on CIFS mounts. Files being directly overwritten succeed and so the bug experienced with Okular was solved.
This wasn’t the end of the issues with CIFS mounts (far from it). They were also experiencing performance problems when listing folders. Interestingly, they would experience it only in the details view of the KDE file dialog.
At its core the issue was due to requesting too much information. When listing a folder known to be remote by KIO (e.g. going through an smb://
URL) the view would limit the amount of information it’d request about the sub-folders. In particular it wouldn’t try to determine the number of files in the sub-folder. This operation is fast nowadays on modern disks, but incurs extra trafic and latency over the network.
This sounded like a simple fix… but in fact it was a bit more work than expected.
Unsurprisingly we quickly found that the code would decide to go for more or less details solely on the URL. Since CIFS mounts get file:/
URLs, they’d end up treated as local files… so we went to querying a bit more agressively. There is an isSlow()
method on the items in the detail view tree which we extended to check for CIFS mounts.
This wasn’t enough though, we immediately realized that the new bottleneck was the calls to isSlow()
itself. It would lead to several statfs
calls which would be expensive as well. The way out was thus to cache the information in the items and for children to query the cache in their parent. Indeed, if the parent is considered slow, we decided to consider the children in the folder as slow as well. This heuristic allowed us to remove all the subsequent isSlow()
calls after the one on the mountpoint folder itself.
This was a very old piece of code we touched there, so some time was also used to clean things up a bit, refactor and rename things to align them better with other KIO parts.
They are such active users of CIFS mounts that they found yet another issue! This time with an old version of LibreOffice. In their version it would show up only if the KDE integration was used. I can tell you we were a bit surprised by this. The investigation wasn’t that easy but we managed to track it down.
The issue was showing up after opening a file sitting on a CIFS mount with LibreOffice. If you did a change to the file, then clicked “Save As…” and selected the same file to overwrite, you would get a “Could not create a backup copy” error and the file wouldn’t be saved if the KDE integration was active. All would be fine without this integration though.
What would be different with and without the KDE integration? Well, in one case there is an extra process! When the file dialog opens, if it is the KDE file dialog, the listing is delegated to a KIO Worker. In this particular case this would matter. Indeed, we figured that LibreOffice keeps an open file descriptor on the opened file. Not only this, it also holds a read lock on the file. The KIO Worker is being forked from LibreOffice, so it too has the open file descriptor with a lock.
This made us realize that there was a leak of file descriptor which is in itself not a good thing. So we changed KIO to cleanup open file descriptors when spawning workers, it’s always a good idea to be tidy. Also, as soon as we removed the file descriptor leak the issue was gone. Nice, this was an easy fix.
Just to be thorough, we tried again but this time with the latest LibreOffice. And even with the patched KIO we would get the “Could not create a backup copy” error! This time it would show up also without the KDE integration. And so back to hunting… without getting too much into the internals of LibreOffice, it turned out the more recent version had extra code activated to produce said backup version, and it has two file descriptors open on the same file. So we were back to the problem of two file descriptors and read locks being involved. But this time it was not due to a leak towards another process, and the architecture of LibreOffice didn’t make it easy to reuse the original file descriptor created when opening the file.
The only thing we could do at this point was to simply not lock when the file is on a CIFS mount. It is not a very satisfying solution but it did the job while fitting the allocated time.
Somehow we couldn’t leave things like this though. If you know well the very much criticized POSIX file locks (which have extra challenges over CIFS), something is still not feeling quite right. We got two processes with a file descriptor on the file to save, and yet there is a single process holding a read lock. During the save, when the failure happens, LibreOffice is still making a “backup copy”, it is not writing yet to the file only reading it… for sure this should be allowed!
We thus started to suspect a problem with the kernel itself… and our contact has been willing to explore it further. After investigation, it’s been confirmed to be a kernel bug. For that the customer hooked us up with Andrew Bartlett from Catalyst as they knew he could help us flesh out ideas in this space. This proved valuable indeed. Thanks to tests we wrote previously and conversations with Andrew, we quickly figured that depending on the options you would pass at mount time the CIFS driver would handle the locks properly or not. We’ve discussed with the maintainer of the CIFS driver for a couple of fixes. They have been merged last week (end of August 2024).
It’s really a joy to have a customer like this. As can be seen from their willingness to dig deeper on what others would consider obscure issues, they demonstrate they are thinking long term. It also means they come with interesting and challenging issue… and they’re appreciative of what we achieved for them!
Indeed we got the pleasure to receive this by email during our conversations:
We very much appreciate the work you’ve done, and difficulty surrounding the challenges you’re working through. Your approach/results are spectacular, and we are very grateful to be able to be a part of it.
Of course, if you have projects involving Free Software communities or issues closer to the system, feel free to reach out, and we’ll see what we could do to help you. Maybe you too can be a forward thinking customer who sparks joy!