Thoughts on Case-Insensitive File Systems
As some folks know, I’m a contributor to Git. I also answer various questions about Git on Stack Overflow and elsewhere, and a lot of those questions are from users on Windows or macOS, since those are the two most common platforms.
Invariably, there are questions about how Git handles branch names or files that differ only in case on those systems, and the answer is “poorly.” This isn’t really a fault of Git (well, the branch names are), but a limitation of the file systems involved.
The Problem
Back in the days before Unicode, there were folks who thought it would be a good idea to have a case-insensitive file system. DOS didn’t distinguish between case, and so to preserve compatibility Windows didn’t, either. The thing that Windows did introduce, however, was Unicode.
When we look up files on a modern file system, we don’t look through the directory in sorted order. In fact, we don’t keep the directory in sorted order, because that’s expensive when you add a new file. Instead, we hash the file name using some good hash function and use that hash value to look up the name in the directory table. This is much faster for large directories and it scales much better. However, in order to look up files correctly, we need to consistently hash the file name in the same way every time. That means that we need a canonical form for file names.
If we know that all of our users are using a given single-byte code page, we can usually easily fold case: our code page will represent one language or a set of languages that have consistent case folding behavior. This works for many languages, including English, Spanish, and French, but it fails for German.
Traditionally, in German, there is a character called the eszett (ß). Nowadays, in Unicode, there is also a capital version of this character (ẞ), but for a long time, people wrote the capital version of an eszett as “SS”. This presents a problem for us: should we fold “GROSSE” to “grosse” or “große”?
Even if we simply agree to fold the eszetts together for our file system purposes (as unfriendly as this might be to German speakers), we have to consider another fact about case folding in file systems: it takes place in the kernel. Consequently, we have to pick a case-folding system that works in a way that is independent of the locale, since locale systems are not in the kernel. Even if we could push locale systems into the kernel, modern operating systems are multi-user systems where different users may use different locales. We know that we need a consistent form for file names, so locale-independent it is. There’s just one problem: we can’t provide a locale-independent case folding operation for both English and Turkish at the same time.
Turkish and other Turkic languages contain a dotted I (İ, i) and a dotless I (I, ı). These are different vowels and using one or the other can change the meaning of a word. In English, we fold the upper case “I” letter (which is dotless) to “i” (which is dotted), and vice versa. In Turkish, we fold the upper case “I” letter (which is dotless) to “ı” (which is dotless), and “İ” (which is dotted) to “i” (which is dotted), and vice versa. Therefore, there is no locale-independent way to fold characters that produces correct results in all languages.
We can, of course, get around this problem by forcing users to use a standard case-folding algorithm that handles most languages correctly and ignores Turkish. This does, however, smack of Anglophone exceptionalism: why shouldn’t Turkish people have a fully functional operating system that meets their needs?
The Consequences
So we’ve determined that we can’t really produce a file system that correctly folds case in a locale-aware way. However, maybe we have backwards compatibility to worry about, so Turkish people are just going to have to suck it up. Let’s see what kind of problems this causes in real life.
Git stores its references in the file system.
The special name HEAD
, which refers to the current branch or commit, can sometimes be referenced as head
(or any number of other spellings) on Windows and macOS.
Except that
sometimes
it
can’t.
This is super confusing for new users and leads to bad habits where users “learn” that case doesn’t matter (except when it does) and hard-to-reproduce problems when trying to replicate behavior across systems.
Most open-source software projects also have a variety of file names in the root of their repository.
Common names here include some variants of README
, LICENSE
, and VERSION
.
All of this is fine, except that C++ 20 defines a header file called version
.
Now, when a source file writes #include <version>
, we’re including the wrong file and the code doesn’t work on those systems.
For those folks using macOS who are using a case-sensitive file system, we now have a variety of projects which just don’t work because companies have gotten used to case-insensitive file systems.
There are also software packages (like the well-known vim-colorschemes) that can’t be checked out correctly on a case-insensitive file system, since they contain multiple files that differ only in case.
The color schemes darkblue.vim
and darkBlue.vim
are, in fact, substantively different files, but on a case-insensitive system, only one can be checked out at a time.
Conclusion
There are a variety of reasons why case-insensitive file systems have problems, as we’ve seen. The solution is to simply admit defeat and use a case-sensitive file system, since it lacks all of these problems (except the incompatible software one).
Users who are used to case-insensitive file systems may do well to learn a secret that Unix users have long known: write all file names in lower case. This is easier—there’s no need for the Shift or Caps Lock key—and it’s consistent, so you don’t need to remember which case you used.
To see more about case folding and why it’s difficult, I recommend this article by James Bennett, which gives a pithy summary of the perils of case folding in Unicode.