Microkernels are slow and Elvis didn’t do no drugs
It’s quite telling though that most microkernel proponents tend not to be kernel developers.
Microkernel hatred is a peculiar phenomenon. Sheltered users who have never had any background in much beyond Windows and some flavor of free monolithic Unix, will, despite a general apathy or ignorance in the relevant subjects, have strong opinions on the allegedly dreadful performance and impracticality of “microkernels,” however they define the term (and we shall see that a lot of people have some baffling impressions of what a microkernel is supposed to be). Quite often, these negative views will be a result of various remarks made by Linus Torvalds and a general hero worship of his character, a misrepresentation of an old Usenet flame war between AST and Torvalds that was somehow “won” and which supposedly proved that microkernels are nothing but a toy of ivory tower academics, or a rehash of quarter century-old benchmarks on CMU’s Mach that were unfavorable. The presence of Linus' character in many of this is no coincidence. It strikes me that anti-microkernel sentiment most vocally originates as a sort of tribal affiliation mechanism by Linux users to ward off insecurity.
In any event, this article will be a concise tour of microkernel myths and misconceptions throughout the ages.
#1: Microkernels are defined by their small size
If this were true, DOS, CP/M and V6 Unix would all be microkernels.
Microkernels are actually defined by their breadth of responsibilities and their focus on isolating the multiplexing of hardware resources from the higher-level system interfaces that give an operating system its personality. Hence, even the earlier and relatively bulky Mach was in effect agnostic of any semantics that would traditionally be associated with Unix, VMS or similar, but instead left this to the policy decisions of higher servers. Although Mach’s abstractions specifically were biased towards implementing Unix-likes, the team at CMU famously managed to port a fully compatible MS-DOS environment down to the BIOS calls running in a dedicated thread.
Gernot Heiser describes the genesis of contemporary microkernels, thus:
1) an abstraction of hardware address spaces, because they are the fundamental mechanism for memory protection. Whether you call that “tasks” (as in early versions of L4) or “address spaces” (as in current versions) or virtualised physical memory is a matter of nomenclature, nothing more. The point is that it’s a minimal wrapper around hardware, enough to allow secure manipulation by user-level code. It isn’t a “task” in the Chorus sense, which is a heavyweight object;
2) an abstraction of execution on the CPU, because that is required for time sharing. Whether you call that “thread”, “scheduler activation” or “virtual CPU” may be notation or maybe a small difference in approach, but not much more, as long as it’s minimal;
3) a mechanism for communication across protection boundaries. This can be in the form of a (light-weight, unlike Mach or Chorus) IPC mechanism, a domain-crossing mechanisms (as in “lightweight RPC”, Pebble or our Mungi system) or a virtual inter-VM interrupt. There are semantic differences between those options, but as long as it’s really minimal, either are valid options. A virtual network interface (as offered as a primitive by some hypervisors) is not minimal, as it requires network-like protocols;
4) memory contexts? At the level of a (proper) microkernel, that’s just the same as a “task”—an abstraction of memory protection. Hence, there should not be separate primitives for tasks and memory contexts.
In summary, the microkernel provides mechanisms corresponding to hardware features. It doesn’t provide services, just fundamental mechanisms. In particular, it does not duplicate OS services. This misunderstanding was one of the causes for the failure of first-generation microkernels. The OS community understands this.
#2: Microkernels are unperformant
From “An Architectural Overview of QNX”, 1992:
At the system performance level, for IPC, pipe I/O, and disk I/O, we see that QNX outperformed the UNIX system by a substantial margin. In fact, the QNX system was able to deliver virtually all of the raw device throughput to the application, while the SVR4 system fell far short. For disk I/O, QNX was substantially faster than SVR4. As faster peripheral devices appear, the ability to deliver the full performance of that hardware will make possible a class of applications that the kernel overhead of UNIX will not be able to accommodate without much larger investments in processor power. In the network case, the QNX Net process and its drivers deliver very nearly the entire cable bandwidth to the application, even with only moderately powerful machines.
From “Experiences with the Amoeba Distributed Operating System”, 1990:
The interesting comparisons in these tables are the comparisons of pure Amoeba RPC and pure SunOS RPC both for short communications, where delay is critical, and long ones, where bandwidth is the issue. A 4-byte Amoeba RPC takes 1.1 msec, vs. 6.7 msec for Sun RPC. Similarly, for 8 Kbyte RPCs, the Amoeba bandwidth is 721 Kbytes/sec, vs. only 325 Kbytes for the Sun RPC. The conclusion is that Amoeba’s delay is 6 times better and its throughput is twice as good.
To compare the Amoeba results with the Sun NFS file system, we have measured reading and creating files on a Sun 3/60 using a remote Sun 3/60 file server with a 16 Mbyte of memory running SunOS 4.0.3. The file server had the same type of disk as the bullet server,> so the hardware configurations were, with the exception of extra memory for NFS, identical to those used to measure Amoeba. The measurements were made at night under a light load. To disable local caching on the Sun 3/60 we locked the file using the Sun UNIX lockf primitive while doing the read test. The timing of the read test consisted of repeated measurement of an lseek followed by a read system call. The write test consisted of consecutively executing creat , write and close. (The creat has the effect of deleting the previous version of the file.) The results are depicted in Fig. 8.
Observe that reading and creating 1 Mbyte files results in lower bandwidths than for reading and creating 16 Kbyte files. The Bullet file server’s performance for read operations is two to three times better than the Sun NFS file server. For create operations, the Bullet file server has a constant overhead for producing capabilities, which gives it a relatively better per- formance for large files.
What about userspace file systems in an otherwise monolithic environment? Well, it’s quite likely you’re using FUSE in one way or another these days, if even just to mount your NTFS-formatted drive. We also saw Bullet outperforming NFS back in its day above. But, firstly, a general statement:
The inefficiency of moving I/O out to user space is also somewhat self-inflicted. A lot of that inefficiency has to do with data copies, but let’s consider the possibility that there might be fewer such copies if there were better ways for user-space code to specify actions on buffers that it can’t actually access directly. We actually implemented some of these at Revivio, and they worked. Why aren’t such things part of the mainline kernel? Because the gatekeepers don’t want them to be. Linus’s hatred of microkernels and anything like them is old and well known. Many other kernel developers have similar attitudes. If they think a feature only has one significant use case, and it’s a use case they oppose for other reasons, are they going to be supportive of work to provide that feature? Of course not. They’re going to reject it as needless bloat and complexity, which shouldn’t be allowed to affect the streamlined code paths that exist to do things the way they think things should be done. There’s not actually anything wrong with that, but it does mean that when they claim that user-space filesystems will incur unnecessary overhead they’re not expressing an essential truth about user-space filesystems. They’re expressing a truth about their support of user-space filesystems in Linux, which is quite different.
A lot of user-space filesystems -perhaps even a majority – really are toys. Then again, is anybody using kernel-based exofs or omfs more seriously than Argonne is using PVFS? If you make something easier to do, more people will do it. Not all of those people will be as skilled as those who would have done it The Hard Way. FUSE has definitely made it easier to write filesystems, and a lot of tyros have made toys with it, but it’s also possible for serious people to make serious filesystems with it. Remember, a lot of people once thought Linux and the machines it ran on were toys. Many still are, even literally. I always thought that broadening the community and encouraging experimentation were supposed to be good things, without which Linux itself wouldn’t have succeeded. Apparently I’m misguided.
What about a benchmark of FUSE, specifically? From “Performance and Extension of User Space File Systems”, 2010:
Our benchmarks show that whether FUSE is a feasible solution depends on the expected workload for a system. It achieves performance comparable to in-kernel file systems when large, sustained I/O is performed. On the other hand, the overhead of FUSE is more visible when the workload consists of a relatively large number of metadata operations, such as those seen in Web servers and other systems that deal with small files and a very large number of clients. FUSE is certainly an adequate solution for personal computers and small-scale servers, especially those that perform large I/O transfers, such as Grid applications and multimedia servers.
Not bad for a bridge to an in-kernel VFS.
From “CHORUS Distributed Operating Systems”, 1988:
Initial performance measurements on a Bull SPS 7/300 computer, based on a MC68020 (with MC68851) processor running at 16MHz (with 2 wait-states) are given on Tables 8 and 9. Performance measurements of the CHORUS UNIX subsystem are related to SPIX, the Bull System V implementation.
Although made on a CHORUS system which is far from being optimized, these figures show a similar level of performance with a traditional UNIX implemented on a standalone machine.
Even half-assed demos from 1988 were about as good as the Unixes of the day.
One of the less successful Mach-like microkernels, Spring, had its authors write a decent rebuttal of some age-old IPC performance fears. From “The Spring nucleus, a microkernel for objects”, 1993:
In traditional IPC systems, such as Berkeley sockets, there are three major costs: First, there is the cost of making scheduling and dispatching decisions. Typically, the thread that issued the request will go to sleep, and the OS kernel will make a careful and objective decision about which thread to run next. With a little luck this will actually be the thread that will execute the request. This thread will then run and upon com pletion of the call, it will wake up the caller thread and put itself to sleep. Once again, the OS kernel will make a careful and scholarly scheduling decision and will, hopefully, run the caller thread.
Second, there is the cost of performing the context switch between the calling address space and the called address space. At its worst, this will involve a full save of the CPU registers (including floating point registers), a stack switch, and then a restoration of CPU state.
Third, there is the cost of moving the argument and result data between the caller and the callee. Traditional kernels tend to use buffer management schemes that are optimized for passing moderately large amounts of data. Recent microkernel IPC systems have pushed back on all three of these costs. By assuming an RPC call and return model, it is possible to perform a direct transfer of control between the caller and the callee threads, bypassing the scheduler. Given such a direct transfer, it is also possible to avoid the full costs of a context switch and only save the minimal information that is necessary for an RPC return. By exploiting the fact that most argument sets are small (or that if they are large then they are passed through shared memory), it is possible to avoid buffer management costs.
Different systems vary in the degree to which they have succeeded in minimizing these costs. For example, the Taos LRPC system modelled a cross-domain call as a single thread of execution that crossed between address spaces, thereby avoiding and dispatching or scheduling costs. However, both Mach and NT model the callee threads as autonomous threads which simply happen to be waiting for a cross-process call. This leads to a certain amount of dispatcher activity when a cross-process call or return occurs. The Spring nucleus has attempted to minimize all three costs. The nucleus’ dispatcher works in terms of shut tles, which do not change during cross-domain calls. There are, therefore, no scheduling or dispatching costs during a cross-domain call. Only an absolutely minimal amount of CPU state is saved (basically a return program counter and stack pointer). We do not attempt to save the current register window set, or attempt to switch to a different kernel stack.
The fast-path is optimized for the case where all the arguments are passed in registers or shared memory, so the nucleus need not concern itself with buffering arguments or results. In addition, the fast-path code is executed directly in low-level trap handlers and avoids the normal system call costs. A more mundane factor in our IPC performance is that our nucleus data structures have been tailored to optimize their performance for cross-domain calls. For example, during a cross-domain call there is no need to check that the target door identifier is valid. Instead, a simple mask and indirect load is performed. If the target door identifier was invalid we will get a pointer to a special nucleus door entry point which always returns an error code. Similarly there was an effort to concentrate the number of flags that the fast-path call and return code would need to test into a single per-thread “anomalous” flag.
If we were prepared to ignore security or debugging issues, we could probably shave several more microsec- onds off our fast-path time. For example, we have to pay several instructions to prevent the callee thread tampering with register windows belonging to the caller thread. Similarly, during both call and return we are prepared to cope with threads that have been stopped for debugging. However, for our desired semantics, we believe we are fairly close to the minimum time required for a cross-domain call.
As such, all performance issues must be discussed within the context of a specific microkernel. As Jochann Liedtke noted in “Towards Real Microkernels”, 1996:
Performance-related constraints seem to be disappearing. The problem of first-generation microkernels was the limitation of the external-pager concept hardwiring a policy inside the kernel. This limitation was largely removed by L4’s address-space concept, which provides a pure mechanism interface. Instead of offering a policy, the kernel’s role is confined to offering the basic mechanisms required to implement the appropriate policies. These basic mechanisms permit implementation of various protection schemes and even of physical memory management on top of the microkernel.
Further, the common canards related to context switch and TLB flush overhead (as employed in, for instance, this horrible piece of propaganda from 2002 by Miles Nordin that was somehow up to the evidently low editorial standards of the Linux Journal), are false:
The deficiency analysis of the early microkernels identified user-kernel-mode switches, address-space switches, and memory penalties as primary sources of disappointing performance. Regarded superficially, this analysis was correct, because it was supported by detailed performance measurements.
Surprisingly, a deeper analysis shows that the three points— user-kernel-mode switches, address-space switches, and memory penalties—are not the real problems; the hardware-inherited costs of mode and address-space switching are only 3%–7% of the measured costs (see Figure 4). A detailed discussion can be found in .
The situation was strange. On the one hand, we knew the kernels could run at least 10 times faster; on the other, after optimizing microkernels for years, we no longer saw new significant optimization possibilities. This contradiction suggested the effi ciency problem was caused by the basic architecture of these kernels.
Indeed, most early microker nels evolved step by step from monolithic kernels, remaining rich in concepts and large in code size. For example, Mach 3 offers approximately 140 system calls and needs more than 300 Kbytes of code. Reducing a large monolithic kernel may not lead to a real microkernel.
#3: Microkernels are a diversion, because a userland server failure will be just as catastrophic as a kernel failure anyway
The only way someone could make such a statement is if they have no idea about the first step of building any fault tolerant system whatsoever: the isolation of failures to a local state space. Having something be a userland process with well-defined communication boundaries instead of an informal boundary running in a monolithic kernel address space, makes a huge difference. Explicit componentization and isolation also makes specific reliability regimes like fail-safe, fail-secure and fail-passive much simpler to implement as matters of policy, since the primitives are all present.
First, the MINIX 3 developers note the main reliability improvements gained by the multiserver componentization alone:
Monolithic operating systems (e.g., Windows, Linux, BSD) have millions of lines of kernel code. There is no way so much code can ever be made correct. In contrast, MINIX 3 has about 4000 lines of executable kernel code. We believe this code can eventually be made fairly close to bug free.
In monolithic operating systems, device drivers reside in the kernel. This means that when a new peripheral is installed, unknown, untrusted code is inserted in the kernel. A single bad line of code in a driver can bring down the system. This design is fundamentally flawed. In MINIX 3, each device driver is a separate user-mode process. Drivers cannot execute privileged instructions, change the page tables, perform I/O, or write to absolute memory. They have to make kernel calls for these services and the kernel checks each call for authority.
In monolithic operating systems, a driver can write to any word of memory and thus accidentally trash user programs. In MINIX 3, when a user expects data from, for example, the file system, it builds a descriptor telling who has access and at what addresses. It then passes an index to this descriptor to the file system, which may pass it to a driver. The file system or driver then asks the kernel to write via the descriptor, making it impossible for them to write to addresses outside the buffer.
Dereferencing a bad pointer within a driver will crash the driver process, but will have no effect on the system as a whole. The reincarnation server will restart the crashed driver automatically. For some drivers (e.g., disk and network) recovery is transparent to user processes. For others (e.g., audio and printer), the user may notice. In monolithic systems, dereferencing a bad pointer in a (kernel) driver normally leads to a system crash.
If a driver gets into an infinite loop, the scheduler will gradually lower its priority until it becomes the idle process. Eventually the reincarnation server will see that it is not responding to status requests, so it will kill and restart the looping driver. In a monolithic system, a looping driver hangs the system.
MINIX 3 uses fixed-length messages for internal communication, which eliminates certain buffer overruns and buffer management problems. Also, many exploits work by overrunning a buffer to trick the program into returning from a function call using an overwritten stacked return address pointing into the overrun buffer. In MINIX 3, this attack does not work because instruction and data space are split and only code in (read-only) instruction space can be executed.
Device drivers obtain kernel services (such as copying data to users' address spaces) by making kernel calls. The MINIX 3 kernel has a bit map for each driver specifying which calls it is authorized to make. In monolithic systems every driver can call every kernel function, authorized or not.
The kernel also maintains a table telling which I/O ports each driver may access. As a result, a driver can only touch its own I/O ports. In monolithic systems, a buggy driver can access I/O ports belonging to another device.
Not every driver and server needs to communicate with every other driver and server. Accordingly, a per-process bit map determines which destinations each process may send to.
A special process, called the reincarnation server, periodically pings each device driver. If the driver dies or fails to respond correctly to pings, the reincarnation server automatically replaces it by a fresh copy. The detection and replacement of nonfunctioning drivers is automatic, without any user action required. This feature does not work for disk drivers at present, but in the next release the system will be able to recover even disk drivers, which will be shadowed in RAM. Driver recovery does not affect running processes.
When a interrupt occurs, it is converted at a low level to a notification sent to the appropriate driver. If the driver is waiting for a message, it gets the interrupt immediately; otherwise it gets the notification the next time it does a RECEIVE to get a message. This scheme eliminates nested interrupts and makes driver programming easier.
Nonetheless, there is cause for skepticism. Linus Torvalds writes:
The problem there is that most device drivers don’t crash the system by following wild pointers.
They crash the system by just not doing the right thing to the hardware (or the hardware itself being buggy). The system ends up crashing because the data off the harddisk is corrupt, which is REALLY BAD, but has nothing to do with protection domains, and everything to do with the fact that hardware is often flaky and complex and badly documented.
Let’s take a look at the hard disk example, then. Is it really hopeless in such a scenario?
MINIX 3 developers write in “Dealing with Driver Failures in the Storage Stack”, 2009:
In this work, we have extended a multiserver operating system with a filter-driver framework and used it to improve MINIX 3’s ability to deal with driver failures in the storage stack. The filter driver operates transparently to both the file-system server and block-device driver and does not re- quire any changes to either component. This flexibility is typically not found in other approaches and proved to be very useful to implement quickly and experiment with dif- ferent protection strategies.
In particular, we have used checksumming and mirror- ing in order to provided end-to-end integrity for file-system data. In addition, we have instrumented the filter with a semantic model of the driver so that it can detect driver- protocol violations and proactively recover. By building on MINIX 3’s ability to dynamically start and stop drivers, our filter driver can provide recovery for many failures that would be fatal in other systems. For example, if the block- device driver exhibits a failure because of aging bugs, the filter driver can request the driver manager to replace the faulty block-device driver with a fresh copy.
We have evaluated our ideas by running several experi- ments on a prototype implementation. Fault-injection test- ing demonstrates the filter’s ability to detect and recover from both data-integrity problems and driver-protocol vi- olations. Performance measurements show that the average overhead for various benchmarks ranges from 0% to 28%, which seems an acceptable trade-off for improved depend- ability in safety-critical applications.
The framework’s flexibility also greatly facilitates future experimentation. For example, one logical extension would be to use cryptographic hashes and data encryption in order to deal with not only buggy, but also malicious drivers. An- other interesting option may be to provide the filter driver with information about key file-system data structures in or- der to implement more fine-grained protection strategies.
But there is more to be gained. How about full system live upgrade driven by compiler instumentation and with rollback-recovery on failure?
Alright, how about more intricate issues concerning internal state corruption that may be propagated? Besides the checkpoint solutions above, from “CuriOS: Improving Reliability through Operating System Structure”, 2008, a layout is described through which:
Through simple fault injection experiments with various systems, we gain insights into properties that are essential for successful client-transparent recovery of OS services. We have described a design for structuring an OS that preserves these properties. CuriOS minimizes error propagation and persists client information using distributed and isolated OS service state to enhance the transparent restartability of several system components. Restricted memory access permissions prevent erroneous OS services from corrupting arbitrary memory locations. Our experimental results show that it is possible to isolate and recover core OS services from a significant percentage of errors with acceptable performance.
It is through a carefully crafted microkernel and capability system that KeyKOS (and its successors EROS and Coyotos) exploits its single-level store where virtual memory and the storage layer are blurred to create a fully orthogonally persistent and coherent OS. See “The Checkpoint Mechanism in KeyKOS”, 1992.
#4: Microkernels turn well understood memory-protection problems into poorly understood communication problems
Someone better inform the computer networking community that they’re all a bunch of jackasses who have no hope of ever decoding the pernicious “communication problems”. Oh shit, wait. The only reason you’re even reading this article is because of the massive research that has been undertaken into solving precisely those very same “communication problems”. Microkernel IPC is no less enhanced by all this work than any other transmission mechanism.
In fact, one of the major downsides of monolithic kernels is that lots of communication takes place which is implicit and unaccountable, since the whole kernel is one large blob residing in its own address space, with all communication boundaries being purely informal and ready to be stomped on by any misbehaving kernel module or non-modular subsystem. There is mostly no separation or componentization (beyond the minimum of loadable modules for the latter) to speak of, other than where the entry points are called in the initialization code.
It also must be underlined that memory protection is not a solved problem by any stretch. The great effort being constantly put into stronger and stronger exploit mitigations is a big counterexample to this claim. There’s also been relatively little attention paid to capability-based addressing. In addition, #9 points out areas of research in strengthening monolithic kernels influenced by microkernel research that all aim to resolve these MP deficiencies. It’s therefore an open issue.
See #7 below for more elucidation on a similar complaint as #4, concerning data sharing.
#5: Microkernels are a bourgeois plot to undermine free software
Yes, this is a seriously held belief.
I am 100% on the side of Linus Torvalds when it comes to microkernels. […] I am not the least bit excited about any progress made in microkernels, I feel that it can only result in much more closed systems that are easier to implement in ways that make them harder to modify. This is why I wish for Hurd to continue to fail.
– nickysielicki, making Alex Jones seem cool and rational in comparison
Right off the bat, the architecture of the Hurd was explicitly chosen by the GNU project with the strong insistence by RMS on a multiserver microkernel, precisely because of the greater user freedom it offers! Perhaps one might believe that GNU and the FSF are actually deep cover agents for the Business Software Alliance and SCO and hence make the above quote consistent, but this would be a heterodox view, to say the least.
nickysielicki rationalizes their delusions thusly:
Let’s go to an alternative universe where Hurd was successful in the 90’s and it reached common usage to the extent that Linux has today.
You’re Western Digital in 2008 and you’re making a TV set-top box called the WDTV-Live. I own one of these in real life universe. It runs linux, which is awesome, because that means that I can SSH into it. It runs an apache server in my home. It can download from usenet or torrents. I can control it via SSH instead of using the remote control.
In this alternative universe, WDLX going to use Hurd instead of linux, because for this small device it will certainly have better performance on their underpowered MIPS chip. And they’re not going to ship anything besides what they have to, becasue this is a small embedded computer.
What happens to that homebrew community when they ship a microkernel with proprietary servers for everything, and nothing else? It’s going to be profoundly difficult to develop on this. You might already see this if you own a chromebook or a WDTV– missing kernel modules means that you simply can’t do anything without compiling your kernel. Couple this with secureboot and you’re locked in.
I’m no expert on these things, most of this is based on brief research from years ago. If you think that I’m wrong, please tell me why, I’d love to be proven wrong. But for the time being, I believe widespread implementations of microkernels would be very anti-general-purpose computing.
Above we see a profound inability to tell the difference between a kernel, its servers/subsystems and application programs. The commentators (linked above) refute this line of thinking.
And just to be clear, what does every microkernel hater’s beloved Linus Torvalds think on the subject?:
You bought their design. It was your choice.
And yes, you own the hardware, and you can hack it any which way you like (modulo laws and any other contracts you signed when you bought it). But they had the right to design it certain ways, and part of that design may be making it harder for you to hack.
For example, they may have used glue to put the thing together rather than standard phillips screws. Or poured resin over some of the chips. All of which has been done (not necessarily with Linux, but this really is an issue that has nothing to do with Linux per se). Making the firmware or hardware harder to access or modify is their choice.
Your choice is whether you buy it, despite the fact that you know it’s not necessarily all that easy to hack.
The “when I buy it, I own it” argument is a favourite of the GPLv3 shills, but it’s irrelevant. The design was done long before you bought it, and yes, Tivo had the right to design and build it, any which way they wanted to.
You are missing the picture. Sure, you can do whatever you want to (within any applicable laws) after you bought it. But that doesn’t take away the right from the manufacturer to design it his way…
Well, so much for that.
#6: Hah, those stupid fucks are running Linux on top of their microkernel! What happened to microkernels being so great, fags?
Er, so bypassing the microkernel for the vast majority of your work is a vindication of the “microkernels are just better” line is it?
– ris, in an another display of fractal wrongness
This is a profound misunderstanding of what a microkernel is and can be used for.
The context of the quote above is in reference to L4Linux, which has a specific goal regarding paravirtualization that the author misses. Quoting from Jonathan Shapiro in 2006, with the obsolete parts dotted out:
Paravirtualization is an important idea, both in its own right and as a partial solution for reliability. It is going to be critical in the success of the Intel and AMD hardware virtualization support.
The credit for this idea and its development properly belongs with the Xen team) at Cambridge University (see: Xen and the Art of Virtualization, not with the L4 team in Dresden. The idea of adapting paravirtualization to a microkernel for use within an operating system is, frankly, silly. Several existing microkernels, including L4.sec (though not L3 or L4), KeyKOS, EROS, Coyotos, and CapROS, have isolation and communication mechanisms that are already better suited to this task. My list intentionally does not include Minix-3, where IPC is unprotected. The major change between L4 and L4.sec is the migration to protected IPC.
In practice, the reason that the L4 team looked at paravirtualization was to show that doing it on a microkernel was actually faster than doing it on Xen. This was perfectly obvious to anybody who had actually read the research literature: the L4Linux performance was noticeably better than the Xen/Linux performance. The only question was: how well would L4Linux scale when multiple copies of Linux were run. The answer was: very well.
[…] L4 has unquestionably demonstrated reliability, but only in situations where the applications are not hostile. L4 has not demonstrated practical success in security or fault isolation. This is the new push in the L4 community. It is why L4.sec (a new research project centered at Dresden) has adopted some fairly substantial architectural evolution in comparison to the L3 and L4 architectures.
There are several reasons why you’d want to run Linux on top of L4. One is the classical virtualization use case of processor consolidation: running multiple systems side-by-side on the same processor, using virtual-machine encapsulation to isolate one from the other. This is the configuration that is indeed used on some of the phones that ship with L4.
The second reason is legacy support: If you think you can just introduce a new OS API (even if it’s POSIX-compliant) and the world will adapt all its software, you’re dreaming. People want to not only keep re-using their old software, they even want to build new software to the old environments. But that shouldn’t stop you from providing a better environment for safety- and security-critical components.
This trend is indeed very strong in the embedded world, even (or particularly) in the domain of safety- or security-critical devices. I rarely get to talk to someone considering deploying seL4 who doesn’t have a need for a Linux or Windows environment. The important point is that a highly trustworthy microkernel underneath allows this to co-exist safely with the critical functionality.
At its core, a microkernel need not be more than a minimal hardware multiplexing base (though earlier, first-generation microkernels were more to the detriment of composability and flexibility, yet sure enough not performance), which in turn can be used to implement virtual machine monitors, separation kernels for multi-level secure environments or full end-user operating systems. In turn, the latter can be either single-server (whole OS runs as one large server building on top of the microkernel primitives, examples include LITES for 4.4BSD and Sprite over Mach), or multiserver - which is what most people think of when they hear “microkernel,” and indeed what reaps the most benefits.
As Gernot Heiser describes it, regarding microkernels versus hypervisors:
I get asked this question a lot: what is the difference between a hypervisor and a microkernel? Frequently the question is accompanied by competitor-planted bullshit such as: isn’t it better to use a hypervisor for virtualization, as it is specifically designed for that, while a microkernel isn’t? But the question also pops up at scientific meetings, such as this week’s IIES workshop.
The short answer is that a microkernel is a possible implementation of a hypervisor (the right implementation, IMHO), but can do much more than just providing virtual machines.
For the long answer we have to dig a bit deeper, as the two have different motivations:
1) A hypervisor, also called a virtual-machine monitor, is the software that implements virtual machines. It is designed for the sole purpose of running de-privileged “guest” operating systems on top (except for the deceptive pseudo-virtualizers). As such it is (or contains) a kernel (defined as software running in the most privileged mode of the hardware).
2) A microkernel is a minimal base for building arbitrary systems (including virtual machines). It is characterised as containing the minimal amount of code that must run in the most privileged mode of the hardware in order to build arbitrary (yet secure) systems.
#7: Microkernels are shared-nothing and limit scalability due to keeping all data structures server-local
Absolutely not. The whole point of having an IPC mechanism, in fact, is to make data sharing an explicit and accountable operation. Example: VM regions as OOL data in Mach message headers.
That said, shared-nothing is not at all a bad idea, and microkernel designers are converging towards that ideal. What, do you think the increasing popularity of actor models and CSP over traditional shared-state concurrency is some sort of a coincidence? The recent renaissance of functional programming has turned “mutable state” into a profanity for many.
Quoting from Shapiro in 2006 again, in response to Linus Torvalds:
Linus makes some statements that are (mostly) true, but he draws the wrong conclusions.
“… It’s ludicrous how microkernel proponents claim that their system is "simpler” than a traditional kernel. It’s not. It’s much much more complicated, exactly because of the barriers that it has raised between data structures.
The fundamental result of [address] space separation is that you can’t share data structures. That means that you can’t share locking, it means that you must copy any shared data, and that in turn means that you have a much harder time handling coherency."
The last sentence is obviously wrong: when you do not share data structures, there is no coherency problem by definition. Technically, it is possible to share memory in microkernel-based applications, but the statement is true in the sense that this practice is philosophically discouraged.
I don’t think that experienced microkernel advocates have ever argued that a microkernel system is simpler overall. Certainly, no such argument has appeared in the literature. The components are easier to test and engineer, but Linus makes a good point when he says The fact that each individual piece is simple and secure does not make the aggregate … simple (he adds: or secure, which is wrong). I don’t think that any of us would claim that large systems are simple, but this complexity is an intrinsic attribute of large systems. It has nothing to do with software construction.
What modern microkernel advocates claim is that properly component-structured systems are engineerable, which is an entirely different issue. There are many supporting examples for this assertion in hardware, in software, in mechanics, in construction, in transportation, and so forth. There are no supporting examples suggesting that unstructured systems are engineerable. In fact, the suggestion flies in the face of the entire history of engineering experience going back thousands of years. The triumph of 21st century software, if there is one, will be learning how to structure software in a way that lets us apply what we have learned about the systems engineering (primarily in the fields of aeronautics and telephony) during the 20th century.
Linus argues that certain kinds of systemic performance engineering are difficult to accomplish in component-structured systems. At the level of drivers this is true, and this has been an active topic of research in the microkernel community in recent years. At the level of applications, it is completely false. The success of things like GNOME and KDE rely utterly on the use of IDL-defined interfaces and separate component construction. Yes, these components share an address space when they are run, but this is an artifact of implementation. The important point here is that these applications scale because they are component structured.
Ultimately, Linus is missing the point. The alternative to structured systems is unstructured systems. The type of sharing that Linus advocates is the central source of reliability, engineering, and maintenance problems in software systems today. The goal is not to do sharing efficiently. The goal is to structure a system in such a way that sharing is minimized and carefully controlled. Shared-memory concurrency is extremely hard to manage. Consider that thousands of bugs have been found in the Linux kernel in this area alone. In fact, it is well known that this approach cannot be engineered for robustness, and shared memory concurrency is routinely excluded from robust system designs for this reason.
Yes, there are areas where shared memory interfaces are required for performance reasons. These are much fewer than Linus suppposes, but they are indeed hard to manage (see: Vulnerabilities in Synchronous IPC Designs). The reasons have to do with resource accountability, not with system structure.
When you look at the evidence in the field, Linus’s statement “the whole argument that microkernels are somehow ‘more secure or’ ‘more stable’ is also total crap” is simply wrong. In fact, every example of stable or secure systems in the field today is microkernel-based. There are no demonstrated examples of highly secure or highly robust unstructured (monolithic) systems in the history of computing.
The essence of Linus’s argument may be restated as “Microkernel-based systems make it very hard to successfully use a design approach that is known to be impossible to engineer robustly.”
I agree completely.
AST also weighs in:
Linus' basic point is that microkernels require distributed algorithms and they are nasty. I agree that distributed algorithms are hell on wheels, although together with Maarten van Steen I wrote a book dealing with them. I have also designed, written and released two distributed systems in the past decade, Amoeba (for LANs) and Globe (For WANs). The problem with distributed algorithms is lack of a common time reference along with possible lost messages and uncertainty as to whether a remote process is dead or merely slow. None of these issues apply to microkernel-based operating systems on a single machine. So while I agree with Linus that distributed algorithms are difficult, that is not germane to the discussion at hand.
Besides, most of the user-space components are drivers, and they have very straightforward interactions with the servers. All character device drivers obey pretty much the same protocol (they read and write byte streams) and all block device drivers obey pretty much the same protocol (they read and write blocks). The number of user-space servers is fairly small: a file server, a process server, a network server, a reincarnation server, and a data store, and a few more. Each has a well-defined job to do and a well-defined interaction with the rest of the system. The data store, for example, provides a publish/subscribe service to allow a loose coupling between servers when that is useful. The number of servers is not likely to grow very much in the future. The complexity is quite manageable. This is not speculation. We have already implemented the system, after all. Go install MINIX 3 and examine the code yourself.
Linus also made the point that shared data structures are a good idea. Here we disagree. If you ever took a course on operating systems, you no doubt remember how much time in the course and space in the textbook was devoted to mutual exclusion and synchronization of cooperating processes. When two or more processes can access the same data structures, you have to be very, very careful not to hang yourself. It is exceedingly hard to get this right, even with semaphores, monitors, mutexes, and all that good stuff.
My view is that you want to avoid shared data structures as much as possible. Systems should be composed of smallish modules that completely hide their internal data structures from everyone else. They should have well-defined ‘thin’ interfaces that other modules can call to get work done. That’s what object-oriented programming is all about–hiding information–not sharing it. I think that hiding information (a la Dave Parnas) is a good idea. It means you can change the data structures, algorithms, and design of any module at will without affecting system correctness, as long as you keep the interface unchanged. Every course on software engineering teaches this. In effect, Linus is saying the past 20 years of work on object-oriented programming is misguided. I don’t buy that.
Once you have decided to have each module keep its grubby little paws off other modules' data structures, the next logical step is to put each one in a different address space to have the MMU hardware enforce this rule. When applied to an operating system, you get a microkernel and a collection of user-mode processes communicating using messages and well-defined interfaces and protocols. Makes for a much cleaner and more maintainable design. Naturally, Linus reasons from his experience with a monolithic kernel and has arguably been less involved in microkernels or distributed systems. My own experience is based on designing, implementing, and releasing multiple such operating systems myself. This gives us different perspectives about what is hard and what is not.
A lot of you are probably tempted to call AST out on his remark about OOP and encapsulation, perhaps saying that OOP has been a failure. Yet, the Linux kernel is absolutely full of object-oriented design patterns. Neil Brown wrote an excellent series on LWN describing them in detail: , .
(I would also wager many of the more naive and zealous OO critics have never read Bertrand Meyer or used Eiffel and Smalltalk, but that’s a tangential point.)
Or, Joe Duffy’s account of his work on Microsoft Research’s Midori project, in the context of full system asynchronicity:
A key to achieving asynchronous everything was ultra-lightweight processes. This was possible thanks to software isolated processes (SIPs), building upon the foundation of safety described in an earlier post.
The absence of shared, mutable static state helped us keep processes small. It’s surprising how much address space is burned in a typical program with tables and mutable static variables. And how much startup time can be spent initializing said state. As I mentioned earlier, we froze most statics as constants that got shared across many processes. The execution model also resulted in cheaper stacks (more on that below) which was also a key factor. The final thing here that helped wasn’t even technical, but cultural. We measured process start time and process footprint nightly in our lab and had a “ratcheting” process where every sprint we ensured we got better than last sprint. A group of us got in a room every week to look at the numbers and answer the question of why they went up, down, or stayed the same. We had this culture for performance generally, but in this case, it kept the base of the system light and nimble.
Code running inside processes could not block. Inside the kernel, blocking was permitted in select areas, but remember no user code ever ran in the kernel, so this was an implementation detail. And when I say “no blocking,” I really mean it: Midori did not have demand paging which, in a classical system, means that touching a piece of memory may physically block to perform IO. I have to say, the lack of page thrashing was such a welcome that, to this day, the first thing I do on a new Windows sytem is disable paging. I would much rather have the OS kill programs when it is close to the limit, and continue running reliably, than to deal with a paging madness.
#8: If microkernels are so great, why is nobody using them?
Except they are.
By some accounts, the most ubiquitous operating system in the world is a Japanese commodity microkernel known as TRON, with many variations.
By early 2012, deployments of OKL4 had surpassed 1.5 billion, particularly in GSM baseband processors.
INTEGRITY, particularly its variant INTEGRITY-178B has an Evaluation Assurance Level of 6 and is compliant with the DO-178B standards for avionics software (hence the name), making it used in the B-2, F-16, F-22, F-35, Airbus A380 and others.
Nokia’s Symbian, lasting for over a decade, used a microkernel for the primary mobile operating system, disproving any notions of unsuitability for such devices.
QNX is used in military hardware, industrial automation, vehicles, medical equipment and networking hardware. It’s particularly famous for its use in the automotive industy, where it had an estimated ~60% share of infotainment and telematics from a 2011 report.
PikeOS is used by the likes of Airbus, Thales, Continental, Raytheon, Samsung, Rheinmetall, Rockwell-Collins, B. Braun, Miele and Rohde & Schwarz. Most famously it is in the Airbus A350 and Airbus A400M.
#9: I’ve barely had my Debian boxes crash on me, therefore monolithic kernels are highly robust!
It’s statements like the above that make “software engineering” sound like an oxymoron. Let’s all ignore that the gold standard for life-critical systems is secure RTOS, most of them microkernels. No, it’s the sharded MangoDB clusters sustaining containerized Baboontoo images from the Crocker registry that are the apex of engineering.
But, monolithic kernels are not highly robust. Be it nested kernels, VirtuOS, microdrivers, the Rx mechanism, microrebooting/recursive restartability, failure-oblivious computing, surviving misbehaved kernel modules and much more, the research in retroactively bolting on microkernel-like reliability features onto widely used monolithic kernels is an active and desperate area with an evident demand.
#10: All of the good ideas from microkernels have already been incorporated into the mainstream work
All of the above have not. In addition, there is nothing like the type-safe modules in SPIN, the name server in Spring being the root of all objects on the system and runtime-extensible, the persistence and crash-resistance of Grasshopper and EROS, any of MINIX 3’s reliability features, or the DIY composability of Fluke. There is so much to be incorporated.