William Lee Irwin III recently announced on the lkml that he’d successfully gotten Linux running on a 64GB x86 server. His posts included two different boot message logs, one without his page clustering patch, and one with. In the latter case, his patch overcomes the 1GB mem_map virtual space limitation imposed by x86 32-bit servers, without which the kernel over-runs allowable memory space. Read the report at KernelTrap.
Page Clustering, Booting Linux On A 64GB x86
2003-03-30 Linux 16 Comments
Linux has been booting and running on 64GiB PAE x86 machines for ages – these patches simply increase the efficiency of such.
I tried to make it clear that this not only involves the caveat of not changing the kernel/user split but that Hugh Dickins had written a patch to do this for 2.4.6/2.4.7, but maybe it wasn’t stated prominently enough. Still, I’m quite excited about getting things working.
is 64GB the limit or will the system be able to handle further memory increases as the hardware improves?
This technique for increasing the amount of memory it’s feasible to manage is somewhat specific to i386 PAE’s situation because its kernel virtualspace is fixed at 1GB despite the growth of physical memory. 64-bit machines have already been working quite well on systems with 64GB of RAM or more without any modification like this. If i386 is extended to do greater than 36-bit physical addressing this will continue to fix the issue of the coremap’s space consumption, but other issues, such as growth of software-page-sized reserved virtual areas and internal fragmentation will arise and need to be addressed.
AFAIK, yes; you could boot those 64GB machines with linux; but doing *anything* with them under linux was a *seriously* silly option. Is there anybody that did such mistake (running 64GB x86 production boxes with linux kernels on them) ?
I find good that linux now can do this.
Bill, if/when this is merged; will have any regressions for small machines? (You know, not everybody can run 64GB machines
And congrats. This must have been a HUGE amount of work.
I really like this quote:
“until it is done it will have severe performance problems on small memory machines (say, less than 16GB)”
It shouldn’t have regressions for small machines when both complete and activated. Today’s standards of “small machines” are far above that of the days of the VAX and so on when the physical memory was in such short supply that there was a very immediate tradeoff between userspace and coremap space consumption. Page clustering on more modern systems does not see anywhere near the internal fragmentation of userspace (when implemented properly) but rather benefits from physical contiguity of memory for io and also larger filesystem blocksizes.
The current issues with respect to small memory machines are a result of an incomplete implementation. The final phases of the implementation (which I am actually currently working on) are specifically dedicated to resolving those fragmentation and/or poor anonymous memory utilization issues as well as cleanups of the code (which at the moment is quite dirty, at least by my standards).
Also, you mention that the machines were bootable but it wasn’t feasible to do much on them. That also needs some qualification, as the use of 2/2 split patches enabled them to support several kinds of workloads, but as it was an ABI violation, prevented some from running. Also, there are others (database workloads feasible to run with only 2GB of virtualspace, which are rare, but exist) whose performance depends critically on the amount of virtualspace granted to userspace, and which would very much like to expand userspace to 3.5GB; expanding userspace to 3.5GB on Linux on a 64GB x86 machine is probably not feasible given the number of kernel allocations the workloads would generally require but some investigation is probably worthwhile.
Don’t take this too far out of context. It is an incomplete implementation and its incompleteness created a severe performance pathology. The “16GB” number is a guesstimate of the amount of physical memory required to absorb the detrimental effects of the pathology I’ve not resolved yet.
Not only will a complete implementation not have this pathology at all, it should benefit performance for various reasons. Furthermore, in the event it should for some reason be found undesirable, it’s perfectly feasible to configure PAGE_MMUSHIFT to 0, making PAGE_SIZE == PAGE_MMUSIZE and so having zero impact on the kernel vs. the kernel prior to page clustering.
Again, this performance issue has nothing to do with page clustering itself. It’s essentially an unfixed “problem” (you might even call it a bug) in a very early, very prototypical implementation (even though this same codebase will also serve as the shipping version, “prototypical” is a statement of its testing level).
I have tried to run several versions of redhat 8’s bigmem kernel on a dual-Xeon machine with 8 GB RAM, and it just kept crashing for no obvious reason (the equivalent smp kernels run without any problems, if one can consider ignoring half the RAM “no problems”).
Has anyone _positive_ experience with the bigmem kernel? Details?
looks like Linux still has issues on bigmem
and I thought BeOS with his 1GB limit was late
(same problem as Linux had sometime before, physical mem mapped into the kernel virtual space :^)
I can’t really speak to Red Hat’s bigmem kernel. I know it does not include my patch (and it wouldn’t be appropriate for them to do so until it’s much stabler). I think resolving your instabilities would be better served by reporting to the RH bugzilla; this is against mainline 2.5, and RH is operating on a very large diff vs. 2.4.9 or thereabouts for AS and 2.4.18 or so for rawhide and other non-AS kernels. I’m sorry to say that I’m literally so unfamiliar with RH’s tree that I’m not even sure which version you have in mind. You’re really going to have to ask them about it.
If RH were to utilize this technique at all for their 2.4.x kernels, it would likely be a better strategy to try to utilize Hugh Dickins’ original patches than mine, as mine are very much 2.5.x-based and I intend to continue this divergence in order to both exploit 2.5.x-specific functionality and obtain a cleaner codebase by introducing 2.5.x-specific notions around both highpte and rmap during the cleanup phase, which involves deriving an abstraction layer to reduce the amount of code (C source linewise) involved for the generic fault handlers.
I doubt this is the same affair. The notion of memory outside the kernel address space was introduced before 2.4.0, and PAE, which has the pagetable format and so on needed for over 4GB of RAM, was introduced at more or less the same time.
Linux has been running on x86 machines with RAM sizes as large as 16GB with few, if any issues for several years. 32GB and 64GB x86 machines (which were initially released at least 5 years ago) are the ones truly affected by the design limitation I’m addressing, even though there are features and general performance improvements that also follow from this same technique.
This is actually some evidence of Linux’ conservatism with respect to design; if the hardware “fad” didn’t take off, it is likely the kernel would have been polluted by features that don’t apply to the general case if measures had immediately been taken to accommodate it. I believe a good portion of the popularity of this technique is due not only to the fact it addresses this specific design limitation, but also to its other aspects of increasing functionality and also that people have been able to think about whether it’s a good thing to do.
A good part of this is a direct result of Hugh Dickins having done it once for 2.4.x; without his original work a very large amount of the guidance I had writing this would have been totally absent, and also without it kernel maintainers would have been unable to evaluate its desirability for general use. Despite my name being associated with this later (and incomplete!) patch, he really deserves a very significant amount of credit, and (of course) none of the bugreports. I have to grudgingly admit his code was of much higher quality than mine is now, and it will take some time for me to shore up my code to match his original code in any sense.
Its very nice to see the developer himself reply in the forums, how refreshing. 🙂 Keep up the good work!
I am curious, however, how high Linux can scale in respects to memory usage on 64-bit hardware. Has anyone booted Linux with 1 terabye of RAM or more, and were there any performance issues?
Well, I read the same stuff everyone else does. I suspect the thread being about something I did effectively motivates me to comment.
I don’t honestly know the upper limits of how far anyone’s tried to stretch Linux on 64-bit hardware. There’s no obvious feasibility issue so there’s no reason why over a TB of RAM couldn’t have been involved. Chances are that if anyone has tried it, SGI has.
I’d suspect TLB coverage and linear algorithms manipulating sets of pages would be issues. Lock contention is also likely to occur on cpu counts typical of machines with that memory size.
64-bit hardware has a natural limit of thousands of TBs? Am I right? (16777216 to be precise)
Well, 64-bit is actually somewhat vague. Different architectures implement different amounts of virtualspace and physical addressing. IPF/IA64 has the largest physical address space that I know of, topping out at 52 bits or thereabouts. AIUI most of the virtual address space limitations arise either from OS limitations or architectural reservation of address bits.
For instance, IPF/IA64 uses the top few bits of virtual addresses to denote which “region”, which is something like a backward i386 MTRR, which starts with which bit of virtualspace (instead of physical), and then the region registers define the attributes of the predefined regions.
The region configuration limits the amount of available user address space.
On MIPS, virtualspace and physicalspace are pre-partitioned into the XKSEG, XKPHYS, XSSEG, and XUSEG regions, and the “compatibility space” consisting of KSEG0, KSEG1, SSEG, KSEG3, and USEG. The boundaries are sort of artificial but they are architected, and limit the user address space.
By and large these limits are harmless, as they are “impossibly large”. But the point stands that you don’t actually get a full 64 bits out of the things (though you don’t need it anyway).
Core Linux code imposes some hassles when trying to implement larger user address spaces; page tables are required to be 3-level radix trees with PAGE_SIZE -sized nodes. This currently limits several 64-bit architectures to less user address space than what they could otherwise grant, for instance, AMD64 is limited to 2TB or so of user address space because it would require awareness of its 4-level hardware page tables to do more by utilizing more entries of its root pagetable node.
In principle page clustering, by raising PAGE_SIZE, could widen the nodes and so enable more user address space to be used for 64-bit software refill TLB architectures (whose pagetable structures are largely arbitrary), but at the very least AMD64 needs the 4-level pagetable support and at the moment there appears to be very little interest in utilizing page clustering to widen pagetable radix tree nodes.