Adding 16 KB page size to Android

Thom Holwerda 2024-08-24 Android 18 Comments

A page is the granularity at which an operating system manages memory. Most CPUs today support a 4 KB page size and so the Android OS and applications have historically been built and optimized to run with a 4 KB page size. ARM CPUs support the larger 16 KB page size. When Android uses this larger page size, we observe an overall performance boost of 5-10% while using ~9% additional memory.
In order to improve the operating system performance overall and to give device manufacturers an option to make this trade-off, Android 15 can run with 4 KB or 16 KB page sizes.
↫ Steven Moreland

Android 15 has been reworked to be page-size agnostic, meaning that a single binary can run on either 4 KB or 16 KB versions of Android. Any assumptions about page size have been removed from Android as well, the EROFS and F2FS file systems as well as UFS are now compatible with 16 KB, and a whole lot more things have been changed and refactored to make this transition as effortless as possible.

Application developers do need to do a few things, though. They’ll need to recompile their binaries with 16 KB alignment, after which they’ll need to be tested in a 16 KB version of an Android device or emulator. To make this possible, starting with Android 15 QPR1, the Pixel 8 and Pixel 8 Pro will get a new develop option that will reboot the device in 16 KB mode. In addition, Android Studio will gain a 16 KB emulator target as well. The 16 KB page size is an ARM-only feature, so people running the emulator on x86 devices will emulate the 16 KB page size, in which “the Kernel runs in 4 KB mode, but all addresses exposed to applications are aligned to 16 KB”.

Of course, Google urges Android developers to test for 16 KB page sizes as soon as possible.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

18 Comments

2024-08-24 10:16 pm
malxau
So this requires new executable alignment, meaning no existing programs will run? And it’s not possible to JIT compile x86 user code to run on such a system, since the kernel cannot implement 4Kb page size semantics?

2024-08-24 11:52 pm
Alfman verbose=1
malxau,
So this requires new executable alignment, meaning no existing programs will run? And it’s not possible to JIT compile x86 user code to run on such a system, since the kernel cannot implement 4Kb page size semantics?
I agree with you, it’s really odd. Here is the relevant quote:
All applications with native code or dependencies need to be recompiled for compatibility with 16 KB page size devices.
Since most native code within Android applications and SDKs have been built with 4 KB page size in mind, they need to be re-aligned to 16 KB so the binaries are compatible with both 4 KB and 16 KB devices. For most applications and SDKs, this is a 2 step process:
Rebuild the native code with 16 KB alignment.
Test and fix on a 16 KB device/emulator in case there are hardcode assumptions about page size.
Most local data/stack memory is aligned to 16bytes and not the page size, be it 4KB, 16KB, 2MB or whatever. I don’t think page size makes any difference at all to local allocations in typical user space software. We can change the page size on linux and it doesn’t require recompiling all the software.
The only functions that should care are those that explicitly allocate pages. sbrk accepts any byte length and does not depend on page size because the kernel can just round up.
https://linux.die.net/man/2/sbrk
mmap seems to be the main culprit here:
https://www.man7.org/linux/man-pages/man2/mmap.2.html
Unless mmap is specifically instructed to use the specified address using MAP_FIXED (normal software shouldn’t need this), the kernel will always return a page aligned address. No problem there. The length does have to be a multiple of page size though. So mmap lengths may need to change. But the thing about it is that correct code and memory allocators were already required to do that per the spec anyways.
offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).
Testing the software is definitely warranted, but correct code should require no changes at all. I don’t do android development though, is there something about the android SDK specifically that is dependent on 4K pages? Anyone know?
Most unix software is dynamically linked to standard libraries that include allocators. These can also be updated without recompiling the software. Does android not do this? That seems a bit weird.

2024-08-25 12:29 am
Alfman verbose=1
Ah, this link contains more specific details!
https://developer.android.com/guide/practices/page-sizes#16-kb-impact
Apparently the recompile is needed to change the alignment specified in the ELF sections. Specifying a 4kB page size in the ELF could cause the loader to use an unaligned address on a 16kB architecture or configuration. Although I still don’t see why they couldn’t just fix the loader to use a larger alignment anyway. It seems to me if they did that the vast majority of native android software should still work fine and that would be worth doing. Am I missing something?

2024-08-25 2:01 am
malxau
I don’t think page size makes any difference at all to local allocations in typical user space software.
Ehhh… it affects the granularity that page permissions can be configured. I have some simple code to detect buffer overruns that places allocations on one page and marks the next page invalid, which means it needs to know the page size. If the page size is larger than the program expects, it ends up marking the page containing the allocation invalid, which immediately crashes. And my logic for detecting the page size is… uhh…
https://github.com/malxau/yori/blob/f982c24fd31439a7a108e79561c41f9393e30c65/lib/osver.c#L638
We can change the page size on linux and it doesn’t require recompiling all the software.
Gosh, I’m glad that’s worked for you. I tried it on PowerPC before (which supports configurable page sizes) and the result didn’t boot. I think though we’re starting to see why…
Specifying a 4kB page size in the ELF could cause the loader to use an unaligned address on a 16kB architecture or configuration. Although I still don’t see why they couldn’t just fix the loader to use a larger alignment anyway.
I was looking at PE with the same questions. Each executable contains multiple sections, and sections need different page permissions. But the problem is that the relative alignment within the executable is fixed – the read only data is at a specified offset relative to the start of the code, etc. The compiled code is expecting to find its global variables at that location. Increasing the page size requires the layout of the image in memory to change, which requires relocations that aren’t compiled into the binary.
With PE, the file on disk is 512 byte aligned, so increasing the in-memory alignment to 16Kb doesn’t take any extra disk space. It would force any 4Kb system to “space out” sections in memory though. There doesn’t seem to be any serious drawback to using a 64Kb (or whatever) section alignment, which should still work with a 4Kb page size.

2024-08-25 3:37 am
Alfman verbose=1
malxau,
Ehhh… it affects the granularity that page permissions can be configured. I have some simple code to detect buffer overruns that places allocations on one page and marks the next page invalid, which means it needs to know the page size. If the page size is larger than the program expects, it ends up marking the page containing the allocation invalid, which immediately crashes. And my logic for detecting the page size is… uhh…
YORI_ALLOC_SIZE_T
YoriLibGetPageSize(VOID)
{
#if defined(_M_ALPHA)
return 0x2000;
#else
return 0x1000;
#endif
}
Yes, I can see that being the case if you hardcode the page size like that. If you are writing kernel code, hard coding might be ok, but normal linux software is meant to call “sysconf(_SC_PAGE_SIZE)” as documented in the man pages. This complies with the POSIX.1 standard.
https://www.man7.org/linux/man-pages/man3/sysconf.3.html
I’d expect for Android software to be using the same syscalls, but admittedly it’s an assumption on my part.
With PE, the file on disk is 512 byte aligned, so increasing the in-memory alignment to 16Kb doesn’t take any extra disk space. It would force any 4Kb system to “space out” sections in memory though. There doesn’t seem to be any serious drawback to using a 64Kb (or whatever) section alignment, which should still work with a 4Kb page size.
I don’t know the specifics of PE file relocation, though what you say makes sense. On unix the mmap call isn’t even obligated to honor the input address so I suspect the linux loader is tolerant of of that. Maybe things are platform specific. I might need to look closely at the code to get a clear answer.
2024-08-25 4:17 am
Alfman verbose=1
malxau,
For fun, I checked some old code of mine to see how it handled page sizes. It used guard pages too! The guard page doesn’t really need to be allocated, doing this could waste a lot of memory with huge page sizes!
int thread_create(THREAD*thread, unsigned int stack_size) {
if (stack_size==0) stack_size = 16*1024;
// alternatively we could mmap
unsigned int pagesize = sysconf(_SC_PAGE_SIZE);
thread->stack_size = ((stack_size + (pagesize-1)) / pagesize) * pagesize; // round up
thread->stack = memalign(pagesize, stack_size+pagesize); // extra page for overflow guard
mprotect((uint8_t*)thread->stack + stack_size, pagesize, PROT_NONE);
// emmits SIGSEGV, mprotect man page has good example
}
Here are some more links down the rabbit hole, haha..
https://dram.page/p/relative-relocs-explained/
https://medium.com/@boutnaru/linux-security-aslr-in-statically-linked-elfs-55556d13adc
I’m not sure if all architectures are able to handle position independent executables. A dynamic linker is normally able to relocate sections, but maybe not all code is compiled to be relocatable.
2024-08-25 8:51 pm
malxau
normal linux software is meant to call “sysconf(_SC_PAGE_SIZE)”
For the record, normal Windows software is meant to call GetSystemInfo. I doubt I’m the only one to not call it though, because I don’t think we’ve ever seen a page size change on an existing architecture before.
If you are writing kernel code, hard coding might be ok
The Windows kernel exposed a PAGE_SIZE constant, but… do you want to be able to load older drivers on newer CPUs? I think the answer is “yes.” If it becomes common to change something like a page size, we’ll need quite a bit of ecosystem adjustment to work with it. Even this thread is about supporting 16Kb, not supporting arbitrary values or supporting future changes.
A dynamic linker is normally able to relocate sections, but maybe not all code is compiled to be relocatable.
What I’m saying/seeing is the code can be relocated but expects to be together – ASLR implies loading things anywhere, but still keeping relative offsets unchanged. This also seems to be explicitly called out in the article you linked to:
We’ll handle the simplest case and assume everything (code, data) in an executable must still be contiguous and cannot move relative to one another. The entire executable will move as a whole to some “base address” that is only known at runtime. Then everything in the executable will be at a fixed offset from the base address.
Supporting arbitrary page sizes implies that any global variable access needs to understand a run-time defined offset to the beginning of the global variable area, and that offset can’t just be a global variable for obvious reasons. The normal way relocations work is to have a table of pointers whose values can change, but relocating sections means the location of that table of pointers needs to be able to change – it’s an extra layer of relocation. ASLR implies that location can’t be well known, but also can’t be at the same relative location. There’s probably some cute solution here but I’m not immediately seeing it.
This would have been easier in a segmented architecture like DOS, where code is relative to CS and data is relative to DS, so the loader can position them independently and the code is identical. Flat memory is simpler, but…well, it’s simpler.
In other words, my hardcoding of the page size doesn’t seem that uncommon or unusual, since it looks like the executable formats and run time loaders of the world did the same 😉
2024-08-26 2:02 am
Alfman verbose=1
malxau,
For the record, normal Windows software is meant to call GetSystemInfo. I doubt I’m the only one to not call it though, because I don’t think we’ve ever seen a page size change on an existing architecture before.
Yes, I do concede that it’s possible developers won’t use the page size function. I just don’t think it’s that common for application developers to write their own page code over a standard library.
The Windows kernel exposed a PAGE_SIZE constant, but… do you want to be able to load older drivers on newer CPUs? I think the answer is “yes.” If it becomes common to change something like a page size, we’ll need quite a bit of ecosystem adjustment to work with it. Even this thread is about supporting 16Kb, not supporting arbitrary values or supporting future changes.
I get that. Things are a bit different on the linux side due to the fact that every module needs to be recompiled for every kernel build anyway.
What I’m saying/seeing is the code can be relocated but expects to be together – ASLR implies loading things anywhere, but still keeping relative offsets unchanged. This also seems to be explicitly called out in the article you linked to:
Yes I read that too, however after way too much testing, the article does not seem to agree with what linux is doing, or maybe it’s just glossing over the details. The main elf binary data & code is randomized to one address, the heap is randomized to another. So far so good. But I’m also finding that shared library code & data are randomized to yet another address that is NOT at a constant offset to the main code. All the shared libraries have a common offset with respect to each other, but not with respect to the main binary. To be honest though I have no idea why it works this way. Why doesn’t ASLR take advantage of randomizing the offset between shared libraries too?
Testing shows that the offset between shared libraries is deterministic and constant on subsequent runs. However if I replace one of the shared libraries with a larger version of itself, the loader is forced to make more space for it. This effectively bumps the addresses of all subsequent shared library sections. The relative offsets between libraries before and after this change are therefor different. It makes sense to me why this needs to work, but then why doesn’t ASLR randomize it? My searches failed to come up with an answer.
(I’ve read that ASLR on windows and linux are different, so maybe nothing I’m saying applies to windows).
Supporting arbitrary page sizes implies that any global variable access needs to understand a run-time defined offset to the beginning of the global variable area, and that offset can’t just be a global variable for obvious reasons. The normal way relocations work is to have a table of pointers whose values can change, but relocating sections means the location of that table of pointers needs to be able to change – it’s an extra layer of relocation. ASLR implies that location can’t be well known, but also can’t be at the same relative location. There’s probably some cute solution here but I’m not immediately seeing it.
I don’t know the exact mechanics used for -pie under the hood. But my understanding is that the addresses are known to the dynamic linker. It knows not only the where the binaries are being randomly positioned, but also the compile time symbol offsets into each binary. Adding these two values should give you the final memory address. I’m not exactly sure how -pie conveys this information to the code, but it must come from the dynamic linker. I’d have to study this more to fill the gaps in my knowledge 🙂
In other words, my hardcoding of the page size doesn’t seem that uncommon or unusual, since it looks like the executable formats and run time loaders of the world did the same
Even though ELF files have provisions for the compiler to assign addresses, I think they’re treated more as hints. Obviously if they overlap or whatever the dynamic linker is going to place them where it wants to. Without -pie this is easier to comprehend (or is at least what I am more familiar with).
2024-08-26 2:23 am
Alfman verbose=1
malxau,
BTW, what an awesome discussion! Thank you 🙂
2024-08-26 11:40 pm
malxau
I think this is confusing shared libraries with sections. What I mean by section is “individual part of a binary that requires specific page permissions.”
Running “link /dump /headers msvcr80.dll” as an example:
SECTION HEADER #1
.text name
53160 virtual size
1000 virtual address (7D001000 to 7D05415F)
…
60000020 flags
Code
Execute Read
SECTION HEADER #2
.rdata name
18AB1 virtual size
55000 virtual address (7D055000 to 7D06DAB0)
…
40000040 flags
Initialized Data
Read Only
…
SECTION HEADER #3
.data name
6FE8 virtual size
6E000 virtual address (7D06E000 to 7D074FE7)
…
C0000040 flags
Initialized Data
Read Write
SECTION HEADER #4
.rsrc name
3C8 virtual size
75000 virtual address (7D075000 to 7D0753C7)
…
40000040 flags
Initialized Data
Read Only
SECTION HEADER #5
.reloc name
327C virtual size
76000 virtual address (7D076000 to 7D07927B)
…
42000040 flags
Initialized Data
Discardable
Read Only
This is showing a .data section (will be on RW pages), a .rdata section (will be on RO pages), a .reloc section (for relocation information), and a .text section (will be on executable pages.) Each of these things needs to be on a page size boundary, because they get different page permissions. Each section is given a virtual address, which is really relative to the base address that the DLL loads at (it’s not absolute.)
Here, it’s easy to see that this binary implicitly assumes 4Kb pages, because all of the virtual addresses are 4Kb aligned. Loading this on a 16Kb page size means the virtual addresses of each section need to change relative to each other – not across shared libraries, but within a single shared library.
Normal relocation information has a fixed virtual address, in this case 0x76000 from the start of the DLL. But if the page size were 16Kb, it can’t be at that offset, so now the challenge is we need some relocation information that tells us where the relocation information is, etc.
That’s why the comment here about all native Android code needs to be relinked with 16Kb alignment to work. But it’s also why those binaries still won’t work on a 64Kb page size machine, etc. Nobody seems to be discussing how to build an executable that can run on an arbitrary page size, just realigning one to work at 16Kb. PE is actually nice here though, because the virtual address layout is not tied to the file layout, so internally aligning at 256Kb or somesuch is quite feasible, but it still requires all binaries in the universe to be relinked.
That’s why having APIs to query page size seems a bit…superfluous. Executables are already hardcoding page size assumptions into their layout. The page size can only be successfully queried if the executable can load, and it can only load if it is a multiple of the system’s real page size. In today’s world, it really only loads if the executable matches the system’s page size.
2024-08-27 2:19 am
Alfman verbose=1
malxau,
This is showing a .data section (will be on RW pages), a .rdata section (will be on RO pages), a .reloc section (for relocation information), and a .text section (will be on executable pages.) Each of these things needs to be on a page size boundary, because they get different page permissions. Each section is given a virtual address, which is really relative to the base address that the DLL loads at (it’s not absolute.)
Yes, I understand that.
Here, it’s easy to see that this binary implicitly assumes 4Kb pages, because all of the virtual addresses are 4Kb aligned. Loading this on a 16Kb page size means the virtual addresses of each section need to change relative to each other – not across shared libraries, but within a single shared library.
I was not positive this was the case, but it does look like you are right. It would be possible to make these relocatable, but doesn’t seem to be done, at least not by default.
Nobody seems to be discussing how to build an executable that can run on an arbitrary page size, just realigning one to work at 16Kb.
In theory it’s the same process that adjusts the addresses between shared libraries. It’s a matter of getting the compiler to generated a list of addresses that need to be adjusted. For example here is PIE code that loads and saves a global variable relative to RIP (really sorry about the formatting)..
4011e4: 48 8b 05 95 2e 00 00 mov 0x2e95(%rip),%rax # 404080
4011eb: 48 83 c0 01 add $0x1,%rax
4011ef: 48 89 05 8a 2e 00 00 mov %rax,0x2e8a(%rip) # 404080
4011f6: b8 00 00 00 00 mov $0x0,%eax
4011fb: e8 50 fe ff ff callq 401050
A relocation table with addresses 0x4011e7 and 0x4011f2 would be all that’s strictly needed to adjust the relative offsets. The loader can see which section the address points to and add in a new offset corresponding to space that needs to be added between sections.
That’s why having APIs to query page size seems a bit…superfluous.
I disagree with this, it still seems like the right thing to do per the spec.
Executables are already hardcoding page size assumptions into their layout. The page size can only be successfully queried if the executable can load, and it can only load if it is a multiple of the system’s real page size. In today’s world, it really only loads if the executable matches the system’s page size.
I’d like to do more tests to learn more about the implementation. Technically it should be doable if the compiler and loader support it. I guess the main question is whether it’s worth doing.
2024-08-27 12:47 pm
malxau
Yes, I understand that.
(Sorry.)
A relocation table with addresses 0x4011e7 and 0x4011f2 would be all that’s strictly needed to adjust the relative offsets.
What I’m struggling with is, “where is that relocation table?” and “how does the code find it?”
It can’t be relative to rip if the “gap” between the code and relocation information can change.
It’s undesirable to have it at a fixed address.
Maybe it’s possible to guarantee all modules are loaded at a certain address alignment, allowing code to find the “beginning” of the module via some kind of “& 0xFFFF00000000” type thing, and from there, find a global relocation. That seems to both limit module size and lessen the effectiveness of ASLR though.
2024-08-27 2:49 pm
Alfman verbose=1
malxau,
What I’m struggling with is, “where is that relocation table?” and “how does the code find it?”
Haha, I don’t know. Maybe an existing ELF relocation structure could be used, but I’m not familiar with the implementation to say for sure. I did not see evidence that relocating individual sections actually works, rather I was making deductions about how it would have to work.
It can’t be relative to rip if the “gap” between the code and relocation information can change.
Why not?
Say there’s two sections, aligned to 4k, with code in section 2 referencing an address in section 1. Note I don’t know the actual mov opcode length, I use 6 as an example…
0x1000 Section 1 Data (+ASLR random address)
db 0x34
0x2000 Section 2 Code (+ASLR random address)
mov al, [rip-0x1000] // load the first byte of section 1
mov al, [rip-0x1006] // load first byte, relative to new rip
mov al, [rip-0x100c] // load first byte, relative to new rip
Now say we need to realign the sections to 1M, despite sections moving and relative RIP, the adjustment is strait forward. The distance increased by 0xff000 and all the offsets can be adjusted accordingly
0x100000 Section 1 Data (+ASLR random address)
db 0x34
0x200000 Section 2 Code (+ASLR random address)
mov al, [rip-0x100000] // 0x1000+0xff000
mov al, [rip-0x100006] // 0x1006+0xff000
mov al, [rip-0x10000c] // 0x100c+0xff000
Maybe it’s possible to guarantee all modules are loaded at a certain address alignment, allowing code to find the “beginning” of the module via some kind of “& 0xFFFF00000000” type thing, and from there, find a global relocation. That seems to both limit module size and lessen the effectiveness of ASLR though.
Hopefully I didn’t misunderstand you, but did my example make sense? I think it solves the issue for page alignments (assuming the compiler produces the location of the offsets). I assume that every process on the system would be using the same page size. This would be important because in order for multiple processes to share the same code memory they would obviously need to share the same offsets in the code. If the offsets were different in every process, then we wouldn’t be able to share the code memory between them.
Hypothetically then, with respect to ASLR, we could randomize the section addresses too on initialization, but the random offsets between sections would have to become finalized once loaded. We couldn’t randomize the section addresses again without invalidating the offsets in other processes.
2024-08-27 5:13 pm
malxau
Now say we need to realign the sections to 1M, despite sections moving and relative RIP, the adjustment is strait forward.
I think that’s where things are going, with two caveats.
First, note this supports page sizes that are 1Mb or an even power of 2 less than 1Mb. That is, it doesn’t support arbitrary page sizes, it supports a range of sizes up to a maximum. It sounds like what’s being proposed is to do this with 16Kb alignment, which works for 16Kb pages and 4Kb pages, but not 64Kb pages. I’d claim this support is insufficient to be future proof, and we’ll end up with another ABI incompatibility in a decade or so.
Second, the “vibe” I’m getting, although it doesn’t seem explicitly stated and I’m not certain, is that ELF is physically laying out the file with those 16Kb gaps. That’s why they’re not jumping to 64Kb or 1Mb, because there’s a disk space cost to making that number larger. In PE this should work with no additional disk space, although I’m hesitant to do it because it means testing these new binaries with all previous versions of the PE loader to check that they will operate correctly.
2024-08-27 7:08 pm
Alfman verbose=1
malxau,
It sounds like what’s being proposed is to do this with 16Kb alignment, which works for 16Kb pages and 4Kb pages, but not 64Kb pages. I’d claim this support is insufficient to be future proof, and we’ll end up with another ABI incompatibility in a decade or so.
That’s my understanding as well. I think arbitrary sizes would be possible as discussed, but that’s not what they are doing.
Second, the “vibe” I’m getting, although it doesn’t seem explicitly stated and I’m not certain, is that ELF is physically laying out the file with those 16Kb gaps. That’s why they’re not jumping to 64Kb or 1Mb, because there’s a disk space cost to making that number larger. In PE this should work with no additional disk space, although I’m hesitant to do it because it means testing these new binaries with all previous versions of the PE loader to check that they will operate correctly.
You’re right that linux binaries have an awful lot of empty bytes, which will get worse with larger pages like 512k. This won’t be too noticeable with larger binaries (pages on average will be more full), but a small 20k program might require a few 512k pages, that’s a lot of overhead 🙁

2024-08-24 10:18 pm
Minuous
As if Android didn’t already waste too much memory, this will make it even worse… 🙁

2024-08-25 12:08 am
Alfman verbose=1
Minuous,
As if Android didn’t already waste too much memory, this will make it even worse…
It’s true. Garbage collected languages tend to use ~3X more memory than unmanaged languages, depending on circumstances of course. But in this day and age I personally think switching to 16kB pages is justified…
https://developer.android.com/guide/practices/page-sizes#benefits
Benefits and performance gains
Devices configured with 16 KB page sizes use slightly more memory on average, but also gain various performance improvements for both the system and apps:
Lower app launch times while the system is under memory pressure: 3.16% lower on average, with more significant improvements (up to 30%) for some apps that we tested
Reduced power draw during app launch: 4.56% reduction on average
Faster camera launch: 4.48% faster hot starts on average, and 6.60% faster cold starts on average
Improved system boot time: improved by 1.5% (approximately 0.8 seconds) on average
These improvements are based on our initial testing, and results on actual devices will likely differ. We’ll provide additional analysis of potential gains for apps as we continue our testing.
2024-08-26 5:06 am
doubleUb
16ko pages may not increase RAM usage that much. But performance, especially for matrix calculation (think IA…) will be improved.