DirectStorage API now available on Windows

Thom Holwerda 2022-03-14 Windows 13 Comments

Starting today, Windows games can ship with DirectStorage. This public SDK release begins a new era of fast load times and detailed worlds in PC games by allowing developers to more fully utilize the speed of the latest storage devices. In September 2020, we announced DirectStorage would be coming to Windows, and after collecting feedback throughout our developer preview, we are making this API available to all of our partners to ship with their games. Check out the announcement blog for an in-depth exploration of the inspiration for DirectStorage and how it will benefit Windows games.

This technology brings the fast storage features of the Playstation 5 and Xbox Series X/S to Windows gaming. I’m curious to see if this feature can make its way to Linux, but I wonder how e.g. games running through Proton would possibly make use of it.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

13 Comments

2022-03-14 11:03 pm
sukru
This is just but one piece in the new generation of game design. Another important one include sampler feedback streaming which allows targeting only required asserts to be loaded into the GPU memory.
https://www.tweaktown.com/news/84845/forspoken-will-tap-gpu-power-to-unlock-full-nvme-speeds-on-pc/index.html
Together, once all the pieces fit into place, “loading” would be a thing of the past (or at least assets like textures).
This would be similar to how we load ELF files in Linux. The kernel does not actually “load” the binary, but instead just mmaps it into the new process space: https://www.bodunhu.com/blog/posts/program-loading-and-memory-mapping-in-linux/
That would be all around beneficial. NVMe will essentially become an “L4” cache for the GPU. However that requires significant code changes to existing game engines. That is why many DirectX 12 titles will currently run slower than DirectX 11 parts, which has more “hand holding”.
These being said, computer graphics is not my primary area. Please feel free to correct / amend what I said.

2022-03-15 3:39 am
Alfman verbose=1
sukru,
This is just but one piece in the new generation of game design. Another important one include sampler feedback streaming which allows targeting only required asserts to be loaded into the GPU memory.
https://www.tweaktown.com/news/84845/forspoken-will-tap-gpu-power-to-unlock-full-nvme-speeds-on-pc/index.html
Together, once all the pieces fit into place, “loading” would be a thing of the past (or at least assets like textures).
This would be similar to how we load ELF files in Linux. The kernel does not actually “load” the binary, but instead just mmaps it into the new process space: https://www.bodunhu.com/blog/posts/program-loading-and-memory-mapping-in-linux/
Nvidia’s unified memory already supports “swapping” between GPU and system RAM today. So you can allocate much more ram for the GPU than is physically available on the GPU. But since you mention memory mapped files specifically, I don’t think that’s going to be supported soon.
Maybe this is just nvidia specific, but their implementation of unified memory requires the memory to be pinned on the host. One reason to do this is that the GPU cannot simply access physical memory directly, it needs to map pages from logical to physical yet it doesn’t read the CPU’s page tables to do so. It uses it’s own map. It is imperative to keep these maps in sync and naturally the easiest way to stop any possibility of getting out of sync and accessing the wrong ram is to simply pin the memory and make the logical->physical address map permanent for the duration of a page allocation. This renders swapping/memory mapped files infeasible.
In principal it could be possible. The GPU would fault in a page and then cascade the fault into the CPU kernel’s page fault handler, thereby reading from disk. But I suspect it would involve some fairly tight kernel coupling and due to the lack of ABI stability on linux could result in more frequent driver breakages.
I suspect we’ll see an RTX-IO path first, even though I kind of prefer your idea to use memory mapped files. They’re very similar, but memory mapped files would make better use of the operating system facilities in an intuitive way. Maybe we’ll see it some day.

2022-03-15 8:42 pm
sukru
Alfman,
I am not sure how nvidia’s propriety solution will fit in all these. As far as I know AMD holds some patents in this, so does Microsoft on some of the key parts. And Intel supports Direct Storage with Sample Feedback Streaming:
https://youtu.be/VDDbrfZucpQ?t=940
(This video describes how it works, much better than I can)
They show a scene which can run with only 230MB texture RAM, while using 350 GB of source data (0.06% resident mem usage).
I am guessing / hoping, they will come up with some amicable cross-licensing agreement. And games will have parity across AMD and nVidia cards. Xbox Series already supports this (it is a hybrid SoC with some Microsoft tech).
The only holdout would be the PS5, which does not support several next-gen features. But that unfortunately comes into “console wars” territory, and technical discussions are not handled well by the fans: https://www.reddit.com/r/PS5/comments/hue8hc/can_sampler_feedback_streaming_actually_make_up/

2022-03-17 1:12 pm
Xanady Asem
It’s just the decompression stage being moved from the CPU to the GPU, which sort of makes more sense, since you eliminate the CPU middleman when moving assets into the GPU.
I wouldn’t say the NVMe becomes an L4 cache, as much as it just allows more efficient DMA transfers between PCIe devices.
Loading times will still be there since decompression of large assets will always induce some latency. But they will be severely reduced.

2022-03-14 11:56 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
but I wonder how e.g. games running through Proton would possibly make use of it.
Probably the same way they’ve historically gotten faster disk I/O out of EXT2/3/4 than NTFS. Typically with these sorts of things, the code to set up the transfer has to go through the translation layer and then the transfer itself proceeds at native speeds because it’s still just DMA or whatever on the hardware side and still just a chunk of memory with data in it on the application side.
The reason things like esync and fsync need improvements to the Linux kernel is that Windows was ahead of Linux on certain kinds of ways to batch up a bunch of little requests to wait on something into a single syscall.

2022-03-15 12:53 am
Alfman verbose=1
ssokolow,
Probably the same way they’ve historically gotten faster disk I/O out of EXT2/3/4 than NTFS. Typically with these sorts of things, the code to set up the transfer has to go through the translation layer and then the transfer itself proceeds at native speeds because it’s still just DMA or whatever on the hardware side and still just a chunk of memory with data in it on the application side.
I don’t think it would take that long to rig something up assuming nvidia provided a scatter/gather API,
After all it’s trivial to get a file’s sectors…
/sbin/hdparm –fibmap /usr/bin/krita
byte_offset begin_LBA end_LBA sectors
0 44040192 44091183 50992
Although there could be major security implications if a non-root program could do this since it effectively bypasses conventional FS security. Therefor it might require special root helpers. Also have to watch out for race conditions.

2022-03-15 12:27 am
Alfman verbose=1
Thom Holwerda,
This technology brings the fast storage features of the Playstation 5 and Xbox Series X/S to Windows gaming. I’m curious to see if this feature can make its way to Linux, but I wonder how e.g. games running through Proton would possibly make use of it.
That is unclear. Nvidia calls this feature “RTX IO”…
https://samagame.com/blog/en/nvidias-rtx-io-will-give-pc-capabilities-comparable-to-ps5-ssds/
Nvidia has designed a new I / O architecture, called RTX IO, which will allow you to achieve effects on PCs comparable to those that are to be possible thanks to a modern SSD disk in the PlayStation 5 console (and a special I / O controller developed by Sony for its needs).
Technology will allow you to decompress and load files from the NVMe SSD drive directly to the GPU memory – bypassing the CPU. As a result, the central unit will be loaded up to twenty times less, and the efficiency of data loading can be even a hundred times higher than in the case of traditional disks and old interfaces.
But despite searching I haven’t seen any linux commitments from nvidia about it. Even once linux GPU drivers support it, it will probably take a very long time to see applications make use of it.
I am pretty sure that linux can already support the GPU based decompression via CUDA. The one piece that’s missing is the DMA operation, meaning that it has to be loaded into RAM before the GPU.
NVME -> GPU
NVME -> RAM -> GPU
There may actually be scenarios when it is better NOT to use the RTX IO direct path. and to deliberately load things into RAM. After all, RAM makes for extremely fast cache, about a magnitude faster even than NVME.
Ideally games would just opportunistically preload content right before it is needed instead of when it’s needed. This is often feasible but it requires developers to implement game-specific prefetch logic. Most devs aren’t going to do that.

2022-03-15 12:44 pm
th22
[quote]
Ideally games would just opportunistically preload content right before it is needed instead of when it’s needed. This is often feasible but it requires developers to implement game-specific prefetch logic. Most devs aren’t going to do that.
[/quote]
Is this not how it is working now? There seems to be a reason they developed the new tech to load it directly into the gpu ram.
Can you name such a scenario where it is better to not load it directly into the gpu ram? What would be the benefit of it?
how do I quote correctly btw?

2022-03-15 1:10 pm
Alfman verbose=1
th22,
Is this not how it is working now? There seems to be a reason they developed the new tech to load it directly into the gpu ram.
To clarify, what I meant was preloading resources before they need to be used/rendered and not wait for the moment they need to be used before loading them in.
For example some games might have a world boundary or boss where the game pauses for a brief moment before loading in. If game programmers can predict the resources that will be needed a few seconds before they are actually needed, then those resources can appear in game instantly without delay.
This is hard to solve generically because a game engine doesn’t necessarily know what that the game will need to spawn in new resources/bosses when user enters a room for example.
Two simpler strategies are for games to load everything up front with a loading screen, or just load everything on demand causing brief interruptions and/or incomplete models during the game (kind of like the way minecraft world chunks that are missing suddenly pop into view).
Can you name such a scenario where it is better to not load it directly into the gpu ram? What would be the benefit of it?
If you’ve got enough GPU ram, there would be no benefit at all. But if there isn’t enough GPU ram then regular RAM is the next best thing.
The GPU needs to hold uncompressed textures (and possibly mipmaps in addition to that), but since the system RAM doesn’t have to actually use the textures it can actually hold the compressed versions. So as a fictitious example: 8GB of system ram might hold the equivalent of 16GB of GPU texture memory.
how do I quote correctly btw?
<blockquote> Quote me </blockquote>

2022-03-15 10:01 pm
sukru
[once again from the other thread above]
https://youtu.be/VDDbrfZucpQ?t=940
The GPU tells which exact resources and mip-map levels are required in the scene. They mention this could be done with at most a single frame delay, which implies it can be done during rendering of the scene itself.
Each scene has multiple passes. So the first pass will just give a list of all textures and mip levels needed (along with shader quality hints for effects performance). And then the CPU can orchestrate passing those resources onto the GPU RAM. (I am not sure whether there is any solution that bypasses the CPU, but it can obviously bypass main RAM with DMA).
Worst case, one scene will be rendered with low quality resources. If you preload a 1×1 version for all textures, there will be nothing “empty” in the scene. In practice it will be dropping only one level of quality, though. (512×512 instead of 1024×1024, etc).
Another thread will garbage collect those resources. If done correctly, VRAM will only act as a cache, with practically “unlimited” texture RAM offloaded to the NVME (as long as they can be retrieved fast enough).

2022-03-15 10:11 pm
sukru
Forgot to mention: NVME bandwidth has practical limits about 1.25GB/s to 4GB/s (which is from the limit of PCIe).
For a 60 frames per second title, that means about 20MB/s being reliably loaded each frame. Or using about 5 seconds to fill an entire 8GB RAM.
So, this is not a “miracle cure”, but a very good tool to make better use of the existing hardware.
2022-03-16 1:23 am
Alfman verbose=1
sukru,
The GPU tells which exact resources and mip-map levels are required in the scene. They mention this could be done with at most a single frame delay, which implies it can be done during rendering of the scene itself.
Interesting. It doesn’t even seem like a complex feature to add, but of course it’s up to nvidia (and others) to add it to vulkan and/or opengl.
Each scene has multiple passes. So the first pass will just give a list of all textures and mip levels needed (along with shader quality hints for effects performance). And then the CPU can orchestrate passing those resources onto the GPU RAM. (I am not sure whether there is any solution that bypasses the CPU, but it can obviously bypass main RAM with DMA).
I don’t know why two passes are needed? The render could just take note whenever a texture could have been used. I think they said the loaded texture doesn’t get loaded in immediately anyways (they use a lower quality texture while the high quality texture loads). If this were a software render this would be easy to add because we control the render process. In DX/OpenGL features are more hard coded.
I’ve been using Blender and it supports procedural texture shaders and vertex shaders. A talented artist can create amazing shaders that can produce very good results procedurally without requiring any memory for textures. You can procedurally generate things like wood, cement, carpet, etc and do so with far less RAM than would be needed for high quality texture mapping. A side benefit is that procedural algorithms can provide infinite variety whereas textures that repeat can ruin the authenticity of a scene. Procedural generation of leaves, trees, etc makes for impressive results without looking repetitive.
I think most games are built by 3d designers who are more familiar with texture mapping. IMHO there’s a lot of merit in using more procedural in game assets. A 8GB game might fit in a few hundred MB without the textures, haha.
2022-03-17 1:12 pm
sukru
Alfman,
The feedback mechanism can also be used to optimize the procedural overhead (i.e.: shader quality). There was another technical video that described using lighting feedback for this. They would down-res parts that are dim or away from the eye and focus more on the highly visible parts.
Overall, this requires a lot of changes to existing game design ideas. DirectX 12 (and equivalent parts in Vulkan, and others) has much less hand holding and hard coded paths compared to before. (Might be a stretch, but the leap could be compared to moving from fixed function pipelines to custom shader ones).