“Ulrich Drepper [the gnu libc project leader] recently approached us [LWN] asking if we would be interested in publishing a lengthy document he had written on how memory and software interact. Memory usage is often the determining factor in how software performs, but good information on how to avoid memory bottlenecks is hard to find. This article is the first in a serie of articles (the original has over 100 pages) that will get published on LWN weekly. Once the entire series is out, Ulrich will be releasing the full text.”
Title should read “What Every GNU/Linux Programmer Should Know About Memory, Part 1”
Not really. Even if the author says he is only interested in linux, the first part is mostly concerned with hardware principles, which are general enough to be useful for other people.
If we’re going to try to be pedantic, that should read “What Every Glibc-using Programmer Should Know About Memory, Part 1”. Glibc runs on at least four different kernels, at my last count.
I’v read the first 2 installments. And they are both quite generic. Nothing at all glibc specific, or even close. As someone who uses glibc, I’m anxiously *waiting* for the more specific segments!
Yes, I wouldn’t expect it to be Glibc specific. I was just out-pedenting a pedent.
“””
“””
Not to be pedantic. But that would be “out-pedanting a pedant”.
But you know that. Since you got it right in your first post. 🙂
I suggest the title “Details that no programmer needs to know about memory” instead. Hopefully the later parts will cover useful topics.
… I try to play memory games, to increase my memory and do not have memory leaks or overflows.
Its not just that but managing memory whilst at the same time realising there are limitations on the hardware front; the size of he FSB, the memory latencies etc. etc. Its like the x86 and contextual switching – and when too much can be a performance hit.
I can’t wait for the rest of the paper to be published. This was a very interesting read. Memory management is essential to any good programmer, it’s what seperates the hard-core coder from the weekend enthusiasts who slap stuff together in .NET and call themselves a “programmer”.
Definitely a great read!
I would like to mention, that assembler-coding makes you think about memory latencies and throughputs and processor cycles quite a lot.
I did not have much time assemblying lately, but learning it once provided me some insight about the difference between code performance and algorithm performance. As a rule of thumb: Try to get the best algorithm in c or f90 working before even considering reprogramming the time-critical subroutines in assembler.
“As a rule of thumb: Try to get the best algorithm in c or f90 working before even considering reprogramming the time-critical subroutines in assembler.”
Seems, thatt this rule is not very well known, because we have too much ressources these days: too fast CPUs, too much RAM, too big hard disks… who cares about efficient programming anyway? 🙂
I really enjoyed the article. Very interesting content, presented in a educational valuable way. Worth having a printed copy on the system shelf.
Or you could separate the serious .NET programmer and the hobbyist one. Using Java or .NET doesn’t mean that you have no control over the underlying system. Memory still is accessed in the cached manner with the same latencies as in native code. If you know how to structure your data (prefer contiguous allocation over spread data, try to keep data that’s used together close in memory), you get the same benefits of this knowledge in a managed language as in an unmanaged one.
A GCed language has one major advantage though: if you’re using a mark-and-sweep collector like in the CLR, the runtime will automatically compact small items together upon collection, so you get better cache locality between items automatically without writing your own custom allocator in C or C++.
Snarky comments against managed runtimes do not demonstrate that you are elite. Just like with any language system, there are skilled tweakers who spend time squeezing performance out of .NET and there are people who just want to solve a one-off problem or play around on the weekends.
already send it to the Firefox team…
“””
“””
Notabug. WontAdmit. WontFix. 😉
Edited 2007-10-05 07:28
Send it to the OpenOffice team, to be honest. Firefox isn’t a speed demon, but it doesn’t make me wish I still had hair to pull. OpenOffice does.
Extremely opinionated, biased pro-Intel anti-AMD, mostly correct but in some cases deceptively wrong.
Wrong: read a few paragraphs and start seeing comments like “this setup will introduce a NUMA architecture and its negative effects”. He also likes talking about FSB speeds to memory – a FSB is CPU-NB, a memory bus is NB-RAM, and modern NUMA systems independently clocks the two buses. (It is useful to use different speeds – “spread spectrum” – because memory and I/O devices have different clocking requirements.)
Biases: for NUMA, many people consider it a good thing because NUMA scales much better than shared-FSB designs. He touts FB-DIMMs as superior to DDR3 without mentioning that the buffering in FB-DIMMs increases latency – in reality, FB-DIMMs do better than DDR on Intel chips with fat caches, and *cheaper* DDR chips do better than FB-DIMMs on AMD chips with more sensitivity to latency.
There are a lot of real engineering tradeoffs here; Ulrich isn’t describing the tradeoff, he’s just saying “A is better” without justification. Frankly, this piece is so biased towards Intel’s choices that it looks like it was commissioned by Intel PR.
“””
“””
Then write a comprehensive series of articles that are of higher quality and publish them. We’d be glad to critique them for you.
Edited 2007-10-05 17:38
He has asked for feedback, so why not write to him?
I agree with your points: you could also add: the buffer in FB-DIMMS use too much power, so FB-DIMMS is going to die quite soon at least at Intel, Sun is using FB-DIMM on their new computers, they need the bandwith to feed their multicore computer.
Although I felt some of his statements were opinion pieces rather than fact this was the most interesting article I’ve seen on here in a long time!
Thank you!
…memory is free. Use as much of it as you want… and don’t bother freeing it up. It will free up when your application stops.
I AM KIDDING!
“””
“””
Ever the optimist! Huh, Tuishimi?
Well, it’ll free up on reboot, anyway. 😉
Edited 2007-10-06 16:36
haha, what an article “In the early days computers were much simpler.” … well, that article is basically not relavent as the systems I mainly program on (when not at work) are what they are talking about “In the early days”.
I was thinking, If addressing is such a huge cost, why not avoid it entirely.
What if the memory modules just cycled through it’s rows as fast as it could letting any interested device do it’s reading or writing at the right time.
Waiting for the memory to cycle through would be murder on latency though. But in an era of multicores and parallel execution maybe the right blend of multitasking and ingenious algorithms for memory allocation in the OS could make it viable.
What do you think?
Memory remains a huge bottleneck in systems, yet coders have been moving toward larger and larger routines to do the same thing – in the name of speeding things up.
About a year ago a coworker had some code that he couldn’t figure out why in Borland C it ran almost instantly, but in GCC was taking about a second per iteration (which over a few hundred iterations dragged a realtime program down to unusable). GCC makes notoriously slow code (the price of multiple targets), but this was above and beyond the norm.
Even stranger after some playing he found turning compiler optimizations OFF it ran even faster… that’s when he got hold of me and I dragged out the dissassembler.
Optimizations on, GCC was taking what should have been a simple LOOP with three memory references that could have been stored in registers, to use a single MMX opcode.
The problem was, the overhead to set up for that MMX opcode involved allocating 128 bytes of memory for two matrixes… If he was performing the same general operation back to back the MMX version would have been faster because it unrolled the loop, but the overhead of setting up for it not only made the code not fit in the L1 cache, but it didn’t even trip cache because each iteration was too different. Basically it allocated 128 bytes of memory each pass (and released it each pass) and used about 512 bytes of machine language for what we were able to quickly rewrite as maybe ten lines of ASM (therin 15-20 bytes of code), entirely using registers inside the loop and only passing three dword values on the stack.
Programmers, especially those writing compilers seem to have forgotten one of the most basic rules of writing code – the less code you use, the faster the program and the less code there is to BREAK. You’ll hear the arguement repeatedly that many of these new techniques are faster and the older smaller code is slower – and it’s utter and total bull MOST (but not all) of the time. You see this attitude in most all forms of programming these days where one way of doing things is completely thrown out in favor of another, the new method amounting to trying to shove that square peg into the round hole.
When it’s the difference between small tight code and multiplying memory access by a factor of ten, increasing memory access inside the loop, getting cache misses and not even fitting the code and values inside the L1 cache, I know which way I lean on this.
Minimalist code can often remove the memory bus concerns inside loops from the equation – especially if you arrange your code to make use of the piping capabilities of newer processors… because while it’s off grabbing memory you can keep it executing other stuff.
Edited 2007-10-07 16:31