Post a Comment
I can't wait for the rest of the paper to be published. This was a very interesting read. Memory management is essential to any good programmer, it's what seperates the hard-core coder from the weekend enthusiasts who slap stuff together in .NET and call themselves a "programmer".
Definitely a great read!
I would like to mention, that assembler-coding makes you think about memory latencies and throughputs and processor cycles quite a lot.
I did not have much time assemblying lately, but learning it once provided me some insight about the difference between code performance and algorithm performance. As a rule of thumb: Try to get the best algorithm in c or f90 working before even considering reprogramming the time-critical subroutines in assembler.
"As a rule of thumb: Try to get the best algorithm in c or f90 working before even considering reprogramming the time-critical subroutines in assembler."
Seems, thatt this rule is not very well known, because we have too much ressources these days: too fast CPUs, too much RAM, too big hard disks... who cares about efficient programming anyway? :-)
I really enjoyed the article. Very interesting content, presented in a educational valuable way. Worth having a printed copy on the system shelf.
Or you could separate the serious .NET programmer and the hobbyist one. Using Java or .NET doesn't mean that you have no control over the underlying system. Memory still is accessed in the cached manner with the same latencies as in native code. If you know how to structure your data (prefer contiguous allocation over spread data, try to keep data that's used together close in memory), you get the same benefits of this knowledge in a managed language as in an unmanaged one.
A GCed language has one major advantage though: if you're using a mark-and-sweep collector like in the CLR, the runtime will automatically compact small items together upon collection, so you get better cache locality between items automatically without writing your own custom allocator in C or C++.
Snarky comments against managed runtimes do not demonstrate that you are elite. Just like with any language system, there are skilled tweakers who spend time squeezing performance out of .NET and there are people who just want to solve a one-off problem or play around on the weekends.
Extremely opinionated, biased pro-Intel anti-AMD, mostly correct but in some cases deceptively wrong.
Wrong: read a few paragraphs and start seeing comments like "this setup will introduce a NUMA architecture and its negative effects". He also likes talking about FSB speeds to memory - a FSB is CPU-NB, a memory bus is NB-RAM, and modern NUMA systems independently clocks the two buses. (It is useful to use different speeds - "spread spectrum" - because memory and I/O devices have different clocking requirements.)
Biases: for NUMA, many people consider it a good thing because NUMA scales much better than shared-FSB designs. He touts FB-DIMMs as superior to DDR3 without mentioning that the buffering in FB-DIMMs increases latency - in reality, FB-DIMMs do better than DDR on Intel chips with fat caches, and *cheaper* DDR chips do better than FB-DIMMs on AMD chips with more sensitivity to latency.
There are a lot of real engineering tradeoffs here; Ulrich isn't describing the tradeoff, he's just saying "A is better" without justification. Frankly, this piece is so biased towards Intel's choices that it looks like it was commissioned by Intel PR.
He has asked for feedback, so why not write to him?
I agree with your points: you could also add: the buffer in FB-DIMMS use too much power, so FB-DIMMS is going to die quite soon at least at Intel, Sun is using FB-DIMM on their new computers, they need the bandwith to feed their multicore computer.
I was thinking, If addressing is such a huge cost, why not avoid it entirely.
What if the memory modules just cycled through it's rows as fast as it could letting any interested device do it's reading or writing at the right time.
Waiting for the memory to cycle through would be murder on latency though. But in an era of multicores and parallel execution maybe the right blend of multitasking and ingenious algorithms for memory allocation in the OS could make it viable.
What do you think?
Memory remains a huge bottleneck in systems, yet coders have been moving toward larger and larger routines to do the same thing - in the name of speeding things up.
About a year ago a coworker had some code that he couldn't figure out why in Borland C it ran almost instantly, but in GCC was taking about a second per iteration (which over a few hundred iterations dragged a realtime program down to unusable). GCC makes notoriously slow code (the price of multiple targets), but this was above and beyond the norm.
Even stranger after some playing he found turning compiler optimizations OFF it ran even faster... that's when he got hold of me and I dragged out the dissassembler.
Optimizations on, GCC was taking what should have been a simple LOOP with three memory references that could have been stored in registers, to use a single MMX opcode.
The problem was, the overhead to set up for that MMX opcode involved allocating 128 bytes of memory for two matrixes... If he was performing the same general operation back to back the MMX version would have been faster because it unrolled the loop, but the overhead of setting up for it not only made the code not fit in the L1 cache, but it didn't even trip cache because each iteration was too different. Basically it allocated 128 bytes of memory each pass (and released it each pass) and used about 512 bytes of machine language for what we were able to quickly rewrite as maybe ten lines of ASM (therin 15-20 bytes of code), entirely using registers inside the loop and only passing three dword values on the stack.
Programmers, especially those writing compilers seem to have forgotten one of the most basic rules of writing code - the less code you use, the faster the program and the less code there is to BREAK. You'll hear the arguement repeatedly that many of these new techniques are faster and the older smaller code is slower - and it's utter and total bull MOST (but not all) of the time. You see this attitude in most all forms of programming these days where one way of doing things is completely thrown out in favor of another, the new method amounting to trying to shove that square peg into the round hole.
When it's the difference between small tight code and multiplying memory access by a factor of ten, increasing memory access inside the loop, getting cache misses and not even fitting the code and values inside the L1 cache, I know which way I lean on this.
Minimalist code can often remove the memory bus concerns inside loops from the equation - especially if you arrange your code to make use of the piping capabilities of newer processors... because while it's off grabbing memory you can keep it executing other stuff.
Edited 2007-10-07 16:31




