Paul Otellini showed an interesting slide at the NY analyst meeting last night which shows how fast Intel hopes to move to multicore processors. Otellini said: “What’s next is something more profound, moving our product line from logical to physical parallelism. Parallelism is computer speak for taking a serial of tasks and doing them together. You need parallelism in the hardware and operating systems and apps that are aware the machine can handle multiple threads”.
if you cant get people to buy dual proc systems .. just sell them ‘multi-core’ single proc systems. gota love how that works.
kind interested what the real performance differances will be between a dual-proc / HT / and multi-core.
verry exciting though, running dual-proc here and loveing every usec of it.
Dual CPU mobos require more watts in the Power Supply, correct?
Is that the same with dual cores?
I am looking forward to dual core Athlon64s soon, since they have that already with Opterons. Well soon they will.
“if you cant get people to buy dual proc systems .. gota love how that works.”
Idiot. multi-core easily has various advantages over dual processor: inter-processor latency, shared cache, lower production costs, lower footprint … etc.
HT is a half-baked approach to multi-core.
As for the software advantages: there are automatic advantages in a multitasking system, but as to whether specific applications per se are able to make use of multi-proc depends upon the operating system and the applicable (does the app use multi processes? does it multiple threads? are the threads independently schedulable on different processors? etc)
Welcome INTEL!
didn’t Be said this about 10 years ago?
All in all it is funny how the big players can grab statements that smaller
companys have made years ago and sell it like their own wisdom.
The same goes with things like the “MHz Myth”. For now to the next minute, INTEL pointed out that MHz is not important anymore and voila – the world changed. Forgotten are all the years AMD and Appe had to fight against INTELS markeing machinery.
— “multi-core easily has various advantages over dual processor: inter-processor latency, shared cache, lower production costs, lower footprint … etc.”
Don’t forget lower power consumption and less heat.
Its just amazing how they keep ramping up processing power like this. No matter what limitations they run into, they always find a way.
“I am looking forward to dual core Athlon64s soon, since they have that already with Opterons. Well soon they will.”
AMD still has any multicores product, they have presented some architecture diagrams presenting their future dual-core Opteron, but its still not done yet neither for the Opteron nor for the Athlon 64 chips.
And its not even a real multi-core architectture but rather putting two cpus in the same box. According to the diagram, the two cpus have their own L2 caches, as opposite to the real multi-core desigh which uses a shared L2 caches design. This will have effects on the glogal performance because of the cache coherency necessity. In the case of the AMD’s design, each cpu has to modify simultaneously the two L2 caches, which is a big lost of time.
Here is the diagram chip presented by AMD…
http://www.onversity.com/image/actual/amd_dualcoreg.jpg
Is it bad that the caches are not shared?
Sun has UltraSPARC IV in the field which has multicore processors, and INTEL is just starting to recognise that this might be a good idea. That almost seems like a bit of turnaround
Seriously though, …
You actually do get some nice advantages if you do things like shared cache and processor groups for threads that are likely to use the same memory; especially if you have a machine with a number of such multicore processors.
When you actually combine virtual processors with multi-cores you start seeing stuff like Niagra, which is going to make life very interesting indeed, and as we saw last week, Niagra has just been taped out.
Not quite true. The thing with the multi-core Opterons, is that the K8 is explicitly designed for SMP configurations. Unless they suddenly go braindead, they’ll be using a modified high-speed HyperTransport link between the two processors, offering a latency greater than the Intel offerings, admittedly, but less by far than conventional Intel SMP.
The distinction with the Intel version, as you said, is that they share the same bus, the same memory and the same cache. That’s a nightmare to manage well in terms of keeping the cache properly loaded (which processor wants which data and when???). It also massively reduces bus throughput per processor (especially if they ever choose to go 4P on a die) and needs enhanced memory management.
AMD’s HyperTransport does impose a slight latency between processors, but not much. They can still use the other processor’s cache (due to AMD’s NUMA efforts) and the cache on other processors. The same happens with memory. Except, in this situation, you have a discrete interface to work with other processors’ memory; you don’t all just start using the same pool. It’s far cleaner and easier to manage. Moreover, it scales far better with larger numbers of processors, and best of all, the on-die memory controller really does come into its own.
The majority of multi-threaded situations involve either unique application-level threads, or two threads which segment data and process individually, before reintegrating. There are very few SMP situations where you want to work with the same cache at the same time, or even have large amounts of inter-thread messaging. The key issue, is that it amplifies cache misses, and if you are messaging heavily, there is a high potential for stalls and deadlocks. Using the same cache has locking problems as well. Where you actually might want to work on the same data all at the same time, SIMD instructions will take care of most of the strain. Logically, you only need two processors for MISD operations in such a processor configuration.
You also have to remember that the majority of code is not parallelised, and most likely never will. As a result, it is very possible that two chips on the same die will be running entirely different processes, and the limitations of having a shared memory bus and shared bus to the northbridge come into play.
Ironically, where Intel have been telling us that media encoding is a great way to test a processor (because the P4 memory subsystem was fast and large), dual-chip-on-a-die Intel processors, without a redesigned bus (and they are renovated P3s, not new processors, after all) will only really be able to keep a single processor supplied with data. This is more the pity, because encoding is ideally suited as an MP task, and one that will probably favour the K8 far more, if both are at equivalent performance levels, due to the discretised busses, the on-die memory controller, and HyperTransport.
IIRC, reading back there was a move to release a dual core Opteron CPU in around 2005, don’t have the exact date, however, it will be inlign with the market when software developers deliver software that will take advantage of the dual core design.
With that being said, AMD is going to volume, and by the failure of Itanium, compatiblity has proven to be a greater asset to customers than fancy new technology ideas. The fact is, x86, for all its quirks, is an alright ISA. AMD grabbed it, cleaned it up, added some extra extensions and voila, you have an architecture with most of the things people hate, removed from the ISA. Customers are happy, software companies are happy, we’re all happy. Too bad Intel is still acting like a child by creating a half baked copy without the necessary NX feature which would help in system security.
I’m betting that this is not a new technology.
The majority of multi-threaded situations involve either unique application-level threads, or two threads which segment data and process individually, before reintegrating.
What does this mean? Can you be a little more elaborate?
There are very few SMP situations where you want to work with the same cache at the same time, or even have large amounts of inter-thread messaging. The key issue, is that it amplifies cache misses
What do you mean that there are very few SMP situations where you want to work with the same cache at the same time? Caches are primarily meant for locality of reference, two threads in process can do a lot with using the same cache.
What on earth is inter-thread messaging? Two threads in a process have acess to the same address space, there is not need to message, that’s the one Idea behind threading an App.
Yes if the OS schedules threads without understanding shared caches you will tend to aggrevate cache misses.
if you are messaging heavily, there is a high potential for stalls and deadlocks. Using the same cache has locking problems as well.
If you have deadlocks that is becuase the code is not written properly, there is nothing inherent in Dual core shared cache cpus that witll cause deadlocks. What locking problems will using the same cache pose? Cache lines can be atomically locked, or when you hit on a lock in the cache invalidate the cache line, and perform an atomic bus op what is the problem??
You also have to remember that the majority of code is not parallelised, and most likely never will. As a result, it is very possible that two chips on the same die will be running entirely different processes, and the limitations of having a shared memory bus and shared bus to the northbridge come into play.
Which century are you in? Most OSes are threaded, now with 1:1 user thread to kernel thread libraries in linux and solaris. Multicore cpus improve system throughput. Most OSes are multitaking today and having two cpus will improve system throughput. May be you should run top, taskmanager or whatever process statistic tools on your favorite OS is to see hoe many basic tools are already threaded.
I don’t understand how two cpus on a die is more limited than two cpus sharing a single bus to the northbridge, infact most of the cache coherency protocols do not need to chew up bus bandwidth.
If all you are trying to say is that multi-core cpus will not improve single threaded performance, then you are right. But the world wants throughput. With Vanderpool and many virtualization technologies, you will need multicore cpus to be able to carve a physical machine into many partitions. That is where the future is heading, atleast from roadmaps of all the chip manufacturers.
they are dumping prescot and its offspring in favor of the more efficient Pentium-M.
going multi core with it in fact.
so the future is mutli-core Pentium M chips that are 64 bit and run no faster than 2 GHz per core.
I’m betting that this is not a new technology.
Umm, the 386 (maybe 286) supported it (but only with segmentation, not paging, which is not very useful today). I’m guessing that mainframes of the 60s probably had it–they always do the advanced stuff first. It seems pretty silly that, if your going to go to all the trouble of advanced hardware protection features, to allow execution on the stack!
Most OSes are multitaking today and having two cpus will improve system throughput…
Yes, you will halve the number of context switches (read: wasted CPU time).
all it needs to be is aware that there are 2 CPUs and have the appropriate proc scheduler to deal with it.
a program how ever needs multi-threading for that application to take advantage of 2 procs, but as far as the OS goes, if it is multi proc aware, it can send jobs to either processor.
oh, and there is a diminishing returns involved with muli proc systems, for most user computing, 2 CPUs is more than enough, 4 would be over kill and not give any net benefits.
all it needs to be is aware that there are 2 CPUs and have the appropriate proc scheduler to deal with it.
a program how ever needs multi-threading for that application to take advantage of 2 procs, but as far as the OS goes, if it is multi proc aware, it can send jobs to either processor.
Hmmm… I am not sure if I gave the impression that an OS has to be multithreaded to take advantage of MP.
However, if an OS is multithreaded, it can better utilize MP. Case in point Solaris, every user process or thread is cound to one kernel thread. The scheduler uses threads as a basic scheduling unit. So Solaris can utlize a MP system more effectuvely even if the user program isn’t multithreaded. Also a threaded OS scales better.
When you have a single threaded OS the kernel can be only executing on one cpu at a time and has to hold a big lock to gaurd the entire kernel. This is a horribly inefficient way of doing MP.
, every user process or thread is cound to one kernel thread.
bound to one kernel thread.
well, that is true, but if there is not a lot of simultaneous System calls going on, like on a home desktop, a single threaded OS will be fine. where you find multi-threaded OSs important is in Servers and mainframes.
well, that is true, but if there is not a lot of simultaneous System calls going on, like on a home desktop, a single threaded OS will be fine. where you find multi-threaded OSs important is in Servers and mainframes
There are always system calls going on, even in desktop OSes, Anything that needs a kernel service will require a system call. Reading a file, requesting memory, accessing driver services.
System calls are just a vectored entry into kernel/supervisor mode to perform a service. Also interrupts are always happening, so the kernel is being entered more than is obvious to an enduser on any system. Multi-threadedness can improve the response time on an interactive system like a desktop.
yes, I did not dispute that, but the number of simultaneous calls in a home desktop are far less than that on a Server or mainframe.
always happening and happening a lot are different.
yes, I did not dispute that, but the number of simultaneous calls in a home desktop are far less than that on a Server or mainframe.
I doubt you can make a generalization like that without empirical data. Home desktops have graphics intesive apps and a lot of multimedia all of which require kernel services. You can profile certain apps and systems under certain loads but you can’t with any certainty take it as a given that servers have more simulatneous system calls than desktops.
I can appreciate all this shared vs. private L2 discussion, but what i find even more interesting is that they say that they will release their 64bit enabled chips when Win64 is released (or reasonably close).
What about Linux? 64bit versions of it are out, and have been for some time now. I thought Intel was really into linux. Well, i thought so anyway. So, if i read it right, this says they don’t care what linux users need at the moment. Because, really.. whats the difference in enabling the feature now, or later if it doesn’t cost them anything? Is it simply they don’t care THAT much about Linux, or is it to save more face about AMD doing it first?
I’m guessing we can expect a pretty good sized marketing blitz from Intel when Win64 does come out. So much so that us consumers will all forget that Intel didn’t invent it.
ps. i’m an avid windows user, and i’m just pointing out what i found most interesting.
I don’t know about mainframes, but I believe the sun4 had per page execution permissions (because segment based execution permissions just don’t count ).
Frankly I don’t think that it is easy to say if shared L2 are better or worse than separated L2.
Pro of shared L2:
2 times the size
No snooping necessary (cache coherency)
Pro of separated L2:
Faster (usually smaller means faster)
Less contention (not shared, but this is lessened by cache coherency logic that must be added)
If your workload is small enough to fit in the separated L2 cache, the separated L2 cache will probably be better otherwise if the workload fits in the shared L2 but not the separated, the shared will be faster..