Many companies backing Intel’s Itanium processor are planning to announce a new alliance in September to try to make it easier for customers to adopt systems using the high-end chip. The group, called the Itanium Solutions Alliance, has several plans to make Itanium more useful, said a source involved with the outfit.
This dog and pony show looks more like a face saving measure on the side of Intel and Intel’s sock puppets (MS, RedHat, Unisys, SGI, etc.). Fact is Itanic is on its last legs and even HP is already contemplating dumping this useless piece of overhyped kit:
http://www.theinquirer.net/?article=25573
If you want a solid high end platform, forget about Itanic, just go to the good old IBM and Sun.
Actually the Montecito chip looks quite promising (and was designed by a team composed of ex-Alpha architects, which bodes well!). The main thing that scares me about Itanium is the *huge* amounts of on-die cache.
Microsoft?
You think Microsoft is an Intel “sock puppet”?
Oh wait! Sorry I didn’t realise you were trolling – I thought you were being serious for a minute.
Fact is that itanium is one of the fastest cpu on the market and you are just spreading standart anti-intel FUD.
The problem is that “one of the fastest” is not enough to establish this platform…
Well, it is one of the roughly equal two fastest in floating point performance.
Not great at integer, but not among the worst.
itanic is only fast on SPEC FP benchmarks. It’s hot, slow and over-priced for everything else.
The killed the Alpha, the best processor in the world, and then that Fiorina imbecile killed the PA-RISC series, another great architecture. Now years later they realise that Itanio is a friggin piece of shit. HP has become another Dell. Intel at least has the new Pentium M with EM64T, HP has no plan B that I know of besides massive lay offs.
They can use MIPS64 chipsets from PCM Sierra and produce MIPS64 dektops with Linux/NetBSD. That would be remarkable. They can even market ODW from Genesi and G5s from Merkury.
If not , they are just another Dell .
Fact is that itanium is one of the fastest cpu on the market and you are just spreading standart anti-intel FUD.
Whe’re those facts? Probably an Intel funded study?
Id like to take all the Itaniums of the world, and melt then down with the verbage into slag.
Alpha has been and will continue to be the best of breed in processors for years to come. Watt for watt, cycle for cycle, it is head and sholders above the rest. Even after all the momuntious mistakes made in marketing and managment, its still technically the best.
“Fact is Itanic is on its last legs.” THANK GOD. NOT A MOMENT TOO SOON. AMD beat Intel to the punch, with 32-bit compatibility. Itanic is a stupid idea who’s time is OVER. They need to bury it deeper then NextGen.
NOTE: Price Check on Isle ONE: “Intel Corp. Intel -1 x Intel Itanium 2 1.6 GHz ( 400 MHz ) – L (SKU: BX80543KC1600G) Price Range: $912.53 from 1 Seller ”
NINE HUNDRED TWELEVE DOLLARS AND FIFTY THREE CENTS FOR A 1.6Ghz processor. Who is kidding?
I’m not convinced current Itaniums are great value but you’re missing some important points.
The fact it’s 1.6GHz is not particularly expressive – the Itanium does a decent amount of work per cycle. It doesn’t need a high clock rate to run fast.
It also has *huge* caches. (9Meg on chip for the current Madisons, multiples of that for the coming Montecito). Large caches are real expensive because of the chip area…
Itanium’s market is ultha-high-end servers. We’re talking systems that require a really good Mean Time Before Failure (MTBF). Achieving a good MTBF costs money and costs you some clock rate (compare the clockrate of IBM’s POWER4 to that of the PPC970 which uses a closely related core microarchitecture).
Finally, the Montecito chip was designed by Alpha architects. Since the basic premise of IA64 seems sane (to me), Intel / HP have good funding and fabrication tech and the Alpha architects are generally agreed to be great I hope we’ll really see Itanium shine with this new chip.
The huge cache is not a feature. It’s a no choice requirement by the nature of EPIC architecture. Different processor/compiler do thing differently. In the case of Itanium, it requires more cache. Please do not give a misleading impression.
> The huge cache is not a feature. It’s a no choice
> requirement by the nature of EPIC architecture.
Why? What’s your argument for this? Instruction density? Instruction size? Do you have any numbers for this? I’d be really interested to see a quantitative study of this sort of thing. In anticipation, here’s my reasoning.
First, the I-cache PoV:
OK, the instruction size of EPIC is 16 bits larger than 32 bit on most RISC instruction sets so the iCache could reasonably be expected to be half as large again.
OTOH, the 48bits can give you improved efficiency in encoding information into the instructions, so it’s a *good* thing for the execution core.
EPIC doesn’t incur the high NOP ratio that traditional VLIW does, so this isn’t wasting cache space. I’m not sure that you can account the size of Itaniums I-caches purely to instruction encoding…
Data cache PoV:
You would want a decent sized data cache to avoid memory access stalls in Itanium (although since the Montecito uses multithreading that’s less of an issue). OTOH, IA64 provides explicit prefetching operations that a compiler can use to get stuff into the cache before it’s needed, which is a neat optimisation to avoid this problem.
Also, a big data cache can cache more data. Even if Itanium *needs* a big d-cache to avoid memory stalls, that same big d-cache will *still* benefit you on large data-sets, so it’s not like it’s a wasted resource.
> Different processor/compiler do thing differently.
> In the case of Itanium, it requires more cache.
My original response was to the comment about the high price-point – that it is there for a reason.
> Please do not give a misleading impression.
Lets keep the discussion technical / economic.
In the EPIC architecture, it’s about explicitly parallelising instructions. The CPU is executing all possible branches that’s possible. Hence the larger instructions and cache required. I’m not saying the cache is waste of resource. I’m saying you cannot conclude that just because Itanium has more cache than Opteron/Power/Xeon/Niagara, then it is superior in this area.
There are many different techniques to design a system, or even a processor, and a processor cache is just one component of it. For example, just because Niagara has 32 threads (compared to 1 in Itanium), it does not mean it’s much better. We must evaluate at higher level. That’s what I meant.
> In the EPIC architecture, it’s about explicitly
> parallelising instructions. The CPU is executing all
> possible branches that’s possible. Hence the larger
> instructions and cache required. I’m not saying the
> cache is waste of resource. I’m saying you cannot
> conclude that just because Itanium has more cache
> than Opteron/Power/Xeon/Niagara, then it is superior
> in this area.
Well… Being pedantic I’d say Itanium isn’t executing all possible *branches*, it’s just parallelising the existing instruction stream. The compiler may choose to insert speculation regarding branches if it has appropriate information.
The instructions don’t actually need to be bigger for purposes of parallelism in IA64 (the “bundles” they’re grouped into are). I assume that the instructions are 48 bit because the Intel people felt that was a good encoding length for some other reason… The 128-bit bundles are just fetch groups.
Bear in mind that all the other processors you mentioned execute multiple instructions concurrently too, so they also need to have them in instruction cache. Itanium doesn’t necessarily have any more parallelism, it’s just chosen explicitly by the compiler rather than the hardware. The other CPUs dynamic execution hardware may also be speculatively executing instructions after a conditional branch before the real outcome is known.
In terms of the data, more cache is still going to benefit you on large data sets. (whether this makes up for the other drawbacks of Itanium is another debate)
I certainly think the larger data cache is an advantage of current Itaniums. I’m not convinced the instruction cache really does need to be bigger to support the parallelism but I could perhaps be convinced…
> There are many different techniques to design a
> system, or even a processor, and a processor cache
> is just one component of it. For example, just
> because Niagara has 32 threads (compared to 1 in
> Itanium), it does not mean it’s much better. We must
> evaluate at higher level. That’s what I meant.
[side note: Montecito is multithreaded]
Absolutely – no argument here 🙂
Plus it’s also a case of “horses for courses”. Niagara should do really well for online transaction processing, database apps, etc (read: common corporate apps). IA64 may well still beat it for scientific computation but I doubt that’ll be such a lucrative market.
IBM used to have two separate (related) lines of CPU – one for scientific work and one for commercial workloads. That was an interesting approach but they eventually merged the two lines to create POWER.
The instructions don’t actually need to be bigger for purposes of parallelism in IA64
They do, because an in-order architecture requires more registers and more operands than an out-of-order one with register renaming.
But whether having 128 registers really gain much compared to the RISC standard of 32 is another question.
I assume that the instructions are 48 bit because the Intel people felt that was a good encoding length for some other reason…
They’re 41 bits actually. Three of those plus a 5-bit so-called template make up a 128-bit bundle. The template encodes what instructions can be executed in parallel.
When no suitable instructions are available, the compiler is forced to put NOPs into instruction slots, thus wasting further instruction bits.
And being a load/store instructions it usually requires more instructions to do the same thing as x86.
The net result is that Itanium binaries end up up to three times as big as x86, so you need a lot more instruction cache to hold the same amount of code.
Itanium doesn’t necessarily have any more parallelism, it’s just chosen explicitly by the compiler rather than the hardware.
Yes, and it will often have less because a compiler is lacking information that a dynamic scheduler has. E.g. it can’t predict intruction latencies (unless you optimise for a particular IA64 implementation) or memory latencies.
I certainly think the larger data cache is an advantage of current Itaniums.
It requires a bigger data cache to make up for its inability to work around cache misses, but
it’s difficult to quantify how much bigger.
>>The instructions don’t actually need to be bigger
>>for purposes of parallelism in IA64
> They do, because an in-order architecture requires
> more registers and more operands than an
> out-of-order one with register renaming.
That’s a good point actually, I hadn’t thought of that!
>>I assume that the instructions are 48 bit because
>>the Intel people felt that was a good encoding
>>length for some other reason…
> They’re 41 bits actually. Three of those plus a
>5-bit so-called template make up a 128-bit bundle.
>The template encodes what instructions can be
>executed in parallel.
I should have done my maths and worked out I was remembering that number wrong 😉
> When no suitable instructions are available, the
> compiler is forced to put NOPs into instruction
> slots, thus wasting further instruction bits.
But that’s (supposedly) a comparatively rare occurance, right? The instructions are parallelised in terms of *groups* which can extend across bundle boundaries. If a group ends *within* a bundle then the template bits are used to encode that fact and instructions from the next group begin. It’s kind of a neat way of getting round the VLIW nop pollution and avoiding dependences on particular CPU issue widths.
I can’t remember the template encoding – is it good enough to always avoid nop padding?
>> Itanium doesn’t necessarily have any more
>> parallelism, it’s just chosen explicitly by the
>> compiler rather than the hardware.
> Yes, and it will often have less because a compiler
> is lacking information that a dynamic scheduler has.
> E.g. it can’t predict intruction latencies (unless
> you optimise for a particular IA64 implementation)
> or memory latencies.
*nod* of course, the tradeoff is that your execution core is simpler and therefore (theoretically) clocks faster and allows you to spend resources on other hardware. How well this is paying off for them is not at all clear, though 😉
>> I certainly think the larger data cache is an
>> advantage of current Itaniums.
> It requires a bigger data cache to make up for its
> inability to work around cache misses, but
> it’s difficult to quantify how much bigger.
*nod* You still don’t need it *if* the compiler can insert speculative loads in the right place, though (but that’s not possible for lots of applications). The answer to anything in CPU design is always “it depends on your workload”; I guess for most things they need the large d-cache as a workaround (although they could just have upped the associativity, given their clock rate isn’t exactly stellar) but for the scientific results (the ones that give them good benchmarks, probably) the large d-cache is probably still a win outright.
A neat trick is the fine-grained thread switching on Montecito; I think it’s a neat solution to latency hiding but it’ll be interesting to see how it’s handled by the OS.
But that’s (supposedly) a comparatively rare occurance, right?
Yes, NOPs are rarer than with a simple VLIW.
I can’t remember the template encoding – is it good enough to always avoid nop padding?
No, it’s quite restrictive actually, e.g. you can only have one floating-point instruction per bundle and the only templates with a stop in the middle are M_MI and MI_I (where M and I indicate memory and integer instructions). And of course branches only target whole bundles.
A neat trick is the fine-grained thread switching on Montecito; I think it’s a neat solution to latency hiding but it’ll be interesting to see how it’s handled by the OS.
Yep, and they might have been better off following this strategy from the start. Simultaneous multi-threading seems a much more efficient way than caches to spend the transistors they saved through EPIC.
With its in-order architecture Itanium was always going to fight a losing battle on single-thread performance, even more so when executing x86 code.
>> I can’t remember the template encoding – is it good
>>enough to always avoid nop padding?
> No, it’s quite restrictive actually, e.g. you can
> only have one floating-point instruction per bundle
> and the only templates with a stop in the middle are
> M_MI and MI_I (where M and I indicate memory and
> integer instructions). And of course branches only
> target whole bundles.
OK, I vaguely remember this stuff but it’s all a bit vague these days 🙂 Thanks! Having 5 bits for the template does seem a bit restrictive really. I guess some of the limitations are also due to the way the opcodes are plumbed though in the microarchitecture though – i.e. there might be structural limitations on what instructions can be encoded in a bundle.
>> A neat trick is the fine-grained thread switching on
>> Montecito; I think it’s a neat solution to latency
>> hiding but it’ll be interesting to see how it’s
>> handled by the OS.
> Yep, and they might have been better off following
>this strategy from the start. Simultaneous
> multi-threading seems a much more efficient way than
> caches to spend the transistors they saved through
> EPIC.
It’s not actually SMT they’re doing on Itanium (as I guess you probably know): the architecture just isn’t well suited to it (whereas on x86 they had almost all the hardware there already :-D). But I agree: for something that just can’t cope with load latency, having multithreading makes *lots* more sense.
Although I believe they get decent performance out of their large caches, it does always worry me that *they* think it’s a good idea!
It’s not actually SMT they’re doing on Itanium
What are they doing then? I’ve only read that Montecito will have two cores running two threads each.
the architecture just isn’t well suited to it
Why not?
Granted, it wouldn’t work with traditional VLIW, where each instruction slot directly corresponds to a execution unit.
But on Itanium the mapping of instructions to execution units is up to the processor, thereby allowing for implementations with different numbers of execution units.
So I’d have thought mapping instructions from two (or more) threads shouldn’t be too much of a problem. Or am I overlooking something?
>> It’s not actually SMT they’re doing on Itanium
> What are they doing then? I’ve only read that Montecito
> will have two cores running two threads each.
It’ll be fine-grained multithreading, not simultaneous. The processor will have two contexts but will only execute instructions from one thread per cycle (whereas in SMT you can have instructions from multiple threads in the same pipe stage). It’ll context switch when one of the threads would otherwise stall.
I’m not sure how this is presented to the OS – it’d be nice if it’d occasionally alternate them anyway, to give the illusion of two equal virtual CPUs. (for comparison the Cray Tera switches between a large number of threads every cycle, so that it basically never stalls but has low single-thread throughput).
>> the architecture just isn’t well suited to it
> Why not?
> Granted, it wouldn’t work with traditional VLIW, where
> each instruction slot directly corresponds to a
> execution unit.
<snipped some insightful comments for brevity>
I *was* going to say “register renaming” but that doesn’t really apply since you’ll need two copies of the architectural state in the CPU anyhow. As you point out, the CPU can also map instructions itself…
The trouble is (as I see it) is that x86 (being dynamic execution) had most of the hardware necessary to manage multiple threads anyhow. Itanium is explicitly designed to avoid this complexity, so they’d need to add relatively more complexity to make it work well. I could probably think of some examples if I think about it some more…
It’ll be fine-grained multithreading, not simultaneous. The processor will have two contexts but will only execute instructions from one thread per cycle (whereas in SMT you can have instructions from multiple threads in the same pipe stage). It’ll context switch when one of the threads would otherwise stall.
Thanks, that’s very interesting, if a bit disappointing.
So what will happen with instructions already in the pipeline when a thread gets stuck? Are they simply discarded and redone after the next context switch?
And what about the two threads on the Xenon or the Cell’s PPE? They’re in-order as well, do they follow a similar model?
> So what will happen with instructions already in the >pipeline when a thread gets stuck? Are they simply >discarded and redone after the next context switch?
It can cope with instructions from different threads being at different stages in the pipe. It just can’t issue from both threads in the same cycle.
The Hyperthreading / SMT on the Pentium 4 solves the “spatial” problem of filling all the issue slots each cycle (by issuing from both threads) and the “temporal” problem of filling the pipeline on a stall (by continuing to issue from the other thread). The FGMT on the IA64 just solves the latter problem but may still waste issue slots within one cycle.
I know Intel people have done hypothetical studies on SMT Itaniums but I don’t think they’re going for full SMT right now.
> And what about the two threads on the Xenon or the
> Cell’s PPE? They’re in-order as well, do they follow
> a similar model?
The Xenon is the X-Box one right? I’d put very good money on them also following fine-grained switching vs. true SMT, since their goal is multiple simple cores. I’d also put money on the Sun Niagara using this model.
[side note on why I think SMT is a cool hack: The neat thing with a dynamic execution processor is that you already have a model of “rename registers, then throw the instructions into a big pot and use a data-flow model, then tack on an in-order commit phase at the end”. For SMT it doesn’t really matter that the instructions come from multiple threads: only the initial and final stages of the pipeline need to do this mapping, the rest can (more or less) just work on the internal register file.]
It can cope with instructions from different threads being at different stages in the pipe. It just can’t issue from both threads in the same cycle.
I found a paper on it: as you already mentioned Intel calls it temporal multi-threading (TMT) and its implementation is surprisingly simple.
Montecito has dual latches for separating its pipeline stages, so switching the active thread is just a question of selecting the corresponding set of latches, without any delay. Should be straightforward to extend to more threads, although of course the large register sets need to be replicated too.
The Xenon is the X-Box one right?
Yep.
I’d put very good money on them also following fine-grained switching vs. true SMT
Seems likely. Thinking about it, implementing true SMT where instructions from different threads can overtake each other isn’t all that straightforward and loses some of the simplicity of an in-order design.
side note on why I think SMT is a cool hack: …
I agree, with all the out-of-order stuff already there, SMT almost comes for free. I was quite surprised Intel omitted it from their new x86 design.
> I found a paper on it: as you already mentioned
> Intel calls it temporal multi-threading (TMT) and
> its implementation is surprisingly simple.
Heh 🙂 I didn’t know they actually *called* it that – could you post a link please? I’d be interesting in seeing it.
> Montecito has dual latches for separating its
> pipeline stages,
Ah, cute 🙂 That is rather simple! Replicating all the state the Itanium has must be painful.
> Seems likely. Thinking about it, implementing true
> SMT where instructions from different threads can
> overtake each other isn’t all that straightforward
> and loses some of the simplicity of an in-order
> design.
I think simplicity is the key here. I guess the problem you’re referring to is interlocks? i.e. that you need more checks in hardware to ensure in-order semantics given the rate of execution changes depending what the other threads’ doing?
I hadn’t thought of it like that – thanks for straigthtening my head out!
> I agree, with all the out-of-order stuff already
> there, SMT almost comes for free. I was quite
> surprised Intel omitted it from their new x86
> design.
I get the impression that although it allows them to fix the spatial scheduling problem too, the main reason they wanted SMT was for the temporal problem. I heard a quote that with a shorter pipeline (and still having OoO execution), they just didn’t find they needed it.
could you post a link please?
Here you go:
http://www.ewh.ieee.org/r5/denver/sscs/Presentations/2005.03.Naffzi…
I guess the problem you’re referring to is interlocks?
I guess so. When an instruction gets stalled, the interlocking logic would have to discriminate between instructions of the same thread, which do need to be stopped, and instructions of other threads, which can continue to progress. Ideally, you’d want instructions of other threads to leap past a stalled instruction, even within one execution unit.
Thanks very much – I’ll check that out.
It’s been a fun discussion!
> It also has *huge* caches. (9Meg on chip for the current Madisons, multiples of that for the coming Montecito). Large caches are real expensive because of the chip area…
Unrelated to architecture…
> Itanium’s market is ultha-high-end servers. We’re talking systems that require a really good Mean Time Before Failure (MTBF).
Again, unrelated to architecture.
Itanium has only sense as long as WLIW architecture fullfills its promise. Otherwise it has no advantage over other already established platforms, like x86-64 or Power.
Was the original article architecture-focused in some way I missed?
>> It also has *huge* caches. (9Meg on chip for the current Madisons, multiples of that for the coming Montecito). Large caches are real expensive because of the chip area…
> Unrelated to architecture…
I know, I mentioned it because related to high price point as mentioned by the o.p: you’re actually paying high prices for an expensive-to-produce product.
> Itanium’s market is ultha-high-end servers. We’re talking systems that require a really good Mean Time Before Failure (MTBF).
> Again, unrelated to architecture.
Yes, again it’s related to the price comments of the original poster. If you’re paying money for a better MTBF then it’s not a gratuitously high price.
>> Itanium has only sense as long as WLIW architecture fullfills its promise. Otherwise it has no advantage over other already established platforms, like x86-64 or Power.
Actually, it’s not pure VLIW, it’s an evolution of that. As I like the *concept* of the EPIC evolution of VLIW, I think it’s a sane approach. As you point out, what architecture you’re using makes no difference unless it’s a good deal from a business standpoint. I’m not claiming that switching right makes economically good sense (short of some very canny future technical and business work by Intel) but I like the concept from an engineering perspective.
NINE HUNDRED TWELEVE DOLLARS AND FIFTY THREE CENTS FOR A 1.6Ghz processor. Who is kidding?
Heh heh.
It’s expensive because to get any performance at all out of it they had to put acres of cache memory on it – to get as much performance on realistic workloads as everybody else’s 500MHz processors.
Talking of itanic, has anyone actually ever seen a real one, up and running?
It’s expensive because to get any performance at all out of it they had to put acres of cache memory on it
If you try to fill your Itanium server to its maximum capacity of memory, the price of the processor itself is more or less the taxes you pay to get the memory…
Talking of itanic, has anyone actually ever seen a real one, up and running?
– 3 Workstations in my office
– A dozen of servers in the room next to me
I have even played with the upcoming Montecito.
>> Talking of itanic, has anyone actually ever seen a >>real one, up and running?
>
> – 3 Workstations in my office
> – A dozen of servers in the room next to me
>
> I have even played with the upcoming Montecito.
We had a (donated) Merced but it died after a few years (power regulator I think). Merceds were a bit dodgy – more a software development vehicle than a serious server. (These were the slow ones.)
As a replacement, we’re due to get a donation of a couple of pre-release Montecitos for our development work. I’m waiting in excited anticipation 🙂
> You think Microsoft is an Intel “sock puppet”?
Pardon my overgeneralization, Intel and M$ are reciprocal whores for each other, so at times it it safe to say that M$ is sock puppet for Intel and vise versa. Why do you think there is “Wintel” designation for all Intel+M$ crap out there? M$ is so deeply in bed with Intel, it is not funny.
first analyse this graphics:
http://blogs.sun.com/roller/resources/JeffV/itanium_rev_ext.gif
itanium is an incredible thing that it is unbelievable such a horrible failure can still exist. it is possible that Intel paid more to IDC or whatever technology puppets that what it earned to make grossly worng predictions just to keep the project alive for almost 10 years.
is it a bad processor? no. but it is definitely not a match for cheaper alternatives apart from some vey lmited areas (like high end super computing)
The Itanium CPU have some really interesting features that are good for tasks that can be paralellized.
Interesting features include, but are not limited to;
Software Pipelining
Register Rotation
Predication
etc.
Why is the Itanium deemed a failure by some?
Most of that comes from that it puts a lot more work on the compilers than architectures like x86. The IA-64 ISA is great for academic compiler research and the only thing Intel and HP mispredicted while working on the Itanium is how difficult it would be to make a good compiler for it.
Overall I commend Intel for going as far as they have done with the Itanium. It may not give them lots and lots of money today, but in a few years all those investments could really begin to pay off for them.
The way you are accounting the need in Cache is ok as long as you consider the effect of a cache miss are equal on both architecture you compare.
But Itanium with a lower frequency can be competitive with other architecture as long as it can issue a maximum of instructions in parallel within one cycles. On fine tuned computations, you can easily get between 3 to 4 instructions within a cycle, if not more.
Here, with 1 cycle stall, you do not prevent 1 instuction from beeing executed, but around 3.
This is one of the reason to have a so high cache. Not only to be able to have an equivalent amount of storage, but to try to get an equivalent penalty as well.
> But Itanium with a lower frequency can be
> competitive with other architecture as long as it
> can issue a maximum of instructions in parallel
> within one cycles. On fine tuned computations, you
> can easily get between 3 to 4 instructions within a
> cycle, if not more.
True… The flip side is that since the Itanium is lower freq you lose *less* cycles on a stall than on a narrower, faster architecture, so you don’t necessarily waste more issue slots in total.
I’d argue the *real* penalty for the Itanium is not the issue width but that it can’t use dynamic execution to schedule around these stalls.
Not having dynamic execution hardware saves you some transistors but ironically quite probably not as much as the cache costs. Of course, eliminating the dynamic execution hardware also gives you a simpler, (potentially) faster core…
You do have a very good point here. The Itanium architects have picked up on this – to a certain extent. There are some neat features in the instruction set for explicit cache prefetching, which (used correctly) can prevent such stalls entirely (even a large cache won’t prevent a first-access stall).
Interestingly, the Montecito architects seem to have picked up on this too and have introduced fine-grained multithreading. Unlike the hyperthreading model, the threads aren’t truly concurrent but it does mean that when one thread stalls (e.g. on cache miss) the other one can execute and keep the hardware busy.
> This is one of the reason to have a so high cache.
> Not only to be able to have an equivalent amount of
> storage, but to try to get an equivalent penalty as
> well.
Yep, that’s true. But (refering back to the earlier poster) although the Itanium needs large cache to reach it’s potential issue rate, that doesn’t mean the cache isn’t useful worth something if the overall throughput is suitably high. The key economic question (IMHO) is whether it’s better to have an architecture like IA64 and push scheduling decisions into the compiler or whether it’s better to just have dynamic execution to hide latencies etc.
Itanium and it’s history shows that massive research does not always guarantee massive profits. Itaniums will be sold 10 years from now, and in increasing numbers. You cannot look at it from the low end point of view…Intel has already spent the money, having the Itanium in the arsenal is only an advantage at this point, it is like a research lab that also sells it’s product to the marketplace. It is true that those who bought some of the early Itaniums were subsidizing the research costs and getting a poorer value than they could have gotten, but then again, it is not that simple, and the engineers programming for the IA64 have gained valuable skills for the next generation Itaniums.
I don’t know much about the Itanium as a chip other than that it apparently handled 80×86-like 64 bit processing in a new and different way. (or at least, such that 32-bit 80×86 code wasn’t nice and fast compared to AMD or 32-bit 80×86 processors)
Maybe it DOES suck for other reasons, but what with the Alpha and PA-Risc dying, Apple abandoning the PowerPC and all, I don’t like this trend of processor architectures being discarded because they’re not 80×86. Ok, so it’s a good architecture. But is it really the be-all and end-all?
At least there’s SPARC and PowerPC for servers/mainframes and ARM for smaller systems, and PowerPC for gaming… but it seems to be shrinking. 80×86 compatible or die?
Yes, I’m probably unduly worried. Like I said, I know practically nothing about the comparative merits of these architectures and Itanium could have been a fetid piece of… well, yeah. I just have that bad feeling.
80×86. Ok, so it’s a good architecture.
No, it’s not. It wasn’t great when it first came out with all its special-purpose registers and brain-dead segmenting model, and generation after generation of kludgy add-ons have just added to the mess.
But it’s adequate.
More importantly though, it’s backwards compatible, and it seems the market just can’t get enough of that, no matter whether anyone really still runs all that old software.
I’m not convinced the instruction cache really does need to be bigger to support the parallelism but I could perhaps be convinced…
I may be wrong, but if I remember well, Montecito will have a separate cache for Instructions and Data at the L2 (and not only L1) to limit the amount of instructions into the cache. This is coherent with what you think.
Interesting… I think a split L2 is a relatively rare architectural feature. IIRC, the Instruction cache portion is a bit bigger than the Data partition (can’t remember the exact number). You’d expect it to be at least a bit bigger due to the 48-bit instruction size… I guess for Montecito it’s also not so expensive to miss in L2, given the L3 is so mind-bogglingly huge 😀
Alpha is crap and should be melted down for scrap.
Itanium is just the fastest processor out there.
Yeah , those specmarks are just propaganda and don’t mean anything.
Keep spreading your lies and keep astroturfing you marketing FUD fools.