Linked by Thom Holwerda on Mon 31st Mar 2014 21:35 UTC
Apple

AnandTech on Apple's A7 processor:

I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).

Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.

This is one area where Apple really took everyone by surprise recently. When people talk about Apple losing its taste for disruption, they usually disregard the things they do not understand - such as hardcore processor design.

Thread beginning with comment 585808
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: Comment by judgen
by Alfman on Wed 2nd Apr 2014 06:18 UTC in reply to "Comment by judgen"
Alfman
Member since:
2011-01-28

judgen,


Could anyone explain why went with a much longer pipeline stages design? I mean standard ARMv8 Cortex designs has 8 pipeline stages and apple A7 has 14. Would this not mean that when a faulty branch clears the pipeline it takes longer to refill, causing performance decreases?


This enters very "opinionated" territory ;)

There's obviously a trade off between growing the pipeline to increase parallelism, and increasing the risk and cost of branch misprediction. The article suggests this architecture increases misprediction penalty from 0-19% and says nothing about the misprediction frequency (which depends alot on the software in use). The idea is for these negatives to be offset by having additional parallelism.

I think the compiler could probably do a better job at scheduling execution units even beyond the CPU's pipeline and with fewer mispredictions since the CPU is forced to do it on the fly. The compiler is far less constrained and should be able to do a more comprehensive analysis. The transistor savings by removing this complexity would result in less electricity or more parallel execution units depending on the way you want to look at it. Either way it's a win! However a pretty big problem with this is the way we distribute software in practice: generically precompiled and expected to run unmodified on different versions of a CPU. It would leave very little room for future CPUs to add execution units and for existing code to take advantage of them since scheduling is specific to a CPU model. Having competing CPUs would be problematic since code would be optimized for one or another, but not both at the same time.


One way to get around this problem is to distribute all software in an intermediary form and produce code which is always be compiled exactly for the target machine's execution units using exactly the right schedule. But for better or worse, CPUs evolved to the long pipelines we have now.

Reply Parent Score: 3