Linked by Thom Holwerda on Fri 8th Apr 2016 01:11 UTC
Hardware, Embedded Systems

RISC-V is a new general-purpose instruction-set architecture (ISA) that's BSD licensed, extensible, and royalty free. It's clean and modular with a 32-, 64-, or 128-bit integer base and various optional extensions (e.g., floating point). RISC-V is easier to implement than some alternatives - minimal RISC-V cores are roughly half the size of equivalent ARM cores - and the ISA has already gathered some support from the semiconductor industry.

Order by: Score:
Can I try this?
by ThomasFuhringer on Fri 8th Apr 2016 08:27 UTC
ThomasFuhringer
Member since:
2007-01-25

Could not quite figure out if there is any hardware under construction or maybe even available implementing it.
What would be the best emulator to try it? QEMU?

Reply Score: 2

RE: Can I try this?
by christian on Fri 8th Apr 2016 09:41 UTC in reply to "Can I try this?"
christian Member since:
2005-07-06

Could not quite figure out if there is any hardware under construction or maybe even available implementing it.
What would be the best emulator to try it? QEMU?


Javascript:)
http://riscv.org/angel-simulator/

Reply Score: 2

RE: Can I try this?
by JLF65 on Fri 8th Apr 2016 17:51 UTC in reply to "Can I try this?"
JLF65 Member since:
2005-07-06

If you want to try it as hardware, you can put VectorBlox on a Cyclone FPGA.

https://github.com/VectorBlox/orca

Reply Score: 2

The times we live
by peskanov on Fri 8th Apr 2016 09:04 UTC
peskanov
Member since:
2006-01-15

This architecture is a good example of the times we live.
Instead of learning from the past and opening some paths toward the future, they just replicate the old MIPS architecture with few changes and adopt some of the most poisonous trends of the present.

I am talking about that "mix your our architecture" thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.

About the ISA:
- The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

- Why not learning from modern hw/sf problems and shortcomings and try to implement some improvements? The RISC architectures were born following that philosophy.
One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.
Add bound checking to memory r/w instructions; let's move the computer world forward and not backward!

There are lot's of thing that can be done in an open ISA to move forward; let's hope we will get some visionaries again, like Cray in the 60's or the ARM team in the 80's.

Reply Score: 4

RE: The times we live
by mmrezaie on Fri 8th Apr 2016 10:17 UTC in reply to "The times we live"
mmrezaie Member since:
2006-05-09

Does anyone know how openpower can be adopted by some open community in order to be used by outsiders not heavyweights like IBM. If that's so then why there is no interest of adopting these ISAs!

Reply Score: 1

RE[2]: The times we live
by Dasher42 on Fri 8th Apr 2016 18:21 UTC in reply to "RE: The times we live"
Dasher42 Member since:
2007-04-05

There's the Tyan Openpower (http://www.tyan.com/solutions/tyan_openpower_system.html) systems, but they're rackmount servers. The POWER architecture is focused on the data center thus far, but I agree, I'd love to see some consumer-grade riffs on it.

Reply Score: 3

RE[3]: The times we live
by Drumhellar on Fri 8th Apr 2016 19:59 UTC in reply to "RE[2]: The times we live"
Drumhellar Member since:
2005-07-12

Or even some workstations!

Reply Score: 2

RE[4]: The times we live
by kokara4a on Mon 11th Apr 2016 07:18 UTC in reply to "RE[3]: The times we live"
kokara4a Member since:
2005-09-16

Or even some workstations!


Like this one?

<a href="https://raptorengineeringinc.com/TALOS/prerelease.php">https://r...

Reply Score: 1

RE: The times we live
by christian on Fri 8th Apr 2016 10:43 UTC in reply to "The times we live"
christian Member since:
2005-07-06

This architecture is a good example of the times we live.
Instead of learning from the past and opening some paths toward the future, they just replicate the old MIPS architecture with few changes and adopt some of the most poisonous trends of the present.

I am talking about that "mix your our architecture" thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.


While I partially agree, things like mul/div don't have zero cost. The problem with mul/div is that they're either slow, multi-cycle (cycle count depending on the operands) operations, or faster but more predictable cycle count operations which require lots of transistors. The former is difficult to pipeline, the latter takes up lots of die space, and if mul/div is a marginal operation in a low end chip, then that's a hit that may not be worth taking. Same for MMUs, some applications just don't need the expense of a complex paged MMU.


About the ISA:
- The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?


Because it takes up instruction encoding space? With modern static and dynamic branch prediction, the pipeline bubbles that predication prevent are not as much of a problem, and with technologies such as hyper-threading, pipelines can be kept plenty busy anyway.

ARM paid a price for predicate bits, it only has 16 registers available in the instruction encodings. While 16 registers are better than the 8 x86 traditionally has, the 32 registers used by other RISC ISA has arguably proved more valuable, and indeed, 64-bit ARM with it's 31 registers has mostly dropped predication.


- Why not learning from modern hw/sf problems and shortcomings and try to implement some improvements? The RISC architectures were born following that philosophy.
One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.
Add bound checking to memory r/w instructions; let's move the computer world forward and not backward!



Bounds checking is a function on the language definition and implementation at the compiler level. Compilers can easily do that with complex or simple instructions, and require no help from the ISA.

Array bounds checking is the sort of operation that goes against RISC philosophy. The operation itself requires you to provide not only the index you require, but also the bounds and the unit size. While the latter is easy for built in types, it's not so easy for complex structures.

RISC resulted when people realised that the single instruction CISC operations like bounds checked indexed addressing modes resulted in complex hardware that functioned no faster than the same multiple instructions to achieve the same thing in RISC. RISC gets a lot of benefits from instruction encoding being simple and fixed length, and while those benefits were certainly more prevalent in the 80's/90's in mainstream CPUs, the same benefits are just as valid in the low end now. In fact, even CISC processors often perform better using multiple simple instructions rather than complex single instructions. That's why, for example, x86 compilers no longer use instructions such as enter/leave for function prologue/epilogue, and instead do the equivalent operations using simple mov/add.

Modern x86/x64, which is now basically the only prevalent CISC architecture still being used (we'll ignore low volume architectures like IBM z Systems) gets the best of both worlds in that it has the dense CISC instruction encoding, but breaks that down into more RISC like pipeline operations on the fly, but it pays a big price in doing so in decoding and execution complexity. I'd love to see what the big ARM vendors could do with a R&D and manufacturing budget like Intel's.



There are lot's of thing that can be done in an open ISA to move forward; let's hope we will get some visionaries again, like Cray in the 60's or the ARM team in the 80's.


RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots. Delay slots were an implementation detail that leaked into the ISA, and were a big mistake in the long run. In fact, I dare say they were regretted even before the end of the 80's, when pipelines started getting longer than the 3/4/5 stages that work for delay slots.

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did. But they've learnt that lesson as well, and defined the standard set of fixed registers, leaving the ABI and compilers to get the most out of those registers.

Reply Score: 12

RE[2]: The times we live
by tylerdurden on Fri 8th Apr 2016 19:10 UTC in reply to "RE: The times we live"
tylerdurden Member since:
2009-03-17



RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots. Delay slots were an implementation detail that leaked into the ISA, and were a big mistake in the long run. In fact, I dare say they were regretted even before the end of the 80's, when pipelines started getting longer than the 3/4/5 stages that work for delay slots.

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did. But they've learnt that lesson as well, and defined the standard set of fixed registers, leaving the ABI and compilers to get the most out of those registers.


A poster in this site with a remote understanding of computer architecture? Thank you, thank you, thank you!

I also like to add that at this point, the RISC/CISC monikers are mainly irrelevant/useless distinctions. Also it is time for people to stop coupling/confusing ISA and/with Microarchitecture.

Reply Score: 3

RE[2]: The times we live
by peskanov on Fri 8th Apr 2016 19:54 UTC in reply to "RE: The times we live"
peskanov Member since:
2006-01-15

christian,

While I partially agree, things like mul/div don't have zero cost…


Well, nothing has zero cost. Thing is, you always need mechanism to stall the pipeline, just for memory access. Stopping it for a multicycle mul/div does not have a significant cost...especially compared with NOT having these instructions!
Even worse, when I work in embedded project, even very small ones, if I don't have mul/div I end wasting memory in algorithms and/or tables to calculate them.
I don't thing any remaining embedded 32 bits CPU ships today without mul. And ARM ships without div (except Cortex M4) for some strange religious reason...
Because it takes up instruction encoding space? With modern static and dynamic branch

It doesn't have too. Not all predication has to be implemented in the ARM way; just check the ARM thumb way for example.
Also, those 4 bits spent in the predicate of the ARM ISA(which is an extreme case) are not the reason ARM used only 16 registers. The shifter operands take a lot more for example!
I am pretty sure that the 16 registers decision was based in gate/area budgets.

About dynamic branching: do you know in the videogame industry (not so much in embedded) we simulate predication using ">>31" and "&"?
Because dynamic branching is never as good as simple, forward calculation expressions. And also because using predication, you unload the branch caches from trivial operations (like max, min, abs...) and let it work on more important branches.


Bounds checking is a function on the language definition and implementation at the compiler level. Compilers can easily do that with complex or simple instructions, and require no help from the ISA.

You can do anything using simple instructions, even the "+" operation. That's not the question here. Bound checking is such an important operation nowadays, that it should be optimized at hw level.
Improving sw quality need improving on determinism. Right now many C/C++ programs work in different ways depending on compiler settings, platform, phase of the moon etc...and you know the reason, every read and write in wrong addresses has completely unintended consequences.
And people insist on using those half dead languages because those who implement bound checking, like Java, are simply slower.
Is there any sane reason for it? I don't think so, when it could be accelerated so easily in hw.

RISC resulted when people realised that the single instruction CISC operations like bounds checked indexed addressing modes resulted in complex hardware that functioned no faster than the same multiple instructions to achieve the same thing in RISC.

completely disagree. RISC comes from realizing operations that don't involve memory can be processed in a nice pipeline, with little stalls. Mostly due to no load-operate-store forced cycles.

In fact, an ARM instruction like addlt r0,r1,r2 asl #2 does more work than many CISC instructions. But it can be nicely pipelined.
Other inst. like rlwimi on PowerPC or "compare and branch" in MIPS prove my point.

RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots.


Delay slots were very rare in the RISC world. I can only remember the original MIPS and the Intel i860.
PowerPC, ARM, Hitachi SH, 88k, Alpha...I don't any of those arch. used delay slots.
So I don't thing that's a big improvement, because that was already marginal.

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did.

I don't have any idea about how this task could be improved, but it should. Fast context of registers change is very important today. As people know less and less about optimization, they perform function calls inside inner loops very often; and this trend is just growing.
Any modern ISA should take this problem very seriously.

Edited 2016-04-08 19:58 UTC

Reply Score: 2

RE[3]: The times we live
by Carewolf on Fri 8th Apr 2016 20:50 UTC in reply to "RE[2]: The times we live"
Carewolf Member since:
2005-09-08

Alpha had delay slots, but only in the architecture-specific micro-code you could upload to Alpha cpus. The standard ISA had not, and it was always architecture specific, but it meant a lot of NOPs in microcode.

Reply Score: 2

RE[3]: The times we live
by viton on Fri 8th Apr 2016 23:15 UTC in reply to "RE[2]: The times we live"
viton Member since:
2005-08-09

And ARM ships without div (except Cortex M4) for some strange religious reason...
Cortex-A7/A15 and their successors have integer divide.
https://community.arm.com/docs/DOC-8059

Things are simpler without divide.
http://www.righto.com/2015/12/reverse-engineering-arm1-ancestor-of....

just check the ARM thumb way for example.
It is a kludge.

About dynamic branching: do you know in the videogame industry (not so much in embedded) we simulate predication using ">>31" and "&"?

On PPC? =)
x86/64 has CMOV. But it can degrade the perf of your code.

BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

Because dynamic branching is never as good as simple, forward calculation expressions.
Usually it is.

And also because using predication, you unload the branch caches from trivial operations (like max, min, abs...) and let it work on more important branches.
In normal circumstances values will be in range, so min/max (in a loop) could be better with extra branches.

Bound checking is such an important operation nowadays, that it should be optimized at hw level.
No, thanks.
Elbrus has bound checking if you're interested.

Is there any sane reason for it? I don't think so, when it could be accelerated so easily in hw.
Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.

Delay slots were very rare in the RISC world. I can only remember the original MIPS and the Intel i860.
Pioneering architectures - RISC I/II had delay slots, as well as Sparc, PA-RISC, SuperH, DSPs.

Quote from wikipedia

All the other features associated with RISC—branch delay slots, separate instruction and data caches, load/store architecture, large register set, etc


Fast context of registers change is very important today.
OoO will handle it. It is not a problem.

Reply Score: 2

RE[4]: The times we live
by peskanov on Fri 8th Apr 2016 23:41 UTC in reply to "RE[3]: The times we live"
peskanov Member since:
2006-01-15

viton,

Cortex-A7/A15 and their successors have integer divide.

I know, I was referring to embedded ARM. The only one I have seen is the Cortex M4 (if my memory does not fool me).
Things are simpler without divide


Yes, the pipeline looks delightful, sure. But, in every case, all 32 bits CPU have to implement a multiplication. And then, if they have some common sense, a division. Because these kinds of CPUs are used to solve problems that normally imply some basic math. And the software solution to mul/div looks much worse than a little irregularity in the hw.

In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it. It was an afterthought and it does not include the shifter operand though, that's a pity (cannot do something as common as multiply by a constant).
David Braben used to tell an history about this question, I think.

x86/64 has CMOV. But it can degrade the perf of your code.

CMOV yes, but not the ">>31, &" sequence. That has a quite predictable (an low) cost. A typical OOO can do 1 or 2 of those sequences per cycle as average, if you can run several in parallel.

BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

That's not undefined. The size of int is undefined; is that what you meant?
If the int has length of 32 bits, >>31 is a perfectly predictable operation.
Elbrus has bound checking if you're interested

I had no idea, thanks. I will check.
Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.


We cannot agree here. A better language will check the boundary, yes. At a cost of adding several instructions and filling several registers.
What's so terrible about the idea of standarizing safe mechanisms and predictable behaviours for something that is SO basic in computing as a random memory access.

OoO will handle it. It is not a problem.


Well, I have been optimizing code for a living and I don't agree with you. Your OOO cpu serves you much better computing useful instructions than moving senseless data all the time.
Everything has a cost; using hardware for redundant code too.

Reply Score: 1

RE[5]: The times we live
by viton on Sun 10th Apr 2016 09:42 UTC in reply to "RE[4]: The times we live"
viton Member since:
2005-08-09

I know, I was referring to embedded ARM.
-R series has divide. But I don't see the point of hw divide in -M series. It should be cheap. Software divider is fine. Divide by arbitrary value is a very rare operation.

In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it.
Bad idea? ARM1 was designed by 2 engineers who didn't do CPU before:

we had to keep things simple. And they all worked. We designed a complicated set of chips that all worked first time, and we did that largely by thinking about things very, very carefully beforehand

CMOV yes, but not the ">>31, &" sequence.
It is a replacement for CMOV for older CPUs.

That's not undefined. The size of int is undefined; is that what you meant?
C++ Standard does not define shift of negative int values.
OK, to be accurate it is implementation-defined, not undefined behavior.
Of course, no sane compiler-developer will broke informal standard, but that doesn't mean shifts are safe. (I'm using mask propagation too)

Well, I have been optimizing code for a living and I don't agree with you.
Same for me.

Edited 2016-04-10 09:42 UTC

Reply Score: 2

RE[6]: The times we live
by peskanov on Sun 10th Apr 2016 12:57 UTC in reply to "RE[5]: The times we live"
peskanov Member since:
2006-01-15

viton,

-R series has divide. But I don't see the point of hw divide in -M series. It should be cheap. Software divider is fine. Divide by arbitrary value is a very rare operation


I don't feel the software divider is fine in any ARM flavour. Even most of the A series lack div; I worked with A8 & A9, which feel like fast CPUs to me until they hit a division. A silly omission which works against ARM in most benchmarks.
Embbeded is not so different; many projects perform some heavy math these days, and adding basic division to a Cortex-Mx cost nothing.

ARM was the first 32 bits ISA I saw without div. When I saw the sw div implementations linked automatically by GCC I was quite surprised
In the beginning, I felt ignoring div was a good decision, as programmers would be more aware of the high cost of a division.
After a decade, I don't think the same. Current programmers never check assembler outputs, and the divide op. just pops too often when a programmer is not trying to avoid it.
A minimal, hw implemented integer vision will cost 15-30 cycles and takes what, a thousand gates?

I can live fine without hw div, but I just think it is a wrong decision. Having a software division written in the flash (along your program) is not better that having a basic one in hw and in the end it saves nothing.

Bad idea? ARM1 was designed by 2 engineers who didn't do CPU before

"Bad idea" was an unfortunate expression. I love the work the ARM team did in their ISA; there are many aspects I dislike or find bizarre, but in general terms I love it.
They worked on a low budget of time and transistor count, so I can understand the omission of mul. But that's one of the first thing they added, so I think that shows how important this feature is.

C++ Standard does not define shift of negative int values.
OK, to be accurate it is implementation-defined, not undefined behavior.
Of course, no sane compiler-developer will broke informal standard, but that doesn't mean shifts are safe. (I'm using mask propagation too)


Is that the same for the C99 & C11 standard?
Jesus, this crap never fails to surprise me. I didn't even know the standard allowed for one's complement implementations of the language.

Same for me.


Cool; let me explain why did I say that.
I don't work on optimization anymore, but when I did I used to hit memory speed limits very often. When I finished my work, my inner loops were usually memory bound and further optimization of the implementation was useless.
From time to time I still measure and optimize a few things, mostly in the ARM space. But also in current x86/x64 systems (don't look or touch the asm there, tough).

I see current ARM cpus showing times quite similar to my previous experiences, but some of my test in an intel i5 seems to me more computation bound than memory bound.
For some toy inner loops I reach an ipc of 5-6 (which I think it's very high). And there is no redundant code there, all inst. produce work.
If I take any of these efficient loops and add cruft, like calling a function instead of inlining it etc...efficiency goes down very fast.
It seems to me every inst. counts in these cases, unlike memory starved cpu's I used to work.

Reply Score: 2

RE[5]: The times we live
by Megol on Sun 10th Apr 2016 15:27 UTC in reply to "RE[4]: The times we live"
Megol Member since:
2011-04-11


"Things are simpler without divide


Yes, the pipeline looks delightful, sure. But, in every case, all 32 bits CPU have to implement a multiplication. And then, if they have some common sense, a division. Because these kinds of CPUs are used to solve problems that normally imply some basic math. And the software solution to mul/div looks much worse than a little irregularity in the hw.
"

Most divisions can be done with reciprocals, division with constants are common. If not the shift-subtract method is easy to program and use.


In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it. It was an afterthought and it does not include the shifter operand though, that's a pity (cannot do something as common as multiply by a constant).
David Braben used to tell an history about this question, I think.


The multiplication algorithm of the ARM2 used the shifter which would have complicated the use of a shifted operand.


"x86/64 has CMOV. But it can degrade the perf of your code.

CMOV yes, but not the ">>31, &" sequence. That has a quite predictable (an low) cost. A typical OOO can do 1 or 2 of those sequences per cycle as average, if you can run several in parallel.
"

Sign extraction + and can of course degrade performance if misused, just as using CMOVcc can!

Even when using Intel processors with slow CMOVcc execution it is often faster than using bithacks.
The big problem is code not arranged correctly

Current Intel processors have a latency of 1 and throughput of 2 for CMOVcc. AMD K7-K10 supported CMOVcc with latency 1 and a throughput of 3 ops/cycle, current AMD processors still execute them well.


"BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

That's not undefined. The size of int is undefined; is that what you meant?
If the int has length of 32 bits, >>31 is a perfectly predictable operation.
Elbrus has bound checking if you're interested

I had no idea, thanks. I will check.
Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.


We cannot agree here. A better language will check the boundary, yes. At a cost of adding several instructions and filling several registers.
What's so terrible about the idea of standarizing safe mechanisms and predictable behaviours for something that is SO basic in computing as a random memory access.
"

It's not terrible but not critical either.

RISC type:

bound_check dest, lower, upper, check_value
branch_cond dest, address

So the pipeline will need to to support 3 inputs, 1 output - it isn't free. Intel until recently executed e.g. ADC (add with carry) with two µops as they didn't want to add 3 input µops even for that limited case.

Or perhaps (not as RISCy):

bound_check lower, upper, check_value
branch_bounds_check address

This would require some type of condition code, something that isn't free. Still needs 3 inputs but could reuse a standard instruction format (with one input replacing the destination register).

Or even (really non-RISC):

bound_check_branch lower, upper, check_value

Where a failed bounds check would branch to an address specified elsewhere.

But you perhaps have another idea how to do it?


"OoO will handle it. It is not a problem.


Well, I have been optimizing code for a living and I don't agree with you. Your OOO cpu serves you much better computing useful instructions than moving senseless data all the time.
Everything has a cost; using hardware for redundant code too.
"

Adding hardware and instructions aren't free.

Reply Score: 1

RE[6]: The times we live
by peskanov on Sun 10th Apr 2016 21:52 UTC in reply to "RE[5]: The times we live"
peskanov Member since:
2006-01-15

Megol,

Most divisions can be done with reciprocals, division with constants are common. If not the shift-subtract method is easy to program and use


I know; in the early nineties I even tried the old "work in logarithms" method for div and mul.
The question is: is there a sane reason for it? I say there is not, and nearly all the 32 bits CPU producers from the late eighties to now seem to agree with that.

Everything can be done in sw if your machine is Turing complete. That does not mean it makes any sense.
Mul and div belong to hw in a 32 bits RISC type CPU design. It just fit the footprint and applications.

The multiplication algorithm of the ARM2 used the shifter which would have complicated the use of a shifted operand

Yeah, but the advantages completely bury the disadvantages. It's pity they didn't make the extra effort to implement a div, no matter how slow.

Current Intel processors have a latency of 1 and throughput of 2 for CMOVcc. AMD K7-K10 supported CMOVcc with latency 1 and a throughput of 3 ops/cycle, current AMD processors still execute them well.

Given enough registers, a modern Intel OOO should eat >>/& sequences like peanuts. However, as I said in other post, I have little real experience with x86 optimization.
In any case, what I would really like (better that predication based in CR flags) is some kind of "cmp_and_select" instruction that could perform what we (poorly) accomplish with shifting and masking.
But you perhaps have another idea how to do it?

Yes, I suggested an implementation a few post ago. Check it out:

http://www.osnews.com/thread?627165
(sorry,don't know how to link here)

Checking would be implemented in the memory access inst. itself.
Could be something like that:
load_word -> unchecked access
check_load_word ->checked access, using a especial format pointer. Failure in the check would return 0 as memory value or raise an exception, depending on configuration

Aside from the Elbrus a previous poster referenced, I know some modern DSPs perform bound checking by hw. But I can't remember the design I saw. Would be nice to find one again.

Adding hardware and instructions aren't free

Nothing is; software divide is costly in many aspects, mostly in time.
I align with the CPU producers here: div and mul inst. pertain to the set of the task 32 bits RISC CPUs solve, and implementing them makes a better product for everybody.

Edited 2016-04-10 22:06 UTC

Reply Score: 1

RE[3]: The times we live
by cjcoats on Mon 11th Apr 2016 11:38 UTC in reply to "RE[2]: The times we live"
cjcoats Member since:
2006-04-16

christian,
[snip...]
You can do anything using simple instructions, even the "+" operation. That's not the question here. Bound checking is such an important operation nowadays, that it should be optimized at hw level.
[snip...]

Wrong.
In Real Code ® there will usually be many related bounds to check and it is well worthwhile optimizing the common parts.

Reply Score: 1

RE[4]: The times we live
by peskanov on Mon 11th Apr 2016 19:33 UTC in reply to "RE[3]: The times we live"
peskanov Member since:
2006-01-15

cjcoat,
I am just a common programmer, and only work in "Real Code".

Infinite hours have been spent by me or work colleagues chasing silly memory corruptions. C++ usually suffer less from them, but boy, when a heavily templated, complex C++ program corrupts memory or work with uninitialized data, prepare some coffee because you are working late that night.

You should explain to me why do you consider bound checking unworthy of improvement/acceleration, because I can not read your mind.
I would happily pay to have GO running at the same speed of C, so I can bury that crap forever.

Reply Score: 1

RE: The times we live
by tylerdurden on Fri 8th Apr 2016 15:33 UTC in reply to "The times we live"
tylerdurden Member since:
2009-03-17


- The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?


In the words of Pauli "not even wrong..."

Reply Score: 3

RE[2]: The times we live
by peskanov on Fri 8th Apr 2016 19:58 UTC in reply to "RE: The times we live"
peskanov Member since:
2006-01-15

tylerdurden,
Aside from trolling, do you have anything to argue?

Some form of predication is included in nearly any new commercial architecture out there. Because it's really an efficient and straightforward way of implementing simple dichotomies, which work well in all microarchitectures.

Edited 2016-04-08 19:59 UTC

Reply Score: 1

RE[3]: The times we live
by Carewolf on Fri 8th Apr 2016 20:53 UTC in reply to "RE[2]: The times we live"
Carewolf Member since:
2005-09-08

No it is actually deprecated these days.

Predication has cost like stalling the pipeline on waiting on all dependent registers and not being able to use branch-prediction. On modern processors, normal jump branches are faster in all cases except those that are weighted exactly 50/50 on each branch.

Edited 2016-04-08 20:53 UTC

Reply Score: 2

RE[4]: The times we live
by peskanov on Fri 8th Apr 2016 21:10 UTC in reply to "RE[3]: The times we live"
peskanov Member since:
2006-01-15

Carewolf,
I am defending here the general concept of predication, especially for arithmetic inst., which can be implemented without any especial register file or add any extra stall posibility in a pipeline.
This is a common mechanism in modern DSPs, an even GPUs.

I don't know why do you say predication causes stalls; maybe you are referring to an specific microarchitecture?
Link us some reference for further discussion.

Reply Score: 1

RE[4]: The times we live
by Megol on Sun 10th Apr 2016 14:27 UTC in reply to "RE[3]: The times we live"
Megol Member since:
2011-04-11

No it is actually deprecated these days.

Predication has cost like stalling the pipeline on waiting on all dependent registers and not being able to use branch-prediction. On modern processors, normal jump branches are faster in all cases except those that are weighted exactly 50/50 on each branch.


Really? Remember that mispredicted branch have a overhead of ~16 cycles so even if the prediction hardware is correct 75% of the time the overhead per branch is 0.25*16=4 cycles. In many cases this is worse than an optimized branchless routine.
50% correct prediction would mean a fixed overhead of 8 cycles which for a realistic core and code (superscalar execution with IPC of 2) means at least 16 instructions just for the overhead costs!

Reply Score: 1

RE[5]: The times we live
by tylerdurden on Sun 10th Apr 2016 23:02 UTC in reply to "RE[4]: The times we live"
tylerdurden Member since:
2009-03-17

Prediction and Predication are not the same thing.
Most dynamic HW branch predictors in modern microarchitectures have significantly higher hit rates on average (95% +) FWIW.

Edited 2016-04-10 23:09 UTC

Reply Score: 2

RE[6]: The times we live
by Alfman on Mon 11th Apr 2016 00:46 UTC in reply to "RE[5]: The times we live"
Alfman Member since:
2011-01-28

tylerdurden,

Most dynamic HW branch predictors in modern microarchitectures have significantly higher hit rates on average (95% +) FWIW.



I'm curious about this, do you have a link?

In this paper they got intel xscale branch prediction rates of 86-94%, obviously it depends on the algorithms, so I'm trying to find more sources to see if they agree with each other.
http://www.ics.uci.edu/~alexv/Papers/iccd03.pdf


The first answer of this SO question goes into some detail demonstrating the negative consequences of misprediction can be severe on a Core i7 when non-predictable random data is used.

http://stackoverflow.com/questions/11227809/why-is-processing-a-sor...

His benchmarks show the predicate version using a conditional move is 6 times faster than the branching version when the data is random (not predictable), which supports peskanov's assertion that predicates can be beneficial. However I'm still looking for more papers on the topic. Many people have a "gut feeling", but I'd like to find a paper that publishes benchmarks for many typical workloads.

Reply Score: 2

RE[7]: The times we live
by Alfman on Mon 11th Apr 2016 02:06 UTC in reply to "RE[6]: The times we live"
Alfman Member since:
2011-01-28

His benchmarks show the predicate version using a conditional move is 6 times faster than the branching version when the data is random (not predictable),


Oops, I should have said about 4 times faster.

Anyone following along with this discussion might find these papers interesting:

This next paper suggests indirect branch prediction accuracy (ie using pointers instead of immediate jumps) is worse than for direct branches.
http://www.eecg.toronto.edu/~steffan/carg/readings/optimizing-indir...


This paper has painstaking details about branch prediction on x86 cpus.
http://www.agner.org/optimize/microarchitecture.pdf

Though it should not surprising that this is possible, they show a worse case example where branch prediction can mispredict alternating branches 100% of the time (ie worse than random). I'm still looking for more research that reveals prediction rates in "typical" programs.

Reply Score: 2

RE[7]: The times we live
by tylerdurden on Mon 11th Apr 2016 17:20 UTC in reply to "RE[6]: The times we live"
tylerdurden Member since:
2009-03-17

Sure thing, check out the data on some relatively modern predictor like COLT:

https://pdfs.semanticscholar.org/895e/8e11f3ba1fa1356e2696ceea4dd0bc...

In computer architecture, you are supposed to optimize for the common case. There are indeed some very bad corner cases that can trash your predictor, but on the average the predictor works out as a positive element in the microarchitectuer.

Reply Score: 2

RE[3]: The times we live
by tylerdurden on Fri 8th Apr 2016 21:27 UTC in reply to "RE[2]: The times we live"
tylerdurden Member since:
2009-03-17

For me to be trolling, you should have written something that was remotely correct.

Branch predication has been proven to be a REALLY BAD IDEA in terms of power consumption and complexity, for general architectures. Specially when implemented fully. E.g. Predication is one of the main reasons why Itaniums ended up being furnaces.

I concede that very limited forms of predication have been used, in a few speculative microarchitectures. But even ARM is dropping out predication at this point; It's not worth the complexity, for the little performance improvement it offers in the real world.

So, given the areas of application for this project (embeeded and SOCs), predication makes little sense.


FWIW There were a lot of misconceptions in your post, regarding microarchitecture, I just used Pauli's succinct words that get to the point.

Reply Score: 2

BP dismissed...
by dionicio on Sat 9th Apr 2016 13:37 UTC in reply to "RE[3]: The times we live"
dionicio Member since:
2006-07-12

Until Power micro-management evolves.

Reply Score: 2

RE: The times we live
by Alfman on Fri 8th Apr 2016 17:35 UTC in reply to "The times we live"
Alfman Member since:
2011-01-28

peskanov,

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.


I agree, these operations are so fundamental and cheap these days they ought to be supported everywhere.

One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.
Add bound checking to memory r/w instructions; let's move the computer world forward and not backward!



I think that's really more of a programming language problem. C/C++ are especially guilty for failing to address bounds checking and overflow problems. C doesn't even expose the overflow flags the CPU gives us. However not all languages have these shortcomings. Rust for example has zero cost bounds checking because the bounds checking is proven through static code analysis at compile time, runtime bounds checking is only needed when the compiler doesn't know bounds information at compile time.


If we really wanted to have special ISA support for bounds checking, what exactly would that look like? Every single memory access could have a start and end range, but where would these ranges come from? The CPU doesn't know a thing about data structures. Even simple things like strings can have unknown length that are set in the data structure itself. The only practical thing to do would be for the compiler to convey structure bound information to the CPU, however if the compiler knows the correct bounds then it really has no business generating code that exceeds those bounds in the first place. So I'm having troubling envisioning a scenario where ISA bounds checking wouldn't be redundant.



About the ISA:
- The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?


I agree adding branch prediction can help performance. It also adds significant complexity though. The moment you decide to go this route, you need more silicon to keep track of branch statistics, you need more silicon to implement transnational registers that can be rolled back, you need more silicon to detect branch mis-prediction, etc. It comes with a lot more complexity and bug potential.

The performance benefits might still justify all the additional complexity, but usually a reference implementation is all about correctness and simplicity. Is there a compelling reason to have this complexity in a reference implementation? Optimized implementations are still possible as long as correctness is maintained.



Edit: I hadn't read christian's post when I posted, oh well, we say many of the same things.

Edited 2016-04-08 17:44 UTC

Reply Score: 2

RE[2]: The times we live
by kwan_e on Fri 8th Apr 2016 18:14 UTC in reply to "RE: The times we live"
kwan_e Member since:
2007-02-18

C/C++ are especially guilty for failing to address bounds checking and overflow problems. C doesn't even expose the overflow flags the CPU gives us. However not all languages have these shortcomings. Rust for example has zero cost bounds checking because the bounds checking is proven through static code analysis at compile time, runtime bounds checking is only needed when the compiler doesn't know bounds information at compile time.


That's not really much different from C++. If you use a std::array and std::get, you get compile-time bounds checking through the type system without even static code analysis. You also get the same with std::tuple for index and type access. If you pass an array reference to a templated function that deduces array references, then the compiler will also warn you if you access something outside of the deduced range.

The fact that Rust still needs bounds checking for dynamically sized blobs means it's still no better than other languages before it in that regard, since no compiler for any language can do compile time bounds checking for dynamically sized blobs.

Edited 2016-04-08 18:18 UTC

Reply Score: 2

RE[3]: The times we live
by Alfman on Fri 8th Apr 2016 21:50 UTC in reply to "RE[2]: The times we live"
Alfman Member since:
2011-01-28

kwan_e,

Well obviously I agree that C/C++ can do everything that rust can. The difference is with safe languages, you get safety automatically 100% of the time, even across threads. In unsafe languages it's up to the developer have to make sure all the code written and used is safe. History shows that as complexity increases, our ability to write and maintain safe code decreases. Thank goodness for tools like valgrind!

Reply Score: 2

RE[4]: The times we live
by bnolsen on Sat 9th Apr 2016 23:37 UTC in reply to "RE[3]: The times we live"
bnolsen Member since:
2006-01-06

When I ditched java style c++ object oriented programming and started using the STL more my memory problems pretty much vanished. That was about 15 years ago. Legitimate problems I have now are pretty much due to threading. Accidentally passing a loop temporary to a thread, race conditions, improper protection of shared data.

This other safety stuff is way overblown and frankly not really a concern for me, and hasn't been for a very long time.

Reply Score: 2

RE[5]: The times we live
by moondevil on Sun 10th Apr 2016 07:37 UTC in reply to "RE[4]: The times we live"
moondevil Member since:
2005-07-08

Apparently others are not so good as you with their C and C++ skills.

http://www.cve.mitre.org/cgi-bin/cvekey.cgi?keyword=memory

Reply Score: 2

RE[6]: The times we live
by kwan_e on Sun 10th Apr 2016 14:44 UTC in reply to "RE[5]: The times we live"
kwan_e Member since:
2007-02-18

Apparently others are not so good as you with their C and C++ skills.

http://www.cve.mitre.org/cgi-bin/cvekey.cgi?keyword=memory


Quite a few of those are C bugs, not C++. Most people keep writing C++89 or C with classes either through laziness or someone decided from on high to eschew with the latest C++ safety features mostly those who view C++ as C with classes, but also those for some reason or another tied to a particular older version of a compiler. And there's a lot of C code out there that hasn't been updated which cannot be the fault of modern C++ features since they weren't available then.

Reply Score: 2

RE[7]: The times we live
by moondevil on Mon 11th Apr 2016 06:59 UTC in reply to "RE[6]: The times we live"
moondevil Member since:
2005-07-08

If they are written in ANSI C++ compatible code, that is accepted by any C++ compliant compiler, they are C++ bugs as well, regardless how you rephrase it.

The very fact that those features as still valid code in modern C++, in spite of the improvements it brings in terms of security, is a security concern.

Also 30 years of history in C and C++ security exploits have more than proven that the average C and C++ developer is not up to the task of writing safe code, regardless of people keep posting how good they are and only the others are writing bad code.

Except for the time I spent at CERN, I never worked in a enterprise where the quality of C and C++ code was of any value.

Reply Score: 2

RE[2]: The times we live
by peskanov on Fri 8th Apr 2016 20:16 UTC in reply to "RE: The times we live"
peskanov Member since:
2006-01-15

Alfman,
you can't blame everything on the language. A memory safe language does bear an important cost in actively checking memory bounds, so that's a thought decision to make.

How would I design a memory bound checker in hw?
It surely can be done in a lot of ways, here goes my suggestion:

- Implement a format of pointer+boundary. It could a 64 bits one, ie. 40+24 base/size. Anything bigger than that, let the compiler use sw checking.
- Implement a "soft" reaction to invalid access. This would make wrong programs wrong, but at least it would deterministic and wouldn't create havoc in memory. It could work that way:
+ Invalid reads return 0
+ Invalid writes are not performed
+ Any invalid read/or write raises a flag for posterior testing
- Implement a "hard" exception, optional for the user. An exception is what hw already does today, so let's apply the same for all bad memory accesses.

I agree adding branch prediction can help performance. It also adds significant complexity though


This is completely depend on implementation. If you wish, the result of comparisons can set on common registers and used as any other operand.

Something like:

cmp_set r0,r1,r2 ; compare r0,r1, set resulting flags in r2
mv_gt r0,r1,r2 ; move r1 to r0 depending on r2

Has exactly the same implications for a pipeline than a "sub, mov" sequence.

Reply Score: 1

RE[3]: The times we live
by Alfman on Fri 8th Apr 2016 22:40 UTC in reply to "RE[2]: The times we live"
Alfman Member since:
2011-01-28

peskanov,

you can't blame everything on the language. A memory safe language does bear an important cost in actively checking memory bounds, so that's a thought decision to make.



It depends on if the safety is enforced at run-time (like an interpreted language) or compile-time. If you call a class method that does bounds checking and that call gets inlined, the compiler may be able to prove that the bounds check is invariant with regards to the context it was called from. Therefor the check becomes redundant and unnecessary.

For example (sorry about the formatting):
class S {
unsigned int len;
char*data;
// many members
unsigned int Length() {
return len;
};
char&operator[](unsigned int index) {
if (index>=len) throw new Exception(...);
return data[index];
};
};

S s="...";
for(int j=0; j < s.Length(); j++) {
cout < < s[j] < < endl;
}


Procedural inlining produces this logic:
for(int j=0; j < s.Length(); j++) {
if (j >= s.Length()) throw new Exception();
cout < < s.data[ j ] < < endl;
}


GCC is smart enough to process the bounds check at compile time and determine that it will always pass, therefor when you examine the binary code (with optimizations turned on) you'll see that the logical bounds check is completely omitted because it didn't change the program semantics. The compiler was logically able to prove that the bounds check succeeded each iteration without actually having to run it.


It surely can be done in a lot of ways, here goes my suggestion:...


Your example begs the question, how is it different from a normal segmentation fault? We already get exceptions on bad memory accesses at that level. Make a program that reads or writes a random pointer, you'll see what I mean.


Something like:
cmp_set r0,r1,r2 ; compare r0,r1, set resulting flags in r2
mv_gt r0,r1,r2 ; move r1 to r0 depending on r2



I see. Sorry, I misread predication as prediction as well. Only a few opcodes in x86 use predication. I don't know how much benefit this offers, I'll need to research it further.

Edited 2016-04-08 22:49 UTC

Reply Score: 2

RE[4]: The times we live
by peskanov on Fri 8th Apr 2016 23:16 UTC in reply to "RE[3]: The times we live"
peskanov Member since:
2006-01-15

Alfman,
yes, sequential memory access can be checked with little cost, and a good HL language will do it for you.

But most heavy-duty inner loops rely in one or several random access; ie table read, tree traverse… And random access often has to be checked for every single case. The language can't help here.
I consider the operation to be important enough to guarantee some hw support.



Your example begs the question, how is it different from a normal segmentation fault? We already get exceptions on bad memory accesses at that level. Make a program that reads or writes a random pointer, you'll see what I mean.


Easy; you have this data declared in your C program:

static int hash[512] = {};
static int hash_init = 0;

at some point, in runtime, your command "hash[i] = first" is run with a value of i = 512.
At that moment, you are possibly destroying the hash_init var.
Of course, that depends on how your compiler arranges the data segment. You can destroy nearly anything with a random access in C/C++.

Segmentation fault does not protect all single data entities of your program. You can happily write outside boundaries. Sometimes an exception will catch the event, but again, that depends on where did you "peek" or "poke". Segmentation fault works on a higher level, covering large areas of memory. What you can read or write inside these areas is basically "total freedom".

Some tools like valgrind can help you catching unsafe data accessing, but I guarantee you even valgrind lets a lot of memory error go without notice.

Java/JVM solves this just inserting instructions which check boundaries constantly.
What I suggest: let the cpu provide fast mechanism to perform that. MMU blocks and segmentation faults are not the way to solve this problem.

Reply Score: 1

RE[5]: The times we live
by Alfman on Sat 9th Apr 2016 01:25 UTC in reply to "RE[4]: The times we live"
Alfman Member since:
2011-01-28

peskanov,

yes, sequential memory access can be checked with little cost, and a good HL language will do it for you.

But most heavy-duty inner loops rely in one or several random access; ie table read, tree traverse… And random access often has to be checked for every single case. The language can't help here.


You say hardware bound checking would yield benefit in more complex examples, but I'm not seeing how. Correct code will never be out of bounds and would not benefit from additional bounds checking. Theoretically incorrect code could benefit, and there's no denying alot of code is incorrect. But incorrect code is inherently the consequence of Garbage-In-Garbage-Out. How does the CPU obtain the correct bounds given that the compiler had the wrong bounds?

I'm having trouble coming up with examples where this would be useful. It would really help me understand if you could provide a specific example.


at some point, in runtime, your command "hash = first" is run with a value of i = 512.
At that moment, you are possibly destroying the hash_init var.
Of course, that depends on how your compiler arranges the data segment. You can destroy nearly anything with a random access in C/C++.


Yes, but this is because C is unsafe. A bounds checking ISA doesn't help C unless we also update C's semantics to be bounds checked.


Java/JVM solves this just inserting instructions which check boundaries constantly.
What I suggest: let the cpu provide fast mechanism to perform that.


I would expect that the boundaries are [i]logically
checked for every single access, but they optimize away in machine code due to invariant conditions exactly like my earlier example above. Bounds check optimizations should still be possible regardless of when the code actually gets compiled. I'm curious, do you have evidence this is not the case?

Reply Score: 2

RE[6]: The times we live
by peskanov on Sun 10th Apr 2016 09:19 UTC in reply to "RE[5]: The times we live"
peskanov Member since:
2006-01-15

Alfman,

I would expect that the boundaries are [i]logically checked for every single access, but they optimize away in machine code due to invariant conditions exactly like my earlier example above. Bounds check optimizations should still be possible regardless of when the code actually gets compiled. I'm curious, do you have evidence this is not the case?


You did set up a trivial example, a sequential memory access algorithm. I agree, hw bound checking does not help much here (although it would still be worthy imo, for short sequences checking).

Continuous bound checking is necessary accessing tables in random access. Something like this (pseudocode):

read_from_disk(mesh_indices)
read_from_disk(vertices)

foreach index in mesh_indices
print (vertices[index])

You can not know beforehand that all indices in the mes_indices array are correctly bounded to vertices arrays.

In other words: random tables access, which is a basic computing tool, must be checked on every use if you want to ensure deterministic behavior.

Reply Score: 1

RE[7]: The times we live
by Alfman on Sun 10th Apr 2016 21:01 UTC in reply to "RE[6]: The times we live"
Alfman Member since:
2011-01-28

peskanov,

I agree, hw bound checking does not help much here (although it would still be worthy imo, for short sequences checking).

Continuous bound checking is necessary accessing tables in random access. Something like this (pseudocode):

read_from_disk(mesh_indices)
read_from_disk(vertices)

foreach index in mesh_indices
print (vertices[index])

You can not know beforehand that all indices in the mes_indices array are correctly bounded to vertices arrays.



Obviously input should always be bounds checked, that's a given. What I said before applies here too: Correct code will never exceed those bounds. Incorrect code can exceed those bounds, but then we know such code is incorrect. How is the CPU supposed to know the correct bounds? Obviously it needs to come from the program, which needs to be fixed, but once you've done that it will no longer generate incorrect bounds. So it's not at all clear to me what we'd gain by adding bounds checking in the CPU.

Not having an example of h/w bounds check that isn't logically redundant in correct code makes me question the need for it.

Reply Score: 2

RE[2]: The times we live
by tylerdurden on Fri 8th Apr 2016 21:29 UTC in reply to "RE: The times we live"
tylerdurden Member since:
2009-03-17

peskanov,

"Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.


I agree, these operations are so fundamental and cheap these days they ought to be supported everywhere.
"

https://www.youtube.com/watch?v=31g0YE61PLQ

Reply Score: 2

RE: The times we live
by JLF65 on Fri 8th Apr 2016 18:08 UTC in reply to "The times we live"
JLF65 Member since:
2005-07-06

I am talking about that "mix your our architecture" thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.


Not every function needs hardware mul/div, and quite a few don't need floating point at all. It's not the optional hardware that makes modular architecture a pain, it's not having a standard way to detect what is and is not available. RISC-V makes it easy to determine. The rest is just software, and normally handled automatically by the compiler.

About the ISA:
- The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?


Branch prediction is on the IMPLEMENTATION side, not the specification side.

One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.
Add bound checking to memory r/w instructions; let's move the computer world forward and not backward!


Someone else made a great reply to this, so not gonna bother other than to repeat that this is silly to implement in hardware and wouldn't make a difference other than to the price and complexity of the hardware.

Reply Score: 2

RE[2]: The times we live
by peskanov on Fri 8th Apr 2016 20:24 UTC in reply to "RE: The times we live"
peskanov Member since:
2006-01-15

JLF65,
show me an use for a 32 bits architecture, who does not benefit *inmensely* from having mul/div.
Better, show me some 32 bit architecture in use today, which does not have *at least* a mul instruction.

Branch prediction is on the IMPLEMENTATION side, not the specification side.


Branch prediction and "predication" is not the same, smart guy. Try to understand what you read.

Someone else made a great reply to this, so not gonna bother other than to repeat that this is silly to implement in hardware and wouldn't make a difference other than to the price and complexity of the hardware


Yeah; emulating ridiculous large CISC instructions (which x86 compilers does not even produce anymore) is a perfectly reasonable use of hw.
But memory bound checking, a basic and integral part of modern computing, is not. That would be a waste. Because a subtraction and a mask operation (all that's needed for it) would be too costly.
Ok.

Reply Score: 1

RE[3]: The times we live
by JLF65 on Fri 8th Apr 2016 20:41 UTC in reply to "RE[2]: The times we live"
JLF65 Member since:
2005-07-06

No need for mul/div? A toaster, a washing machine, a simple clock... any number of common items that today use a simple chip instead of gears or crap.

And sorry, I read that as branch prediction every time, even when proof-reading my reply. Weird. The primary advantage of branch PREDICATION (got it this time ;) ) is making pipelining and cacheing slightly easier/better. In other words, it's an IMPLEMENTATION thing again. You don't want implementation related features creeping into your specs or later improvements in implementation are harder or render parts of your spec unusable. It SEEMS like a good idea, but only for certain kinds of implementations.

Reply Score: 2

RE[4]: The times we live
by peskanov on Fri 8th Apr 2016 20:56 UTC in reply to "RE[3]: The times we live"
peskanov Member since:
2006-01-15

JLF65,
the topic of this thread is the definition of an 32/64/128 bits ISA.
Please tell me which toaster needs a 32 bits ISA.

You use a 32 bits ISA to solve problems with a minimum of complexity. Which nearly always involves mul and div.
I don't care a PIC lacks mul; but a CPU with a 32 bits ISA should!
And I do work in embedded (not full time), so I should now.

About predication: you don't understand predication. It's defined at instruction level, not at implementation level.

This is predication:

; if (a<0) a = -a

cmp r0,#0
rsblt r0,r0,r0

Do you see the "lt" sufix. This is the part of the instruction which performs the predication.
It's defined in the ISA, that's predication.

Reply Score: 1

RE[5]: The times we live
by JLF65 on Sat 9th Apr 2016 00:21 UTC in reply to "RE[4]: The times we live"
JLF65 Member since:
2005-07-06

I fully understand predication. CMOV is a common example. It's part of the ISA, but a BAD part. You're making implementation specific optimizations part of the instruction set, and what happens when those optimizations are obsolete? Or another optimization makes it harder to implement? Again, it SEEMS like a good idea on the surface, but it's moving implementation specific design into the instruction set where it has no place.

And who cares if 32 bits is overkill for a toaster? It's a TOTALLY OPEN SOURCE processor you can stick into an ASIC or FPGA along with everything else you want your toaster to do without having to pay anyone else a dime. For example, a PIC chip would handle a toaster, but then you're paying them for their processor. You could use a 6800 MPU, but then you're paying Freescale for their processor. If an ASIC is cheaper, this gives you a TOTALLY free CPU to use.

Reply Score: 2

RE[6]: The times we live
by agentj on Sat 9th Apr 2016 08:17 UTC in reply to "RE[5]: The times we live"
agentj Member since:
2005-08-19

Why 32bit core is overkill ? 32bit CPUs are so cheap and easy to support by good quality toolchains and IDEs these days. If you need to use assembly code in order to squeeze last byte of some chip (not related with pricing, power or performance constraints), you chose wrong chip. They are also easily available in tiny BGA packages.

Edited 2016-04-09 08:18 UTC

Reply Score: 2

RE[6]: The times we live
by peskanov on Sun 10th Apr 2016 09:11 UTC in reply to "RE[5]: The times we live"
peskanov Member since:
2006-01-15

JLF65,

I fully understand predication. CMOV is a common example. It's part of the ISA, but a BAD part. You're making implementation specific optimizations part of the instruction set, and what happens when those optimizations are obsolete? Or another optimization makes it harder to implement? Again, it SEEMS like a good idea on the surface, but it's moving implementation specific design into the instruction set where it has no place.

I have no clue what you mean here. Intel never removes instr. from their ISA. If you think new inst. should not be used, that's up to you, at least in this CPU family. Good luck convincing the compilers, though. They could have a different opinion.

Following your reasoning, nobody should compile the FP code for SSE and restrict to the original stack based FPU, which is a total disaster.
BTW, I will readily acknowledge I am no x86/x64 expert. My defense of predication mechanism or equivalents is not attached to the CMOV inst., which I have never used and barely know.

And who cares if 32 bits is overkill for a toaster? It's a TOTALLY OPEN SOURCE processor you can stick into an ASIC or FPGA along with everything else you want your toaster to do without having to pay anyone else a dime. For example, a PIC chip would handle a toaster, but then you're paying them for their processor.


You cannot have it both ways.
Either you care for minimal footprint and cost (no 32bits CPU in toaster) or you don't and will accept anything with reasonable price that solves the task.
If you already have a big 32 bits register bank and a RISC design pipeline, integrating minimal mul/div adds little to the footprint and this functionality is worthy; history shows it.
That's a fact, and good luck finding a 32 bits cpu without mul in the market.
Why insisting in defining a product that nobody wants, open source or not?

You design your ISA with some kind of computing load in mind; the tasks you address with a 32 bits CPU usually involve * and / at minimum, even in embedded (PIDs or filtering comes to mind).

Reply Score: 1

RE[7]: The times we live
by JLF65 on Sun 10th Apr 2016 18:54 UTC in reply to "RE[6]: The times we live"
JLF65 Member since:
2005-07-06

Intel does all kinds of crazy things that they're then stuck with forever to maintain backwards compatibility. It took AMD to make a usable 64-bit extension for the x86 family. I don't hold Intel out as an example of anything other than how to succeed by throwing billions of bucks at a problem rather than engineering a good solution.

And 32-bit is the new 8-bit. Again, who cares if it's overpowered - it's the new minimum standard processor. And RISC-V gives you a nice well designed 32-bit base for free. Hell, even 8-bit is overpowered for something like a toaster, but people used it because it was the minimum standard.

Reply Score: 2

RE[7]: The times we live
by atomic on Mon 11th Apr 2016 04:05 UTC in reply to "RE[6]: The times we live"
atomic Member since:
2014-12-02

IMO any chip-manufacturer, with the market share of Intel and x86, would have the same approach, which is not to remove any older instructions and things like A20, just to keep the massive amount of older code running and customers happy. People often talk about finding a better engineering solution for x86, altough any other manufacturer, including ARM, would likely do exactly the same thing as Intel in comparable market-position.

Reply Score: 1

Pro-Competition
Member since:
2007-08-20

The RISC-V project seems to be gaining steam. This is a very welcome development IMO. It would be really nice to have a fully open hardware option!

I am curious about how RISC-V compares to other open (or semi-open) RISC ISAs, like MIPS, SPARC, Power, etc. Has anyone done a comparison of all (or some) of these?

(The Genode team's recent report on their efforts with RISC-V in its current state was informative: http://www.osnews.com/story/29138/How_Genode_came_to_the_RISC-V_CPU... .)

Reply Score: 3

Software/Hardware Instructions
by Earl C Pottinger on Fri 8th Apr 2016 21:15 UTC
Earl C Pottinger
Member since:
2008-07-12

What happen to the CPUs where the entire instructions set was defined, but when the hardware was not in a particular model/version of a CPU a well defined subroutine location was called to do the same function.

IE. If you used the multiple instruction and the hardware is done the hardware does the function, but is the hardware is not then a pre-defined location to get a subroutine call that does the function in software.

Done proper there is not a single bit difference in the code you run, but the speed it runs depends on what hardware is present in the CPU.

Reply Score: 1

tylerdurden Member since:
2009-03-17

That sounds like the PAL code in the old Alphas. A special ROM had the subroutines to "emulate" the instructions not implemented in the system chip.

SPARC also did something similar, but it was purely software; some SPARC instructions were optional and the CPU would simply throw an exception (and tell the OS to figure it out) if it encounter an optional instruction which it could not decode (not implemented in that revision).

I could be wrong, I haven't touched those architectures in forever. And they were slightly before my time in school.

Reply Score: 2

JLF65 Member since:
2005-07-06

The ColdFire processor was derived from the 680x0 processor family by Motorola. They started from the 68050 and threw out most everything that kept them from using a RISC core. Unimplemented 680x0 instructions/addressing modes generate an exception, and a library allows those instructions to be done in software, allowing most 680x0 code to run on ColdFire processors. Some Atari ST/TT folks make ColdFire accelerators for their old ST/TTs.

Reply Score: 2

JLF65 Member since:
2005-07-06

Correction, 68060. They skipped over the 68050, and a number of folks have been trying to make something they call a 68050 in FPGA.

Reply Score: 2

Let me ask a layman's question
by MyNameIsNotImportant on Sun 10th Apr 2016 22:18 UTC
MyNameIsNotImportant
Member since:
2013-01-02

There are a lot of experts in this thread as I can see.

As this is an open source processor design, is there any way to actually create your own processors without having access to or owning a fab?

Can this design be put into an FPGA and work efficiently?

Reply Score: 1

tylerdurden Member since:
2009-03-17

Yup, you can most definitively put RISC-V on a FPGA, in fact that's how most of the development has happened thus far.

Here's an example of a RISC-V fpga environment:

http://www.lowrisc.org/docs/tagged-memory-v0.1/fpga/

Reply Score: 2