‘RISC-V offers simple, modular ISA’

Thom Holwerda 2016-04-08 Hardware 78 Comments

RISC-V is a new general-purpose instruction-set architecture (ISA) that’s BSD licensed, extensible, and royalty free. It’s clean and modular with a 32-, 64-, or 128-bit integer base and various optional extensions (e.g., floating point). RISC-V is easier to implement than some alternatives – minimal RISC-V cores are roughly half the size of equivalent ARM cores – and the ISA has already gathered some support from the semiconductor industry.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

78 Comments

2016-04-08 8:27 am

ThomasFuhringer
Could not quite figure out if there is any hardware under construction or maybe even available implementing it.

What would be the best emulator to try it? QEMU?

2016-04-08 9:41 am

christian
Could not quite figure out if there is any hardware under construction or maybe even available implementing it.

What would be the best emulator to try it? QEMU?

Javascript:)

http://riscv.org/angel-simulator/
2016-04-08 5:51 pm

JLF65
If you want to try it as hardware, you can put VectorBlox on a Cyclone FPGA.

https://github.com/VectorBlox/orca

2016-04-08 9:04 am

peskanov
This architecture is a good example of the times we live.

Instead of learning from the past and opening some paths toward the future, they just replicate the old MIPS architecture with few changes and adopt some of the most poisonous trends of the present.

I am talking about that “mix your our architecture” thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.

About the ISA:

– The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

– Why not learning from modern hw/sf problems and shortcomings and try to implement some improvements? The RISC architectures were born following that philosophy.

One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.

Add bound checking to memory r/w instructions; let’s move the computer world forward and not backward!

There are lot’s of thing that can be done in an open ISA to move forward; let’s hope we will get some visionaries again, like Cray in the 60’s or the ARM team in the 80’s.

2016-04-08 10:17 am

mmrezaie
Does anyone know how openpower can be adopted by some open community in order to be used by outsiders not heavyweights like IBM. If that’s so then why there is no interest of adopting these ISAs!

2016-04-08 6:21 pm

Dasher42
There’s the Tyan Openpower (http://www.tyan.com/solutions/tyan_openpower_system.html) systems, but they’re rackmount servers. The POWER architecture is focused on the data center thus far, but I agree, I’d love to see some consumer-grade riffs on it.

2016-04-08 7:59 pm

Drumhellar
Or even some workstations!

2016-04-11 7:18 am

kokara4a
Or even some workstations!

Like this one?

<a href=”https://raptorengineeringinc.com/TALOS/prerelease.php“>

2016-04-08 10:43 am

christian

This architecture is a good example of the times we live.

Instead of learning from the past and opening some paths toward the future, they just replicate the old MIPS architecture with few changes and adopt some of the most poisonous trends of the present.

I am talking about that “mix your our architecture” thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.

[/q]

While I partially agree, things like mul/div don’t have zero cost. The problem with mul/div is that they’re either slow, multi-cycle (cycle count depending on the operands) operations, or faster but more predictable cycle count operations which require lots of transistors. The former is difficult to pipeline, the latter takes up lots of die space, and if mul/div is a marginal operation in a low end chip, then that’s a hit that may not be worth taking. Same for MMUs, some applications just don’t need the expense of a complex paged MMU.

About the ISA:

– The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

Because it takes up instruction encoding space? With modern static and dynamic branch prediction, the pipeline bubbles that predication prevent are not as much of a problem, and with technologies such as hyper-threading, pipelines can be kept plenty busy anyway.

ARM paid a price for predicate bits, it only has 16 registers available in the instruction encodings. While 16 registers are better than the 8 x86 traditionally has, the 32 registers used by other RISC ISA has arguably proved more valuable, and indeed, 64-bit ARM with it’s 31 registers has mostly dropped predication.

– Why not learning from modern hw/sf problems and shortcomings and try to implement some improvements? The RISC architectures were born following that philosophy.

One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.

Add bound checking to memory r/w instructions; let’s move the computer world forward and not backward!

Bounds checking is a function on the language definition and implementation at the compiler level. Compilers can easily do that with complex or simple instructions, and require no help from the ISA.

Array bounds checking is the sort of operation that goes against RISC philosophy. The operation itself requires you to provide not only the index you require, but also the bounds and the unit size. While the latter is easy for built in types, it’s not so easy for complex structures.

RISC resulted when people realised that the single instruction CISC operations like bounds checked indexed addressing modes resulted in complex hardware that functioned no faster than the same multiple instructions to achieve the same thing in RISC. RISC gets a lot of benefits from instruction encoding being simple and fixed length, and while those benefits were certainly more prevalent in the 80’s/90’s in mainstream CPUs, the same benefits are just as valid in the low end now. In fact, even CISC processors often perform better using multiple simple instructions rather than complex single instructions. That’s why, for example, x86 compilers no longer use instructions such as enter/leave for function prologue/epilogue, and instead do the equivalent operations using simple mov/add.

Modern x86/x64, which is now basically the only prevalent CISC architecture still being used (we’ll ignore low volume architectures like IBM z Systems) gets the best of both worlds in that it has the dense CISC instruction encoding, but breaks that down into more RISC like pipeline operations on the fly, but it pays a big price in doing so in decoding and execution complexity. I’d love to see what the big ARM vendors could do with a R&D and manufacturing budget like Intel’s.

[q]

There are lot’s of thing that can be done in an open ISA to move forward; let’s hope we will get some visionaries again, like Cray in the 60’s or the ARM team in the 80’s.

RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots. Delay slots were an implementation detail that leaked into the ISA, and were a big mistake in the long run. In fact, I dare say they were regretted even before the end of the 80’s, when pipelines started getting longer than the 3/4/5 stages that work for delay slots.

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did. But they’ve learnt that lesson as well, and defined the standard set of fixed registers, leaving the ABI and compilers to get the most out of those registers.

2016-04-08 7:10 pm

tylerdurden

RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots. Delay slots were an implementation detail that leaked into the ISA, and were a big mistake in the long run. In fact, I dare say they were regretted even before the end of the 80’s, when pipelines started getting longer than the 3/4/5 stages that work for delay slots.

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did. But they’ve learnt that lesson as well, and defined the standard set of fixed registers, leaving the ABI and compilers to get the most out of those registers.

A poster in this site with a remote understanding of computer architecture? Thank you, thank you, thank you!

I also like to add that at this point, the RISC/CISC monikers are mainly irrelevant/useless distinctions. Also it is time for people to stop coupling/confusing ISA and/with Microarchitecture.
2016-04-08 7:54 pm

peskanov
christian,

While I partially agree, things like mul/div don’t have zero costâ€¦ [/q]

Well, nothing has zero cost. Thing is, you always need mechanism to stall the pipeline, just for memory access. Stopping it for a multicycle mul/div does not have a significant cost…especially compared with NOT having these instructions!

Even worse, when I work in embedded project, even very small ones, if I don’t have mul/div I end wasting memory in algorithms and/or tables to calculate them.

I don’t thing any remaining embedded 32 bits CPU ships today without mul. And ARM ships without div (except Cortex M4) for some strange religious reason…

Because it takes up instruction encoding space? With modern static and dynamic branch

It doesn’t have too. Not all predication has to be implemented in the ARM way; just check the ARM thumb way for example.

Also, those 4 bits spent in the predicate of the ARM ISA(which is an extreme case) are not the reason ARM used only 16 registers. The shifter operands take a lot more for example!

I am pretty sure that the 16 registers decision was based in gate/area budgets.

About dynamic branching: do you know in the videogame industry (not so much in embedded) we simulate predication using “>>31” and “&”?

Because dynamic branching is never as good as simple, forward calculation expressions. And also because using predication, you unload the branch caches from trivial operations (like max, min, abs…) and let it work on more important branches.

Bounds checking is a function on the language definition and implementation at the compiler level. Compilers can easily do that with complex or simple instructions, and require no help from the ISA.

You can do anything using simple instructions, even the “+” operation. That’s not the question here. Bound checking is such an important operation nowadays, that it should be optimized at hw level.

Improving sw quality need improving on determinism. Right now many C/C++ programs work in different ways depending on compiler settings, platform, phase of the moon etc…and you know the reason, every read and write in wrong addresses has completely unintended consequences.

And people insist on using those half dead languages because those who implement bound checking, like Java, are simply slower.

Is there any sane reason for it? I don’t think so, when it could be accelerated so easily in hw.

RISC resulted when people realised that the single instruction CISC operations like bounds checked indexed addressing modes resulted in complex hardware that functioned no faster than the same multiple instructions to achieve the same thing in RISC.

completely disagree. RISC comes from realizing operations that don’t involve memory can be processed in a nice pipeline, with little stalls. Mostly due to no load-operate-store forced cycles.

In fact, an ARM instruction like addlt r0,r1,r2 asl #2 does more work than many CISC instructions. But it can be nicely pipelined.

Other inst. like rlwimi on PowerPC or “compare and branch” in MIPS prove my point.

RISC-V has corrected many of the original RISC mistakes. The main one IMO is the removal of those goddam instruction delay slots.

Delay slots were very rare in the RISC world. I can only remember the original MIPS and the Intel i860.

PowerPC, ARM, Hitachi SH, 88k, Alpha…I don’t any of those arch. used delay slots.

So I don’t thing that’s a big improvement, because that was already marginal.

[q]

And register windows! Coming from Berkeley, I worried that RISC-V might have register windows like Berkeley RISC did.

I don’t have any idea about how this task could be improved, but it should. Fast context of registers change is very important today. As people know less and less about optimization, they perform function calls inside inner loops very often; and this trend is just growing.

Any modern ISA should take this problem very seriously.

Edited 2016-04-08 19:58 UTC

2016-04-08 8:50 pm

Carewolf
Alpha had delay slots, but only in the architecture-specific micro-code you could upload to Alpha cpus. The standard ISA had not, and it was always architecture specific, but it meant a lot of NOPs in microcode.
2016-04-08 11:15 pm

viton
And ARM ships without div (except Cortex M4) for some strange religious reason…

Cortex-A7/A15 and their successors have integer divide.

https://community.arm.com/docs/DOC-8059

Things are simpler without divide.

http://www.righto.com/2015/12/reverse-engineering-arm1-ancestor-of….

just check the ARM thumb way for example.

It is a kludge.

About dynamic branching: do you know in the videogame industry (not so much in embedded) we simulate predication using “>>31” and “&”?

On PPC? =)

x86/64 has CMOV. But it can degrade the perf of your code.

BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

Because dynamic branching is never as good as simple, forward calculation expressions.

Usually it is.

And also because using predication, you unload the branch caches from trivial operations (like max, min, abs…) and let it work on more important branches.

In normal circumstances values will be in range, so min/max (in a loop) could be better with extra branches.

Bound checking is such an important operation nowadays, that it should be optimized at hw level.

No, thanks.

Elbrus has bound checking if you’re interested.

Is there any sane reason for it? I don’t think so, when it could be accelerated so easily in hw.

Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.

Delay slots were very rare in the RISC world. I can only remember the original MIPS and the Intel i860.

Pioneering architectures – RISC I/II had delay slots, as well as Sparc, PA-RISC, SuperH, DSPs.

Quote from wikipedia

All the other features associated with RISCâ€”branch delay slots, separate instruction and data caches, load/store architecture, large register set, etc

Fast context of registers change is very important today.

OoO will handle it. It is not a problem.

2016-04-08 11:41 pm

peskanov
viton,

Cortex-A7/A15 and their successors have integer divide. [/q]
I know, I was referring to embedded ARM. The only one I have seen is the Cortex M4 (if my memory does not fool me).

Things are simpler without divide

Yes, the pipeline looks delightful, sure. But, in every case, all 32 bits CPU have to implement a multiplication. And then, if they have some common sense, a division. Because these kinds of CPUs are used to solve problems that normally imply some basic math. And the software solution to mul/div looks much worse than a little irregularity in the hw.

In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it. It was an afterthought and it does not include the shifter operand though, that’s a pity (cannot do something as common as multiply by a constant).

David Braben used to tell an history about this question, I think.

x86/64 has CMOV. But it can degrade the perf of your code.

CMOV yes, but not the “>>31, &” sequence. That has a quite predictable (an low) cost. A typical OOO can do 1 or 2 of those sequences per cycle as average, if you can run several in parallel.

BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

That’s not undefined. The size of int is undefined; is that what you meant?

If the int has length of 32 bits, >>31 is a perfectly predictable operation.

Elbrus has bound checking if you’re interested

I had no idea, thanks. I will check.

Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.

We cannot agree here. A better language will check the boundary, yes. At a cost of adding several instructions and filling several registers.

What’s so terrible about the idea of standarizing safe mechanisms and predictable behaviours for something that is SO basic in computing as a random memory access.

[q]OoO will handle it. It is not a problem.

Well, I have been optimizing code for a living and I don’t agree with you. Your OOO cpu serves you much better computing useful instructions than moving senseless data all the time.

Everything has a cost; using hardware for redundant code too.
2016-04-10 9:42 am

viton
I know, I was referring to embedded ARM.

-R series has divide. But I don’t see the point of hw divide in -M series. It should be cheap. Software divider is fine. Divide by arbitrary value is a very rare operation.

In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it.

Bad idea? ARM1 was designed by 2 engineers who didn’t do CPU before:

we had to keep things simple. And they all worked. We designed a complicated set of chips that all worked first time, and we did that largely by thinking about things very, very carefully beforehand

CMOV yes, but not the “>>31, &” sequence.

It is a replacement for CMOV for older CPUs.

That’s not undefined. The size of int is undefined; is that what you meant?

C++ Standard does not define shift of negative int values.

OK, to be accurate it is implementation-defined, not undefined behavior.

Of course, no sane compiler-developer will broke informal standard, but that doesn’t mean shifts are safe. (I’m using mask propagation too)

Well, I have been optimizing code for a living and I don’t agree with you.

Same for me.

Edited 2016-04-10 09:42 UTC
2016-04-10 12:57 pm

peskanov
viton,

-R series has divide. But I don’t see the point of hw divide in -M series. It should be cheap. Software divider is fine. Divide by arbitrary value is a very rare operation [/q]

I don’t feel the software divider is fine in any ARM flavour. Even most of the A series lack div; I worked with A8 & A9, which feel like fast CPUs to me until they hit a division. A silly omission which works against ARM in most benchmarks.

Embbeded is not so different; many projects perform some heavy math these days, and adding basic division to a Cortex-Mx cost nothing.

ARM was the first 32 bits ISA I saw without div. When I saw the sw div implementations linked automatically by GCC I was quite surprised

In the beginning, I felt ignoring div was a good decision, as programmers would be more aware of the high cost of a division.

After a decade, I don’t think the same. Current programmers never check assembler outputs, and the divide op. just pops too often when a programmer is not trying to avoid it.

A minimal, hw implemented integer vision will cost 15-30 cycles and takes what, a thousand gates?

I can live fine without hw div, but I just think it is a wrong decision. Having a software division written in the flash (along your program) is not better that having a basic one in hw and in the end it saves nothing.

Bad idea? ARM1 was designed by 2 engineers who didn’t do CPU before

“Bad idea” was an unfortunate expression. I love the work the ARM team did in their ISA; there are many aspects I dislike or find bizarre, but in general terms I love it.

They worked on a low budget of time and transistor count, so I can understand the omission of mul. But that’s one of the first thing they added, so I think that shows how important this feature is.

C++ Standard does not define shift of negative int values.

OK, to be accurate it is implementation-defined, not undefined behavior.

Of course, no sane compiler-developer will broke informal standard, but that doesn’t mean shifts are safe. (I’m using mask propagation too)

Is that the same for the C99 & C11 standard?

Jesus, this crap never fails to surprise me. I didn’t even know the standard allowed for one’s complement implementations of the language.

[q]Same for me.

Cool; let me explain why did I say that.

I don’t work on optimization anymore, but when I did I used to hit memory speed limits very often. When I finished my work, my inner loops were usually memory bound and further optimization of the implementation was useless.

From time to time I still measure and optimize a few things, mostly in the ARM space. But also in current x86/x64 systems (don’t look or touch the asm there, tough).

I see current ARM cpus showing times quite similar to my previous experiences, but some of my test in an intel i5 seems to me more computation bound than memory bound.

For some toy inner loops I reach an ipc of 5-6 (which I think it’s very high). And there is no redundant code there, all inst. produce work.

If I take any of these efficient loops and add cruft, like calling a function instead of inlining it etc…efficiency goes down very fast.

It seems to me every inst. counts in these cases, unlike memory starved cpu’s I used to work.
2016-04-10 3:27 pm

Megol

Things are simpler without divide

Yes, the pipeline looks delightful, sure. But, in every case, all 32 bits CPU have to implement a multiplication. And then, if they have some common sense, a division. Because these kinds of CPUs are used to solve problems that normally imply some basic math. And the software solution to mul/div looks much worse than a little irregularity in the hw.

[/q]

Most divisions can be done with reciprocals, division with constants are common. If not the shift-subtract method is easy to program and use.

In fact the original ARM was designed without mul, but after a short period, they noticed it was a bad idea and added it. It was an afterthought and it does not include the shifter operand though, that’s a pity (cannot do something as common as multiply by a constant).

David Braben used to tell an history about this question, I think.

The multiplication algorithm of the ARM2 used the shifter which would have complicated the use of a shifted operand.

[q]x86/64 has CMOV. But it can degrade the perf of your code.

CMOV yes, but not the “>>31, &” sequence. That has a quite predictable (an low) cost. A typical OOO can do 1 or 2 of those sequences per cycle as average, if you can run several in parallel.

[/q]

Sign extraction + and can of course degrade performance if misused, just as using CMOVcc can!

Even when using Intel processors with slow CMOVcc execution it is often faster than using bithacks.

The big problem is code not arranged correctly

Current Intel processors have a latency of 1 and throughput of 2 for CMOVcc. AMD K7-K10 supported CMOVcc with latency 1 and a throughput of 3 ops/cycle, current AMD processors still execute them well.

[q]BTW mask propagation (int >> 31) is undefined behaviour in C++. Your code could break.

That’s not undefined. The size of int is undefined; is that what you meant?

If the int has length of 32 bits, >>31 is a perfectly predictable operation.

Elbrus has bound checking if you’re interested

I had no idea, thanks. I will check.

Most of the time index will be limited / valid. The problem is mainly in ancient languages like C, etc.

We cannot agree here. A better language will check the boundary, yes. At a cost of adding several instructions and filling several registers.

What’s so terrible about the idea of standarizing safe mechanisms and predictable behaviours for something that is SO basic in computing as a random memory access.

[/q]

It’s not terrible but not critical either.

RISC type:

bound_check dest, lower, upper, check_value

branch_cond dest, address

So the pipeline will need to to support 3 inputs, 1 output – it isn’t free. Intel until recently executed e.g. ADC (add with carry) with two Âµops as they didn’t want to add 3 input Âµops even for that limited case.

Or perhaps (not as RISCy):

bound_check lower, upper, check_value

branch_bounds_check address

This would require some type of condition code, something that isn’t free. Still needs 3 inputs but could reuse a standard instruction format (with one input replacing the destination register).

Or even (really non-RISC):

bound_check_branch lower, upper, check_value

Where a failed bounds check would branch to an address specified elsewhere.

But you perhaps have another idea how to do it?

[q]OoO will handle it. It is not a problem.

Well, I have been optimizing code for a living and I don’t agree with you. Your OOO cpu serves you much better computing useful instructions than moving senseless data all the time.

Everything has a cost; using hardware for redundant code too.

Adding hardware and instructions aren’t free.
2016-04-10 9:52 pm

peskanov
Megol,

Most divisions can be done with reciprocals, division with constants are common. If not the shift-subtract method is easy to program and use [/q]

I know; in the early nineties I even tried the old “work in logarithms” method for div and mul.

The question is: is there a sane reason for it? I say there is not, and nearly all the 32 bits CPU producers from the late eighties to now seem to agree with that.

Everything can be done in sw if your machine is Turing complete. That does not mean it makes any sense.

Mul and div belong to hw in a 32 bits RISC type CPU design. It just fit the footprint and applications.

The multiplication algorithm of the ARM2 used the shifter which would have complicated the use of a shifted operand

Yeah, but the advantages completely bury the disadvantages. It’s pity they didn’t make the extra effort to implement a div, no matter how slow.

Current Intel processors have a latency of 1 and throughput of 2 for CMOVcc. AMD K7-K10 supported CMOVcc with latency 1 and a throughput of 3 ops/cycle, current AMD processors still execute them well.

Given enough registers, a modern Intel OOO should eat >>/& sequences like peanuts. However, as I said in other post, I have little real experience with x86 optimization.

In any case, what I would really like (better that predication based in CR flags) is some kind of “cmp_and_select” instruction that could perform what we (poorly) accomplish with shifting and masking.

But you perhaps have another idea how to do it?

Yes, I suggested an implementation a few post ago. Check it out:

http://www.osnews.com/thread?627165

(sorry,don’t know how to link here)

Checking would be implemented in the memory access inst. itself.

Could be something like that:

load_word -> unchecked access

check_load_word ->checked access, using a especial format pointer. Failure in the check would return 0 as memory value or raise an exception, depending on configuration

Aside from the Elbrus a previous poster referenced, I know some modern DSPs perform bound checking by hw. But I can’t remember the design I saw. Would be nice to find one again.

[q]Adding hardware and instructions aren’t free

Nothing is; software divide is costly in many aspects, mostly in time.

I align with the CPU producers here: div and mul inst. pertain to the set of the task 32 bits RISC CPUs solve, and implementing them makes a better product for everybody.

Edited 2016-04-10 22:06 UTC

2016-04-11 11:38 am

cjcoats
christian,

[snip…]

You can do anything using simple instructions, even the “+” operation. That’s not the question here. Bound checking is such an important operation nowadays, that it should be optimized at hw level.

[snip…]

Wrong.

In Real Code ® there will usually be many related bounds to check and it is well worthwhile optimizing the common parts.

2016-04-11 7:33 pm

peskanov
cjcoat,

I am just a common programmer, and only work in “Real Code”.

Infinite hours have been spent by me or work colleagues chasing silly memory corruptions. C++ usually suffer less from them, but boy, when a heavily templated, complex C++ program corrupts memory or work with uninitialized data, prepare some coffee because you are working late that night.

You should explain to me why do you consider bound checking unworthy of improvement/acceleration, because I can not read your mind.

I would happily pay to have GO running at the same speed of C, so I can bury that crap forever.

2016-04-08 3:33 pm

tylerdurden

– The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

In the words of Pauli “not even wrong…”

2016-04-08 7:58 pm

peskanov
tylerdurden,

Aside from trolling, do you have anything to argue?

Some form of predication is included in nearly any new commercial architecture out there. Because it’s really an efficient and straightforward way of implementing simple dichotomies, which work well in all microarchitectures.

Edited 2016-04-08 19:59 UTC

2016-04-08 8:53 pm

Carewolf
No it is actually deprecated these days.

Predication has cost like stalling the pipeline on waiting on all dependent registers and not being able to use branch-prediction. On modern processors, normal jump branches are faster in all cases except those that are weighted exactly 50/50 on each branch.

Edited 2016-04-08 20:53 UTC

2016-04-08 9:10 pm

peskanov
Carewolf,

I am defending here the general concept of predication, especially for arithmetic inst., which can be implemented without any especial register file or add any extra stall posibility in a pipeline.

This is a common mechanism in modern DSPs, an even GPUs.

I don’t know why do you say predication causes stalls; maybe you are referring to an specific microarchitecture?

Link us some reference for further discussion.
2016-04-10 2:27 pm

Megol
No it is actually deprecated these days.

Predication has cost like stalling the pipeline on waiting on all dependent registers and not being able to use branch-prediction. On modern processors, normal jump branches are faster in all cases except those that are weighted exactly 50/50 on each branch.

Really? Remember that mispredicted branch have a overhead of ~16 cycles so even if the prediction hardware is correct 75% of the time the overhead per branch is 0.25*16=4 cycles. In many cases this is worse than an optimized branchless routine.

50% correct prediction would mean a fixed overhead of 8 cycles which for a realistic core and code (superscalar execution with IPC of 2) means at least 16 instructions just for the overhead costs!
2016-04-10 11:02 pm

tylerdurden
Prediction and Predication are not the same thing.

Most dynamic HW branch predictors in modern microarchitectures have significantly higher hit rates on average (95% +) FWIW.

Edited 2016-04-10 23:09 UTC
2016-04-11 12:46 am

Alfman verbose=1
tylerdurden,

Most dynamic HW branch predictors in modern microarchitectures have significantly higher hit rates on average (95% +) FWIW.

I’m curious about this, do you have a link?

In this paper they got intel xscale branch prediction rates of 86-94%, obviously it depends on the algorithms, so I’m trying to find more sources to see if they agree with each other.

http://www.ics.uci.edu/~alexv/Papers/iccd03.pdf

The first answer of this SO question goes into some detail demonstrating the negative consequences of misprediction can be severe on a Core i7 when non-predictable random data is used.

http://stackoverflow.com/questions/11227809/why-is-processing-a-sor…

His benchmarks show the predicate version using a conditional move is 6 times faster than the branching version when the data is random (not predictable), which supports peskanov’s assertion that predicates can be beneficial. However I’m still looking for more papers on the topic. Many people have a “gut feeling”, but I’d like to find a paper that publishes benchmarks for many typical workloads.
2016-04-11 2:06 am

Alfman verbose=1
His benchmarks show the predicate version using a conditional move is 6 times faster than the branching version when the data is random (not predictable),

Oops, I should have said about 4 times faster.

Anyone following along with this discussion might find these papers interesting:

This next paper suggests indirect branch prediction accuracy (ie using pointers instead of immediate jumps) is worse than for direct branches.

http://www.eecg.toronto.edu/~steffan/carg/readings/optimizing-indir…

This paper has painstaking details about branch prediction on x86 cpus.

http://www.agner.org/optimize/microarchitecture.pdf

Though it should not surprising that this is possible, they show a worse case example where branch prediction can mispredict alternating branches 100% of the time (ie worse than random). I’m still looking for more research that reveals prediction rates in “typical” programs.
2016-04-11 5:20 pm

tylerdurden
Sure thing, check out the data on some relatively modern predictor like COLT:

https://pdfs.semanticscholar.org/895e/8e11f3ba1fa1356e2696ceea4dd0bc…

In computer architecture, you are supposed to optimize for the common case. There are indeed some very bad corner cases that can trash your predictor, but on the average the predictor works out as a positive element in the microarchitectuer.

2016-04-08 9:27 pm

tylerdurden
For me to be trolling, you should have written something that was remotely correct.

Branch predication has been proven to be a REALLY BAD IDEA in terms of power consumption and complexity, for general architectures. Specially when implemented fully. E.g. Predication is one of the main reasons why Itaniums ended up being furnaces.

I concede that very limited forms of predication have been used, in a few speculative microarchitectures. But even ARM is dropping out predication at this point; It’s not worth the complexity, for the little performance improvement it offers in the real world.

So, given the areas of application for this project (embeeded and SOCs), predication makes little sense.

FWIW There were a lot of misconceptions in your post, regarding microarchitecture, I just used Pauli’s succinct words that get to the point.

2016-04-09 1:37 pm

dionicio
Until Power micro-management evolves.

2016-04-08 5:35 pm

Alfman verbose=1

peskanov,

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy. [/q]

I agree, these operations are so fundamental and cheap these days they ought to be supported everywhere.

One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.

Add bound checking to memory r/w instructions; let’s move the computer world forward and not backward!

I think that’s really more of a programming language problem. C/C++ are especially guilty for failing to address bounds checking and overflow problems. C doesn’t even expose the overflow flags the CPU gives us. However not all languages have these shortcomings. Rust for example has zero cost bounds checking because the bounds checking is proven through static code analysis at compile time, runtime bounds checking is only needed when the compiler doesn’t know bounds information at compile time.

If we really wanted to have special ISA support for bounds checking, what exactly would that look like? Every single memory access could have a start and end range, but where would these ranges come from? The CPU doesn’t know a thing about data structures. Even simple things like strings can have unknown length that are set in the data structure itself. The only practical thing to do would be for the compiler to convey structure bound information to the CPU, however if the compiler knows the correct bounds then it really has no business generating code that exceeds those bounds in the first place. So I’m having troubling envisioning a scenario where ISA bounds checking wouldn’t be redundant.

[q]About the ISA:

– The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

I agree adding branch prediction can help performance. It also adds significant complexity though. The moment you decide to go this route, you need more silicon to keep track of branch statistics, you need more silicon to implement transnational registers that can be rolled back, you need more silicon to detect branch mis-prediction, etc. It comes with a lot more complexity and bug potential.

The performance benefits might still justify all the additional complexity, but usually a reference implementation is all about correctness and simplicity. Is there a compelling reason to have this complexity in a reference implementation? Optimized implementations are still possible as long as correctness is maintained.

Edit: I hadn’t read christian’s post when I posted, oh well, we say many of the same things.

Edited 2016-04-08 17:44 UTC

2016-04-08 6:14 pm

kwan_e
C/C++ are especially guilty for failing to address bounds checking and overflow problems. C doesn’t even expose the overflow flags the CPU gives us. However not all languages have these shortcomings. Rust for example has zero cost bounds checking because the bounds checking is proven through static code analysis at compile time, runtime bounds checking is only needed when the compiler doesn’t know bounds information at compile time.

That’s not really much different from C++. If you use a std::array and std::get, you get compile-time bounds checking through the type system without even static code analysis. You also get the same with std::tuple for index and type access. If you pass an array reference to a templated function that deduces array references, then the compiler will also warn you if you access something outside of the deduced range.

The fact that Rust still needs bounds checking for dynamically sized blobs means it’s still no better than other languages before it in that regard, since no compiler for any language can do compile time bounds checking for dynamically sized blobs.

Edited 2016-04-08 18:18 UTC

2016-04-08 9:50 pm

Alfman verbose=1
kwan_e,

Well obviously I agree that C/C++ can do everything that rust can. The difference is with safe languages, you get safety automatically 100% of the time, even across threads. In unsafe languages it’s up to the developer have to make sure all the code written and used is safe. History shows that as complexity increases, our ability to write and maintain safe code decreases. Thank goodness for tools like valgrind!

2016-04-09 11:37 pm

bnolsen
When I ditched java style c++ object oriented programming and started using the STL more my memory problems pretty much vanished. That was about 15 years ago. Legitimate problems I have now are pretty much due to threading. Accidentally passing a loop temporary to a thread, race conditions, improper protection of shared data.

This other safety stuff is way overblown and frankly not really a concern for me, and hasn’t been for a very long time.
2016-04-10 7:37 am

moondevil
Apparently others are not so good as you with their C and C++ skills.

http://www.cve.mitre.org/cgi-bin/cvekey.cgi?keyword=memory
2016-04-10 2:44 pm

kwan_e
Apparently others are not so good as you with their C and C++ skills.

http://www.cve.mitre.org/cgi-bin/cvekey.cgi?keyword=memory

Quite a few of those are C bugs, not C++. Most people keep writing C++89 or C with classes either through laziness or someone decided from on high to eschew with the latest C++ safety features mostly those who view C++ as C with classes, but also those for some reason or another tied to a particular older version of a compiler. And there’s a lot of C code out there that hasn’t been updated which cannot be the fault of modern C++ features since they weren’t available then.
2016-04-11 6:59 am

moondevil
If they are written in ANSI C++ compatible code, that is accepted by any C++ compliant compiler, they are C++ bugs as well, regardless how you rephrase it.

The very fact that those features as still valid code in modern C++, in spite of the improvements it brings in terms of security, is a security concern.

Also 30 years of history in C and C++ security exploits have more than proven that the average C and C++ developer is not up to the task of writing safe code, regardless of people keep posting how good they are and only the others are writing bad code.

Except for the time I spent at CERN, I never worked in a enterprise where the quality of C and C++ code was of any value.
2016-04-11 7:21 am

kwan_e
The very fact that those features as still valid code in modern C++, in spite of the improvements it brings in terms of security, is a security concern.

And yet also because of such backwards compatibility, legacy code can be analyzed by modern C++ compilers and fixed, which is better than rewriting in language flavour of the month.

But all this is by-the-by from the main point that it is possible to write new safe code in the modern version on par with any “safe” language without extra effort. In fact, often with less effort.
2016-04-11 8:18 am

moondevil
Unless you are doing code reviews, compiling warnings as errors, enabling all possible warnings, forbid the developers to use C style code with static analyzers that break the integration tests, have strict policies for STL usage,…. Exploits will keep on happening.

Except for companies like Microsoft, Google, Apple and similar I haven’t seen any company going to the effort of writing safe code in C++.

Just take an hour of your time to watch CppCon 2015 talk from Herb Sutter, which I imagine you know who he is.

Among the audience of top C++ developers that managed to get a CppCon 2015 ticket, only 1% assumed using some form of static analysis tooling.

It might be possible, but if it isn’t enforced, people won’t do it.
2016-04-11 3:43 pm

Alfman verbose=1
kwan_e,

But all this is by-the-by from the main point that it is possible to write new safe code in the modern version on par with any “safe” language without extra effort. In fact, often with less effort. [/q]

You are offhandedly dismissing alot of the frustrations stem from unsafe languages. Yes, of course I agree that to the extent that stl already has templates to do what you need, you can reuse and avoid some grief. But the stl doesn’t defines 100% of the data structures everyone will ever need. There have been times I’ve started out using stl structures and encountered limitations that forced me to roll my own. The reality is that some structures are more specialized than what the stl can provide. Even when you don’t need specialized classes, there can still be a problem because stl classes have no protection from concurrent access.

http://www.sgi.com/tech/stl/thread_safety.html

The SGI implementation of STL is thread-safe only in the sense that simultaneous accesses to distinct containers are safe, and simultaneous read accesses to to shared containers are safe. If multiple threads access a single container, and at least one thread may potentially write, then the user is responsible for ensuring mutual exclusion between the threads during the container accesses.

This is the only way to ensure full performance for containers that do not need concurrent access. Locking or other forms of synchronization are typically expensive and should be avoided when not necessary.

Re-entrancy/concurrency can be a problem even in single threaded code since the STL algorithms are chosen to be efficient at the expense of safety.

http://stackoverflow.com/questions/32017363/do-the-iterator-invalid…

If you violate the above rules, the C++ std library does not stop you. It just states there is a race condition — aka, undefined behavior. Anything can happen. In C++ std library, if you don’t need something, you don’t pay for it: and a single-threaded use of those containers would not need the weight of synchronization. So you don’t get it by default.

Even one of the most fundamental classes, std::string has undefined behavior for certain unsafe operations. We could argue that these limitations are obvious or that programmers would never use them incorrectly, but the truth is as code becomes more and more complex, things start to slip by us. We make bad assumptions and our code becomes vulnerable to classes of errors that don’t happen in safe languages.

http://www.cplusplus.com/reference/string/string/operator%5B]/

If pos is less than the string length, the function never throws exceptions (no-throw guarantee).

If pos is equal to the string length, the const-version never throws exceptions (no-throw guarantee).

Otherwise, it causes undefined behavior.

Note that using the reference returned to modify elements that are out of bounds (including the character at pos) also causes undefined behavior.

While I could find more, the point isn’t to criticize the STL. It is good, people should use it. STL definitely makes C++ better. But the fact remains that as long as the responsibility of verifying code safety is left to the programmer, then C++ is inherently less safe than a language where the compiler is responsible for code safety.

This is more of a gripe rather than a safety issue, but c++ stl template errors are some of the most atrocious errors I’ve ever seen come out of a compiler. I’m not one to be intimidated by a challenging bug, but I really hate when this happens in C++ because of a dumb little typo.

https://bytes.com/topic/c/answers/893220-compile-error-templated-stl…

[q]% /opt/aCC/bin/aCC -AA -c -g -I. test.cc

Error 226: “/opt/aCC-3.30.01/include_std/rw/tree.cc”, line 492 # No appropriate function found for call of ‘operator ()’. Last viable

candidate was “bool std::less<int>::operator ()(const int &,const int &) const” [“/opt/aCC-3.30.01/include_std/functional”, line

167]. Argument of type ‘const std::basic_string<char,std::char_traits<char>,std: :allocator<char> > &’ could not be converted to

‘const int &’.

if (!_C_key_compare(__x->_C_key(), __k))

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error 226: “/opt/aCC-3.30.01/include_std/rw/tree.cc”, line 500 # No appropriate function found for call of ‘operator ()’. Last viable

candidate was “bool std::less<int>::operator ()(const int &,const int &) const” [“/opt/aCC-3.30.01/include_std/functional”, line

167]. Argument of type ‘const std::basic_string<char,std::char_traits<char>,std: :allocator<char> > &’ could not be converted to

‘const int &’.

|| _C_key_compare(__k, ((_C_tree_iter&)__j)._C_node->_C_key())) ? e

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^

Error 556: “./ExclusiveMap.h”, line 30 # Unable to generate specialization “__rw::__rw_tree_iter<std::pair<const

std::basic_string<char,std::char_traits<char>,std: :allocator<char> >,int>,long,std::pair<const int,int> *,std::pair<const int,int>

&,__rw::__rw_rb_tree_node<std::allocator<std::pair <const int,int> >,std::pair<const

std::basic_string<char,std::char_traits<char>,std: :allocator<char>

>,int>,std::basic_string<char,std::char_traits<cha r>,std::allocator<char> >,__rw::__select1st<std::pair<const

std::basic_string<char,std::char_traits<char>,std: :allocator<char>

>,int>,std::basic_string<char,std::char_traits<cha r>,std::allocator<char> > > > >

__rw::__rb_tree<std::basic_string<char,std::char_t raits<char>,std::allocator<char> >,std::pair<const

std::basic_string<char,std::char_traits<char>,std: :allocator<char> >,int>,__rw::__select1st<std::pair<const

std::basic_string<char,std::char_traits<char>,std: :allocator<char>

>,int>,std::basic_string<char,std::char_traits<cha r>,std::allocator<char> > >,std::less<int>,std::allocator<std::pair<const

int,int> > >::find(const std::basic_string<char,std::char_traits<char>,std: :allocator<char> > &)” due to errors during generation.

typename map<C2, C1>::iterator map2Iter = _map2.find(c2);

…

2016-04-08 8:16 pm

peskanov
Alfman,

you can’t blame everything on the language. A memory safe language does bear an important cost in actively checking memory bounds, so that’s a thought decision to make.

How would I design a memory bound checker in hw?

It surely can be done in a lot of ways, here goes my suggestion:

– Implement a format of pointer+boundary. It could a 64 bits one, ie. 40+24 base/size. Anything bigger than that, let the compiler use sw checking.

– Implement a “soft” reaction to invalid access. This would make wrong programs wrong, but at least it would deterministic and wouldn’t create havoc in memory. It could work that way:

+ Invalid reads return 0

+ Invalid writes are not performed

+ Any invalid read/or write raises a flag for posterior testing

– Implement a “hard” exception, optional for the user. An exception is what hw already does today, so let’s apply the same for all bad memory accesses.

I agree adding branch prediction can help performance. It also adds significant complexity though

This is completely depend on implementation. If you wish, the result of comparisons can set on common registers and used as any other operand.

Something like:

cmp_set r0,r1,r2 ; compare r0,r1, set resulting flags in r2

mv_gt r0,r1,r2 ; move r1 to r0 depending on r2

Has exactly the same implications for a pipeline than a “sub, mov” sequence.

2016-04-08 10:40 pm

Alfman verbose=1
peskanov,

you can’t blame everything on the language. A memory safe language does bear an important cost in actively checking memory bounds, so that’s a thought decision to make. [/q]

It depends on if the safety is enforced at run-time (like an interpreted language) or compile-time. If you call a class method that does bounds checking and that call gets inlined, the compiler may be able to prove that the bounds check is invariant with regards to the context it was called from. Therefor the check becomes redundant and unnecessary.

For example (sorry about the formatting):

class S {

unsigned int len;

char*data;

// many members

unsigned int Length() {

return len;

};

char&operator[](unsigned int index) {

if (index>=len) throw new Exception(…);

return data[index];

};

};

S s=”…”;

for(int j=0; j < s.Length(); j++) {

cout < < s[j] < < endl;

}

Procedural inlining produces this logic:

for(int j=0; j < s.Length(); j++) {

if (j >= s.Length()) throw new Exception();

cout < < s.data[ j ] < < endl;

}

GCC is smart enough to process the bounds check at compile time and determine that it will always pass, therefor when you examine the binary code (with optimizations turned on) you’ll see that the logical bounds check is completely omitted because it didn’t change the program semantics. The compiler was logically able to prove that the bounds check succeeded each iteration without actually having to run it.

It surely can be done in a lot of ways, here goes my suggestion:…

Your example begs the question, how is it different from a normal segmentation fault? We already get exceptions on bad memory accesses at that level. Make a program that reads or writes a random pointer, you’ll see what I mean.

[q]Something like:

cmp_set r0,r1,r2 ; compare r0,r1, set resulting flags in r2

mv_gt r0,r1,r2 ; move r1 to r0 depending on r2

I see. Sorry, I misread predication as prediction as well. Only a few opcodes in x86 use predication. I don’t know how much benefit this offers, I’ll need to research it further.

Edited 2016-04-08 22:49 UTC

2016-04-08 11:16 pm

peskanov
Alfman,

yes, sequential memory access can be checked with little cost, and a good HL language will do it for you.

But most heavy-duty inner loops rely in one or several random access; ie table read, tree traverseâ€¦ And random access often has to be checked for every single case. The language can’t help here.

I consider the operation to be important enough to guarantee some hw support.

Your example begs the question, how is it different from a normal segmentation fault? We already get exceptions on bad memory accesses at that level. Make a program that reads or writes a random pointer, you’ll see what I mean.

Easy; you have this data declared in your C program:

static int hash[512] = {};

static int hash_init = 0;

at some point, in runtime, your command “hash[i] = first” is run with a value of i = 512.

At that moment, you are possibly destroying the hash_init var.

Of course, that depends on how your compiler arranges the data segment. You can destroy nearly anything with a random access in C/C++.

Segmentation fault does not protect all single data entities of your program. You can happily write outside boundaries. Sometimes an exception will catch the event, but again, that depends on where did you “peek” or “poke”. Segmentation fault works on a higher level, covering large areas of memory. What you can read or write inside these areas is basically “total freedom”.

Some tools like valgrind can help you catching unsafe data accessing, but I guarantee you even valgrind lets a lot of memory error go without notice.

Java/JVM solves this just inserting instructions which check boundaries constantly.

What I suggest: let the cpu provide fast mechanism to perform that. MMU blocks and segmentation faults are not the way to solve this problem.
2016-04-09 1:25 am

Alfman verbose=1
peskanov,

yes, sequential memory access can be checked with little cost, and a good HL language will do it for you.

But most heavy-duty inner loops rely in one or several random access; ie table read, tree traverseâ€¦ And random access often has to be checked for every single case. The language can’t help here. [/q]

You say hardware bound checking would yield benefit in more complex examples, but I’m not seeing how. Correct code will never be out of bounds and would not benefit from additional bounds checking. Theoretically incorrect code could benefit, and there’s no denying alot of code is incorrect. But incorrect code is inherently the consequence of Garbage-In-Garbage-Out. How does the CPU obtain the correct bounds given that the compiler had the wrong bounds?

I’m having trouble coming up with examples where this would be useful. It would really help me understand if you could provide a specific example.

at some point, in runtime, your command “hash = first” is run with a value of i = 512.

At that moment, you are possibly destroying the hash_init var.

Of course, that depends on how your compiler arranges the data segment. You can destroy nearly anything with a random access in C/C++.

Yes, but this is because C is unsafe. A bounds checking ISA doesn’t help C unless we also update C’s semantics to be bounds checked.

[q]Java/JVM solves this just inserting instructions which check boundaries constantly.

What I suggest: let the cpu provide fast mechanism to perform that.

I would expect that the boundaries are [i]logically checked for every single access, but they optimize away in machine code due to invariant conditions exactly like my earlier example above. Bounds check optimizations should still be possible regardless of when the code actually gets compiled. I’m curious, do you have evidence this is not the case?
2016-04-10 9:19 am

peskanov
Alfman,

I would expect that the boundaries are [i]logically checked for every single access, but they optimize away in machine code due to invariant conditions exactly like my earlier example above. Bounds check optimizations should still be possible regardless of when the code actually gets compiled. I’m curious, do you have evidence this is not the case?

You did set up a trivial example, a sequential memory access algorithm. I agree, hw bound checking does not help much here (although it would still be worthy imo, for short sequences checking).

Continuous bound checking is necessary accessing tables in random access. Something like this (pseudocode):

read_from_disk(mesh_indices)

read_from_disk(vertices)

foreach index in mesh_indices

print (vertices[index])

You can not know beforehand that all indices in the mes_indices array are correctly bounded to vertices arrays.

In other words: random tables access, which is a basic computing tool, must be checked on every use if you want to ensure deterministic behavior.
2016-04-10 9:01 pm

Alfman verbose=1
peskanov,

I agree, hw bound checking does not help much here (although it would still be worthy imo, for short sequences checking).

Continuous bound checking is necessary accessing tables in random access. Something like this (pseudocode):

read_from_disk(mesh_indices)

read_from_disk(vertices)

foreach index in mesh_indices

print (vertices[index])

You can not know beforehand that all indices in the mes_indices array are correctly bounded to vertices arrays.

Obviously input should always be bounds checked, that’s a given. What I said before applies here too: Correct code will never exceed those bounds. Incorrect code can exceed those bounds, but then we know such code is incorrect. How is the CPU supposed to know the correct bounds? Obviously it needs to come from the program, which needs to be fixed, but once you’ve done that it will no longer generate incorrect bounds. So it’s not at all clear to me what we’d gain by adding bounds checking in the CPU.

Not having an example of h/w bounds check that isn’t logically redundant in correct code makes me question the need for it.
2016-04-10 9:36 pm

Andre
I can imagine

char* buffer = malloc(64);

buffer[64] = 1;

where the malloc function adds some metadata to buffer describing the size. But this would be rather an extension to the C language. However, if every buffer would get managed by the memory manager implemented in hardware (so, per buffer rather then per process). Interesting concept, but could it work?
2016-04-10 10:05 pm

peskanov
Alfman,

Not having an example of h/w bounds check that isn’t logically redundant in correct code makes me question the need for it

Let me return your question to you.

I have not worked in Java for 5 or 6 years, professionally, it seems I am forever stuck in the zombie land of C/C++.

It seems I am not able to produce some assembler output from Jave for you. The flags -XX are not working for some reason and don’t have the time to investigate further.

So, if you are working in a safe language like Java or Go, why don’t you produce some assembler output so we can check it out, for a function that perform a few table read given an index?

I have a hard time believing current implementations are able to follow the life of table index and ensuring they never exceed their potential uses inside the table.
2016-04-10 11:52 pm

Alfman verbose=1
peskanov,

Let me return your question to you.

I have not worked in Java for 5 or 6 years, professionally, it seems I am forever stuck in the zombie land of C/C++.

Actually, I don’t mind an example in C/C++.

Do you have an example where C/C++ programs could benefit from hw bounds checks without changing the code?
2016-04-11 10:46 am

peskanov
Actually, I don’t mind an example in C/C++.

Do you have an example where C/C++ programs could benefit from hw bounds checks without changing the code?

Of course. Any random table access where the index is not trivial (plenty of them).

int is_empty (char* collision_map, int w, int x, int y)

{

int offs = x / 100 + w * (y / 100);

return collision_map [offs]

}

This code is amateur and obviously unsafe, no guards of any kind. But the language allows it and it’s pretty common.

The problem is, being completely unbounded, any wrong parameter will return unpredictable results.

That means it can work ok for a year, nobody noticing it, and then crash horribly in a port version or a compiler version change.
2016-04-11 1:19 pm

Alfman verbose=1
peskanov,

int is_empty (char* collision_map, int w, int x, int y)

{

int offs = x / 100 + w * (y / 100);

return collision_map [offs]

}

This code is amateur and obviously unsafe, no guards of any kind. But the language allows it and it’s pretty common.

The problem is, being completely unbounded, any wrong parameter will return unpredictable results.

That means it can work ok for a year, nobody noticing it, and then crash horribly in a port version or a compiler version change.

Yes, it’s possible that invalid values are input. But this is a problem with unsafe languages like C/C++. I’ll repeat my point one last time because it still applies, adding h/w bounds checking CAN NOT fix things like this BECAUSE the cpu does not know what what the correct bounds are.
2016-04-11 7:25 pm

peskanov
Yes, it’s possible that invalid values are input. But this is a problem with unsafe languages like C/C++. I’ll repeat my point one last time because it still applies, adding h/w bounds checking CAN NOT fix things like this BECAUSE the cpu does not know what what the correct bounds are.

I think there is a basic misunderstanding here. I don’t the cpu to FIX anything.

I want to ACCELERATE what your safe language is ALREADY doing.

You can do what I posted above in Java, or GO, ADA or any other safe language.

The difference with C is this: those languages pass extra information and add extra instructions to perform bound checking. Therefore, the code is somewhat slower, but safer.

Those inst. can be a lot of extra workload depending on the nature of our inner loops an table accesses.

What do I want?

Many programmer keeps attached to their gcc compilers because safe languages consistently score slower than C/C++.

To me, bound checking is a light operation that could be performed transparently by hw, with little cost.

My wet dream is a hl language + isa combo which perform at speeds close to C, without the horrible safety penalty we all C & C++ coders are already paying.

I don’t want any fix or patch for C or C++, I consider both languages complete dead ends and despise the attempts to modernize them and prolong their lives.
2016-04-11 11:06 pm

Alfman verbose=1
peskanov,

I think there is a basic misunderstanding here. I don’t the cpu to FIX anything.

I want to ACCELERATE what your safe language is ALREADY doing. [/q]

Ok, so going forward lets assume that all code is correct because that’s the case you are talking about (this wasn’t very clear to me).

You can do what I posted above in Java, or GO, ADA or any other safe language.

The difference with C is this: those languages pass extra information and add extra instructions to perform bound checking. Therefore, the code is somewhat slower, but safer.

You’ve said this about java, but you haven’t confirmed it. I think this assumption needs to be confirmed. Safe languages like rust and even unsafe languages like C can eliminate bounds checks entirely when they can logically prove that the bounds check will always pass (which is often true in correct code).

I think most of the time bounds checks can be eliminated by logical elimination in correct code, which is why I keep asking for a specific counter example to show me why I’m wrong. In every example I can think of, explicit h/w bounds checking becomes synonymous/redundant with generic instructions that do the exact same thing, so I’m not seeing where there would be a big advantage.

[q]You should explain to me why do you consider bound checking unworthy of improvement/acceleration, because I can not read your mind.

It’s not that bounds checking isn’t worthy of acceleration, but it’s that bounds checking that can be logically eliminated already has ZERO COST. In such cases, adding h/w bounds checking would increase program size and code pipeline requirements, all for no benefit whatsoever.

For bounds checking that cannot be logically eliminated, of course it HAS to be done. But then why create new bounds checking instructions to test a bounds and branch somewhere else when we already have generic instructions to do that?

Would you mind provide specific details of how these new opcodes might work?

Edited 2016-04-11 23:21 UTC
2016-04-12 6:02 am

dpJudas
I think most of the time bounds checks can be eliminated by logical elimination in correct code, which is why I keep asking for a specific counter example to show me why I’m wrong. In every example I can think of, explicit h/w bounds checking becomes synonymous/redundant with generic instructions that do the exact same thing, so I’m not seeing where there would be a big advantage. [/q]

OK, here’s a slightly more elaborate example of what was already originally mentioned in this thread:

File f(“model.mesh”);

int numFaces = f.readInt32();

int numControlPoints = f.readInt32();

int[] indices = f.readInt32Array(numFaces * 3);

float[] vertices = f.readFloat32Array(numControlPoints * 3);

foreach (int index in indices)

{

float x = vertices[index * 3];

float y = vertices[index * 3 + 1];

float z = vertices[index * 3 + 2];

Console.WriteLine(“vertex: {0}, {1}, {2}”, x, y, z);

}

There is no logical elimination possible in this kind of code since the lookup is not known at compile time. The compiler therefore has no real choice but to insert bounds checks inside the inner loop.

[q]For bounds checking that cannot be logically eliminated, of course it HAS to be done. But then why create new bounds checking instructions to test a bounds and branch somewhere else when we already have generic instructions to do that?

Would you mind provide specific details of how these new opcodes might work?

He already did. Read up in the thread.
2016-04-12 7:39 am

Alfman verbose=1
dpJudas,

File f(“model.mesh”);

int numFaces = f.readInt32();

int numControlPoints = f.readInt32();

int[] indices = f.readInt32Array(numFaces * 3);

float[] vertices = f.readFloat32Array(numControlPoints * 3);

foreach (int index in indices)

{

float x = vertices[index * 3];

float y = vertices[index * 3 + 1];

float z = vertices[index * 3 + 2];

Console.WriteLine(“vertex: {0}, {1}, {2}”, x, y, z);

}

There is no logical elimination possible in this kind of code since the lookup is not known at compile time. The compiler therefore has no real choice but to insert bounds checks inside the inner loop. [/q]

While I don’t know what code .net actually produces, assuming it produces the best possible bounds checking code inside the loop, it might do this (and probably optimize the index*3 as well…)

…

foreach (int index in indices) {

if ((unsigned int)index >= numControlPoints ) {invalid…};

float x = vertices[index * 3];

float y = vertices[index * 3 + 1];

float z = vertices[index * 3 + 2];

Console.WriteLine(“vertex: {0}, {1}, {2}”, x, y, z);

}

As input, the bounds check above needs the value being checked(index), the range of permissible values(length), and the instruction pointer if it fails (invalid). Observe the casting, this tests for both negative and positive values of index in one go.

In x86 this might translate into something like “cmp EAX, EBX; jae invalid”.

2 bytes for cmp, and (most likely) 2 for jae.

I apologize that I’m not terribly familiar with the proposed RISCV ISA, but it looks like it might translate into something like this “BGE X1, X2, invalid” with 4 bytes.

Now, please explain to me what a hardware optimized bounds test would do that’s any better than using generic instructions such as these?

[q]He already did. Read up in the thread.

Not for nothing but I am an experienced assembly language programmer, and I still don’t really see strong evidence that new bounds checking instructions would be that useful in typical software. If I’m wrong then it should be easy to come up with common examples to show it right?

The X86 actually had a “bound” instruction that could compare a register to an upper and lower bound in memory and call an interrupt handler on failure. It got removed in x86-64 since it wasn’t all that useful and compilers never used it.

http://www.fermimn.gov.it/linux/quarta/x86/bound.htm

Edited 2016-04-12 07:55 UTC
2016-04-12 10:46 am

peskanov
alfman,

Now, please explain to me what a hardware optimized bounds test would do that’s any better than using generic instructions such as these?

At least, you acknowledge that bound check is normally performed with extras instructions and registers…that’s something.

This system has 3 extra loads:

– Extra general register (registers are very valuable, so that’s a significant loss).

– Extra entry in the branch cache.

– Extra instruction.

What I proposed is to include bound checking in the read or write inst.

So no extra instructions needed. Bounds could integrated with the address using a bitfield format, or set on especial registers (no need to use general ones, as their duty would be reduced).
2016-04-12 2:07 pm

Alfman verbose=1
peskanov,

What I proposed is to include bound checking in the read or write inst. [/q]

Ok, but I don’t feel this has been thought through to it’s logical completion. I keep asking where the bounds come from, passing this information to the CPU is not free, the compiler MUST convey it somehow.

Here are all the logical possibilities:

1. The bounds information can be encoded with the instruction (this would only be applicable to a static buffer with constant length, but in many cases such as a variable length string you could not hard code a bounds in the instruction).

2. The bounds information can be loaded from a memory address or cache (this might work but is much less efficient than a register. You don’t want to take a double hit to read the bounds information every time you read/write a memory address).

3. The bounds information can be loaded from one or more register(s) (this can be efficient, but you’ve merely shifted the burden of loading the registers elsewhere, which means you’ll still need more instructions to perform a bounds check).

Now your criticisms of the generic instructions:

This system has 3 extra loads:

– Extra general register (registers are very valuable, so that’s a significant loss).

Since possibility #1 isn’t an option for dynamic structures and possibility #2 is costly because it relies on the memory subsystem, then a register actually makes the most sense. You could use a special dedicated register for bounds checking, but this just implies your going to have to end up loading it from memory or another general purpose register, so it makes sense to skip indirection and just use a general purpose register.

If general purpose registers are a bottleneck, then maybe a better solution is just to add more.

- Extra entry in the branch cache.

This is true, but I’m not sure it’s such a bad thing for the bound branch to be predicted like any other branch. To the extent that you really don’t want bounds checks in the branch cache for some reason, I’d argue a much more consistent solution would be to solve the problem generically with a generic branch hinting mechanism to specify which branches you want in the cache.

[q]- Extra instruction.

That’s true. But the solution you are proposing implicitly executes this extra instruction every single ram access even if it’s redundant with the program’s logic. Using generic instructions you don’t end up with any of this computational redundancy. Also, since you still need to load the bounds register from somewhere, you’ll also need some extra instructions to do that.

I realize you’d want to optimize bounds computation as much as possible by making it run in parallel. But we mustn’t overlook that on a pipelined architecture you also get the benefit of running the extra generic instruction in parallel, so in the rare case that an explicit bound is needed it’s still not obvious that bound reads/writes would produce a net benefit.

One last point is that generic instructions are very flexible with regards to the way we can handle constraints. Only multiples of 4 are valid? No problem with generic instructions. You want bounds to wrap around using a mask? No problem with generic instructions. You needs to perform the bounds check of several values before performing any memory reads/writes? No problem with generic instructions. But if you’ve hardcoded a specific kind of bounds check into memory operations, you still might end up having to use generic instructions anyways just to get what you want.

And so, maybe you can agree that the need for h/w bounds checking isn’t so obvious? I guess we might not end up agreeing, but thanks for the discussion anyways, it’s been fun.

Edited 2016-04-12 14:13 UTC

2016-04-08 9:29 pm

tylerdurden
peskanov,

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy.

I agree, these operations are so fundamental and cheap these days they ought to be supported everywhere.

https://www.youtube.com/watch?v=31g0YE61PLQ

2016-04-08 6:08 pm

JLF65

I am talking about that “mix your our architecture” thing. Dou you want mul/div in your CPU? Do you want 64 o 128 floats?

The biggest roadblock of modern computing is software. You are not helping anybody letting all your d***ed arch. undefined, and forcing all tool makers to work their asses on silly flavours of the same ISA.

Mul and div should be defined in the standard. Even in a minimalist implementacion for embedded, adding slow mul/div instructions cost nothing in terms of area/energy. [/q]

Not every function needs hardware mul/div, and quite a few don’t need floating point at all. It’s not the optional hardware that makes modular architecture a pain, it’s not having a standard way to detect what is and is not available. RISC-V makes it easy to determine. The rest is just software, and normally handled automatically by the compiler.

About the ISA:

– The biggest lesson of the last 20 years is that you want/benefit from branch predication. Even the most simplest form of it. Where is branch predication here?

Branch prediction is on the IMPLEMENTATION side, not the specification side.

[q]One operation most modern software does, is array bound checking. Software perform it adding lost of inst., but hw could do it without significant cost.

Add bound checking to memory r/w instructions; let’s move the computer world forward and not backward!

Someone else made a great reply to this, so not gonna bother other than to repeat that this is silly to implement in hardware and wouldn’t make a difference other than to the price and complexity of the hardware.

2016-04-08 8:24 pm

peskanov
JLF65,

show me an use for a 32 bits architecture, who does not benefit *inmensely* from having mul/div.

Better, show me some 32 bit architecture in use today, which does not have *at least* a mul instruction.

Branch prediction is on the IMPLEMENTATION side, not the specification side. [/q]

Branch prediction and “predication” is not the same, smart guy. Try to understand what you read.

[q]Someone else made a great reply to this, so not gonna bother other than to repeat that this is silly to implement in hardware and wouldn’t make a difference other than to the price and complexity of the hardware

Yeah; emulating ridiculous large CISC instructions (which x86 compilers does not even produce anymore) is a perfectly reasonable use of hw.

But memory bound checking, a basic and integral part of modern computing, is not. That would be a waste. Because a subtraction and a mask operation (all that’s needed for it) would be too costly.

Ok.

2016-04-08 8:41 pm

JLF65
No need for mul/div? A toaster, a washing machine, a simple clock… any number of common items that today use a simple chip instead of gears or crap.

And sorry, I read that as branch prediction every time, even when proof-reading my reply. Weird. The primary advantage of branch PREDICATION (got it this time ) is making pipelining and cacheing slightly easier/better. In other words, it’s an IMPLEMENTATION thing again. You don’t want implementation related features creeping into your specs or later improvements in implementation are harder or render parts of your spec unusable. It SEEMS like a good idea, but only for certain kinds of implementations.

2016-04-08 8:56 pm

peskanov
JLF65,

the topic of this thread is the definition of an 32/64/128 bits ISA.

Please tell me which toaster needs a 32 bits ISA.

You use a 32 bits ISA to solve problems with a minimum of complexity. Which nearly always involves mul and div.

I don’t care a PIC lacks mul; but a CPU with a 32 bits ISA should!

And I do work in embedded (not full time), so I should now.

About predication: you don’t understand predication. It’s defined at instruction level, not at implementation level.

This is predication:

; if (a<0) a = -a

cmp r0,#0

rsblt r0,r0,r0

Do you see the “lt” sufix. This is the part of the instruction which performs the predication.

It’s defined in the ISA, that’s predication.
2016-04-09 12:21 am

JLF65
I fully understand predication. CMOV is a common example. It’s part of the ISA, but a BAD part. You’re making implementation specific optimizations part of the instruction set, and what happens when those optimizations are obsolete? Or another optimization makes it harder to implement? Again, it SEEMS like a good idea on the surface, but it’s moving implementation specific design into the instruction set where it has no place.

And who cares if 32 bits is overkill for a toaster? It’s a TOTALLY OPEN SOURCE processor you can stick into an ASIC or FPGA along with everything else you want your toaster to do without having to pay anyone else a dime. For example, a PIC chip would handle a toaster, but then you’re paying them for their processor. You could use a 6800 MPU, but then you’re paying Freescale for their processor. If an ASIC is cheaper, this gives you a TOTALLY free CPU to use.
2016-04-09 8:17 am

agentj
Why 32bit core is overkill ? 32bit CPUs are so cheap and easy to support by good quality toolchains and IDEs these days. If you need to use assembly code in order to squeeze last byte of some chip (not related with pricing, power or performance constraints), you chose wrong chip. They are also easily available in tiny BGA packages.

Edited 2016-04-09 08:18 UTC
2016-04-10 9:11 am

peskanov
JLF65,

I fully understand predication. CMOV is a common example. It’s part of the ISA, but a BAD part. You’re making implementation specific optimizations part of the instruction set, and what happens when those optimizations are obsolete? Or another optimization makes it harder to implement? Again, it SEEMS like a good idea on the surface, but it’s moving implementation specific design into the instruction set where it has no place. [/q]
I have no clue what you mean here. Intel never removes instr. from their ISA. If you think new inst. should not be used, that’s up to you, at least in this CPU family. Good luck convincing the compilers, though. They could have a different opinion.

Following your reasoning, nobody should compile the FP code for SSE and restrict to the original stack based FPU, which is a total disaster.

BTW, I will readily acknowledge I am no x86/x64 expert. My defense of predication mechanism or equivalents is not attached to the CMOV inst., which I have never used and barely know.

[q]And who cares if 32 bits is overkill for a toaster? It’s a TOTALLY OPEN SOURCE processor you can stick into an ASIC or FPGA along with everything else you want your toaster to do without having to pay anyone else a dime. For example, a PIC chip would handle a toaster, but then you’re paying them for their processor.

You cannot have it both ways.

Either you care for minimal footprint and cost (no 32bits CPU in toaster) or you don’t and will accept anything with reasonable price that solves the task.

If you already have a big 32 bits register bank and a RISC design pipeline, integrating minimal mul/div adds little to the footprint and this functionality is worthy; history shows it.

That’s a fact, and good luck finding a 32 bits cpu without mul in the market.

Why insisting in defining a product that nobody wants, open source or not?

You design your ISA with some kind of computing load in mind; the tasks you address with a 32 bits CPU usually involve * and / at minimum, even in embedded (PIDs or filtering comes to mind).
2016-04-10 6:54 pm

JLF65
Intel does all kinds of crazy things that they’re then stuck with forever to maintain backwards compatibility. It took AMD to make a usable 64-bit extension for the x86 family. I don’t hold Intel out as an example of anything other than how to succeed by throwing billions of bucks at a problem rather than engineering a good solution.

And 32-bit is the new 8-bit. Again, who cares if it’s overpowered – it’s the new minimum standard processor. And RISC-V gives you a nice well designed 32-bit base for free. Hell, even 8-bit is overpowered for something like a toaster, but people used it because it was the minimum standard.
2016-04-10 7:07 pm

Andre
Yeah…. intel stuck with backwards compatibility all the way back to the 8088. Now the architecture is shifting towards UEFI in stead of BIOS, with all the odd things like the gate A20 and such. Backwards compatible with a quick and dirty design be to on the market, planned to do a better architecture later, but that never happened.

But well… 32 bit being overpowered? If you equip it with a few KB of RAM and ROM, running at a few MHz…. wouldn’t be that overpowered,right? Just one question, what’s the impact for the code size?
2016-04-10 9:57 pm

peskanov
JLF65,

I am not sure why, but we are not communicating very well. I guess you already noticed my English is not very good.

I think we agree a 32 bits CPU is nice to have even in the lowest embedded application. I also don’t care if it’s overkill for a toaster, I am all for it. I would pay for not touching a PIC again.

I am just saying: go with the full system, 32 bits RISC with mul and div. The hw is barely bigger and the software is much cleaner and it’s faster.

So, don’t bother designing the ISA without these important operations. That’s my whole point.
2016-04-10 10:56 pm

Alfman verbose=1
JLF65,

Intel does all kinds of crazy things that they’re then stuck with forever to maintain backwards compatibility. It took AMD to make a usable 64-bit extension for the x86 family. I don’t hold Intel out as an example of anything other than how to succeed by throwing billions of bucks at a problem rather than engineering a good solution. [/q]

I agree, x86 dominates the desktop market due to sheer force rather than great engineering.

[q]Again, who cares if it’s overpowered – it’s the new minimum standard processor. And RISC-V gives you a nice well designed 32-bit base for free. Hell, even 8-bit is overpowered for something like a toaster, but people used it because it was the minimum standard.

I agree with this too. One might ask why use 32bit architecture when an 8bit architecture will do? Well the 8bit CPU would save you tons of transistors, but that’s not really a big factor these days, having parity with the rest of the programming world makes development a whole lot easier.

It is why I feel modern CPUs designed in 2016+ should support div/mul. A CPU should be able to handle public key crypto and decoding mp3s in realtime without struggling because mul/div are emulated.

In the end it’s an arbitrary line, we can save some transistors by making these instructions optional. But sometimes the benefits of standardizing on a common 32bit platform outweigh the costs of fragmentation. IMHO we should try to avoid fragmentation right off the bat.
2016-04-11 4:05 am

atomic
IMO any chip-manufacturer, with the market share of Intel and x86, would have the same approach, which is not to remove any older instructions and things like A20, just to keep the massive amount of older code running and customers happy. People often talk about finding a better engineering solution for x86, altough any other manufacturer, including ARM, would likely do exactly the same thing as Intel in comparable market-position.

2016-04-08 4:39 pm

Pro-Competition

The RISC-V project seems to be gaining steam. This is a very welcome development IMO. It would be really nice to have a fully open hardware option!

I am curious about how RISC-V compares to other open (or semi-open) RISC ISAs, like MIPS, SPARC, Power, etc. Has anyone done a comparison of all (or some) of these?

(The Genode team’s recent report on their efforts with RISC-V in its current state was informative: http://www.osnews.com/story/29138/How_Genode_came_to_the_RISC-V_CPU… .)

2016-04-08 9:15 pm

Earl C Pottinger

What happen to the CPUs where the entire instructions set was defined, but when the hardware was not in a particular model/version of a CPU a well defined subroutine location was called to do the same function.

IE. If you used the multiple instruction and the hardware is done the hardware does the function, but is the hardware is not then a pre-defined location to get a subroutine call that does the function in software.

Done proper there is not a single bit difference in the code you run, but the speed it runs depends on what hardware is present in the CPU.

2016-04-09 3:14 am

tylerdurden
That sounds like the PAL code in the old Alphas. A special ROM had the subroutines to “emulate” the instructions not implemented in the system chip.

SPARC also did something similar, but it was purely software; some SPARC instructions were optional and the CPU would simply throw an exception (and tell the OS to figure it out) if it encounter an optional instruction which it could not decode (not implemented in that revision).

I could be wrong, I haven’t touched those architectures in forever. And they were slightly before my time in school.

2016-04-09 5:06 am

JLF65
The ColdFire processor was derived from the 680×0 processor family by Motorola. They started from the 68050 and threw out most everything that kept them from using a RISC core. Unimplemented 680×0 instructions/addressing modes generate an exception, and a library allows those instructions to be done in software, allowing most 680×0 code to run on ColdFire processors. Some Atari ST/TT folks make ColdFire accelerators for their old ST/TTs.

2016-04-09 4:52 pm

JLF65
Correction, 68060. They skipped over the 68050, and a number of folks have been trying to make something they call a 68050 in FPGA.

2016-04-10 10:18 pm

MyNameIsNotImportant

There are a lot of experts in this thread as I can see.

As this is an open source processor design, is there any way to actually create your own processors without having access to or owning a fab?

Can this design be put into an FPGA and work efficiently?

2016-04-11 5:23 pm

tylerdurden
Yup, you can most definitively put RISC-V on a FPGA, in fact that’s how most of the development has happened thus far.

Here’s an example of a RISC-V fpga environment:

http://www.lowrisc.org/docs/tagged-memory-v0.1/fpga/