RISC Is Fundamentally Unscalable
Today, there was an announcement about a new RISC-V chip, which has got a lot of people excited. I wish I could also be excited, but to me, this is just a reminder that RISC architectures are fundamentally unscalable, and inevitably stop being RISC as soon as they need to be fast. People still call ARM a “RISC” architecture despite ARMv8.3-A adding a FJCVTZS
instruction, which is “Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero”. Reduced instruction set, my ass.
The reason this keeps happening is because the laws of physics ensure that no RISC architecture can scale under load. The problem is that a modern CPU is so fast that just accessing the L1 cache takes anywhere from 3-5 cycles. This is part of the reason modern CPUs rely so much on register renaming, allowing them to have hundreds of internal registers that are used to make things go fast, as opposed to the paltry 90 registers actually exposed, 40 of which are just floating point registers for vector operations. The fundamental issue that CPU architects run into is that the speed of light isn’t getting any faster. Even getting an electrical signal from one end of a CPU to the other now takes more than one cycle, which means the physical layout of your CPU now has a significant impact on how fast operations take. Worse, the faster the CPU gets, the more this lag becomes a problem, so unless you shrink the entire CPU or redesign it so your L1 and L2 caches are physically closer to the transistors that need them, the latency from accessing those caches can only go up, not down. The CPU might be getting faster, but the speed of light isn’t.
Now, obviously RISC CPUs are very complicated architectures that do all sorts of insane pipelining to try and execute as many instructions at the same time as possible. This is necessary because, unless your data is already loaded into registers, you might spend more cycles loading data from the L1 cache than doing the actual operation! If you hit the L2 cache, that will cost you 13-20 cycles by itself, and L3 cache hits are 60-100 cycles. This is made worse by the fact that complex floating-point operations can almost always be performed faster by encoding the operation in hardware, often in just one or two cycles, when manually implementing the same operation would’ve taken 8 or more cycles. The FJCVTZS
instruction mentioned above even sets a specific flag based on certain edge-cases to allow an immediate jump instruction to be done afterwards, again to minimize hitting the cache.
All of this leads us to single instruction multiple data (SIMD) vector instructions common to almost all modern CPUs. Instead of doing a complex operation on a single float, they do a simple operation to many floats at once. The CPU can perform operations on 4, 8, or even 16 floating point numbers at the same time, in just 3 or 4 cycles, even though doing this for an individual float would have cost 2 or 3 cycles each. Even loading an array of floats into a large register will be faster than loading each float individually. There is no escaping the fact that attempting to run instructions one by one, even with fancy pipelining, will usually result in a CPU that’s simply not doing anything most of the time. In order to make things go fast, you have to do things in bulk. This means having instructions that do as many things as possible, which is the exact opposite of how RISC works.
Now, this does not mean CISC is the future. We already invented a solution to this problem, which is VLIW - Very Large Instruction Word. This is what Itanium was, because researchers at HP anticipated this problem 30 years ago and teamed up with Intel to create what eventually became Itanium. In Itanium, or any VLIW architecture, you can tell the CPU to do many things at once. This means that, instead of having to build massive vector processing instructions or other complex specialized instructions, you can build your own mega-instructions out of a much simpler instruction set. This is great, because it simplifies the CPU design enormously while sidestepping the pipelining issues of RISC. The problem is that this is really fucking hard to compile, and that’s what Intel screwed up. Intel assumed that compilers in 2001 could extract the instruction-level parallelism necessary to make VLIW work, but in reality we’ve only very recently figured out how to reliably do that. 20 years ago, we weren’t even close, so nobody could compile fast code for Itanium, and now Itanium is dead, even though it was specifically designed to solve our current predicament.
With that said, the MILL instruction set uses VLIW along with several other innovations designed to compensate for a lot of the problems discussed here, like having deferred load instructions to account for the lag time between requesting a piece of data and actually being able to use it (which, incidentally, also makes MILL immune to Spectre because it doesn’t need to speculate). Sadly, MILL is currently still vaporware, having not materialized any actual hardware despite it’s promising performance gains. One reason for this might be that any VLIW architecture has a highly unique instruction set. We’re used to x86, which is so high-level it has almost nothing to do with the underlying CPU implementation. This is nice, because everyone implements the same instruction set and your programs all work on it, but it means the way instructions interact is hard to predict, much to the frustration of compiler optimizers. With VLIW, you would very likely have to recompile your program for every single unique CPU, which is a problem MILL has spent quite a bit of time on.
MILL, and perhaps VLIW in general, may have a saving grace with WebAssembly, precisely because it is a low-level assembly language that can be efficiently compiled to any architecture. It wouldn’t be a problem to have unique instruction sets for every single type of CPU, because if you ship WebAssembly, you can simply compile the program for whatever CPU it happens to be running on. A lot of people miss this benefit of WebAssembly, even though I think it will be critical in allowing VLIW instruction sets to eventually proliferate. Perhaps MILL will see the light of day after all, or maybe someone else can come up with a VLIW version of RISC-V that’s open-source. Either way, we need to stop pretending that pipelining RISC is going to work. It hasn’t ever worked and it’s not going to work, it’ll just turn into another CISC with a javascript floating point conversion instruction.
Every. Single. Time.
No, VLIW is not the solution to the problem. There's a good reason why industry moved away from it.
And there's also no reason why RISC-V cant scale. RISC/CISC distinction in 2019 is dumb as neither popular designs are either one of those.
Actually CISC is wrapped around RISC cores today. So RISC has become an industry standard for all practical purporses despite all shortcomings.
CISC=~Instruction Decoding+RISC
See this:
An ex-ARM engineer critiques RISC-V
https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68
Autogeneration of RISC for ASIC based on templates/parameters is solution from RISC-V/SiFive. Autogeneration of Accelerators on some analysis and simuation is my proposal for enhancement to such ASIC-RISC.
Long, long time ago, when the first RISC processors appeared, they where really RISC and fast and fun to program in assembly. Since that, centuries changes their version number, the world became more complex, new limits appeared. Every time a limit appears, the discussions about the future have the same code, only the data are upgraded. The old and the new aren't the good and the bad. The good is to search what's better than this eternal mix of old and new, as this ARM 8.3 or the hybrid cars which will be soon forgetten.
I hope you can read my English.
Good night, morning... :)
You mischaracterize RISC: if you go to the original Hennesy Patterson papers it's the idea that ISA design should quantify performance and only include instructions that help fast execution. This was done in reaction to ISAs that included very complicated instructions that were actually slower in practice than syimilar functionality compiled from simple instructions---EVEN ON THE SAME CHIP WHICH DIDN"T OPTIMIZE THOSE SIMPLE INSTRUCTIONS. H-P realized that you can then simplify the ISA and make the simple instructions faster.
There's nothing in the original RISC that prohibits complex instructions, if they can be shown to be fast---examples include floating point, encryption, SIMD, and even crazy Javascript conversion if it indeed is faster as a dedicated instruction.
Note, finally, that complicated architecture, such as a humongous register set or large caches, incurs performance cost due to e.g. decoding complexity that slows down the critical path present in every instruction. The implementation tricks such as register renaming and multilevel caches are just practical ways of dealing with this problem.
1. RISC has long lost its meaning you're still operating under -- 'reduced instruction set' has little to do with instruction count or complexity these days (or for the past 20 years, for that matter). Instead RISC has much to with instruction *decoding* complexity and the load/store paradigm (thoroughly adopted by CISC uarchs, naturally).
2. There's a good reason VLIW was phased out even from GPUs, where conditions were most beneficial for that paradigm. We don't have compilers that can solve run-time problems ahead of time -- I doubt we ever will. A 'defer count' solves nothing when you don't know how much to defer at the time you generate the instruction.
We cannot keep everything static in technology, OSes, ISAs, applications, fabrication etc can never be always same, so plz do not take R in RISC for Religion-which in my opinion also need to change, forgive me.
Autogeneration of RISC for ASIC based on templates/parameters is solution from RISC-V/SiFive.