Moose Hole wrote:How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.
Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. (...) Where's my honorary doctorate/millions of dollars, bitch?
The problem with doing that in a literal sense is that you need a very large amount of fast memory, and a CPU and OS combo that can issue and communicate between threads very quickly, even with a programming language that doesn't limit you, as speed increases. Every conditional would create another set of instructions to run, so for every linear n speed increase over the previous speed s, and with an average main memory latency of m, all normalized to average branches processed before a successful memory read (hence having m), you could need close to 2
(s+n-m)-2
(s-m) more memory and computational units, to keep the increase in actual performance linear, and would need to keep the clock latency of the memory the same as before, as well
(close to, because not every branch will need 2 units, but off-hand, I don't know what the average base number would be, nor do I know how exactly how computed branches would be handled). Basically, needed cache and instruction units would need be increased by 1.5-2x for every additional branch processed, else any speed gains will either be nullified by waiting on memory, or will be eaten up by having more branches to process than you have instruction units to process them. With typical main memory access around 100 cycles
(often less, but it's not absolutely predictable w/ DRAM), and caches generally ranging from 10-50 cycles, that would be quite a bitch to achieve.
It is not entirely without merit, but it is far more efficient to have prediction that will occasionally fail, which is what our CPUs do. For high level evaluations, where the evaluation of the condition will take some time, it is sometimes done, at a code level; it's just rare that it is worthwhile to do, unless the branch is effectively of the 'do
something complicated or do
nothing and move right along' variety, and even then, the condition itself would need to take a long time to evaluate, without the evaluation itself being used for an input of the chosen path, which would also make the opportunity for it fairly rare.
As to the OP's supposition, the ultimate problem which killed any such designs was memory latency not keeping up (there were attempts to make such computers, and a few met with modest success). It used to be CPUs were slow, and memory was fast. However, CPU performance increased at a higher rate than memory, both SRAM and DRAM. Even though there are some things that would be far slower with many simple CPU cores,
if we had gone down that path, much of what we do today would utilize them well enough to get good enough performance
(IoW, we would have developed SW and HW differently, so the problems we have today w/ fine parallelism would not exist, and we'd be facing different performance challenges, instead).
With such high memory latencies, nothing remotely like a crappy uC (4004), no matter how many, could handle anything worthwhile. The biggest problem we have today for CPUs is memory. It's not that 99% of users need all the bandwidth of even a single channel of DDR3, but that it takes so many cycles to hit L1 cache, so many to hit L2, so many hit L3
(if it exists), and so many to hit main RAM
(and gets worse, the more outstanding IOs there are). Even in cases where multiple RAM channels are handy, it's because of the added RAM IO resources, not just bandwith
(IE, 2 64-bit channels is better than 1 128-bit channel). During that time, if all was not predicted perfectly, the CPU is going to sit there and twiddle its thumbs. Multithreading can help, but it increases the strain on memory resources, which are often your bottleneck in the first place. We would need processors way more complicated than 4004s.
That said, if memory had not gone on such a track of getting slower and slower over time, or had done so at a much slower rate, we might very well have many scalar cores working for our main needs. All it would have taken would be for multiple cores on a die to be cheaper than making one CPU faster, for a given performance increase, back in the late 80s, when all of this crap was still being pioneered, instead of finally happening in the early 00s, after ISA, OS, and language wars were pretty much over.
So, let us assume that what Intel faced trying to get P4s to >=4GHz within a decent thermal envelope had instead happened to everybody just as they were getting to the low hundreds of MHz, stalling the ability for deeper pipelines, good buffers, good speculation, and cheap trickses to extract more performance by more speed. In my view, there would be the following main commonalities and differences:
Common: cache prefetching, branch prediction, speculative execution, vector extensions that use existing scalar units for arithmetic. These all take very few xtors, and very little power, relative to their performance benefit. Even if DRAM and SRAM were faster, these technologies would need to exist, they just wouldn't present quite the bottleneck/ace-in-the-hole they do, today
(except for vector ISA extensions).
Different: the above would have tuning options, even explicit controls, by the software running, or possibly be programmable coprocessors.
Very different: we would have subthreads, which would share the same memory context
(however crude, threads existing as they do has gotten us through several decades now, and it took Java taking off for enough research to be done to prove software could do the job just as well, and we still aren't doing that outside of research, so I figure we'd still end up handling memory protection this way), and would begin by copying the parent thread's registers at the point where a split command is issued
(IE, copy registers, but then one's counter goes to address X, the other to address Y), and then have some subthread management options, as well. Multithreading would be highly desirable for this. Branches might even be able to be speculatively split, so that high confidence branches take a chance, and lower confidence ones get both sides evaluated. Also, it would be highly advantageous, in this scenario, to be able to have local memory that is coherent among subthreads using it, but not coherent with main memory. With normal caches, and being able to designate some of that as local memory, such local local memory could be kept as a fairly small and efficient scratch space
(FI, in main thread A, subthread group 1 may use 32KB local memory, subthread group 2 may use 12KB, etc., for small-scale memory isolation--only temporary shared writable memory for the subthread group would need to be used, which could be allocated by the routines in that group, then freed once they return, so it could remain pretty small at any given time--this all assumes that common OO languages would have abstracted support for memory models that could be implemented this way, too). Scratch space using cache has been used in some CPUs with quite a bit of success, and partial implementations of the rest have been done in some more obscure chips, both from the get-go and as cheap and dirty programmer trickses
(hey, it's enough for plausibility, right?).
Also very different: to make it fast enough without eating die space, we'd need efficient RISC ISAs, that would include, as part of either instruction words or memory tables, information about instruction dependency. Since compilers can extract peak IPC
(but not actual IPC) from some set of code, I see no reason why this would not be doable. It would free up many xtors, and improve CPI, while only sacrificing a minimal amount of real IPC
(I don't see it doing as well, except in corner cases of assembly programming by skilled humans, as modern hardware IPC extraction--it would be reduce the space/xtor bloat of decode, reorder, issue, and retirement, so as to fit more computational units into a chip, while having pretty good IPC per unit). With all of that, much of what our CPUs do today in hardware, could be implemented partially in software, and thus we could get good enough overall performance. Parallel tasks would perform better than what we have today, while serial would perform a little worse. It would be good enough, because many small-scale fine-grained extractions of parallelism that are a PITA today would be dead nuts simple, likely even intuitive to implement, while high-throughput serial workloads
(where our current CPUs and software development excels) would be a bit more troublesome.
Finally, very different: C and C-alike languages would be nearly dead. The world would not have moved to functional languages or anything, but all commonly-used languages would have time-efficient side effect management built in
(time-efficient based on the programmer's time screwing with it), and would generally avoid making it easy for the programmer to unknowingly create code that could cause side effects. So, maybe it would be better to say that C-like default memory models would be dead.
There's no reason many weaker units
could not match a small number of more powerful units,
overall, IMO. However, the history we have gave us pressures that sent us down our current path, and the R&D needed to even
try something like I tried to come up with above would likely cost tens of trillions of dollars, and take 10-20 years, with no guarantee that it would be worthwhile, compared to just doing what we ended up doing.