Ultra-parallel computer

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

userxp
Posts: 436
Joined: Thu Jul 09, 2009 12:40 pm UTC

Ultra-parallel computer

Postby userxp » Fri Apr 15, 2011 11:05 pm UTC

Imagine a computer that, instead of having one Central Processing Unit and separate RAM memory (Von Neumann architecture), has a million small processors, each with a small amount of processing power (let's say the same as an Intel 4004) and a small amount of memory (say 5 kB). All the processors run their own code individually, but each one is connected to a few other processors, and they can share information with very low latency.

Could you use that computer for the same things we use our current computers? Which things would be easier and which would be more difficult? I think that many common tasks could be solved parallely. For example, in order to render a document or a vector image, you'd just send each element to a processor, which would then send the rendered bitmap to another processor (or group of processors) that would combine the results. Programming would less like typing and more like building networks of nodes.

So, can you find any creative ways to do things with this computer? Do you think we might see different types of processors in the near future?

User avatar
ATCG
Posts: 471
Joined: Sat Aug 18, 2007 3:44 am UTC
Location: Straight up the jω axis

Re: Ultra-parallel computer

Postby ATCG » Fri Apr 15, 2011 11:40 pm UTC

Sounds just a bit like the transputer. See also systolic arrays.
"The age of the universe is 100 billion, if the units are dog years." - Sean Carroll

User avatar
Yakk
Poster with most posts but no title.
Posts: 11083
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Ultra-parallel computer

Postby Yakk » Sat Apr 16, 2011 2:11 am UTC

Note that GPUs are, in some senses, "large arrays of micro processors" in some sense.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

letterX
Posts: 535
Joined: Fri Feb 22, 2008 4:00 am UTC
Location: Ithaca, NY

Re: Ultra-parallel computer

Postby letterX » Sat Apr 16, 2011 4:37 am UTC

In the same way that computer scientists have identified classes like P and NP, they've done similar things for parallel computers. You might want to have a look at NC, for example.

Moose Hole
Posts: 398
Joined: Fri Jul 09, 2010 1:34 pm UTC

Re: Ultra-parallel computer

Postby Moose Hole » Mon Apr 18, 2011 5:37 pm UTC

How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.

Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. These tracks would also delegate paths when they encounter a conditional. Whenever a conditional is evaluated, the track that shouldn't be used is scrapped while the one that has already been started is used.

Example:
if (hard thing to solve) // processor 1 evaluates this for a while
{
do thing; // processor 2 evaluates this while processor 1 is working
}
else
{
if (harder thing to solve) // processor 3 evaluates this while processor 1 is working
{
do thing; // processor 4 evaluates this while processor 3 is working
}
other stuff to do; // processor 5 evaluates this while processor 3 is working
}


In this example, processor 5's stuff might even be able to be saved if processor 3's statement is true, as long as processor 4's code doesn't modify processor 5's track somehow. Where's my honorary doctorate/millions of dollars, bitch?

User avatar
Meteorswarm
Posts: 979
Joined: Sun Dec 27, 2009 12:28 am UTC
Location: Ithaca, NY

Re: Ultra-parallel computer

Postby Meteorswarm » Mon Apr 18, 2011 6:29 pm UTC

Moose Hole wrote:How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.

Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. These tracks would also delegate paths when they encounter a conditional. Whenever a conditional is evaluated, the track that shouldn't be used is scrapped while the one that has already been started is used.

Example:
if (hard thing to solve) // processor 1 evaluates this for a while
{
do thing; // processor 2 evaluates this while processor 1 is working
}
else
{
if (harder thing to solve) // processor 3 evaluates this while processor 1 is working
{
do thing; // processor 4 evaluates this while processor 3 is working
}
other stuff to do; // processor 5 evaluates this while processor 3 is working
}


In this example, processor 5's stuff might even be able to be saved if processor 3's statement is true, as long as processor 4's code doesn't modify processor 5's track somehow. Where's my honorary doctorate/millions of dollars, bitch?


The problem with this is that in most programming languages (like C, whose syntax you're using), all of those statements can have side effects, which foul things up because you wouldn't want the side effects from something to happen if you weren't supposed to evaluate it.

This can be gotten around by using side effect-free languages, like many functional languages. For example, Haskell enforces a rigid separation between side effect-containing and side effect-free code, which means that, in principle, a compiler have the program do exactly what you suggest (or, do you one better, and if you have something like "return e1 + e2" evaluate e1 and e2 in parallel), but this is somewhat tricky to implement in practice.

I hear rumors about it being done in Haskell, though. Can anybody verify?

Edit: oh, lookie: http://www.haskell.org/haskellwiki/GHC/Data_Parallel_Haskell -- this isn't quite what you want, but it's partway there. A quote I found while searching: "automatic parallelization is easy, efficient parallelization is hard." Spinning off a thread for every sub-expression has overhead, so determining what coarse chunks you want to break off is a difficult problem.
The same as the old Meteorswarm, now with fewer posts!

Moose Hole
Posts: 398
Joined: Fri Jul 09, 2010 1:34 pm UTC

Re: Ultra-parallel computer

Postby Moose Hole » Mon Apr 18, 2011 7:42 pm UTC

That is awesome. I figured the compiler could figure out the threading using optimization of some sort, I just used C syntax to illustrate the point. I wasn't aware that functional languages worked that way, so now I know.

User avatar
Meteorswarm
Posts: 979
Joined: Sun Dec 27, 2009 12:28 am UTC
Location: Ithaca, NY

Re: Ultra-parallel computer

Postby Meteorswarm » Mon Apr 18, 2011 7:46 pm UTC

Moose Hole wrote:That is awesome. I figured the compiler could figure out the threading using optimization of some sort, I just used C syntax to illustrate the point. I wasn't aware that functional languages worked that way, so now I know.


Note that not all functional languages work this way. OCaML is really promiscuous with its side effects; you can slap them in anywhere. This has some benefits, but obviously some downsides.
The same as the old Meteorswarm, now with fewer posts!

letterX
Posts: 535
Joined: Fri Feb 22, 2008 4:00 am UTC
Location: Ithaca, NY

Re: Ultra-parallel computer

Postby letterX » Tue Apr 19, 2011 5:14 am UTC

Moose Hole wrote:How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.

Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. These tracks would also delegate paths when they encounter a conditional. Whenever a conditional is evaluated, the track that shouldn't be used is scrapped while the one that has already been started is used.


So, what you've just suggested happens to already exist in a couple forms. This is basically the definition of speculative execution, except that we don't usually split off to multiple processors (If you try to do this at every branch point, the number of processors grows exponentially, so you'll run out of true concurrency pretty quickly).

On the instruction by instruction level, modern processors already do this. Everything in a modern CPU is pipelined, so there are several sequential instructions in flight at any one time. If there is a branching instruction, we run into a problem, because it is not clear which branch we should feed into the pipeline, because normally we start feeding in the next instruction before the test for the branch will have completed. So we try to predict which branch will actually be executed, feed the instructions from that branch into the pipeline, and if we guessed wrong, we scrap the intermediate results and start over on the correct one.

As far as speculative execution at a higher level, there have been some interesting research systems which do * speculation on filesystem operations. This is based on the observation that we can predict the correct value of most filesystem operations, and can continue processing based on those predictions until the long-latency operations actually return. It's a pretty neat system, but it doesn't seem to have made it into any production level systems that I'm aware of. So, it could well be that the costs involved in making sure that being able to roll back side-effects isn't worth the gains of speculative execution.


* Paper may be behind a paywall, depending on if you're at a university, etc.

User avatar
cerbie
Posts: 934
Joined: Sat Jul 05, 2008 5:14 am UTC
Location: USA

Re: Ultra-parallel computer

Postby cerbie » Tue Apr 19, 2011 3:03 pm UTC

Moose Hole wrote:How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.

Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. (...) Where's my honorary doctorate/millions of dollars, bitch?
The problem with doing that in a literal sense is that you need a very large amount of fast memory, and a CPU and OS combo that can issue and communicate between threads very quickly, even with a programming language that doesn't limit you, as speed increases. Every conditional would create another set of instructions to run, so for every linear n speed increase over the previous speed s, and with an average main memory latency of m, all normalized to average branches processed before a successful memory read (hence having m), you could need close to 2(s+n-m)-2(s-m) more memory and computational units, to keep the increase in actual performance linear, and would need to keep the clock latency of the memory the same as before, as well (close to, because not every branch will need 2 units, but off-hand, I don't know what the average base number would be, nor do I know how exactly how computed branches would be handled). Basically, needed cache and instruction units would need be increased by 1.5-2x for every additional branch processed, else any speed gains will either be nullified by waiting on memory, or will be eaten up by having more branches to process than you have instruction units to process them. With typical main memory access around 100 cycles (often less, but it's not absolutely predictable w/ DRAM), and caches generally ranging from 10-50 cycles, that would be quite a bitch to achieve.

It is not entirely without merit, but it is far more efficient to have prediction that will occasionally fail, which is what our CPUs do. For high level evaluations, where the evaluation of the condition will take some time, it is sometimes done, at a code level; it's just rare that it is worthwhile to do, unless the branch is effectively of the 'do something complicated or do nothing and move right along' variety, and even then, the condition itself would need to take a long time to evaluate, without the evaluation itself being used for an input of the chosen path, which would also make the opportunity for it fairly rare.



As to the OP's supposition, the ultimate problem which killed any such designs was memory latency not keeping up (there were attempts to make such computers, and a few met with modest success). It used to be CPUs were slow, and memory was fast. However, CPU performance increased at a higher rate than memory, both SRAM and DRAM. Even though there are some things that would be far slower with many simple CPU cores, if we had gone down that path, much of what we do today would utilize them well enough to get good enough performance(IoW, we would have developed SW and HW differently, so the problems we have today w/ fine parallelism would not exist, and we'd be facing different performance challenges, instead).

With such high memory latencies, nothing remotely like a crappy uC (4004), no matter how many, could handle anything worthwhile. The biggest problem we have today for CPUs is memory. It's not that 99% of users need all the bandwidth of even a single channel of DDR3, but that it takes so many cycles to hit L1 cache, so many to hit L2, so many hit L3 (if it exists), and so many to hit main RAM (and gets worse, the more outstanding IOs there are). Even in cases where multiple RAM channels are handy, it's because of the added RAM IO resources, not just bandwith (IE, 2 64-bit channels is better than 1 128-bit channel). During that time, if all was not predicted perfectly, the CPU is going to sit there and twiddle its thumbs. Multithreading can help, but it increases the strain on memory resources, which are often your bottleneck in the first place. We would need processors way more complicated than 4004s.

That said, if memory had not gone on such a track of getting slower and slower over time, or had done so at a much slower rate, we might very well have many scalar cores working for our main needs. All it would have taken would be for multiple cores on a die to be cheaper than making one CPU faster, for a given performance increase, back in the late 80s, when all of this crap was still being pioneered, instead of finally happening in the early 00s, after ISA, OS, and language wars were pretty much over.

So, let us assume that what Intel faced trying to get P4s to >=4GHz within a decent thermal envelope had instead happened to everybody just as they were getting to the low hundreds of MHz, stalling the ability for deeper pipelines, good buffers, good speculation, and cheap tricks to extract more performance by more speed. In my view, there would be the following main commonalities and differences:

Common: cache prefetching, branch prediction, speculative execution, vector extensions that use existing scalar units for arithmetic. These all take very few xtors, and very little power, relative to their performance benefit. Even if DRAM and SRAM were faster, these technologies would need to exist, they just wouldn't present quite the bottleneck/ace-in-the-hole they do, today (except for vector ISA extensions).

Different: the above would have tuning options, even explicit controls, by the software running, or possibly be programmable coprocessors.

Very different: we would have subthreads, which would share the same memory context (however crude, threads existing as they do has gotten us through several decades now, and it took Java taking off for enough research to be done to prove software could do the job just as well, and we still aren't doing that outside of research, so I figure we'd still end up handling memory protection this way), and would begin by copying the parent thread's registers at the point where a split command is issued (IE, copy registers, but then one's counter goes to address X, the other to address Y), and then have some subthread management options, as well. Multithreading would be highly desirable for this. Branches might even be able to be speculatively split, so that high confidence branches take a chance, and lower confidence ones get both sides evaluated. Also, it would be highly advantageous, in this scenario, to be able to have local memory that is coherent among subthreads using it, but not coherent with main memory. With normal caches, and being able to designate some of that as local memory, such local local memory could be kept as a fairly small and efficient scratch space (FI, in main thread A, subthread group 1 may use 32KB local memory, subthread group 2 may use 12KB, etc., for small-scale memory isolation--only temporary shared writable memory for the subthread group would need to be used, which could be allocated by the routines in that group, then freed once they return, so it could remain pretty small at any given time--this all assumes that common OO languages would have abstracted support for memory models that could be implemented this way, too). Scratch space using cache has been used in some CPUs with quite a bit of success, and partial implementations of the rest have been done in some more obscure chips, both from the get-go and as cheap and dirty programmer tricks (hey, it's enough for plausibility, right?).

Also very different: to make it fast enough without eating die space, we'd need efficient RISC ISAs, that would include, as part of either instruction words or memory tables, information about instruction dependency. Since compilers can extract peak IPC (but not actual IPC) from some set of code, I see no reason why this would not be doable. It would free up many xtors, and improve CPI, while only sacrificing a minimal amount of real IPC (I don't see it doing as well, except in corner cases of assembly programming by skilled humans, as modern hardware IPC extraction--it would be reduce the space/xtor bloat of decode, reorder, issue, and retirement, so as to fit more computational units into a chip, while having pretty good IPC per unit). With all of that, much of what our CPUs do today in hardware, could be implemented partially in software, and thus we could get good enough overall performance. Parallel tasks would perform better than what we have today, while serial would perform a little worse. It would be good enough, because many small-scale fine-grained extractions of parallelism that are a PITA today would be dead nuts simple, likely even intuitive to implement, while high-throughput serial workloads (where our current CPUs and software development excels) would be a bit more troublesome.

Finally, very different: C and C-alike languages would be nearly dead. The world would not have moved to functional languages or anything, but all commonly-used languages would have time-efficient side effect management built in (time-efficient based on the programmer's time screwing with it), and would generally avoid making it easy for the programmer to unknowingly create code that could cause side effects. So, maybe it would be better to say that C-like default memory models would be dead.

There's no reason many weaker units could not match a small number of more powerful units, overall, IMO. However, the history we have gave us pressures that sent us down our current path, and the R&D needed to even try something like I tried to come up with above would likely cost tens of trillions of dollars, and take 10-20 years, with no guarantee that it would be worthwhile, compared to just doing what we ended up doing.
DSenette: (...) on the whole, even a trained killer cow is kind of stupid.

LakatosIstvan
Posts: 87
Joined: Tue Jun 03, 2008 5:54 pm UTC
Location: Sector ZZ9 Plural Z Alpha
Contact:

Re: Ultra-parallel computer

Postby LakatosIstvan » Sun Apr 24, 2011 9:32 am UTC

Moose Hole wrote:How can parallelization be even better? I was thinking maybe there could be some sort of future-processing technique.

Imagine a program with a conditional statement that takes a few cycles to evaluate. While it is being evaluated by one processor, two other processors can start working on the next bit of instructions: One on the true and one on the false. These tracks would also delegate paths when they encounter a conditional. Whenever a conditional is evaluated, the track that shouldn't be used is scrapped while the one that has already been started is used.

Example:
if (hard thing to solve) // processor 1 evaluates this for a while
{
do thing; // processor 2 evaluates this while processor 1 is working
}
else
{
if (harder thing to solve) // processor 3 evaluates this while processor 1 is working
{
do thing; // processor 4 evaluates this while processor 3 is working
}
other stuff to do; // processor 5 evaluates this while processor 3 is working
}


In this example, processor 5's stuff might even be able to be saved if processor 3's statement is true, as long as processor 4's code doesn't modify processor 5's track somehow. Where's my honorary doctorate/millions of dollars, bitch?


This isn't a new thing, scientists have thought about this a looong time ago (since the 70s, I think). There were even computers built with this technique. But it turns out that it is too much trouble to go through, because you have to worry about all the side effects that the process might create which can't be reversed, so you have to create copies of the memory, manage them correctly, etc., etc. It would be too much work, and even worse performance than we have now.
Sir_Elderberry wrote:Cords are not just bendy cylinders. They are flexible, animate beings possessed by spirits whose intentions are beyond our ken. They are Cthulu's tentacles intruding on our reality.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Ultra-parallel computer

Postby EvanED » Sun Apr 24, 2011 9:46 am UTC

LakatosIstvan wrote:But it turns out that it is too much trouble to go through, because you have to worry about all the side effects that the process might create which can't be reversed, so you have to create copies of the memory, manage them correctly, etc., etc. It would be too much work, and even worse performance than we have now.

In general I would categorize that amongst the "speculative multithreading" techniques, which is still an ongoing area of research AFAIK. (Don't expect that Wikipedia page to be very helpful unless you like reading conference papers.)

akashra
Posts: 503
Joined: Tue Jan 01, 2008 6:54 am UTC
Location: Melbourne, AU
Contact:

Re: Ultra-parallel computer

Postby akashra » Mon May 23, 2011 9:06 am UTC

userxp wrote:Imagine a computer that, instead of having one Central Processing Unit and separate RAM memory (Von Neumann architecture), has a million small processors, each with a small amount of processing power (let's say the same as an Intel 4004) and a small amount of memory (say 5 kB). All the processors run their own code individually, but each one is connected to a few other processors, and they can share information with very low latency.

We have one of these downstairs from my desk at work - It's called the IBM BlueGene/C. We are also taking delivery of another newer one later this year, called the IBM BlueGene/Q.

It consists of thousands of low-speed multi-thread-multi-core (4 threads per 16 cores per proc) processors (~500MHz), on compute cards (2 cpus per card), which go in drawers (32 cards per drawer), which go in to racks (32 drawers per rack).
I think the one we've ordered is about 30 racks, but I'm not 100% sure. Cost is ~$100m.

The current /C that we have there, yeah, they can compile and run stuff on their normal systems, which they do before writing software to run on it.

I've often wondered why we're spending $100m on the new one when you can get equivalent performance out of a massive array of Tesla compute cores for ~$17m. I'm guessing cooling and power usage are a major factor, and BlueGene is designed with these in mind whereas Tesla, not so much.
( find / -name \*base\* -exec chown us : us { } \ ; )


Return to “Computer Science”

Who is online

Users browsing this forum: No registered users and 4 guests