Concurrent programming

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
eds01
Posts: 109
Joined: Tue Apr 10, 2007 12:34 am UTC

Concurrent programming

Postby eds01 » Tue Nov 10, 2009 7:42 pm UTC

So I'm going to be writing a program for an embarrassingly parallel problem. I'd like it to scale linearly with the number of cores, so the program will get much better when we start to get computers with dozens of cores.

The one thing is that it should be running tens of thousands of the computations per second. The program is an AI, it plays out random games and picks the move with the best winrate, essentially. I already have something in Lisp which can play random playouts, although it's unacceptably slow and needs to be rewritten with a more intelligent (in particular, my representation of the board is inefficient, it will be switched to a bit vector in the rewrite). This works with a sufficiently large number of randomly played games, along with some intelligent pruning at the beginning. Obviously, the speed of the playout is important, but so is the overhead of setting up playout itself.

I haven't done too much parallel programming before (I've done a little a few years ago). Can anyone give me suggestions of what to learn to implement it? One idea I had was writing the logic for the playout in C, and doing everything else in Erlang (which I've looked at a very little bit). Just using fork() in C was another idea that was suggested by a friend.

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Concurrent programming

Postby Berengal » Tue Nov 10, 2009 8:35 pm UTC

From your description I assume you by concurrent you mean mean parallel (i.e. SIMD), not concurrent (i.e. multiple threads of execution). They're pretty different (but the terminology is still in flux).

Haskell does pretty decently on parallelism, providing a couple of high-level abstractions that completely abstract away all that business with threads for very little overhead. Basically, it has two different ways of doing parallelism: semi-explicit and vectors.

The vector parallelism is still experimental, but I've read some reports saying good things about it, despite the lack of stability. Vector parallelism is basically just your regular map-filter-reduce triumverate on steroids, working on (fixed size) vectors instead of linked lists. The new thing about them is that you can nest the vector operations and so don't have to manually flatten everything. The compiler and runtime work together to divide the work evenly among a set of worker threads specified on startup, so you don't have to worry about where to parallelize.

Semi-explicit parallelism looks very much like regular programming, except you annotate values that might be beneficial to evaluate in parallel. When the annotation is encountered, a "spark" is created. A set of worker threads instantiates these sparks, which evaluates the value they annotated. If a value annotated by a spark falls out of scope, the spark is garbage collected, and if the main thread requires the value before the spark is instantiated the spark is discarded and the main thread evaluates the value just as it normally would. The nice thing about this kind of thread-less parallelism is that the semantics of the program remain unchanged, so you could write the program normally then go over it and annotate with parallelism later.
Sparks are extremely lightweight, both in size and time, but they're not completely free. The challenge in using this kind of parallelism is deciding where to spark. Too much granularity and the overhead will kill you, but not enough and you'll have idle threads. Also, you don't want to spark off values the main thread is immediately going to claim. Their cheapness makes it possible to be agressive though, and GHC 6.12 has new profiling options for parallel and concurrent programs that look very promising.

Erlang has a nice concurrency model, but it's not made for parallelism. C just isn't made for either, but there might be some helpful libraries.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

User avatar
headprogrammingczar
Posts: 3072
Joined: Mon Oct 22, 2007 5:28 pm UTC
Location: Beaming you up

Re: Concurrent programming

Postby headprogrammingczar » Tue Nov 10, 2009 8:47 pm UTC

Spoiler:
http://en.wikipedia.org/wiki/Thread_pool_pattern

Basically, you create a "pool" of worker threads, which are threads that sit around doing nothing until they are given something to do. You tell the pool to do something, which adds it to a shared queue of tasks, which each thread takes from as they finish other tasks. The whole thing requires very low overhead, especially if you set up the program to make a system call to determine the number of concurrent threads the runtime/computer supports, and only create that many worker threads. If what you are trying to solve will require the whole power of the computer it is running on, you should also look into OpenCL, which gives you access to the GPU.


Aaaaand Berengal's post eclipses mine completely.
<quintopia> You're not crazy. you're the goddamn headprogrammingspock!
<Weeks> You're the goddamn headprogrammingspock!
<Cheese> I love you

User avatar
cerbie
Posts: 934
Joined: Sat Jul 05, 2008 5:14 am UTC
Location: USA

Re: Concurrent programming

Postby cerbie » Tue Nov 10, 2009 10:20 pm UTC

OpenCL may be the ticket, given good CPU scaling, GPU use (if you have a modern enough Radeon), and being explicitly parallel with no hardware or OS knowledge needed. While I think catering to C programmers is a bad habit (if they're decent, they can learn something else, and in the long-term, wouldn't that be better?), it's pretty well abstracted.

The downside to OpenCL is that your best option is beta software, AFAICT.
http://developer.amd.com/GPU/ATISTREAMS ... fault.aspx

I've been meaning to tinker with it, but haven't gotten around to it...

Aside: why are so many papers on Brook and OpenCL written with GNU style, wrt code? It makes me want to tear my eyes out.
DSenette: (...) on the whole, even a trained killer cow is kind of stupid.

User avatar
eds01
Posts: 109
Joined: Tue Apr 10, 2007 12:34 am UTC

Re: Concurrent programming

Postby eds01 » Tue Nov 10, 2009 11:32 pm UTC

Is there support for bit vectors or something like that in Haskell? The representation of the board should ideally be small and fast. The game is played on an n-by-n board (generally 19 by 19, although 9 by 9 and 13 by 13 boards are also used), and each intersection can have 3 states: empty, black or white, kind of like in Othello.

Accessing a random point on the board should be O(1), my current implementation in Lisp has it being O(n) (apparently, representing a board using a list of empty, black and white points is inefficient, and the inefficiency multiplies when you're generating "groups" of stones).

User avatar
lulzfish
Posts: 1214
Joined: Tue Dec 16, 2008 8:17 am UTC

Re: Concurrent programming

Postby lulzfish » Wed Nov 11, 2009 12:24 am UTC

I would use C or C++ along with SDL's threads library. It works on Windows, Linux, and Mac OS, and it's not very complex, but it does everything I've ever asked it to do. I wrote a multithreaded fractal renderer in it.

You could also just use fork(), although managing multiple processes is probably much harder than managing multiple threads.

User avatar
headprogrammingczar
Posts: 3072
Joined: Mon Oct 22, 2007 5:28 pm UTC
Location: Beaming you up

Re: Concurrent programming

Postby headprogrammingczar » Wed Nov 11, 2009 12:47 am UTC

Processes also don't communicate well with each other. On Windows, they don't share their memory space, which forces you to use the OS as a middleman. If you are looking for performance, multi-process programming is a terrible idea.
<quintopia> You're not crazy. you're the goddamn headprogrammingspock!
<Weeks> You're the goddamn headprogrammingspock!
<Cheese> I love you

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Concurrent programming

Postby Berengal » Wed Nov 11, 2009 1:13 am UTC

eds01 wrote:Is there support for bit vectors or something like that in Haskell? The representation of the board should ideally be small and fast. The game is played on an n-by-n board (generally 19 by 19, although 9 by 9 and 13 by 13 boards are also used), and each intersection can have 3 states: empty, black or white, kind of like in Othello.

Accessing a random point on the board should be O(1), my current implementation in Lisp has it being O(n) (apparently, representing a board using a list of empty, black and white points is inefficient, and the inefficiency multiplies when you're generating "groups" of stones).

Not sure if there are bit vectors directly, but there are various byte arrays and vectors available. If you want lots of speed and you're going to do individual adressing I think a byte vector is preferable, especially since you'll only be using 361 bytes per board so size shouldn't be an issue unless you need lots and lots of boards in the L1 cache at the same time.

Linked lists are very slow if you're doing lots of lookups, since they're just a bunch of pointers under the hood, quite possibly pointing all over the heap. (I saw a video talk by Guy Steele recently, where he presented various versions of a tree structure representing lists, for use in parallel programming. The title was something like "cons considered (slightly) harmful").
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

User avatar
cerbie
Posts: 934
Joined: Sat Jul 05, 2008 5:14 am UTC
Location: USA

Re: Concurrent programming

Postby cerbie » Wed Nov 11, 2009 1:27 am UTC

Was it this?

P.S. You know you're hopeless, when you go, ooo! upon seeing, "Model Based Testing of Data Constraints," in the side list...
Last edited by cerbie on Wed Nov 11, 2009 7:57 am UTC, edited 1 time in total.
DSenette: (...) on the whole, even a trained killer cow is kind of stupid.

nteon
Posts: 26
Joined: Sun Oct 07, 2007 9:15 pm UTC

Re: Concurrent programming

Postby nteon » Wed Nov 11, 2009 3:15 am UTC

You could use C with libdispatch (otherwise known as Apple's Grand Central Dispatch) if you're on FreeBSD or Mac (Linux isn't supported yet)
http://libdispatch.macosforge.org

Learning OpenCL is another interesting option.

Finally Clojure is a fast lisp build for concurrency, so you could look into that since you're already Lisping it up.
It's 106 miles to Chicago, we got a full tank of gas, half a pack of cigarettes, it's dark, and we're wearing sunglasses. Hit it.

User avatar
Cleverbeans
Posts: 1378
Joined: Wed Mar 26, 2008 1:16 pm UTC

Re: Concurrent programming

Postby Cleverbeans » Wed Nov 11, 2009 8:22 pm UTC

eds01 wrote:The game is played on an n-by-n board (generally 19 by 19, although 9 by 9 and 13 by 13 boards are also used), and each intersection can have 3 states: empty, black or white, kind of like in Othello.


Is it Go? Just guessing based on your description and avatar.
"Labor is prior to, and independent of, capital. Capital is only the fruit of labor, and could never have existed if labor had not first existed. Labor is the superior of capital, and deserves much the higher consideration." - Abraham Lincoln

User avatar
eds01
Posts: 109
Joined: Tue Apr 10, 2007 12:34 am UTC

Re: Concurrent programming

Postby eds01 » Wed Nov 11, 2009 8:52 pm UTC

Cleverbeans wrote:Is it Go? Just guessing based on your description and avatar.


Yeah, it's a Go AI. A professor I know wrote a couple of heuristic functions which can do things like calculate influence to rule out stupid moves, but does no reading of a position. They have the advantage of being really fast, but they'll suggest moves which don't actually do much good due to tactical reasons (e.g. it's strengthening a lost cause). The Monte-Carlo bit will basically do the reading. I'm interested in how it will compare to programs that use UTC (an extension of the Monte Carlo idea).

headprogrammingczar wrote:Processes also don't communicate well with each other. On Windows, they don't share their memory space, which forces you to use the OS as a middleman. If you are looking for performance, multi-process programming is a terrible idea.


They don't really need to communicate much - each playout just runs on its own, and just has to communicate if it was a win or a loss. How bad is it in that case?



Also, Berengal, I was looking up SIMD, and that doesn't quite look like what I'm looking for. While the rules of the game is the same for all of the playouts, calculations will require slightly different steps (for example, calculating all of the stones in a group will go deeper in the recursion depending on how large a group it is). All of the games are randomly played out, so they will all be different. The main concern is just running all of them concurrently, on different cores (Because more cores means more games played through means stronger moves, at least in theory).

shieldforyoureyes
Posts: 342
Joined: Sat Apr 19, 2008 9:00 am UTC
Contact:

Re: Concurrent programming

Postby shieldforyoureyes » Wed Nov 11, 2009 9:12 pm UTC

Sounds to me like you should split it up at the process level. The bonus there is that you can distribute the load over seperate computers. Don't thread if you don't have to.

rabuf
Posts: 15
Joined: Thu Mar 20, 2008 2:30 pm UTC

Re: Concurrent programming

Postby rabuf » Thu Nov 12, 2009 1:44 am UTC

Further information on what systems you can access would be useful.

OpenMPI: If you already have the code written in C, OpenMPI can be used to (a little less easily, more dealing with it's API calls) to distribute these tasks across a cluster. I used this on a (dumb) raytracer program once, it was fairly straightforward. I didn't deal much in optimizing data structures, and the scenes themselves were rather straightforward (yours will be even less complex). My initial version just did a simple scatter/gather to my original code. In my main loop over every pixel instead of processing them all in one process, I shoved off #Pixels/#Systems pixels to each system in the cluster that I'd requested. Thanks to the simplicity of the algorithm I did get a linear speed up on the cluster available (from 1-32 processors available)

OpenMP: If you already have the code written in C, OpenMP can be used to (fairly easily) parallelize it to run on multiple cores/processors in a shared memory configuration. Similar to the MPI project above, added a few pragmas (no need for API calls in my sole use of it) and the main for loop was distributed across whatever number of cores I made available. Again, a linear speed up on my system for that algorithm (only 4 cores available though).

OpenCL/CUDA: If you have access to any Nvidia card since late 2006 (geforce 8xxx and up) or ATI cards from (someone else can answer this), then OpenCL may be an option (CUDA is Nvidia exclusive). Benefits of OpenCL include being able to run the same code on different graphics cards and even on standard processors. So even if you don't have a GPU to test it on you can still test it out on your local system, though you may still run into troubles when you do get it onto a GPU. Note: OpenCL/CUDA are actually C-derived languages and C/C++ APIs. You can run these programs as either full programs written in OpenCL/CUDA or as kernels (think inner part of a for loop) managed by some other language via the APIs (C, python, probably others).

All three of the above can be combined with each other in some fashion should you have access to a cluster supporting it. For instance, if you have multiple GPUs you could use OpenMP to spawn one process to handle the output from each GPU and collect the results at the end of that. Or a cluster of systems with GPUs could be used with OpenMPI distributing the tasks, the tasks running on separate GPUs with OpenCL and then OpenMPI collecting the results.

Erlang with HiPE (High Performance Erlang) allows you to make use of multiple cores. Along with its (relatively) straightforward method of communicating between nodes it could be useful for distributing this kind of task, especially if you're not dealing heavily with numerical computations (traditional weakness, anyone know if this is still true?), but it's also possible to have it call out to C code, so even if you need something more efficient doing the computations you can use it to handle the communications and C code to do the heavy lifting.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 5 guests