-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP vs. MPI vs. Hadoop(Map Reduce) #27
Comments
First of all, OpenMP is exclusively for shared memory environments. MPI and Hadoop were designed with distributed memory environments in mind. (Though, they both work on a single, shared memory machine.) MPI stands for "Message Passing Interface" and, in a way, that's really the only thing that it has to offer. At the basic level it offers Hadoop was specifically designed with data in mind, in particular, with the Map-Reduce operation and distributed file storage / management in mind. This is much more specific operation than the generic "send a message from one machine to another" and therefore can be optimized. One primary difference between this and MPI is that the Hadoop developers took a lot of time to make sure that their operations are fault-tolerant. MPI's approach leans more towards the "trust that the developer knows what they're doing" end of things. (I don't have much personal experience with Hadoop so this is perhaps where I should stop.)
Hadoop's specialty is data and was designed with this in mind. MPI is a general purpose message passing system. You can definitely implement something like Map-Reduce in MPI but, most likely since Hadoop focuses on this operation, Hadoop will have better performance than whatever we might be able to cook up in an afternoon. Still, after this class is over I might want to spend some time trying that out!
I'm unaware of any "built-in" MPI file management tools so I believe such things need to be done by the developer. Each message passing involves a contiguous data buffer, though, multiple disjoint buffers can be sent. Each process would have to manage some sort of file buffer / pointer and periodically perform read/writes to disk. There's a whole world of optimization considerations, here, since disk access is about x1000 slower than RAM access.
Yes! Compilation: I haven't actually tried it myself, yet, but I think you just need to include the OpenMP header and compile via
The overhead for creating and communicating between processes is larger than that of threads. If each MPI process is running on a single shared memory environment (e.g. each process runs on a different machine on the network and each machine as 16 cores, say) then that MPI process can spawn multiple threads via OpenMP. Some care needs to be taken on which thread is allowed to execute MPI communication calls. Typically, one would want to separate the multi-threaded code from the communication code and only have the master thread (pre-FORK or post-JOIN) perform things like I can't think of any reason why you want to use MPI instead of OpenMP on a single machine since most machines are shared memory environments. The code we're writing in class is mostly for demonstration purposes since (a) there is additional setup involved in communicating with multiple machines and (b) not everyone has access to a network cluster. Here is a talk I found on the OpenMP website about the topic: http://openmp.org/sc13/HybridPP_Slides.pdf |
Thanks! |
Would you give some insight in which technology to use for what problems between the three
OpenMP vs. MPI vs. Hadoop
It seems that hadoop provide an automated way to distribute the data in distributed memory thus saving us some time to distribute the data as done in MPI.
Does MPI also provide a way to store your files such that they are chunked and stored in different memory blocks?
Is it possible to use OpenMP and MPI in the same code (How are we going to compile that). Would it be beneficial so that different processes run in different servers (assuming we have a cluster of servers) and then within each server we run OpenMP to use multiple cores, thus saving the MPI overhead. Or would it be better to use different communicators within a single server. Can you actually define what processes to run in which machines?
The text was updated successfully, but these errors were encountered: