|
RDBMS, to rewrite or not to rewrite… I got confused… |
| February 19th, 2008 under Devel, Algorithms, Distributed, rengolin, Computers, Software. [ Comments: none ]
|
|
Mike Stonebreaker (Ingres/Postgres) seems to be confused as well…
First he said Google’s Map/Reduce was “Missing most of the features that are routinely included in current DBMS”, but earlier he said to ditch RDBMS anyway because “modern use of computers renders many features of mainstream DBMS obsolete”.
So, what’s the catch? Should we still use RDBMS or not? Or should we still develop technologies based on relational databases while Mike develops himself the technology of the new era? Maybe that was the message anyway…
My opinion:
MapReduce is not a step backwards, there are sometimes when indexing is actually slower than brute-force. And I’m not saying that on insert time the indexes have to be updated and so on, I’m saying in the actual search for information, if the index is too complex (or too big) it might take more time to search through the index, compute the location of the data (which might be anywhere in a range of thousands of machines), retrieve the data and later on, sort, or search on the remaining fields.
MapReduce can effectively do everything in one step, while still in the machine and return less values per search (as opposed to primary key searches first) and therefore less data will be sent over the network and less time will be taken.
Of course, MapReduce (as any other brute-force methods) is hungry for resources. You need a very large cluster to make it really effective (1800 machines is enough :)) but that’s a step forward something different from RDBMS. In the distributed world, RDBMS won’t work at all, something have to be done and Google just gave the first step.
Did we wait for warp-speed to land on the moon?! No, we got a flying foil crap and landed on it anyway.
Next steps? Many… we can continue with brute-force and do a MapReduce on the index and use the index to retrieve in an even larger cluster, or use automata to iteratively search and store smaller “views” somewhere else, or do statistical indexes (quantum indexes) and get the best result we can get instead of all results… The possibilities are endless…
Lets wait and see how it goes, but yelling DO IT than later DON’T is just useless…
UPDATE:
This is not a rant against Stonebreaker, I share his ideas about the relational model being far too outdated and the need for something new. What I don’t agree, though, is that MapReduce is a step backwards, maybe not even a step forward, probably sideways.
The whole point is that the relational model is the thesis and there are lots of antithesis, we just didn’t come up with the synthesis yet.
Popularity: 11% [?] Share This
|
|
LSF, Make and NFS 2 |
| November 27th, 2007 under Unix/Linux, Distributed, rengolin, Computers. [ Comments: none ]
|
|
Recently I’ve posted this entry about how NFS cache was playing tricks on me and how sleep 1 kinda solved the issue.
The problem got worse, of course. I’ve raised to 5 seconds and in some cases it was still not enough, than I’ve learnt from the admins that the NFS cache timeout was 60 seconds!! I couldn’t sleep 60 on all of them, so I had to come with a script:
timeout=60
while [ ! -s $file ] && (( $slept < $timeout )); do sleep 5; slept=$(($slept+5)); done
In a way it’s not ugly as it may seem… First, the alternative is to change the configuration (either disable cache or reduce timeout) in the whole filesystem and that would affect others. Second because now I just wait for the (almost) correct amount of time and only when I need (the first -s will get the file if there is no problem).
At least, sleep 60 on everything would be much worse! 
Popularity: 9% [?] Share This
|
|
Sam is dead |
| November 9th, 2007 under Distributed, rengolin. [ Comments: none ]
|
|
I regret to announce - this is the end. After a long life in service (and even longer in coma), Samwise Gamgee is dead as a parrot.
This summer, Sam contracted a weird disease where all characters on its screen were misplaced and some new ones were added, (kinda looked like Dutch) and its boot process was then interrupted. It had nothing wrong with Linux, the AMI BIOS screen was bogus (or Dutch) too.
Some say that when you’re dying you start to go back in your childhood and unlearn everything. I believe Samwise was not from Middle Earth but in fact from Holland.
In spite of all attempts to bring him from coma…

he was proclaimed dead precisely at 21:09 of 9th November 2007 and was properly unplugged.

Now, he rests in peace with his friends (still in duty) in his very own place of origin (not Holland, I mean).

Popularity: 6% [?] Share This
|
|
LSF, Make and NFS |
| October 17th, 2007 under Unix/Linux, Algorithms, Distributed, rengolin. [ Comments: 2 ]
|
|
I use LSF at work, a very good job scheduler. To parallelize my jobs I use Makefiles (with -j option) and inside every rule I run the command with the job scheduler. Some commands call other Makefiles, cascading even more the spawn of jobs. Sometimes I achieve 200+ jobs in parallel.
Our shared disk BlueArc is also very good, with access times quite often faster than my local disk but yet, for almost two years I’ve seen some odd behaviour when putting all of them together.
I’ve reported random failures on processes that worked until then and, without any modifications, worked ever after. But not a long time ago I figured out what the problem was… NFS refresh speed vs. LSF spawn speed using Makefiles.
When your Makefile looks like this:
bar.gz:
$(my_program) foo > bar
gzip bar
There isn’t any problem because as soon as bar is created gzip can run and create the gz file. Plain Makefile behaviour, nothing to worry about. But then, when I changed to:
bar.gz:
$(lsf_submit) $(my_program) foo > bar
$(lsf_submit) gzip bar
Things started to go crazy. Once every a few months in one of my hundreds of Makefiles it just finished saying:
bar: No such file or directory
make: *** [bar.gz] Error 1
And what’s even weirder, the file WAS there!
During the period when these magical problems were happening, which I was lucky to streamline the Makefiles every day so I could just restart the whole thing and it went well as planned, I had another problem, quite common when using NFS: NFS stale handle.
I have my CVS tree under the NFS filesystem and when testing some perl scripts between AMD Linux and Alpha OSF machines I used to get this errors (the NFS cache was being updated) and had to wait a bit or just try again on most of the cases.
It was then that I have figured out what the big random problem was: NFS stale handle! Because the Makefile was running on different computers, the NFS cache took a few milliseconds to update and the LSF spawner, berzerk for performance, started the new job way before NFS could reorganize itself. This is why the file was there after all, because it was on its way and the Makefile crashed before it arrived.
The solution? Quite stupid:
bar.gz:
$(lsf_submit) "$(my_program) foo > bar" && sleep 1
$(lsf_submit) gzip bar
I’ve put it on all rules that have more than one command being spawned by LSF and never had this problem again.
The smart reader will probably tell me that it’s not just ugly, it doesn’t cover all cases at all, and you’re right, it doesn’t. NFS stale handle can take more than one second to update, single-command rules can break on the next hop, etc but because there is some processing between them (rule calculations are quite costy, run make -d and you’ll know what I’m talking about) the probability is too low for our computers today… maybe in ten years I’ll have to put sleep 1 on all rules… 
Popularity: 10% [?] Share This
|
|
Yet another supercomputer |
| October 2nd, 2007 under Unix/Linux, Algorithms, Distributed, rengolin, Computers. [ Comments: none ]
|
|
SciCortex is to launch their cluster-in-a-(lunch)-box with promo video and everything. Seems pretty nice but some things worries me a bit …
Of course a highly interconnected backpane and some smart shortest-path routing algorithms (probably not as good as Feynman’s) is much faster (and reliable?) than gigabit ethernet (myrinet also?). Of course, all-in-one chip technology is much faster and safer and more economic than any HP or IBM 1U node money can buy.
There are also some eye-candy like a pretty nice external case, dynamic resource partitioning (like VMS), native parallel filesystem, MPI optimized interconnection and so on… but do you remember Cray-1? It had wonderful vector machines but in the end it was so complex and monolithic that everyone got stuck with it and never used it anymore.
Assembling a 1024-node Linux cluster with PC nodes, Gigabit, PVFS, MPI etc is hard? Of course it is, but the day Intel stops selling PCs you can use AMD (and vice-versa) and you won’t have to stop using the old machines until you have a whole bunch of new ones up and running transparently integrated with your old cluster. If you do it right you can have a single cluster beowulf cluster running alphas, Intel, AMD, Suns etc, just bother with the paths and the rest is done.
I’m not saying it’s easier, nor cheaper (costs with air conditioning, cabling and power can be huge) but being locked to a vendor is not my favourite state of mind… Maybe if they had smaller machines (say 128 nodes) that could be assembled in a cluster and still allow external hardware to be connected having intelligent algorithms to understand the cost of migrating process to external nodes (based on network bandwidth and latency) would be better. Maybe it could even make their entry easier to existent clusters…
Popularity: 11% [?] Share This
|
|
Middle Earth: Proxy |
| May 8th, 2007 under Technology, Distributed, rengolin. [ Comments: none ]
|
|
When updating the nodes I have to download several times (N for N nodes) the same packages, so a good idea is to have a proxy that would do it for me once and all nodes get from the local copy. For that we have the good old squid.
On the Master node:
$ sudo apt-get install squid
Than edit the config file. It’s rather huge but search for acl localhost and add the line below:
acl cluster src 192.168.2.0/24
http_access allow cluster
assuming your cluster is on that subnet.
Now, on each node (also on Master) set the environment variable (on .bashrc):
export http_proxy="http://master-node:3128/"
export ftp_proxy="http://master-node:3128/"
Also, a good idea is to increase the max cache object from 4Mb to, say 400M because the idea is to cache deb packages and not webpages. You can also limit the global size of the cache (like 1Gb) so old packages will be deleted.
# Per object (400MB)
maximum_object_size 409600 KB
minimum_object_size 64 KB
# Global (1GB)
cache_dir ufs /var/spool/squid 1000 16 256
Restart squid and you’re ready to go:
$ sudo /etc/init.d/squid restart
Popularity: 4% [?] Share This
|
|
Middle Earth: shared disk |
| December 18th, 2006 under Technology, Distributed, rengolin. [ Comments: none ]
|
|
To stop copying everything all the time I needed a shared disk. Parallel Virtual File system was my parallel FS of choice but also I needed a quick and not so fast and reliable filesystem for tests. For that, I chose NFS. Later I can install PVFS if I need to.
Well, install NFS on Ubuntu is VERY simple!
Server:
Install the packages:
sudo apt-get install nfs-user-server nfs-common
Then, edit the /etc/exports file in the server:
/scratch/global frodo(rw) sam(rw) merry(rw) pippin(rw)
Create the directory, with permission to the group users:
mkdir /scratch/global/
chmod g+ws /scratch/global/
chgrp users /scratch/global/
and start the service:
sudo /etc/init.d/nfs-user-server restart
Client:
Install the package:
sudo apt-get install nfs-common
Edit the /etc/fstab and add the mount point:
gandalf:/scratch/global /scratch/global nfs rw 0 0
Create the directory and mount it:
sudo mkdir /scratch/global/
sudo mount /scratch/global/
That’s just it… really.
Popularity: 4% [?] Share This
|
|
Open MPI |
| September 8th, 2006 under Technology, Distributed, rengolin. [ Comments: none ]
|
|
Open MPI is the new trend to MPI applications. It promise to deliver a high quality MPI1 and MPI2 compliant implementation substituting all other implementations to date.
Of course, this is far too much to assume for a new software even for such a big project. It not only lacks documentation and a step-by-step guide to use the system but it’s not MPI2 compliant yet and there are still many basic bugs unfixed.
But don’t think it’s bad because it’s not. The architecture was quite well planned, the code is being carefully written as far as I could see and it have many options for debug the server and running MPI programs. It also have a component system where you can add new functionalities without patching the main code, which is a great deal for programs that aim to be standard one day.
LAM is being deprecated because most of their team is working on OpenMPI which is almost what happened to Mozilla and Firefox. But they make a statement on their pages that’s not true: “Since it’s an MPI implementation, you should be able to simply recompile and re-link your applications to Open MPI — they should ‘just work.’ “.
Talking to a friend (the one who found a code that didn’t compile straight away) I found out that MPICH2 is still far better for performance and MPI2 compliance. Also, installing and running LAM here shown me that LAM is still more stable and easy to use than OpenMPI.
Let the time play it’s part and see what comes out of it…
Popularity: 4% [?] Share This
|
|
Middle Earth: MPI |
| August 28th, 2006 under Technology, Distributed, rengolin. [ Comments: none ]
|
|
MPI stands for Message Passing Interface and is a system to execute programs across nodes in a cluster using a message passing library to enable communication among nodes. It’s a very powerful library and is now the standard for parallel programs.
Normally I’d choose LAM MPI as I always did in the past but I wanted to test MPICH, another very famous MPI implementation.
But what I found out was that the MPICH version for Ubuntu is rather old and the on line documentation is completely different from what I had and there was no documentation at all on any Ubuntu package I could find. (for instance, my config file was apache-like and the new is XML, so I couldn’t even start the service).
Well, I guess that the best always win and that’s the third time I choose LAM over MPICH exactly because of the same problem: installation and documentation.
Installing LAM MPI was very simple. On the master node (gandalf) I installed:
$ sudo apt-get install lam-runtime lam4c2 lam4-dev
And on the execution nodes, just the runtime:
$ sudo apt-get install lam-runtime lam4c2
MPEasy
A while ago I had developed a set of scripts to help running and syncing a LAM MPI cluster when you don’t have a shared disk yet to use within the cluster (my case yet) so it’s specially designed to home clusters and the start of a more serious cluster when you didn’t have time to setup a shared disk setup yet.
So, installing MPEasy is easy, download the tarball, explode it into some dir and set the env variable on your startup script:
On .bashrc:
export MPEASY=~/mpeasy
export PATH=$PATH:$MPEASY/bin
On .cshrc:
setenv MPEASY ~/mpeasy
setenv PATH $PATH:$MPEASY/bin
And put the node list, one per line, on $MPEASY/conf/lam_hosts. Afther that, just running:
$ bw_start
should start your mpi cluster. After that you can start some MPI tests. Go to the $MPEASY/devel/test directory and compile the hello.c.
$ mpicc -o hello hello.c
Than, you need to sync the current devel directory to all nodes:
$ bw_sync
And run:
$ bw_run 10 $MPEASY/devel/test/hello
You should be able to do the same to all other codes on it, just remember to sync before running, otherwise you’ll have an outdated version on the nodes and you’ll have problems. On a shared disk environment it wouldn’t be a problem, of course.
Popularity: 4% [?] Share This
|
|
PI Monte Carlo - Distributed version |
| August 28th, 2006 under Algorithms, Distributed. [ Comments: none ]
|
|
On April I published an entry explaining how to calculate PI using the most basic Monte Carlo algorithm and now, using the Middle Earth cluster I can do it parallelized.
Parallelizing Monte Carlo is a very simple task because of it’s random and independent nature and this basic monte carlo is even easier. I can just run exactly the same routine as before on all nodes and at the end sum everything and divide by the number of nodes. To achieve that, I just changed the main.cc file to use MPI, quite simple indeed.
The old main.cc just called the function and returned the value:
area_disk(pi, max_iter, delta);
cout << "PI = " << pi << endl;
But now, the new version should know whether it’s the main node or a computing node. After, all computing nodes should calculate the area and the main node should gather and sum.
/* Nodes, compute and return */
if (myrank) {
area_disk(tmp, max_iter, delta);
MPI_Send( &tmp, 1, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD );
/* Main, receive all areas and sum */
} else {
for (int i=1; i < size; i++) {
MPI_Recv( &tmp, 1, MPI_DOUBLE, i, 17, MPI_COMM_WORLD, &status );
pi += tmp;
}
pi /= (size-1);
cout << "PI = " << pi << endl;
}
On MPI, myrank says your node number and size shows you the total number of nodes. On the most basic MPI program, if it’s zero you’re the main node, otherwise you’re a computing node.
All computing nodes calculate the area and MPI_Send the result to the main node. The main node waits for all responses on the main loop and sum the temporary result tmp to pi and at the end divide by the number of computing nodes.
Benefits:
This monte carlo is extremely basic and very easy to parallelize. As this copy is run over N computing nodes and there’s no dependency between them you should achieve an increase in speed of over N times the non-parallel one.
Unfortunately, this algorithm is so slow and inaccurate that even running on 9 computing nodes (ie. 9 times faster) it’s still wrong on the third digit.
The slowness is due to the algorithm’s stupidity but the inaccuracy is due to the lack of a really good standard random number generators. Almost all machines yielded results far from the 5-digit answer on M_PI macro of C standard library and the result was also far from it. Also, there are so many other ways of calculating PI that are so much faster that it wouldn’t be a good approach ever!
The good thing is that it was just to show a distributes monte carlo algorithm working… 
Popularity: 5% [?] Share This
|
| « Previous entries |
|
|