Testing a cluster

Topics: Solving, 4.2

Thread index  |  Previous thread  |  Next thread  |  Start a new discussion

RSS FeedRSS feed   |   Email notificationsTurn on email notifications   |   10 Replies   Last post: October 7, 2011 12:25pm UTC
Vu Le

Vu Le

August 27, 2011 9:30am UTC

Testing a cluster

Hi,

I just set up Comsol on a Red Hat cluster and I have been testing out the models that came with the installation. So far all the models I have tested have run about equally as fast on 1 node as 12 nodes. Does any one know of any models that I can test on my cluster that will clearly show the gains of running Comsol on a cluster?

Thank you in advance.

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

James  Freels

James Freels

August 27, 2011 3:59pm UTC in response to Vu Le

Re: Testing a cluster

If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.

www.comsol.com/shared/download..._safety_related_procedures.pdf

www.comsol.com/papers/7970/download/freels_presentation.pdf

There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Dragos Constantin

Dragos Constantin

September 15, 2011 9:11pm UTC in response to James Freels

Re: Testing a cluster

Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.

I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?

Thank you in advance for your help.
Dragos




If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.

www.comsol.com/shared/download..._safety_related_procedures.pdf

www.comsol.com/papers/7970/download/freels_presentation.pdf

There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).


Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Dragos Constantin

Dragos Constantin

September 15, 2011 10:01pm UTC in response to Dragos Constantin

Re: Testing a cluster

Dear All,
I have read the documentation again....

For memory hungry models one should use -nn 10 -np 8 as opposed to -nn 80.

For small models -nn 80 gives the best gain.

In the last couple of days I went nuts with this. I can report the simulation time goes down from 2655s to 755s and the memory goes down from ~50GB/node to ~16GB/node.

James, I suppose your memory usage plot uses the first configuration where you specify the number of processors(cores) per node. Am I right?

Thanks,
Dragos


Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.

I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?

Thank you in advance for your help.
Dragos




If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.

www.comsol.com/shared/download..._safety_related_procedures.pdf

www.comsol.com/papers/7970/download/freels_presentation.pdf

There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).



Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

James  Freels

James Freels

September 15, 2011 10:14pm UTC in response to Dragos Constantin

Re: Testing a cluster

Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.

My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.

I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.

One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Dragos Constantin

Dragos Constantin

September 15, 2011 11:02pm UTC in response to James Freels

Re: Testing a cluster

Hi James,
So, let's say you have two nodes, one with 8 cores and one with 16 cores. Will you use a -nn 3 -np 8 to try and balance the load? I will try to boot instances with half the number of cores and see how they behave. In any case I can see the cores are not working all the time as they used to when I specified the -nn 80 instead of -nn 10 -np 8 but at least I am no longer running out of memory.

Thanks,
Dragos



Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.

My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.

I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.

One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.


Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

James  Freels

James Freels

September 16, 2011 2:24am UTC in response to Dragos Constantin

Re: Testing a cluster

No. I will use -nn 2 -np 8

I interpret this as

-nn number of compute nodes

-np number of processors per compute node (or total number of cores per compute node).

So, in this situation, you end up with 8 cores on the 16-core node that are not used by COMSOL and are free to be used by another application.

I think in your case where you set -nn 80, it was filling up the first set of compute nodes in your cluster, then starting another set , and so forth. So if you have a 10-node cluster, you probably had 8 instances of comsol running on each of your compute nodes. I am amazed it ran at all. It must have been writing alot to swap space on your disk drive.

How many compute nodes do you have in your cluster, and how much memory on each compute node, and how many cores on each compute node. I can then recommend to you what to use as your settings. How big a problem are you running ?

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Dragos Constantin

Dragos Constantin

September 16, 2011 10:09pm UTC in response to James Freels

Re: Testing a cluster

Hi James,
If you want to use all the cores in your cluster I encourage you to try -nn 3 -np 8 or even a -nn 6 -np 4 should work (this is for the example with two physical machines with 16 and 8 cores respectively). MPD should be able to distribute the nodes and processors accordingly. I think it also depends on how you start mpd but I am not positive here. In my case I start mpd on each individual machine and I use --ncpus to specify the number of cores for each machine. This way I end up with a pool of 80 cores and it is up to me to choose the right combination of nodes and processors. The only constraint is

nodes*processors<=80

I do not look at the number of nodes, i.e. –nn, as the physical machines but rather as the number of logical compute cluster nodes. This way I can chose -nn from 8 up to 80.

I have run an example with the following combinations:

-nn 10 -np 8 (one compute cluster node per machine)
-nn 20 -np 4 (two compute cluster nodes per machine)
-nn 40 -np 2 (four compute cluster nodes per machine)
-nn 80 (no -np) (8 compute cluster nodes per machine)

MPD is smart enough to distribute the compute nodes uniformly across the cluster. In the documentation it is stated that a big model should be run with the topmost configuration whereas a small model would benefit from the last configuration. Also, I am not at all worried about large scale deployments as MPI should work with a really huge pool of processors. I can tell you I have solved a problem using –nn 160 and I had no issues.

Thanks,
Dragos



Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

James  Freels

James Freels

September 17, 2011 3:41pm UTC in response to Dragos Constantin

Re: Testing a cluster

I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?

Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?

In other words, for a given job, will this make it run faster or slower ?

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Dragos Constantin

Dragos Constantin

September 20, 2011 4:35am UTC in response to James Freels

Re: Testing a cluster

Hi James,
If you have a machine with more than one worker then MPI uses the loopback interface to communicate with workers on the same machine. However, as you increase the number of workers the demand for memory increases. So, let's say you have 10 physical machines, each having 8 cores. From my experience if you want to reduce the amount of used memory you will lunch the job with

-nn 10 -np 8

configuration, i.e. one worker per physical machine. My models are big so I am monitoring the memory load. I can report that besides the unbalanced CPU load which you have observed, the memory is not equally distributed among the 10 physical machines as well. If your models are small you can do a

-nn 80 -np 1

in which case the memory requirement is more than double the previous case. However, your cores will work ALL the time because you have assigned one worker per core.

You will have to experiment with a real cluster and see if there is a speed gain or not. I can tell you I did not see a definitive speedup but it might be because my cluster is virtual. In my case the average ping time between nodes is between 0.27ms to 0.35ms.

Please let me know if you observe any change in speedup for different -nn/-np configurations.

Thanks,
Dragos



I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?

Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?

In other words, for a given job, will this make it run faster or slower ?


Reply  |  Reply with Quote  |  Send private message  |  Report Abuse

Alain Glière

Alain Glière

October 7, 2011 12:25pm UTC in response to Vu Le

Re: Testing a cluster

Coming back to the initial post by Vu Le, I would like to share some timing results obtained on our Linux RedHat cluster pointing out the importance of the choice of the test case.


For some time, I got the same kind of disappointing results as those of Vu Le. The test case I used was a linear diffusion equation with a very fine mesh and 8 coupled unknowns, altogether 6.7 MDoFs. The solver used was MUMPS, adapted to shared memory computation. The test case looked fine but the results showed no shared memory improvement at all:

node core time (s)
1 1 479
1 8 269
2 8 284
3 8 284


As this was not in agreement (i) with my expectations and (ii) with the results presented by James Freels at the COMSOL Conference, I contacted him. On his advice I built another test case designed to spend more time in the matrix factorization part where parallelism is efficiently implemented. The case is smaller than the previous one (1.06 MDoFs), the 3D geometry is basic but the coupled system of diffusion equations is now non linear. The shared memory speed-up is still not that obtained in Monte Carlo computations but this is finite elements:

node core time (s) memory (Gb)
1 1 9059 32
1 4 2530 32
1 8 1493 32
2 8 923 18
3 8 725 16
4 8 573 11


Best regards,

Alain Glière

Reply  |  Reply with Quote  |  Send private message  |  Report Abuse


Rules and guidelines