Note: This discussion is about an older version of the COMSOL Multiphysics® software. The information provided may be out of date.
Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
Cluster computing got stuck :external progress scheduling
Posted Oct 17, 2012, 8:49 a.m. EDT Installation & License Management, Studies & Solvers Version 4.3 13 Replies
Please login with a confirmed email address before reporting spam
model working directory shared.
However,when I am trying to submit a batch job,using the cluster model in the user guide libarary,it happened to be stuck during "external progress :scheduling"(I found it in the .mph.log file).
And situation is,if I select only one node,the head node,comsolclusterbatch.exe progresses perfect and responded correctly.I select the second node,the compute node,the comsolclusterbatch.exe is effective again and returned the result correctly.
And if I set the node=2 both into working,a job requiring two nodes submitted to the HPC job manager,on both two computer nodes the comsolclusterbatch.exe appeared,comsuming some of memory,ca 60M,but neither of them is comsuming any CPU,and the progress is 0.
I found the log stuck at "external progress 1 :scheduling"
It's very strange that comsol failed to work parallely on two nodes.
I am running windows HPC 2008 R2 on the head node and the compute node.
The headnode have an i7 CPU,and the compute node have a AMD B28 cpu. Both the memory is 16G.
looking forword to your reply
Please login with a confirmed email address before reporting spam
Please login with a confirmed email address before reporting spam
But sincerely,the guy beside my office runs a cluster running Linux,he has six DELL computers connected as a cluster,running ubuntu.Maybe Linux is more popular a platform.
If I can not get this run properly then maybe I have to turn to a Linux platform
Please login with a confirmed email address before reporting spam
we do have a Windows HPC Server 2008 R2 based cluster system. It works quite well with COMSOL since maybe 4.2 or so (it got better with each release).
If you have any detailed questions, please feel free to send a PM.
Regards
Matthias
Please login with a confirmed email address before reporting spam
it could be anything from a problem in the model to a setup problem of your cluster. If you post a simple model, I can try to run it on our Windows HPC cluster to see if it works here.
Regards
Matthias
Update: We of course also run COMSOL on that cluster system...
Please login with a confirmed email address before reporting spam
If you continue to be stuck, before trying on your neighbor's LInux cluster, you might try asking COMSOL tech support for some help in case it is an installation problem. do you have other test cases that come with the Microsoft HPC cluster to make sure your cluster is working correctly ?
Please login with a confirmed email address before reporting spam
The situation is,I installed comsol on the head node,with a license containing clusternode serial.
I shared the installation directory on the headnode to make it accessible to all computing nodes,and I setup a NEW directory,sharing every permission with everyone to store the *.mph file and the *.mph.log file.
External COMSOL installation forder is directed to the shared COMSOL working directory on my headnode
the External working directory path:\\headnode\COMSOL43\
the External file storing path:\\headnode\test1\
I'm using the model from model libarary according to the user guide
"cluster_install_win_43.pdf"
And I don't really understand what is a Floating Network Licence,I just want my model simuteanously studied by All my nodes,but it got stuck at scheduling external progressssssssssssssssss.
I tried to submit a job with comsol commands manually in the HPC Cluster manager,assigning two nodes including the head node,using 8 cores,and 8 comsolclusterbatch.exe appeared in the taskmanager occupying lot of memory and CPU with out returning anything.
When I submit a job through comsol using the configuration above,assinged pathes,on both the headnode and comupte node the progress comsolclusterbatch.exe appeared and progressed several seconds,then stopped working,the progress is not terminated automaticly,and meanwhile returning nothing,in the COMSOL GUI the progress stuck at external progress1:scheduling
Please login with a confirmed email address before reporting spam
I think it may be a installation problem,as my cluster passed all the MPI diagnosis.
Please login with a confirmed email address before reporting spam
there should be a .status and a .log file in the directory where you put your .mph file for cluster calculation. Could you have a look at those and post them?
Regards
Matthias
Please login with a confirmed email address before reporting spam
Now I can run comsol in paralell,and the batch progress terminated normally,and the PROBLEM is,the batch returns nothing.I assigned two nodes to run this batch job.The result is saved,but the progress did not update with comsol desktop.
This is the log file,I have check just now.
*******************************************
***COMSOL 4.3.0.151 progress output file***
*******************************************
Mon Oct 22 19:33:49 CST 2012
---------- Current Progress: 100 %
Memory: 428/450 563/582
Stationary Solver 1 in Solver 1 started at 22-十月-2012 19:34:42.
Current Progress: 0 %
Memory: 445/469 565/591
Nonlinear solver
Number of degrees of freedom solved for: 15040.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 0.0037
mod1.p: 0.01
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 31 0.0100000 32 2 1 2
- Current Progress: 10 %
Memory: 476/503 607/633
2 5.4 0.1000000 6 3 2 4
-- Current Progress: 20 %
Memory: 462/504 592/646
3 0.031 1.0000000 0.61 4 3 6
-------- Current Progress: 88 %
Memory: 483/504 616/646
4 0.0011 1.0000000 0.029 5 4 8
---------- Current Progress: 100 %
Memory: 469/504 599/646
5 4e-005 1.0000000 0.0024 6 5 10
Node 1:
Nonlinear solver
Number of degrees of freedom solved for: 15040.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.u: 0.0037
mod1.p: 0.01
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 31 0.0100000 32 2 1 2
2 5.4 0.1000000 6 3 2 4
3 0.031 1.0000000 0.61 4 3 6
4 0.0011 1.0000000 0.029 5 4 8
5 4e-005 1.0000000 0.0024 6 5 10
Stationary Solver 1 in Solver 1: Solution time: 21 s.
Current Progress: 0 %
Memory: 505/506 660/661
---------- Current Progress: 100 %
Memory: 498/561 658/721
Stationary Solver 2 in Solver 1 started at 22-十月-2012 19:35:15.
Current Progress: 0 %
Memory: 549/572 664/721
Nonlinear solver
Number of degrees of freedom solved for: 67521.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.c: 27
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 0.64 0.0100000 0.64 2 1 2
- Current Progress: 10 %
Memory: 604/733 723/872
2 0.6 0.0722125 0.65 3 2 4
-- Current Progress: 20 %
Memory: 733/738 867/880
3 0.48 0.7221251 1.6 4 3 6
---- Current Progress: 44 %
Memory: 733/740 867/885
4 0.12 1.0000000 7.7 5 4 8
------- Current Progress: 73 %
Memory: 607/740 725/885
5 0.066 1.0000000 0.17 6 5 10
------- Current Progress: 70 %
Memory: 606/740 724/885
6 0.027 1.0000000 0.055 7 6 12
-------- Current Progress: 80 %
Memory: 607/740 724/885
7 0.0084 1.0000000 0.024 8 7 14
-------- Current Progress: 88 %
Memory: 737/740 871/885
8 0.0025 1.0000000 0.0079 9 8 16
--------- Current Progress: 94 %
Memory: 608/740 726/885
9 0.00074 1.0000000 0.0018 10 9 18
---------- Current Progress: 100 %
Memory: 584/740 700/885
Node 1:
Nonlinear solver
Number of degrees of freedom solved for: 67521.
Nonsymmetric matrix found.
Scales for dependent variables:
mod1.c: 27
Iter ErrEst Damping Stepsize #Res #Jac #Sol
1 0.64 0.0100000 0.64 2 1 2
2 0.6 0.0722125 0.65 3 2 4
3 0.48 0.7221251 1.6 4 3 6
4 0.12 1.0000000 7.7 5 4 8
5 0.066 1.0000000 0.17 6 5 10
6 0.027 1.0000000 0.055 7 6 12
7 0.0084 1.0000000 0.024 8 7 14
8 0.0025 1.0000000 0.0079 9 8 16
9 0.00074 1.0000000 0.0018 10 9 18
Stationary Solver 2 in Solver 1: Solution time: 104 s. (1 minute, 44 seconds)
Run time: 159 s.
Saving: \\headnode\samples\123.mph
Save time: 11 s.
Total time: 201 s.
and the status file,thankfully done.
1350905830192
Done
Please login with a confirmed email address before reporting spam
the log file looks absolutely ok. I believe that the status file should say "0", but cannot verify this at the moment.
Have you opened the saved .mph file, to see if there are results inside? From the log file, they should be there.
I see the same behavior from time to time as well, jobs running fine, but the GUI is not coming back. I have no clue about the reason yet. Nevertheless, I follow jobs always from the HPC Cluster Manager (or Job Manager) as well, so I see what's going on.
So maybe you do some more tests!
Regards
Matthias
Please login with a confirmed email address before reporting spam
Now the progress bar sometimes goes normally,and thus return an expected result according to the samples in the model library.And some times the result is saved but no notification to open the result.
Now I am working on my model,a 2D grating ca 93750+56250=150000 sq um,mesh grid at 0.2 as maximum for free triangular,establishing ca 50million triangles,comsuming ca 9GB memory
It's a little upseting that COMSOL is comsuming so much memory.On one of my computing node returned a MPI error,which terminated my caclulation.
I have 16G on my headnode and 18G on two compute nodes,Should I add more compute nodes or upgrade my headnode?
Please login with a confirmed email address before reporting spam
do you use the head node as a compute node as well?
Upgrade: What is the type of network in your cluster? Gigabit Ethernet, 10G Ethernet, Infiniband? What type of models would you like to run? Large models with lots of nodes, or parametric sweeps of smaller models?
We chose to have big compute nodes (96 GB RAM, 12 cores each) and rather slow network (1G Ethernet) because we are usually using the cluster for parametric studies, and the models easily fit onto one node. However, if you plan to run really huge models, you should go for large memory and fastest network at the same time.
Regards
Matthias
Please login with a confirmed email address before reporting spam
There is still a little problem about our model.We are studing a certern structure of the photonic crystal,
And we are focusing on the Energy gaps of the crystal in the K wave-vector space.
We found a model in the version 3.5 a,the bandgap of photonic Crystal in the RF module,
But I do not have a version of the 3.5a comsol,I found the PDF describing the model,
And I made some progress in studying the egeinfrequency.
But what is strange that the PDF declares the egeinfrequency being around 4.22e14,and we reached a value of 4.3e14 acording to the direct solver studying the egeinfrequency,I set up the same variables and constants as the PDF explained,and the intergration for the whole domain I wrote A intop1(1) for A @ m^2,nEz intop1(Ez*conj(Ez)/A) @(V/m)^2,is there anything wrong with this?I found this model no longer exists in the version 4.3
And another Problem is that there is a Harmonic Propagation selection in the solver parameters of version 3.5a,which I couldn't find in the version 4.3
I am trying to rebuild this model in 4.3,And I really need some help
Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.
Suggested Content
- FORUM Cluster computing got stuck :external progress scheduling
- BLOG How to Run on Clusters from the COMSOL Desktop® Environment
- FORUM Cluster computing stuck at:external progress scheduling
- KNOWLEDGE BASE Running COMSOL® in Parallel on Clusters
- FORUM how did you fixed your cluster computing got stuck problem