where did all that time go?...

where did all that time go?...

Hello MTL-ers,I am running 4 simultaneous jobs on 1 server: acano02,each are limited to 13Gb memory and 10 cpus. Like so:[sels@acano01 Debug]$ qstat -f 14756 | grep comment comment = Job run at Sun Aug 28 at 01:56 on (acano02:ncpus=10:mem=13631488k[sels@acano01 Debug]$ qstat -f 14757 | grep comment comment = Job run at Sun Aug 28 at 01:56 on (acano02:ncpus=10:mem=13631488k[sels@acano01 Debug]$ qstat -f 14758 | grep comment comment = Job run at Sun Aug 28 at 01:56 on (acano02:ncpus=10:mem=13631488k[sels@acano01 Debug]$ qstat -f 14759 | grep comment comment = Job run at Sun Aug 28 at 01:56 on (acano02:ncpus=10:mem=13631488k[sels@acano01 Debug]$qsub correctly deduced that they can run together.I then qstat with about 94 seconds in between to see where they went.[sels@acano01 Debug]$ dateSun Aug 28 02:12:31 PDT 2011[sels@acano01 Debug]$ qstatJob id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----14756.acaad01 retimegurobi sels 00:05:13 R workq 14757.acaad01 retimegurobi sels 00:01:21 R workq 14758.acaad01 retimegurobi sels 00:02:34 R workq 14759.acaad01 retimegurobi sels 00:03:21 R workq 14760.acaad01 retimegurobi sels 0 Q workq [sels@acano01 Debug]$ dateSun Aug 28 02:14:05 PDT 2011[sels@acano01 Debug]$ qstatJob id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----14756.acaad01 retimegurobi sels 00:05:15 R workq 14757.acaad01 retimegurobi sels 00:01:24 R workq 14758.acaad01 retimegurobi sels 00:02:37 R workq 14759.acaad01 retimegurobi sels 00:03:25 R workq 14760.acaad01 retimegurobi sels 0 Q workq [sels@acano01 Debug]$So it seems 2 seconds went to job14756, 3 to14757, 3 to14758 and 4 to14759.Where did the other 94 - 12 = 82 seconds go?Oke, maybe it coudl be that the Time Use column is only updated in larger time steps.So 10 minutes later I checked.$ dateSun Aug 28 02:23:21 PDT 2011[sels@acano01 Debug]$[sels@acano01 Debug]$ qstatJob id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----14756.acaad01 retimegurobi sels 00:05:40 R workq -> +25s14757.acaad01 retimegurobi sels 00:06:02 R workq -> +5min42s14758.acaad01 retimegurobi sels 00:02:58 R workq -> +21s14759.acaad01 retimegurobi sels 00:03:43 R workq -> +18s14760.acaad01 retimegurobi sels 0 Q workq $So in total, my 4 jobs used 6min 46 s. So where did the other 3 min 14 s go?Also: this Time Use column, should be the accumulated time over all threads right?Since each job uses ncpus=10 (and internally I use 4 threads maximum),this time should even advance with a factor 4 faster than the wallclcok time, right?...Then the lost time problem, seems to be 4 times worse, or not?Any ideas?Peter

5 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Oh: 12 minutes later, suddenly, job 14757 is done,but job 14757 suddenly got 49 minutes extra time use (which is all the time considering 4 threads)while the other jobs barely got any extra time.Ah: this can be explained because all my processes first start with 1 thread and then change into 4 thread gear...But still the other jobs (probably still at 1 thread operation, barely get time. What could be the reason?Should it not be fairely balanced or does the first one still have priority?Or would there be a memory wall problem?[sels@acano01 ~]$ dateSun Aug 28 02:35:16 PDT 2011[sels@acano01 ~]$ qstatJob id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----14757.acaad01 retimegurobi sels 00:50:16 R workq 14758.acaad01 retimegurobi sels 00:03:19 R workq 14759.acaad01 retimegurobi sels 00:04:07 R workq 14760.acaad01 retimegurobi sels 00:01:30 R workq [sels@acano01 ~]$

Hello MTL-ers,Maybe this also helps other loggers:I think the wallclock time that progresses, but still is not reflected in the Time Use column,in any process - so the lost time of the subject -
must be due to the (thousands) of small (15k average) log files I write out, one threaded, before(and after) I start on the 4 threaded part.This is mainly disk access time I suppose, and this kind of time is not counted in the cpu cores Time Use column. From the files' time stamps, I indeed see writing these log files takes 5 or more minutes.I'll eliminate the loggin and see if that's right.best regards,Peter

Hi all,Yes, when removing the extensive loggin to disc,, the Time Use column suddenly made sens again.No lost time anymore. On my laptop writing to disc took a lot less time, probably because it is an SMD disk.Are the server discs a bit slow maybe?...So that's cleared up then. Just don't log more than you immediately need here. It takes time. :)best regards,Peter

I meant SSD (Solid State Drive) io SMD (Surface Mount Device?). :)Peter

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen