Unresolved contention for Intel Fortran RTL global resource

Unresolved contention for Intel Fortran RTL global resource

Dogbite的头像

As I have slogged forward (I hope) in the process of OMPizing my large legacy program, I have already encountered one of OMP's most endearing traits: the program now crashes multiple times in multiple places in a single run. (Of course, this is progress compared to the program simply stopping without explanation or error message.)

I am hoping someone can provide insight into this particular message beyond what the documentation provides:

forrtl: severe (152) unresolved contention for Intel Fortran RTL global resource

This seems to indicate that the system is having trouble handling multiple calls to IO routines from the individual threads -- that is, the routines are not capable of handling threaded IO calls. This reminds me of something I read about using CRITICAL declarations to isolate IO calls.

I am wondering about the level of granularity I need to use for this protection. Is it sufficient to isolate calls to an individual file, or must Iisolate all IO calls to all files? Or does it depend on the call: do I need to isolate all READ calls from each other and all WRITE calls from each other, but can permit Thread 0 to READ while Thread 3 WRITEs?

And are there other considerations I'm missing?

26 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Steve Lionel (Intel)的头像

What compile options are you using? Is this a single EXE or does it use a DLL or static library with other Fortran code?

The Fortran RTL can protect itself against access from multiple threads assuming that the /reentrancy:threaded option is used (I'm pretty sure /Qopenmp enables that.) However, the languge disallows starting an I/O on a given unit while another one is in progress on that unit, hence the usual advice to put I/O in a critical section (or a single thread).

Steve
Dogbite的头像
Quoting - Steve Lionel (Intel) What compile options are you using? Is this a single EXE or does it use a DLL or static library with other Fortran code?

The Fortran RTL can protect itself against access from multiple threads assuming that the /reentrancy:threaded option is used (I'm pretty sure /Qopenmp enables that.) However, the languge disallows starting an I/O on a given unit while another one is in progress on that unit, hence the usual advice to put I/O in a critical section (or a single thread).

Thanks, Steve.

The program is a single EXE and uses the IMSL ('link_fnl_static.h') library and the OMP ('libiomp5md.lib') library.
The command line I've been using reads:

/nologo /debug:full /Od/gen-interfaces /Qopenmp /Qdiag-error-limit:50 /Qopenmp_report:2 /warn:interfaces /Qauto /align:commons /assume:byterecl /fpe:0 /names:uppercase /module:"x64Debug" /object:"x64Debug" /asmattr:source /asmfile:"x64Debug" /traceback /check:bounds /libs:static /threads /dbglibs /c

I set the /reentrancy:threaded option, but that produced similar results:one /0 error, one unresolved contention, and one "internal consistency check failure, file for_exit_handler.c, line 359".

Which leads me to another question: what happens after a fatal error on one thread? Does the system terminate them (unless they suicide first)? Or do the survivors get to run as long as they can?

Steve Lionel (Intel)的头像

I'm not sure what happens if one thread dies. It had been my understanding that a fatal error would end the application. I'll have to ask our RTL developers about the contention error.

Steve
Steve Lionel (Intel)的头像

The information I received said that this can happen if two threads are trying to do I/O to the same unit number at the same time, and the RTL is having difficulty getting access to the data structure. This would be an error anyway, hence my suggestion to put I/O in a critical section. Is this happening after some other error occurs in the program?

Steve
Dogbite的头像
Quoting - Steve Lionel (Intel) The information I received said that this can happen if two threads are trying to do I/O to the same unit number at the same time, and the RTL is having difficulty getting access to the data structure. This would be an error anyway, hence my suggestion to put I/O in a critical section. Is this happening after some other error occurs in the program?

In the most recent case, three (of four) threads end with out-of-bound subscript errors. The RTL throws up an address stack for each error, and the three threads stop printing their "here I am" messages. The fourth thread putters on contentedly, spitting out messages and writing to the TX file. (This lead to my question about how RTL handles shutting down threads.) The last thread then croaks with errmsg #25, record number outside range, while writing to the IX file. The file was opened with the default "unlimited" number of records, so the applicability of this message seems, um, uncertain. I have no problem with this being an artifact of RTL shutting the thread down, but it would be nice to know I don't have to worry about errmsg #25.

Steve Lionel (Intel)的头像

I have no answer for you regarding error 25. Is this using direct access to write records?

Steve
Dogbite的头像
Quoting - Steve Lionel (Intel) I have no answer for you regarding error 25. Is this using direct access to write records?

Yes, sorry, should have mentioned that.

I expect that RTL is in the process of shutting down thread four, but thread four is prosecuting output. So while RTL's left hand kills the thread, RTL's right hand goes to do output but fails because the left hand wiped some needed parameter. Thread four did nothing wrong, but its demise rippled.

onkelhotte的头像

Quote:

Steve Lionel (Intel) wrote:

The information I received said that this can happen if two threads are trying to do I/O to the same unit number at the same time, and the RTL is having difficulty getting access to the data structure. This would be an error anyway, hence my suggestion to put I/O in a critical section. Is this happening after some other error occurs in the program?


Hi Steve,

I have a similar problem:
http://software.intel.com/en-us/forums/topic/335328

What do you mean by "putting I/O in a critical section"?

Thanks in advance,
Markus

mecej4的头像

"What do you mean by "putting I/O in a critical section"?"

It imposes a definite set of limitations,requirements on what threads can/should do when the code in the section is being executed.

This is covered in the Open MP manuals.

Dogbite的头像

Markus,

The solution I arrived at was to enclose each call to a file within a named OMP Critical region. This way only one thread can access the file at a time. If another thread wants to access the same file, OMP will not permit access until the first thread has exited the critical region. In this sense, the critical region acts as a lock upon the file.

In the snippet below, two files (RX and EX) are protected by named critical statements.

IF ( FPCABS .GT. 1 .AND. BBUDGET .EQ. 0 ) THEN ! Budget Off & subsequent FP
!$OMP CRITICAL(RX1C1)
CALL RXREAD(SEQNO,FPERHD,L1HOLR,I2HOLR,I4HOLR,R4HOLR,NRECS)
IF ( NRECS .GT. 0 )
& CALL RXSET(FPERHD,L1HOLR,I2HOLR,I4HOLR,R4HOLR,NRECS)
!$OMP END CRITICAL(RX1C1)
!$OMP CRITICAL(EX1C1)
CALL EXREAD(SEQNO,NRECE,R4BUFE,R4HOLE,IMPHLE,FPRHLE)
CALL EXSET(NRECE,R4BUFE,R4HOLE,IMPHLE,FPRHLE)
!$OMP END CRITICAL(EX1C1)

I used named regions so that access is only dependent upon the use of that particular file -- as there are five files to be protected, naming creates five file-specific locks. If I didn't name the regions, then a thread wanting to use, say, the RX file would have to wait while another thread used the EX file.

I hope that helps.

onkelhotte的头像

Thanks dogbite,
I´ve found it in the OpenMP reference.

Markus

onkelhotte的头像

I still get this error. But it still happens even when it is in a CRITICAL section:

subroutine writeReportTemperatures()
!
 use globaleVariablen
 use EA_Telegramm
 implicit none
 character*1 seperator
 character*512 tempstring
 integer(kind=4) iUnit
!$OMP CRITICAL (TEMP_ACCESS)
 seperator=';'
 write(tempstring,'(3(a,i2.2),8(a,f6.1))') &
 seperator,EA_Telegramm_849%DBX340,':',EA_Telegramm_849%DBX360,':',EA_Telegramm_849%DBX380, &
 seperator,tempStrang(1,1,nzTempGiessmaschine), &
 seperator,measuredValues849%tempGiessmaschine, &
 seperator,tempStrang(1,1,nzTempMRGS), &
 seperator,measuredValues849%tempMRGS, &
 seperator,tempStrang(1,1,nzTempTreiber1), &
 seperator,measuredValues849%tempTreiber1, &
 seperator,tempStrang(1,1,nzTempTreiber2), &
 seperator,measuredValues849%tempTreiber2
 call replaceChar(tempstring,'.',',')
 open(NEWUNIT=iUnit, file=trim(fileNameReportTemperatures), status='old', access='append')
 write(iUnit,'(a)') trim(tempstring)
 close(iUnit)
!$OMP END CRITICAL (TEMP_ACCESS)
end subroutine writeReportTemperatures

The "forrtl: severe (152): unresolved contention for Intel Fortran RTL global resource" error still occurs, sometimes in the longer write statement to tempstring, sometimes in the write statement to iUnit and sometimes in the close statement.

When I replace the newunit=iunit to a fixed unit number the same happens. This file is only being used in this subroutine and in a initReportTemperatures subroutine, they can´t be called at the same time. Performing a flush(iUnit) was useless too.

Markus

IanH的头像

Does the error happen on its own?  Most of the time I see this when my program is in the process of crashing and burning due to other unrelated incompetence on my part.

p.s. filenames for open are "auto-trimmed" - trailing blanks are not used.

jimdempseyatthecove的头像

How was tempstring opened? In particular was asychronous used?

Jim Dempsey

www.quickthreadprogramming.com
onkelhotte的头像

Quote:

IanH wrote:Does the error happen on its own?  Most of the time I see this when my program is in the process of crashing and burning due to other unrelated incompetence on my part.

p.s. filenames for open are "auto-trimmed" - trailing blanks are not used.

The program doesn´t crash when I don´t call the writeTemperatures subroutine. I have it running for nearly 30 minutes, with the call it never ran longer than 2 minutes.

Markus

PS: Thanks for the trim information in the open statement.

onkelhotte的头像

Quote:

jimdempseyatthecove wrote:How was tempstring opened? In particular was asychronous used?

Jim Dempsey

I quite don´t understand your question. tempstring is a local character*512 variable.

If you mean if the file was opened with asynchronous='yes': This was not the case. Opening the file with asynchronous='yes' leads to some kind of dead lock. At different run times the program does nothing any more. But it is not freezed and it is not "not repsonding" when looking into Windows Taskmanager.

Markus

jimdempseyatthecove的头像

If you read some of the other threads you will note that there is (was) an issue where when the number of pending asynchronous I/O's exceed an internal limit that this induces an I/O error. You are not using asynchronous='yes' so this eliminates that problem (this is why I asked). There was another old issue where NEWUNIT= crapped out after some number of iterations. You indicate that by using fixed unit number you still experience the crash. So NEWUNIT= is not the issue. You've demonstrated that these two potential issues are not the proble. What do we have left....

1) This is an embarrassing reason: The .obj containing writeReportTemperatures was not compiled with -openmp. This would cause the !$OMP CRITICAL to be ineffective. Note, should this subroutine be in a module it is not unusual for a module not to get compiled (or compilation errors) and/or for module path for linking not same as module path for outputs. In these cases an out of date or wrong .obj is linked. Verify this. Also note, verification in Debug build is not verification in Release build.

2) There is a resource error. What happens when each thread uses a unique filenameReportTemperatures file name. Careful omp_get_thread_num not necessarily thread unique when using nested parallel regions, use thread ID instead (or unique number obtained and saved in thread private variable). Should this fix the issue, then the problem lies within the O/S.

3) There is a resource error (2). What happens if the open for  filenameReportTemperatures is performed once for all threads, and then you use only the WRITE within your subroutine? Windows has a Lazy write which may be causing an issue when performing open, append, close, open, append, close...

4) Is the output file for filenameReportTemperatures in one of the Windows folders (or sub folder)? If so, create a different folder (path) that is not under a Windows system folder.

Jim Dempsey

www.quickthreadprogramming.com
onkelhotte的头像

1.) The subroutine is part of the main program and not inside a module. Clean and Rebuild / Deleting the obj. per hand doesn´t help.

2.) The creation of the file (in a subroutine initReportTemperatures) is prior to the first call of writeReportTemperatures. They are not in different threads, they are beeing called in the same subroutine:

! ... some other code
if(bReportTemperaturen.eqv..true.) then
    call initReportTemperatures()
end if
do while (programRuns.eqv..true.)
!... some other code
if(bReportTemperaturen.eqv..true.) then
    call writeReportTemperatures()
end if
! ... some other code

3.) To prevent a open - append - write - close - open ... I/O problem I tried flush(iUnit), but this didn´t help either.

4.) The file is in my C:\User\ directory so I have full access to that file.

I think that the error is somewhere else in my code and has nothing to do with this statements...

Markus

jimdempseyatthecove的头像

RE 2)

Lift the OPEN and CLOSE for filenameReportTemperatures out of subroutine writeReportTemperatures, using say iUnitReportTemperatures, keeping only the WRITE(iUnitReportTemperatures, ... in subroutine writeReportTemperatures. You may want to add flush(iUnitReportTemperatures) after WRITE if desired. IOW:

program, open/init, append, append, append, ..., append, close, end program

Have you run your code in Debug build, with OpenMP enabled, and with all runtime checks enabled (uninitialized variables, array index checking, etc...). I've experienced situations where such errors in coding cause seamingly unrelated errors in different code.

www.quickthreadprogramming.com
Dogbite的头像

I just have to ask:  Shouldn't iUnit be assigned a value at some point?  At this point I wouldn't be sure that the two routines are writing to the same file.

jimdempseyatthecove的头像

In my original response 2) you would use different unit numbers (for different files), critical section not required, However, in alternate method 2nd response 2) (program, open/init, append, append, append, ..., append, close, end program), only one unit required and critical section required.

I should have been clearer (got to read through the lines)

Jim Dempsey

www.quickthreadprogramming.com
IanH的头像

Quote:

Dogbite wrote:I just have to ask:  Shouldn't iUnit be assigned a value at some point?  At this point I wouldn't be sure that the two routines are writing to the same file.

The code uses F2008's best-thing-since-sliced-bread NEWUNIT= specifier in the OPEN statement to do that.

app4619的头像

 Quote:

@IanH: The code uses F2008's best-thing-since-sliced-bread NEWUNIT= specifier in the OPEN statement to do that. 

+1  Now that is really useful (I never rated sliced bread much though) ....!!!

onkelhotte的头像

Yesterday I rewrote my program, now it has an open, a write and a close subroutine like Jim suggested. It was a lot of work because the final close statement could be in many branches in my program, so I tried to avoid that by using a create and an open - write - close subroutine. Now it works, thanks!

But why does the "forrtl: severe (152): unresolved contention for Intel Fortran RTL global resource" error comes up? The open - write - close subroutine is being called every 0.8 to 1 second, that should be enough time for Windows7 to do three simple I/O operations...

The NEWUNIT specifier is simple and a great way to handle I/O operations. That´s more than just +1 :-)

Markus

jimdempseyatthecove的头像

I think that this is a Windows related quirk. I will restrict myself from saying error, because I am certain MS will say it is not an error. Some background on my suspicions.

I have a utility program written in C++ back in 1997 that is used to synchronize paths. When the destination path is empty, it becomes a copy. Prior to Windows 7 the synchronization to empty folder across a network path would always work.  Since Windows 7 some of the files would fail to copy (or delete and copy when target file out of date), thus requiring a subsequent synchronization. Upon examination of the problem, the "error" codes returned are not "hard errors" (e.g. file in use) but fall into the soft error category such as: insufficient resources. The problem would occur less frequently with synchronizations using local disks.The MSDN articles tend to say something like: When receiving 0x.... retry the operation.

My guess is: IVF is receiving one of these soft errors (one that MS suggests immediate retry), and instead of retrying IFV hard errors out (unresolved...RTL global resource").

A reproducer might be handy for Intel to investigate this problem. Your init routine, writeReportTemperatures, and a loop to write ~1GB of data, plus a dummy module that declares the variables used. Submit with details (IVF version, O/S version, IVF options, placement of output file loca/remote/type).

The work-arround I proffered reduces the probability of this error.

Happy for you to get working...

Jim Dempsey

www.quickthreadprogramming.com

登陆并发表评论。