IO issues on Windows Server 2012

IO issues on Windows Server 2012

We recently added a Dell Poweredge R420 to our compute server mix. Because of the newer hardware we had to install Windows Server 2012 STD. To which we added the HPC addon to make it similar to our other compute servers running Windows 2003 Compute Cluster Edition.

Our existing software built with Intel Visual Compiler XE 13.1.2.190 runs without problems on the older setup. But on the new configuration we get an IO error at what seem to be random (maybe load dependent). If the application writes a very large file followed by a close then open it again we get an IOSTAT=30. The writes are to a local raid 1 disk with a controller with 1GB cache. Which is similar to our older compute servers.

Is there a compatibility issue between Visual Fortran and Windows Server 2012.

48 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Steve Lionel (Intel)的头像

We have seen that issue with many different Windows versions - it seems that Windows doesn't fully close a file until "sometime after" it returns back to the code that called CloseHandle. Our usual recommendation here is to add a SLEEPQQ call after the CLOSE for, say, 500ms. That usually takes care of it.

Steve

Already tried that with the following code.

100 continue
        open(unit=11,file=szFile,IOSTAT=ios999,buffered='yes',form='unformatted',action='write')
        if (ios999.ne.30 .and. ios999.ne.0) then
            goto 666
        endif
        openCount = openCount + 1
        if (ios999.eq.30) then
            if (openCount.gt.300) then ! try for 5 minutes
                goto 666
            endif
            ! wait up to 1 seconds
            iWait = 1000
            call SLEEPQQ(iWait) ! wait (integer in milliseconds)
            goto 100
        endif

Only happening on the Server 2012 box, never needed this before on 2003 Compute Cluster boxs.

Steve Lionel (Intel)的头像

I don't know what to tell you, then. We did test on that version of Windows and did not see any problems. But there isn't much we can do here - if the call to CloseHandle (the Windows API used to close a file) returns success, we have to assume it is closed. It might be useful to call the Windows API routine GetLastError (the one from KERNEL32, not the one in module IFPORT) and see what the Windows status is after the failed open.

Steve
iliyapolak的头像

It seems that handle to file is not closed thus from the POV of Windows it is still in use.As it was told try to call GetLastError WinApi function and inspect its return value.More advanced option will include the usage of windbg and scanning for not released handles.You can also use AppVerifier for the same purpose.

Added a c call to GetLastError, it seems to always return zero. Do I need to try something else to get more granular information than the IOSTAt=30?

Going to add a call to execute a script when this happens. Going to use the Sysinternals handle routine to see who has this file open.

Steve Lionel (Intel)的头像

Which GetLastError did you call? It needs to be the one from kernel32.

Steve

Used the following c code:

#include <process.h>
#include <errno.h>
#include <Windows.h>
#pragma warning( push )
#pragma warning( disable : 4996 )
extern "C"
{
    void __stdcall GET_LAST_ERROR(int *err)
    {
        DWORD thisError = GetLastError();
        *err = thisError;
    }
}

From Fortran added:

call Get_Last_Error(lastError)

iliyapolak的头像

Sorry for late answer.Try to use Process Explorer and look for open handles In your code.I think that there is only one function named GetLastError and it is exported by Kernel32.DLL. Regarding the result try to convert it to HRESULT. Here is the link http://msdn.microsoft.com/en-us/library/windows/desktop/ms680746(v=vs.85).aspx

Not sure how this will help since GetLastError is returning a zero. Just in case I redid my code as follows.

#include <process.h>
#include <errno.h>
#include <Windows.h>
#include <WinError.h>
#pragma warning( push )
#pragma warning( disable : 4996 )
extern "C"
{
    void __stdcall GET_LAST_ERROR(int *err)
    {
        HRESULT result = HRESULT_FROM_WIN32(GetLastError());
        *err = (DWORD)result;
    }
}

iliyapolak的头像

In this case please use either Handle.exe or Process Explorer to look for orphaned handles.

Looks like the low level fortran open code has a bug and when it gets an IOSTAT=30 type error it does not close the openHandle it has.

Modified the wait code to have a close on the unit and things are working.

100 continue
        open(unit=11,file=szFile,IOSTAT=ios999,buffered='yes',form='unformatted',action='write')
        if (ios999.ne.30 .and. ios999.ne.0) then
            goto 666
        endif
        openCount = openCount + 1
        if (ios999.eq.30) then
            if (openCount.gt.300) then ! try for 5 minutes
                goto 666
            endif
            ! wait up to 1 seconds
            iWait = 1000
            call SLEEPQQ(iWait) ! wait (integer in milliseconds)
            close(11) ! add for bug in fortran open
            goto 100
        endif

Steve Lionel (Intel)的头像

There is no bug in the OPEN code - you get error 30 only if the CreateFile API returns -1 as the handle, indicating that the operation failed. There is no path through the code where you'd get error 30 and a handle was successfully returned to the OPEN code.

The CLOSE you added would have no effect. The SLEEPQQ is what helps, as I suggested earlier.

Steve

Sorry your wrong without the close I get IOSTAT=30 300+ times and then the program quits.

Same thing in another place in the program. Without the close it would loop 240 times them give IOSTAT=30 error message. Remember this only seems to be happening on Server 2012.

! Random unit for output
5 continue
outFile = 12
call GETTIM (i,j,k,l)
call SEED(i+j+k+l)
call RANDOM(xFile)
outFile = (xFile * 70) + 20.0
if (outFile.lt.20.or.outFile.gt.99) then
    goto 5
endif
close(outFile)

! Check files
inquire(file=TemplateFile,EXIST=EXISTS)
if (.not.EXISTS) then
    goto 1300
endif
! open files
open(unit=11,file=TemplateFile,err=1000,form='formatted',    &
    share='DENYNONE',action='read',IOSTAT=ios999)
100 Continue

! make sure unit not in use
inquire(unit=outFile,OPENED=UNITINUSE)
if (UNITINUSE) then
    goto 5
endif

open(unit=outFile,file=DeckFile,form='FORMATTED',    &
     action='WRITE',STATUS='REPLACE',IOSTAT=ios999)
if (ios999.ne.30 .and. ios999.ne.0) then
    goto 1100
endif
openCount = openCount + 1
if (ios999.eq.30) then
    if (openCount.gt.240) then ! try for 4 minutes
        call Get_Last_Error(lastError)
        goto 1100
    endif
    ! wait up to 1 seconds
    iWait = 1000
    call SLEEPQQ(iWait) ! wait (integer in milliseconds)
    close(outFile) ! fix for Fortran bug in failed open
    goto 100
endif

Forgot to add that I wrote a little routine that does a SYSTEM call to run handle.exe on the file that gets the IOSTAT=30 and writes it to a log fiel. It always shows that the running program has a handle to the file when the error happens.

jimdempseyatthecove的头像

There used to be an issue in ca V8.nn of IVF where IOSTAT=ios999 would not update ios999 on success. As a test for this lingering bug, insert ios999=0 in front of the open, remove the close(11), then see if you get a successful open after multiple retries. If this changes the behavior (open works), then please report this back to Steve.

BTW, I've seen a similar issue in Windows with C++ when the O/S is buffering writes. Note, when you goto arround the writes, you effectively have bypassed any buffering. Try inserting a FLUSH(11) prior to close, though I suspect the buffered write will still be in effect.

Jim Dempsey

www.quickthreadprogramming.com
iliyapolak的头像

If there is call to CloseHandle and if the operation fails the return value will be zero.And it this case detailed error description should be given by GetLast Error.Maybe somehow your file handle is being CV loser two times it will also throw an error.

Tried setting ios999=0 just before open seems to have no effect. Been using Resource Monitor to watch the exe and it looks like there may be a close problem. Does the normal close(11) throw a runtime error is it fails? I'm seeing a handle for a file I know should be closed.

Definitely an issue with Fortran close on our Server 2012 box. Watching same program using Process Explorer on Server 2003 and Server 2012 you can see orphan handles accumulate on 2012. In this case we have a readonly shared file that gets opened 5 times by 5 calls to a subroutine witch writes out 5 separate files to 5 directories. First pass there is a handle to the file left over on 2012 not on 2003. Over time (like over night) you end up with 4 or 5 extra handles.  Problem comes with the output files, after enough time if an output file name is repeated in the same directory enough times eventually a handle gets left over also. Which causes an IOSTAT=30 since it can not be overwritten because of the open handle. In all cases even error conditions the file units are being closed before exiting the subroutine.

Steve Lionel (Intel)的头像

This tells you that it is Windows holding on to the handle after the close, not anything that Fortran is doing.

Steve

What can we do about it?

Steve Lionel (Intel)的头像

To the best of my knowledge, adding the SLEEP call after the close has worked for everyone else who has reported this, If you find adding another CLOSE helps, that is certainly not harmful, but I am skeptical that it is doing anything. I do recommend putting a SLEEP call after the initial close, not waiting for the OPEN to fail.

But do you really need to close and reopen the file immediately? Why not use REWIND?

Steve
jimdempseyatthecove的头像

>> Why not use REWIND?

If the same app is run multiple times (e.g. from batch), and using the same name temp file, then rewind will not help. However, you might consider STATUS='SCRATCH', which presumably works out name collisions with temporary file names.

Jim Dempsey

www.quickthreadprogramming.com

We are not closing and reopening immediately. Let’s see if I can explain the application better. The main app is a form of scheduler which takes a set of working directories (5 in my test case) and sets of template files. One of which is a primary file which uses includes for the other files. The app process’s the template files writing a set into each directory with different parameters substituted in. It then spawns a control app that runs a reservoir simulator in that directory using the files. When the simulation finishes the control app notifies the main app that the directory is free and the main app repeats the process of building new files from template files and parameters, followed by another run. The primary file name usually changes each time but the other include file names may be the same between runs, just the data is different. The spawned app opens and closes the files just fine and does not show any leftover handles. It’s only the template files and the same named output files that seem to build up leftover handles. It’s like the same name in the same directory being used frequently causes the problem.

iliyapolak的头像

The problem is probably related to Windows OS as you mentioned the growing count of orphaned handles.For more advanced troubleshooting you can use Application Verifiier with windbg and to issue !htrace command (extension) moreover please enable in Application Verifier the option which will track invalid handle usage.

jimdempseyatthecove的头像

This suggestion is a programming change specifically to work around this problem.

In the main app (control app), do not close the output files (nor the input files that will be reopened on next trip through loop).
In the main app, in CreateProcess call, set the bInheritHandles TRUE, and instead of passing file names on lpCommandLine, pass the text of the hex of the file handle of that file.
In the spawned process, change the file open code, such that if a file name begins with "0x" that it converts the hex code residing in the place of a file name, into a handle, and uses that instead of performing an open (CreateFile). Your Fortran code can do this by modifying the OPEN(...) to add USEROPEN=YourOpenRoutine.

Jim Dempsey

www.quickthreadprogramming.com

Will try the application verifier suggestions. Can't change the spwaned program it's third party so unable to try that.

jimdempseyatthecove的头像

>>Can't change the spwaned program it's third party so unable to try that.

Then the next possibility to ponder is to create a new folder/directory each time you run the spawned program. Still perform the delete files, then as a later cleanup pass remove the temporary folders/directories.

Hmm....

I do not know if this applies to Windows Server 2012

Assure that file indexing is turned off for the folder in which you perform your writes.

You also might look at FlushFileBuffers (http://msdn.microsoft.com/en-us/library/windows/desktop/aa364439(v=vs.85).aspx)

Jim Dempsey

www.quickthreadprogramming.com

Having major problems installing Application Verifier on Server 2012. The older standalone one, the one in Windows 7 SDK and the one in Windows 8 SDK will not install. Will keep trying but running short of ideas.

iliyapolak的头像

I did not have any problems installing App verifier on Win 8,but I must admit that I have never used that program under Win 8 or Win Server 2012.

Do you have any error message during the installation or it simply hangs?

My schedule finally let me get back to this problem.

I followed the suggestion about using useropen and found that if I captured the handle to the open and used it after the normal Fortran close I got rid of the left over handles. Not very elegant or general purpose but lets us run this app on Server 2012.

Problem exist on all our Server 2012 box's with all the recent versions of Fortran. Using handle.exe on Server 2003 and Server 2012 with the same app and data shows that Fortran has had issue's with handles for a long time. On Server 2003 it leaves handles to folders around and on Server 2012 it leaves handles to folder and files around. I have up loaded results of running handle.exe on Server 2003 and Server 2012 to show the problem. In both cases the app ran for multiple days opening and closing files in five work directories. The Server 2012 version had the useropen patch so it would not crash. Thus the only files showing are files that used normal opens and closes.

Since we many times run multiple versions of this app on a server, the large number of handles that accumulate may account for why some of the servers need reboots because they get unresponsive after a week or so of working.

附件: 

iliyapolak的头像

>>>Since we many times run multiple versions of this app on a server, the large number of handles that accumulate may account for why some of the servers need reboots because they get unresponsive after a week or so of working.>>>

Do you mean globally unresponsive(whole system is frozen)?

iliyapolak的头像

Another more suggestion to try is to use windbg and its handle related metacommands.You can at least be able to investigate what is leaking the handles.

Do you mean globally unresponsive(whole system is frozen)?

RDP logins take forever and in Windows Explorer clicking on Network goes off forever and shows nothing but the activity ring.

Another more suggestion to try is to use windbg and its handle related metacommands.You can at least be able to investigate what is leaking the handles.

handle.exe and the Resource Monitor in Server 2012 work fine and show that the low level code used by Fortran has not kept up with OS changes and is not always deleting handles as it should on close.  We have other programs written in C# including one that runs as a service for days and do very similar things and they always have only a minimum number of handles.

 

Steve Lionel (Intel)的头像

I can't figure out what you think needs "keeping up with" - a CloseHandle is a CloseHandle.

Steve

Whatever Fortran is doing is not always closing the handle when close is called. Also even in Server 2003 there are multiple handles to folders left around even though although all the program does is set current directory to the folder. i.e. no opens etc for folders. I'm sure it's something Microsoft changed particularly in Server 2012 that the code in Fortran was never updated for.

Steve Lionel (Intel)的头像

I don't think that's a correct interpretation. Fortran closes the handle, but Windows doesn't actually do the close until sometime later. We will run some experiments and see what we can find out.

Steve

Thanks

Do you need any config information for hardware or software?

Steve Lionel (Intel)的头像

Well, it would really help if you could provide a self-contained test program. The snippets you provided earlier are missing some context.

Steve

Ok that will take a day or so.

iliyapolak的头像

>>>handle.exe and the Resource Monitor in Server 2012 work fine and show that the low level code used by Fortran has not kept up with OS changes and is not always deleting handles as it should on close.  We have other programs written in C# including one that runs as a service for days and do very similar things and they always have only a minimum number of handles.>>>

I agree with you,but handle.exe does not have such low level capabilities as windbg has.Handles are created and managged by kernel mode codeWindbg has excellent logic and heuristic which can help you to track per  handle leaking with the thread and function granularity.

Still working on example. The first simple one I put together with the same write subroutines as the main application and some do loops did not show a problem. Updating sample to be a full windows application using windows message pump like the main application. Suspect that the problem may relate to running inside a windows message pump.

I did not give up, all my examples even the full windows app failed to show the handle issue. So after a lot of trial and error I stripped the real application down to the essentials and it still shows the handle issue. To summarize the application creates a huge number of orphan handles when run on Windows Server 2008R2 HPC and Windows Server 2012 STD with or without HPC, also on Windows 8.1.  But does not on Windows Server 2003 Compute Cluster Edition, Windows XP and Windows 7. All versions of the application have same problem 32bit/x64 and Debug/Release.

Attach is Lynx_new - stripped.zip 2010 Visual Studio solution with everything needed to build each version, x64 debug is already built. Test.zip is test files. Instructions.docx is some simple instructions for running. Key routine is DeckModProcessor.f90 which opens a template file then creates a new file in a working directory copies the template with changes to that file and closes it.

Give it a try on one of your test server's. All my tests were on Dell Servers varying from old PowerEdge 1950's to new PowerEdge R420's. It looks like its not directly related to hardware but operating systems from Server 2008R2 (windows 8) on.

Steve Lionel (Intel)的头像

I don't see the attachment.

Steve

They show up under my files tab, so I selected them and did another submit.

附件: 

Steve Lionel (Intel)的头像

Ok, thanks. We'll see what we can find.

Steve

登陆并发表评论。