Problems using MKL with threaded C# on 32-bit

Problems using MKL with threaded C# on 32-bit

Hi

I'm trying to use the MKL to act as a calculation library underneath some threaded C# and having some problems on 32-bit machines.

Using MKL 10.2.5.035, I've built a custom DLL as described in the knowledge base article. If I generate a 64-bit DLL and use that in a 64-bit application then I have no problem. If I switch to 32-bit I do.

If I leave the "threading=" argument off the makefile command line (as I do for 64-bit) then when I run the application tried to call into the MKL I get a message

"OMP: Error #134: Cannot set thread affinity mask.
OMP: System error #87: The parameter is incorrect."

If instead I set "threading=single" and use mkl_sequential.dll then I get an access violation with a call stack of

ChildEBP RetAddr Args to Child
08dcf8f8 7710272c 087407d8 087407d8 7709b76a ntdll!RtlpBreakPointHeap+0x23 (FPO: [Non-Fpo])
08dcf914 77140b37 093e0000 087407d8 7709b76a ntdll!RtlpValidateHeapEntry+0x16d (FPO: [Non-Fpo])
08dcf95c 770fa967 093e0000 50000063 087407e0 ntdll!RtlDebugFreeHeap+0x9a (FPO: [Non-Fpo])
08dcfa50 770a32f2 087407d8 087407e0 00000003 ntdll!RtlpFreeHeap+0x5d (FPO: [Non-Fpo])
08dcfa70 766314d1 093e0000 00000000 087407e0 ntdll!RtlFreeHeap+0x142 (FPO: [Non-Fpo])
08dcfa84 0f4d1316 093e0000 00000000 087407e0 KERNEL32!HeapFree+0x14 (FPO: [Non-Fpo])
08dcfae8 0f4d1cfb 0f4d0000 08dcfb14 770a97c0 mkl!_vmlFreeThreadLocalData+0x22
08dcfaf4 770a97c0 0f4d0000 00000003 00000000 mkl!_DllMainCRTStartup+0x1e (FPO: [Non-Fpo]) (CONV: stdcall) [f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\crtdll.c @ 476]
08dcfb14 770c20bb 0f4d1cdd 0f4d0000 00000003 ntdll!LdrpCallInitRoutine+0x14
08dcfbb8 770c22a2 00000000 00000000 08dcfbd4 ntdll!LdrShutdownThread+0xe6 (FPO: [Non-Fpo])
08dcfbc8 7663367e 00000000 08dcfc14 770a9d72 ntdll!RtlExitUserThread+0x2a (FPO: [Non-Fpo])
08dcfbd4 770a9d72 005ac470 60c777a3 00000000 KERNEL32!BaseThreadInitThunk+0x15 (FPO: [Non-Fpo])
08dcfc14 770a9d45 69f459c0 005ac470 00000000 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
08dcfc2c 00000000 69f459c0 005ac470 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

This seems to occur about 10 seconds after the calculation we are doing completes (which is when the last call to the MKL returns).

The calls to the MKL are being made inside a operation parallelised by the .NET Task Parallel libraries 4. The MKL calls are done via P/Invoke with a typical declaration of

[DllImport(MKL_DLL, CallingConvention = CallingConvention.Cdecl, ExactSpelling = true, SetLastError = false)]
public static extern void cblas_dgemv(CBLAS_ORDER order, CBLAS_TRANSPOSE TransA, int M, int N, double alpha, double[] A, int lda, double[] X, int incX, double beta, double[] Y, int incY);

One observation that seems reasonably solid is that if the calculation is run in a single thread first and subsequentially run multithreaded then the crash doesn't occur. If run multithreaded first time it's occurs pretty reliably.

I've tried turning on the Microsoft managed debug assistants to for garbage collections as we cross the P/Invoke boundary but these seem to force the calculation to become single threaded so the crash stops (although the slowdown is so dramatic I haven't been able to do enough samples to be sure it's gone rather than just become rare).

Am I doing something obviously wrong? Any ideas for where to look to track down the problem?

Thanks in advance
Ian

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Some further information.

I've shrunk the test case for the crash I'm seeing down to a smaller C# app with source code below. I'm running this against a custom MKL dll built using the command line

nmake /f"C:\\Program Files\\Intel\\MKL\\10.2.5.035\\Tools\\Builder\\makefile" ia32 export=MKLfunctions.lst buf_lib= MKLRoot="C:\\Program Files\\Intel\\MKL\\10.2.5.035" name=MKL threading=sequential

where MKLfunctions.lst contains:

	vdCdfNormInv
	vdCeil
	vdDiv
	vdErf
	vdErfInv
	vdErfc
	vdExp
	vdExpm1
	vdFloor
	vdInv
	vdInvCbrt
	vdInvSqrt
	vdMul
	vdRound
	vdSub
	vdTrunc
	vmlClearErrStatus
	vmlClearErrorCallBack
	vmlGetErrStatus
	vmlGetErrorCallBack
	vmlGetMode
	vmlSetErrStatus
	vmlSetErrorCallBack
	vmlSetMode

Reducing the include VML functions further seems to make the crash intermittent/disappear (despite the fact the test code only calls vdCdfNormInv).

/DEBUG has also been added to the arguments of the Link command in the makefile in order get PDB symbols.

When the C# is built as 32-bit and the number of threads is left as 4, and the calculation trigger then it tends to crash a while (30 seconds-ish) after calculate completes. If the number of threads is reduced to 2 it doesn't crash. If run with 2 threads followed by 4 then it also appears not to crash.

There appears to be an operating system element as the crash has been observed on every machine it's been tried on that ran Windows 7 (32 and 64 bit tried) or Vista (only 64 bit tried) but has not occurred on any of the Windows XP machines tried (both 32-bit and 64-bit).

C# Test code

using System;
using System.Runtime.InteropServices;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace TestThread
{
    public class FormMain : Form
    {
        public FormMain()
        {
            InitializeComponent();
        }

        private void calculateButton_Click(object sender, EventArgs e)
        {
            var result = Parallel.For(0, (int)this.numThreads.Value, Calculate);
            this.textBox.Text = result.IsCompleted.ToString();
        }

        private static void Calculate(int x)
        {
            Vector_MKL v = new Vector_MKL(10);
            Random random = new Random();

            for (int i = 0; i < v.Count; ++i)
            {
                v[i] = random.NextDouble();
            }

            Vector_MKL vout = new Vector_MKL(v.Count);
            v.CdfNormInv(vout);
        }

        private NumericUpDown numThreads;
        private Label label1;

        private System.ComponentModel.IContainer components = null;

        protected override void Dispose(bool disposing)
        {
            if (disposing && (components != null))
            {
                components.Dispose();
            }
            base.Dispose(disposing);
        }

        private void InitializeComponent()
        {
            this.calculateButton = new System.Windows.Forms.Button();
            this.textBox = new System.Windows.Forms.TextBox();
            this.numThreads = new System.Windows.Forms.NumericUpDown();
            this.label1 = new System.Windows.Forms.Label();
            ((System.ComponentModel.ISupportInitialize)(this.numThreads)).BeginInit();
            this.SuspendLayout();

            this.calculateButton.Location = new System.Drawing.Point(182, 12);
            this.calculateButton.Name = "calculateButton";
            this.calculateButton.Size = new System.Drawing.Size(75, 23);
            this.calculateButton.TabIndex = 0;
            this.calculateButton.Text = "Calculate";
            this.calculateButton.UseVisualStyleBackColor = true;
            this.calculateButton.Click += new System.EventHandler(this.calculateButton_Click);

            this.textBox.Location = new System.Drawing.Point(12, 41);
            this.textBox.Multiline = true;
            this.textBox.Name = "textBox";
            this.textBox.Size = new System.Drawing.Size(245, 104);
            this.textBox.TabIndex = 1;

            this.numThreads.Location = new System.Drawing.Point(122, 15);
            this.numThreads.Maximum = new decimal(new int[] { 8, 0, 0, 0});
            this.numThreads.Minimum = new decimal(new int[] { 1, 0, 0, 0});
            this.numThreads.Name = "numThreads";
            this.numThreads.Size = new System.Drawing.Size(38, 20);
            this.numThreads.TabIndex = 2;
            this.numThreads.Value = new decimal(new int[] { 4, 0, 0, 0});

            this.label1.AutoSize = true;
            this.label1.Location = new System.Drawing.Point(12, 17);
            this.label1.Name = "label1";
            this.label1.Size = new System.Drawing.Size(46, 13);
            this.label1.TabIndex = 3;
            this.label1.Text = "Threads";

            this.AutoScaleDimensions = new System.Drawing.SizeF(6F, 13F);
            this.AutoScaleMode = System.Windows.Forms.AutoScaleMode.Font;
            this.ClientSize = new System.Drawing.Size(269, 157);
            this.Controls.Add(this.label1);
            this.Controls.Add(this.numThreads);
            this.Controls.Add(this.textBox);
            this.Controls.Add(this.calculateButton);
            this.Name = "FormMain";
            this.Text = "FormMain";
            ((System.ComponentModel.ISupportInitialize)(this.numThreads)).EndInit();
            this.ResumeLayout(false);
            this.PerformLayout();
        }

        private Button calculateButton;
        private TextBox textBox;
    }

    class Program
    {
        [STAThread]
        static void Main()
        {
            Application.EnableVisualStyles();
            Application.SetCompatibleTextRenderingDefault(false);
            Application.Run(new FormMain());
        }
    }

    static class MKL
    {
        [DllImport("mkl.dll", CallingConvention = CallingConvention.Cdecl, ExactSpelling = true, SetLastError = false)]
        public static extern int vdCdfNormInv(int N, [In] double[] A, [Out] double[] Y);
    }

    internal abstract class Vector
    {
        public int Count { get; protected set; }
        public double this[int i] { get { return Items[i]; } set { Items[i] = value; } }
        public double[] Items = null;

        public Vector(int aCount)
        {
            Items = new double[aCount];
            Count = aCount;
        }
        
        public abstract void CdfNormInv(Vector A);
    }

    internal class Vector_MKL : Vector
    {
        public Vector_MKL(int aCount) : base(aCount) { }

        public override void CdfNormInv(Vector A) { MKL.vdCdfNormInv(Count, A.Items, Items); }        
    }
}

Hi Ian,

Could youtell how do you build your test code(build command and build compiler) so we can investigate the problem?

I try build the C# Test code under command line of MSVC 2008 32bit ,

Setting environment for using Microsoft Visual Studio 2008 x86 tools.

C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC>cd C:\Intel_MKL_C#_Example
s\Uvsl

C:\Intel_MKL_C#_Examples\Uvsl>nmake ia32 MKLROOT="C:\Program Files\Intel\MKL\10.
2.5.035"

Microsoft Program Maintenance Utility Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.

Add path of the MKL libraries to the lib environment variable
set lib=%MKLROOT%\ia32\lib;%lib%
MKL entries for custom dll
Workaround for pardiso
Microsoft 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.

_fseeki64.c
Build MKL custom dll
nmake mkl.dll MACHINE=IX86 MKL_LIB="mkl_intel_c_dll.lib mkl_intel_threa
d_dll.lib mkl_core_dll.lib" MSLIB=user32.lib

Microsoft Program Maintenance Utility Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.

'mkl.dll' is up-to-date
Add path of the mkl.dll and path of the MKL binaries to the path environment var
iable
set path=C:\Intel_MKL_C#_Examples\Uvsl;%MKLROOT%\ia32\bin;%path%
Build and run examples
nmake /a vsl.exe

Microsoft Program Maintenance Utility Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.

Compile vsl.cs
csc .\vsl.cs
Microsoft Visual C# 2008 Compiler version 3.5.30729.1
for Microsoft .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.

vsl.cs(3,24): error CS0234: The type or namespace name 'Tasks' does not exist in
the namespace 'System.Threading' (are you missing an assembly
reference?)
NMAKE : fatal error U1077: 'C:\Windows\Microsoft.NET\Framework\v3.5\csc.EXE' : r
eturn code '0x1'
Stop.
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\
VC\BIN\nmake.EXE"' : return code '0x2'
Stop.

Regards,
Ying

Hi Ying,

You won't be able to compile the code using Visual Studio 2008, it uses the Task Parallel library which is part of .NET 4.0 and so requires Visual Studio 2010.

There were no other special settings in the project file (apart from compiling as x86 rather than "Any CPU").

I have done some further investigate I think I have a clearer idea why this is happening.

The crash always occurs below _vmlFreeThreadLocalData with mismatched parameters to HeapFree. Having looked at the disassembly of the function it looks to me like the pointer passed to HeapFree is stored in thread local storage (via TlsGetValue/TlsSetValue) but that the heap passed as the first parameter appears to be storage as a global.

Having looked at _vmlGetThreadLocalData where this allocation is created it appears to do not only the allocation if needed but use HeapCreate to create the heap. However it looks like the calls to HeapCreate do not have any locking around them and so there's a race condition if all the threads try to get their local data at the same time, leading to multiple heaps being created but only the address for one being kept. (I'm not hugely experience reading assembler so might be wrong.) This would explain most of the symptoms (though admittedly not all).

Does this explanation make sense? Is this a result of compiling using libsequential? My impression from reading table 5-2 of the user guide was that this should work if using Win32 threading.

Many thanks,
Ian

Hi Ian,

Thanks, I can build the test case using MSVS 2010, 64bit mkl library under Windows 7 (64bit) now. It runs fine under windows 7 (64bit).

but i have issue to build it with 32bit mkl library under the 64bit windows 7 machine, so no further investigation.

Just someclarify
1) The crash occurs only with 32bit mkl sequential library under 32bit window vista, right? It is run ok under windows XP?

2) both 64bit mkl sequential and 64bit mkl parallel library under 64bit windows vista and 64bitwindows 7, right?

If it is convenient for you, could you please try MKL 10.3 beta ( Intel Math Kernel Library 10.3 Beta) and see if any change?

(I want to make sure,the problem is
1. OS-related
or 2. mixed-thread model confict (mkl unmanaged parallel mode and C# managed parallel
or 3. TLS limitationin VML).

Thanks
Ying

1) The crash occurs only with 32bit mkl sequential library under 32bit
window vista, right? It is run ok under windows XP?

The crash occurs running as a 32-bit app with 32-bit MKL sequential on 32-bit Windows 7, 64-bit Windows 7, 64-bit Vista (there's no 32-bit Vista machine available to test it on). The crash has not been reproduced on XP (32 or 64-bit).

If I compile app and MKL as 64-bit multithreaded it works on the 64-bit operating systems as you say. I haven't tried the 64-bit sequential because the 64-bit multithreaded work. (32-bit multithreaded didn't crash but reported the error mentioned in my initial post - solving that problem would be equally useful.)

Thanks

Ian

Hi,

I have now tried the MKL 10.3 beta with the same test code and build procedure and am seeing the same failure. I have not dug in as much depth but all the symptoms point to it being the same failure.

Thanks,
Ian

Hi Ian
I tried to build the test case with 32-bit MKL threadedon 3 machines: 32-bit Vista, 64-bit windows7, 64-bit Vista. we run into some problem when run 32bit mkl apps on 64 OS, but can't reproducethe problem as you reported.

I attach the code and makefile,please try them on your machine (especially 32bit windows7)and let me know if anyresult.

some notes:
1) build application

run examples on the IA32 or 64bit platform, go to the command window:
Start -> All Programs -> Microsoft Visual Studio 2010 -> Visual Studio Tools -> Visual Studio 2010 Command Prompt (32bit, or 64bit seperately)

Change directory to the one where examples are located and type:
nmake ia32 MKLROOT="C:\Program Files\Intel\MKL\10.2.5.035" on ia32 platform
nmake em64t MKLROOT="C:\Program Files\Intel\MKL\10.2.5.035" on em64t platform

2) About cross platform: 32bit apps run on 64bit platform
It seems theapp with64bit MKL can work on 64bit os, and the app with 32bit MKL can work on 32bit OS.(at least on our machine. if the attached code can work on 32 bit window 7).

But when i build and run the app with 32bit MKL on 64bit OS, i getthe error :
System.BadImageFormatException: at TestThread.mkl.vdcdfNormInv(.)

We investiage the problem and found it isbecause .NET JIT compiles managed code for HOST system architecture (that is 64-bit) by default, while MKL.DLL is built for 32 bit.

Application crashes when trying to import DLL library, with wrong image format error-code.

In order to fix that, weforced running VSL.EXE in 32-bits mode with CorFlags.exe tools (goes with .NET SDK):

CorFlags.exe /32BIT+ vsl.exe

reStarting from this point VSL.EXE runs fine on64bit system.

You may try this and see if the code can be build and runon your Window7 64bit and Window vista 64bit.

Regards,
Ying

Attachments: 

AttachmentSize
Downloadapplication/zip Uvsl.zip3.24 KB

Hi

I find the same as you. If I build and run the code in the zip file then I can no longer reproduce the crash I was seeing.

It appears to be a factor of the custom MKL library. As if I swap the MKL.dll I had previously compiled for the one produced by the makefile you give then I can still see a crash (intermittently). I note that the DLL built by your makefile is significantly smaller (100Kb rather than 2500Kb) than the one I seemed to be getting with the Makefile from 10.2.5.035.

I will investigate further and see if I can understand the difference in behaviour.

Thanks,
Ian

Just quick comments. the mainly difference should be :
the custom MKL builder is using MKL static library
And the one in C# sample is using MKL dynamic library.

Regards,
Ying

Hi,

I haven't been able to reproduce any problems using the test program when I build the custom DLL using the makefile you give in the zip. However when I scale back up to using the full list of functions and call it from the application I'm still getting an error message:

OMP: Error #134: Cannot set thread affinity mask.
OMP: System error #87: The parameter is incorrect.

Popping up in a console window, upon closing which the process exits.

This seems to be a race condition as if I run one or two threads it does occur. If I run with 4 in usually but not always occurs. If I run with 8 it occurs everytime. (Tested on a machine with two quad core Xeon processors.) It appears to be related to initialisation as if the calculation succeeds the first time then it continues to succeed on future attempts even if the number of threads is increased.

I haven't been able to extract a call stack for the calls to SetThreadAffinityMask yet but the calls to SetProcessAffinityMask have a call stack:

(C# calling code)
mkl!vdCdfNormInv
mkl!mkl_vml_service_threader_d_min
mkl!mkl_vml_service_threader_s_min+0x158 (no symbol available for this function)
kernel32!SetProcessAffinityMask

In general it is supposed to be safe to call the VML functions for multiple threads simultaneously? Or should we be calling a specific function on a single thread first to ensure safe initialisation?

Hi Ian,

Have you tried MKL 10.3 beta?
(When try MKL 10.3 beta, as mkl provide run-time library in the version,you may call directly mkl_rt.dll instead of building custom DLL mkl.dll)

1) About your questionon VML functions's threads, actually,yes, ithas global variable and causeTLS issueinearly MKLversion.That is mainly reason ofthecustom dll build tool provided in MKLdistribution package. Butthe problem seldom occuredin recent version (including MKL 10.2.x, at least, no such report in the year). Anyway, if possible, please update toMKL10.3 beta and let me know if any result.

Regarding the error message
OMP: Error #134: Cannot set thread affinity mask.
OMP: System error #87: The parameter is incorrect
2) Could you please checkon your machine ,is there other versions libiomp5md.liband libiomp5md.dll in system?( then rename of them,keep the one associated to MKL version?).

3) Orplease use mkl_sequential_dll.lib instead of mkl_intel_thread_dll.lib in the makefile and try with your large application?

Regards,
Ying

Hi,

There were no other versions of libiomp5md.lib on the system.

However switching to mkl_sequential_dll.lib seems to have solved the problem. As we have not a crash in the last week or so of using it.

I wonder if you can put my mind at ease though. I'm currently a little nervous of this fix as we seem to be using same code as we started with but just with custom dll using dynamic rather than static linking to the MKL. Given it seemed to be a race condition in the first place I can't see how this could solved the problem fundamentally as opposed to just letting us be consistently lucky. Are you able to explain?

Many thanks for your help,
Ian

Hi Ian,

I'm sorry for missingyour followup, and stoping to seekthe root cause as we can't reproduce the problem.Do you havechance to revisit the problem with MKL 10.3.9 (the latest version)and let me know how it works?

Thanks
Ying

Quoting Ying H (Intel)...we can't reproduce the problem.Do you havechance to revisit the problem with MKL 10.3.9 (the latest version)and let me know how it works?
Hi everybody,

I'd like to bring attention of Intel software engineers to OpenMP library instead of MKL. During last 10 days
this is the 2nd case with these two errors:

...
OMP : Error # 134 Cannot set thread affinity mask. OMP : System error #87 : The parameter is incorrect. ...

1. It looks like there is a problem with OpenMP library

2. Please take a look at a similar post at:

http://software.intel.com/en-us/forums/showthread.php?t=103306&o=a&s=lr

3. Did you try to set a highernumber of OpenMP threads, 4096 for example,withthese two environment variables:

OMP_NUM_THREADS = 4096
OMP_THREAD_LIMIT = 4096

Best regards,
Sergey

Hi Sergey,

Thank you a lot for the suggestion. I post the question to Intel C++ compiler Forum: http://software.intel.com/en-us/forums/showthread.php?t=103501.

Let's see any news from there.

Thanks
Ying H.

Hello,The fix should be available inComposer XE 2011 update 9 orMKL 10.3.9Hope this helps,--Vladimir

Dear All,

Some comments from Intel OpenMP engineers,

We could not reproduce the problem (triedthe C# example with higher OpenMP number, if possible, could anyone provide a test case?). But the errormay happen when there is call SetProcessAffinityMask in library initialization,after that, there is another SetThreadAffinityMask call (either explicitily or implicity). We change the behavious in 20110823 libiomp5md.lib library, it work silently now because we ignore affinity setting if SetthreadAffinity Mask fails in current version.

Sosuggest to try20110823 or later OpenMP RTL.

Best Regards,
Ying H.

Leave a Comment

Please sign in to add a comment. Not a member? Join today