Hyperthreading, registers

Hyperthreading, registers

Imagen de alex-telitsine

I have following random problem with hyper-threading/OpenMP.

Function B | Function B
Function A | Function A

Accidentally, registers in function A getting screwed up, generating access violation. If I?ll restart function A in debug mode, it loads registers correct, and function completes.

It seems that during hyperthreading, somehow registers are corrupted. Stack seems to be intact. No static variables. Anybody has seen such problem or have an advice? Maybe it's cache related?

Thanks,
Alex Telitsine
alex@streambox.com
Streambox Inc

publicaciones de 17 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de quocanle

I'm not sure I understand your diagram with regards to Function B and Function A. Some pseudo code might help.

You might have some race condition within your code. Check out the Intel threading tools

These tools can help you with your development of OpenMP application for Hyper-Threading technology.

An Le
Intel

Imagen de alex-telitsine

What is race condition?

below is pseudo code:

FunctionB(Start,End)
{
for (all data blocks from Start to End){
FunctionA(data block)
}
}
//--------------------------------
#pragma omp parallel
#pragma omp sections
{
#pragma omp section
FunctionB(0,N/2);
#pragma omp section
FunctionB(N/2,N);
}
#pragma omp barrier
//--------------------------------

Thanks,
Alex

Imagen de Community Admin

>> What is race condition?
Picture this: a word processor program creates two threads, one to read a file and another to write to a file. Everything is fine at first, the first thread waits for the second to finish writing to the file before reading it. The two threads work happily, everything is fine as long as the writing thread always writes first. But one dark day, the reading thread reads the file before the writing thread writes to the file and the program fails. This is known as a race condition because both threads race to finish their file operation. The program will always fail if the reading thread wins the race. A race condition is the result of bad synchronization

Imagen de alex-telitsine

An, thanks for reply.
There is no any I/O inside the functions.
I'm using hyperthraeding(OpenMP) in 3 main portion of the application:

Motion search
DCT/Quantization
Bit parsing/arithmetic coding.
It gives me about 15% in execution speed.

Motion search is the most intensive part of it, and it's pure data moving/comparison.Registers' corruption happens with hyperthreading only, and after 4-12 hours.
On access violation, I coud move PC to the start of the function, and then it comes thrue fine, after registers are loaded with data from stack. It seems like registers getting screwed up somehow on context switch. I cannot isolate condition when it happens. Could it be the problem with the way threads/memory are handled in Intel 7.0 OpenMP implementation?

Another strange behavor of Intelompiler:
I have code like this:
if (pCI->MultipleCPU) {
openMP section
}else{
serial section
}

even if MultipleCPU is set to zero, some OpenMP code is executed in the start of the function anyway. It seems that if function has #pragma parallel inside, OpenMP code is inserted in the begining of the function anyway, which is not correct in my opinion.

Thanks,
Alex



Thanks,
Alex

Imagen de Community Admin

Alex,


A stack corruption error in the OpenMP runtime library would be very unusual. OpenMP is typically used by those in High Performance Computing (HPC), and they run for hours also.

Intel has a tool that is coming out soon call Intel Thread Checker. It looks for code and logic error. It is available only on Window* and IA32 system.

If you are interested in getting access to the beta version of this tool, please send me an email. I will forward your request to the appropriate person.

An

Imagen de aaron-coday (Intel)

>
> FunctionB(Start,End)
> {
> for (all data blocks from Start to End){
> FunctionA(data block)
> }
> }

1. One important common error is if the for loop above, are you doing like:
for( i=Start; i < End; i++)

Sometimes people accidentally use "<=" instead of "<".

> //--------------------------------
> #pragma omp parallel
> #pragma omp sections
> {
> #pragma omp section
> FunctionB(0,N/2);
> #pragma omp section
> FunctionB(N/2,N);
> }
> #pragma omp barrier
> //--------------------------------

2. Next some general tips
* Do you get the same problem with Release and Debug mode?

* Try without second section, so do something like:

#pragma omp parallel
> #pragma omp sections
> {
> #pragma omp section
> FunctionB(0,N);
> }
> #pragma omp barrier

Doing this will launch the computation in a separate thread from the master thread. If you have an error here, than the problem is somehow with the way you communicate data.

* Next, with your omp version, use the OMP_NUM_THREADS environment variable. Set it to 1. Then run your program. This will run with 1 thread but with the OpenMP. Good test, for communication problems

* Good olde code inspections :)

* Again the Intel Thread Checker will likely find the problem for you :)

Hope this helps!

Imagen de alex-telitsine

The problem was related to overclocking.
Somehow overclocking did cause random crash, but only with hyperthreading enabled and in hyperthreading sections.
Problem did disappear,after going back to original 3.06GHz. It is running fine for many days now.
I needed to overclock during development time to get it work with real-time video samples.

Imagen de Brandon Hewitt (Intel)

I'm not sure you really understood what a race condition is based on your response of "There is no any I/O inside the functions.". A race condition is just as likely (more likely) to involve memory than to involve files (as An used in his example). Any time you have a situation where one thread is reading data that can be written to by another thread, you have a potential race condition. There are ways to prevent these (such as OMP Barrier statements), but the programmer needs to be careful.

The new observation that overclocking appeared to be a catalyst for the crashes really makes me suspect a race condition. If this code is going to have more than a handful of users, I would recommend reviewing this code in depth as advised by Aaron to make sure there are no problems. Race conditions can be almost impossible to reproduce sometimes, so it's best to diagnose and fix them when you have the chance.

Brandon
Intel Software Network Support

For on-line assistance: http://support.intel.com/support/performancetools
For product support information: http://www.intel.com/software/products/support
* Intel and Pentium are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries
* Other names and brands may be claimed as the property of others

Message Edited by intel.software.network.support on 12-09-2005 09:51 AM

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"
Imagen de alex-telitsine

Brandon,

I've checked the code with Intel Threading plugin for VTune. It's very useful tool, but quite slow. Anyway, it's identified one spot in the code, with race condition (+ a lot in system DLLs). I've fixed it, ran application with threading tools to make sure, then and have tried to overclock CPU again.
No luck, after a day of execution application crashed again in the most CPU intensive part of the code: motion search. Reason for the crash was the same: registers are corrupted, while data in stack where registers loaded from is OK. I want to point, that motion search doesn't have or had race condition, no data is written to same area by different threads.
What I don't understand how race condition could cause problem as I have at all. Obviously, 2 threads have separate stack and my guess problem occurs during register load from stack like:
mov eax,[esp+xxx]
Stack is in different memory regions for the threads, by definition. Without overclocking everything runs OK, for at least a week I was able to test.

I really would like to pinpoint the problem, since new systems, with HT enabled, will be used for TV broadcasters, and 100% uptime is a requirement.

Thanks,
Alex

Imagen de bronx


Alex,

your code snippet looks very much like code prone to L1 dcache aliasing, a very simple trick can minimize the DL1 conflict misses and you will enjoy some % speedup, more safely than if you ressort to overclocking ;-)

Problem :
* DL1 64KB alias conflict + branch predictor issues with modulo 1 MB speculative data
* Under current MS Windows thread stack frames are allocated on multiple of 1 MB boundaries by default

Work-around :
Adjust the initial thread stack address by placing a call to _alloca at the begining of a wrapper block around the main thread procedure. Example :
{
_alloca(threadNumber * offset);
MainThreadProc();
}
1KB offsets can be enough but the best fit is application dependent. So, experiment with other values and realworld measurements to find the optimal offset in you very own case.

you just have to add 3-4 lines of code, also have a look at the link below


about overclocking :

as you know GPR moves are executed with 0.5 clock latency by the "REE" (fast ALU that is double-pumped at more than 6 GHz ! on your 3.06) and loads from L1 d-cache are speculative. I suppose that all of this is very sensitive to the integrity of the signals and require a clock skew strictly in spec range. So it's certainly not the best idea to overclock your server if you need 24/24 availability


DL1 "antialiasing"

Message Edited by intel.software.network.support on 12-09-2005 09:48 AM

Imagen de bronx


http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/xeon/threading/20437.htm

Message Edited by intel.software.network.support on 12-09-2005 09:49 AM

Imagen de Brandon Hewitt (Intel)

Alex,

Just to respond about how your code could have race conditions. First remember that I've only seen a tiny bit of your code, so I can only speculate on what I've seen. However, race conditions occur with shared data between threads. If threads couldn't share data, their usefulness would be limited. The fact that they have separate stacks is irrelevant.

If I have code like the following:

#pragma omp section
FunctionB(0,N/2,0); // thread 1
#pragma omp section
FunctionB(N/2,N,1); // thread 2

then say I define FunctionB like so:

int global_array[N+1];

void FunctionB(int a, int b, int c) {
for (int i = a; i <=b; i++)
global_array = c;
}

So what happens if thread 1 does all its work, then thread 2 does all its work? You get global_array[N/2] containing 1. However, what if thread 2 does its work before thread 1? Then global_array[N/2] contains 0 at the end. Hence, the value of global_array[N/2] is entirely dependent on the order of execution of the threads. This is a race condition that could result in an error. The diabolical part of this is that you may never know this error exists if the threads always happen to schedule in the "right" way.

In any event, I can't tell you for sure that the behavior you are seeing is due to a race condition or not. The fact that the VTune Analyzer pointed one out, but didn't identify anything else makes me suspect that this could be due to what bronx is talking about instead. If you still suspect that the compiler might be causing the problem, submit an issue to https://premier.intel.com.

Brandon
Intel Software Network Support

For on-line assistance: http://support.intel.com/support/performancetools
For product support information: http://www.intel.com/software/products/support
* Intel and Pentium are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries
* Other names and brands may be claimed as the property of others

Message Edited by intel.software.network.support on 12-09-2005 09:49 AM

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"
Imagen de bronx

Brandon,

more than once I've remarked code snippets messed out in this forum, like your "global_array" indexation by "i" which has disappeared apparently, right ?, maybe something can be done in the future to improve that


>The diabolical part of this is that you may never know

spurious crashhhhes at customer's sites and never in house, an so on, *diabolical* is the word indeed

Imagen de Brandon Hewitt (Intel)

It's probably the same thing that italicized by last two paragraphs. Yes, global_array should be indexed by i.

Brandon

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"
Imagen de Brandon Hewitt (Intel)

However, "by" should be "my" and that is strictly user error :-)

Brandon

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"
Imagen de bigbearking

I wrote a simple mem copy program as below and compiled it with intel C compiler using -axN option to get vectorized code (With non-temporal writes). In one run, I removed the initialization loop on A and in another run I had this loop. The memory copy between two arrays A and B are repeated 10 times.



I can understand in the first iteration, the reported memory bandwidth should be quite different due to page switching. The results of the other iterations should be the same. However, I got 6.x cycles per float element in the 2nd run and 3.x cycles per float element in the first run.



I don't really understand this. I am using a 2-proc Xeon 2.4GHZ system with 533MHZ front bus. Anyone can explain this? Thanks!



//mysecond.c



#include time.h>



double mysecond()

{

struct timeval tp;

struct timezone tzp;

int i;



i = gettimeofday(&tp,&tzp);

return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );

}



//copy.c



#include

#define RATE 2.4e9

#define N 1024*1024*64

#define NTIMES 10



extern double mysecond();



__declspec(align(1024)) float a[N], b[N];



int main () {



int i,j,k;

int kk;



double t_1;





for (kk = 0; kk N; kk++) {

a[kk] = 1;

}



for (k=0; k



t_1 = mysecond();



for (kk = 0; kk N; kk++)

b[kk] = a[kk];



t_1 = mysecond() - t_1;



printf("cycles/element = %lf, bandwidth =%lf
",t_1*RATE/(N*1.0), N*1.0*4*1.0/t_1);

}

}

Inicie sesión para dejar un comentario.