Performance loss when converting statics to class members

Performance loss when converting statics to class members

A few years ago I started working on a project, for which I created (among others) one C (instead of C++) file with a large number (hundreds) of static and global variables, and functions using those variables. A few douzen are arrays of exactly 32 or 64 kB, all allocated with __declspec(align(64)), the rest is mainly base types, sometimes small (size 2) arrays of base types.

I'm now trying to convert this file to a class, mainly because I need to be able to have multiple instances of it.

What I did:
1. Converted all functions to class methods.
2. Converted all the static and global variables to class members (removing 'static').

I have overloaded the constructor of the class to make sure it's always aligned a 64 bytes, and I've checked that the 32/64 kB arrays are also still aligned at 64 bytes.

At first I got a really big performance drop (more than 13%). After checking the pointer values I discovered that my old implementation with statics caused memory to be allocated at more-or-less random locations; after converting many of the arrays were exactly 64 kB apart which of course causes caching issues. So I added some 'fillers' (0x1100 bytes each) to get rid of that. This nearly completely restored the performance.

But now I have added all the variables, I'm seeing a 4% drop in performance. This new class is only a control layer with some simple calculations, most of the work is done elsewhere in other classes (and partially by IPP).

I'm using compiler option /Qipo, due to which almost everything gets inlined into one big function, which makes it difficult to analyse what is causing the changes. (I would have to wade through a few MB's of assembly output).

4% may not seem much, but this is a real-time application, which is consuming quite a lot of processing power as it is. So I really want to get rid of the extra overhead.

Are there more things (like the different memory locations) that I should be aware of when performing this conversion?

22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

A portion of the 4% could be the result of code alignment and/or size changes. I've seen as high as 3% when adding a function that is never called.

Another potential cause is additional register "pressure". The Object pointer, when it is not already in a register, will have to be loaded (and may require a push onto stack of current this pointer), then offset to member variable is obtained from instruction stream. Static objs do not require context swtich of the this pointer.

I suggest youcompile both ways (assuming your old code is selectable for conditional compilation), and run a profiler. I recommend VTune as opposed to Parallel Amplifier since you can see the dissassembly code with VTune (if dissassembly view is available in PA, then please tell me how to do the view). If VTune isn't available, then try the AMD Code Analyst. It works on Intel processors using timer driven sampling.

Jim Dempsey

www.quickthreadprogramming.com

Using VTune isn't really an option (I've tried, but due to the /Qipo option everything gets folded into one huge function, the output .ASM file contains a single function of a few MB in size. And VTune lists 70% of the CPU usage in that function.)

But I've now followed your suggestion and changed the code to compile conditionally, and I've moved all the static declarations to a single location in the file instead of scattered through the entire file (fortunately that turned out to have no effect on the performance).

I'm now stepwise converting statics to non-statics, and testing the performance at every step. So far I've been able to move about half of the variables (including all the arrays, surprisingly) with no performance effect at all (less than 0.5% anyway). I'll continue with the rest of the variables, until I encounter variables that affect the performance. But if I can single them out I can probably also fix the performance effects.

Note: About the this pointer. That was also my initial hunch. But as it turns out, statics (at least the static arrays, haven't checked basic types) are apparently allocated on the fly - scattered through memory. So they need to be looked up as well. And that should even be harder - for the class members there's just this + a constant value which can be hard-coded. For statics a real look-up must be performed. So I would expect the statics to be even slower...

btw. Jim is right about Amplifier not capable of viewing asm; Amplifier is a light weight performance tuning tool.

So after removing the static, everything will be on the stack. do you have "__declspec(align(64))" for the class definition?

Jennifer

I have overloaded 'new' to make sure the class is allocated at a multiple of 64 bytes. And I've checked the array pointers in it, they are all at 64 byte boundaries.

I have converted all statics to static class members: No effect on the performance.
Now I'm converting them, step by step, to normal members. I have now converted about 60% of the variables (including ALL the arrays) and I'm seeing a performance loss of about 0.5%. If I convert the rest also this increases to 4%, so most of the performance loss is in the part that I haven't converted yet.

But for now I'm going to use another solution: 95% of the users of my application only need a single instance of this class. And I know in advance if they do. So I'm going to build 2 different versions of it. Fortunately, because converting everything to static class members, this won't have any effect on the rest of the code, so I can achieve this by using a few #defines in the class I'm converting.

Maybe at a later point I'll continue the conversion to clean up the code, but for now I can reach maximum performance for 95% of my users with just a few lines of code - so for now I'm taking that route.

Both Amplifier and VTune can show you which statements within the function are incurring the largest overhead.

Depending on what you are looking at in either tool, if you double click on the function name, the source code will come in to view. If this doesn't work, then poke around the tool bars to find "view source" or something to that effect.

What you might want to do is disable Qipo, and run the profiling there. Although this code might not run as fast, the profile might isolate the area of concerne.

Jim

www.quickthreadprogramming.com

Have you checked to see if your new code (slow code) is performing many new/delete, or ctor/dtor operations due to stack instantiation of your class? If this is the case, then consider making these objects/arrays persistent (allocate and/or ctoronce). e.g. maintain a pool of these objects (on first iteration when pool empty perform new/ctor, then at end of iteration instead of delete, linke object into list of (like) objects into your pool(s).

Jim Dempsey

www.quickthreadprogramming.com

One of the problems with ipo is it folds everything into one huge function.

While this removes function calls (with argument passing ovherhead), it also can make your code sequence larger than what might fit into your L1 instruction cache. In some cases ipo can result in slower code.

How does your code perform without Qipo?

How does your code performewhen optimized favoring size?

Jim Dempsey

www.quickthreadprogramming.com

I have tried a lot of different options, and /Qipo makes my code a few percent faster, at least it did when I was still using statics. I haven't really checked 'favour small size' yet.

I can try what VTune shows when I disable /Qipo (I'll first have to check if the 4% difference is still there).

The code is intended as real-time code, so there are no memory (de)allocations during processing, only at startup/shutdown. The class itself is instantiated only once at startup.

>>The code is intended as real-time code, so there are no memory (de)allocations during processing, only at startup/shutdown. The class itself is instantiated only once at startup.

and in your first post

>> multiple instances of...

Have you checked to see if you could place your static objects onto the stack. With the /Qipo and aggressive inline optimizations, the stack layer in which the relocated (prior static) objects are declaredare directly visible and accesible using the EBP (RBP) relative coding. In this circumstance, the static code vs the EBP relative code is essentially the same. One uses offset of xxx from Virtual Address 0, the other uses offset of yyy from EBP.

Depending on where the static image was loaded in your Virtual Address, the number of bytes necessary to represent the offset may differ. The address in the static compiled versionmight have consumed 1, 2, 4 (or 8) bytes in the instruction sequence. Under favorable optimization, when attempting to /Qipo the code into one function, and placing the former static object onto stack, by placing them earlier on the stack, you could potentially reduce the number of bytes in the instruction stream required to access these objects. e.g. place the state variables first, and your larger array/buffer later. In this manner the instructions will consume fewer bytes. Fewer instruction bytes may improve instruction pipeline performance.

Jim Dempsey

www.quickthreadprogramming.com

Ok, I'll try that, sounds like it should work - thanks!

About the multiple instances: There are 3 'tastes' of my program: Stand alone, Winamp plugin and VST plugin. For the stand alone and Winamp plugin versions, there's always only 1 instance of the class - if an application using the Winamp plugin needs multiple instances it needs to load the DLL multiple times (the Winamp plugin interface does not allow creating multiple instances, it's a C-based interface).

In case of a VST plugin, programs can open multiple instances based on a single loaded DLL (the VST plugin interface is class-based). Opening and closing instances does not happen during processing.

I added the VST version later, and last week I received a bug report about strange behavior when multiple VST plugin instances were used - only then I discovered that this is an issue.

This is really odd: I've tried making the object static, and I'm still getting a 4% performance loss, if I make the class members non-static. I declared the class with __declspec(align(64)).

eg. static __declspec(align(64)) MyClass myObject;

If in MyClass all the class members are static, the code is fast, if they aren't, the code is still 4% slower.

I'm really going to give up on this for now (I don't have time to keep working on this, especially since the difference only affects a small portion of my users and it's not really big).

Static member variables do not require the use of the this pointer

class MyClass
{
...
static double foo;
...
};
...
static __declspec(align(64)) MyClass myObject;

a = myObject.foo;

The above code will not load the address of myObject into the register used for the this pointer.
The above would be true even without "static" on the "__declspec(align(64)) MyClass myObject;" (due to static being on member variable "double foo;" within the MyClass class.

Static on foo would mean you would only have one instance of MyClass::foo regardless of the number of MyClass objects wiithin your program.

Jim

www.quickthreadprogramming.com

The situation that I've tested is the opposite: myObject is static and foo is not. But shouldn't the compiler treat 'foo' as static because myObject is static?

Quoting piet_de_weerThe situation that I've tested is the opposite: myObject is static and foo is not. But shouldn't the compiler treat 'foo' as static because myObject is static?

In your prior post to the abovequote:

>>If in MyClass all the class members are static, the code is fast, if they aren't, the code is still 4% slower.

This seems to imply (to me)the member variables are declaredstatic too.

Jim

www.quickthreadprogramming.com

They are now, but that's exactly what I'm trying to get rid of.

I expected that your proposed solution (declaring the object itself as static) would automatically make all the class members be declared static. And hence give me back the 4% performance that I lost when removing that 'static' before all the class members.

Make the instance(s) at compile time as opposed to at run time (by way of new/malloc). These can be static, global or stack based instances (without requiring a this pointer load).

__declspec( align( 64) ) // optional alignment
struct foo
{
... your variables here
};
// instantiate outside scope of function

foo fooA; // global
static foo fooB; // within view of current compile time module
int main()
{
...
}
// or on stack
int yourFunc(...)
{
foo fooC; // within view of yourFunc
foo* fooPtrD = new foo; // but not allocatable
...
}

The first three instances should generate similar code, the fooA, fooBbeing relative to no register (i.e. to Vitrual Address 0), the third (fooC) relative to the stack base pointer (ebp or rbp). The last reference, fooPtrD, will require the loading of the pointer (fooPtrD) into a register (the this pointer) prior to dereferencing.

If you need multiple instances, your first attempt (incurring the 4% overhead) might be touse the pointer to the variable from within your processing function.

void yourFunc(foo* aFoo);

You may find that by relocatingthe function yourFuncinto the struct/class as a member function might yield better optimization opportunities (although it could be worse too).

An alternate means to manipulate multiple instances of your objects is touse more code

foo fooA;
foo fooB;
#define fooX fooA
#include "fooFunc.cpp"
#undef fooX
#define fooX fooB
#include "fooFunc.cpp"
#undef fooX
...
fooFuncfooA();
..
fooFuncfooB();
...
-------------

// fooFunc.cpp
// ** not compiled seperately
// ** only included (multiple times) into main.cpp
void fooFunc##fooX(...)
{
...
fooX.var = 123456.;
}

Using a profiler capable of drilling down to ASM code would be better than rolling the dice with your code.

Jim

www.quickthreadprogramming.com

You should be able to write the code using templates in place of the #include method above.
Templates will be easier to debug.

Jim

www.quickthreadprogramming.com

Unless I'm missing something obvious, this is exactly what I'm doing.

This is fast ("new" overloaded to allocate at 64 byte multiple):

class X
{
  static int a;
  static int b;
  void func();
}

__declspec(align(64)) X x1; // fast
__declspec(align(64)) static X x2; // fast
X* x3 = new X(); // fast

This is slow ("new" overloaded to allocate at 64 byte multiple):

class X
{
  int a;
  int b;
  void func();
}

__declspec(align(64)) X x; // slow
__declspec(align(64)) static X x; // slow
X *x = new X(); // slow

Using #defines or templates would work, but the function that gets slow get - when unfolded - REALLY big (think hundreds of kB's), and I don't know in advance how many will be used.

In the second (slow) implementation, this may indicate a compiler optimization performance problem. The compiler should know the virtual memory offset of x, and then the fixed relative offsets of the member variables a and b, and thus should have known the fixed virtual memory address of a and b, and thus should not have produced code any different than for the first case.

Is there something (not obvious to you) that you are not telling us?

In the 1st case (static int a... inside class), the static variables a and b might be aligned differently than in the 2nd case. IOW in the 1st case a and b might not be contiguous (each being aligned on an integral boundary specified as a compiler/linker option). In the second case, a and b are likely contiguous, but may not be due to #pragma align(n) or compiler option for specifying default alignment of member variables.

This is to say, what is not being shown in the code snips above, may be affecting the performance.
.OR. the compiler may be manipulating a this pointer when it need not (inefficent optimization).

Can you produce the assembler output file, and thencopy the sections of code representing a statement or two that run slower. Use the assembler option that includes the source as comments.

Jim

www.quickthreadprogramming.com

Hi Jim,

There are no #pragma aligns or compiler options that change the alignment. I'm using __declspec(align(64)) for a number of arrays, and the alignment of those arrays is indeed different (which at first caused a 13% performance drop when I made them non-static, but that's solved now by adding some fillers).

If I make all the arrays static, there's currently little performance difference between dynamic or static allocation (dynamic is 0.5-1% slower).

The remaining 2.5-3% performance loss comes from basic types: bool, int, float. While I was converting things to non-statics I have moved them around a lot (I've moved variables that are used at the same time together), which did not seem to have any impact on the performance.

(Question: Would the compiler re-organize basic types, not arrays, if they are declared static?)

The problem with the compiler (assembly) output is that it's huge (multiple MB's for the function where I'm loosing performance), and that the total code is only a few % slower. So it's not easy to find which instruction(s) are causing it, or even to compare the assembly output files of the two. (If it was easy, I would have compared the assembly output myself instead of asking here if there are known issues with static vs. dynamic class members, and tried to find a workaround). So I cannot post anything useful here. I'll try to compare 2 outputs using Winmerge later to see if there's anything obvious (now I have made it possible to switch between static members or dynamic members, the difference in assembly output might be a lot less than before).

For now I'm assuming that it's a compiler optimization problem. If I find anything useful when comparing the assembly output I'll post it here.

Your struct(s)/class(s) member variable alignments are not only subject alignment of the struct/class itself, but also are affected by the struct/class packing requirements. These are affected by

#pragma pack(...)

and/or

/Zpn

static member variables within a struct/class are scoped within the struct/class but are global (one instance), occupy no space within the struct/class, and are aligned by compiler option/default and generally default to sizeof(intptr_t) alignment.

non-static member variables are located within each instance of the stuct/class. This struct/class may have an alignment request (your programming). The offsets to the member variables within the struct/class are affected by packing rules or alignment requests. There generally is a requirement that if a member variable within a struct/class has an alignment requirement that the struct/class itself must have an alignment requirement that is an evenmultiple of all aligned member variables declared within the struct/class.

By examining the machine code or by using the memory debug window and entering the address of the struct.member and then looking at the hex address produced, or by printing the address, you can observe what are the actual placement of the member variables. This will let you know the alignment and/or what variables might share a cache line.

In some cases, forcing alignment is counter-productive, in these casesdensepacking tends to be more productive. I wrote an article on the Intel blogs pages demonstrating this effect. In this sample program (PARSEC fluidanimate) a 30% performance advantage was gained by density packing of cache lines as opposed to cache line aligned allocations (which by the way exhibited a net gain over unaligned allocations). Each application though is different, buy careful analysis of data relationships can make for significant differences in performance (you reported 13% difference observed). The "trick" here for you is to find out the best placement scheme. Part of the Art of Programming.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today