Mixing SSE and AVX inside an application

Mixing SSE and AVX inside an application

Hi,

I am currently in the process of adding AVX support to my application.

While floating point avx port looks quite simple, integer avx port is not since there is no integer avx 256 ( :( ). So I need to emulate those with 2 AVX128 instructions. However, there seem to be no AVX128 intrinsics (at least I couldnt find them). But since there is a big penalty for switching between SSE and AVX I need the compiler to generate the AVX128 Integer instructions.

I know about the AVX Compiler flag but they are out of the question since sse code needs stay intact so I can still run the software on plattforms that dont support avx. So the idea is to have two code path and a branch somewhere to choose the code path fitting to the CPU.

So what am I supposed to do to get the compiler to generate AVX128 in one place and SSE instructions in another for the same source file? Why arent there any AVX128 intrinsics?

By the way, I am using the VC2010 at the moment, using the intel compiler would be at lot of work (tried it and there where quite some problems where it couldnt compile the code so that pretty much rules itself out as well although it might be a last resort).

Any hint would be great.
Michael

37 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Michael,
The compiler will generate the AVX (128bit) based on the switches used. The same 128bit intrinsic can be used for SSE and AVX. when compiler sees arch:AVX switch it will generate AVX code for that intrinsic. However, you can still use 256bit AVX instructions if you want to convert integer to float before processing and back before writing it.
Please use VC2010 SP1 if you have access to MSDN.

Yes, I know about the compiler flag (and am already using the VC2010 SP1 Beta). But that is actually the point that sucks since it is impossible to run the binary on older hardware so this is an absolute showstopper for any existing application in my opinion and is therefore totally out of the question. I dont understand why there are no intrinsics to explicitly access the AVX 128 instructions, in my opinion this will prevent many developers of commercial applications to use AVX at all.

Conversion to float isnt an option either since it looses too much accuracy (23 bits wont work for us) and double precision destroys all the benefits of using SIMD at all ( fun fact: I implemented two version of an _mm_div_epi32 function, one which calls 4 normal integer divides on the components of an sse vector, the other which converts an int vector to two double vectors, does two divides and converts the result back. Calling 4 normal divides was about 20-30% faster, at least on a Conroe based system).

The only option I currently see is to have two different dlls and decide on application starttime which dll to load. So you will at least need a build system that can compile the same code twice (or even more) with different compiler settings - not really fun to setup.

instead of2 DLLs you can also use2 static librariesgenerated fromthe same source (legacy SSE intrinsics), one with the AVX flag. To avoid identifiersclashesC++ namespaces come handy since you can use the name mangling to add automatically a prefix such as "SSE::" and "AVX::"

Ok, that might work as well. But I think I will stick to the two (or more) DLL variant. Requires me quite some code restructuring but seems to be the best and probably the cleanes option. I just hope the speed gain from AVX will be worth the effort.

recompiling128-bit code just for the 3 operands AVX instructions provides nearly no speedup in real world use cases (less than 1.05 x actual speedups from hands on experience)

though if you have significant code portionsamenable to 256-bit fp code it will be more worth the effort, from my experiments you easily get from 1.2 x to 1.3 x speedups vs SSE for the cases with normalloads/stores, the best speedup I have measured so far is 1.82 x for a case with less than averageloads/store, i.e. most computations with 3 register operands, soyou can expect more gains overall in 64-bit mode since you can have more registers availablefor your constants

??? So....whats your point? I am not talking about the 3 operands AVX instructions, actually when I added the few SSE4.2 instruction I got about 1percent performance increase at most - so I couldnt care less about these 3 operands instruction.
If I had a choice of what I would like to see in AVX it would be quite a lot simpler:
All basic math ( +,-,*,/) and logic (and, or, andnot, xor) and comparison ( <=, <, !=, ==, >, >=) operations implemented for both, int32, int64, float, and double in a way that makes them equal to standard floating point math.
I dont need such blendvps instructions when I can do a simple or(andnot, and) for the same result. But missing a normal integer multiply and divide just sucks. And since it took til SSE 4.2 to actually add a normal integer multiplication, I have very little hope for AVX in this point.

About the significant code portion: Actually I am adding AVX support to a commercial SSE optimized raytracer (look at www.pi-vr.de for more info). AVX support in our case means: instead of just tracing 4 rays at once, we can now trace 8 rays with AVX, so the speedup may be quite nice if everything works out, since I have roughly spoken 300 source files full of SSE code, summing up to many thousand lines of full SSE code.

Anyway, the seperation of the whole code into a new DLL seems to mostly work already (still need to fix some dependencies). So hopefully I will get some AVX results soon.

>??? So....whats your point?

eh Michael I'm just trying to help you get started, you were saying that

"
explicitly access the AVX 128 instructions, in my opinion this will prevent many developers of commercial applications to use AVX at all.
"

so I was thinking that you were planning to recompile your code just to get the non-destructive 3 operands instruction that AVX-128 is offering for all instructions (but the 4-operand VBLENDVPS), if you plan for AVX-256 after all you can hope for more than 5% speedup andin thiscase it will beindeed sensible to also compile legacy 128-bit code for AVX-128 so that you have no transition penalty

>So hopefully I will get some AVX results soon.

I will be very interested to hearaboutthe speedups you'll get, don't forget to post your findings here!
Withour own realtime 3D rendererwe arestuck at roughly 15% overall speedup with nearly all kernels in AVX-256 mode already, that's quite deceptive so far

Ah, ok, sorry I misunderstood your answer. I should have been more specific of what I am actually doing, I think.

The transition penalty is really what bothered me most and thats why I wonder why there arent any intrinsics to explicitly use avx when you want avx and sse when you need sse. But with two different DLLs I guess the compiler flag will take care of this now (I hope).

15% speedup indeed doest sound that much, I hoped it to be at least 50%. I will post results once I have some, should be next week I hope.

>But with two different DLLs I guess the compiler flag will take care of this now (I hope

at least with the Intel compiler it works well, when compilinglegacy SSEn intrinsics with the "/QxAVX" flag it generates AVX-128 code that you can freely mix with AVX-256 code (using the 256-bit intrinsics) without any transition penalty

>15% speedup indeed doest sound that much

yes but that's the overall speedup with turbo enabled and 8 threads, we get better speedup with a single thread and turbo off (i.e. our workloads are memory bandwidth bound), also some individual kernels have 50% speedup or more (our best observed speedup is 82 % for a loop), though a lot of loops are at 10% speedup or less, my understanding is that a key limiter to 128-bit to 256-bit scalability is the load bandwidth from the L1D cache, 32B/cycle can be used for SSE(2 loads per cycle) and AVX-256 can't do better 32B/cycle or 1 load per cycle sustained. The L1D$ write bandwidth (16B/clock) is also a strong limiter for all the cases where you copy arrays or set buffers to a value, in these cases the speedup is roughly = 0%

@bronxzv:
Looks like you have problem with scaling to multiple cores. Usually an issue with memory bandwidth and data access pattern. You need to find a way to reduce memory bandwidth requirement. Improving data locality is the best way to accomplish that.

@michael:
If you are already writing AVX code path that uses 256-bit AVX FPU, not using 3 operand syntax in the rest of that code path doesn't make any sense.

When transitioning form AVX to legacy SSE you need to use vzeroupper if I remember correctly. Check the optimization reference manual for details on mixing AVX and SSE.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>@michael:
>If you are already writing AVX code path that uses 256-bit
AVX FPU, not using 3 operand syntax in the rest of >that code path
doesn't make any sense.

Thats not what I meant or said. I just dont really care about these instructions, means I would be fine if they didnt exist and I would even be very happy if instead something like _mm_div_epi32 and _mm_mullo_epi32 would have been there from the very beginning. Having to emulate blendvps by using a series of &, | and so on is easy and simple (and not even really slower...) but just implementing a working _mm_mullo_epi32 was not so easy and I am still missing an implementation of _mm_div_epi32 that is actually faster than just doing 4 simple divides.

But my main issue was that there is no way to expicitly use AVX128 instructions with intrinsics thus the only way to make an application support both, AVX and SSE, is to essentially compile the whole appication twice and more or less let the customer decide which application to start. Just using the SSE intrinsics is a bad idea, as we are told, but since AVX lacks integer support at the moment it is actually the only way to do.

So just adding AVX for a part where it makes sense an let the rest untouched is impossible. I had to really restructure a lot of code to just being able to but all the relevant SSE/AVX code into a single DLL I can switch on startup, I am just glad it worked out at all in our case, I guess in many other cases this will just fail.

Let us leave the lack of those instructions aside for a moment.

You can write all your critical functions using AVX intrinsics in a single .cpp file which you compile with /QxAVX, and all your critical SSE functions in another .cpp file which you compile with /QxSSE2. Then you can use Intel Compiler CPU dispatching feature which will select proper function variant to call during runtime.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

I am using der VC2010 compiler that doesnt know the dispatching feature. Also I am not shure how this feature behaves with inlined code (and inlining is essentiel for us as it pays of in frames per second). But since the intel compiler wouldnt compile our application the last time we checked with errors I couldnt really decipher (means: code that compiles fine in msvc and gcc and is just consitent with the C++ standard), it is not really an option for us.

As I have written: I am talking about a realtime raytracing system
inside a commercial applications with about a million lines of code
where the raytracing part is 100% SSE code. This is not a toy
application with a few expensive functions, actually we are trying to
push the CPU to the limits here.

So you dont want to put roughly spoken 50000-100000 lines of SSE code into a single cpp file. In fact, you dont want to actually rewrite all that code for AVX at all but use templates for it. But then you cant since you cant call the right instructions since the intrinsics are missing.

Anyway, we figured a way out to compile different dlls for the different CPU plattforms and hopefull I will get the AVX port running within this week.

Quoting bronxzv
>But with two different DLLs I guess the compiler flag will take care of this now (I hope

at least with the Intel compiler it works well, when compilinglegacy SSEn intrinsics with the "/QxAVX" flag it generates AVX-128 code that you can freely mix with AVX-256 code (using the 256-bit intrinsics) without any transition penalty

>15% speedup indeed doest sound that much

yes but that's the overall speedup with turbo enabled and 8 threads, we get better speedup with a single thread and turbo off (i.e. our workloads are memory bandwidth bound), also some individual kernels have 50% speedup or more (our best observed speedup is 82 % for a loop), though a lot of loops are at 10% speedup or less, my understanding is that a key limiter to 128-bit to 256-bit scalability is the load bandwidth from the L1D cache, 32B/cycle can be used for SSE(2 loads per cycle) and AVX-256 can't do better 32B/cycle or 1 load per cycle sustained. The L1D$ write bandwidth (16B/clock) is also a strong limiter for all the cases where you copy arrays or set buffers to a value, in these cases the speedup is roughly = 0%

In our experience, it was L2 which limited sustained load bandwidth (which is already greater on Sandy Bridge than could be attained on pre-AVX CPUs).
The lack of advantage for AVX on copy and memset operations was always a reasonably well documented "feature."

>So you dont want to put roughly spoken 50000-100000 lines of SSE code into a single cpp file. In fact, you dont want to actually rewrite all that code for AVX at all but use templates for it. But then you cant since you cant call the right instructions since the intrinsics are missing.

I think our case is pretty much the same, we have roughly 70 k lines of C++ code in around 120 .cpp files for the realtime 3D renderer and other performance critical parts of our engine, indeed it will be a very bad idea to put all of this code in the same file (!) and it makes no sense to duplicate all the code for each target path (it will be plain unmanageable). The best you can do IMO is to work at a higher level with wrapper classes around the intrinsics and inlined functions / operators and simply having some specialized headers for each path (like SSE / SSE-2 / AVX / AVX with FMA3/ whatever). Asensible design methodologyisto havea singlesource coderepository using high level team conventions (ISA-agnostic),each change is directly available for allthe targetpaths and you have very low validation costs & delays. Adding a new path or tuning the primitives ofexisting paths is then 100% orthogonal with the main projects.

Intel C++ compiler usually follows C++ standard better than MSVC, so in most cases, the problem is with the code.

Intel Compiler also has options to compile some dubious constructs that MSVC accepts by default.

There are also some compiler bugs with templates and advanced C++ features, but those are detected pretty fast and resolved in updates.

Regarding the size of your project and amount of SSE code in it -- it has always been an unwritten rule that ~10% of the code is responsible for ~90% of the performance, and raytracing is not an exception.

Writiing almost everything with SSE manually doesn't make sense when better compiler can do that for you automatically.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

I ended up with having overloaded base floating point and integer classes that I choose based on a compile time define. Our build system is capable of building the same source with different defines so once I have fixed all issues that come from getting form 4 components to 8 components further AVX extensions should be a matter of rewriting parts of the base classes. At least with SSE2/SSE4 switches this already works quite nicely.

About the compile problems: There were issues with some very simply functions like

inline __m128 _mm_madd_ps( const __m128& a, const __m128& b, const __m128& c)
{
return _mm_add_ps( _mm_mul_ps( a, b), c);
}

Looks pretty standard to me (and GCC thinks so as well). Maybe it was a compiler bug but since at the moment there is no advantage for us to switch to a different compiler it just is not an option.

About the amount of SSE code: Be ashured that the amount of SSE-Code is exactly the amount we need and leaving the work to the compiler will make thinks a lot slower (after all, we are competing with GPU based tracers here and can beat them on a Dual CPU workstation, so it looks like we are not doing things too wrong).

Current compiler are just not capable of SIMDifying a ~1000 C++ code lines for a single material shader with multiple (virtual) function calls in it. There is just no way for them to recognize that they can compute 4 or more values in parallel.

Automatic SIMDifying works fine for small loops, but for complex algorithms it just fails, always.

Intel customers (and even people who are evaluating and considering Intel C++ Compiler) usually report all the problems they find either through the forum or through the premier support. Problems get fixed that way.

Same goes for auto-vectorization -- the more auto-vectorization problems people report, the smarter the compiler becomes.

Seems that in your case you decided to take an "easy" way out, and avoid participating in compiler enhancement, but at the cost of having to develop and maintain progressively larger and larger hand-written codebase.

In other words, your short-term win may become long-term loss as the size of your project keeps growing.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>Seems that in your case you decided to take an "easy" way out, and avoid
participating in compiler >enhancement, but at the cost of having to
develop and maintain progressively larger and larger hand-written
>codebase.

>In other words, your short-term win may become long-term loss as the size of your project keeps growing.

Needing to wait for a bug to get fixed instead of simply using the compiler that works for us is just a bad idea. Why should we bother using another compiler if what we use now works? For maybe 5% more performance at the cost of 5 times longer compile times? No, thank you. Why do you think you can judge what we are doing? We are developing bleeding edge technologie, running raytracing of multi million triangle scenes on clusters with up to 1000 Cores and 8MP Resolutions in realtime, faster than any competitor and you want to tell me I am taking the easy way out because I prefer a solution that works now instead of having to wait for someone else to fix bugs in a compiler? Seriously, I had enough trouble with GPU manufactures and their drivers bugs, I am happy if things just work like I tell them and can concentrate myself of algorithms.

I am saying you dont have a clue about what we are actually programming here, so you better shouldnt start to judge.

And who says handwritten SSE code ist larger than handwritten standard c++ code? I think, I can handle the required masks from time to time. And the code needs to be written anyway.

I am not trying to judge you, so you can hold your horses.

I have put the word "easy" under quotes because I don't think either path is really easy. What I am doing is stating some obvious facts.

Intel has excellent track record of fixing bugs that are reported, sometimes even providing a specific fix over the FTP to a customer, and there is always a workaround in the meantime. I sincerely doubt that you would get such a treatment with GNU or MSVC in case you hit a bug there.

Regarding the SSE .vs. C/C++ code size, one intrinsic usually maps to one instruction, while one row of C/C++ code can map to several instructions. I was saying that SSE code is generally larger in terms of numbers of lines of code, not in terms of instruction count.

Regarding possible advantage, it may be more than 5% when you factor in global optimizations and complex code transformations Intel compiler is capable of doing, not to mention readily available performance libriaries, but it looks like you will never find out how much advantage it may bring you.

I worked on 3D image reconstruction (back projection, forward projection) for medical purposes both on CPU and on a GPU, but you are right, I really don't have a clue... why you came here to ask people who don't have a clue for help and advice, when you already know what is best for you. MSDN/Technet forum might be a better place for your arrogant attitude.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Igor, this is the AVX forum not the Intel C++ forum.Why don't you simplylet people doing these days actual AVX development talk freely together ? You have nomoderator credential here (pls correct me if I'm wrong) to say who can post, what to post, who should go away because they use different methodologies than you.

>Regarding possible advantage, it may be more than 5% when you factor
in global optimizations and >complex code transformations Intel compiler
is capable of doing, not to mention readily available >performance
libriaries, but it looks like you will never find out how much advantage
it may bring you.

We did test it 1 1/2 years ago and it was about 5 percent more performance while taking a lot longer to compile. MSVC is not so bad anymore as it was back in the VC6 and 7 Versions. It can do global optimizations and profile guided optimizations as well. There is still a performance advantage to the intel compiler but it is not worth the additional effort we would need to put into switching the compiler. And I didnt came here for any compiler discussions. Last time I checked, this was an AVX forum, not an intel compiler forum.

>I worked on 3D image reconstruction (back projection, forward
projection) for medical purposes both on >CPU and on a GPU, but you are
right, I really don't have a clue... why you came here to ask people who
>don't have a clue for help and advice, when you already know what is
best for you. MSDN/Technet forum >might be a better place for your
arrogant attitude.

Yes, you dont have a clue about what we are doing here. I didnt say you dont have a clue about programming or your own work. But you definetly have no clue about why we chose to use SSE everywhere inside the raytracer (as I already said: Because it is faster and automatic optimization fails for our tasks and it will probably always fail unless it implements a dynamic scheduler that can figure out which rays need to be traced and which are already terminated).

And about the arrogant attitude: I dont like being told the problem is my code (which compiles fine with msvc AND gcc) and to be told that I am doing stupid things because I have a 100% SSE optimized raytracer instead of just a few SSE functions. So which one of use is more arrogant?

I came here to maybe find a solution for a problem ( missing AVX128 intrinsics). The solutions that I knew (and were presented here) are all flawed for various reasons in my opinion so I hoped to find an alternative. But since there didnt seem to be one, I adjusted myself by restructuring m code a littlebit and putting everything into a DLL I can compile with different flags. Not a solution I had hoped for, but it seems to work, problem solved.

@bronxzv:

You are right, I do not have admin rights nor I would like to have that burden on my shoulders.

However, if you re-read my posts in this thread, you will see that I was trying to be helpfull as usual, and in return I was told twice that I am incompetent / don't have a clue, etc.

Furthermore, Michael was very brash and unpleasant in his replies from the beginning (even towards you), and I feel that I have the right as a senior member of the ISN community (if not as a Black Belt) to say something about it, especially when his replies have offended me.

@Michael:

I wrote "in most cases, the problem is with the code", not "the problem is in your code". For me, there is a considerable difference between the two, even though English is not my primary language.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>Furthermore, Michael was very brash and unpleasant in his replies from the beginning (even towards you),

sure, you're right, though I think we are now one step further and I'll be very interested to hear about the speedups Michael will get with his raytracer, if he posts is findings on some MS forums I'll be not aware since I dont visit these very often

I am sorry to you both. I didnt mean to be rude at all. I just couldnt see the relevance of the answers to my question. I have been pretty clear of the options I know in my very first post and just getting the same options I already know and have discarded for good reasons presented as answers is a bit disappointing and frustrating.

@Igor: Once again, I didnt say you are incompetent, in no way. You just dont know our source code and cant judge our decisions in any way. And I personally felt offended by something like

>Writiing almost everything with SSE manually doesn't make sense when better compiler can do that for you >automatically.

So you were essentially saying what we are doing doesnt make any sense although you are not in any position to give a qualified judgement. And being called arogant for saying "you have not clue" (because you dont know our code and therefore your solutions were impossible to apply to our problem) offended me quite a bit.

>I just couldnt see the relevance of the answers to my question

the title of your 1st post read "Mixing SSE and AVX inside an application" and my advice to compile the same files with different options and using C++ namespaces to avoid identifiers collisions at link time is arguably at least somewhat relevant. IMHO it's even the best way to deal with the issue and it can be used with all C++ compilers, btw the fact that there is no VEX-128 intrinsics (fp or int) is an orthogonal issue since the collision at link time comes from the fact that we compile several times the same file and will be not avoided with a 128-bit variant of all intrinsics (and without duplicating all your source code). For example in our case one way to deliver the renderer is through a freeware web player (NPAPI plugin or Active X)

http://www.inartis.com/Products/Kribi%203D%20Player/Default.aspx

andit's arguably better to have all the application in a single .dll or .ocx file instead of several DLLs which will be more difficult to install and update

Yes, of course that had relevance and that wasnt one of the answers I was referring to. For us using a single DLL for the whole application wouldnt work (or it would be a 200MB DLL), the raytracing (SSE) part is only a small part of the whole application.

About the VEX-128 intrinsics and why I really miss them:

My idea was the following: Every function has a template parameter specifying a versionId/whatever. So you implement the function only once and let the compiler create the functions as often as needed. The beauty about this approach would be that you would only need to write a function once for a templated base type and let the compiler implement it more often for 1, 4, 8, 16,... component vectors. Using some few special functions you can then decide during runtime which function to call based on the active rays in a packet, always choosing the optimal code path.

Now extending this approach to support AVX would just be a matter of implementing the base types for AVX as well and adding an entry point to call the image trace function with the required template parameter (which would just be any int). No need to reimplement any functions of the main raytracer. But then, this doesnt work since VEX-128 and integer VEX-256 are missing. So using a DLL for AVX and one for SSE is ok but it makes build process a bit more complicated.

Anyway, I really didnt want to offend anyone, so sorry again if I did.

>. The beauty about this approach would be that you would only need to write a function once for a templated base type

we do just that, though without using templates but simply including special headers for each target path
here is a very simpleexample of source code

const PFloat oSpotR(lightColor.x),oSpotG(lightColor.y),oSpotB(lightColor.z);
const ULONG count = lBunch.samplesCount;

#pragma unroll(4)
for (ULONG i=0; i {
const PFloat ok = PFloat(qFk+i) * PFloat(qAs+i);
MAcc(r+i,PFloat(lBunch.fColR+i),oSpotR*ok);
MAcc(g+i,PFloat(lBunch.fColG+i),oSpotG*ok);
MAcc(b+i,PFloat(lBunch.fColB+i),oSpotB*ok);
}

headers for the SSE pathdefine FP_VEC_WIDTH = 4, PFloat(i.e. packed float) is a wrapper class around__m128 where operator *is based on _mm_mul_ps, etc

for AVX FP_VEC_WIDTH = 8, PFloatis a wrapper class around__m256 where operator *is based on _mm256_mul_ps, etc

FYI the ASM dump for the AVX path (without FMA3) is shown below exactly as generated by the compiler, it's unrolled for the best performance from actual timings on a 2600K based PC

; LOE eax esi edi ymm4 ymm5 ymm6
.B11.82: ; Preds .B11.82 .B11.81

;;; {
;;; const PFloat ok = PFloat(qFk+i) * PFloat(qAs+i);

mov ecx, DWORD PTR [16+ebx] ;449.45
vmovups ymm7, YMMWORD PTR [-888+ebp+edi] ;449.38
vmulps ymm3, ymm7, YMMWORD PTR [edi+ecx] ;449.45

;;; MAcc(r+i,PFloat(lBunch.fColR+i),oSpotR*ok);

vmulps ymm0, ymm6, ymm3 ;450.48

;;; MAcc(g+i,PFloat(lBunch.fColG+i),oSpotG*ok);

vmulps ymm7, ymm5, ymm3 ;451.48

;;; MAcc(b+i,PFloat(lBunch.fColB+i),oSpotB*ok);

vmulps ymm3, ymm4, ymm3 ;452.48
mov DWORD PTR [-904+ebp], eax ;
mov eax, DWORD PTR [160+esi] ;450.26
mov edx, DWORD PTR [28+ebx] ;450.7
vmulps ymm1, ymm0, YMMWORD PTR [eax+edi] ;450.7
vaddps ymm2, ymm1, YMMWORD PTR [edi+edx] ;450.7
vmovups YMMWORD PTR [edi+edx], ymm2 ;450.7
vmovups ymm2, YMMWORD PTR [-856+ebp+edi] ;449.38
mov eax, DWORD PTR [164+esi] ;451.26
vmulps ymm0, ymm7, YMMWORD PTR [eax+edi] ;451.7
mov eax, DWORD PTR [32+ebx] ;451.7
vaddps ymm1, ymm0, YMMWORD PTR [edi+eax] ;451.7
vmovups YMMWORD PTR [edi+eax], ymm1 ;451.7
mov eax, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm3, YMMWORD PTR [eax+edi] ;452.7
mov eax, DWORD PTR [36+ebx] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [edi+eax] ;452.7
vmovups YMMWORD PTR [edi+eax], ymm1 ;452.7
vmulps ymm2, ymm2, YMMWORD PTR [32+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm2 ;450.48
vmulps ymm1, ymm5, ymm2 ;451.48
vmulps ymm2, ymm4, ymm2 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [32+ecx+edi] ;450.7
vaddps ymm0, ymm7, YMMWORD PTR [32+edi+edx] ;450.7
vmovups YMMWORD PTR [32+edi+edx], ymm0 ;450.7
mov ecx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm1, YMMWORD PTR [32+ecx+edi] ;451.7
mov ecx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [32+edi+ecx] ;451.7
vmovups YMMWORD PTR [32+edi+ecx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm2, YMMWORD PTR [32+ecx+edi] ;452.7
vmovups ymm2, YMMWORD PTR [-824+ebp+edi] ;449.38
vaddps ymm1, ymm0, YMMWORD PTR [32+edi+eax] ;452.7
vmovups YMMWORD PTR [32+edi+eax], ymm1 ;452.7
mov ecx, DWORD PTR [16+ebx] ;449.45
vmulps ymm1, ymm2, YMMWORD PTR [64+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm1 ;450.48
vmulps ymm2, ymm5, ymm1 ;451.48
vmulps ymm1, ymm4, ymm1 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [64+ecx+edi] ;450.7
vaddps ymm0, ymm7, YMMWORD PTR [64+edi+edx] ;450.7
vmovups YMMWORD PTR [64+edi+edx], ymm0 ;450.7
mov ecx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm2, YMMWORD PTR [64+ecx+edi] ;451.7
vmovups ymm2, YMMWORD PTR [-792+ebp+edi] ;449.38
mov ecx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [64+edi+ecx] ;451.7
vmovups YMMWORD PTR [64+edi+ecx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm1, YMMWORD PTR [64+ecx+edi] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [64+edi+eax] ;452.7
vmovups YMMWORD PTR [64+edi+eax], ymm1 ;452.7
mov ecx, DWORD PTR [16+ebx] ;449.45
vmulps ymm0, ymm2, YMMWORD PTR [96+edi+ecx] ;449.45
vmulps ymm3, ymm6, ymm0 ;450.48
vmulps ymm2, ymm5, ymm0 ;451.48
vmulps ymm0, ymm4, ymm0 ;452.48
mov ecx, DWORD PTR [160+esi] ;450.26
vmulps ymm7, ymm3, YMMWORD PTR [96+ecx+edi] ;450.7
vaddps ymm1, ymm7, YMMWORD PTR [96+edi+edx] ;450.7
vmovups YMMWORD PTR [96+edi+edx], ymm1 ;450.7
mov edx, DWORD PTR [164+esi] ;451.26
vmulps ymm3, ymm2, YMMWORD PTR [96+edx+edi] ;451.7
mov edx, DWORD PTR [32+ebx] ;451.7
vaddps ymm7, ymm3, YMMWORD PTR [96+edi+edx] ;451.7
vmovups YMMWORD PTR [96+edi+edx], ymm7 ;451.7
mov ecx, DWORD PTR [168+esi] ;452.26
vmulps ymm0, ymm0, YMMWORD PTR [96+ecx+edi] ;452.7
vaddps ymm1, ymm0, YMMWORD PTR [96+edi+eax] ;452.7
vmovups YMMWORD PTR [96+edi+eax], ymm1 ;452.7
add edi, 128 ;447.5
mov eax, DWORD PTR [-904+ebp] ;447.5
inc eax ;447.5
cmp eax, DWORD PTR [-916+ebp] ;447.5
jb .B11.82 ; Prob 27% ;447.5

I am not shure I understand what exactly you are doing. For example how would you implement something like this:

An AVX-Ray Packet with 8 rays in it hits 2 surfaces, one with material
A, the other with material B. Now lets assume one ray in the packet hits
material A, 7 hit material B. Calling material Bs shade function for
the whole ray packet with masking out the one ray that hits something
else is ok. But calling the same shade function of material A with 7
rays masked out is just burning CPU cycles. So ideally you would like to
call a shade function that is specialized for handling a single ray.
With templates you could have something like

template
class FloatType
{...};

template<>
class FloatType<1>
{....};

template<>
class FloatType<4>
{...};

and a shade function like

template
void shade()
{
typedef FloatType floatType;
...
do work with floatType
}

The calling code would be something like (ray extracing code omitted)

switch( _mm_movemask_ps( activeRays))
{
case 1:
shade<1>();
break;
case 4:
shade<4>();
}

I am not shure how you want to do this with includes.

the only way to get good speedups with SSE/AVX-256 is to use all 4/8 computationslots whenever possible, so calling a scalar path like you'll do for material A is out of the question (btw scalar SSE vs packed SSE will havethe same throughput anyway) what you have to do is to aggregate together rays for each material in a first pass(using an operation like VCOMPRESS in LRBni) then process each material in turn with as much as possible samples/rays packed together, this way you also maximize thetemporal coherence of your memory accesses (for examplefor texture fetch)

also note that your example with a "switch" statement will endure heavy branch prediction misses, thiswill be one more limiter to the scalability from SSE to AVX-256

Of course you are right but it is not always possible to keep the slots full, especially at high ray depths. But then, this is a limitation in the current design of our raytracing loop, I will probably clean that up so these things are easier to do.

The switch case is indeed not optimal (and actually this is more a theoretical discussion since I havent really implemented it this way) but since we are calling a virtual function in our case that does many thousand instructions it should not be the limiting factor. Memory access is still what makes the largest difference.

In some of my kernels branch prediction missesare a strong limiter for SSE to AVX scalability,branches are unavoidable for optimized code even insome vectorized loops (in some of my cases at least), i.e. the timings are worse with 100% branch elimination,vs 90+% branch eliminationand still a few hard to predict branches

it was going a bit like this in a today's experiment (best performance = 100)

variant 1: branch elimination + a few branches
SSE 87
AVX-256 100
SSE to AVX-256 speedup = 15%

variant 2 : 100% branch elimination
SSE 72
AVX-25695
SSE to AVX-256 speedup = 32%

so the variant with the best scalability will be fine to put AVX in good light (even if 32% for perfectly vectorized code is deceptive) but I had to choose the 1st variant with 15% speedup

@michael:

When I said that, I incorrectly assumed that you might be over-optimizing by writing everything manually with SSE intrinsics -- I apologize if that offended you, but I have seen people doing it in the past, so I wanted to eliminate that as an option. Unfortunately, it turned out clumsy.

Keep in mind that I was not aware of the amount of serial .vs. parallel code in your application nor that you are writing shading functions on a CPU simply because you did not tell us exactly what you are doing.

Without more details about your project, the only option we had was to guess what might be appropriate solution for you.

To summarize -- before I offended you with my assumption I suggested the following:

1. Using three operand syntax when you are already writing AVX code
2. Checking optimization reference manual on mixing SSE and AVX code
3. Writing two versions of critical functions, and using CPU dispatching feature of Intel Compiler
4. Re-evaluating Intel Compiler as an option

Out of those 4 suggestions only suggestion #3 cannot be applied, and only because in your code all functions seem to be critical.

In my opinion, all those suggestions were quite reasonable given the circumstances, and definitely not the reason to get so worked up.

I will now go and stand in my corner for a while.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

You are right, I should have been more specific.

I am using intrinsics only, so I hope the compiler will figure out when it can use the 3 operand syntax if I set the AVX Compiler Flag.

The problem with #3 for us is that the whole raytracer is designed to work on floating point/ integer data with a specified packet width, so every function just always assumes it can work on multiple values in parallel. Terminated rays are masked out and updates are done accordingly when necessary. Therefore dispatching would mean to dispatch the whole raytracer, which has about 300 source files at the moment.

#4 is really something I would like to try but the last time we tried I just gave up after 5 days because we couldnt get everything to compile and even where not been able to figure out what exactly the problem was. Maybe we will give it another try at some time in the future, but for the moment we will use VC2010 and GCC only.

I am glad that we now understand each other.

If you decide to re-evaluate Intel C++ Compiler and if you still have issues, do not forget that in addition to Premier Support where you can submit your issues, there is a compiler forum here as well with a lot of experts always willing to help.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Leave a Comment

Please sign in to add a comment. Not a member? Join today