_mm_rsqrt_ps and intel architecture code analyzer

_mm_rsqrt_ps and intel architecture code analyzer

I have written a code to do sqrt using _mm_rsqrt_ps. When I use iaca -arch nehalem to run this code, it shows _mm_rsqrt_ps is executed on port 1, while most places I have seen (includingintel 64 and ia-32 architecture optimization reference manual) mentions sqrt is on port 0. Is this right? Is there a document which explicitly mentions where each SIMd instructions are assigned to which port?Thanks

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I think you are doing reciporal of square root and not square root. sqrt andrsqrt are two differnet instructions. iaca-arch tool is right about the port binding. I beleive rsqrt goes on port one. And as you read in documentation sqrt goes on port 0.
i dont think there is a complete list anywhere to instructions and port mapping.

Thanks.Yes. I am doing rsqrt. In the intel optimization reference manual (Table 2-6, page 2-26), On Port 1, I only see FP_ADD. every other FP operations are listed under port 0 (includingDIV and SQRT). Is it because rsqrt is implemented using FP additions (internally) that we should assume rsqrt is on port 1?-ravi

No. I wont assume that.

So, how did you know that it belongs to Port 1? I could not locate in the document.Thanks

"iaca" told you. Didnt? or you dont beleive the tool.

Yes. iaca told me. But it conflicted with what was in the document. That is what I am trying toclarify and confirm. The real performance I get does not match with the performance iaca is predicting it to be. May be it is because of compiler issue (that it cannot optimize the code very well).If iaca is accurate, then I have no problem in faithfully following it.

Hi Ravi,
As i said in previous post. Document is incomplete as it is not listing all the instructions and it is not written to list all this information as it can change from processor to processor. Tool is giving the complete listing.
Regarding your performance gap, you may need to look into more detail. if you are comfortable with code sharing you can post here and someone can tell you why there is a performance gap.
Try differnet compiler, may intel compiler if you are not using that one. it will point out if this is due to compiler.

Thanks. I am clear now regarding iaca. I am using intel compiler (10.2). I will try the new one.

Here is the code doing square root. Just a simple function which is doingsquareroot. Thanks for help.


Downloadtext/x-c++src main.cpp0 bytes

Hi Ravi,
I just checked that your code is getting very good performance. i liked the way you unrolled it. What was your expectation? What are you getting?

Thanks.I estimation based upon instruction count (using IACA) wasthroughputof 22 cycles for 16 sqrts (the loop).The performance I get is 42 cycles of throughput.Almost 2x slower. That is what I am trying to understand how I can improve it further. How much do you get?

i do understand IACA but how did you measure your performance 42cycles. can you please elaborate that?
You are port 0 and port 1 limited.

I measured time using #include "tbb/tick_count.h". This gives me performance in seconds.Then Idividedit by the number of square roots. I converted time for 16squareroots into cycles (since I know the clock frequency of my processor 2.67 GHz).IACA is giving me 22 cycles for every 16 sqrts. I compared these cycles to the cycles obtained using above method.

First thing you want to check that IACA is running for same architecture as your machine is. You can check it when you run it. it will print architecture over there. if IACA is running for latest architecture then there is difference in performance when you running on your machine.
Assuming those are right. IACA is telling you optimal throughput but you want to collect more from IACA with -analysis PERFORMANCE or -analysis DATA_DEPENDENCY.
You want to see the instructions marked as "CP" these are the instructions on critical path.
These analysis will print Latency in beginning for each port. Which will give you a little idea about how many cycles are taking place for one loop.
Secondly, looking at your code, your performance is limited by port 1 and also port0. if you can somehow break that dependency or choosing different instructions you may get better performance. You may need to look at the assembly generated - as you are using lot of registers there may be chance of register spills to stack. That will add more delay. you want to avoid it by reusing some already defined registers. Compilers usually take care of it but sometime compiler is not clear about the scope and keep the register alive little longer.

Thanks. I am using visual studio 10 with intel compiler. How can I check assembly generated code in this setting?

You need to compile individual file with /FA or /FAcs settings in output section of project properties. it will put assembly files or .cod file in your release/debug folder. open those files and you can see if instructions starts with "V" or not.
other option is SDE if you have installed. it has a tool got xed which dumps disassembly. i beleive it does show also that instruction is AVX or SSE.

Idownloadedintel 11.1 compiler to run AVX as you suggested. I am not able to figure out how to compile in in visual studio 2008 with AVX instruction. Where should I set the flag to use AVX instruction set?
If I set /arch:AVX it says AVX architecture not found.
Should I set /QxAVX compiler option? How should I set it?

In Intel specific optimization properties you should have available /QxAVX. /arch doesn't include Intel specific optimizations; only /arch:SSE2 and SSE3 are available. /arch:AVX is planned but may not be fully implemented yet.

Thanks for covering this guys. I just got home and sat down trying to figure this out, and found this thread. Great stuff. Thanks again.

Leave a Comment

Please sign in to add a comment. Not a member? Join today