We are developing and optimizing codec on Intel architecture with assembly optimization by finding most time consuming functions/modules through vtune amplifier.
I have more basic questions, please clarify
- How to find stalls presents in the assembly, if so how to remove this. Only re-ordering is the solution?
- Is there any possibility to know what are the instructions pipelined?.
- Confusion is there whether intrinsic optimization or assembly optimization programming gives the better performance. Of course if portability required intrinsic programming is good, but looking for better performance.
- Are the IPPs are license-free?
- What are the basic strategies/steps in writing and optimizing the assembly function?. If you have any document related this during IPP implementation, please share.