Modification of gen assembly and querying the time stamp register

Modification of gen assembly and querying the time stamp register

Hello

I am using OpenCL on Gen9.5 architecture and I am using i7-7700k processor. I want to have something similar to clock function inside my kernel. What I want to do is something similar to

__kernel void TimeMeasurementKernel(){
  //get current clock value
  t0 = clock();
	//someoperations
  t1 = clock();
  time_elapsed_to_execute_operations = t1 - t0;
}

But it seems that intel doesn't provide any means to get the clock value from inside the OpenCL kernel. But there is a timestamp architecture register tm0 as mentioned in here. My goal is to query this timestamp register from inside the kernel. So one of the way that I thought of, if possible, then to generate the .gen assembly file first by using ioc64 offline compiler using the -asm option, modify the assembly file by introducing the timestamp register opcode and then generate the binary from the modified assembly. I would be able to load the modified binary through the  clCreateProgramWithBinary. I am not sure if this is possible. But if it is doable then I would very much appreciate if someone could provide some sort of working example as to how I can do this, specially introducing the timestamp register by modifying the generated assembly from ioc64 and then creating the binary file from the modified assembly. 

Also as I have stated my goal is to read the timestamp register, if this similar thing can be achieved by any other method then that would do as well. All I want to gather is the execution time of a portion of my code during runtime. Please let me know if any further information is required. Thank you.

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Sankha,

This is functionality we are looking to enable, but it's not quite ready yet - stay tuned.

In the meantime, are you on Windows or Linux?  If you're on Linux you may be able to generate code that accesses the timestamp register directly, so long as you are willing to compile a "debug" or "internal" version of the compiler.

See:

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/BiFModu...

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/Compile...

One minor word of caution is that this might not quite do what you want it to do since the GPU Execution Units are multi-threaded, so be careful how you interpret the results.

Hope this helps!

Hello Ben,

Thank you so much for your reply. I am using linux (ubuntu 16.04). Working on the debug mode is fine with me. I was just wondering if you could elaborate your point more about generating the code to access the timestamp register. Also how can I compile the :"debug" / "internal" version of the compiler and how can I use it to generate code to access the timestamp register. I would very much appreciate if you could explain the point a bit more. Thank you again for all the information.

Quote:

Dutta, Sankha wrote:

Hello Ben,

Thank you so much for your reply. I am using linux (ubuntu 16.04). Working on the debug mode is fine with me. I was just wondering if you could elaborate your point more about generating the code to access the timestamp register. Also how can I compile the :"debug" / "internal" version of the compiler and how can I use it to generate code to access the timestamp register. I would very much appreciate if you could explain the point a bit more. Thank you again for all the information.

Sorry for the very slow reply - this got lost in the holiday shuffle.

Instructions to build the compiler may be found here:

https://github.com/intel/intel-graphics-compiler#building

In your kernel code, you'll need to declare the prototype for the function to get the timestamp register (don't forget the "overloadable" attribute!), and you should be able to call it like any other OpenCL built-in function.  I'd recommend dumping kernel ISA to ensure that it's all working correctly, following a method such as:

https://github.com/intel/opencl-intercept-layer/blob/master/docs/kernel_...

Do note that this is preview / prototype functionality and may change at any time, but it should work, and I'd love feedback if it does what you want it to do. Please let me know if you have any follow-up questions.

 

Hello Ben

Thank you so much for your reply. I apologize for such a later response. I was working on a different problem for sometime and couldn't get back to this problem. However, I need to work on this issue now. As per your instruction I have build the compiler in the internal mode using -DCMAKE_BUILD_TYPE=Internal flag in cmake and I would now follow your instruction to use the timestamp register. However, your instruction to use the timestamp register is still not very clear to me

In your kernel code, you'll need to declare the prototype for the function to get the timestamp register (don't forget the "overloadable" attribute!), and you should be able to call it like any other OpenCL built-in function

In the comment previous to the last one, you have provided me this link where the intel_get_cycle_counter() function would give me the time stamp. But it's usage inside my kernel code is not very clear to me. Will it be possible for you to provide me with some sample pseudo-code regarding it's usage inside my kernel. I would try myself in between and I will let you know if I succeed but it would be really helpful if you could help me out with this. Thank you again for all the help.

Here is a very simple example:

ulong __attribute__((overloadable)) intel_get_cycle_counter( void );

kernel void test( global ulong* dst )
{
    dst[0] = intel_get_cycle_counter();
}

To be sure this is working, look at the generated assembler and verify that you see the timestamp register ("tm0") in your ISA.

Disclaimer: this is not officially supported functionality and it could be changed or removed at any time.  That being said, if you find this useful and it does what you want it to do, please let us know.  Thanks!

Hello Ben 

Thank you for your response. That code was very much helpful. I would do it and would let you know if I am able to see the timestamp register in the generated assembly. There is one more following question I have got. There are some more built-in functions that I can see such as getting thread ID, slice ID which I am also interested into. So I should be able to use that in the similar manner I presume? 
Also you asked me about what I want to achieve which I forgot to answer. Basically I want to measure time for pointer chasing. I would access some random addresses and measure cached and uncached time. There would some similar operations as well. Thank you again for your help. 

Quote:

There are some more built-in functions that I can see such as getting thread ID, slice ID which I am also interested into. So I should be able to use that in the similar manner I presume? 

Yes, the other built-in functions may be used similarly, and the same disclaimer applies.  :-)

Thank you again for your reply. I have another quick question but kind of unrelated to the topic. Please let me know if I should open a separate topic for this. I have read that L3 is the data cache and it is accessible by all the EUs and unlike L1 and L2 it is used for  all the computation and coherent with the LLC. Then is the L3 cache is inclusive of the LLC? Does L3 follows the inclusive principle similar as the corresponding CPU side. It would be helpful if you could provide me some details about this and also some references if possible. Thank you again for your help.

Quote:

Then is the L3 cache is inclusive of the LLC? Does L3 follows the inclusive principle similar as the corresponding CPU side.

Yes, in general all caches are inclusive.

The one exception is that some earlier platforms with EDRAM treated the EDRAM as a "victim cache".  Newer platforms with EDRAM treat it as a memory side cache, however, so it's inclusive.

If you haven't read our "Compute Architecture" whitepapers yet I'd highly recommend them.  Here's a link to our "Gen9" whitepaper:

https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute...

Are you going to publish the results of your study?  I'd love to see them.  Thanks!

 

 

Hello Ben

Thank you so much for your reply. I am currently working on an ongoing research project which I am looking forward to publish. I am going to put all my results in there. I have got some initial results and I won't be able to share in here. Let me know if you are interested for a  separate discussion. I also have some following questions for which I am going to create a separate post. Thank you again for your help. 

Hello

So I was trying to build the IGC. However, in both internal and debug mode, when I am executing make -j `nproc` then there are several errors that I have pasted here. I was wondering if this could be resolved. Thank you.

Leave a Comment

Please sign in to add a comment. Not a member? Join today