Modification of gen assembly and querying the time stamp register

Modification of gen assembly and querying the time stamp register

Hello

I am using OpenCL on Gen9.5 architecture and I am using i7-7700k processor. I want to have something similar to clock function inside my kernel. What I want to do is something similar to

__kernel void TimeMeasurementKernel(){
  //get current clock value
  t0 = clock();
	//someoperations
  t1 = clock();
  time_elapsed_to_execute_operations = t1 - t0;
}

But it seems that intel doesn't provide any means to get the clock value from inside the OpenCL kernel. But there is a timestamp architecture register tm0 as mentioned in here. My goal is to query this timestamp register from inside the kernel. So one of the way that I thought of, if possible, then to generate the .gen assembly file first by using ioc64 offline compiler using the -asm option, modify the assembly file by introducing the timestamp register opcode and then generate the binary from the modified assembly. I would be able to load the modified binary through the  clCreateProgramWithBinary. I am not sure if this is possible. But if it is doable then I would very much appreciate if someone could provide some sort of working example as to how I can do this, specially introducing the timestamp register by modifying the generated assembly from ioc64 and then creating the binary file from the modified assembly. 

Also as I have stated my goal is to read the timestamp register, if this similar thing can be achieved by any other method then that would do as well. All I want to gather is the execution time of a portion of my code during runtime. Please let me know if any further information is required. Thank you.

25 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Sankha,

This is functionality we are looking to enable, but it's not quite ready yet - stay tuned.

In the meantime, are you on Windows or Linux?  If you're on Linux you may be able to generate code that accesses the timestamp register directly, so long as you are willing to compile a "debug" or "internal" version of the compiler.

See:

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/BiFModu...

https://github.com/intel/intel-graphics-compiler/blob/master/IGC/Compile...

One minor word of caution is that this might not quite do what you want it to do since the GPU Execution Units are multi-threaded, so be careful how you interpret the results.

Hope this helps!

Hello Ben,

Thank you so much for your reply. I am using linux (ubuntu 16.04). Working on the debug mode is fine with me. I was just wondering if you could elaborate your point more about generating the code to access the timestamp register. Also how can I compile the :"debug" / "internal" version of the compiler and how can I use it to generate code to access the timestamp register. I would very much appreciate if you could explain the point a bit more. Thank you again for all the information.

Quote:

Dutta, Sankha wrote:

Hello Ben,

Thank you so much for your reply. I am using linux (ubuntu 16.04). Working on the debug mode is fine with me. I was just wondering if you could elaborate your point more about generating the code to access the timestamp register. Also how can I compile the :"debug" / "internal" version of the compiler and how can I use it to generate code to access the timestamp register. I would very much appreciate if you could explain the point a bit more. Thank you again for all the information.

Sorry for the very slow reply - this got lost in the holiday shuffle.

Instructions to build the compiler may be found here:

https://github.com/intel/intel-graphics-compiler#building

In your kernel code, you'll need to declare the prototype for the function to get the timestamp register (don't forget the "overloadable" attribute!), and you should be able to call it like any other OpenCL built-in function.  I'd recommend dumping kernel ISA to ensure that it's all working correctly, following a method such as:

https://github.com/intel/opencl-intercept-layer/blob/master/docs/kernel_...

Do note that this is preview / prototype functionality and may change at any time, but it should work, and I'd love feedback if it does what you want it to do. Please let me know if you have any follow-up questions.

 

Hello Ben

Thank you so much for your reply. I apologize for such a later response. I was working on a different problem for sometime and couldn't get back to this problem. However, I need to work on this issue now. As per your instruction I have build the compiler in the internal mode using -DCMAKE_BUILD_TYPE=Internal flag in cmake and I would now follow your instruction to use the timestamp register. However, your instruction to use the timestamp register is still not very clear to me

In your kernel code, you'll need to declare the prototype for the function to get the timestamp register (don't forget the "overloadable" attribute!), and you should be able to call it like any other OpenCL built-in function

In the comment previous to the last one, you have provided me this link where the intel_get_cycle_counter() function would give me the time stamp. But it's usage inside my kernel code is not very clear to me. Will it be possible for you to provide me with some sample pseudo-code regarding it's usage inside my kernel. I would try myself in between and I will let you know if I succeed but it would be really helpful if you could help me out with this. Thank you again for all the help.

Here is a very simple example:

ulong __attribute__((overloadable)) intel_get_cycle_counter( void );

kernel void test( global ulong* dst )
{
    dst[0] = intel_get_cycle_counter();
}

To be sure this is working, look at the generated assembler and verify that you see the timestamp register ("tm0") in your ISA.

Disclaimer: this is not officially supported functionality and it could be changed or removed at any time.  That being said, if you find this useful and it does what you want it to do, please let us know.  Thanks!

Hello Ben 

Thank you for your response. That code was very much helpful. I would do it and would let you know if I am able to see the timestamp register in the generated assembly. There is one more following question I have got. There are some more built-in functions that I can see such as getting thread ID, slice ID which I am also interested into. So I should be able to use that in the similar manner I presume? 
Also you asked me about what I want to achieve which I forgot to answer. Basically I want to measure time for pointer chasing. I would access some random addresses and measure cached and uncached time. There would some similar operations as well. Thank you again for your help. 

Quote:

There are some more built-in functions that I can see such as getting thread ID, slice ID which I am also interested into. So I should be able to use that in the similar manner I presume? 

Yes, the other built-in functions may be used similarly, and the same disclaimer applies.  :-)

Thank you again for your reply. I have another quick question but kind of unrelated to the topic. Please let me know if I should open a separate topic for this. I have read that L3 is the data cache and it is accessible by all the EUs and unlike L1 and L2 it is used for  all the computation and coherent with the LLC. Then is the L3 cache is inclusive of the LLC? Does L3 follows the inclusive principle similar as the corresponding CPU side. It would be helpful if you could provide me some details about this and also some references if possible. Thank you again for your help.

Quote:

Then is the L3 cache is inclusive of the LLC? Does L3 follows the inclusive principle similar as the corresponding CPU side.

Yes, in general all caches are inclusive.

The one exception is that some earlier platforms with EDRAM treated the EDRAM as a "victim cache".  Newer platforms with EDRAM treat it as a memory side cache, however, so it's inclusive.

If you haven't read our "Compute Architecture" whitepapers yet I'd highly recommend them.  Here's a link to our "Gen9" whitepaper:

https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute...

Are you going to publish the results of your study?  I'd love to see them.  Thanks!

 

 

Hello Ben

Thank you so much for your reply. I am currently working on an ongoing research project which I am looking forward to publish. I am going to put all my results in there. I have got some initial results and I won't be able to share in here. Let me know if you are interested for a  separate discussion. I also have some following questions for which I am going to create a separate post. Thank you again for your help. 

Hello

So I was trying to build the IGC. However, in both internal and debug mode, when I am executing make -j `nproc` then there are several errors that I have pasted here. I was wondering if this could be resolved. Thank you.

Hi SankhaD,

Thanks for the feedback and the continued interest.

That link is no longer active (404) so I'll disable it. Hopefully there have been updates to the igc source tree since you encountered issues that resolve things...

If you have possible issue reports, they can go directly to IGC portal or Intel Graphics Compute Runtime portal... or event as attachments to the thread. If they're bigger files, we may be able to set up some sort of private transfer.

IGC github

runtime github

Thank you,

-MichaelC

 

Hi SankhaD,

From other thread 809059: 

Quote:

I am trying to use the intel graphics compiler  to compile my opencl programs. However, I am bit confused with the usage of it. First I would like to provide some background. I built the compiler in the debug mode and not in the release mode by using  -DCMAKE_BUILD_TYPE=Debug flag while doing cmake. The reason I wanted to built it in debug mode is mentioned in this forum post. My requirement is same as mentioned in the forum and a bit more and for that reason I had to built it in the debug mode. The newest IGC version had built errors and I revert back to previous version igc_release_2018-11-15. I also have a post in IGC git issues mentioning the errors and the problem I having with IGC. I am confused as to how can I use the IGC. My goal is to use the overloadable functions (usable in the debug mode of IGC) like intel_get_cycle_counter (and similar functions) in my opencl kernel exactly in the way as mentioned in the forum post by Ben Ashbaugh and then use intel IGC to compile and also to check the kernel ISA that the overloadable functions has been included. 

In my post in the git issues of IGC, Alex Paige pointed out to use ocloc tool that would call the IGC and compile it. But I have questions on his suggestion which is same as I have here that how can I make sure that ocloc is using the IGC that I built in the debug mode. Also how can I observe the kernel ISA that ocloc built. So I have same questions as asked in comment of issue #87. I know that this is post is similar to the comments in issue, but I thought to seek help from multiple sources while I am waiting for the response in git. I really need to use those overloaded functions inside my OpenCL kernel and I would very much appreciate if I could get this issue solved.

Compiler and OCL stakeholders are aware of your blockage. This is a case of support, documentation, and feature exposure not encompassing the scope of your deep dive. 

That being said... Some recommendations for posting on thread 87:

  • Can you post specifically what you've tried with ocloc to drive IGC? There may be forum spectators who are familiar with some of the other components that build gen isa binaries who are unfamiliar with ocloc who can help.
  • Can you post what you've tried at the commandline to launch with iga? Can you post the errors? How did you build or obtain iga? Was the gen isa targeting still for Intel® Core™ i7-7700K?
  • Did you post the example of the input source code sent through ocloc to trigger your feature?

Reporting even trivial in appearance steps can simplify reproduction or expectations.

Also:

  • it's going to be more likely to get more feedback with a more recent release tagged build. Especially as BenA indicated and AlexanderP indicated related features are being evaluated.
  • Can you keep future posts to the github thread you started or to the original thread here... ? It will be easier for spectators to follow.
  • You may want to follow thread #38 as well: There was a suggestion that ocloc needed to match the rest of the igc build in use in thread #38. Thread 87 may indicate some kind of mismatch for your build. I'm not sure if or how that would still affect your build. But it may be worth looking into.

Thanks for your interest,

-MichaelC

Hello Micheal

Thank you so much for a detailed reply. I apologize for the post in different forums. I will give answer to your queries, however, it would not be in the order as you have in your reply. 

  • Can you keep future posts to the github thread you started or to the original thread here... ? It will be easier for spectators to follow:

I will  keep my original post in here and I will put a link in the github for this post. 

  • Can you post specifically what you've tried with ocloc to drive IGC? There may be forum spectators who are familiar with some of the other components that build gen isa binaries who are unfamiliar with ocloc who can help.

Lets say I have a cl file sample.cl file. I used ocloc as follows: 

ocloc -file sample.cl -device kbl -output sampleOut -64 -llvm_text

It generates 4 files sampleOut_Gen9core.bin, sampleOut_Gen9core.gen, sampleOut_Gen9core.ll,  sampleOut_Gen9core.spv.

But ocloc only compiles successfully if I do not include the overloadable function code to read the timestamp register as mentioned in here. If I include the overloadable function code then the error is 

unknown mangling!
UNREACHABLE executed at /home/duttasankha/Desktop/SANKHA_ALL/INTEL_GRAPHICS_COMPILER/llvm_source/projects/llvm-spirv/lib/SPIRV/OCLUtil.cpp:178!
Aborted (core dumped)

But there are no errors otherwise. Also when I use the intel offline compiler ioc64 to generate .gen file from which I can read the ISA. But the sampleOut_Gen9core.gen file generated by ocloc is unreadable and so I am unable to check the ISA.

  • Can you post what you've tried at the commandline to launch with iga? Can you post the errors? How did you build or obtain iga? Was the gen isa targeting still for Intel® Core™ i7-7700K?

Though the previous version of IGC built successfully, but no executable of iga was created and so I built iga separately and iga64 binary was created under igc/visa/iga/IGAExe folder. I used the comandline iga in the following manner

./iga64 /path/to/opencl/binary/testOut_Gen9core.bin -p=9 -d

The error output is given in here. Though there is error in the output, still this is not the opencl kernel ISA that I want to observe.
I am still using the same architecture Intel® Core™ i7-7700K.

  • Did you post the example of the input source code sent through ocloc to trigger your feature?

The sample code that I am using in ocloc is as follows:

ulong __attribute__((overloadable)) intel_get_cycle_counter( void );

__kernel void testKern(__global int *buff,int numEls){
	
	int dummy1 = 0;

	ulong dst = intel_get_cycle_counter();
	
	for(int i = 0 ;i<numEls;i++)
		dummy1+=buff[i];	

	buff[0] = dummy1;
	
}
  • it's going to be more likely to get more feedback with a more recent release tagged build. Especially as BenA indicated and AlexanderP indicated related features are being evaluated.

This point is not very clear to me. I tried to build the IGC with the most recent release tag, but there was error in the built as mentioned in the original post in igc git issue post and so I used an earlier tagged release. If this is not what you meant then please let me know and I will try to answer more concretely. 

  • You may want to follow thread #38 as well: There was a suggestion that ocloc needed to match the rest of the igc build in use in thread #38. Thread 87 may indicate some kind of mismatch for your build. I'm not sure if or how that would still affect your build. But it may be worth looking into.

I have checked actually thread #38 but that didn't helped much regarding the usage of ocloc. As mentioned in here  by alex paige, I installed ocloc and all the other dependencies from this release version. So, according to my understanding, ocloc is using the igc that I downloaded from compute runtime instructions. But I am confused how can I use ocloc that would invoke the igc that I built and not the one I downloaded from the opencl software. 

Please let me know if my response to the queries is sufficient. Thank you again.

 

 

 

Hello 

I was wondering if someone could provide me if there are any developments based on the answers I posted. I apologize if this is causing any inconvenience, but the information I needed is critical for the project and it would be really helpful if someone could provide me some insight. Thanks.

Hello Sankha,

The following instructions should unblock you and only requires a "release" driver to compile your kernel above.

1) Search for the following line of code in <workspace>/igc/IGC/Compiler/Optimizer/OCLBIUtils.cpp (line 1612)

#if defined(_DEBUG) || defined(_INTERNAL)
    //Internal Debug Built-Ins
    m_CommandMap["__builtin_IB_read_cycle_counter"]  = CSimpleIntrinMapping::create(GenISAIntrinsic::GenISA_cycleCounter, false);

2) Move this line "m_CommandMap["__builtin_IB_read_cycle_counter..............." 3 lines up bringing it outside of the preprocessor directives "#if defined(_DEBUG) || defined(_INTERNAL)"

3) Wipe out your build folder and rebuild igc and the compute runtime from scratch.

4) Run your command line ocloc -file sample.cl -device kbl -output sampleOut -64 -llvm_text

 

 

Hello Alex

Thank you for your reply but my essential queries and issues are still not clear to me. I am still confused as to how ocloc is using the IGC. As I mentioned before that I am able to build the IGC with some previous version in the debug mode. So building the IGC in release/debug/internal mode is not the issue here (although I started with that). So I guess even if I keep those lines you mentioned inside the DEBUG preprocessor directive it should be fine if I compiled the IGC in debug/internal mode. But my question is how to use the ocloc with the IGC that I am building. Previously Micheal C asked me the details about my issues which I have answered in detail my queries and issues in comment #15. There I have mentioned in detail that though I am building the IGC in debug mode and everything, ocloc is giving compilation error while building the opencl kernel with the overloaded functions. It would be helpful if you could refer to the post about the issues. The main salient points in my problems are

How can I compile an opencl kernel (that would have overloaded functions inside) using ocloc that is internally going to use the IGC that I am building (be it release,debug or internal). Also I followed the installation procedure in linux in here  to install the debian files and for this reason I am seeking the information that how can I force ocloc to use IGC that I am building.

Also if you could explain that how ocloc is interfaced with IGC then that would also be helpful. 

Thanks.

 

There is one more update that I would like to provide is that I followed the steps that you mentioned and run ocloc in the command line 

 ocloc -file sample.cl -device kbl -output sampleOut -64 -llvm_text

But it didn't produce any binary. There is no output at all. Previously 3 files used to get generate (.bin, .gen and .spv). But now no file gets generated. After a successful compilation, ocloc used to show a message which is not shown this time. It seems that ocloc is not working properly. It would be very much helpful if I could get some feedback thank you.

Hello Sankha,

I'd like to step back a moment.  There may be an easier way to do what you'd like to do.

Can I assume that your usual usage model is to create your OpenCL program via clCreateProgramWithSource()?  If so, since IGC is invoked at runtime when your kernel is built, if you rename and replace the IGC on your system with the IGC you've built then you'll be able to create programs using the POC intel_get_cycle_counter function from source, without needing to go through ocloc and clCreateProgramWithBinary().

You can find your system IGC with:

locate libigc.so

By default, it will most likely be in /usr/lib/.

Simply rename this libigc.so to something else (such as libigc.so.backup) and copy the modified libigc.so that you built in its place.  You may need elevated privileges to do this – be careful!  To go back to the original libigc.so, reverse these steps.

A few other notes:

Your OpenCL driver needs to be compatible with the IGC you’ve built, so please be sure that they’re not too far apart.  You may find it helpful to find the IGC tag associated with a particular compute-runtime release:

https://github.com/intel/compute-runtime/releases

If you don’t have the privileges you need to do the renaming and replacement, it’s possible we can still make something work using LD_LIBRARY_PATH or LD_PRELOAD, but let’s cross that bridge if we come to it.

Thanks!

@ben ashbaugh 

Thank you so much for clearing up the usage methodology. Your comment was very clear and really helpful. At the very beginning let me mention that I have privilege access and I have tried what you have mentioned but there is a JIT compilation error that is showing up. Let me give you some insight on what I am doing. So I have download the IGC from source and since the current version was giving (which made me to post the issue on the first place), I changed to tag igc_release_2018-12-12 which compiled and installed fine. On the other hand, I have also followed the ubuntu installation procedure of 19.11.12599 release version of opencl runtime and installed all the debian packages.

Now there are 2 approaches that I have taken as I have discussed below:

1. The first approach is to use the libigc from the IGC  that I built:

So as per your comment, I have copied the libigc.so from my igc built (which is in dump64/igc/ directory )to /usr/lib and /usr/local/lib. After copying, when I am trying to execute my opencl program using clCreateProgramWithSource() then I am getting JIT compilation error. However, there are no details about the error that could be obtained by clGetProgramBuildInfo. 

This is the outpur from clGetProgramBuildInfo 

==========ERROR=========

=======================
2. The second approach is to install the debian files in opencl runtime release and use the prepackaged igc.

So when I am using the ubuntu installation procedure after downloading the debian packages (specially intel-igc-core_19.11.1622_amd64.deb) as mentioned in here (sudo dpkg -i intel-igc-core_19.11.1622_amd64.deb), my opencl program got compiled without any error and ran fine (without having the overloaded function inside opencl kernel). But if I include the overloaded functions as mentioned below

ulong __attribute__((overloadable)) intel_get_cycle_counter( void );
__kernel void vec_add(__global int *out, __global const int *in1, __global const int *in2) {
  int i = get_global_id(0);
  ulong dst =  intel_get_cycle_counter();
  out[i] = in1[i] + in2[i];
}

Then the error in JIT is 

unknown mangling!
UNREACHABLE executed at /home/duttasankha/Desktop/SANKHA_ALL/INTEL_GRAPHICS_COMPILER/llvm_source/projects/llvm-spirv/lib/SPIRV/OCLUtil.cpp:178!
Aborted (core dumped)

The reason for this error is understandable as here the IGC that came with the debian package is installed, this error is showing which would not be there if I use the IGC that I built. 

But again if I copy the libigc.so to /usr/lib or do sudo make install from the build folder from IGC workspace folder then it is not compiling. So I can see that whenever I am trying to use the IGC that I built is throwing a JIT compilation error, but when I am using the debian packages, it compiles file. I am not sure what is the issue in here that when I am using the libigc from my built is failing but using the debian packages from the opencl release is successful. It would be very much helpful if you could provide me some sort of idea as for the reason of this. Thank you.

 

Best Reply

Quote:

Dutta, Sankha wrote:

I changed to tag igc_release_2018-12-12 which compiled and installed fine. On the other hand, I have also followed the ubuntu installation procedure of 19.11.12599 release version of opencl runtime and installed all the debian packages.

...

when I am trying to execute my opencl program using clCreateProgramWithSource() then I am getting JIT compilation error.

My guess is that the compiler from December is incompatible with the driver from March and the incompatibility is being flagged as a JIT compilation error.  Since it sounds like you're using an older compiler due to compilation errors with newer code, could you please try a corresponding older driver?

To be on the safe side, try to get as close to December 12th as you can, for example:

https://github.com/intel/compute-runtime/releases/tag/18.51.12049

It looks like any driver older than February 19th may work, however:

https://github.com/intel/intel-graphics-compiler/commits/master/IGC/Adap...

@Ben Ashbaugh

Thank you so much for your reply. That did the job. That was really really helpful and I cannot thank you enough. Now I am able to use the overloaded functions without any compilation error. I have few follow up questions though which are as follows.

1. There is an overloaded function intel_get_eu_id which I am guessing is used to get the EU id. My GPU is gen9 which has got only 8 EUs/subslice. However, I can see EU id values of like 8, 9, 10 and 11 as I thought it should be up to 7 (0 - 7). So I am confused how to interpret this value and was wondering if you could help me with that.

2. What is the difference between  intel_get_eu_idintel_get_eu_thread_id and intel_get_hw_thread_id?

3. How can I view the .gen file of the kernel ISA that would include this overloaded functions as well. I would need to make sure that the cycle value that I am getting is indeed through tm0 register.

4. So In the Ibif_impl.cl file there are some more overloaded functions that I am intending to use if it comes out to be the same as I think it is. I was wondering if you could give me some idea about the purpose of the following overloaded functions and how can I use them.

__attribute__((always_inline)) int OVERLOADABLE intel_get_active_channel_mask( void )
__attribute__((always_inline)) uint OVERLOADABLE intel_set_dbg_register(uint dbg0_0)
__attribute__((always_inline)) uint OVERLOADABLE intel_get_grf_register( uint value )
__attribute__((always_inline)) uint OVERLOADABLE intel_get_flag_register( uint flag )
__attribute__((always_inline)) uint OVERLOADABLE intel_get_control_register

Thank you again for for all the help.

Quote:

Dutta, Sankha wrote:

Now I am able to use the overloaded functions without any compilation error.

Excellent, very happy to hear it worked!

Quote:

Dutta, Sankha wrote:

I have few follow up questions though which are as follows.

1. There is an overloaded function intel_get_eu_id which I am guessing is used to get the EU id. My GPU is gen9 which has got only 8 EUs/subslice. However, I can see EU id values of like 8, 9, 10 and 11 as I thought it should be up to 7 (0 - 7). So I am confused how to interpret this value and was wondering if you could help me with that.

2. What is the difference between  intel_get_eu_idintel_get_eu_thread_id and intel_get_hw_thread_id?

These values come straight from the HW "state registers".  For some of these IDs, they're guaranteed to be unique, but they may not necessarily be contiguous.  There's a bit more information in the state register description in the programmer's reference manual:

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-... (around page 750)

Quote:

Dutta, Sankha wrote:

3. How can I view the .gen file of the kernel ISA that would include this overloaded functions as well. I would need to make sure that the cycle value that I am getting is indeed through tm0 register.

Can you use the Kernel ISA feature from the Intercept Layer?

https://github.com/intel/opencl-intercept-layer/blob/master/docs/kernel_...

Quote:

Dutta, Sankha wrote:

4. So In the Ibif_impl.cl file there are some more overloaded functions that I am intending to use if it comes out to be the same as I think it is. I was wondering if you could give me some idea about the purpose of the following overloaded functions and how can I use them.

We mostly added these to help with debugging.  They map more-or-less directly to the EU HW registers.  As a side note, I recall that get_grf_register and get_flag_register may not have been implemented, so I'd advise checking before relying on them.

@Ben Ashbaugh

Thank you so much for all the details. I will close the issue in IGC git repo as well. Thank you again. 

Leave a Comment

Please sign in to add a comment. Not a member? Join today