General Code Questions

General Code Questions

 Do you have questions related to the code – i.e. can I do x or y? Ask the experts here! 

63 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I have some questions about how to compile the code.

After I type make, it generates the executable file cell_clustering. If I want to use small file, I type the command ./cell_clustering<< small.cdc,      Is it correct?

If I do not change the code, I just want to run the original code and use the Intel vtunes to find the bottleneck? Can I use the small.cdc as a input file directly? Or I need to decrease the value of the parameters in small.cdc, what is the appropriate value to run the orignial code? Could you give me a sample? 

Hi,

You should not try to run the code on the login node. You should run it on the cluster compute nodes which have Xeon Phi cards. To run your executable on the cluster, you need to queue it using the syntax:

qmic “~/mydir/myapp  ~/mydir/mydata”

So, for your executable, you can run it using:

qmic "~/cell_clustering/cell_clustering ~/cell_clustering/small.cdc"

You can use qstat to check the status of the code. Once the code runs, you will find STDIN.oxxx and STDIN.exxx files generated in your home directory with the stdout and stderr outputs, respectively. Check these files for the results of your run. This procedure is described in details in the server Readme file available under /common/.

Before running your code on the cluster though, please make sure that you have the -mmic compiler option in your Makefile. This makes sure that your code runs on the Xeon Phi. It will return an error otherwise.

And, yes, you should use small.cdc as it is to test the code even before optimization.

Hope that answers your question. Please feel free to post any other questions you may have.

Thanks,

Iman

I was wondering if we would be able to use the AVX-512 extensions available on the Xeon Phi.  I attempted to compile on my access node with the -xMIC-AVX512 option.  Instead of returning an error indicating that it was unsupported (like -xCORE-AVX512; "icpc: command line error: option '-xCORE-AVX512' not supported"), it instead returned a series of linker errors and undefined references.

icpc -mmic -o cell_clustering cell_clustering.cpp util.cpp -DCOMPILER_VERSION=\""icpc-20150815"\" -DBUILD_HOST=\"cfxcluster-x86_64-Linux-2.6.32-504.el6.x86_64\" -openmp -xMIC-AVX512 -Wall -lrt
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libimf.a when searching for -limf
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libsvml.a when searching for -lsvml
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libirng.a when searching for -lirng
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libipgo.a when searching for -lipgo
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libdecimal.a when searching for -ldecimal
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libiomp5.so when searching for -liomp5
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libiomp5.a when searching for -liomp5
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libcilkrts.so.5 when searching for libcilkrts.so.5
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libirc.a when searching for -lirc
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libsvml.a when searching for -lsvml
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libirc_s.a when searching for -lirc_s
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoAcquire'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoArenaAlignedFree'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIEngineGetIndex'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferReleaseRef'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiTargetFptrTableRegister'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoSharedMalloc'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoSharedAlignedFree'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoSharedFree'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPerfGetCycleFrequency'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIBufferAddRef'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIPipelineStartExecutingRunFunctions'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoRelease'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoArenaAlignedMalloc'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoArenaRelease'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiRemoteFuncRegister'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoSharedAlignedMalloc'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `COIProcessWaitForShutdown'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiLibInit'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiLibFini'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoArenaAcquire'
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiMicVarTableRegister'

Was there ever any intention to allow us to use this extension?  If so, have I missed something?

Thanks

Thanks for your reply. I still have two questions. 

1. In the README file, it says 

To compile this code for Intel(R) Xeon Phi(TM), it will be necessary

to change the compiler switches in the Makefile, replacing -xHost with -mmic.

Below is the provided makefile 

 

COMPILER_VERSION := "$(CXX)-$(shell $(CXX) --version | head -n1 | cut -d' ' -f4)"
BUILD_HOST:=$(shell sh -c './BUILD-HOST-GEN')

CFLAGS = -DCOMPILER_VERSION=\"$(COMPILER_VERSION)\" -DBUILD_HOST=\"$(BUILD_HOST)\"

cell_clustering: cell_clustering.cpp util.cpp util.hpp Makefile
    $(CXX) -o $@ cell_clustering.cpp util.cpp $(CFLAGS) -Wall -lrt

clean:
    rm -rf cell_clustering

I am not sure what -xHost means. How do I replace it? After I replace it, I compile it on access node, and then submit it to the cluster right.

 

2. I see the README file in the cluster. It seems this code will be run only on Xeonphi, which means only on native model. So the offload model is not allowed, right?

 

Thanks for Your time!

Have a nice day!

1. The wording in the README file included with the code is not exactly correct. You can add the flag -xhost to the variable CFLAGS to compile the code for the CPU architecture on which you are compiling. Alternatively, you can add the flag -mmic to CFLAGS to compile the code for the Intel Xeon Phi architecture.

2. Correct, only native model, no offload.

I have a few quick questions:

(1) I know that the final criterion must be '1' after the code is done running, but does the 'final energy' also have to be correct (i.e. the same as in the unoptimized code)? I assume yes, but I just wanted to check.

(2) Will the code we submit be tested on the remote machine you have set up for us? Or do you plan to run benchmarks on a different machine? I just want to make sure I don't make any platform specific optimizations only to be disqualified because my code doesn't compile on a different machine.

(3) Is it possible for you to raise the wall clock time limit on the compute nodes? I really want to get a run of huge.cdc. to finish. I have managed to make some pretty substantial optimizations to the code, but I still get huge.cdc to finished before reaching the 600 second limit.

(4) Is there someone I can talk to about internship opportunities at CERNopenlab? Even if I don't win the competition, I would love to discuss a little more about what an intern might do during the summer.

Quote:

jremmons wrote:

I have a few quick questions:

(1) I know that the final criterion must be '1' after the code is done running, but does the 'final energy' also have to be correct (i.e. the same as in the unoptimized code)? I assume yes, but I just wanted to check.

(2) Will the code we submit be tested on the remote machine you have set up for us? Or do you plan to run benchmarks on a different machine? I just want to make sure I don't make any platform specific optimizations only to be disqualified because my code doesn't compile on a different machine.

(3) Is it possible for you to raise the wall clock time limit on the compute nodes? I really want to get a run of huge.cdc. to finish. I have managed to make some pretty substantial optimizations to the code, but I still get huge.cdc to finished before reaching the 600 second limit.

(4) Is there someone I can talk to about internship opportunities at CERNopenlab? Even if I don't win the competition, I would love to discuss a little more about what an intern might do during the summer.

Hi,

(1)  No,  'final energy' doesn't have to be the same as the unoptimized code.

(2) Yes, we will test the code on the SAME remote machine provided to you to perform your tests.

(3) A discussion about it is ongoing, I'll keep you posted.

(4) I'll ask one of my colleagues to get back to you about the internship.

Hope that answers your questions.

Best regards,

Iman

 

Quote:

Kendon R. wrote:

I was wondering if we would be able to use the AVX-512 extensions available on the Xeon Phi.  I attempted to compile on my access node with the -xMIC-AVX512 option.  Instead of returning an error indicating that it was unsupported (like -xCORE-AVX512; "icpc: command line error: option '-xCORE-AVX512' not supported"), it instead returned a series of linker errors and undefined references.

icpc -mmic -o cell_clustering cell_clustering.cpp util.cpp -DCOMPILER_VERSION=\""icpc-20150815"\" -DBUILD_HOST=\"cfxcluster-x86_64-Linux-2.6.32-504.el6.x86_64\" -openmp -xMIC-AVX512 -Wall -lrt
ld: skipping incompatible /opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/mic/libimf.a when searching for -limf
...
/opt/intel/compilers_and_libraries_2016.0.109/linux/compiler/lib/intel64_lin/libioffload_target.so.5: undefined reference to `myoiMicVarTableRegister'

Was there ever any intention to allow us to use this extension?  If so, have I missed something?

Thanks

 

AVX-512 extensions are not available in first generation Xeon Phi coprocessors that you are using in the cluster. They will be available in the second generation. For the first generation of Xeon Phi, the instruction set is called IMCI or KNC instructions. They are automatically implemented by the compiler when you use the compiler argument "-mmic".

Hi,

(1)  No,  'final energy' doesn't have to be the same as the unoptimized code.

(2) Yes, we will test the code on the SAME remote machine provided to you to perform your tests.

(3) A discussion about it is ongoing, I'll keep you posted.

(4) I'll ask one of my colleagues to get back to you about the internship.

Hope that answers your questions.

Best regards,

Iman

I just want to follow up on (1). If the final energy doesn't need to be correct, can I just remove the computation from the code altogether? It would certainly speed up the code if I didn't need to run the getEnergy function :).

John

Where can I get help with compilation flags?

My current version runs small.cdc in 12.5s on the cluster, which is shocking as my desktop runs it in 3.7s. I've been through options list (is that the right one?) and set everything that I can find that corresponds to my environment, but it's still nearly 4x slower. I'm obviously missing something with the unfamiliar platform, as my code optimisations are proving to work. Can you provide any guidance for those unfamiliar with the toolset? 

Thanks, Craig

Quote:

jremmons wrote:

I have a few quick questions:

(1) I know that the final criterion must be '1' after the code is done running, but does the 'final energy' also have to be correct (i.e. the same as in the unoptimized code)? I assume yes, but I just wanted to check.

(2) Will the code we submit be tested on the remote machine you have set up for us? Or do you plan to run benchmarks on a different machine? I just want to make sure I don't make any platform specific optimizations only to be disqualified because my code doesn't compile on a different machine.

(3) Is it possible for you to raise the wall clock time limit on the compute nodes? I really want to get a run of huge.cdc. to finish. I have managed to make some pretty substantial optimizations to the code, but I still get huge.cdc to finished before reaching the 600 second limit.

(4) Is there someone I can talk to about internship opportunities at CERNopenlab? Even if I don't win the competition, I would love to discuss a little more about what an intern might do during the summer.

Hi for question #4 please take a look at the CERN openlab website for more information about the summer internship. 

http://openlab.web.cern.ch/summer-student-programme 

thanks -Russ

Quote:

jremmons wrote:

I have a few quick questions:

(1) I know that the final criterion must be '1' after the code is done running, but does the 'final energy' also have to be correct (i.e. the same as in the unoptimized code)? I assume yes, but I just wanted to check.

(2) Will the code we submit be tested on the remote machine you have set up for us? Or do you plan to run benchmarks on a different machine? I just want to make sure I don't make any platform specific optimizations only to be disqualified because my code doesn't compile on a different machine.

(3) Is it possible for you to raise the wall clock time limit on the compute nodes? I really want to get a run of huge.cdc. to finish. I have managed to make some pretty substantial optimizations to the code, but I still get huge.cdc to finished before reaching the 600 second limit.

(4) Is there someone I can talk to about internship opportunities at CERNopenlab? Even if I don't win the competition, I would love to discuss a little more about what an intern might do during the summer.

Hi,

This is to notify you that we increased the time limit on the compute nodes.

Thanks,

Iman

Hi, 

I have a question.

If I want to use auto vectorization, which parameter should I add in the make file?

 When I add -vec, or -vec-report3, both are wrong. 

 

Thanks very much!

Quote:

Craig H. wrote:

Where can I get help with compilation flags?

My current version runs small.cdc in 12.5s on the cluster, which is shocking as my desktop runs it in 3.7s. I've been through options list (is that the right one?) and set everything that I can find that corresponds to my environment, but it's still nearly 4x slower. I'm obviously missing something with the unfamiliar platform, as my code optimisations are proving to work. Can you provide any guidance for those unfamiliar with the toolset? 

Thanks, Craig

Hi,

You can get more information about compiler flags and their descriptions in the C++ Compiler Reference guide.
https://software.intel.com/en-us/node/581686

Best,
Ryo

Quote:

Hi, 

I have a question.

If I want to use auto vectorization, which parameter should I add in the make file?

 When I add -vec, or -vec-report3, both are wrong. 

 

Thanks very much!

Hi,

The automatic vectorization feature is enabled by default with Intel C++ compiler, and by adding -mmic, the compiler knows to use the correct instruction set (IMCI set) for the Knight's Corner Xeon Phi Coprocessor. So you don't need to add any additional flags (beyond -mmic) to have the compiler try to automatically vectorize your code. 

With regards to "-vec-report3", this is the old syntax for the vectorization report. The new flag for vectorization portion of the optimization report is -qopt-report  -qopt-report-phase=vec

Best,
Ryo

Quote:

Craig H. wrote:

Where can I get help with compilation flags?

My current version runs small.cdc in 12.5s on the cluster, which is shocking as my desktop runs it in 3.7s. I've been through options list (is that the right one?) and set everything that I can find that corresponds to my environment, but it's still nearly 4x slower. I'm obviously missing something with the unfamiliar platform, as my code optimisations are proving to work. Can you provide any guidance for those unfamiliar with the toolset? 

Thanks, Craig

Hello Craig,

For small input, like small.cdc, your code may run faster on the Xeon. However, for larger input, this won't be the case and you need to optimize your code for best performance on Xeon Phi.

Hop that helps.

Thanks,

Iman

Hi, 

If I want to set the threads (i.e 2)on my own machine, I can change the environment variable by typing the command. 

 

export OMP_NUM_THREADS=2

If I want to run it on Xeonphi, how can I set the number of threads ?

 

Thank you!

you could use: omp_set_num_threads(2);

Hi;

I have a question. 

When I use openmp to optimize my code. I find that in the output file, the values of number of cells in subvolume, average neighbors in subvolume and the correctness coefficient have changed compared to the unoptimized code. I just wonder weather this is correct or not.

 

Thanks very much!

Have a nice day!

Quote:

Hi;

I have a question. 

When I use openmp to optimize my code. I find that in the output file, the values of number of cells in subvolume, average neighbors in subvolume and the correctness coefficient have changed compared to the unoptimized code. I just wonder weather this is correct or not.

 

Thanks very much!

Have a nice day!

Hi,

That's expected that these values change. You should make sure that the final criterion value is always 1, other values can change.

Iman

Hi,

I have a confusing quesion.

For example; if I use openmp to parallel the outer loop in the functions, such as 

static void runDecayStep(float**** Conc, int L, float mu) {
    runDecayStep_sw.reset();
    // computes the changes in substance concentrations due to decay
    int i1,i2,i3;
#pragma omp parallel for private(i2,i3)
    for (i1 = 0; i1 < L; i1++) {
        for (i2 = 0; i2 < L; i2++) {
            for (i3 = 0; i3 < L; i3++) {
                Conc[0][i1][i2][i3] = Conc[0][i1][i2][i3]*(1-mu);
                Conc[1][i1][i2][i3] = Conc[1][i1][i2][i3]*(1-mu);
            }
        }
    }
    runDecayStep_sw.mark();
}

 

And I set the number of threads in main functions, such as 2, and in the Makefile, I add -qopenmp

int main(int argc, char *argv[]) {
 omp_set_num_threads(2);

I find that although it speeds up. But no matter which number I set the thread, 1 or 44,128 or whatever. The total execution time does no change. 

I feel like it always use full threads. But I donot know where I am wrong.

 

Best Regards

Have a nice day!

 

To confirm that the number of threads you've set are being used in the parallel region(s) use: int omp_get_num_threads();

Also check that your synchronization (waiting on threads - e.g. critical section), is not the bottleneck

Hi,

I am running the code non-parallel. I replaced all calls to getNorm(x) with calls to cblas_snrm2(3, x, 1).

I find that, as a result, the code with "small.cdc" runs slower. I was expecting the opposite. Is there a reason for this and is it specific to the Xeon Phi?

I included the MKL header and "-mkl" compiler flag.

Thanks.

Hi, could someone officially clarify question #1 from jremmons, please?, getEnergy consumes a huge chunk of the execution time and I'd like to know if I can get rid of it.

Thanks.

Quote:

Hi,

I have a confusing quesion.

For example; if I use openmp to parallel the outer loop in the functions, such as 

static void runDecayStep(float**** Conc, int L, float mu) {
    runDecayStep_sw.reset();
    // computes the changes in substance concentrations due to decay
    int i1,i2,i3;
#pragma omp parallel for private(i2,i3)
    for (i1 = 0; i1 < L; i1++) {
        for (i2 = 0; i2 < L; i2++) {
            for (i3 = 0; i3 < L; i3++) {
                Conc[0][i1][i2][i3] = Conc[0][i1][i2][i3]*(1-mu);
                Conc[1][i1][i2][i3] = Conc[1][i1][i2][i3]*(1-mu);
            }
        }
    }
    runDecayStep_sw.mark();
}

 

And I set the number of threads in main functions, such as 2, and in the Makefile, I add -qopenmp

int main(int argc, char *argv[]) {
 omp_set_num_threads(2);

I find that although it speeds up. But no matter which number I set the thread, 1 or 44,128 or whatever. The total execution time does no change. 

I feel like it always use full threads. But I donot know where I am wrong.

 

Best Regards

Have a nice day!

 

 

You are using the correct method to set the number of OpenMP threads. See test below that confirms that:

[avladim@cfxcluster ~]$ cat test-threads.cc 
#include <cstdio>
#include <omp.h>

int main() {
  omp_set_num_threads(2);
#pragma omp parallel
  {
    printf("Test #1, thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
  }

  omp_set_num_threads(4);
#pragma omp parallel
  {
    printf("Test #2, thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
  }
}
[avladim@cfxcluster ~]$ 
[avladim@cfxcluster ~]$ 
[avladim@cfxcluster ~]$ icpc -o test-threads -qopenmp -mmic test-threads.cc 
[avladim@cfxcluster ~]$ qmic "~/test-threads"
2384.cfxcluster
[avladim@cfxcluster ~]$ cat STDIN.o2384 
Test #1, thread 0 of 2
Test #1, thread 1 of 2
Test #2, thread 1 of 4
Test #2, thread 0 of 4
Test #2, thread 2 of 4
Test #2, thread 3 of 4

 

Why performance does not change as you vary the number of threads is a different question, I am not in the position to answer that.

Quote:

Pablo G. wrote:

Hi, could someone officially clarify question #1 from jremmons, please?, getEnergy consumes a huge chunk of the execution time and I'd like to know if I can get rid of it.

Thanks.

Hi Pablo,

The optimized code needs to generate the same output as the serial version. Values may not be the same for some parameters but you still need to calculate them.

Thanks,

Iman

Quote:

Pinaky B. wrote:

Hi,

I am running the code non-parallel. I replaced all calls to getNorm(x) with calls to cblas_snrm2(3, x, 1).

I find that, as a result, the code with "small.cdc" runs slower. I was expecting the opposite. Is there a reason for this and is it specific to the Xeon Phi?

I included the MKL header and "-mkl" compiler flag.

Thanks.

Hi Pinaky,

To clarify, are you running it on the cluster using qmic?

Thanks,

Iman

Hi,

I have two questions.

1.If I use openmp. where should I put -qopenmp in the makefile?

It is 

$(CXX) -mmic -qopenmp -o $@ cell_clustering.cpp util.cpp $(CFLAGS) -Wall -lrt

or

$(CXX) -mmic  -o $@ cell_clustering.cpp util.cpp $(CFLAGS) -qopenmp -Wall -lrt

The performances are different even I use the same code. 

 

2.What is the difference with -qopenmp and -openmp.

Before I though qopenmp is used in Windows and openmp is used in Linux.

But it seems my code can run under both conditions, but the performance will be different. It makes me confusing.cat 

Thank you very much!

Best Regards

Have a nice day!

Yes, I am.

For the last line of STDIN.e****, I get a total time of ~150s with the original code and ~250s with the CBLAS code.

 

EDIT: I compiled using the "-mkl -mmic" flags. I did not use any OpenMP flags or add any parallel statements.

I run the binary using qmic "${BASE}/cell_clustering ${BASE}/small.cdc". I believe that this runs the process natively on the Xeon Phi. Since the code is serial, I expected the CBLAS statement to be much more efficient.

 

UPDATE: Do I need to link to a special MIC library?

I tried the Intel MKL Link Line Advisor. The results were no different. It suggested:

#Link line:
 -Wl,--start-group ${MKLROOT}/lib/mic/libmkl_intel_lp64.a ${MKLROOT}/lib/mic/libmkl_core.a ${MKLROOT}/lib/mic/libmkl_sequential.a -Wl,--end-group -lpthread -lm

#Compiler options:
-I${MKLROOT}/include -mmic

Quote:

Iman Saleh (Intel) wrote:

Hi Pinaky,

To clarify, are you running it on the cluster using qmic?

Thanks,

Iman

is it possible for us to use openmp 4 features such as array notation, simd enabled functions and other simd pragma constructs? also, can we use those constructs inside an already existing omp parallel region?

Quote:

Tanmay K. wrote:

is it possible for us to use openmp 4 features such as array notation, simd enabled functions and other simd pragma constructs? also, can we use those constructs inside an already existing omp parallel region?

Hi Tanmay,

You can use any technique you like to make the code faster. Needless to say, make sure you test it and it works on the cluster.

Thanks,

Iman

Quote:

Pinaky B. wrote:

Yes, I am.

For the last line of STDIN.e****, I get a total time of ~150s with the original code and ~250s with the CBLAS code.

 

EDIT: I compiled using the "-mkl -mmic" flags. I did not use any OpenMP flags or add any parallel statements.

I run the binary using qmic "${BASE}/cell_clustering ${BASE}/small.cdc". I believe that this runs the process natively on the Xeon Phi. Since the code is serial, I expected the CBLAS statement to be much more efficient.

 

UPDATE: Do I need to link to a special MIC library?

I tried the Intel MKL Link Line Advisor. The results were no different. It suggested:

#Link line:
 -Wl,--start-group ${MKLROOT}/lib/mic/libmkl_intel_lp64.a ${MKLROOT}/lib/mic/libmkl_core.a ${MKLROOT}/lib/mic/libmkl_sequential.a -Wl,--end-group -lpthread -lm

#Compiler options:
-I${MKLROOT}/include -mmic

Hi Pinaky,

That's really a general question so I'll just share some thoughts with you that may help. Some of these optimized function are meant to be called for large arrays/values so you may not see optimizations otherwise. The reason getNorm is faster is that it is in the same translation unit and is certainly inlined. In general, you need to apply other optimization techniques to get real improvement in performance. Feel free to check our training on these techniques: https://software.intel.com/en-us/modern-code/training

Thanks,

Iman 

Quote:

Hi,

I have two questions.

1.If I use openmp. where should I put -qopenmp in the makefile?

It is 

$(CXX) -mmic -qopenmp -o $@ cell_clustering.cpp util.cpp $(CFLAGS) -Wall -lrt

or

$(CXX) -mmic  -o $@ cell_clustering.cpp util.cpp $(CFLAGS) -qopenmp -Wall -lrt

The performances are different even I use the same code. 

 

2.What is the difference with -qopenmp and -openmp.

Before I though qopenmp is used in Windows and openmp is used in Linux.

But it seems my code can run under both conditions, but the performance will be different. It makes me confusing.cat 

Thank you very much!

Best Regards

Have a nice day!

Hi,

1- I am not sure I am following the question. Your two commands look the same for me. A good practice though is to include compiler options within CFLAGS.

2-  qopenmp is the replacement for openmp, which is deprecated. 

Thanks,

Iman

I see. BLAS operations are slower than inline code for vectors of small size. Thanks for the help.

Quote:

Iman Saleh (Intel) wrote:

Quote:

Hi Pinaky,

That's really a general question so I'll just share some thoughts with you that may help. Some of these optimized function are meant to be called for large arrays/values so you may not see optimizations otherwise. The reason getNorm is faster is that it is in the same translation unit and is certainly inlined. In general, you need to apply other optimization techniques to get real improvement in performance. Feel free to check our training on these techniques: https://software.intel.com/en-us/modern-code/training

Thanks,

Iman 

Hi;

In the vectorization report,such as :

LOOP BEGIN at cell_clustering3.cpp(575,17)

I know 575 means the number of line, but what does 17 mean here?

 

Thanks very much!

Have a nice day!

 

Quote:

Hi;

In the vectorization report,such as :

LOOP BEGIN at cell_clustering3.cpp(575,17)

I know 575 means the number of line, but what does 17 mean here?

 

Thanks very much!

Have a nice day!

 

Hi,

That's the column number.

Thanks,

Iman

Hi, 

When I checked the optimization report,

it says

" loop was not vectorized: compile time constraints prevent loop optimization."

What does this mean?

 

Thanks very much!

Quote:

Hi, 

When I checked the optimization report,

it says

" loop was not vectorized: compile time constraints prevent loop optimization."

What does this mean?

 

Thanks very much!

Hi,

Please check this article related to your question:

https://software.intel.com/en-us/articles/cdiag15532

Thanks,

Iman

Hi,

Is there a way to make the 2nd argument to 

_mm_malloc(size, align)

portable?

With the Xeon Phi, we know that align should be 64, which comes from the 512-bit IMCI instructions. Is there a macro constant or something that I could place there to ensure that it ports to other instruction sets like 256-bit AVX?

Quote:

Pinaky B. wrote:

Hi,

Is there a way to make the 2nd argument to 

_mm_malloc(size, align)

portable?

With the Xeon Phi, we know that align should be 64, which comes from the 512-bit IMCI instructions. Is there a macro constant or something that I could place there to ensure that it ports to other instruction sets like 256-bit AVX?

I would suggest doing this at the beginning of your file:

#ifdef __MIC__

   #define ALIGN 64

#else

  #define ALIGN 32

#endif

_mm_malloc(size et al, ALIGN);

At compile them, if the -mmic flag is used, __MIC__ will be set otherwise it won't. This should work for current 512bit IMCI vs. 256bit AVX/AVX2. 

I'm a contestant as well, someone from Intel can correct me in case there is a better way.

 

Hi,

Which Intel Phi model is on the cluster?

Quote:

Daniel v. wrote:

Hi,

Which Intel Phi model is on the cluster?

Intel Xeon Phi coprocessor 7120P

Hi,

when I use small input file, the output shows  

average neighbors in subvolume: 147.982300
correctness coefficient: 0.006173

When I use huge input file, the output file shows 

step 0
subVolMax: 0.267184
number of cells in subvolume: 8397
cells in subvolume are not well-clustered: 0.501402

step 10
step 20
step 30
step 40
step 50
step 60
step 70
step 80
step 90
step 100
step 110
step 120
step 130
step 140
step 150
step 160
step 170
step 180
step 190
step 200
step 210
step 220
step 230
step 240
step 250
step 260
step 270
step 280
step 290
step 300
step 310
step 320
step 330
step 340
step 350
step 360
step 370
step 380
step 390
step 400
step 410
step 420
step 430
step 440
step 450
step 460
step 470
step 480
step 490
subVolMax: 0.267184
number of cells in subvolume: 8645
cells in subvolume are not well-clustered: 0.493001

Is this result correct? I am kind of confused.

Plus: I find the final criteria is 0, so I think the result is wrong. What will have effect on final criteria? Time or something else?

Final criteria equals 0 means that parallelism that I use is wrong or the acceleration is not fast enough?

 

Best Regards 

 

 

 

Quote:

Hi,

when I use small input file, the output shows  

average neighbors in subvolume: 147.982300
correctness coefficient: 0.006173

When I use huge input file, the output file shows 

step 0
subVolMax: 0.267184
number of cells in subvolume: 8397
cells in subvolume are not well-clustered: 0.501402

step 10
step 20
step 30
step 40
step 50
step 60
step 70
step 80
step 90
step 100
step 110
step 120
step 130
step 140
step 150
step 160
step 170
step 180
step 190
step 200
step 210
step 220
step 230
step 240
step 250
step 260
step 270
step 280
step 290
step 300
step 310
step 320
step 330
step 340
step 350
step 360
step 370
step 380
step 390
step 400
step 410
step 420
step 430
step 440
step 450
step 460
step 470
step 480
step 490
subVolMax: 0.267184
number of cells in subvolume: 8645
cells in subvolume are not well-clustered: 0.493001

Is this result correct? I am kind of confused.

Plus: I find the final criteria is 0, so I think the result is wrong. What will have effect on final criteria? Time or something else?

Final criteria equals 0 means that parallelism that I use is wrong or the acceleration is not fast enough?

 

Best Regards 

Hi

For your results to be correct, Final Criterion has to be equal to 1. This parameter is the indication of correctness of your code and it doesn’t take time-to-solution into account. You should only make sure that this parameter is equal to 1, other parameters may change.

Hope that answers your question.

Thanks,

Iman

Everyone, please check out this webinar on the contest...

https://software.intel.com/en-us/videos/webinar-intel-modern-code-develo...

The webinar presents: 

- Code Walkthrough

- How to connect to the cluster and run your code

- Answers to frequent questions.

Iman

Quote:

Iman Saleh (Intel) wrote:

Quote:

boyli@clarkson.edu wrote:

 

Hi,

when I use small input file, the output shows  

average neighbors in subvolume: 147.982300
correctness coefficient: 0.006173

When I use huge input file, the output file shows 

step 0
subVolMax: 0.267184
number of cells in subvolume: 8397
cells in subvolume are not well-clustered: 0.501402

step 10
step 20
step 30
step 40
step 50
step 60
step 70
step 80
step 90
step 100
step 110
step 120
step 130
step 140
step 150
step 160
step 170
step 180
step 190
step 200
step 210
step 220
step 230
step 240
step 250
step 260
step 270
step 280
step 290
step 300
step 310
step 320
step 330
step 340
step 350
step 360
step 370
step 380
step 390
step 400
step 410
step 420
step 430
step 440
step 450
step 460
step 470
step 480
step 490
subVolMax: 0.267184
number of cells in subvolume: 8645
cells in subvolume are not well-clustered: 0.493001

Is this result correct? I am kind of confused.

Plus: I find the final criteria is 0, so I think the result is wrong. What will have effect on final criteria? Time or something else?

Final criteria equals 0 means that parallelism that I use is wrong or the acceleration is not fast enough?

 

Best Regards 

 

 

Hi

For your results to be correct, Final Criterion has to be equal to 1. This parameter is the indication of correctness of your code and it doesn’t take time-to-solution into account. You should only make sure that this parameter is equal to 1, other parameters may change.

Hope that answers your question.

Thanks,

Iman

 

But what confuses me is that if I use small input file, the final criteria equals to 1, but if I use huge input file(huge.cdc), the final criteria equals to 0.

What causes this problem? 

Also I test  with the original cell_clustering.cpp, since the only difference with the small input file and huge input file is that the parameter L and Divithreshold is different; if I just change L to 100(in small input it is 80), the final criteria still equals to 0.

 

Best Regards

Quote:

Quote:

Iman Saleh (Intel) wrote:

 

Quote:

boyli@clarkson.edu wrote:

 

Hi,

when I use small input file, the output shows  

average neighbors in subvolume: 147.982300
correctness coefficient: 0.006173

When I use huge input file, the output file shows 

step 0
subVolMax: 0.267184
number of cells in subvolume: 8397
cells in subvolume are not well-clustered: 0.501402

step 10
step 20
step 30
step 40
step 50
step 60
step 70
step 80
step 90
step 100
step 110
step 120
step 130
step 140
step 150
step 160
step 170
step 180
step 190
step 200
step 210
step 220
step 230
step 240
step 250
step 260
step 270
step 280
step 290
step 300
step 310
step 320
step 330
step 340
step 350
step 360
step 370
step 380
step 390
step 400
step 410
step 420
step 430
step 440
step 450
step 460
step 470
step 480
step 490
subVolMax: 0.267184
number of cells in subvolume: 8645
cells in subvolume are not well-clustered: 0.493001

Is this result correct? I am kind of confused.

Plus: I find the final criteria is 0, so I think the result is wrong. What will have effect on final criteria? Time or something else?

Final criteria equals 0 means that parallelism that I use is wrong or the acceleration is not fast enough?

 

Best Regards 

 

 

Hi

For your results to be correct, Final Criterion has to be equal to 1. This parameter is the indication of correctness of your code and it doesn’t take time-to-solution into account. You should only make sure that this parameter is equal to 1, other parameters may change.

Hope that answers your question.

Thanks,

Iman

 

 

 

But what confuses me is that if I use small input file, the final criteria equals to 1, but if I use huge input file(huge.cdc), the final criteria equals to 0.

What causes this problem? 

Also I test  with the original cell_clustering.cpp, since the only difference with the small input file and huge input file is that the parameter L and Divithreshold is different; if I just change L to 100(in small input it is 80), the final criteria still equals to 0.

 

Best Regards

Hi,

There are other parameters that need to change as the problem is scaled up. Changing only L may result in incorrect output.

When you ran the code on the huge data set, did you use the provided huge,cdc or did you increase the problem size by increasing L?

Thanks,

Iman

Quote:

Iman Saleh (Intel) wrote:

Everyone, please check out this webinar on the contest...

https://software.intel.com/en-us/videos/webinar-intel-modern-code-develo...

The webinar presents: 

- Code Walkthrough

- How to connect to the cluster and run your code

- Answers to frequent questions.

Iman

Thanks for the information :-)

In the video, it is recommended to use VTune Amplifier in our machines to profile the code and detect where the hotspots are. However, my development machine contains an intel i3, and not a XeonPhi (as I suppose it is the case for most students). How can VTune help us in this case?. Is the information given still relevant?.

Thanks.

Hi Pablo,

Yes, the profiler shows you where most of the CPU time is spent in your code, so it's still relevant.

Thanks,

Iman

Quote:

Iman Saleh (Intel) wrote:

Hi Pablo,

Yes, the profiler shows you where most of the CPU time is spent in your code, so it's still relevant.

Thanks,

Iman

Uhm, I still have my doubts... can't it be the case that a loop that is a bottleneck in sequential execution, to be beneficial because of parallel execution or vectorization, and because we are using a serial CPU, and a non-vectorizing compiler, this can be misleading?.

Thanks for your help.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today