Recent posts
https://software.intel.com/en-us/recent/510226
enUndefined reference to "scalable_free"
https://software.intel.com/en-us/forums/topic/281650
<p>Hello</p>
<p>I have installed individual free versions of parallel<br />
studio xe both in Windows (Visual Studio) and in my Linux systems. I<br />
made a program that uses since I am<br />
using scalable_malloc and scalable_free inside my functor object. My<br />
program compiles and runs successfuly in Visual Studio, but When I try<br />
compiling it in Linux system (after setting environment), I get this<br />
error:</p>
<p>icc -openmp histogram.cc -o histogram -ltbb</p>
<p>/tmp/icc8Kgyh5.o:<br />
In function<br />
`tbb::internal::start_reduce,<br />
histogram_tbb, tbb::auto_partitioner const>::execute()':<br />histogram.cc:(.gnu.linkonce.t._ZN3tbb8internal12start_reduceINS_13blocked_rangeIiEE13histogram_tbbKNS_16auto_partitionerEE7executeEv[.gnu.linkonce.t._ZN3tbb8internal12start_reduceINS_13blocked_rangeIiEE13histogram_tbbKNS_16auto_partitionerEE7executeEv]+0x1e7):<br />
undefined reference to `scalable_malloc'<br />/tmp/icc8Kgyh5.o: In function `tbb::internal::finish_reduce::execute()':<br />histogram.cc:(.gnu.linkonce.t._ZN3tbb8internal13finish_reduceI13histogram_tbbE7executeEv[.gnu.linkonce.t._ZN3tbb8internal13finish_reduceI13histogram_tbbE7executeEv]+0x30):<br />
undefined reference to `scalable_free'</p>
<p>I wonder whats problem with scalable_allocator here, Can anyone help please ?</p>
Mon, 10 Oct 11 12:26:55 -0700akhal281650undefined reference to 'scalable_free'
https://software.intel.com/en-us/forums/topic/266187
<p>Hello</p>
<p>I have installed individual free versions of parallel studio xe both in Windows (Visual Studio) and in my Linux systems. I made a program that uses since I am using scalable_malloc and scalable_free inside my functor object. My program compiles and runs successfuly in Visual Studio, but When I try compiling it in Linux system (after setting environment), I get this error:</p>
<p>icc -openmp histogram.cc -o histogram -ltbb</p>
<p>/tmp/icc8Kgyh5.o: In function `tbb::internal::start_reduce, histogram_tbb, tbb::auto_partitioner const>::execute()':<br />histogram.cc:(.gnu.linkonce.t._ZN3tbb8internal12start_reduceINS_13blocked_rangeIiEE13histogram_tbbKNS_16auto_partitionerEE7executeEv[.gnu.linkonce.t._ZN3tbb8internal12start_reduceINS_13blocked_rangeIiEE13histogram_tbbKNS_16auto_partitionerEE7executeEv]+0x1e7): undefined reference to `scalable_malloc'<br />/tmp/icc8Kgyh5.o: In function `tbb::internal::finish_reduce::execute()':<br />histogram.cc:(.gnu.linkonce.t._ZN3tbb8internal13finish_reduceI13histogram_tbbE7executeEv[.gnu.linkonce.t._ZN3tbb8internal13finish_reduceI13histogram_tbbE7executeEv]+0x30): undefined reference to `scalable_free'</p>
<p>I wonder whats problem with scalable_allocator here, Can anyone help please ?</p>
Fri, 07 Oct 11 10:33:22 -0700akhal266187malloc vs scalable_malloc
https://software.intel.com/en-us/forums/topic/281678
<p>HejI have seen "malloc" as kind of replacement for "new" to create some heap memory and return a pointer to it (though I still wonder why would one need to use malloc instead of new), but I wonder what TBB "scalable_malloc" brings different from "malloc". And what situation it best fits to use "scalable_malloc" ?</p>
Fri, 07 Oct 11 07:49:28 -0700akhal281678TBB vs OpenMP on single threads
https://software.intel.com/en-us/forums/topic/282243
<p>I have parallelized matrix mutliplication<br />
and image convolution algorithms using OpenMP and TBB and I was trying<br />
to check scalability of these models with number of cores from one to 8.<br />
I used "omp_set_num_threads(n)" for OpenMP and "task_scheduler_init<br />
TBBinit(n)" for TBB to control number of cores. I am using Intel<br />
Compiler.<br />
For n=1; In case of convolution, OpenMP shows no overhead and perform<br />
equally well compared to serial version (to my surprise) while TBB<br />
performs bad and start getting better only when I choose n>1 and this<br />
is natural.</p>
<p>The weird thing is with matrix multiplication that When I use<br />
optimization flag "-O0" i.e disable optimizations, TBB performs slightly<br />
bad than serial one with n=1; which is natural overhead; but OpenMP<br />
performs exactly equal to serial one which means it doesnt incur any<br />
overhead. And when for same n=1; when I use compiler flag "_O1", OpenMP<br />
performs better than even serial one, while TBB still performs bad than<br />
serial for one thread....... and with compiler flag "-O3" optimizations,<br />
TBB still is bad than serial for n=1 but now OpenMP performs twice as<br />
fast as serial one :) What is happening there? I am using<br />
static(schedule) in OpenMP, does it means OpenMP programs with static<br />
scheduling has NO OVERHEAD at all ? or how it can be explained.. ?</p>
Wed, 24 Aug 11 15:56:10 -0700akhal282243Fortran spread() function in OpenCL
https://software.intel.com/en-us/forums/topic/282472
<p>Hello</p>
<p>I want to implement fortran spread function in OpenCL kernel. For example; inside a loop with index k, I have following fortran statements that need to make (n-k)(n-k) matrices first from a row of matrix "a" and then from a column of matrix "a";</p>
<p>spread(a(k,k+1:n),1,n-k)<br />spread(a(k+1:n,k),2,n-k)</p>
<p>I will probably make a new matrix in each iteration of loop inside OpenCL kernel and will need to spread kth row along all rows of one matrix and to spread kth column along all colums of second matrix. How could I do that in Opencl kernel?</p>
Fri, 05 Aug 11 08:15:44 -0700akhal282472Inner loops with OpenCL
https://software.intel.com/en-us/forums/topic/282564
<p>Hello</p>
<p>I am new to OpenCL and want to parallelize some looping code thats<br />
doing lu factorization with the looping structure showed by exact code<br />
as below:</p>
<p> for(int k = 0; k < N-1; k++)<br /> {<br /> for(int i = k+1; i < N; i++)<br /> S[i*N + k] = S[i*N + k] / S[k*N + k];</p>
<p> for(int j = k+1; j < N; j++)<br /> for(int i = k+1; i < N; i++)<br /> S[i*N + j] -= S[i*N + k] * S[k*N + j];<br /> }</p>
<p>I have done with the simple opencl kernel with single work items (no groping). Thats following:</p>
<p> int IDx = get_global_id(0);<br /> int IDy = get_global_id(1);</p>
<p> for(int k = 0; k < n-1; k++)<br /> {<br /> barrier(CLK_GLOBAL_MEM_FENCE);</p>
<p> if(IDy > k && IDx == k)<br /> matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];</p>
<p> barrier(CLK_GLOBAL_MEM_FENCE);</p>
<p> for(int j = k+1; j < n; j++)<br /> {<br /> if(IDy > k && IDx == j)<br /> matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];<br /> }<br /> }</p>
</p>
<p>But I dont get correct results when compared to the serial code, this<br />
is my personal try for OpenCL kernel and I am still learning how this<br />
data parallel scheme in OpenCL works, Can you point out what I am doing<br />
wrong in the kernel?</p>
Sun, 31 Jul 11 11:18:56 -0700akhal282564CPU vs GPU optimizations
https://software.intel.com/en-us/forums/topic/282623
<p>Hello</p>
<p>I have implemented a straightaway naive matrix<br />
multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for<br />
just an 8-core CPU system while I only run it on CPUs. I have applied<br />
some popular optimizations like utilizing private memory and local<br />
memory optimizations, and grouping my matrix in one dimension so I use<br />
both global and local dimension sizes. Now I get Speedup of around 24<br />
with same 8-core CPU.</p>
<p>First I wonder this much speedup because for<br />
8-cores I normally get around or less than 8 speedup with OpenMP for<br />
example. so these figures of 16 and 24 amaze me how its possible?</p>
<p>Second<br />
these local + private memory and grouping of work items are<br />
optimizations that I heard are only for GPUs and arent for CPUs so I<br />
again wonder how I get so much boost in speedup when I run it only on<br />
CPUs ?</p>
<p>Thirdly, I wonder how local and private memory and grouping<br />
are handled for CPUs as they cause speedup, caches or processor<br />
registers or what? Because this is magic to get so much speedup...</p>
<p>I also want to know what are CPU specific optimizations in OpenCL ?</p>
<p>Please<br />
help me clarify because I am so new to OpenCL and its giving me so big<br />
performance I cant beleive it, I have verified results and they are<br />
perfectly accurate.<br />Thanks in advance</p>
Wed, 27 Jul 11 10:37:53 -0700akhal282623Changing number of threads in TBB
https://software.intel.com/en-us/forums/topic/282961
<p>Hi</p>
<p>How can one control number of threads? For example, if I specify through scheduler some number of threads, e.g.,</p>
<p>tbb::task_scheduler_init TBBinit(nthreads);</p>
<p>and then I want to change available number of threads in middle of the program, how do I do that? </p>
Mon, 04 Jul 11 12:23:02 -0700akhal282961Nested For Loop: blocked_range 1D or 2D
https://software.intel.com/en-us/forums/topic/283016
<p>Hej<br />I am kinda newbie with Intel TBB and trying out parallelizing a<br />
problem which worked well with OpenMP but doesnt show speed up with TBB<br />
though looping are independent. I thought maybe 2D blocked_range might<br />
help, though it shows speedup but wrong results of calculation. My codes<br />
are as follows:<br />[code]<br />/*-----Serial Version-----*/<br />for(k=0; k {<br /> for(i=k+1; i<br />
{<br />
s[i][k] = s[i][k]/s[k][k];<br /> for(j=k+1; j s[i][j] -= s[i][k]*s[k][j];<br />
}<br /> }</p>
<p>/*OpenMP version (which shows considerable speedup)<br />
*/<br />
<br />#pragma omp parallel default(shared) private(k)<br /> for(k=0; k {<br />#pragma omp for private(i,j) schedule(static)<br /> for(i=k+1; i<br />
{<br />
a[i][k] = a[i][k]/a[k][k];<br /> for(j=k+1; j a1[i][j] = a1[i][j] - a1[i][k]*a1[k][j];<br />
}<br /> }</p>
<p>/* TBB version (1D blocked_range) */<br />task_scheduler_init TBBinit(nthreads); <br />for(int k=0; k parallel_for(blocked_range(k, size, (size-k)/nthreads), my_class(a2));<br />/* setting grainsize to that values reduced time but still its multiple of serial exection time:( */</p>
<p>class my_class<br />
{<br />
double** my_a; </p>
<p>
public:</p>
<p>my_class(a[size][size]):my_a(a){}</p>
<p>
void operator() (const blocked_range& r) const<br />
{</p>
<p>double** a2 = my_a;<br />
int k = r.begin(); <br />
for(int i=k+1; i!=size; i++)<br />
{<br />
a2[i][k] = a2[i][k]/a2[k][k]; <br />
for(j=k+1; j!=size; j++)<br />
a2[i][j] = a2[i][j] - a2[i][k]*a2[k][j];<br />
} <br />
}<br />
}; //This 1-D gives so poor performance</p>
<p>/*----- I tried 2-D range as follows-------*/<br />
for(int k=0; k<br />
parallel_for(blocked_range2d(k,size,(size-k)/nthreads,k,size,(size-k)/nthreads), my_class2d(a3));<br />//Class body<br />
class my_class2d</p>
<p>{</p>
<p>double** my_a; </p>
<p></p>
<p>public:</p>
<p>my_class2d(a[size][size]):my_a(a){}</p>
<p></p>
<p> void operator() (const blocked_range2d& r) const</p>
<p> {</p>
<p>double** a3 = my_a;</p>
<p> int k = r.rows().begin();<br />
int end = r.rows().end(); //or r.cols().end() </p>
<p></p>
<p> for(int i=k+1; i!=end; i++)</p>
<p> {</p>
<p> a3[i][k] = a3[i][k]/a3[k][k]; </p>
<p> for(j=k+1; j!=end; j++)</p>
<p> a3[i][j] = a3[i][j] - a3[i][k]*a3[k][j];</p>
<p> } </p>
<p> }</p>
<p>};<br />
//But this 2D attempt gives wrong results<br />[/code]</p>
<p>Is this structure even parallelizable with TBB, if yes then with 1D<br />
range or with 2D range, because my 1D range example gives correct<br />
results but its too far slow than even serial, and 2D is fast but wrong<br />
results. Any help?</p>
Fri, 01 Jul 11 04:33:01 -0700akhal283016parallel_for:blocked_range 1D or 2D
https://software.intel.com/en-us/forums/topic/283017
<p>Hej<br />I am kinda newbie with Intel TBB and trying out parallelizing a problem which worked well with OpenMP but doesnt show speed up with TBB though looping are independent. I thought maybe 2D blocked_range might help, though it shows speedup but wrong results of calculation. My codes are as follows:<br />[code]<br />/*-----Serial Version-----*/<br />for(k=0; k {<br /> for(i=k+1; i {<br /> s[i][k] = s[i][k]/s[k][k];<br /> for(j=k+1; j s[i][j] -= s[i][k]*s[k][j];<br /> }<br /> }</p>
<p>/*OpenMP version (which shows considerable speedup) */ <br />#pragma omp parallel default(shared) private(k)<br /> for(k=0; k {<br />#pragma omp for private(i,j) schedule(static)<br /> for(i=k+1; i {<br /> a[i][k] = a[i][k]/a[k][k];<br /> for(j=k+1; j a1[i][j] = a1[i][j] - a1[i][k]*a1[k][j];<br /> }<br /> }</p>
<p>/* TBB version (1D blocked_range) */<br />task_scheduler_init TBBinit(nthreads); <br />for(int k=0; k parallel_for(blocked_range(k, size, (size-k)/nthreads), my_class(a2));<br />/* setting grainsize to that values reduced time but still its multiple of serial exection time:( */</p>
<p>class my_class<br />{<br />double** my_a; </p>
<p>public:<br /> my_class(a[size][size]):my_a(a){}<br />
<br /> void operator() (const blocked_range& r) const<br /> {<br />
double** a2 = my_a;<br /> int k = r.begin(); <br /> for(int i=k+1; i!=size; i++)<br /> {<br /> a2[i][k] = a2[i][k]/a2[k][k]; <br /> for(j=k+1; j!=size; j++)<br /> a2[i][j] = a2[i][j] - a2[i][k]*a2[k][j];<br /> } <br /> }<br />}; //This 1-D gives so poor performance</p>
<p>/*----- I tried 2-D range as follows-------*/<br />for(int k=0; k parallel_for(blocked_range2d(k,size,(size-k)/nthreads,k,size,(size-k)/nthreads), my_class2d(a3));<br />//Class body<br />class my_class2d<br />
{<br />
double** my_a; </p>
<p>
public:</p>
<p>my_class2d(a[size][size]):my_a(a){}</p>
<p>
void operator() (const blocked_range2d& r) const<br />
{</p>
<p>double** a3 = my_a;<br />
int k = r.rows().begin();<br /> int end = r.rows().end(); //or r.cols().end() <br />
for(int i=k+1; i!=end; i++)<br />
{<br />
a3[i][k] = a3[i][k]/a3[k][k]; <br />
for(j=k+1; j!=end; j++)<br />
a3[i][j] = a3[i][j] - a3[i][k]*a3[k][j];<br />
} <br />
}<br />
};<br />//But this 2D attempt gives wrong results<br />[/code]</p>
<p>Is this structure even parallelizable with TBB, if yes then with 1D range or with 2D range, because my 1D range example gives correct results but its too far slow than even serial, and 2D is fast but wrong results. Any help?</p>
Fri, 01 Jul 11 04:23:47 -0700akhal283017