<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Wed, 25 Nov 2009 11:37:39 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/feed" rel="self" type="application/rss+xml" />
    <title>Intel Software Network - <![CDATA[ Intel® AVX and CPU Instructions ]]> feed</title>
    <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>PTEST improvement?</title>
      <description><![CDATA[ Hi,<br /><br />another benchmark: while I was testing compare performance, the next step is to compare branching on compares, so I wanted to show the impact of ptest in comparison to pmovmskb - cmp. But my results show that ptest is slower in almost all cases. See the first page of <a target="_blank" href="http://vir.homelinux.org/compare.pdf">compare.pdf</a> for the results. I would understand ptest and pmovmskb showing the same speed if both instructions count as being in the "integer domain", therefore both having the same 1 cycle penalty wrt. domain crossing (is this correct?).<br /><br />Am I understanding correctly, that in principle ptest and pmovmskb execute equally fast and that the cmp-jump can be optimized via macro-fusion so that both vector-branching implementations really are equivalent (except for the one additional GPR that the pmovmskb version requires)? Where then could the difference come from?<br /><br />(Yes, I will have to try out the simulator. I did not find the time yet to try it.)<br /><br />For reference:<br />(float_v::operator&lt;).isFull():<br /><br />with ptest:<br />cmpltps %xmm1,%xmm0<br />ptest  %xmm2,%xmm0<br />jae<br />(where xmm2 is 0xfffff...)<br /><br />without ptest:<br />cmpltps %xmm1,%xmm0<br />pmovmskb %xmm0,%ecx<br />cmp    $0xffff,%ecx<br />je<br /><br /><br />!(float_v::operator&lt;).isEmpty():<br /><br />with ptest:<br />cmpltps %xmm1,%xmm3<br />ptest  %xmm3,%xmm3<br />je<br /><br />without ptest:<br />cmpltps %xmm1,%xmm3<br />pmovmskb %xmm3,%ecx<br />test   %ecx,%ecx<br />je<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/70098/</link>
      <pubDate>Tue, 24 Nov 2009 00:59:42 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/70098/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Low rate on sse2 code</title>
      <description><![CDATA[ <p>Hi!<br />why this scalar sse2 code (all data in L1 cache) executes on Core2 only on rate 1.49 flop/cycle?</p>
<p>L10:<br /> movsd (%esi), %xmm5<br /> movsd (%ebx), %xmm4<br /> addl $4, %edi<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm3<br /> movsd (%ecx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm2<br /> movsd (%edx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm1<br /> movsd (%eax), %xmm4<br /> mulsd %xmm5, %xmm4<br /> movsd 8(%esi), %xmm5<br /> addsd %xmm4, %xmm0</p>
<p> movsd 448(%ebx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm3<br /> movsd 448(%ecx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm2<br /> movsd 448(%edx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm1<br /> movsd 448(%eax), %xmm4<br /> mulsd %xmm5, %xmm4<br /> movsd 16(%esi), %xmm5<br /> addsd %xmm4, %xmm0</p>
<p> movsd 896(%ebx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm3<br /> movsd 896(%ecx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm2<br /> movsd 896(%edx), %xmm4<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm1<br /> movsd 896(%eax), %xmm4<br /> mulsd %xmm5, %xmm4<br /> movsd 24(%esi), %xmm5<br /> addsd %xmm4, %xmm0</p>
<p> addl $32, %esi<br /> movsd 1344(%ebx), %xmm4<br /> addl $1792, %ebx<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm3<br /> movsd 1344(%ecx), %xmm4<br /> addl $1792, %ecx<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm2<br /> movsd 1344(%edx), %xmm4<br /> addl $1792, %edx<br /> mulsd %xmm5, %xmm4<br /> addsd %xmm4, %xmm1<br /> movsd 1344(%eax), %xmm4<br /> addl $1792, %eax<br /> mulsd %xmm5, %xmm4<br /> cmpl $56, %edi<br /> addsd %xmm4, %xmm0</p>
<p> jne L10</p>
<p> </p>
<p>how to rewrite this code for achievement near theoretical peak rate (2.0 flop/cycle)?</p> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/70086/</link>
      <pubDate>Mon, 23 Nov 2009 11:39:47 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/70086/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>How many info could I get to estimate DRAM bandwidth?</title>
      <description><![CDATA[ <span class="sectionHeadingText">I am now working on a work to estimate the  active time of DRAM.  How could I get info about memory access times by cpu?<br />Thanks a lot!</span> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69942/</link>
      <pubDate>Tue, 17 Nov 2009 08:17:55 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69942/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Understanding my Benchmarks</title>
      <description><![CDATA[ Hi,<br />I wote a benchmark to compare thpossible speedup with SSE vs. scalar execution. But I don't undestand the results I get.<br />The following loop:<br />loop:<br />movaps 0x10(%rax),%xmm1<br />cmpltps %xmm1,%xmm0<br />movaps 0x20(%rax),%xmm0<br />cmpltps %xmm0,%xmm1<br />movaps 0x30(%rax),%xmm1<br />cmpltps %xmm1,%xmm0<br />add $0x40,%rax<br />movaps (%rax),%xmm0<br />cmpltps %xmm0,%xmm1<br />cmp    %rax,%rbx<br />ja     loop<br />appears to require ~2 cycles per movaps+cmpltps (8 cycles per iteration) on a Nehalem processor. (The memory it iterates over is of a size &lt; L1 size.)<br /><br />The generated code for the scalar case looks like this:<br />loop:<br />movss  0x4(%rax),%xmm1<br />ucomiss %xmm0,%xmm1<br />seta   %dl<br />movss  0x8(%rax),%xmm0<br />ucomiss %xmm1,%xmm0<br />seta   %dl<br />movss  0xc(%rax),%xmm1<br />ucomiss %xmm0,%xmm1<br />seta   %dl<br />add  $0x10,%rax<br />movss  (%rax),%xmm0<br />ucomiss %xmm1,%xmm0<br />seta   %dl<br />cmp    %rax,%rbx<br />ja     loop<br />This requires ~1.33 cycles per ucomiss (i.e. 5.33 cycles per iteration) on the same processor. (Same memory size, too.)<br /><br />The result is that to compare N floats with SSE I need N/2 cycles. Without SSE I need 1.33*N cycles. That's a speedup of factor 2.66. I expected something closer to a factor of 4 than that...<br /><br />Now I'm trying to understand where this comes from:<br />1. the cmpps result is not used, therefore only the throughput should count, i.e. I can execute one cmpps per cycle. Do the movaps account for the second cycle? Could the movaps execute in parallel with cmpps if they'd use a different register?<br />2. The ucomiss call has a latency of 1 cycle. The result of set is not used, therefore the instruction can run in parallel with everything else. The movss instruction can execute in parallel to the previous ucomiss and seta. So 1.33 looks sensible, but I can't fully understand where this comes from.<br /> Question: Does the second call to seta have to wait for the first one to retire because it writes to the same register?<br /><br />Anybody that can help me to understand instruction level parallelism better?<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69800/</link>
      <pubDate>Tue, 10 Nov 2009 08:13:08 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69800/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>help on detecting stalls(identifying structural hazards) in assembly code</title>
      <description><![CDATA[ Hi All,<br /> Our project is to optimize instruction scheduling in gcc by detecting structural hazards. We are trying to come up with a test case for the same, a scenario wherein one of the instructions is stalled due to the resource being used by some other instruction. However, we are unable to do so.<br /><br />1. We wrote a C program - doing - floating point multiplications, divisions and additions. However in both the files - 'progname.s' file and 'progname.c.190r.sched2' file, the instructions were scheduled for execution in sequential order. We couldn't find a way to detect a stall, by looking at the assembly code generated.<br />Question: How do we detect that a stall has occurred if execution is being carried out in a particular sequence?<br />Also we would like to know of a tool, which given a 'progname.s' file, gives details of the execution time of each instruction and the clock cycle in which stall will occur, if execution is carried out in this sequence.<br /><br />2. We saw that integer operations were already being performed during compilation. Hence we were left with only floating point operations to be looked into for structural hazards. <br />Question: Once a stall is detected in case of floating point unit being used currently by some other instruction, which instruction can be scheduled in so as to avoid this stall(since integer operations are performed at compile time and floating point units are being used)?<br /><br />Target machine architecture: 686<br />Working on: Intel Pentium Dual Core processor<br /><br />Thanking you,<br />Dhiraj<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69472/</link>
      <pubDate>Wed, 28 Oct 2009 10:18:15 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69472/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>is there a standard format in which we provide architecture specific information to a software</title>
      <description><![CDATA[ Hi All,<br /> Our project requires us to specify architecture specific information to gcc(not .md files in gcc). We need to specify number of cycles taken per instruction for 686 architecture - as information to gcc.<br /><br />Question: Is there a standard format in which we define the architecture specific information to softwares requiring these? Do we have the architecture specific information for Pentium Dual Core architecture in a format, that can be read by any software requiring it?<br /><br />Target Architecture: 686 processor<br />Working on: Intel Pentium Dual Core processor<br /><br />Thanking You,<br /> Dhiraj.<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69383/</link>
      <pubDate>Sun, 25 Oct 2009 16:24:53 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69383/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>how to turn off out-of-order execution in Intel processor</title>
      <description><![CDATA[ Hi All,<br /> Our project is to optimize instruction scheduling in gcc, by detecting structural hazards. The algorithm employed requires no out-of-order executions by the processor.<br /><br />Question: Is there a command/mechanism to turn out-of-order execution off in Intel processor?<br />Target Architecture: 686 processor<br />Working on: Intel Pentium Dual Core processor<br /><br />Thanking You,<br />Dhiraj. ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69382/</link>
      <pubDate>Sun, 25 Oct 2009 14:32:29 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69382/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Parallel instructions for detecting MSB in array of bytes</title>
      <description><![CDATA[ First time posting, not sure if correct forum, but...<br /><br />I have a large array of bytes, max up to 1600, but mostly up to 128.<br /><br />Bytes are typically 7-bit of information, and the MSB is used as a sentinel, so the MSB is set in a small portion of the bytes.<br /><br />Currently I'm looping through them in a loop, but is there a better way to use SSEx to process 128 bytes in parallel and get back a SSE vector with bits set for each byte?<br /><br />Any suggestions?<br /><br />Thank you,<br />craptacus ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69127/</link>
      <pubDate>Fri, 16 Oct 2009 01:58:32 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69127/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Why only CS, IP and EFLAGS are saved while interrupt??</title>
      <description><![CDATA[ I am new to assembly programming. I was reading about 386 Interrupt. I came to know that only CS, IP and EFLAGS are saved as a part of interrupt, they pop back when we have iret. But I am wondering, why they didnt save all the visible registers, segments register etc.,???<br /><br />Please excuse me, if I understood something wrong.<br /><br />Thanks for your effort in helping me... ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/68619/</link>
      <pubDate>Fri, 16 Oct 2009 01:34:34 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/68619/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Out of order execution</title>
      <description><![CDATA[ Is there a simulator and/or a general procedure one can follow to predict what instructions will be executed in what order (assuming all data is in the L1 cache)? I'm having a hard time comprehending why a given instruction sequence executes much faster than another. I suspect it's due to the out of order execution and register renaming, but I've found no tangible reason yet. Any help would be appreciated. ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69140/</link>
      <pubDate>Thu, 15 Oct 2009 23:53:15 -0700</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/69140/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
  </channel></rss>