Intel® Advanced Vector Extensions

Links to instruction documentation

Using Hardware Features in Intel® Architecture to Achieve High Performance in NFV


Communications software requires extremely high performance, with data being exchanged in a huge number of small packets. One of the tenets of developing Network Functions Virtualization (NFV) applications is that you virtualize as far as possible, but still optimize for the underlying hardware where necessary.

  • Developers
  • Linux*
  • Networking
  • NFV
  • DPDK
  • Intel® Advanced Vector Extensions
  • Networking
  • Vectorization
  • Evaluating the Power Efficiency and Performance of Multi-core Platforms Using HEP Workloads

    As Moore’s Law drives the silicon industry towards higher transistor counts, processor designs are becoming more and more complex. The area of development includes core count, execution ports, vector units, uncore architecture and finally instruction sets. This increasing complexity leads us to a place where access to the shared memory is the major limiting factor, resulting in feeding the cores with data a real challenge. On the other hand, the significant focus on power efficiency paves the way for power-aware computing and less complex architectures to data centers. In this paper we try to examine these trends and present results of our experiments with Intel® Xeon® E5 v3 (code named Haswell-EP) processor family and highly scalable High-Energy Physics (HEP) workloads.
  • Developers
  • Linux*
  • Server
  • Haswell
  • CERN
  • NUMA
  • Intel® Advanced Vector Extensions
  • Code Modernization
  • Data Center
  • Parallel Computing
  • Power Efficiency
  • Threading
  • Vectorization
  • How long does a 6700K take to multiply two integers?


    I just read on Wikipedia that an IBM 1620 took 17ms to multiple two integers, and I was wondering how long a modern CPU takes to execute the same operation.

    I hope I'm in the right forum. I found this question from 2008 ( ), which, going by Google, seems to suggest that I should ask my question here.

    Regardless, I'm looking forward to your answers.

    Intel® System Studio (примеры и учебные материалы)

    Intel® System Studio is a comprehensive and integrated tool suite that provides developers with advanced system tools and technologies to help accelerate the delivery of the next-generation, energy-efficient, high-performance, and reliable embedded and mobile devices. We have created a list of samples demonstrating different features of Intel System Studio, Also tutorials will show usage of features in your applications. By Downloading or copying all or any part of the sample source code, you agree to the terms of the Intel® Sample Source Code License Agreement
  • Developers
  • Linux*
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • Yocto Project
  • Internet of Things
  • Windows*
  • C/C++
  • Beginner
  • Intermediate
  • Intel® System Studio
  • system studio sample
  • Intel System Studio code sample
  • system studio tutorials
  • system studio example code
  • Intel® Advanced Vector Extensions
  • Intel® Streaming SIMD Extensions
  • Debugging
  • Development Tools
  • Firmware
  • Intel® Atom™ Processors
  • Intel® Core™ Processors
  • Internet of Things
  • Optimization
  • Parallel Computing
  • Vectorization
  • Do Non-Temporal Loads Prefetch?

    I can't find any information on this anywhere. Do non-temporal load instructions (e.g. MOVNTDQA), which use the separate non-temporal store rather than the cache hierarchy, do any prefetching? How does the latency and bandwidth compare to a normal load from main memory?

    Is the way to think about the store as if it is as "close" to main memory as the L3 cache, but also as "close" to the register files as the L1 cache?

    Possible bug in SDE - jump with 16-bit operand size

    The Software Developer's Manual, and the corresponding AMD document, indicate that after the new RIP is calculated, it is then truncated to whatever the instruction's operand size is.

    To see if this was actually true, I assembled a JMP instruction with a 66 prefix, to set an operand size of 16 bits.  I would expect this to jump to a 16-bit address.

    Running this instruction on my AMD Steamroller CPU, I got a segmentation fault.

    But running it with SDE, the trace shows a jump without truncating the destination address.

    It would appear that SDE is incorrect.

    Subscribe to Intel® Advanced Vector Extensions