The JITter Conundrum - Just in Time for Your Traffic Jam

In interpreted languages, it just takes longer to get stuff done - I earlier gave the example where the Python source code a = b + c would result in a BINARY_ADD byte code which takes 78 machine instructions to do the add, but it's a single native ADD instruction if run in compiled language like C or C++. How can we speed this up? Or as the performance expert would say, how do I decrease pathlength while keeping CPI in check? There is one common short cut around this traffic jam of interpreted languages. If you were going to run this line of code repeatedly, say in a loop, why not translate it into the super-efficient machine code version?

Host OS Ubuntu 14.04 Virtual Centos 7.1

So I am trying to start a virtual machine on ubuntu 14.04 using virtual machine manager.

I used the following instructions:


Minus the kernel patch part. (Which may be the issue).

When the virtual machine starts this is the status:

Writing a kernel module for Xeon Phi

Hello, I have written a simple kernel-space memory allocator module for Xeon Phi, but I have yet to figure out how it builds.

It should compile from the host, but there is no Xeon Phi-specific build /lib/modules/ folder.

I copied one from the ramfs in running Xeon Phi (, but then there is no 'build' folder initialized.

Is there any documentation or an article to guide me through?

Core Challenge In Speeding Up Python, PHP, HHVM, Node.js...

A traditional compiler translates a high-level computer program into machine code for the CPU you want to run it on. An interpreted language translates a high-level language into the machine code for some imaginary CPU. For historical reasons, this imaginary CPU is called a "virtual machine" and its instructions are called "byte code." One advantage of this approach is development speed: creating this byte code for an imaginary CPU is usually much faster than creating machine code for a real CPU. It's also more portable: that byte code can in theory run on a wide variety of machines. So a developer can write high-level code and debug it rapidly, rather than waiting for a long recompile.

offload_transfer: array of variables?


I would like to pre-allocate a number of buffers for later data transfers from CPU to MIC, using explicit offloading in C++.

It works nicely if each buffer corresponds to an explicit variable name, as e.g. in the double-buffering examples. However, I would like to have a configurable number of such buffers (more than 2), i.e. an array of buffers. (the buffers are used for asynchronous processing on the MIC, and I need quite a few of them).

Efficiently Use KNC Instructions on Unaligned Data

MIC requires strict 64Byte data alignment to utilize vpu, but why? I found Sparc also have such an requirement. But other multi-core CPU can handle unaligned data.

As MIC can automatically vectorize a for loop of data(with compiler optimization), what if the data is unaligned in this case? will the auto optimization still work?  if yes, how?


I would like to clarify my problem here.

Iscriversi a Unix*