Utilizing Intel® AVX with Cakewalk SONAR* X1

Download Article

Download Using Intel® AVX with Cakewalk SONAR* X1 (PDF 162KB)
 

Introduction to Intel® AVX

Intel® AVX is a new 256-bit instruction set extension to SSE and is designed for applications that are Floating Point (FP) intensive. It will be released in 2011as part of the Intel® Core™ processor family and will be present in platforms ranging from notebooks to servers. Intel® AVX is designed to improve performance with wider vectors, new extensible syntax, and rich functionality, resulting in better management of data and applications like image, audio/video processing, scientific simulations, financial analytics and 3D modeling and analysis.

How to detect OS and hardware support for Intel® AVX

Detecting OS and hardware support instructions for Intel® AVX are contained in the programming reference manual1. These instructions need to be executed only once at application initialization to establish the right paths to execute AVX code on supported hardware and OS features. For more information on the Intel® Advanced Vector Extensions Programming Reference guide, see section 2.22

Compiler support

Microsoft Visual Studio 2010* supports the /arch:AVX switch that compiles Intel® AVX intrinsic code. For more information on the compiler switch, please visit http://msdn.microsoft.com/en-us/library/7t5yh4fd.aspx

AVX compiler intrinsics can be found in the Intel header file immintrin.h which ships with Visual Studio 2010. Windows 7* SP1 is needed to run and execute any AVX instruction on Intel® Core™ hardware.

Background on Cakewalk SONAR* X1 and Intel® AVX applicability

SONAR* is Cakewalk's technology-leading digital audio workstation and runs natively on Intel® Core™ with Windows* X64 and X86 platforms. For 12 years, SONAR* has been the first workstation to support all Microsoft’s* operating systems, and the first to pioneer support for many Intel technology developments: hyper-threading, multiprocessor architectures, multicore processing and 64-bit processing. In June 2009, Cakewalk* was the only audio software company named a finalist in Microsoft’s Global ISV Partner* Awards3 . In Sept 2009, Cakewalk* was featured by Intel in the launch of the Intel® Core™ i7 processor and the keynote presentations at the IDF Conference4.

SONAR* is used by musicians, content creation professionals and mix engineers worldwide. It is a scalable application which adapts to available hardware CPU resources allowing the recording, mixing and processing of large numbers of audio tracks and audio DSP plugins. While streaming audio in SONAR*, the bulk of the work done is processing audio buffers for various tasks such as mixing and DSP, as well as performing various data and bit depth conversions. The AVX instruction set found in the Intel® Core™ processor family excels at these tasks and code written to take advantage of this instruction set will, according to Cakewalk* results, outperform earlier generation processors. Intel® AVX instructions may be used in both 32-bit as well as 64-bit applications.

Savings in CPU processing translate can lead to real gains for customers. These processing savings are designed to reduce load while playing complex projects, thereby allowing for glitch free and responsive performance even while playing hundreds of tracks and virtual instruments with latencies (audio buffer sizes) as low as 1 millisecond.



 

Intel® AVX applicability in audio software (Case studies)

All code examples below are samples of real use cases as implemented within Cakewalk's SONAR* X1 application; they are reproduced here with permission from Cakewalk*, Inc. The optimizations were inspired by functions identified as performance hotspots by the Intel® VTune™ performance analyzer and by other benchmarking tools running benchmarking workloads.

A key advantage of Intel® AVX is that it now has 265-bit YMM registers as compared to the 128-bit XMM registers with SSE2. This allows the CPU to process twice the data bandwidth in a single AVX instruction, providing a performance boost while processing large numbers of audio buffers. These performance gains become even more appreciable when processing very small, low latency audio buffers, since this activity consumes most CPU resources.

Audio processing software typically performs a lot of audio bit depth conversions while streaming audio buffers through the playback graph, rendering and mixing. Examples of such conversions are 24-bit integer to 32-bit float, 64-bit double precision conversion, and float 32-bit to 64-bit double precision conversion. Some common conversions targeted for optimization are shown below in un-optimized and optimized form. The routine that follows converts 24-bit integers to 64-bit double precision audio buffers. Cakewalk* provided the following sample of un-optimized 24-bit integer to 64-bit double conversion code:

 

#define INT24(p) (((int)(p[2]) << 16) | ((unsigned)(BYTE)(p[1]) << 8) | ((unsigned)(BYTE)(p[0])))

static inline void CopyInt24ToFlt64( double* pfDst, void* plSrc, int cBuf, double dScaleFactor )
{
	if (0 == cBuf)
		return;

	char* pch = ((char*) plSrc) + (cBuf * 3) - 3;
	for (int ix = cBuf-1; ix >= 0; ix--)
	{
		int l = INT24(pch);
		pfDst[ix] = double( dScaleFactor * l );
		pch -= 3;
	}
}

 

This sample was optimized with Intel® AVX instruction set to provide up to 1.84x improvement on 32-bit binaries and 1.69x on 64-bit binaries (tests conducted on 2.2GHz Intel® Core™ development systems internally by Intel engineers in Intel labs). For the Intel® AVX implementation, doing aligned stores provided more significant gains than un-aligned stores. In the optimized code, we first checked for alignment of the store pointers and performed serial operations till the destination was aligned. Once the destination addresses were aligned, we switched to using AVX intrinsic. A known issue with _mm256_zeroall() is that with Visual Studio 2010 may zero out some of the XMM registers producing invalid results if those register have data needed for further processing. Visual Studio 2010 SP1 fixes this issue, so it is recommended to use Visual Studio 2010 SP1.

 

static inline void CopyInt24ToFlt64AVX( double* pfDst, void* plSrc, int cBuf, double dScaleFactor ) 
{
	if (0 == cBuf)
		return;
 
	char* pch = ((char*) plSrc);

	__m128i zero = _mm_set1_epi8(0); // Setting up zero value ymm variable
	__m128i mask = _mm_set_epi8(0x84,0x0B,0x0A,0x09,0x83,0x08,0x07,0x06,0x82,0x05,0x04,0x03,0x81,0x02,0x01,0x00); // Mask for doing shuffles so we can only use 24 bits out of 32-bits
	__m256d dScale = _mm256_set1_pd(dScaleFactor); // scale factor for multiplication
	
	int elem = 0;
	int i = 0;

	// in AVX store misalignment can have performance impact, hence we want to check and make sure we do serial operations till store address
	// is aligned to 32-byte boundary (hence $ with 1F) and the loop will be executed till that value is non-zero
	while((i < cBuf) && (((__int64)pfDst & 0x1F) != 0))
	{
		int l = INT24(pch);
		*pfDst = double( dScaleFactor * l );
		pch += 3;
		pfDst++;
		i++;
		elem++;
	}

	int loopcnt = cBuf-elem;
	int srcsize = 0;

	// vzeroupper or vzeroall is needed to avoid AVX SSE transition perf penalty. Both are not needed. Just calling one is fine
	_mm256_zeroall();
	for (int ix = 0; (ix < (loopcnt)/8) && (srcsize+32 <= loopcnt*3); ix++)
	{
		__m128i src = _mm_loadu_si128((__m128i*)pch); // Load 128bit from src
		__m128i src1 = _mm_loadu_si128((__m128i*)(pch+12)); // Load next 128 bit from src (pch is char array and we use only 12 chars in one convert

		__m128i res1 = _mm_shuffle_epi8(src, mask); // shuffle to get 24-bit value in 32-bit format, so the data will be pch0 pch1 pch2 0
		__m128i res2 = _mm_shuffle_epi8(src1, mask);// This is needed to do int to float conversion later, since intrinsic needed 32-bit data
		res1 = _mm_slli_epi32(res1, 0x08); // need to preserve the sign and keep the original value! (sign needed for –ve)
		res1 = _mm_srai_epi32(res1, 0x08);
		res2 = _mm_slli_epi32(res2, 0x08);
		res2 = _mm_srai_epi32(res2, 0x08);

		__m256d dRes1 = _mm256_cvtepi32_pd(res1); // int to double conversion
		__m256d dRes2 = _mm256_cvtepi32_pd(res2);
			
		dRes1 = _mm256_mul_pd(dScale, dRes1); // multiply with dScale
		dRes2 = _mm256_mul_pd(dScale, dRes2);
		
		_mm256_store_pd(pfDst, dRes1); // Store the result into dst array
		_mm256_store_pd(pfDst+4, dRes2);
		
		pfDst = pfDst+8;
		pch = pch+24;
		srcsize += 24;
		elem += 8;
	}
	
	// Handle remaining elements which cannot be done in AVX code due to less than 8 element processing remaining
	for (int ix = 0; ix < (cBuf-elem); ix++)
	{
		int l = INT24(pch);
		pfDst[ix] = double( dScaleFactor * l );
		pch += 3;
	}

}

 

The routine that follows converts 32-bit float to 64-bit double precision audio buffers. This code is typically exercised while mixing a 32-bit source to a 64-bit destination in SONAR*. With SSE2, this could be optimized by concurrently processing 2 double precision values due to the 128-bit width of the XMM register. Now that there are 265-bit YMM registers with Intel® AVX, we are able to store 4 doubles at a time, doubling the throughput of the instruction.

Original un-optimized C code:

 

static inline void CopyFlt32ToFlt64(double* dTempBuf, float* pSrc, int cSamp, int nSrcInterleave)
{
	float*  pCnvSrc = (float*)pSrc + cSamp * nSrcInterleave;
	double* pCnvDst = (double*)dTempBuf + cSamp * nSrcInterleave;

	for (int ix = 0; ix < cSamp * nSrcInterleave; ++ix)
		*(--pCnvDst) = *(--pCnvSrc);
}

 

This was optimized with Intel® AVX instruction set (below) to provide up to 1.40x improvement on 32-bit binaries and 1.46x on 64-bit binaries (tests conducted on 2.2GHz Intel® Core™ development systems internally by Intel engineers in Intel labs)

 

static inline void CopyFlt32ToFlt64AVX(double* dTempBuf, float* pSrc, int cSamp, int nSrcInterleave)
{
	float*  pCnvSrc = (float*)pSrc; 
	double* pCnvDst = (double*)dTempBuf;

	int ix = 0;
	int elem = 0;

	// perform serial stores until store addresses are aligned
	while((ix++ < (cSamp * nSrcInterleave)) && (((int)pCnvDst & 0x1F) != 0))
	{
		*pCnvDst = *pCnvSrc;
		pCnvDst++;
		pCnvSrc++;
		elem++;
	}


	_mm256_zeroall();
	// handling 8 data sources are a time
	for (int iy = 0; iy < ((cSamp * nSrcInterleave)-ix)/8; iy++)
	{
		__m128 src = _mm_loadu_ps(pCnvSrc);
		__m128 src2 = _mm_loadu_ps(pCnvSrc+4);
	
		__m256d res1 = _mm256_cvtps_pd(src);
		__m256d res2 = _mm256_cvtps_pd(src2);
	
		_mm256_store_pd(pCnvDst, res1);
		_mm256_store_pd(pCnvDst+4, res2);
	
		pCnvDst = pCnvDst+8;
		pCnvSrc = pCnvSrc+8;
		elem += 8;	
	}

	// left overs
	int iz = 0;
	for (iz = 0; iz < (cSamp * nSrcInterleave)-elem; iz++)
	{
		*(pCnvDst++) = *(pCnvSrc++);
		
	}
	
	
}

 

 

Measurable Intel® AVX performance gains in SONAR* X1

SONAR* X1, the latest generation of the SONAR* product family, has code optimized for Intel® AVX enabled processors. Running on a processor such as in the Intel® Core™ processor family, Cakewalk* saw CPU gains while playing back projects which exercise the optimizations described earlier.

To verify Intel® AVX gains Intel engineers tested the following workloads on2.2GHz Intel® Core™ software development vehicles configured in Table 1, running at 2ms audio latency. Measurements were performed by measuring CPU cycles spent with and without Intel® AVX. The non AVX test was performed by disabling AVX in the BIOS.

 

Hardware OS Memory Cakewalk*
Intel® Core™ 2.2 GHZ 64-bit Windows 7 4 GB SONAR* X1

Workload- A was a project with several audio tracks and plug-ins typically used in high track count benchmarks. In this workload, CPU cycles were reduced by 1.23x or a net gain of 23% CPU bandwidth, when running with Intel® AVX enabled as compared to AVX disabled.

Workload- B was another benchmark project with several plug-ins used for high plug-in count benchmarks. In this workload we noted a reduction of CPU cycles by CPU cycles by 1.44x or a net gain of 44% CPU bandwidth, with AVX enabled.

Workload- C was a real world multi-track project from the SONAR* sample content disk. In this workload we observed a reduction of CPU cycles by 1.10x or a net gain of 10% CPU bandwidth, with AVX enabled.

Conclusion/ Summary

SONAR* X1 is optimized for low latency audio processing using the latest Intel® AVX instruction set. As reflected in the sample code and examples, AVX can provide a performance boost to audio applications that do a lot of iterative looped processing of buffers.

Alignment makes a significant difference in performance; it is highly recommended for applications to use aligned addresses for AVX operations.

Windows 7* SP1 is needed to run and execute any AVX instruction on Intel® Core™ hardware.

About the Authors

Noel Borthwick (Cakewalk CTO): For over a decade Noel Borthwick has worked on advancing professional audio recording software capabilities on the Intel and Microsoft platform. As Cakewalk's CTO, he has been instrumental in embracing many technology firsts, at the intersection of Microsoft operating systems and Intel processor architectures. Being a jazz musician, he brings a musical-creative perspective to his work at Cakewalk.

Rajshree Chabukswar (Intel): Rajshree Chabukswar is a software engineer working on client enabling in the Software Solutions Group that enables client platforms through software optimizations. Prior to working at Intel, she obtained a Masters degree in Computer Engineering from Syracuse University, NY.

Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.