Memory Bandwidth Issue

Memory Bandwidth Issue

Hello,

We have now been using the Intel Media SDK for a number of years with great success. Recently however we have run into a rather major issue that we have been unable to resolve and would appreciate your feedback on.

In short, we push your platforms pretty hard, using significant CPU and memory bandwidth for real time video processing. We then use the Media SDK to perform hardware assisted H.264 encoding, however the moment that we start encoding a 1920x1080 interlaced video stream we see the performance of the entire system drop dramatically. All processes, even ones that are not using the Media SDK, seem to immediately start using almost twice the CPU usage. When we encode at lower resolutions (e.g.720x480) we do not see the same problem.

I have attached an example CPU plot that shows the problem; basically if we run our application as normal it uses some level of CPU. When we then start recording using the Intel Media SDK, even a single stream of video makes the entire CPU usage jump up dramatically; what cannot be seen on this plot is that the CPU usage of the Media SDK process is actually relatively low, but when it is running the "CPU time" spent in all the other processes increases a lot; presumable because the overall CPU memory bandwidth drops and so reads and writes tend to stall.

I have made a test application that demonstrates at least part of this problem. It basically works as follows :

1) It launched 8 threads doing memcpy's and measures the Mb/s copied on each one.
2) After 30seconds it will then start writing a 1080i M4V file to disk using the Intel Media SDK using hardware encoding.

What you see is basically the following :
Thread 1, memcpy = 1.47Gb/s
Thread 3, memcpy = 1.44Gb/s
Thread 2, memcpy = 1.43Gb/s
Thread 5, memcpy = 1.43Gb/s
Thread 0, memcpy = 1.43Gb/s
Thread 6, memcpy = 1.42Gb/s
Thread 7, memcpy = 1.36Gb/s
Thread 4, memcpy = 1.36Gb/s
[ snip ]
Starting to record to disk using Intel Media SDK ...
[ snip ]
Thread 1, memcpy = 0.99Gb/s
Thread 5, memcpy = 1.21Gb/s
Thread 0, memcpy = 0.95Gb/s
Thread 3, memcpy = 1.41Gb/s
Thread 7, memcpy = 1.07Gb/s
Thread 4, memcpy = 1.17Gb/s
Thread 2, memcpy = 1.11Gb/s
Thread 6, memcpy = 1.11Gb/s

In other words, the memory performance when using the Media SDK about 30%.

Our best guess is that memory bandwidth across the system drops dramatically when using the hardware encoder, and would appreciate any guidance that you might have in either diagnosing or avoiding this issue.

Thank you,

Cary Tetrick on behalf of Andrew Cross.

Fichier attachéTaille
Télécharger intelmediasdkcpuexample.png112.26 Ko
11 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hi Cary,

Can you please share some more information about your system configuration such as: Processor/Platform, OS, Media SDK version, driver version.

To understand your workload, also please expand on the pipeline: Are you using Encode+VPP or just Encode, system memory or D3D(9 or 11) memory surfaces, muxing with audio?

Regards,
Petter 

Intel Media SDK System Analyzer (64 bit)

The following versions of Media SDK API are supported by platform/driver:

        Version Target  Supported       Dec     Enc

         1.0     HW      Yes             X       X       [Adapter 1]

         1.0     SW      Yes             X       X

         1.1     HW      Yes             X       X       [Adapter 1]

         1.1     SW      Yes             X       X

         1.3     HW      Yes             X       X       [Adapter 1]

         1.3     SW      Yes             X       X

         1.4     HW      Yes             X       X       [Adapter 1]

         1.4     SW      Yes             X       X

         1.5     HW      No

         1.5     SW      Yes             X       X

         1.6     HW      No

         1.6     SW      Yes             X       X

Graphics Devices:

         Name                                         Version             State

         Intel(R) HD Graphics                         9.17.10.2932        Active

         NVIDIA GeForce GT 440                        9.18.13.1106        Active

System info:

         CPU:    Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

         OS:     Microsoft Windows 7 Professional

         Arch:   64-bit

Installed Media SDK packages (be patient...processing takes some time):

         Intel« Media SDK 2013 (x64)

         Intel(R) Media SDK 2012 R3 (x64)

         Intel(R) Media SDK 2012 R3 (x86)

         Intel(R) Media SDK 2012 R2 (x64)

Installed Media SDK DirectShow filters:

   Intel« Media SDK MP3 Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mpa_dec_ds.dll

   Intel« Media SDK JPEG Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\jpeg_dec_filter.dll

   Intel« Media SDK MPEG-2 Splitter :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mp2_spl_ds.dll

   Intel« Media SDK H.264 Encoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\h264_enc_filter.dll

   Intel« Media SDK MVC Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\mvc_dec_filter.dll

   Intel« Media SDK AAC Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_aac_dec_ds.dll

   Intel« Media SDK MPEG-2 Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\mpeg2_dec_filter.dll

   Intel« Media SDK MP4 Splitter :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mp4_spl_ds.dll

   Intel« Media SDK MPEG-2 Muxer :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mp2_mux_ds.dll

   Intel« Media SDK MP4 Muxer :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mp4_mux_ds.dll

   Intel« Media SDK H.264 Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\h264_dec_filter.dll

   Intel« Media SDK MP3 Encoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_mpa_enc_ds.dll

   Intel« Media SDK AAC Encoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\imc_aac_enc_ds.dll

   Intel« Media SDK MPEG-2 Encoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\mpeg2_enc_filter.dll

   Intel« Media SDK VC-1 Decoder :     C:\Program Files\Intel\Media SDK 2013\samples\_bin\x64\vc1_dec_filter.dll

Installed Intel Media Foundation Transforms:

   Intel« Hardware VC-1 Decoder MFT : {059A5BAE-5D7A-4C5E-8F7A-BFD57D1D6AAA}

   Intel« Hardware H.264 Decoder MFT : {45E5CE07-5AC7-4509-94E9-62DB27CF8F96}

   Intel« Hardware MPEG-2 Decoder MFT : {CD5BA7FF-9071-40E9-A462-8DC5152B1776}

   Intel« Quick Sync Video H.264 Encoder MFT : {4BE8D3C0-0515-4A37-AD55-E4BAE19AF471}

   Intel« Hardware Preprocessing MFT : {EE69B504-1CBF-4EA6-8137-BB10F806B014}

 

(Hmmm... no preview in comments) - just in case the first attempt was unreadable:

Graphics Devices:

         Name                                         Version             State

         Intel(R) HD Graphics                         9.17.10.2932        Active

         NVIDIA GeForce GT 440                        9.18.13.1106        Active

System info:

         CPU:    Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

         OS:     Microsoft Windows 7 Professional

         Arch:   64-bit

Installed Media SDK packages (be patient...processing takes some time):

         Intel« Media SDK 2013 (x64)

         Intel(R) Media SDK 2012 R3 (x64)

         Intel(R) Media SDK 2012 R3 (x86)

         Intel(R) Media SDK 2012 R2 (x64)

This is from my dev system, which is using the 2013 SDK, but product systems have 2012 R3 and show the same symptems.

We are using only Encode in the pipeline. Currently released systems use system memory, but I recently changed things to use D3D9 surfaces which seems to have made a very small improvement.  We use our own code to encode video, and use ffmpeg libs to encode AAC audio, and mux both.

In my testing, I was able to reproduce the problem just by running sample encode in a command window behind our code.

In our own code, if I bypass just the calls to encode and sync, I don't see thing happening.

I've uploaded the executable mentioned in the original post. I would stay away from the link to sendspace.com.

Fichiers joints: 

Fichier attachéTaille
Télécharger media-sdk-code-test.zip11.52 Mo

Hi Cary,

I cannot run the executable project you provided since there is a DLL missing: "Codec.speed.x64.dll".

In any case, based on your description and system configuration, I'm pretty sure the reason for the observed behavior is due to the nature of recent generations of Intel Core processors. These processors have a capability called Intel Turbo Boost Technology which in essence lets the GPU and CPU part of the processor share the power envelope (TDP).
http://en.wikipedia.org/wiki/Intel_Turbo_Boost

For instance, if the GPU is idle and only one processor core is used by single threaded workload then that single core can execute at higher frequency. Consider the case when CPU is executing at high frequency, then we add a GPU intensive workload. Since there is a max TDP and the processor resources are shared, the CPU(s) must decrease frequency to allow for simultaneous GPU execution. This, in turn leads to overall higher CPU utilization.

Regards,
Petter 

Petter, thanks for your response. I've attached the codec.  In our code and the sample code we are running threads on all cores. I tried running the turbo boost monitor while running our code and then encoding. It did not drop frequency.

Fichiers joints: 

Fichier attachéTaille
Télécharger codec.zip207.08 Ko

Hello,

Where you guys able to reproduce this? Were you able to get our test sample to work?

Thanks,

Cary

Hi Cary,

Yes, we can execute your application using the DLL you provided. Looking at the Media SDK API calls made from the application it does not give me any clues to what may be going on.

I still believe the behavior you observe is likely related to the way Intel Turbo Boost Tech. works.

Can you tell me a bit more about how the surfaces are delivered to the Media SDK encoder (copies, read from disk, backbuffer?)?

Regards,
Petter 

I downloaded the Turbo Boost Monitor (2.6). It doesn't seem to behave the way you describe (i.e., doesn't drop running our software.or this test) It stays at 3.5GHz, on a machine with a 3.4GHz rated CPU. Is there another monitor or tool to monitor this? (or a performance counter?)

In the test code, we have a blank YUV frame buffer. This is passed into our writer through a memory mapped file mechanism. In the writer, I allocate surfaces (D3D9) in advance, much like in your sample code. When I get a frame, I get one of the unused buffers, lock it (D3D lock), Then it gets copied into the surface using one of our YUV conversion functions. These are highly opimized and use SSE2. Once filled, the surface is locked, and is then suubmited for encoding.

Hi Cary,

Unfortunately I do not think the available turbo/frequency monitoring tools conveys the TDP sharing between CPU and GPU accurately. You could check out the following tools, but they will likely give you the same result.
http://software.intel.com/en-us/articles/intel-power-gadget/
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

We are exploring the behavior of your test application a bit further. I will let you know what we find.

One thing I noted is a potential thread contention issue. It looks like the first stage of your workload (not using Media SDK) is using as many threads as logical cores. You may try changing the number of threads you use to drive the SW workload. You can also explore to see if there is any impact in changing the Media SDK parameter "NumThreads" to 1 or 0 (Media SDK decides). For HW accelerated workloads there is no point in explicit use of many threads.

Regards,
Petter 

Connectez-vous pour laisser un commentaire.