Using hardware video decode on Mobile Internet Devices

Submit New Article

October 12, 2009 12:00 AM PDT


Purpose and Scope

This paper gives a primer on how to take advantage of the hardware video decode provided in the Intel® Graphics Media Accelerator 500 chipset (Intel® GMA 500).

Acronyms and Abbreviations

API Application Programming Interface
AVC Advanced Video Coding. Also known as H264
CPU Central Processing Unit
DRI Direct Rendering Infrastructure. Allows application direct access to 3D acceleration hardware
DRI2 Direct Rendering Infrastructure 2. New design of Direct Rendering Infrastructure
DRM Direct Rendering Manager. Kernel driver part of the Direct Rendering Infrastructure
FFmpeg Collection of open source video and audio codecs
H.264 Highly-efficient decoder for video content
IDCT Inverse Discrete Cosine Transform
IZZ Inverse Zig-Zag
HD High Definition
MID Mobile Internet Device
MOBLIN* Open source project developing an optimized Linux* based platform for MIDs (http://moblin.org)
MPEG Moving Pictures Expert Group
MPEG-2 Video codec used in DVDs in particular
MPEG-4 Video codec. Very popular over Internet and on mobile devices
PDA Personal Digital Assistant
VC-1 Video codec standard developed by Microsoft*
VLD Variable Length Decoder or also known as slice level acceleration
XvMV Extension of the X video extension for the X Window system allowing the offload of the Motion Compensation onto the graphics chipset for MPEG-2 video decode.

Introduction

Since computers have turned multimedia, video has been one of the most popular media consumed on such devices. As the computing world is getting more and more mobile, there is a patent need to enhance the experience of high quality video consumption in mobility situation. PDAs and smartphones have failed so far delivering such experience in terms of format support and in terms of quality. Who hasn't tried to play such videos and found that the video plays jerky or not at all? Good video playback experience with the main existing formats is one of the promises Mobile Internet Devices claim to bring. Let's see how this can be achieved on an Intel® AtomTM processor-based MID.



The MID promise

As mobile phones are gaining in capabilities, they started to support some video playback. This experience remains poor due to the screen size as well as the processing power of the platform that is somewhat limited. There is thus no way for example to have a wide variety of content supported. Usually the content is tailored to the device in size and in format. Basically there is virtually no chance that the video content you play on your PC, plays at all on your mobile phone. MIDs come with a new promise to the mobile device users: the capability to play most of the video content that a PC plays including full movies.

Owing to its 4,5-6 inch screen which has at least a resolution of 800x480, the visual experience of video playback on MID is very pleasant. The Intel® AtomTM processor, being an x86 architecture, the MID can support most of the codecs available out of the box including the most common video formats like MPEG-2, MPEG-4, H.264 and VC-1. It allows the user to view the same video content he is downloading over the web and playing on its PC.

Nevertheless, it remained two drawbacks that prevent to deliver the full promise: The processing power necessary to decode some video files available on the internet may overcome the MID processing power. Decoding such heavily compressed content which are usually encoded with relatively high definition requires a lot of performance. The second is that decoding a video on the processor drains significantly the battery. In the case of a MID, it is essential that the user can view at least a full movie relying on its battery while he is in situation of mobility. That means autonomy of about several hours of video playback.

In order to solve the challenges mentioned above, Intel has integrated a video hardware decoder in the Intel® GMA 500 MID chipset. On one hand, with this solution, the user can view up to 2 movies with one charged battery. The electrical power needed to decode a video stream is significantly decreased. On the other hand, the decoder can handle HD streams decode in the most common formats available: MPEG-2, MPEG-4 layer 2, H.264 and VC-1. In order to use the HW video decoder, the applications must send the video stream to the hardware decoder. With Windows*, the hardware video decoder of the Intel® GMA 500 is accessible through the DXVA API. With Linux, this is done in using a new public API called the VA API. The Intel® GMA 500 chipset enables the Intel® processor-based MID to offer the best in class video experience a user can expect on such a small device.

MID Hardware video decode

Windows and Linux drivers for the Intel® GMA 500 are available to allow applications to access the hardware video decode feature. In this paper we will focus on how to enable this capability on Linux using the VA API.

Video decoding can be represented as a pipeline. It has some pure decoding stages like Variable Length Decode, Inverse Quantization, Inverse Discrete Cosine Transform, Motion Compensation, De-blocking (if applicable) and some post processing stages like color space conversion or scaling. The compressed data stream needs to go through all the stages to be properly decoded. The pipeline can be fed at any stage with the appropriate data. This is what we call an entry point.



The VA API is a public software API specification which provides access to graphics hardware acceleration for video processing. It is meant to enable hardware accelerated video decode at various entry points (VLD, IDCT, Motion Compensation, etc…) for the prevailing coding standards today (MPEG-2, MPEG-4/ASP/H.263, MPEG-4 AVC/H.264, and VC-1/WMV9), as well as hardware accelerated video post-processing such as de-interlacing, color adjustments, color space conversion and output scaling. The VA API provides much more functionality than the existing XvMC API. XvMC was designed to support MPEG-2 motion compensation only.

As this is a public API, it could be used in many graphical chipsets. Some may implement it partly, only giving access to a subset of entry points. In the case of the Intel® GMA500 it supports VLD level offloads for all formats which includes VLD, iDCT, Motion Compensation and in-loop de-blocking (if applicable).

MID Video acceleration driver architecture

The VA API provides an interface between a video decode application (client) and the hardware decode accelerator (server), to off-load video decode operations from the CPU. The basic operation paradigm is, for the client, to send parameters and compressed data buffers to the server using the API, while the server decompresses and post processes the bit stream it receives from the client.

From the client perspective, the server acts as a virtual decode pipeline dedicated to the application. Several applications can access the driver concurrently resulting in multiple video decode offloaded to the chipset. In this situation, each application has only access to its own virtual decode pipeline. Moreover, an application can create multiple virtual decode pipelines allowing the application to decode several streams concurrently.



The core API itself is windowing system agnostic, and could be implemented with a variety of windowing systems potentially. The current Linux video acceleration driver for the Intel® GMA 500 is implemented with the X window system. The output rendering function (vaPutSurface) is to be used for X11 Drawables. It is particularly well suited for standard 2D video players which need to display video in a screen-aligned pixel rectangle.

With the first driver architecture based on DRI, it is not possible to map efficiently the video decoded in hardware, in a 3D environment. With the future evolution of the driver to DRI2, if the target drawable is an X pixmap, then the video output could also be used as a texture in the OpenGL* pipeline with the ext_texture_from_pixmap extension in order to achieve more sophisticated visual effects.

MID Multimedia frameworks

As an application vendor you may want to take advantage of this hardware video decode so that your application get the best out of the MID platform. You may not need to use the VA API in your code yourself. There exists several multimedia frameworks today that have been optimized to use this capability. Any application built on top of those frameworks can get the benefits of the platform without having VA API insights.

Two frameworks are available on the Moblin* platform: Helix* and Gstreamer*.

Helix framework is capable of using the video decode hardware acceleration. As a result every player built on top of the Helix framework benefits from this feature like the RealPlayer* for MID (http://www.helixcommunity.org).

Gstreamer is a very popular multimedia framework in the open source community. Many open source media players are based on this framework in the Linux world : Totem*, Rhythm*, Banshee*,...The company Fluendo* (http://www.fluendo.com) is providing optimized codecs for the Gstreamer framework for the Intel® GMA 500 chipset. By using the Fluendo optimized codecs, all these applications can benefit seamlessly from the video hardware acceleration.

An implementation of the FFmpeg codecs using the VA API has been developed by Splitted-Desktop Systems* (http://www.splitted-desktop.com) , which resulted in dramatic performance improvements with video playbacks in MPlayer* on the current Intel® processor-based MIDs using the Intel® GMA 500 chipset. For reference, the sources are available at this location: http://www.splitted-desktop.com/en/libva/

Typical code structure

The code implementing a video decoding with the VA API must follow a certain structure.

After an initialization phase, the client negotiates a mutually acceptable configuration with the server. It locks down profile, entry point, and other attributes that are not varying along the stream decoding. Once the configuration is set and accepted by the server, the client creates a decode context. This decode context can be seen as a virtualized hardware decode pipeline. The decode pipeline must be configured by passing a number of datasets.

The program is now ready to start decoding the stream. The client gets and fill decode buffers with slices and macroblock level data. The decode buffers are sent to the server until the server is able to decode and render the frame. The client then reiterate the operation with the decode buffers over and over to decode the bit stream. See below the typical flowchart of a decoder using the VA API. We will detail the different phases of the algorithm in the coming chapters.



Initialization Phase

Setting display

x11_display = XOpenDisplay(NULL);
vaDisplay = vaGetDisplay(x11_display);
vaStatus = vaInitialize(vaDisplay, &major_ver, &minor_ver);


Negotiating and creating configuration

In order to determine the level of hardware acceleration supported on a particular platform, the client needs to make sure the hardware supports the desired video profile (format) and the entry points available for that profile. For this, the client queries the driver on its capabilities using the vaQueryConfigEntrypoints. Depending on the driver answer the client can take the appropriate action. Find here a code sample showing a configuration negotiation phase.

vaQueryConfigEntrypoints(vaDisplay, VAProfileMPEG2Main, entrypoints, 
                             &num_entrypoints);

    for	(vld_entrypoint = 0; vld_entrypoint < num_entrypoints; vld_entrypoint++) {
        if (entrypoints[vld_entrypoint] == VAEntrypointVLD)
            break;
    }
    if (vld_entrypoint == num_entrypoints) {
        /* not find VLD entry point */
        exit(-1);
    }

    /* Assuming finding VLD, find out the format for the render target */
    attrib.type = VAConfigAttribRTFormat;
    vaGetConfigAttributes(vaDisplay, VAProfileMPEG2Main, VAEntrypointVLD,
                          &attrib, 1);
    if ((attrib.value & VA_RT_FORMAT_YUV420) == 0) {
        /* not find desired YUV420 RT format */
        exit(-1);
    }
    
    vaStatus = vaCreateConfig(vaDisplay, VAProfileMPEG2Main, VAEntrypointVLD,
                              &attrib, 1,&config_id);


Decode context

Once a decode configuration has been created, the next step is to create a decode context which represents a virtual hardware decode pipeline. This virtual decode pipeline outputs decoded pixels to a render target called "Surface". The decoded frames are stored in Surfaces and can subsequently be rendered to X drawables defined in the first phase.

The client creates two objects. It creates first a Surface object. This object gathers the parameters of the render target to be created by the driver like picture width, height and format. The second object is a "Context" object. The Context object is bound with a Surface object when it is created. Once a surface is bound to a given context, it can not be used to create another context. The association is removed when the context is destroyed. Both contexts and surfaces are identified by unique IDs and its implementation specific internals are kept opaque to the client. Any operation whether it is data transfer or frame decoding will be given this context ID as a parameter to determine which virtual decode pipeline is used. See below a code sample showing how to set the decode context.

/* 
         * create surfaces for the current target as well as reference frames
        VASurfaceID vaSurface;
            vaStatus = vaCreateSurfaces(vaDisplay,surf_width,surf_height,
                                VA_RT_FORMAT_YUV420, 1, &vaSurface);
        /* 
         * Create a context for this decode pipe
         */
        VAContextID vaContext;
        
    vaStatus = vaCreateContext(vaDisplay, config_id,
                               CLIP_WIDTH,
                               ((CLIP_HEIGHT+15)/16)*16,
                               VA_PROGRESSIVE,
                               &vaSurface,
                               1,
                               &vaContext);


Decoding frames

For decoding frames, we need to feed the virtual pipeline with parameter and bit stream data so that it can decode the compressed video frames. There are several types of data to send:
  • Some configuration data like inverse quantization matrix buffer, picture parameter buffer, slice buffer parameter or other data structure required for the different formats supported. This data parameterize the virtual pipeline before sending the actual data stream for decode.
  • The bitstream data. It needs to be sent in a structured way so that the driver can interpret it and decode it correctly.
There is a unique data transfer mechanism that allows the client to pass both types of data to the driver.

Creating Buffer
The way to send parameter and bit stream data to the driver is through "Buffers". The buffer data store is managed by the library while the client identifies each buffer with a unique Id assigned by the driver.

There are two methods to set the contents of the buffers that hold either parameters or bit stream data. The first one actually copies the data to the driver data store. To do this you in need to invoke vaCreateBuffer with a non null "data" parameter. In that case, a memory space is allocated in the data store on the server side and the data is copied from into this memory space. This is the way it is used in the sample code provided:

  static VAPictureParameterBufferMPEG2 pic_param={
  horizontal_size:16,
  vertical_size:16,
  forward_reference_picture:0xffffffff,
  backward_reference_picture:0xffffffff,
  picture_coding_type:1,
  f_code:0xffff,
  {
      {
        intra_dc_precision:0,
        picture_structure:3,
        top_field_first:0,
        frame_pred_frame_dct:1,
        concealment_motion_vectors:0,
        q_scale_type:0,
        intra_vlc_format:0,
        alternate_scan:0,
        repeat_first_field:0,
        progressive_frame:1 ,
        is_first_field:1
      },
  }
};

vaStatus = vaCreateBuffer(vaDisplay, vaContext,
                              VAPictureParameterBufferType,
                              sizeof(VAPictureParameterBufferMPEG2),
                              1, &pic_param,
                              &vaPicParamBuf);


If you call it with a null "data" parameter, the buffer object is created but the memory space is not assigned in the data store. By invoking vaMapBuffer(), the client get access to the buffer address space in the data store. This prevents doing memory copies of data from the client to the server address space. The client can then fill the buffer with data. After the buffer is filled with data and before it is actually transferred to the virtual pipeline, it must be unmapped calling vaUnmapBuffer(). Find here a code example:

/* Create a picture parameter buffer for this frame */
        VABufferID picture_buf;
        VAPictureParameterBufferMPEG2 *picture_param;
        vaCreateBuffer(dpy, context, VAPictureParameterBufferType, sizeof(VAPictureParameterBufferMPEG2), 1, NULL, &picture_buf);
        vaMapBuffer(dpy, picture_buf, &picture_param);
        picture_param->horizontal_size = 720;
        picture_param->vertical_size = 480;
        picture_param->picture_coding_type = 1; /* I-frame */ 
        vaUnmapBuffer(dpy, picture_buf);


Sending the parameters and bitstream for decode

For decoding frames we need to send stream parameters first: the inverse quantization matrix buffer, the picture parameter buffer, the slice buffer parameter or other data structures required for the given format. Then the data stream can be sent to the virtual pipeline. This data is passed using the data transfer mechanism described in the previous chapter. The transfer of data is invoked through vaRenderPicture call.

For each frame to render, you need to go through a vaBeginPicture/vaRenderPicture/vaEndPicture sequence. In this sequence, once the necessary parameters like the inverse quantize matrix or the picture parameter buffer or any other parameter needed depending on the format, are set, the data stream can be sent to the driver for decoding. The decode buffers are sent to the virtual pipeline owing to vaRenderPicture calls. When all the data related to the frame are sent, the vaEndPicture() call makes the end of rendering for the picture. This is a non blocking call so the client can start another vaBeginPicture/vaRenderPicture/vaEndPicture sequence while the hardware is decoding the current frame that has been submitted. The vaPutSurface call will send the decode output surface to the X drawable. It performs a de-interlacing (if needed) color space conversion and scaling to the destination rectangle. Find here a code sample describing the decode sequence.

    vaBeginPicture(vaDisplay, vaContext, vaSurface);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaPicParamBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaIQMatrixBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaSliceParamBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaSliceDataBuf, 1);
    vaEndPicture(vaDisplay,vaContext);

    vaStatus = vaSyncSurface(vaDisplay, vaContext, vaSurface);
    
    if (putsurface) {
        win = XCreateSimpleWindow(x11_display, RootWindow(x11_display, 0), 0, 0,
                              win_width,win_height, 0, 0, WhitePixel(x11_display, 0));
        XMapWindow(x11_display, win);
        XSync(x11_display, True);
        
        vaStatus = vaPutSurface(vaDisplay, vaSurface, win,
                                0,0,surf_width,surf_height,
                                0,0,win_width,win_height,
                                NULL,0,0);
    }


Additional capabilities

The VA API provides also other capabilities than just decoding acceleration. It provides functions for
  • client and library synchronization
  • subpicture blending in the decoded video stream
  • host based post-processing by retrieving image data from decoded surfaces.

You can get more details on these capabilities in going through the VA API specifications. The API, which is currently in the version 0.29, will evolve overtime adding incremental functionalities supported by future version of chipsets.

Performance

Let's compare the performance of video playback on the current MID platforms. The first test compares the playback performance with Totem player on a Compal Jax* 10 MID platform. The Intel® GMA 500 chipset used in this MID is the UL11L. The Intel® AtomTM processor is the Z500 at 800 MHZ. In this test we will limit ourselves to SD content as the UL11L chipset is not supporting HD content decode.



The measurements taken below are measuring the CPU usage of a full video playback including audio decode. The first measurement shows the system cpu usage when doing a playback on Totem player with the software FFmpeg codecs (no hardware acceleration). The second one shows the system CPU usage when doing a playback on RealPlayer for MID with the hardware accelerated codecs.

Video format resolution fps Max CPU usage Totem + FFmpeg codecs Max CPU usage RealPlayer for MID with hardware accelerated codecs
MPEG-2 720x480 30 72% 39%
MPEG-4 720x480 22 50% 31%
H.264 640x360 60 100% 27.5%
VC-1 720x480 25 100% 31%

The usage of the VA API allows the CPU usage to drop significantly when the hardware video decode is used reducing significantly the power drain on the battery. Note that when the CPU reaches 100% the system is not capable anymore to match the targeted frame rate. Indeed, the frame rate drops to a few frames per second, giving a pretty degraded experience.

In the second test we will use a platform with an Intel® AtomTM processor Z530 at 1.6GHZ and a US15W GMA 500 chipset. Unlike in the previous test, this version of chipset is capable of decoding HD content. The playback is done with the regular FFmpeg codecs (without hardware acceleration) and the second one with the Fluendo codecs using hardware acceleration through the VA API. Note here that we are only measuring pure video decode. There is no audio decode happening.

Video format resolution fps Max CPU usage FFmpeg codecs Max CPU usage Fluendo video codecs with hardware acceleration
MPEG-2 480x576 25 22.8% 18%
MPEG-4 640x272 24 22.4% 10%
H.264 1280x544 30 100% 13%
VC-1 1280x720 25 100% 15.5%

The playback has been activated here using gst-launch-0.10 tool with the following command line : gst-launch-0.10 filesrc location=<media file> ! decodebin ! queue ! xvimagesink. The system had Intel® Hyper-threading Technology disabled. When reaching 100% usage, the playback experience is significantly degraded as the encoded fps cannot be delivered by the system. It drops to a few frames per second making the experience pretty poor.

Summary

As MIDs are becoming more and more widespread, video playback on this devices is seen as one of the major usage model especially as mobile TV and Video on Demand are really becoming popular. To be able to experience video playback in optimal conditions and to extend the battery life of the device, it is essential that the video players are using the hardware video decode capability provided in the platform.

Independent software vendors (ISV) have the choice to build their players on top of multimedia frameworks optimized for such platforms as Helix or Gstreamer, or to implement this decoding using the standard public API: VA API. It's a tremendous opportunity to get into this new growing segment and bring outstanding video support to the handheld world.

Additional Resources

ISVs that are considering using hardware acceleration will benefit from the following resources:

The VA API specifications are published on the freedesktop.org site, For more information, please visit: http://www.freedesktop.org/wiki/Software/vaapi.

If you look for information on Fluendo codecs, go to http://www.fluendo.com

Information on RealPlayer for MID can be found at https://helixcommunity.org/licenses/realplayer_for_mid_faq.html

The sources of Mplayer using VA API provided by Splitted-Desktop Systems are available there: http://www.splitted-desktop.com/en/libva/

For software development on the MIDs, Intel® Software Network offers technical resources at: http://software.intel.com/en-us/articles/atom/all/1

About the Author

Philippe Michelon has a long history of software optimization on numerous Intel® architectures. Philippe works as Application Engineer in the Intel® Software and Services Group in Grenoble in France, with ISV- and service-enabling for Intel's new mobile client platforms. Currently his focus is on MIDs.

Philippe holds a M.S in Computational and Mathematical Engineering and can be reached at philippe.michelon@intel.com

Greetings

Special thanks to Jonathan Bian and Sengquan Yuan for their contribution to this paper.

Sample code

Sample code decoding a hardcoded mpeg2 stream with VA API

/*
 * Copyright (c) 2007-2008 Intel Corporation. All Rights Reserved.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the
 * "Software"), to deal in the Software without restriction, including
 * without limitation the rights to use, copy, modify, merge, publish,
 * distribute, sub license, and/or sell copies of the Software, and to
 * permit persons to whom the Software is furnished to do so, subject to
 * the following conditions:
 * 
 * The above copyright notice and this permission notice (including the
 * next paragraph) shall be included in all copies or substantial portions
 * of the Software.
 * 
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
 * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
 * IN NO EVENT SHALL PRECISION INSIGHT AND/OR ITS SUPPLIERS BE LIABLE FOR
 * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */
/*
 * This is a program to show how VAAPI work,
 * It does VLD decode for a simple MPEG2 clip 
 * The clip and VA parameters are hardcoded into mpeg2vld-demo.c
 *
 * gcc -o  mpeg2vld-demo  mpeg2vld-demo.c -lva
 * ./mpeg2vld-demo  : only do decode
 * ./mpeg2vld-demo 1: decode+display
 *
 */  
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <getopt.h>
#include <X11/Xlib.h>

#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#include <assert.h>

#include "va.h"
#include "va_x11.h"

/* Data dump of a 16x16 MPEG2 video clip,it has one I frame
 */
static unsigned char mpeg2_clip[]={
    0x00,0x00,0x01,0xb3,0x01,0x00,0x10,0x13,0xff,0xff,0xe0,0x18,0x00,0x00,0x01,0xb5,
    0x14,0x8a,0x00,0x01,0x00,0x00,0x00,0x00,0x01,0xb8,0x00,0x08,0x00,0x00,0x00,0x00,
    0x01,0x00,0x00,0x0f,0xff,0xf8,0x00,0x00,0x01,0xb5,0x8f,0xff,0xf3,0x41,0x80,0x00,
    0x00,0x01,0x01,0x13,0xe1,0x00,0x15,0x81,0x54,0xe0,0x2a,0x05,0x43,0x00,0x2d,0x60,
    0x18,0x01,0x4e,0x82,0xb9,0x58,0xb1,0x83,0x49,0xa4,0xa0,0x2e,0x05,0x80,0x4b,0x7a,
    0x00,0x01,0x38,0x20,0x80,0xe8,0x05,0xff,0x60,0x18,0xe0,0x1d,0x80,0x98,0x01,0xf8,
    0x06,0x00,0x54,0x02,0xc0,0x18,0x14,0x03,0xb2,0x92,0x80,0xc0,0x18,0x94,0x42,0x2c,
    0xb2,0x11,0x64,0xa0,0x12,0x5e,0x78,0x03,0x3c,0x01,0x80,0x0e,0x80,0x18,0x80,0x6b,
    0xca,0x4e,0x01,0x0f,0xe4,0x32,0xc9,0xbf,0x01,0x42,0x69,0x43,0x50,0x4b,0x01,0xc9,
    0x45,0x80,0x50,0x01,0x38,0x65,0xe8,0x01,0x03,0xf3,0xc0,0x76,0x00,0xe0,0x03,0x20,
    0x28,0x18,0x01,0xa9,0x34,0x04,0xc5,0xe0,0x0b,0x0b,0x04,0x20,0x06,0xc0,0x89,0xff,
    0x60,0x12,0x12,0x8a,0x2c,0x34,0x11,0xff,0xf6,0xe2,0x40,0xc0,0x30,0x1b,0x7a,0x01,
    0xa9,0x0d,0x00,0xac,0x64
};

/* hardcoded here without a bitstream parser helper
 * please see picture mpeg2-I.jpg for bitstream details
 */
static VAPictureParameterBufferMPEG2 pic_param={
  horizontal_size:16,
  vertical_size:16,
  forward_reference_picture:0xffffffff,
  backward_reference_picture:0xffffffff,
  picture_coding_type:1,
  f_code:0xffff,
  {
      {
        intra_dc_precision:0,
        picture_structure:3,
        top_field_first:0,
        frame_pred_frame_dct:1,
        concealment_motion_vectors:0,
        q_scale_type:0,
        intra_vlc_format:0,
        alternate_scan:0,
        repeat_first_field:0,
        progressive_frame:1 ,
        is_first_field:1
      },
  }
};

/* see MPEG2 spec65 for the defines of matrix */
static VAIQMatrixBufferMPEG2 iq_matrix = {
  load_intra_quantiser_matrix:1,
  load_non_intra_quantiser_matrix:1,
  load_chroma_intra_quantiser_matrix:0,
  load_chroma_non_intra_quantiser_matrix:0,
  intra_quantiser_matrix:{
         8, 16, 16, 19, 16, 19, 22, 22,
        22, 22, 22, 22, 26, 24, 26, 27,
        27, 27, 26, 26, 26, 26, 27, 27,
        27, 29, 29, 29, 34, 34, 34, 29,
        29, 29, 27, 27, 29, 29, 32, 32,
        34, 34, 37, 38, 37, 35, 35, 34,
        35, 38, 38, 40, 40, 40, 48, 48,
        46, 46, 56, 56, 58, 69, 69, 83
    },
  non_intra_quantiser_matrix:{16},
  chroma_intra_quantiser_matrix:{0},
  chroma_non_intra_quantiser_matrix:{0}
};

static VASliceParameterBufferMPEG2 slice_param={
  slice_data_size:150,
  slice_data_offset:0,
  slice_data_flag:0,
  macroblock_offset:38,/* 4byte + 6bits=38bits */
  slice_vertical_position:0,
  quantiser_scale_code:2,
  intra_slice_flag:0
};

#define CLIP_WIDTH  16
#define CLIP_HEIGHT 16

int surf_width=CLIP_WIDTH,surf_height=CLIP_HEIGHT;
int win_width=CLIP_WIDTH<<1,win_height=CLIP_HEIGHT<<1;

int main(int argc,char **argv)
{
    VAEntrypoint entrypoints[5];
    int num_entrypoints,vld_entrypoint;
    VAConfigAttrib attrib;
    VAConfigID config_id;
    VASurfaceID vaSurface;
    VAContextID vaContext;
    VABufferID vaPicParamBuf,vaIQMatrixBuf,vaSliceParamBuf,vaSliceDataBuf;
    int major_ver, minor_ver;
    Display *x11_display;
    VADisplay	vaDisplay;
    VAStatus vaStatus;
    Window win = 0;
    int putsurface=0;

    if (argc > 1)
        putsurface=1;
 
    x11_display = XOpenDisplay(NULL);
    vaDisplay = vaGetDisplay(x11_display);
    vaStatus = vaInitialize(vaDisplay, &major_ver, &minor_ver);

    vaQueryConfigEntrypoints(vaDisplay, VAProfileMPEG2Main, entrypoints, 
                             &num_entrypoints);

    for	(vld_entrypoint = 0; vld_entrypoint < num_entrypoints; vld_entrypoint++) {
        if (entrypoints[vld_entrypoint] == VAEntrypointVLD)
            break;
    }
    if (vld_entrypoint == num_entrypoints) {
        /* not find VLD entry point */
        exit(-1);
    }

    /* Assuming finding VLD, find out the format for the render target */
    attrib.type = VAConfigAttribRTFormat;
    vaGetConfigAttributes(vaDisplay, VAProfileMPEG2Main, VAEntrypointVLD,
                          &attrib, 1);
    if ((attrib.value & VA_RT_FORMAT_YUV420) == 0) {
        /* not find desired YUV420 RT format */
        exit(-1);
    }
    
    vaStatus = vaCreateConfig(vaDisplay, VAProfileMPEG2Main, VAEntrypointVLD,
                              &attrib, 1,&config_id);
    
    vaStatus = vaCreateSurfaces(vaDisplay,surf_width,surf_height,
                                VA_RT_FORMAT_YUV420, 1, &vaSurface);

    /* Create a context for this decode pipe */
    vaStatus = vaCreateContext(vaDisplay, config_id,
                               CLIP_WIDTH,
                               ((CLIP_HEIGHT+15)/16)*16,
                               VA_PROGRESSIVE,
                               &vaSurface,
                               1,
                               &vaContext);
    
    vaStatus = vaCreateBuffer(vaDisplay, vaContext,
                              VAPictureParameterBufferType,
                              sizeof(VAPictureParameterBufferMPEG2),
                              1, &pic_param,
                              &vaPicParamBuf);
    vaStatus = vaCreateBuffer(vaDisplay, vaContext,
                              VAIQMatrixBufferType,
                              sizeof(VAIQMatrixBufferMPEG2),
                              1, &iq_matrix,
                              &vaIQMatrixBuf );
                
    vaStatus = vaCreateBuffer(vaDisplay, vaContext,
                              VASliceParameterBufferType,
                              sizeof(VASliceParameterBufferMPEG2),
                              1,
                              &slice_param, &vaSliceParamBuf);

    vaStatus = vaCreateBuffer(vaDisplay, vaContext,
                              VASliceDataBufferType,
                              0xc4-0x2f+1,
                              1,
                              mpeg2_clip+0x2f,
                              &vaSliceDataBuf);

    vaBeginPicture(vaDisplay, vaContext, vaSurface);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaPicParamBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaIQMatrixBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaSliceParamBuf, 1);
    vaStatus = vaRenderPicture(vaDisplay,vaContext, &vaSliceDataBuf, 1);
    vaEndPicture(vaDisplay,vaContext);

    vaStatus = vaSyncSurface(vaDisplay, vaContext, vaSurface);
    
    if (putsurface) {
        win = XCreateSimpleWindow(x11_display, RootWindow(x11_display, 0), 0, 0,
                              win_width,win_height, 0, 0, WhitePixel(x11_display, 0));
        XMapWindow(x11_display, win);
        XSync(x11_display, True);
        
        vaStatus = vaPutSurface(vaDisplay, vaSurface, win,
                                0,0,surf_width,surf_height,
                                0,0,win_width,win_height,
                                NULL,0,0);
    }

    printf("press any key to exit\n");
    getchar();

    vaDestroySurfaces(vaDisplay,&vaSurface,1);
    vaDestroyConfig(vaDisplay,config_id);
    vaDestroyContext(vaDisplay,vaContext);
    vaTerminate(vaDisplay);
    
    return 0;
}