Optimizing Multimedia Applications for Multicore
Research Area: Paralellization of Embedded Multimedia Applications and Data Accessing
Frameworks
Authors: Smarak Acharya, Krishnabir Ghosh, Rajnish Mishra and Rishi Mathur
Faculty Mentor: Naveen I G (Lecturer, ECE Dept)
Name of the Institution: Sir M Visvesvaraya Institute of Technology, Bangalore
Abstract
It is becoming difficult to meet the growing processing demands of embedded-multimedia applications with single-core architectures. Multicore embedded architectures have emerged as a promising solution to this problem. Speeding up multimedia applications is required with the progress of the consumer electronics like mobile phones, digital TV and games.
This paper describes parallelization methods of multimedia applications on the multicore processors. It includes understanding the data-access pattern of an application can help you effectively use the memory and system resources of the underlying architecture to develop a scalable parallel application.
Due to complex architecture of multicore systems, compiler technology and development tools require more sophistication for applications to run successfully. One develops most parallel software by converting sequential programs to parallel programs by hand, and a lack of multicore-aware developmental tools makes the software difficult to evaluate.
Software frameworks can provide a better starting point for developing multicore applications and thus help to reduce the development time. This article demonstrates frameworks for embedded-multimedia applications; however, you can extend the data-flow models to many other applications. The frameworks incorporate the inherent data parallelism in multimedia applications and demonstrate effective management of streaming data by efficiently using the underlying architecture.
This article discusses parallelization techniques targeting multimedia algorithms, which require high processing power and are attractive for embedded-system applications.
Background
Most multimedia applications running on single core processors have sequential algorithms. If they are made to run at their current state on multicore systems, the utilization of the system’s superior architecture and processing speed will be inefficient. Hence, arises the need to parallelize and optimize the program.
There are various ways to parallelize a program. Some programs have inherent parallelism, while others exhibit complex hierarchal data access patterns. In general, multimedia and scientific applications are comparatively easy to parallelize than games and control applications, due their predictable data access patterns.
The granularity of parallelism varies greatly from a set of frames to a macroblock of a frame. In general, the lower the granularity, the higher the level of synchronization you need between the sharing elements. Lower granularities increase parallelism and reduce network traffic; higher granularities require lower synchronization but also increase the network traffic. So, based on the application type and system requirements, the frameworks define different levels of parallelism.
The following dual core architecture is taken as an example to develop a parallel algorithm:
Figure shows the ADSP-BF561 architecture, which consists of separate instruction and data memory private to the two cores and a shared L2 and external memory. You can interface all peripherals and DMA resources to either core with configurable-arbitration schemes. There are two DMA controllers, each of which consists of two sets of MDMA (memory-DMA) channels. A separate bus connects the L2 memory and each core. A shared bus connects the external memory and the two cores.
All of the frameworks, described below, use DMA to move the streaming data within the memory hierarchy. The other alternative, cache memory, does not manage any data. You know the data-access pattern for the applications you are targeting; thus, you can effectively use the DMA engine to manage data. The cache suffers from nondeterministic access times, cache-miss penalties, and increased external-memory-bandwidth requirements. Using the DMA engine, you can transfer data to L1 memory before a core request; the system performs transfers in the background without halting the core for a data-item request.
The DMA channels and the fast accessing speeds of L1 and L2 memories can be exploited for applications with smaller granularities of parallelism, without accessing the slow external memory. However, for larger granularities of parallelism there is a memory constraint, due to the presence of multiple data frames. The dependant frames need to be stored in the external memory and independent blocks can be sequentially accessed by the L1 and L2 memories.
Two cores share the L2 and external-memory-interface bus, although separate buses connect both memory levels. Thus, you should minimize simultaneous access by the two cores to the same memory level to avoid stalls due to contention. To minimize contention, the frameworks map code and data objects such that only one core maximally accesses the L2 core; the other core maximally accesses external memory. In this case, the core performing most of the external-memory accesses has greater memory-access latency, but overall access latency is less than the cost of contention.
If the interrupt processing time is less than the processing time of the streaming data, you can assign all peripheral interfaces to one core for ease of programming; the lower interrupt processing time will not affect the load balancing between the two cores.
Problem Statement
There are two significant challenges to producing parallel software in which you can scale the performance of a sequential application to the number of available cores: developing efficient parallel algorithms and efficiently using the shared resources such as the memory, DMA (direct-memory-access) channels, and interconnect network. Thus it takes us to the question:
How should one develop an efficient parallel algorithm and at the same time, minimize the usage of shared resources and increase the speed of the program?
Methodology
Development of an efficient parallel algorithm involves the following steps:
a) Identification of data access pattern of the program
b) Fitting an appropriate framework model
Identification of data access pattern of the program:
The goal of data parallelism is to find blocks of data that can be treated independently in order to feed it into a processing element of a core.
For most multimedia applications, you can view the data-access pattern as a 2-D pattern (spatial domain), in which the independent blocks of data are confined to a single frame, and a 3-D pattern (temporal domain), in which the independent blocks of data span more than one frame. In the spatial domain, you can divide the frame into slices with N sequential rows and macroblocks of a video frame. In the temporal domain, you can subdivide the data flow at a frame level or a GOP (group-of-pictures) level.
Algorithms with a slice or macroblock data-access pattern require greater synchronization but have less network traffic, as the memory hierarchy needs to store only a part of the image data. In the case of a frame- or a GOP-type data-access pattern, the memory hierarchy needs to store large amounts of data but requires considerably less synchronization, as the system exhibits higher granularities of parallelism.
Algorithms in the form of macroblocks require greater synchronization, but create lesser network traffic as only part of the data needs to be stored before processing. On the contrary, algorithms of GOP-type require lesser synchronization, but occupy larger spaces adding to the network traffic.
Once you have identified the data accessing pattern of a program, you need to design and fit an appropriate framework to parallelize the program.
Fitting an appropriate framework model:
Taking the granularity of data access pattern as the basis, you can define one of the following frameworks to parallelize your program:
i. Line processing (spatial pattern)
ii. Macroblock processing (spatial pattern)
iii. Frame processing (temporal pattern)
iv. GOP level processing (temporal pattern)
There are also ways to integrate multiple frameworks for asymmetrical parallel processing when you have two or more processing algorithms for a data stream.
Line processing involves data that are dependant only at line level. Core A handles inputs and core B handles outputs. Separate MDMA channels transfer data between the cores. L1 uses multiple buffers to avoid contention between cores and peripheral access of DMA channels. Counting semaphores synchronizes each line to both the cores. This saves external memory bandwidth. Examples of applications that can use this framework include color conversion, histogram equalization, filtering, and sampling.
Macroblock processing has alternate macroblocks moving between the cores. The L2 memory maintains multiple slice buffers, and separate MDMA channels transfer macroblocks from L2 to L1 memory of each core. L1 memory also maintains multiple buffers to avoid contention between DMA and core access. Similar to the line-processing framework, Core A handles the input-video interface, and Core B manages the output interface; a counting semaphore achieves synchronization between the two cores. Targeted example applications for this framework include edge detection, JPEG/MPEG-encoding/decoding algorithms, and convolution encoding.
Frame processing involves data accessing in the form of dependant frames of data. The frames here are spatially linked and are stored in external memory. Data is moved in sub blocks of frames to L1 or L2 memory. Processing of frames is done by line or macroblock method. Processing within a frame is done sequentially. Core A handles inputs and core B handles outputs and semaphores synchronize the frames entering the cores.
GOP level processing differs from frame processing in data access pattern of the frames, which is temporal. Here, sets of frames that have no dependency between them are separately processed by the cores in a sequential manner. Blocks of frames are handled by the L1 or L2 memory. To improve the efficiency, the external memory is divided as banks between the cores.
In a more real-world application, multiple algorithms running within the system process the streaming data, and each of these algorithms may exhibit a different data-access pattern. In such cases, you can combine the frameworks for a particular application. To take advantage of the multiple cores you can pipeline the process to achieve parallelism.
Key Results
• Efficiency of a multimedia application, running on a multicore system is increased by parallelization of its algorithm.
• Most important prerequisite for parallelization of an algorithm is identification of the data accessing pattern of the program.
• Reduction sequential elements of the program increases parallelization.
• The multiple cores present in a processor can be exploited by assigning independent sections of the algorithm to be processed by different cores.
• The presence of caches and DMAs can also be exploited to speed up the program.
• Minimization of contention between the cores and peripheral memory as key to efficient parallelization.
Discussions
The first and foremost point is why is sound and video processing important in multimedia development?
IMPORTANCE OF SOUND PROCESSING IN MULTIMEDIA DEVELOPMENT:
Recording sound for multimedia applications is only the first step in the process of sound processing. Usually, all recordings must be modified to some degree to raise them to an appropriate level of quality. Some files just need trimming excess sound that was not intended for use. Other files require audio processing to increase their amplitude or to decrease the volume at the end for a professional fade out. To add realism to a presentation, we may have to envelope a particular sound effect to increase or decrease at specific times. The various sound files that are recorded separately may have to be fixed together for a more realistic effect. Using several tracks allows for accurate mixing of separate files. Adding music in the background requires careful calibration of the respective sound amplitudes. Sound processing is essential to produce edited files that are coordinated with the graphical information in a multimedia application.
Regarding video processing, the various video files that are recorded separately have to be mixed together for a more professional result. The mixing of several video tracks allows for very interesting montages. Adding music in the background greatly enhances a video presentation, but it requires careful calibration of the respective sound amplitudes. Video processing is essential to produce edited files that are coordinated with the graphical information in a multimedia application.
Scope for future work
Main problem of processing in multicore is that the processor out performs the memory. With memory sharing and bus utilization, performance can’t be achieved if there is a bottleneck on the bus. This drawback has hurdled the efficient running of most programs on multicore systems. However, multimedia applications have been well supported by multicore systems due to their relative ease of parallelization and minimal utilization of memory and bus. This makes them a striking prospect for futurization. They provide an ideal platform for development of new applications, which utilize their speed and efficiency in other non multimedia programs.
Conclusions
We have shown that by understanding the data access pattern and minimizing the usage of shared resources by implementing one of the above frameworks, a scalable, parallel multimedia application can be developed that can run efficiently on a multicore system.
References
• Multimedia Sound and Video: Jose' Lozano (Prentice Hall/ Macmillan Computer publishing)
• Software Development for Embedded Multi-core Systems (A Practical Guide Using Embedded Intel): James R. Reinders
• Multicore Processors and Systems (Integrated Circuits and Systems): Stephen W. Keckler, Kunle Olukotun, and H. Peter Hofstee
• Video Templates for Developing Multimedia Applications on Blackfin Processors: Kaushal Sanghai
• Digital Computer Fundamentals: Thomas C Bartee (McGraw-Hill Kogakusha,Ltd)
• www.google.co.in
• www.wikipedia.org
• www.springer.com
• www.bitpipe.com
• www.actapress.com
Acknowledgements
We express our gratitude to the management of the Sir M Visvesvaraya Institute of Technology, Bangalore to allow us to carry out the research.
Our sincere thanks to our principal, Dr M S Indira and Prof. K Pichchamatu, HOD, ECE Department for their whole hearted cooperation, valuable suggestions, guidance and support during the research.
Our sincere gratitude to our mentor Naveen I G, lecturer, ECE Department who took pains to steer us to accomplish this paper.
We also thank each other (authors of the paper) for showing dedication in order to make this paper presentable.
Comments (0) 
Trackbacks (0)
Leave a comment 