Image processing and threading

Image processing and threading

In his introductory post Rdwells mentions focus on pipelined image processing.

Who else here has an interest and experience in that area?

Rdwells, any pressing threading questions to like ask? experiences to share?

--D

rdwells wrote:

"I've been working with multithreaded software for about 5 years now (so I'm a relative newcomer around here, it seems!), mainly to do pipelined image processing.:

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Robert Schultz's picture

Hello, I hope you are (or someone similar) is still around.

I'm prepping to create a 2D image processing DLL which use classic Erosion/Dilation algorithms, BUT, on several million pictures in kind of a batch mode. (64 bit with 32 bit compatibility).

I'm somewhat new to the Intel toolset so first, I need advise on which tools I really need. I've already ordered the compiler, IPP and vTune; I suspect I need the MKL and ThreadChecker. Your thoughts?

Also, I could really benifit from any advise, white papers, sample code etc. that deal with things like erosion, dilation and related functions.

Thanks in advance for any advise!

--Rob.

The questions might better be dealt with on the Visual Computing forum.

MKL TBB and IPP are included in the Intel Professional C++ compiler.

VTune is an excellent tool for performance tuning, particularly for batch mode runs taking upwards of several seconds. Even without VTune, the openmp profiling library is useful for OpenMP, /Qparallel, and MKL threaded region analysis (don't know about IPP).

Parallel Studio Inspector has been advocated pending a new product capable of running on Windows 7 as a successor to Thread Checker. You might try evaluations when you are ready. You don't need them unless you thread explicitly, by OpenMP or thread library calls.

jimdempseyatthecove's picture

Rob,

You might wish to read a blog of mine here: http://www.drdobbs.com/go-parallel/ Titled: Two Variations on Parallel Pipelines.

This article illustrates the benefit of using a well crafted parallel_pipeline for image processing. You can email me if you have questions not pertinant to this forum.

Jim Dempsey

www.quickthreadprogramming.com

Thank you for sharing this.

Greetings gents,

I was going to create another thread but this sounded similar to what i wanted to do.

I have jpeg images on disk that are around 1.9 MB each. At the moment i am reading them in a secuence one by one that doing some image processing from image to image.

Would anyone please suggest a efficient way to read those images from disk? I have identified that is the slowest part in the application.

I was thinking of using TBB to thread the read but some people mentioned that might not help.

Anything in IPP that would be useful? Or should i do something like image slicing perhaps to improve on this?

Sincerely

Dan

You would likely require a RAID array (let the disk system thread the reads among multiple devices) or SSD, before you could evaluate where additional performance might be gained by software parallelism.

Alexander Agathos's picture

Erosion/Dilation on million of pictures...why not create a large canvas (background) and do a collage with giant pictures containing batches of these pictures (especially if they are of the same dimension, if not, it is easy to make them by adding pixels. The problem then suitability is for the GPU. Some problems are more suitable for the CPU but this problem I believe can be dealt more efficiently in the GPU.

Kind Regards,

Alexander Agathos.

"----Document your case----"
Alexander Agathos's picture

Perhaps you are referring to a movie, most probably it is a movie that produces so many pictures, then you have indeed the same dimension. I do not know the application that you have in mind but I think making a collage of the frames in a giant frame in the CPU in Parallel feeding it to the GPU and then returning the output and let the CPU dismantle the giant frame in Parallel again you can achieve real time results with some overhead of making the collage and dismantling it. This works because the SE operates in a local area and the operation needs the original pixel elements so this makes the whole process completely independant. So the trheads in the GPU can blissfully work on each pixel without caring what the other threads are doing.

All the best in your project.

Kind Regards,

Alexander Agathos.

"----Document your case----"

Maybe you want to take a look at my multicore framework "Fiber Pool" (http://www.thinkmeta.de/en/fiberpool_overview.html).

It has a special File I/O Scheduler which uses a technique called "Parallel File Processing" for maximum CPU performance.

Quoting David Solomon (Intel)
In his introductory post Rdwells mentions focus on pipelined image processing.

Who else here has an interest and experience in that area?

Rdwells, any pressing threading questions to like ask? experiences to share?

--D

rdwells wrote:

"I've been working with multithreaded software for about 5 years now (so I'm a relative newcomer around here, it seems!), mainly to do pipelined image processing.:

I have interst no experience though

Alexander Agathos's picture

Great framework. I plan to fully implement also this collage idea and present it. This Framework can come veyy handy.

Cheers,
Alexander.

"----Document your case----"
Alexander Agathos's picture

In general there are two camps, one dealing with the GPU and one that deals with the CPU. I believe in both camps. For instance there is a project I am creating which will find the normals on a 3D Mesh which is going to be a blend of my Intel I7 and my nVidia GTX-275 card. People should realize the size of the thread and how complex it is. CPU can deal with large threads and complex. The GPU can deal with less complex threads and they should be lightweight. If you literaly start threads that are heavy in the GPU the Driver is going to simply tell you sorry but I crashed which never happens in the CPU. So please do not start a debate on these two approaches. One could tell me OpenCL, I am still not convinced, I am satisfied with a direct approach for my Scientific purposes.

Best,
Alexander.

"----Document your case----"
Robert Schultz's picture

Thank you David,

Threading and concurrency are issues are in the next phase of this projects development. I'm now focusing on algorithms and the like.

I will read up on Rdwells as I'm absorbing all the information I can get.

All the best!
Rob.

Robert Schultz's picture

Thank you Tim, I've just purchased these tools and have built my first few routines. So far so good. And man, they are FAST.

Appreciate the tips on the new evals, I'll certainly be trying those out!
--Rob.

Robert Schultz's picture

Hi Jim,

I didn't know Dr.Dobbs has this info just for Parallel. Many thanks for the info!

--Rob.

Robert Schultz's picture

Hi Dan,

In my case, I must perform mathematical morphology (dilation/erosions etc.) on several thousand images per minute. (read image from disk, process it, save to disk)

First I aquired a good machine with dual 10Krpm Sata drives and 16GB of ram, quad core processor.

Now, I'm developing the best image algorithms I can using IPP/MKL and then once I'm confident I have the more better algotithms, build a threading/concurrency model to squeek out as much cpu utilization as possible.

Tim and Alex have some good thoughts and I'm sure I'll implement some variation of these.

IPP has some dilation and MM type functions and I'm just now looking at integrating these.

I plan to share my experiences of this project. I'm going to jump into OpenMP/CnC and do some experiments there also. TBB may be all I need, but only experimentation will tell.

All the best!
--Rob.

jimdempseyatthecove's picture

>>In my case, I must perform mathematical morphology (dilation/erosions etc.) on several thousand images per minute.

When the image files are not interrelated (can be processed independently) then it is recommended to implement coarse grained parallelization bymoving theparallelizationto the outermost layer (file by file). For these situations a parallel pipeline works exceptionally well.

If this application is a production system that performs this task many times I suggest you determine if the bottleneck is due to I/O, processing, or both. I/O can be resolved by using more disks (6)and RAID10.

If the performance issue is determined to be processing (I assume it is a blend of I/O and processing) then maybe your motherboard could accomidate a higher performing processor.

Jim Dempsey

www.quickthreadprogramming.com
Robert Schultz's picture

Thanks everyone for the good links and information. Too many choices, I've a headache now! ;-)

My application must analyse still pictures from radar images. The "regular" cpp code I have is good, but not good enough. We need a throughput similar to a video processing app, which I think is attainable. I can see the assembly instructions being generated by the compiler are compact, not too much overhead at all.

So, bottom line is that this is going to require some experimentation on my part. Got the right tools, just need to figure out how best to use them.

One thought I had, which I think would be a good start, would be to actually USE the processing capability of my 4 core CPU. Like many commercial apps, it is using one core, at about 20-50% utilization; not so good.

So bottom line, I think using IPP and parallel (on Native code) is the best way to head. (?)

This is going to be very interesting.

Thanks again! --Rob.

jimdempseyatthecove's picture

Your radar images (my guess) are likely similar to a frame grabber - a series of snapshots at time intervals. These snapshots are either stored in seperately named files, or stored similar to an AVI in one big file. What you first need to look at is to answer:

are frames n and n+1 being processed seperately or compared with each other.

When compared seperately, then the parallelization process can be coarse grained (next available core processes next waiting frame). Each frame processed (in parallel), is processed essentially using your serial code. Your serial code is mostely untouched, but some changes may be required to move static (global) state variables into dynamic (frame orthread) seperate areas. Excepting for start and end of applicaiton the coding changes for this type of applicaiton is quite easy to do.

When adjacent immages are compared together - e.g. compressing or computing trajectory of object in view, then you have to determine how best to seperate the functional tasks. One way might be

thread 0 working on differences between frame n and n+1
thread 1 working od differences between frame n+1 and n+2
...

As long as your are not writing annotations into the images, this too may require relatively little code change.

The above two instances would be considered coarse grained parallization.

Now then, when you spend a long time on each frame and frames are interrelated, you may need to focus on parallization of the code that processes each frame. This will require careful analysis of your program in order to make the frame processing multi-thread safe and inorder to be efficient.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.