CPU affinity of a pipeline

CPU affinity of a pipeline

Hello,

I have implemented an application that uses Intel TBB
pipeline pattern for parallel processing on Intel Xeon CPU E5420 @ 2.50GHz running
RHEL 6.

The application basically composes of 8 pipelines. Each
pipeline has one token (making it one thread per pipeline). Each pipeline receives
data from an endpoint and processes it to completion. I ran this application
and collected general exploration analysis data using vTune amplifier. The
profiler reported high CPI in finish_task_switch function of vmlinux module which
suggests that the kernel is spending more time performing context switching and
adversely affecting performance of the application.

What I would like to understand is why is the kernel
performing high context switching? Will each pipeline be scheduled on the same
CPU? Is there a way to assign CPU affinity with each pipeline? How can reduce
this performance impacting behavior? Please provide some optimization tips.

Thank you,

Vishal

37 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>Each pipeline receives data from an endpoint and processes it to completion.

Do you mean there is one end point for all 8 pipelines
or
8 end points, one for each of the 8 pipelines?

An example of the former (1 end point for all 8 pipelines) is reading a sequential file into buffer, then process buffer(s 8-ways). In this case you should have 1 pipeline, with at least 8 tokens.

If you have the latter (seperate end point per pipeline), then I suggest you eliminate the pipeline and run it asas 8 seperate tasks.

Jim Dempsey

www.quickthreadprogramming.com

Thank you for a response.It is 8 end points, one for each of the 8 pipelines.Do I have to assign CPU affinity when running as 8 seperate tasks? Will the scheduler always schedule the task to the same Core?Thank you,Vishal

You don't say anything about the filters in the pipeline (how many, what kind), and what happens if you increase the number of tokens (just 1 doesn't make for much of a pipeline)...

There are three filters per pipeline. Input, Process and Output. All the filters are serial. I have not tried increasing the tokens. The reason is that I want to processing to be similar to like having 8 threads and working on data from their own endpoint.
Regards,Vishal

You cannot expect TBB to do its thing if you don't let it.

>>There are three filters per pipeline. Input, Process and Output.

Are Input and Output performing I/O?

If so, the fread and fwrite (or other R/W API) will introduce a thread delay in which you would like to recover for productive use.

Each pipeline's tokens are seperate from the other pipelines tokens. Use at least 3 tokens per pipeline (5 would be better, any more may exhibit diminishing returns). Three provides for each pipeline to havea working thread working in the Process pipe, while the input and output threads are blocked waiting for I/O completion.

Oversubscribe the thread pool in the init(..). The amount of oversubscription would depend upon the number of Reads and Writesconcurrently blocked waiting for I/O completion. This is more of a tuning process rather than a fixed formula.

BTW, in QuickThread the Input and Output pipes would be scheduled on I/O class threads thus avoiding oversubscription of the compute class threads. Also, the QuickThread parallel_pipeline can be setup to be NUMA as well as CPU Affinity sensitive (see www.quickthreadprogramming.com).

Jim Dempsey

www.quickthreadprogramming.com

Yes, the input and output are performing I/O.
Increasing the tokens per pipeline to 3 seems like a good experiment. Though I have questions about that.1. If I increase the tokens to 3 then effectively I am increasing the thread count to 3*8=24. And If I have only 8 logical core won't this cause cache thrashing and degrade the performance?2. The input has to be processed in the order it is received. Having 3 tokens potentially means that there could be two tokens processing their own inputs at the same time. (Processing time and sequence of steps to process each input depends on the kind of input.) Will the order for processing be maintained? I am new to this concept and I do not understand it well enough.In the meantime, I will start experimenting by increasing the tokens. I will also look into the Quick Threads.Thank you,Vishal Sharma.

>>increasing the thread count to 3 then effectively I am increasing the thread count to 3*8=24

No, potentially you are increasing the task count to 3*8=24 (to be performed by the available threads)

Note, IIF (if and only if) tasks (such as your input pipe, and output pipe) experience thread stalls due to I/O (as they will with Read and Write), then (then and only then) consider oversubscribing the thread count bythe number of threads that could (typically are) stalled at any point in time. You may have to experiment with the over-subscription count to find the sweet spot.

>>2. The input has to be processed in the order it is received.

Filters have three modes:

a) parallel (any order)
b) serial_out_of_order (one at a time, no particular order)
c) serial_in_order (one at a time, in sequence)

Make each of your pipe (filter stages) serial_in_order

This way, each stage can run concurrently in different threads, with the restriction of processing tokens in order received.

Tn = Thread n
Pn = Pipeline n

P0: T0 (Input, stalled at Read file R0), T1 (processing), T2 (output, stalled at Write file W0)
P1: T3 (Input, stalled at Read file R1), T4 (processing), T5 (output, stalled at Write file W1)
...
P7: T21 (Input, stalled at Read file R7), T22 (processing), T23 (output, stalled at Write file W7)

Notes,

Tokens of each pipeline circulate from output stage back to input stage.
Although 3 tokens per pipeline would be minimal, you may find it beneficial to use more, because at times you may experience high seek latencies performing I/O to 16 files. Experimentation will tell.
Also, latencies will change from system to system and disk drive placements of files (if you have that flexibility).

There is a difference between the TBB parallel_pipeline and QuickThread parallel_pipeline.

When a pipeline has an input filter (pipe segment), some number of processing filter(s), and and outupt filter, you can construct the pipeline such that the input pipe (receiving empty buffers) and run by a singleI/O class thread, and the output pipe (writing pipe), also run by a single, but different, I/O class thread, will write the pipes in collated order (collation order specified by the input pipe). This permits you to effectively have sequential input, parallel (any order) internal processing, sequential (and collated) output. This does require that your "process" function(s) be thread safe. If this is not suitable then you can make the internal pipes serial.

If you have much work invested in TBB I suggest you stick with TBB. If this is a new conversion of an application then you could experiment with QuickThread(one word, no "s" at the end).

Jim Dempsey

www.quickthreadprogramming.com

Jim,Thank you for your response. Really appreciate your help.One question - When a pipeline is created with 3 filters and number of token is set to 1 - does this mean that three tasks exist and they are being processed by three threads?Thank you,Vishal

I'm not convinced yet.

I think I've been on the side of doing I/O and computation on the same core once before, myself, but if I remember well it was deemed of little value. I guess it depends on how much computation there really is. If the computation is trivial, then I would forget about pipelines, certainly ones where the first and the third filter are merely about I/O that might stall a thread. If the computation is CPU-intensive, then I would certainly want to know whether it makes a real difference to have the data in cache first, and even then I would prefer to dedicate some threads to I/O without involving TBB yet, sidestepping the oversubscription workaround, since it seems difficult to have a situation that's both CPU-bound and memory-bound. Then maybe you could have a stage to warm up the data (if you're so convinced it's an issue that you would dedicate the development time to experiment with that), but didn't somebody invent hyperthreading (which by any other name smells as sweet) specifically to keep CPUs busy while waiting for memory in one of their hardware threads, as long as the data motion doesn't become cumulative and degenerates into thrashing, which doesn't seem to be the case here?

If I'm mistaken, somebody please convince me otherwise.

P.S.: I think Jim and myself basically gave the same advice about increasing the number of tokens (he less tersely so), so how about some feedback about the results of that?

(Edited: some trimming.)

>>When a pipeline is created with 3 filters and number of token is set to 1 - does this mean that three tasks exist and they are being processed by three threads?

You question is incomplete, perhaps you are missing a critical point about Tasks, Software Threadsand HardwareThreads (and pipelines).

A pipeline with 3 filters essentially represents 3 tasks. Tasks do not run until all:

a) they are enqueued
b)a hardware thread is scheduled by the O/S to run a software thread (of TBB)
c) a software thread within the TBB thread pool takes the enqueued task request
(provided it is not busy running some other task)

A pipeline with 0 tokens (abstract picture for you)

P: (task 1waiting for token), (task 2 waiting for token), (task 3 waiting for token)
-------------------------------------------------------
A pipeline with1 token (abstract picture for you)

At T=1
P: task 1(potentially) running (potentially stalled at Read), (task 2 waiting for token), (task 3 waiting for token

At T=2
P: task 1waiting for token, task 2(potentially) running, task 3 waiting for token

At T=3
P: task 1waiting for token, task 2 waiting for token, task 3 (potentially) running (potentially stalled at Write)

...back to T1...

Note, "(potentially) running" means running iif Hardware thread is available .AND. software thread is available (i.e. not running some other task).

With 1 token, one one of: Input, Process, Output can (potentially) be running (or potentially stalled in the case of Input or Output tasks).

With 3 tokens (and after first 2 have been read:

P: Task 1 (potentially) running (potentially stalled at Read), Task 2 (potentially running, Task 3 (potentially) running (potentially stalled at Write).

With 3 tokens you could (potentially) have a Read in progress, concurrent with a Process, concurrent with a Write in progress.

The P: description above is but one of the possible states. You could potentially have

P: Task 1 waiting for token, Task 2 two tokens in queue, one token(potentially) running, Task 3 waiting for token

By having more than 3 tokens, say 5, you could (potentially) have:

P: Task 1 (potentially) running (potentially stalled at Read), Task 2 two tokens in queue, one token(potentially) running, Task 3 (potentially) running (potentially stalled at Write)

The actual experience will vary from the above, but the above should give you a better description of what may happen.

Your application (from your sketch) has 8 input files and 8 output files (potentially 16 I/Os in flight). Should your system have but 1 spindle (one disk) you will be experiencing large seek latencies. To reduce seek latencies you might consider having each file read several buffers in a row, send each buffer (token)tothe Processpipe, then have each file write several buffers. The TBB parallel_pipeline is not configurable to do this (neither is the QuickThread parallel_pipeline), however you can recode to do something like this sketch:

Input pipe:
// Read 1 to 4 buffers (short reads on EOF)
for(Token.nBuffers = 0; Token.nBuffers < 4; ++Token.nBuffers)
if(Read(Token.buffer, Token.nBuffers) break; // break on EOF

Process pipe:
if(Token.nBuffers)
{
parallel_invoke( //or use parallel_for_each, or...
[&](){ Process(Token.buffer, 0); },
[&](){ if(Token.nBuffers > 1) Process(Token.buffer, 1); },
[&](){ if(Token.nBuffers > 2) Process(Token.buffer, 2); },
[&](){ if(Token.nBuffers > 3) Process(Token.buffer, 3); });
}

Output pipe:
for(int i=0; i Write(Token.buffer, i);

Jim Dempsey

www.quickthreadprogramming.com

>>...why is the kernel performing high context switching?...

Did you try to check priorities of your main process and all threads?

There is a possibility that TBB "switched" priorities to a HIGH_PRIORITY_CLASS for the process
and THREAD_PRIORITY_HIGHEST for all threads. That is why there are so many context switches.

On Windows platformsthese Win32 API functions will getpriorities:

...
if( ::GetPriorityClass( ::GetCurrentProcess() ) == HIGH_PRIORITY_CLASS )
...
if( ::GetThreadPriority( ::GetCurrentThread() ) == THREAD_PRIORITY_HIGHEST )
...

Note: I just realized that you dothe jobin Linux...

At lower priorities processes and threads will have less contextswitches and willspend more time
on processing of your data.

Best regards,
Sergey

So while Vishal still hasn't done a quick test to
easily eliminate one possibility, while mainly Jim is spending a lot more of his time writing lengthy arguments, the problem may have been hiding
elsewhere...

Then again, this (undocumented) priority boost is only performed "#if _MSC_VER", and the platform in question is Linux.

>>the problem may have been hiding elsewhere...

Messing around with thread priorities is often counter-productive. The natural tendency is "my program is more important than yours/someone else's", therefore, I set all my threads to highest priority. And all other programmers make the same decision.

What is not stated by Vishal is if he is doing something he ought not to be doing (which is obvious once you know your have been bitten). Pseudo code

main()
{
spawn 8 threads (pthreads, _beginthread, whatever)
{ // each thread
tbb::init(...
parallel_pipeline(...
}
join
}

In the above, you will be generating 8 TBB contexts. IOW you will be oversubscribed by 8x

Messing around with thread priorities "can be" productive, in circumstances where you know your application has oversubscribed threads .AND. you know which threads need a boost. In TBB, tasks are not in control of which thread takes the task. Therefore you have no advanced way of knowing which thread's priority to boost _prior_ to it taking the task request. This results in your only choice isto raise the priority of all your application's TBB thread pool priorities and thus getting into a shoving match with other applications (or your own application in the event you choose to oversubscribe).

In TBB, you can resolve this in two ways:

a) Use extra non-TBB threads to perform non-TBB task work at higher priority. Doing so introduces a domain issue and how to migrate work requests between domains (in particular start/resume TBB tasks). While this is not particularly difficult to do, it is not built in to the architecture of TBB.

b) Add thread priority boost at task termination, add thread priority reduction at task start for lower priority tasks. This adds overhead in constantly readjusting thread priorities.

BTW in QuickThread this is a non-Issue since it has two classes of threads (higher priority I/O classand compute class) and task enqueues can choose a designated class.

Jim Dempsey

www.quickthreadprogramming.com

>>...Please provide some optimization tips...

- Did you try to execute the application on another computer with a different CPU, or with a different Linux OS?

- How much RAM and VM ( Virtual Memory ) doesit use?

- What about a size of VM?

- It is not clear how bigare data sets?

Best regards,
Sergey

Hello,Sorry for not been able to perform the test by changing the token count to 3 or 5. I was out sick. I am back this morning and will try that today and post the results.I appreciate all your comments and help. It is really helpful as I learn these new concepts.Thank you,Vishal

Jim/et al,Here is my psuedo code, if it helps understanding my implementation:Here are my definitions of Input, Process and Output classes. Each of these are derived from tbb::filter class.class Input : public tbb::filter {public:Input(bool serialInput, int numtokens, int destid=0); virtual ~Input();void* operator()(void*);protected:void AllocateDataHandles(int tokens);private:int tokens; int next;int DestID; vector flowData;};class Process : public tbb::filter {public: Process(bool serialInput); virtual ~Process(); void* operator()(void * rawdata);};class Output : public tbb::filter {public: Output(bool serialInput); virtual ~Output(); void* operator()(void* processeddata);};Here is my flow thread that instantiates the Input, Process, Output filters and pipeline.class Flows {public: Flows(); virtual ~Flows(); void SetupFlow(int id) { _input = new Input(true, 1, id); _process = new Process(true); _output = new Output(true); _pipeline.add_filter(*_input); _pipeline.add_filter(*_process); _pipeline.add_filter(*_output); assert(!_thread); _thread = boost::shared_ptr(new boost::thread(boost::bind(&Flows::StartFlow, this))); assert(_thread); } void StartFlow() { intnumThrds = 1; task_scheduler_init init(); _pipeline.run(numThrds); } void StopFlow();private: boost::shared_ptr _thread; pipeline _pipeline; filter* _input; filter* _process; filter* _output;};#DEFINENUMDEST 8int main() { vector flows; Flows* flow; for (int i = 0; i < NUMDEST; i++) { flow= new Flows(); flow->SetupFlow(i); flows.push_back(flow); } ... ... ... return 0;}As you can see I am using the default constructor to initializetask scheduler. No thread counts are given. I am assuming that the number of token per pipeline will dictate the number of threads that will exist.
Please tell me if I am doing anything whichis not suppose to be done when using the Pipeline pattern for the TBB.
Appreciate your comments.
Thank you,
Vishal

I have changed the number of tokens to 3. With this, I ran the application and started analysis using vTune. This time I noticed high CPI rate, Retire Stalls, Instruction Starvation, Execution Stalls in TBB Scheduler Internals from the function:tbb::internal::custom_scheduler::receive_or_steal_task(long&, bool)
Also, thread_return function of Linux Kernel reported high instruction starvation.Regards,Vishal

Here is your problem:

a) our main() is instantiating 8 (NUMDEST) boost threads

b) Each boost thread is, with respect to TBB, a concurrent "main()", although TBB will not be aware of any concurrancy.

c) Each boost thread issues task_scheduler_init, which allocates a full set of threads. (2x E5420 = 8 cores/hw threads). Making 64 threads.

Consider using

task_scheduler_init init(3); // or 2 or 1

Keep in mind that using 3, will create 24 threads (23 + 1 for main). 16 of which will be performing I/O. If the I/O requests are the bottleneck then you will NOT have to worry about "spinwaits". However, if the compute section is the bottleneck, then when the I/O tasks run out of tokens, they will enter a spinwait waiting for an additional token, and this time will be wasted. You might consider reducing the number ofTBB threads per boost thread (changeinit(3) to init(2)), or editing the TBB source to reduce the spinwait time (it appears to be hardwired, correct me if I am wrong).

***
A better route to take is to NOT use boostthreads.

Stay within TBB.

Experiment with something like:

main()
{
task_scheduler_init init(nHWthreads + pendingIoThreads); // ?? 8 + ?4 ??
parallel_invoke(
[&]() {doPipeline(0); },
[&]() {doPipeline(1); },
[&]() {doPipeline(2); },
[&]() {doPipeline(3); },
[&]() {doPipeline(4); },
[&]() {doPipeline(5); },
[&]() {doPipeline(6); },
[&]() {doPipeline(7); });
}

void doPipeline(int n)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation
Flow.StartFlow();
}

Jim Dempsey

www.quickthreadprogramming.com

"Each boost thread issues task_scheduler_init, which allocates a full set
of threads. (2x E5420 = 8 cores/hw threads). Making 64 threads."
task_scheduler_init is a reference to a shared structure, so only 7 TBB worker threads will be added in total. The problem must be something else.

"A better route to take is to NOT use boostthreads."
That will not provide any required concurrency.

I don't see what would cause the problem, but I haven't really looked at pipeline for a while, and it has changed somewhat since then. There's a yield operation in there that may have something to do with it, but Vishal has kept the pace of trying things extremely slow so far even without blocking (if I may callously make a TBB joke here). The second experiment, after increasing the number of tokens, would be to keep that number at one and make the filters parallel, or apply the correct setting for each filter individually and try various numbers of tokens.

Further out, something may be tried with explicitly executing a filter on a specific thread.

Only afterwards would it seem useful to go deeper into the implementation to find out.

I just hope it's not some unavoidable consequence of keeping the arenas separate, which was done at least partially to avoid any entanglement between pipelines run from different threads that would destroy guaranteed concurrency.

Of course, if you don't have the time or inclination to do any of that, just use plain old threads, because I'm not sure this program is making any use of what TBB has to offer.

Hi Raf,
As mentioned in one of my earlier reply, I was out sick and could not test the approach suggested to me earlier. Second, Intel's TBB is new concept for me. I am eager to trying out suggestions made by experts such as yourself and to learn in process. I have described my implementation in one of the replies today. If you think that the program does not make use of what TBB has to offer, please help me understand how I can make use of TBB correctly.I am ready to provide information that may help you to make better suggestions.Best Regards,Vishal

From my previous posting:
"The second experiment, after increasing the number of tokens, would be
to keep that number at one and make the filters parallel, or apply the
correct setting for each filter individually and try various numbers of
tokens."

I don't have high hopes, but at least the program will look more like a real TBB program, making this more generally relevant.

I just looked up the specifications for your CPU, and I'm seeing a processor from 2007, with 4 cores with 4 threads, so that would be without hyperthreading. With 8 threads to begin with, you're already potentially oversubscribing the machine. I have to admit that I can't confidently predict what can be expected here, but context switching should not be a big surprise in the presence of oversubscription.

(Added) Even if input and output filter should be serial_in_order where the middle filter can be parallel, also try to make them all parallel with just 1 token.

To begin with, I had the following settings:- Input filter: Serial (is_serial = true),- Process filter: Parallel(is_serial = false), and- Output filter: Serial(is_serial = true)- Number of pipelines = 8 (1 per destination),- Number of tokens = 1 (per pipeline)Built and ran the application on aIntel Xeon CPU X5680 @ 3.33GHz which is a faster and newer processor (Has 24 hardware threads). This should eliminate potential oversubscribing issue. The vTune performance analyzer reported high CPI, instruction stalls in finish_task_switch function of vmlinux which is same running onE5420 @ 2.50GHz CPU.Next changed the setting as follows:- Input filter: Serial (is_serial = true),- Process filter: Parallel(is_serial = true), and- Output filter: Serial(is_serial = true)- Number of pipelines = 8 (1 per destination),- Number of tokens = 3 (per pipeline)Built and ran the application on aIntel Xeon CPU X5680 @ 3.33GHz. The vTune performance analyzer reported high CPI, instruction stalls in TBB Scheduler internals and thread_return function of vmlinux. which is same running onE5420 @ 2.50GHz CPU.Next working on the setting as follows:- Input filter: Serial (is_serial = false),- Process filter: Parallel(is_serial = false), and- Output filter: Serial(is_serial = false)- Number of pipelines = 8 (1 per destination),- Number of tokens = 1(per pipeline)Built and running the application on aIntel Xeon CPU X5680 @ 3.33GHz.Regards,Vishal

So does the all-parallel situation run more efficiently or not? The idea is that it would run all on the same thread without stealing, and that might provide a clue, especially in comparison with a program that eliminates the pipeline and just repeatedly calls the filters in order.

Something else you could do is use task_scheduler_init with just 1 thread, to avoid the generation of any TBB worker threads that might go around stealing work.

If that doesn't show anything interesting, I would have to defer to somebody with more experience running TBB from multiple user threads (I prefer a setup without any significant blocking inside TBB code), sorry. Perhaps somebody from Intel has an idea?

Vishal,

Could youmake a test program using the parallel_invoke outline above. This should be relatively easy for you to do.

Note, I understand that your actual application may not be able to use the parallel_invoke due to variable numbers of pipelines and/or irregular start/finish times. The paralle_invoke method posted earlier is a quick way of making an assessment as to the route to take.

For an unknown number of pipelines (at programming time), but where all the pipeline requirements are determined at an initialization time, AND where all these will be launched and run to completion (as if a single task), THEN consider using the technique as done int the parallel Fibbonacci method (with slight change):

void doPipeline(int n, int m)
{
if(n==m)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation, remove init
Flow.StartFlow();
return
}
int step = m - n;
parallel_invoke(
[&](){ doPipeline(n,n+step); },
[&](){ doPipeline(n+step+1,m); });
}
...
int main(...)
{
task_scheduler_init init(nHWthreads + pendingIoThreads);
doPipeline(0, nFiles);
}

Jim Dempsey

www.quickthreadprogramming.com

Hello Jim and Raf,It may be a while before I get to this as I have beenpulled into looking at a critical issue. Please bare with me.
Thank you,Vishal

Vishal,

In an earlier post I mentioned a diminishing returns relating to increasing the number of parallel pipelines as it relates to affecting I/O times, specifically seek times. Not having your code makes this hard to asses your situation.

As a test example I used my threading toolkit, QuickThread, which has parallel_pipeline, however I am experimenting with something new at this time called Parallel Manifold. With Parallel Manifold you can construct the equivilent to a parallel_pipeline. File read stage, parallel process stage, File write stage. Parallel Manifolds are much more flexible than parallel_pipelines.Parallel Manifolds area data-flow and resource flow paradigm using the QuickThread task scheduler (similar to TBB in some respects).

Test program is a variant of what was in the first TBB book for parallel_pipeline. Read a largesequential file, uppercasing each word, writing an output file. The upcase pipe is run in parallel.

My system is a Core i7 2600K Sandy Bridge. 4 core with HT, running Windows 7 Pro x64.

For the test, I wanted to reduce disk seek latencies as much as possible. I placed the input file (1.3GB) on D:, Seagate ST3160811AS, and wrote the output file(s) to C:, Seagate ST3750630AS.

The file I/O uses simple fopen, fread, fwrite. IOW no asychronous I/O to Windows. I/O buffers were set at a respectible 128KB.

Base line Serial version of the process:

49.8 seconds, 26.7MB/s Read, 53.37MB/s Read+Write

"parallel pipeline" using Parallel Manifold paradigm

10 threads: 8 compute class threads, 2 I/O class threads
(QuickThread has two classes of threads)

10.7 seconds, 124.0MB/s Read, 241.6MB/s Read+Write, 4.65x that of serial

Note, I/O is one of the limiting factors.

Now for the part of this discussion that focuses on your problem.

Modify above code to run eight instances of the above process (in one process using one thread team).

All eight "parallel pipelines" used the same input file but different output files.
These pipelines (Manifolds) were not phase synchronized so the shared input file will not necessarily be reading from the same position, nor re-using the same I/O buffer. ThereforeReads might be received faster than if on seperate files, however, read seek latencies and disk cache eviction will not be eliminated.

24 threads were used: 8 Compute Class threads, + 16 I/O class threads
(each pipeline has a Read and Write thread available, same as one pipeline earlier)

All 8 files in parallel

170.8 seconds, 62.2MB.s Read, 124.5MB/s Read+Write, 2.4x that of serial

** Note this is 8x the data
Looking at one of the pipelines

170.8 seconds, 7.8MB/s Read, 15.5MB/s Read+Write, 0.3 that of serial

** (1/8 the aggrigate)

The important figures are:

241.6MB/s for single pipeline is reduced to
124.5MB/s for eight concurrent pipelines

What this means is (on this system)the ratio of I/O latency to Compute timedoes not favor using additional parallel pipelines. *** This is not a general statement. If you have a fast RAID system I/O will be (should be) less and change the ratio (as to if good or bad, your milage will vary).

Jim Dempsey

www.quickthreadprogramming.com

"Please bare with me."
I categorically refuse!

P.S.: But I will "bear" with you. :-)

When considering alternatives, I would probably do all the I/O on a separate thread (for lack of QuickThread's I/O tasks), building data items into a concurrent_queue, and use parallel_while/do to execute work with its feeder taking work from the queue. In the separate thread, select/poll would detect what input is ready, and parallel_for would process the current batch to prevent starvation of any input.

But I am still interested in the root cause of the perceived problem, because guaranteed concurrency between separate arenas is meant to be a feature.

>>I would probably do all the I/O on a separate thread...building data items into a concurrent_queue, and use parallel_while/do to execute work

You would potentiallywant multiple threads, one for each (potential) I/O.
The TBB concurrent_queue does not have a task initiator on first fill situation. Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing). One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error.

The Parallel Manifolds could be though of similar to concurrent_queue, in respect to being FIFO MPMC, but it also Ordered output and various combination of MP, SP, MC, SC, but more important it provides for a task trigger mechanism. e.g. on First Fill it will initiate the consumer task(s). The manifolds are more complex than the simple description above, as you can have multiple input ports and/or output port, each fed by one or more threads, with the trigger firing the consumer taskonly when all ports are satisfied and there is an available consumer. You can interlink the manifolds as you please (as long as it makes sense with your data/resource flow).

Jim Dempsey

www.quickthreadprogramming.com

>>P.S.: But I will "bear" with you. :-)

Good catch ;)

Perhaps this should be "beer" with me.

Jim

www.quickthreadprogramming.com

"You would potentially want multiple threads, one for each (potential) I/O."
If it doesn't scale well, why go there at all?

"The TBB concurrent_queue does not have a task initiator on first fill situation."
Indeed, that would be nice to have, I've been thinking about this myself on several occasions.

"Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing)."
How so?

"One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error."
True, #29 was rather rash, even without the objection that parallel_while/do would jumble the data items between input and output (a single pipeline seems better, even if it imposes a global order above the desired per-channel order, probably adversely affecting latency, and it would definitely need complete messages as input to avoid outright starvation). But I hope I would have come to my senses early enough to try the relatively new TBB flow graph with libevent instead (although I secretly wish that Vishal doesn't do so before the current problem is diagnosed and perhaps solved...).

>>"You would potentially want multiple threads, one for each (potential) I/O."
If it doesn't scale well, why go there at all?

When a process incorporates I/O, then scaling is not based upon cores alone. Rather it becomes an interrelationshipamongst:

threads (not cores)
I/O subsystem
O/S driver capability

Vishal's problem description has 16 files (at least from my understanding of these forum threads).
If you do not oversubscribe the threads, then the time interval (say) between the file read statement and the return from read (I/O completion) is a stall time for the issuing thread. IOW for each pending I/O you have one less compute thread available for doing work in your TBB thread pool. (assuming TBB threads areperforming I/O). To resolve this (within TBB), you oversubscribe threads under the assumption/requirement that you will have pending I/O operations. The degree of over-subscription would depend upon the I/O subsystem and the I/O latency compared against compute requirements. In situations where (sum of) I/O latencies is a very small amount of overall run-time, then you would not over-subscribe. As this ratio (I/O latency : compute) increases then you may find it beneficial to add additional threads.

Observing the Task Manager (or equivalent on Linux) can be used as an indicator for tuning the degree of over-subscription. In QuickThread this is a non-issue due to the dual classes of threads, and as a result the compute class can be maintained at the full compliment of available hardware threads.

FWIW When running the 8 concurrent pipelines, the I/O queue depth chart of the Resource Manager showed a jagy 7-9 queued writes on the output disk.Only 8 could be pending from my app, so this must be an O/S buffered lazy writeissue.The input disk showed ~0 queued reads. This indicates that the entire input file had been cached either by the disk read ahead or by the system file cache (system has more RAM than file size). Had this been run without the additional threads, then either computestarves or I/O starves.

The published specifications for this output drive was 120MB/s sustained writes. My single pipeline (Parallel Manifold) attained 124MB/s. The faster than published speed may be attributable tosweet spot on disk.The 8 pipeline performance aggregatewrite speed was 26.7MB/s, the drop in performance is due to seek latencies introduced by writing to 8 different output files (i.e. 8 separate output streams with significant seek distances as opposed to streaming to single file with track-to-track seeking). Had multiple output drives been available, then targeting multiple output drives may have been in order.

Jim Dempsey

www.quickthreadprogramming.com

Sorry for the confusion, I meant specifically that the program architecture does not scale well to many channels if they each require a software thread. And I don't think the original problem was about files on disk, not explicitly anyway.

>>Sorry for the confusion, I meant specifically that the program architecture does not scale well to many channels if they each require a software thread. And I don't think the original problem was about files on disk, not explicitly anyway.

Raf, if you dig through my comments you will find that they are indicative of using more pipelines than your I/O subsystem is capable of _effectively_ using is counter productive. The 8-way pipeline using two disks shows a drop in aggregate throughput (to about half).

Additionally, the optimal tuning for I/O threads would be such that the disk queue would have no more entries than necessary to not fall to 0 (during peakperiods of low latency transactions).

To illustrate the point, I will tune for 8 compute threads, 1 I/O thread, two parallel pipelines (using Parallel Manifolds), each pipeline writing to separate disk....

1 I/O thread: 124.77MB/s (write), virtually no difference from using 4 threads (for dual pipeline)
2 I/O threads: 124.48

Reconfigure to use compute class threads for I/O (0 I/O threads), as would for TBB pipeline:

0 I/O threads: 65.71MB/s

Oversubscribe compute class by 1 thread, 0 I/O class threads

+1 compute, 0 I/O: 66.03MB/s

As you can observe, over-subscription of a single compute class (like with TBB), yields slight performancegain (from 65.71MB/s to 66.03MB/s). Where as using an additional threadseparated from compute class (like QuickThread) yields significant performance gain (65.71MB.s to 127.77MB/s).

I fully expect that, using TBB parallel pipeline, in conjunction with non-TBB threads for performing I/O, would yield similar improvement over using TBB threads for I/O.

A little extra work can pay big dividends.

Jim Dempsey

www.quickthreadprogramming.com

OK, one last try... I proposed one I/O thread (which doesn't seem such a bad idea according to #35), you countered with "one thread per I/O" (which I interpreted as O(8) threads, basically the situation that this forum thread is really about), I objected to that (which I think you interpreted as objecting to 1 thread per disk device), and I hope we can now agree that really we more or less agreed all along. :-)

Optimal tuning for physical devices seems like a difficult problem, perhaps beyond the scope of this forum. Still, your mileage may differ with different file systems: have you seen the same degradation with concurrent pipelines using, say, ZFS, even on the same number of disks (just from hearsay, though)?

Leave a Comment

Please sign in to add a comment. Not a member? Join today