Sorry for the long post but I hope some of you may find this unusual application of TBB interesting.
I am trying to build an event processing framework using TBB. Based on helpful comments received in another thread, I have a few questions about the right design for this system and in fact, if TBB is indeed the right solution for this problem. I appreciate your suggestions.
The system consists of a graph of between 10-50 Event Processors (EP). Each EP receives an event, updates its state and optionally emits another event. An event is a C++ struct (no methods) and consists of a vector of n doubles where n is not constant. An EP can subscribe to one or more EPs and can publish its results to one or more subscribers. The only exceptions are Initial EPs (which don't subscribe to anything e.g. random number generators) and Terminal EPs (which don't publish anything e.g. write results to a text file). The system can have more than one initial/terminal EPs. We can loosely group EPs into Stages e.g. Initial, intermediate1, intermediate2..., terminal.
We are looking at processing millions of events. The number of events generated at each successive stage is progressively lower (since we "reduce" at each EP e.g. sample every 10th event and emit it). To summarize: EPs are:
2. short-running and fast.
3. implemented as functors.
4. block often (since they cannot do anything until the next event is in).
To make life more complex, some EPs are synchronous i.e. they cannot process the next incoming event until ALL successive EPs process the last event emitted. Others are asynchronous i.e. they process an event soon as it comes in, emit it to all its subscribers (who enqueue them until they are ready) and wait for the next event. We do not know if an EP is synchronous or not until the program starts running. An EP can have a combination of sync and async EP subscribers. Async EPs need to handle overflows but let's assume we truncate the queue on overflow.
Most important of all, EPs can (and should, for performance reasons) run in parallel - if an EP publishes to three subscribers, all of them can run simultaneously. But there are situations where all of them need to be notified in sequence and as you guessed, this is not known until runtime. The silver lining is this is decided at startup and doesn't change during execution.
The initial solution looked like this and had horrible performance:
1. Each EP was implemented as an object with a TBB queue that blocks on a pop.
2. Start all non-initial EPs - all of which block waiting for an input event - using Parallel-for.
3. Start all initial EPs in a Parallel-for.
4. Each EP pops the concurrent (input event) queue in an infinite loop, processes it, dynamically creates an output event and notifies the subscribers i.e. enqueues the event pointer to each subscriber's queue. The loop ends when a special END event is sent.
5. For both sync and async processing, the EP waits on an ack queue. If the subscriber is sync, it won't push the ack until processing is complete. Async subscribers push an ack immediately before continuing with processing the event.
I understand that parallel-for is a bad idea. An alternative I can think of is to have a threadpool for each Stage. Because we create millions of events dynamically, I believe I should use the scalable allocator.
What we have is workflow parallelism, not HPC-style parallel-matrix manipulation. My questions are:
1. Is TBB the right approach for this system?
2. If yes, what can and should be done to improve the performance?