Hi, I have a C# workload built against .NET 4.6.1 using the TPL (task parallel library) on Win 10 Fall Creators Update (.NET 4.7.1). Two IO tasks run in parallel to p/invoke CreateFile and ReadFile to read the first 8 kB from each file in an array from an SSD. Two compute tasks then pick up these 8k chunks and call through a C++/CLI layer into C++ for some SIMD number crunching and stay in ring 3. The IO tasks share a C# lock statement to increment the file array index and the compute tasks use a second, independent lock to increment through the chunks. The compute tasks have some brief delay logic in case compute gets ahead of IO but, as IO is consistently faster, instrumentation shows it fires at most once at the start of processing.
As a baseline, I'm profiling a 6000 file case. First running the two IO tasks, waiting for them to complete, and then the two compute tasks gives 1.6 seconds for the IO (3800 file reads/s) and 1.8 s for compute (3400 file chunks/s). Since the operation is sequential, total time is 3.4 s for an all up throughput of 1800 files/s. Since the test processor is dual core and hyperthreaded (i5-4200U Haswell), one would expect running all four tasks in parallel would complete in close the 1.8 s limiting duration for compute. Unfortunately, this isn't what happens. Instead, what occurs is IO completes in 1.7 s (a drop to 3500 files/s) and compute degrades from 1.8 to 2.6 s, a rather precipitous drop from 3400 to only 2300 files/s. While this is still a decent improvement over the 1800 files/s of sequential operation it leaves two threads running on four logical processors for 900 ms after IO completes but the compute tasks are still running. One might reasonably expect spinning up two more compute tasks at this point would shorten this period to 450 ms, since that doubles processor resources allocated to CPU bound work. But that's not what happens. Instead of pushing overall throughput up to 2800 files/s the compute time remains 2.6 s despite the extra processing power. Curiously, a single compute task with nothing else running also takes about 2.6 s even though inspection of performance counters shows good load balancing between two compute tasks.
Additionally, this is a best case. Sometimes performance drops as low as 1300 files/s. Sometimes this seems attributable to other system load but most of the time there's a drop to 1800 files/s with no other obvious load on the box. From some experiments with setting thread affinity, it appears this drop is attributable to both IO tasks landing on one core and both compute tasks on the other. The more typical case of 2300 files/s seems to be associated with each core running one IO task and one compute task.
I've attempted to have a look in VTune (Parallel Studio XE 2018 Update 1) but it consistently BSODs the box shortly after the target executable starts, so no information is available from it. However, bandwidth here is only about 30MB/s so I'd not expect any trouble with L3 and all operations have sequential stride so 4k aliasing shouldn't be an issue. Profiling in Visual Studio 2017 indicates only expected CPU hotspots shows no contention over either the IO or compute lock. Oddly, VS does indicate some shared handle contention at the C++/CLI to C++ and p/invoke sites but the numbers are inconsistent with to the observed delays and inspection of the release build disassembly shows no critical regions at these points. So I suspect this is just VS indicating the insertion of its contention instrumentation. Also, the codebase contains another quad thread SIMD workload which runs through the same classes but doesn't exhibit this scalability problem (it runs about twice as fast on four logical processors as on two, as expected). The difference is that workload initiates from a single threaded C++/CLI transition and then invokes concurrency::parallel_for at the C++ level.
Any suggestions as to how else to take this apart to try to figure out what's going on? Pushing the compute tasks into C++ or C++/CLI isn't really an option as, in addition to the SIMD, they need to update data structures defined in a dependent C# assembly and also make some computationally lightweight but functionally critical C# calls into managed dependencies.