My program does not actually use multithreading, it just divides the calculation among several processes, and the parent process collect the result in shared memory, once they finish. All the code is CPU bound, no I/O, and relatively small footprint (20 MB per process).
I started with this approach when the first multithread Pentiums4 came out, but it did not improve the speed at all.
With advent of true multicore processors (Pentium D), I got almost linear speedup, according to the number of cores.
So my logic was pretty straightforward, start as many processes as there are cores, and ignore the threads. I tested the program on Atom, N270, it has one core and two threads, and program was about 30% faster, when I ran it with two processes. Did not know what to think of it.
But recently I tested my program on 4 core 8 threads processor (i7-3610QM), and I got surprising results. My best speedup was when I started 6 processes, not 4 or 8. These are the times:
Since I can not test my software on all processor configurations, I need some reliable formula to calculate optimal number of processes for best speedup.
I could also make a benchmark and set the optimal value during program installation, but I dislike such zero knowledge approach.
I no longer have a clear mental model of what a core vs thread is on the CPU. Note that I am not talking about programming models, but about hardware capabilities of the CPU. So thread is like half of extra core, but not quite ;-)