Linked by Hadrien Grasland on Thu 19th May 2011 21:31 UTC
Permalink for comment 474218
To read all comments associated with this story, please click here.
To read all comments associated with this story, please click here.
Features
Linked by Thom Holwerda on 05/21/13 21:38 UTC
Linked by Thom Holwerda on 05/20/13 11:29 UTC
Linked by Thom Holwerda on 05/18/13 21:33 UTC
Linked by David Adams on 05/16/13 4:23 UTC
Linked by Thom Holwerda on 05/11/13 21:41 UTC
Linked by Thom Holwerda on 05/08/13 14:22 UTC
Linked by Thom Holwerda on 05/02/13 15:28 UTC
Linked by Thom Holwerda on 04/29/13 21:06 UTC
Linked by Thom Holwerda on 04/24/13 22:24 UTC
Linked by Thom Holwerda on 04/18/13 11:21 UTC
More Features »
Sponsored Links



Member since:
2011-01-28
Here is a short program which tests a multithreaded checksum.
Nothing impressive here. All that's interesting are the benchmarks below. (Dual-Core Intel(R) Xeon(R) CPU E3110 @ 3.00GHz)
// twiddle.c -- Compare performance of single threaded and multithreaded code.
// gcc -O3 -o twiddle -lpthread twiddle.c
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
typedef struct {
int*mem;
int len;
int ret;
} WORK;
double Now() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (float)tv.tv_usec/1000000;
}
void* DoWork(void*work) {
// Do some arbitrary nonsense work
WORK*w = (WORK*)work;
int*mem = w->mem;
int*stop = mem + w->len;
int sum = 0;
while(mem<stop) {
sum+=*mem;
mem++;
}
w->ret = sum;
}
int main(int argc, char*args[]) {
int len=500000000; // 500M ints, or 2GB
int*mem = (int*)malloc(len * sizeof(int));
WORK work[2];
double elapse;
pthread_t thread[2];
int run;
memset(mem, 1, len * sizeof(int));
printf("Array Len = %d\n", len);
for(run=0; run<3; run++) {
// Method 1 - Do work in one shot
elapse = Now();
work[0].mem = mem;
work[0].len = len;
DoWork(work);
elapse = Now() - elapse;
printf("Run %d Method 1 - One shot = %d (%0.2fs)\n", run, work[0].ret, elapse);
// Method 2 - Split work in half
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
DoWork(work+0);
DoWork(work+1);
elapse = Now() - elapse;
printf("Run %d Method 2 - split single thread = %d (%0.2fs)\n", run, work[0].ret + work[1].ret, elapse);
// Method 3 - Split work in half, threaded
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
pthread_create(thread+0, NULL, &DoWork, work+0);
pthread_create(thread+1, NULL, &DoWork, work+1);
pthread_join(thread[0], NULL);
pthread_join(thread[1], NULL);
elapse = Now() - elapse;
printf("Run %d Method 3 - split dual thread = %d (%0.2fs)\n", run, work[0].ret + work[1].ret, elapse);
} // run
}
Output:
Run 0 Method 1 - One shot = 1345479936 (0.36s)
Run 0 Method 2 - split single thread = 1345479936 (0.36s)
Run 0 Method 3 - split dual thread = 1345479936 (0.31s)
Run 1 Method 1 - One shot = 1345479936 (0.36s)
Run 1 Method 2 - split single thread = 1345479936 (0.37s)
Run 1 Method 3 - split dual thread = 1345479936 (0.33s)
Run 2 Method 1 - One shot = 1345479936 (0.37s)
Run 2 Method 2 - split single thread = 1345479936 (0.37s)
Run 2 Method 3 - split dual thread = 1345479936 (0.31s)
Now, did this achieve the speedup we'd like? Since there is virtually no synchronization between threads to cause overhead, one might have expected 2 threads to cut the time in half, so why didn't it?
The answer is that the CPU is way under utilized, the algorithm is actually IO bound and that the CPU spends most of it's time waiting for data.
Now which is better/more scalable?
Taking up one CPU for .37s, or two CPUs for 0.31s?
If this is the only process, then admittedly yes .31s is better. However, if there are any other CPU bound processes waiting, then the .06s tradeoff is starting to look pretty expensive since I could have been running a CPU bound process simultaneously for .37s instead of tacking it on top of .31s.
Keep in mind we haven't even introduced the need for multithreaded synchronization, which is extremely expensive.
For these reasons, it's typically better to exploit parallelism at the macro-level instead of parallelizing individual steps. This is usually a good thing because it's easier to do with much less synchronization.
It also lends credence to my cherished view that async IO interfaces usually can perform better than blocking threaded ones.