Linked by Hadrien Grasland on Thu 19th May 2011 21:31 UTC
Hardware, Embedded Systems Having read the feedback resulting from my previous post on interrupts (itself resulting from an earlier OSnews Asks item on the subject), I've had a look at the way interrupts work on PowerPC v2.02, SPARC v9, Alpha and IA-64 (Itanium), and contribute this back to anyone who's interested (or willing to report any blatant flaw found in my posts). I've also tried to rework a bit my interrupt handling model to make it significantly clearer and have it look more like a design doc and less like a code draft.
Permalink for comment 474218
To read all comments associated with this story, please click here.
Parallization and Bit twiddling
by Alfman on Sun 22nd May 2011 20:08 UTC
Alfman
Member since:
2011-01-28

Here is a short program which tests a multithreaded checksum.

Nothing impressive here. All that's interesting are the benchmarks below. (Dual-Core Intel(R) Xeon(R) CPU E3110 @ 3.00GHz)



// twiddle.c -- Compare performance of single threaded and multithreaded code.
// gcc -O3 -o twiddle -lpthread twiddle.c


#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>


typedef struct {
int*mem;
int len;
int ret;
} WORK;

double Now() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (float)tv.tv_usec/1000000;
}

void* DoWork(void*work) {
// Do some arbitrary nonsense work
WORK*w = (WORK*)work;
int*mem = w->mem;
int*stop = mem + w->len;
int sum = 0;
while(mem<stop) {
sum+=*mem;
mem++;
}
w->ret = sum;
}

int main(int argc, char*args[]) {
int len=500000000; // 500M ints, or 2GB
int*mem = (int*)malloc(len * sizeof(int));
WORK work[2];
double elapse;
pthread_t thread[2];
int run;

memset(mem, 1, len * sizeof(int));
printf("Array Len = %d\n", len);

for(run=0; run<3; run++) {

// Method 1 - Do work in one shot
elapse = Now();
work[0].mem = mem;
work[0].len = len;
DoWork(work);
elapse = Now() - elapse;
printf("Run %d Method 1 - One shot = %d (%0.2fs)\n", run, work[0].ret, elapse);



// Method 2 - Split work in half
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
DoWork(work+0);
DoWork(work+1);
elapse = Now() - elapse;
printf("Run %d Method 2 - split single thread = %d (%0.2fs)\n", run, work[0].ret + work[1].ret, elapse);

// Method 3 - Split work in half, threaded
elapse = Now();
work[0].mem = mem;
work[0].len = len/2;
work[1].mem = mem+len/2;
work[1].len = len/2;
pthread_create(thread+0, NULL, &DoWork, work+0);
pthread_create(thread+1, NULL, &DoWork, work+1);
pthread_join(thread[0], NULL);
pthread_join(thread[1], NULL);
elapse = Now() - elapse;
printf("Run %d Method 3 - split dual thread = %d (%0.2fs)\n", run, work[0].ret + work[1].ret, elapse);

} // run

}



Output:
Run 0 Method 1 - One shot = 1345479936 (0.36s)
Run 0 Method 2 - split single thread = 1345479936 (0.36s)
Run 0 Method 3 - split dual thread = 1345479936 (0.31s)
Run 1 Method 1 - One shot = 1345479936 (0.36s)
Run 1 Method 2 - split single thread = 1345479936 (0.37s)
Run 1 Method 3 - split dual thread = 1345479936 (0.33s)
Run 2 Method 1 - One shot = 1345479936 (0.37s)
Run 2 Method 2 - split single thread = 1345479936 (0.37s)
Run 2 Method 3 - split dual thread = 1345479936 (0.31s)


Now, did this achieve the speedup we'd like? Since there is virtually no synchronization between threads to cause overhead, one might have expected 2 threads to cut the time in half, so why didn't it?

The answer is that the CPU is way under utilized, the algorithm is actually IO bound and that the CPU spends most of it's time waiting for data.


Now which is better/more scalable?
Taking up one CPU for .37s, or two CPUs for 0.31s?

If this is the only process, then admittedly yes .31s is better. However, if there are any other CPU bound processes waiting, then the .06s tradeoff is starting to look pretty expensive since I could have been running a CPU bound process simultaneously for .37s instead of tacking it on top of .31s.


Keep in mind we haven't even introduced the need for multithreaded synchronization, which is extremely expensive.


For these reasons, it's typically better to exploit parallelism at the macro-level instead of parallelizing individual steps. This is usually a good thing because it's easier to do with much less synchronization.

It also lends credence to my cherished view that async IO interfaces usually can perform better than blocking threaded ones.

Reply Score: 1