Linked by Hadrien Grasland on Thu 19th May 2011 21:31 UTC
Hardware, Embedded Systems Having read the feedback resulting from my previous post on interrupts (itself resulting from an earlier OSnews Asks item on the subject), I've had a look at the way interrupts work on PowerPC v2.02, SPARC v9, Alpha and IA-64 (Itanium), and contribute this back to anyone who's interested (or willing to report any blatant flaw found in my posts). I've also tried to rework a bit my interrupt handling model to make it significantly clearer and have it look more like a design doc and less like a code draft.
Thread beginning with comment 474025
To read all comments associated with this story, please click here.
Locking benchmarks
by Alfman on Fri 20th May 2011 20:34 UTC
Alfman
Member since:
2011-01-28

This site has some interesting benchmarks for a couple low level synchronization primitives.

http://www.spiral.net/software/barrier.html


The obvious locking mechanism on x86 is to use a "lock bts" instruction. According to above this consumes 530 cycles on a modern dual core processor. And 1130 cycles on a 4 core multi processor system.

They've proposed another algorithm, which is incidentally the same algorithm sometimes used to synchronize multiple clients over a SMB file system.

For example, Quicken Pro maintains it's database on a file share without actually running a "server".

Anyways, I am surprised they were able to achieve performance gains this way. 220 cycles on a dual core and 1130 cycles on a 4 core multiprocessor system.

It relies on the cache coherency characteristics behind the x86, which I have no idea how portable it is across architectures.

You may want to study the approach, however I don't think it will be useful because the algorithm is dependent upon a fixed number of threads (and presumably it gets worse as that fixed number is increased).


Anyways, I still argue that a lock free asynchronous model without any threads will easily outperform the "popup thread" model, particularly for simple tasks which consume 1-2K cycles.

I don't care which way you go, it probably won't matter either way, but please stop clinging to the view that threads are more scalable.

Reply Score: 1

RE: Locking benchmarks
by jal_ on Sun 22nd May 2011 13:08 in reply to "Locking benchmarks"
jal_ Member since:
2006-11-02

I still argue that a lock free asynchronous model without any threads will easily outperform the "popup thread" model, particularly for simple tasks which consume 1-2K cycles.


This is a very interesting discussion thread, but I'm wondering what kind of model you envision when you talk of a lock-free, asynchronous, thread-less model. Can you point me to a description or more elaborate discussion of such a model?

Reply Parent Score: 2

RE[2]: Locking benchmarks
by Alfman on Sun 22nd May 2011 21:11 in reply to "RE: Locking benchmarks"
Alfman Member since:
2011-01-28

jai_,

"This is a very interesting discussion thread, but I'm wondering what kind of model you envision when you talk of a lock-free, asynchronous, thread-less model. Can you point me to a description or more elaborate discussion of such a model? "

More than a vision... I've actually implemented it in userspace on linux, and working on a windows port now, although I haven't made the project public.

Technically, my vision is to convert all calls which either block or could take a long time into asynchronous callback mechanisms.


So basically these function calls below are from my API.
When any of the tasks complete, the registered callback functions are invoked - all within a single even oriented thread.

async_file_create(&file, &file_cb);
async_file_open(&file, "async.test", ASYNC_FILE_READ | ASYNC_FILE_WRITE | ASYNC_FILE_CREATE | ASYNC_FILE_TRUNCATE);

async_process_create(&process, &process_cb);
async_process_setup_stdout(&process, &process_out_cb);
async_process_setup_stdin(&process, &process_in_cb);
async_process_execute(&process, "ls -l /");

async_timer_create(&timer1, &timer_callback);
async_timer_schedule(&timer1, 500);

async_resolve_create(&resolve, &resolve_cb);
async_resolve_lookup_host(&resolve, "av.com", ASYNC_ADDRESS_IP4);

async_address_create(&remote_address);
async_address_set(&remote_address, "127.0.0.1", 5555, ASYNC_ADDRESS_AUTO);
async_tcplisten_create(&lsock, &lsock_cb);
async_tcplisten_listen(&lsock, &remote_address);
async_address_destroy(&remote_address);


So, above, all those actions are performed simultaneously, and they callback when complete.


One callback could look something like this (sorry about OS News formatting bugs):

void resolve_cb(ASYNC_RESOLVE*resolve) {
char buffer[128];
ASYNC_ADDRESS address;
async_address_create(&address);
while(async_resolve_get_address(resolve,&address)) {
printf("Address %s\n", async_address_get_ip(&address, buffer, sizeof(buffer)));
}
async_address_destroy(&address);
async_resolve_destroy(resolve);
}



The thing which I find so cool about this model is that two completely independent processes (or applications for that matter) can be run together inside this model without interfering with each other. Nothing (apart from the main handler) ever blocks.

Threads are supported for long running tasks, and when the threads complete their work, they invoke a callback into the async handler much like any other blocking call.


Unfortunately, under linux, there are many blocking calls lacking an async interface and are implemented using threads in the kernel. It forces async userspace apps to wrap threads around certain kernel calls for no good reason other than the kernel requiring it. I consider it shortsighted, but many people would bash my head in if I say that too loudly.

Instead of burning through threads quickly, my implementation caches minimal proxy threads dedicated for this purpose.

Still, in this model proxy threads are unnecessary overhead which serve absolutely no purpose other than getting around blocking calls.

I think this model would work very well in the kernel itself. Instead of blocking a thread, a request could just register a callback and be done until the callback occurs.


The async model has one extremely important property which is inherently difficult for threads: well defined cancellation behavior. I can easily cancel an async IO request, canceling a blocked thread turns out to be extremely problematic.

In fact, pthreads has special locks programmers should use to ensure the cancellation doesn't cause corruption. Unfortunately for my model, this means thread cancellation itself can block and there's nothing I can safely do about it.

Even to this day, this is the cause of many aggravating linux kernel CIFS lockups where one cannot kill a hung connection, but has to wait for kernel threads to timeout.

Ah well, I can't have everything.

Reply Parent Score: 1