Linked by Hadrien Grasland on Fri 27th May 2011 11:34 UTC
General Development After having an interesting discussion with Brendan on the topic of deadlocks in threaded and asynchronous event handling systems (see the comments on this blog post), I just had something to ask to the developers on OSnews: could you live without blocking API calls? Could you work with APIs where lengthy tasks like writing to a file, sending a signal, doing network I/O, etc is done in a nonblocking fashion, with only callbacks as a mechanism to return results and notify your software when an operation is done?
Thread beginning with comment 475056
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[3]: Mutex-less computing
by Alfman on Mon 30th May 2011 08:59 UTC in reply to "RE[2]: Mutex-less computing"
Alfman
Member since:
2011-01-28

xiaokj,

"I would first like to declare that I am speaking out of my area of expertise."

I don't mind in the least.

"If you have multiple cores, then even if you dedicate one core to the server, then if, in the rare case, two other cores decide to request something, then there will be a need to care, no?"

Consider a process like a web server using async IO mechanisms. From one thread, it waits for connections and IO (with a single blocking call to the OS, like epoll_wait). From there, as requests come in, it dispatches further IO requests on behalf of clients but since the async thread only schedules IO and doesn't block, it returns immediately to waiting for more IO. From the userspace perspective, there are no threads nor mutexes needed to implement this model.

One simple way to achieve more parallelism is to run two or more instances of this async application, and there still would be no need for userspace mutex since they'd be running in different processes.


"On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC, are hardware accelerated to be (a) non-blocking and (b) interrupt-enabled and (c) runs in one cycle."

Firstly, my assembly knowledge drops off significantly beyond the x86, so I can't generalize this to be true for other architectures.


(a) non-blocking is true in the threaded sense, but false at the hardware level.

(b) I don't know what you mean, but atomics don't interfere with interrupts.

(c) Runs in one cycle? Where do you get that from?

"Why do you claim that they are slower than CPU speed?"

The CPU exerts a lock signal which prevents other CPUs from using the bus, like a tiny mutex. Since this takes place at the bus level, it also means atomic operations operate at bus speed. And from my hasty measurements, it even takes more than one bus cycle.


But I won't expect you to take my word for it...here's a microbenchmark on my laptop. I used gcc -O2 (gcc optimized away the loop into a mul, which is why I used two adds):

int x=0, y=0, i;
for(i=0; i<100000000; i++) {
x++;
//y+=x; // normal add = 0.1s
__sync_add_and_fetch(&y, x); // lock add = 2.8s
}
printf("%d\n",y);

Anyways, using "y+=x", the loop ran in .1s and the underlying opcode was just "add".

Using the gcc atomic function, the compiler emitted "lock add" and the loop executed in 2.8s.

The first can execute in parallel on SMP, the second becomes serialized.

Ideally I'd write a more precise assembly language test, but I think this example is sufficient to demonstrate my claim.

"Nonetheless, if you combine atomicity and MT, I cannot foresee why a good implementation will not outperform simple threaded and/or async as described."

Here are a few reasons:

Threads are best suited for longer running tasks when MT overhead is minimal compared to other processing. However for IO, operations are frequently followed by more IO operations (read/write loops). Very little time is spent in the threads doing real work. CPU state context switching overhead becomes enormous.

The async model can handle the same IO from one thread in a queue with no context switching at all. Using an AIO interface, it's possible to batch many requests into a single syscall.


MT is inherently penalized by the need for synchronization overhead, async is not.

Excessive use of MT in certain designs result in unscalable stack utilization (even with enough RAM there'll be CPU cache performance issues).

Reply Parent Score: 2

RE[4]: Mutex-less computing
by xiaokj on Mon 30th May 2011 16:43 in reply to "RE[3]: Mutex-less computing"
xiaokj Member since:
2005-06-30

Alfman,

I kinda get what you mean. I was talking about non-blocking schemes, not blocking schemes, so I think we are a bit out of sync ourselves.

I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe -- it does not even try to block the content being overwritten twice! (ABA problem). As such, I doubt your atomics is exactly the same as my CAS solution.

If my reading is not failing me, the CMPXCHG instruction should be hardware-non-blocking and yet atomic, which is kinda like heavenly. The LL/SC is even better -- it does that and avoids ABA, at the slight cost of speed. Definitely nothing as long as the "lock add" you are referring to.

The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case.

Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex. Or else it would be kind of defeating the point of such an invention.

Reply Parent Score: 2

RE[5]: Mutex-less computing
by Alfman on Tue 31st May 2011 00:26 in reply to "RE[4]: Mutex-less computing"
Alfman Member since:
2011-01-28

xiaokj,

"I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe -- it does not even try to block the content being overwritten twice!"

cmpxchg -- no lock
lock cmpxchg -- explicit lock
xchg -- implied lock

cmpxchg (without lock) is only multithread safe on a single processor (or perhaps a process with affinity set).


You had said:
"...On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC..."

On x86, the cmpxchg instruction must have LOCK asserted in order to be atomic, and I think the wikipedia article agrees.

There are no LL/SC opcodes on x86, and I cannot comment on them.

"The LL/SC is even better -- it does that and avoids ABA, at the slight cost of speed."

The ABA problem can be solved by using 64bit cmpxchg on 32bit pointers, such that that the cmp can make sure that changes we intend to apply are still valid at the point of execution for cmpxchg.

I'd need to look up code to remember the details.


"The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case."

Ok, I see. On a preemptive single processor system, cli can provide atomicity (assuming cpu ring-0).


"Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex."

This is true without the lock prefix. Lock is what causes it to stall.

The x86 cache coherency model is complicated, but I believe this is what could happen without a lock.

CPU A - read mem X into cache
CPU B - read mem X into cache

mem X = 10
CPU A cache X = 10/clean
CPU B cache X = 10/clean


CPU A - cmpxchg (cmp = 10 ok, set X=13)
CPU B - cmpxchg (cmp = 10 ok, set X=15)

mem X = 13 or 15? (which got there last?)
CPU A cache X = 13/dirty or 15/clean
CPU B cache X = 15/dirty or 13/clean

Since the cache is operating behind the scenes independently from the processing circuitry, it provides no guaranties that the initial value read is still current at the time of the write.

Therefor, without a lock, the atomicity of one of the cmpxchg instructions is broken.


"Or else it would be kind of defeating the point of such an invention."

I'd argue that a slow lock is still better than no lock at all. And "slow" is relative, accessing a shared resource is inherently a serial operation which cannot be parallelized - I don't know how reasonable it is to expect an SMP system to do it for free.

Reply Parent Score: 2