To view parent comment, click here.
To read all comments associated with this story, please click here.
xiaokj,
"I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe -- it does not even try to block the content being overwritten twice!"
cmpxchg -- no lock
lock cmpxchg -- explicit lock
xchg -- implied lock
cmpxchg (without lock) is only multithread safe on a single processor (or perhaps a process with affinity set).
You had said:
"...On the other hand, these kinds of atomics, the compare-and-swap and the LL/SC..."
On x86, the cmpxchg instruction must have LOCK asserted in order to be atomic, and I think the wikipedia article agrees.
There are no LL/SC opcodes on x86, and I cannot comment on them.
"The LL/SC is even better -- it does that and avoids ABA, at the slight cost of speed."
The ABA problem can be solved by using 64bit cmpxchg on 32bit pointers, such that that the cmp can make sure that changes we intend to apply are still valid at the point of execution for cmpxchg.
I'd need to look up code to remember the details.
"The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case."
Ok, I see. On a preemptive single processor system, cli can provide atomicity (assuming cpu ring-0).
"Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex."
This is true without the lock prefix. Lock is what causes it to stall.
The x86 cache coherency model is complicated, but I believe this is what could happen without a lock.
CPU A - read mem X into cache
CPU B - read mem X into cache
mem X = 10
CPU A cache X = 10/clean
CPU B cache X = 10/clean
CPU A - cmpxchg (cmp = 10 ok, set X=13)
CPU B - cmpxchg (cmp = 10 ok, set X=15)
mem X = 13 or 15? (which got there last?)
CPU A cache X = 13/dirty or 15/clean
CPU B cache X = 15/dirty or 13/clean
Since the cache is operating behind the scenes independently from the processing circuitry, it provides no guaranties that the initial value read is still current at the time of the write.
Therefor, without a lock, the atomicity of one of the cmpxchg instructions is broken.
"Or else it would be kind of defeating the point of such an invention."
I'd argue that a slow lock is still better than no lock at all. And "slow" is relative, accessing a shared resource is inherently a serial operation which cannot be parallelized - I don't know how reasonable it is to expect an SMP system to do it for free.




Member since:
2005-06-30
Alfman,
I kinda get what you mean. I was talking about non-blocking schemes, not blocking schemes, so I think we are a bit out of sync ourselves.
I am basing myself off wikipedia as of right now, and I do not see it saying that CAS blocks at the hardware level. In fact, it does say that CAS is unsafe -- it does not even try to block the content being overwritten twice! (ABA problem). As such, I doubt your atomics is exactly the same as my CAS solution.
If my reading is not failing me, the CMPXCHG instruction should be hardware-non-blocking and yet atomic, which is kinda like heavenly. The LL/SC is even better -- it does that and avoids ABA, at the slight cost of speed. Definitely nothing as long as the "lock add" you are referring to.
The (b) about interrupts is because atomics are usually done on single cores by simply disabling interrupts for the transaction. The CAS is not much faster than that case, but scales a lot better for the many-cores case.
Can you please clarify? I am still of the impression that this CAS runs in one instruction cycle, not at bus speed due to a tiny mutex. Or else it would be kind of defeating the point of such an invention.