Post a Comment
Stratus does this in hardware. Marathon Technologies does this in software. Both take care of many other failure modes (disk, power supply, I/O, memory, and in Marathon's design, even driver faults) that are much more likely to cause a system failure. I mean it's cool IBM is working on this (and they have some really strong high availability talent in house), but they need to take it to the system level (not just one component, and especially one - the CPU - that is among the least likely to fail).
They have been doing this for years with their mainframes, to the point that it swaps out the bad hardware and calls service with no user or operator intervention, and without a single bad bit hitting disk. This is how they have years between reboots on what is now called the Z series (370/390 for you old schoolers.)
I think it will be great if we can get this kind of fail over all of the way down into SMB sized machines, but I amso agree that it needs to be not only the processor, but other componenets too Add memory to the list, since we already have RAID, and we are getting close.



