A bug in the ROM for the Macintosh II was recently discovered that causes a crash when booting in 32-bit mode. Doug Brown discovered and documented the bug while playing with the MAME debugger. Why did it never show up before? It seems a quirk in Motorola’s 68030 CPU inadvertently fixes it when executing an illegal instruction that shouldn’t have been executed in the first place.
I was starting to believe something that sounded almost too crazy to be true: Apple had an out-of-bounds jump bug in the Classic II’s ROM that should have caused a Sad Mac during boot, but they had no idea the bug was there because the 68030 was accidentally fixing the value of A1 by executing an undocumented instruction. How could I prove that my theory was correct?
By buying a Classic II and hacking the ROM in order to see exactly what is happening on hardware, of course!
↫ Doug Brown
What follows is his process for investigating the room on emulated hardware, and then testing it on actual hardware.
It’s an interesting bug, elaborately written up. Finds like this lead to better emulation of the original hardware.
That’s all well and good but what I find most remarkable though is the commitment to debugging code from the distant past, haha. To me it’d be like debugging windows 3.1, I’m sure we could find bugs, but where does one get the motivation to do it?
Well, in this case, the motivation was that the emulator wouldn’t boot, but the hardware did. With the essential job emulators have in the modern world, preserving “ancient” computer systems, having emulators that can act identically to real hardware is getting more and more important. If the hardware and the emulator act differently, there’s a variation in how the emulator and the hardware behave, exposing a bug in the emulator. You’ll need to track down the root cause of the behavioural variation in order to fix the emulator and make it behave properly.
Just in this case, it’s some absolute weirdness that the emulator behaves “properly” and it’s actually the hardware weirding out. Finding odd hardware bugs like this could prove invaluable in the future when trying to run some software that again behaves different in the software sandbox compared to actual silicon.
The123king,
It wasn’t really unusual to have undocumented opcodes. Arguably the silicon is right and the emulator’s job is to replicate the behavior of the silicon regardless of how thoroughly it was documented.
It gets even trickier when we look at clock cycle accurate emulation. The trivial way to emulate an ISA might lead to new race conditions with existing software algorithms that never exhibited issues on original hardware. This gets even more challenging on architectures with coprocessors. Cycle inaccurate emulation could expose new software bugs. I guess emulators with cycle inaccurate emulation could actually prove very valuable in triggering race condition faults in modern software too 🙂
Sadly the 68030 never have been fully reverse engineered, mostly because it wasn’t that wide used compared to the 68000 (computers, consoles, embedded boards, etc) which have complete 1:1 software and hardware (FPGA) emulation.
You’re totally right. But the joy of undocumented or unsupported opcodes is just that, they’re undocumented and unsupported. Only the most perfect of emulators will be able to replicate these errata and that requires debugging and testing. It’s weird exploratory debugging sessions like this one that make such emulators more accurate.
Some classics.
1. Every software has a bug.
2. A bug brings a friend.
3. There is always another bug between any two bugs.
4. When you write code to fix some bugs, the new code has more bugs than what you just fixed.
5. The system is stable while the number of bugs is even.
IMO, #5 applies to the 32-bit boot attempt.
It makes me wonder just how many software/hardware bugs like this lie undiscovered, because the hardware freaks out in just the right way to work.
I’m reminded of the famous case of the first ARM chip drawing 0W, purely because Steve Furber didn’t connect VCC to it, and the chip ran fine on leakage current from the external buses. Definitely a hardware bug in the initial test setup, but still, the hardware carried on marching.
My favorite story is about Airbus A380 and Catia V4/V5.
For example here
https://blog.beyondsoftware.com/failed-project-series-what-went-wrong-with-a380
Consequently, in Germany, engineers used a legacy version of CATIA (a design software program) to develop the miles of wiring for the wings. In France, however, engineers used the updated version of the same software to create the wing structures. The two versions were not compatible with each other. As a result, the German-manufactured wiring did not fit into the French-manufactured wing configuration, and both elements required a complete overhaul before the wings operated correctly.