When your computer resides on another planet, service calls aren’t an option. So when BAE Systems North America Inc. needed an extremely reliable operating system for the computational subsystem it was supplying for NASA’s Mars rovers, it chose an industrial-strength, real-time operating system from Wind River Systems Inc., Alameda, Calif, VXWorks.
I tought they would choose QNX for such task. Congratulations for wind river if the mars probe is successfull, beagle 2 was not.
Haha, wxWorks track record on Mars isn’t all that
great. 🙂
Last time there, it provided the engineering community
with a prime example of what priority inversion is.
Hope it goes better this time. 🙂
> Last time there, it provided the engineering community
> with a prime example of what priority inversion is.
> Hope it goes better this time. 🙂
If you’d care to elaborate, maybe some of us less interplanetarily aware citizens could understand. In other words, what exactly happened?
Im waiting for the Linux and Windows advocates to proclaim Linux or Windows should have been chosen instead. Debian and Fedora Core will never see Mars. VxWorks is not a bad real time OS and I wish WindRiver and BAE much success for this project.
Here is a simple explanation of priority inversion:
Consider three processes, a high priority one A, a medium priority one B, and a low priority one C. Normally, A gets precedence over B, which gets precedence over C. There is a problem, however, when this is mixed with exclusive shared resources. Say A and C share a resource Q. Normally, C locks Q for only a short time, so A never spends much time waiting for it. However, consider what can happen in a round-robin scheduler. B can preempt C while it is holding Q, since it has a higher priority. If A then waits on Q, you have a problem. B will continue running until it sleeps, because it has a higher priority than C. C will never get to run and release Q. Thus, you effectively have a high priority process A waiting on a medium priority process B to finish.
In the Mars Lander mission, A was a high-priority process that managed data on a message bus. B was a long-running medium-priority communications process. C was a low-priority process that used the message bus to transfer meteorological data. Q was a shared lock on the message bus. The problem was that B that the communications task in B took a very long time. Normally, this was not an issue, because B would just run whenever A did not need the CPU. However, in the priority inverted scenario, B would prevent A from running for extremely long streches of time. This caused the Lander’s software to be rather flaky, requiring reboots every so often when the priority inversion happened.
Some OSs (I think Solaris, among them) try to address priority inversion. They use something called priority inheritence. A lock can be given a high priority. While holding that lock, a process inherits the priority of the lock. So in the Mars Lander case, when the meterological task locked the message bus, it would inherit the high priority of the lock, so the communications task could not preempt it, and it could complete its work and release the lock quickly.
I wonder what OS was installed in Beagle2 boardcomputer.
“With volume discounts, the software can cost as little as $8,000 per copy.”
Why does it cost so much for an operating system that performs a limited set of tasks ? No wonder whenever a rocket explodes on the launching pad, taxpayers become poorer by the tune of millions.
just a small addition to an excellent explanation, rayiner there was actually an option for enabling priority inheritance in vxworks. the pathfinder engineers just forgot to turn it on..
http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder….
Joseph,
http://www.embeddedstar.com/press/content/2003/6/embedded9122.htmlt
http://www-2.cs.cmu.edu/~personalrover/PER/
Penguins in space are already hardend as they are used in the International Space Center on the shuttle and in lots a nasa’s and esa’s workarea’s.
@hgm,
The Beagle2 used (uses?) the Spacecraft Control Operating System (SCOS). A single Linux workstation is recieving and sending its commands according to several articles.
wxWorks do not it well all the time.
Don’t you remember Mars PathFinder little problems that engineers needed to debug live while the rover was on Mars !
http://www.cis.ksu.edu/~hatcliff/842/Docs/Course-Overview/pathfinde…
I am wondering if you are familiar with CSP, Occam etc and if so are parallel systems based on Occam like languages that use rendezvous (also ADA) and message passing (MPI,PVM)also suceptable to priority inversion?
As much as I know about Occam systems I don’t think so but then Occam has very little choice in priority levels (high, low). I assume that if Occam programs model systems with mutexes, semaphores, locks etc the problem would be right there too.
Thanks
johnjaksonATusaDOTcom
Also for those that would even think to ask why not my favourite OS or cpu chip in space. On most of the missions I know of, the cpu processing is remarkably puny compared to desktops.
I have been told that Nasa engineers prefer to use distributed very low power & relatively slow cpus for many of their components. What counts most is that the cpu is way faster than needed and can run well very underclocked, and that it be available in Rad hard version. This also means that cpus that are rad hard have to be many generations behind desktops so 1u cpu is perfectly fine.
Last conversation I had with a JPL friend said they were using 20yr old 8/16 bitters. Now older PPCs are also used as the central cpu too but their workload might be only 1%. The mars missions though may be more advanced.
Also VxWorks & others of its class have been running same sort of thing since micros 1st became commonplace in process control. There really isn’t much call for nVidia, super fast AGP buses, huge HDs, and the myriad of interfaces that give us driver hell. The engineering is all about keeping command, sensors & comms working together.
I also seem to recall that communications is also at incredibly slow bit rates, to have even the remotest chance of being recieved correctly. At such large distances the amount of power received by the dishes is almost zero. I’d need to do more digging to say anymore.
Why not qnx? Don’t know if the nasa guys are allowed
to use Canadian software 😉
JJ, yep, they use small cpus and spread them around, also comes in handy for redundancy. Also just makes for better packaging then trying to wire it all back to one main cpu. I think what many miss is in the end these craft are basicly a bunch of simple robots bolted together. Simple little chips with rather basic code is all it needs. They don’t need fancy, they need reliable. And simple is nice to. It will be a long time till we see Sci Fi type computers on these things.
Even things like the Shuttle use old old tech and have very little cpu power to them. But it works and the code is very well hammered out.
For a project like this your going to use the simplest thing you can get that has the lowest resources. Heck I wouldn’t be surprised if there is a good deal of Pic’s on that thing with a small bit of assembler doing some important tasks.
First of all, very well explained.
I recall reading about Solaris’ “priority inheritance” recently, henceforth I can identify (somewhat) with this concept. It would be interesting to see an implementation of this scheduler model that takes into account semaphores, pipes, shared memory and local TCP client/server interaction. This would remove the need of explicitly specifying dependencies in application code. In turn, all existing software would benefit immediately.
One can only begin to imagine how complex an implementation of such a concept could be. The question is whether the latency introduced by increasing the complexity of the scheduler is regained in the reduced number of temporary deadlocks.
If you read the interviews/papers with th JPL folks, you’d know
that they did see this problem before launcing pathfinder, but they had other things to do. Short story is that they forgot to turn on priority inheritance. Default is off on vxWorks..
There are a lot of schemes to solve the priority inversion
problem, and priority inheritance is (just) one of them.
However, some analysis (see Yodaikens paper http://www.linuxdevices.com/articles/AT7168794919.html ) shows that prioirity inheritance does not, in fact, solve the problem.
Some OS:es supply primitives for P.I. (e.g. wxWorkds) but a
better solution is to use lock-free data structures (e.g. message passing) as other RTOS do, e.g. QNX and OSE.
what exactly causes priority inversion?
vxworks is also used in european space ship. Rtems too. But such things have really really poor feature comparre to “normal” os. Sometimes, they use TCP/IP stack. It did not have any memory protection. They used a priority scheduler + semaphore + threads, that’s all. vxworks is well known for there debuging tools compare to rtems for example.
Top of the line cpu in europe are erc32 from atmel (sparc 25Mhz), the next will be the Leon at 100Mhz.
http://www.atmel.com/dyn/products/product_card.asp?part_id=3187
http://www.atmel.com/dyn/products/product_card.asp?part_id=3178
USA could use some PPC at 200/300 Mhz for space. But you can’t buy it as you want
The cause is having interacting tasks at different
priorities running at unpredictable times. This is very
common in any complex system that uses threads and
shared locks. It is very hard not to have PI in this
architecture, especially if one shared resource is being
accessed a lot.
Real world access patterns are often
suprising. Maybe a transistor isn’t to spec and you
get a lot more interrupts and thus data
than you were expecting, for example.
VxWorks really doesn’t handle PI because it only
restores the tasks priority after the last mutex taken
by a thread since the condition has been unblocked.
It doesn’t restore when the condition has been really
cleared. This means PI can last for very long times,
as lower priority threads often have infinite streams
of work to perform.
In the real-time community there is a bias against message
passing as too slow, so that is why they probably use
locks. You also have queuing issues as a lock is
a form of back pressure.
Message passing still has deadlock issues, but PI
is solved. QNX and other OSs will correctly restore
priority in PI state.
That they saw this condition and thought it wasn’t
important is somewhat shocking. PI means there is
something wrong in your architecture, especially
if it happens a lot. It means your system isn’t
deterministic.
Huh?
http://flightlinux.gsfc.nasa.gov/
Interesting!
When I said lock-free data structures, message-passing
is just one of them. There are several more, read Yodaikens
paper above.
Anyway, there’s no intrinsic resaon why message passing
would be any slower than a semaphore approach. Transfer of
execution control, either over a passed message or release of a semaphore, boils down to one context switch.
Message passing does not require to copy or move memory – that’s a common misconception – unless passing messages
between processes living in separate memory spaces.
I think message passing is an elegant solution to the question of data ownership.
Then, I would disagree with you on what message passing solves. Deadlocks are easy to accomplish with semaphores, not that easy with message passing. However, message passing does solves the PI problem on the process level but it reappears on the message level: i.e. by queuing up all
sorts of messages in the same queue, lower-priority messages may block higher-priority messages: priority inversion appears on the message level.
It’s interesting because it’s Linux?
I think what’s more interesting is that if you build something like an aircraft or space shuttle, the cost of the OS is peanuts compared to all other stuff…
And frankly, if an airplane uses Linux and I had an optional one to choose which ran QNX… the choice would be obvious! I’d choose security and my safety rather than your politics!
Thorbjörn,
I’ve never been smart enough to understand and deploy
lock free data structures in a complex environment.
I agree message passing can be efficient. I work in the
real-time embedded space and i am just passing on the
summary of many many conversations.
Semaphores and tasks take less discipline than messages
so they tend to be used more.
Deadlocks are just easy to accomplish with message
passing. It is a flaw in the protocol, not
the mechanism of the protocol that is the problem.
Priority in a queue isn’t the same because the priority
isn’t inherited by the task. Priority in a queue is
mainly a latency issue. There are several ways around
this, but none are really satisfactory.
They used a hardend RS/6000 board. I am sure a simple COTS version of a CCT board. QNX isn’t supported on that as far as I know. The (I assume) Motorola chip is well supported by vxWorks.
Does anyone find it strange that BAE Systems wouldn’t use CsLEOS (http://platformsolutions.na.baesystems.com:8080/CsLeos/index.jsp) which is their own RTOS. Makes me wonder if CsLEOS really isn’t as much of a contender in the RTOS market as they want you to believe.
As to the concern over costs of an RTOS, one must consider the reason why they cost so much. It isn’t like buying a copy of Windows or Mac OSX. Most RTOS’s are used in safety critical applications such as aircraft, space, and military applications. Therefore the companies that produce the OS must have them certified. Certification can be quite costly.
VxWorks can also be found in high volume devices such as cell phones (I also believe that one of the original Apple Airport networking hubs ran VxWorks). According to a discussion I had with a salesman from Windriver, high volume can really bring the price down for their OS. In addition, such manufactures buy a non-safety critical version of the OS.
One drawback of vxWorks is the lack of MMU support. Most of the processors used in space do not have an MMU, but that is changing. I’m not sure what BAE processor is used, but if it is the BAE RAD750 there is an MMU.
Operating systems like LynxOS, QNX, Linux, and the not-too-successful vxWorks AE use the MMU to isolate user processes from the rest of the system. In vxWorks, a stay pointer could allow the code to corrupt almost any memory. If you have an MMU available, why not use it?
Also, a vxWorks alternative that I am using is RTEMS ( http://www.rtems.com ) A FREE open source , real time OS with a file system, BSD network stack and most of the features of vxWorks. ( still no MMU support but good for lower end 32 bit processors )
OK, my mistake, the rovers use a RAD6000, which does not have an MMU!
The vxWorks Board Support Package comes with a RAD6K board, so using it is fairly easy. Most other OSs like RTEMS would have to be ported ot the board.
Actually, a package is available for MMU support. It’s a little bit salty, but it does provide the necessary memory protection such as program text, etc. You’d be supprised at the features Windriver provides; of course for an additional fee. You just need to ask your tech rep.
How would linux do for the Mars probe ?
How would it handle priority inverse ?
The article said that VX is ‘very reliable’ which i find kind of insulting , even fud type statements. VX better than linux ? or Sun ?
If anyone understands U.S. government contracts they might understand whats really going on here. Bea and Vx. You scratch my back and I scratch yours.
This article really has my attention since we the public pay for these robots and we may not be buying the best software to go with it but contractors may be pushing inferior software .
This could be something to call the congressman about .
Again , Why isn’t Linux being used in these government projects ?
hcuar, there is no stack protection and there is
no per task memory (process like) memory protection
so it is of relatively little use. The vxworks
“kernel” even makes use of deleted memory which
makes life even more difficult.
You would be suprised what you can’t do with vxworks 🙂
>The article said that VX is ‘very reliable’ which i find kind >of insulting , even fud type statements. VX better than linux >? or Sun ?
Yes, and yes. vxWorks is also a realtime OS. Sun(Solaris) and Linux are not.
>If anyone understands U.S. government contracts they might >understand whats really going on here. Bea and Vx. You >scratch my back and I scratch yours.
I for one, know a few JPL engineers that would be rather insulted by that comment.
>Again , Why isn’t Linux being used in these government projects ?
Why should it ? You use the best tool for the job you need to do.
Linux cannot do everything. Get over it.
no like these
http://www.lynuxworks.com/
http://dmoz.org/Computers/Software/Operating_Systems/Realtime/Linux…
Maybe because support for VXworks is better and the engineers are smarter?
Maybe because government is not interested in zealotry?
Maybe because hype and facts are not same thing?
Maybe because there are reasons beyond the license which makes things appropriate to be on Mars?
Are there any technical statements or benchmarks or whatever from you side which would show any benefits to use Linux over vxworks?
>Maybe because there are reasons beyond the license which >makes things appropriate to be on Mars?
Yup. One of the things is that vxWorks is known to work. It has been on many missions elsewhere. When you play with things like this, you don’t take many chances. You use what has proven to work.
And, is Linux known to work on the processors they have on the vahicles ? Come one, you don’t select a chip based on which OS you need to run on missions like this. e.g. I’d like to see the x86/alpha/ppc chip that operates constantly/reliably on -50 to -100 degrees.
Thoff,
if we’re not careful, this could turn into a
flame fest.
Anyway, semaphores and semaphore protocols are
actually far from trivial. If one uses PCP or HLP,
you’d have to know ALL semaphores in the system
and which processes/tasks that use which semaphore,
to be sure that you get the protection you need
and to be sure you aren’t deadlocked. This is fine
with small systems, but, say, with 3rd party components
that you have no source to you have no idea what routine
uses how many and which semaphores. At best the 3rd
party routines are documented for its logical function.
Very rarely for for its access patterns for shared resources (or its timing characteristics for that matter). Where does it say that malloc() uses a semaphore for instance?
Using message passing is another way of thinking
and for some systems it’s more natural than others.
Telecommunication applications is an example where
the model fits very well, where the message streams
are directly mapped to the traffic streams.
Message passing is not a solution to all problems,
but a very convenient solution to some.
You can’t beat a purpose built device with COTS. Two options for a project like this, buy an existing system built and proven for this type of environment. Or develope your own OS/Hardware. Personally i think Linux among other OS’es would attach a huge amount of unnecessary bloat. Another thing i read some where too(correct me if i am wrong)was one of the reasons they are using older cpu’s is due to the comparatively huge transistor pathways that reduces interference from various forms of radation in space. So with these hardware limitations is it worth the effort trying to get Linux to run on a Motorola 6800? Just so the Linux fundamentalists can say we put penguins in space?
Thorbjörn Jemander , why a flame fest? My usual role
in these conversations is on the message passing side.
It’s just that deadlock is independent of semaphores
and message passing. And in a complex app knowing all
the semaphores and patterns of interaction is
extremely difficult. NASA was hosed with 3 tasks.
Imagine much more complex apps that do many more things.
Thoff,
about flame fest: sorry, forgot the smiley after that
comment. 🙂
I agree with you. I tried to point out what both sides
usually miss in these discussions, that msg passing actually
might suffer from a sort of PI. (Not in the traditional
sense, but never the less.)
As for causing latencies, that’s for sure. Introducing
queue latencies in real-time systems may be lethal for the
whole system, especially if you have processes whose
msg computation time is proportional to the queue length.
I saw such an example just a few weeks ago. Very nasty
behaviour.
The URL for the Embedded Star story should be:
http://www.embeddedstar.com/press/content/2003/6/embedded9122.html
The one listed in a prior comment contains an extra “t” at the end which os generating a lot of 404 errors on our site.