A couple of months ago, at the Lightweight Languages Workshop 2002, Matthew Flat made a premise in his talk: Operating system and programming language are the same thing (at least “mathematically speaking”). I find this interesting and has a lot of truth in it. Both OS and PL are platforms on which other programs run. Both are virtualizing machines. Both make it easier for people to write applications (by providing API, abtractions, frameworks, etc.)
The difference between the two, Matthew continued, is that OS focuses more on non-interference–or isolation between OS processes. The main task of a multiuser OS is to let several users use the computer simultaneously. Thus, it is important that no user can take over the machine or use up its resources permanently. Also, no processes shall be able to terminate other processes, peek into their resources, or do any other things that violate privacy unless it is permitted by the OS security policy.
On the other hand, PL focuses on expressiveness and cooperation. PL provides high level constructs and facilities so
that one can write programs in less time and with less amount of effort. 10 lines of higher level PL code might be equivalent to 100 to 1000 lines of machine/lower level language code. Additionally, PL provides means for people to share reusable code through the concepts of modules, shared libraries, components, etc.
As time progresses, OS’es are becoming more like PL. And vice versa. OS now provides more and more ways for cooperation/sharing: IPC, threads, COM, etc. PL now provides ways to do isolation: sandboxing, processes, etc.
However, in all programming languages that I am currently using (Perl, Python, Ruby), none of them had been designed from the ground up to do isolation. Thus, none of the isolation mechanisms really work well.
This article will focus on above three languages. It would certainly be interesting to also discuss Scheme, Smalltalk, Java, and
Erlang–however since I’m not adequately familiar with any of them I’ll leave the readers to give feedback on these.
Why Isolation In PL?
As people construct more and more complex systems, the need for isolation becomes apparent. Complex systems usually untrusted user-level code that need to be restricted. Several examples follow.
- Database systems usually provide some sort of stored procedure. A remote client can connect to the database and
triggers stored procedure to be executed. It is important that if the stored procedure crashes or loops, other clients can continue to use
- Business applications usually allow users to
specify business rules or constraints. Both are
basically some simplified high level code. Users might specify these
rules incorrectly and the application must ensure that those errors
have any unwanted impact.
- Web application servers usually
allow pages/templates to contain code. Since generally the
interpreter itself (e.g. Perl or PHP) is exposed to do the execution
of the code, the application must somehow ensure that no templates can
crash the application.
- Other applications might allow users to
specify regular expressions. Regular expressions is actually a
language, though a mini one. Overly complex regexes–either specified
accidentally or on purpose–can cause the regex engine to loop
endlessly doing backtracking.
So, in essence, complex
applications are usually a platform by itself, running
subprocesses/subprograms (in a single OS process). Thus, this requires
that the PL has isolation mechanisms beyond those provided by the OS:
like restricting a piece of code from accessing a certain part of the
filesystem, from using more than a specified amount of memory/CPU time,
from accessing certain functions/modules/variables. Unfortunately, most
PL don’t have enough of them.
The two main security models in Perl are tainting and safe
compartments. Tainting are mainly for tracing data, so I will not
discuss it here.
In Perl 5.6/5.8 there are about 400 bytecode-level instructions,
called opcodes. All Perl code will eventually be compiled to these
opcodes. print is actually a single opcode. So are
open, sysopen, mkdir, rmdir,
fork, gethostbyname, etc. To see the complete list of
Perl opcodes, see theOpcode
Two things are apparent. One, Perl opcodes are higher level than
machine level instructions or even Java bytecode instructions. Two, Perl
is a monolithic beast. Many facilities (like directory manipulation and
even DNS-related stuffs) are built into the language. Perl5 is
monolithic because of historical reasons. Perl6 will also be
monolithic–so I heard–because of speed reasons.
Every single opcode can be enabled or disabled. This is done
in the compilation step. If there is a forbidden opcode encountered by
the compiler, the compiler will refuse it and compilation will fail.
This has the advantage of speed: the cleansed code will absolutely have
no run-time speed impact. The disadvantage: one must be careful to
compile code at run-time–otherwise untrusted code can be compiled with
dangerous opcodes in it.
The Safe.pm is a standard Perl module that allows a piece code
to be compiled with a specified opcode mask (a list of opcodes that are
to be forbidden). In addition to that, Safe.pm will do a “namespace
chroot”. It will make Safe::Root0 (or Safe::Root1
for the second compartment, and so on) as the code’s main::
namespace. This means that the code in the compartment cannot access
variables in the original main:: namespace, so global variables
like $/ is not shared with code outside the compartment (Some
variables like $_ or the _ filehandle is shared,
That’s basically what Perl offers us for security. In practice,
Safe.pm is not practical. Choosing a reasonable set of “safe”
opcodes is not always straightforward. An opcode like open can
range from “rather safe” to “extremely dangerous”.
Perl’s open is so powerful and has many functions: it can open
a file for reading, for writing, it can execute programs, open a pipe,
duplicate a filehandle, etc. You can’t, for instance, make Perl allow
only read in open. Overriding open() doesn’t make it
safe, because the code in compartment can always refer to the builtin
version using CORE::open(). Moreover, Perl can be told to
read/write files without using any opcode at all (for example, using
$^I). Thus it is not possible to restrict an unstrusted Perl
code from accessing filesystem. To do this, one must resort to using OS
facility (like Unix’s chroot or BSD’s jail).
The show-stopper for Safe.pm: most modules don’t work under Safe.pm.
DBI, for example. Embperl 1.x uses Safe.pm but drops it in the 2.x
versions. Virtually no other web application servers uses Safe.pm these
days. Even Perl experts say that Safe.pm is too broken.
Conclusion: Perl has some sort of sandbox, but it works at the
compilation step only. It’s not very flexible and it’s not very useful.
Perl is also monolithic and many functions are built into the
interpreter. Thus, it is harder to isolate functionalities.
The Python language design is very simple and clean. Amongst the
security models of the three languages, Python’s is the one I like the
most. Python security model is capability-based, meaning that: if you
don’t want a certain code to be able to do stuff, you don’t give a
reference to the module/function that provide that stuff. Python is also
much more modular: the core functionality is much less than that of
Perl. For example, OS specific services–like unlink or
rmdir–are located in the sys and os module.
This means we can more easily restrict access to those services by
depriving the code from importing the appropriate modules.
Here’s Python’s execution model: each code runs in a frame (“a
context”). In a frame, there are two namespaces: the local and the
global namespace. A namespace is a mapping between names and objects.
You get reference (=capability) to objects from a namespace. Every time
a variable/function/object/module name is mentioned, Python will look
for it in the namespaces. The local namespace will be searched first,
then the global. If the name is not found in either, Python will give a
We can manipulate a namespace easily, since it is available as a
dictionary. We can even execute a code and give it our custom
dictionaries to be used as the code’s local and global namespaces. This
way, we can limit what objects are available to the code. That’s
basically how the security model works in Python.
Actually, there’s a third namespace that will be searched when a name
is not found in a local and global namespace: the builtin namespace. The
builtin namespace contains basic functions like open,
exit, execfile. Most of the Python’s builtin
capabilities are provided through this builtin namespace. The rest is
creatures like print or exec which are statements, not
rexec is the standard Python module to do sandboxing. It
basically does what is explained above: run the sanboxed code with a
custom local and global namespace. Additionally, rexec creates a custom
builtin namespace and provides a safer substitutes for functions like
open or __import__. This way, we can tell rexec to
forbid the untrusted code from opening a file in write mode. Or from
importing dangerous modules.
rexec is pretty flexible and indeed has been used successfully in
several applications. Guido’s web browser Grail, for instance, allows
running Python applets. However, rexec seems to be not flexible or
fine-grained enough, because Zope chooses not to use rexec. Instead, it
uses its own home-growned module to do restricted execution.
There are several things that rexec can’t do. Resource limiting, for example. To do that you need to resort to the OS (like using Unix’s setrlimit). Also, since Python does not have private attributes, you can’t give an object to an untrusted code without the fear that the code
will use the Python reflection mechanism to “peek into the
guts” of your object (and from there gain references to other objects). There are two separate solutions to the last problem: the Bastion and mxProxy C extension modules, which essentially provide private attributes.
Conclusion: Python has a nice and simple security model. However, rexec cannot do all kinds of isolation that one might need, like resource limiting. Guido once also said that rexec is not tested enough and it might contain security holes.
One of the main goals of Ruby seems to be “to replace
Perl”. In that respect, it has copied many Perl features. Tainting is one of them. In Perl there are two running modes: tainting mode on (-T, setuid) and off (no -T). Ruby extends this concept a bit by providing four different “safe levels” (indicated by the global variable $SAFE). The different safe levels is as follows.
Safe level 0 (default mode): no tainting is performed.
Safe level 1: tainted data cannot be used to do potentially dangerous.
Safe level 2: in addition to level 1 restriction, program files cannot be loaded from a globally writable locations (e.g. from /tmp).
Safe level 3: in addition to level 2 restriction, all newly created objects are considered tainted.
Safe level 4: in addition to level 3 restriction, the running program is effectively partitioned in two. Nontainted objects may not be modified. Typically, this will be used to create a sandbox: the program sets up an environment using a lower
$SAFE level, then resets
$SAFE to 4 to prevent subsequent changes to that environment.
It’s evident that, as with tainting, the safe levels are primarily concerned with data security and are not very sandbox-like (in the sense of “isolating subprocesses from another” sandbox). Matz confirmed this in the ruby-talk mailing list by saying that Ruby currently does not have any sandbox yet. Running a code in safe level 4
is usually too restrictive to be practical, plus it does not provide enough isolation.
The problem with isolation in Ruby is that all objects are accessible from any code through the ObjectSpace facility (including the code running in safe level 4). This is of course in direct conflict with the capability concept, in that you don’t give a reference/capability unless necessary. However, Ruby does protect an object’s attributes and has a #freeze method to make an object becomes read-only.
Conclusion: Ruby doesn’t have a sandbox (yet).
Java has a sandbox security model and a bytecode verifier. Tcl basically has the same. Erlang is evolutionary more advanced in providing isolation, in that it has a notion of “PL-level processes” (a process is isolated in all ways from another).
As people construct more and more complex applications in PL, PL’s are required to have adequate security/isolation mechanisms. Current PL’s in mainstream usage do not have adequate security mechanisms, so
programmers are often forced to fall back to using facilities provided by the OS. This has drawbacks such as lack of portability and reduced efficiency. There will perhaps be new PL’s designed with isolation as
one of their main goals–or current PL’s might be
improved/redesigned–so hopefully this requirement of having a “multiuser PL” will be fulfilled in the future.
About the Author:
Steven is a software developer residing in Bandung, Indonesia.