Secure Coding in C and C++: Strings

Eugenia Loli 2005-12-05 General Development 34 Comments

Strings – such as command-line arguments, environment variables, and console input – are of special concern in secure programming because they comprise most of the data exchanged between an end user and a software system. This chapter covers the security issues with strings and how you can sidestep them.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

34 Comments

2005-12-05 6:09 am

ma_d
Sometimes I wonder about this:

“C/C++ lack of support for a string type is why all software is insecure.”

Well, most don’t say that, just the ones pushing their pet-language. But it is a real problem. I wonder if programmers simply hadn’t thought of this possibility until 5-10 years ago, and hadn’t really had it in their head until 5 years ago.

It’s not *that* hard to sit down and say to yourself “hey, I’m writing this function, what if the input is really long? Is that possible? Ok, it seems remotely possible, how can I make this error if it’s bigger than I’m supposed to handle? Can I just handle the ginormous string?” Yea, that’s extra work, but the bear is always in the details right?

2005-12-05 2:39 pm

PrimalDK
Well, first of all, most issues with their programming languages that programmers face today were researched and some of the problems solved by academia decades ago. Object-orientation was coined in 1967 with the advent of Simula, not invented by Bjarne Stroustrup in the early 80s. Pascal had a string type in the 70s, so no, it’s not like C++ couldn’t have had one when Stroustrup started on his “C with classes”.

The reason for not including one is probably mainly that the runtime shouldn’t have to calculate string length when dynamically allocating and resizing strings, thus making string allocation and resizing an even more expensive operation, and that certain processes (functions, procedures) are easier to implement (use [pseudo-code] “while not zero do …”.

Also, when 8-bit processors (early 80s) count in a register they count to 255, thus when strings are longer than 255 characters, the compiler can’t use a register for the loop. If a null-terminated string-model is used (as it is in C/C++), strings can be arbitrary length and the generated machine code for checking the boundary is the same no matter the length, no matter the word-size of the processor, as long as a compare and a jump instruction exists.

In the early 80s, programmers generally weren’t programming software the size of a modern operating system with millions of lines of code, so understanding the structure of the code was feasible for a single person. With the advent of programming in the large (and, of course, modern user interfaces with its widgets), object-orientation took off.

The problems that null-terminated strings (character arrays) are just a few in a sea of possible errors that programmers create, and they are easily handled by garbage collection and array boundaries checking. Much harder problems are guaranteeing hard-realtime response on non-embedded systems, getting near-C performance from side-effects free languages, proving anything about sufficiently large software structures, handling schedules in distributed systems etc.

I don’t think it’s feasible for anybody to handle the code of as complex structures as some of the larger systems represent today and make it work all the time, unless it is built on a solid logical or mathematical foundation. Python can run out of memory too, and when it starts allocating your swap-mem it will slow down your computer, thus, were the software – say – flying an airplane potentially kill people.

“C/C++ lack of support for a string type is why all software is insecure” only demonstrates knowledge of popular misconceptions, not a particularly broad view of reality.

2005-12-05 7:06 pm

ma_d
Extremely well put.

2005-12-05 6:48 am

rayiner
Well, most don’t say that, just the ones pushing their pet-language. But it is a real problem.

Yes it’s a real problem. Some studies show that half of security vulnerabilities result from buffer overflows: http://www-128.ibm.com/developerworks/security/library/s-overflows/… In languages that rigorously check array bounds and use a garbage collector, these errors just aren’t possible. In modern compilers (and with modern CPUs), these things cost so little (array-bounds checks cost on the order of 2-3%), there is almost no good reason to not use them.

I wonder if programmers simply hadn’t thought of this possibility until 5-10 years ago, and hadn’t really had it in their head until 5 years ago.

Programmers realized it decades ago. Pascal is older than many of the posters in this forum, and Lisp is older than most of the posters in this forum. However, mainstream programmers are living about two decades behind the state of the art in many respects, and concern about memory safety checking is one of those.

It’s not *that* hard to sit down and say to yourself “hey, I’m writing this function, what if the input is really long? Is that possible? Ok, it seems remotely possible, how can I make this error if it’s bigger than I’m supposed to handle?

In a good language (Python, Lisp, among others), you can bang out small functions in less than a minute. Stopping to think about details like that will absolutely kill your productivity. It’s a matter of focusing on the problem, not on the implementation. The more you can concentrate on exactly what you’re trying to do, instead of on figuring out how to get the computer to do what you want, the faster and more efficiently you can code.

This is a very well-understood and well-accepted concept in industrial engineering and management. They have people whose job consists simply of streamlining workflows as much as possible, to allow people to complete tasks without the mental overhead of thinking about things that are only incidentally related. I have absolutely no idea why programmers are so resistent to these principles.

Yea, that’s extra work, but the bear is always in the details right?

The bear is in the actual problem. I have enough on my hands figuring out how to solve the actual problem, and the fewer details to get in the way the better.

2005-12-05 3:50 pm

Wrawrat
Completely agree with you., with an exception. In my opinion, system programmers writing low-level routines (kernel, drivers) should have the burden to make these checks since they should know what they are doing. The 2 or 3% of CPU does matter since performance is an issue. Sure, it isn’t that much, but I hate the current trend making new computers not much faster than older ones because of that philosophy.

But you’re right for normal programs. These checks are a waste of time. I am doing C++ but I am seriously investigating other languages that are less bitchy with that kind of stuff.

2005-12-05 6:53 am

timosa
I have never fully understood why buffer overflows are so common. Most toolkits offer some kind of help for programmers. GTK has GString and QT has QString. Of course there is a performance penalty but I suppose security is usually more important for the end users.
2005-12-05 7:00 am

rayiner
Buffer overflows are common because the C standard library sucks. Not only does it not have a standard string type, but it has very poor tools for manipulating strings, and sequences in general.

Most developers tend to use whatever functions come bundled with the language. You’ll not often see someone add a dependency to GTK+, for example, unless they’re actually writing a GTK+ application. As a result, most applications that have no GUI tend to use the built-in C routines, and those suck very badly.

2005-12-05 7:18 am

evangs
Not to mention the fact that the geek community will be up in arms when they see that your program has a dependency on GTK+ ) when it doesn’t really use anything else apart from GString. Imagine all the cries of bloat you’ll be getting.

2005-12-05 8:44 am

miffe
Not to mention the fact that the geek community will be up in arms when they see that your program has a dependency on GTK+ ) when it doesn’t really use anything else apart from GString. Imagine all the cries of bloat you’ll be getting.

Which is why most non graphic stuff in GTK+ is acctually in glib. So the programmers should just link with that, not the full GTK.

2005-12-05 8:19 am

Richard James
C development (of the language, not in the language) is pretty much deprecated and replaced by C++. This is why these things have never been fixed. Someone really needs to sit down and write a new C specification and stop expecting developers to not use C because somewhere, somebody is programming in it right now.

Saying that the developers should change to another language or that they should use other libraries to make safe code is stupid, because it is not going to happen. People write in C because it works for their project, not because they have never seen another language before.

2005-12-05 9:40 am

nimble
People write in C because it works for their project

The question isn’t whether it works at all, but whether higher-level languages wouldn’t work better for them. Especially later on when the project has to be maintained and extended.

not because they have never seen another language before.

…, but because they’re most familiar with C and aren’t sure about going with something new. Perfectly natural, if regrettable.

2005-12-06 12:24 am

Richard James
The question isn’t whether it works at all, but whether higher-level languages wouldn’t work better for them. Especially later on when the project has to be maintained and extended.

For some people C is the only language that will work for their project. Higher level languages are not always a replacement, so harping on about them will not work.

2005-12-06 1:03 am

nimble
For some people C is the only language that will work for their project.

You yourself say “some”. I’d say “few”.

Higher level languages are not always a replacement, so harping on about them will not work.

If you’re gonna dismiss my comments as “harping”, at least come up with some logical counter-arguments. The fact that a few projects really do require a low-level language doesn’t mean that many other projects wouldn’t be better served with higher-level languages.

2005-12-05 10:30 am

Anonymous
C is kind of like dynamite. Yes, dynamite could blow off your hand, but that doesn’t mean dynamite shouldn’t be explosive–what good would it be if it weren’t? Dynamite has its place, just like C.

This article gives some insight about precautions that should be considered when working with C/C++. Rather than discussing the points brought up in the article, everyone seems to be attacking a low-level language for doing exactly what it is designed to do–give a developer extreme control. Whether or not everyone needs this degree of control is not the point. It exists for those that do need it.

2005-12-05 11:06 am

nimble
Read the comments again. People weren’t criticising C, but its use in projects were low-level access and (often over-estimated) efficiency should be secondary issues compared to security and maintainability.

But while we’re at it: yes, C is a decent enough low-level language, but even there it has its flaws.

In some respects it doesn’t provide enough hardware access, e.g it doesn’t expose the carry and overflow flags, which means you have to use either fairly inefficient bit manipulation or program directly in assembler when implementing big integers.

The lack of standardised base types led to everyone introducing their own “types.h” standard, causing problems with portability and mixing code from different sources.

And surely even a low-level language could have a decent module mechanism and handle constants and inline functions, rather than rely on a dumb preprocessor for those things.

2005-12-05 11:13 am

Anonymous
Just use std::string like all C++ textbooks recommend. When bound to use C char array, use std::string.c_str() function. What can be easier?

2005-12-05 11:42 am

Anonymous
Just use std::string like all C++ textbooks recommend. When bound to use C char array, use std::string.c_str() function. What can be easier?

Agreed.

The author is making a lot of noise around a non-issue; at least in C++.

PS: When an article uses “C” and “C++” as if they were the same language, then you know the article does not hold a very deep knowledge of C++.

2005-12-06 11:29 am

anda_skoa
PS: When an article uses “C” and “C++” as if they were the same language, then you know the article does not hold a very deep knowledge of C++.

Exactly!

My guess is that mentioning C++ is required marketing wise.

You write a book about proper C coding but to get better sales your title has to include C++ as well.

Maybe for those “C++” programmers that are misusing C++ as C with classes.

I mean, streaming from cin into a fixed length char buffer? come on

Who does that in other languages that have string classes and then complains about getting an out of bounds exception?

2005-12-05 11:47 am

nimble
Just use std::string like all C++ textbooks recommend. When bound to use C char array, use std::string.c_str() function. What can be easier?

c_str() returns a ‘const char *’, yet C functions often expect a ‘char *’ (even if they don’t actually intend to change the string), so you’ll have to sprinkle your code with const_casts.

Going the other way, a string can be constructed from a ‘char *’ easily enough, yet it does involve a copy and you still have to think about who’s going to deallocate the ‘char *’.

And none of this helps with string input, which is what the article was mainly about. operator>> on istreams doesn’t do strings.

So much bother with such a basic data type ..

2005-12-05 12:16 pm

nimble
And none of this helps with string input

But then again there’s stringbuf, which the article never mentions.
2005-12-05 1:04 pm

Anonymous
And none of this helps with string input, which is what the article was mainly about. operator>> on istreams doesn’t do strings.

Since when?

string Word;

cin >> Word;

or even

string Line;

getline(cin, Line);

2005-12-05 1:40 pm

nimble

string Word;

cin >> Word;

Afaik that’s a non-standard GNU extension. Correct me if I’m wrong.

2005-12-05 2:11 pm

Anonymous
“Afaik that’s a non-standard GNU extension. Correct me if I’m wrong.”

In C++ its now a part of the standard library.
2005-12-09 12:46 pm

Anonymous
Wrong! From Stroustrup C++ FAQ[1]:

[…]

When choosing a book, look for one that presents Standard C++ and use the standard library facilities in an integrated manner from the start. For example, reading a string from input should look something like

string s; // Standard C++ style

cin >> s;

and not like this

char s[MAX]; /* Standard C style */

scanf(“%s”,s);

[…]

1. http://www.research.att.com/~bs/bs_faq.html#how-to-start

2005-12-05 1:10 pm

Anonymous
c_str() returns a ‘const char *’, yet C functions often expect a ‘char *’ (even if they don’t actually intend to change the string), so you’ll have to sprinkle your code with const_casts.

Or you could fix the function definition of the ‘offending’ function (and report that to the maintainer so that it’ll be fixed in a next release).
2005-12-05 3:01 pm

Anonymous
If you’re truly interfacing with C this way from C++, and if the other poster’s recommendation to report the problem to the libary maintainer in hope of a fix (which may work for open source, but not often for proprietary libraries) don’t work for you, you should be using std::vector, specialized on char. This can be transparently used as character array for interfacing with C and takes little effort to interface with std::string for C++ string processing.

2005-12-05 4:11 pm

rayiner
Ha ha ha! That’s a very good joke!
2005-12-05 4:17 pm

rayiner
“C/C++ lack of support for a string type is why all software is insecure” only demonstrates knowledge of popular misconceptions, not a particularly broad view of reality.

Nobody said C’s lack of a string type is why all software is insecure. What people did say is that this lack is why software is less secure than it otherwise would be. If C was bounds-checked and GC’ed, would software instantly be secure? No. However, half of those CERT advisories would not exist, and perhaps even less, given that all the time spent looking for buffer overflows could be put to better uses.

There are problem domains where the guarantees of a safe language are too constraining for the task. Most programmers do not work in those problem domains. You can, for example, happily work in real-time code using a real-time GC, as long as you can accept a 50% performance hit. Most control-type code is in this category. The only folks who cannot take advantage of these technologies are people who need both high-performance and low memory usage. Kernel folks might fall into this category, and maybe game programmers.

The set of people who actually need to be using C is very small. The set of people who would be better served by something else, but don’t use it (for whatever reason), is absolutely enormous.
2005-12-05 5:25 pm

Latem
I do agree that C strings are somewhat problematic and error prone. Or at least more care should be taken when working with strings in C.

However, as others have mentioned, if just using C++, and there is not much interfacing with C code, just use any of the available string classes. char* is really not intended to be used for strings in C++. std::string is perfectly good for straight C++. Use CString under MFC (yuck), or QString if working with Qt. There is some performance penalty, but IMHO all the conveneince and features available make up for the small speed loss. I know QString uses implicit sharing to reduce memory usage, so copying only occurs when it is needed. There are methods available to get C style strings to be used when needed. Finally, QString is unicode, and CString can be as well.

Edited 2005-12-05 17:28

2005-12-05 6:28 pm

rayiner
While using the various String classes is the only recourse C++ programmers have, it’s not a great solution technically. The basic problem is that the STL post-dates C++ by far too long. As a result, most existing code doesn’t use the STL string class, but rolls its own. The poor programmer is now stuck with a dozen different string implementations, as well as a large number of C libraries that dont use any string implementation. What does that mean? Every time someone wants to interface with multiple libraries, he’s back to dealing what char*. This is not an academic problem. If you’re used to doing STL-style “modern” C++ coding, the that every single library has its own string class, instead of using std::string becomes a massive PITA.

The basic problem is that C++’s design is broken. Sticking stuff in optional libraries, instead of including them in the language, is a theoretically pure approach, but impractical. The end result is that everyone ends up rolling their own implementations of basic functionality, and interoperability and code reuse go to hell. In comparison, if your standard library is rich enough to do most of the basic things a programmer might want to do, then at least different libraries have a usable common set of data structures and algorithms. The STL is a good step towards fixing this issue, but its far too little and far too late.
2005-12-05 7:11 pm

ma_d
What rayiner didn’t seem to mention to you is that: The STL makes heavy use of “const char*” which essentially means that std::string is only half useful in the STL. The next problem is that std:string is not bounds checked, at all: Try it. There’s no real point in having [] access, no bounds checking, and calling it safe; you’ll still cause yourself the same problems.

You may save yourself some debugging time with std::string, but I doubt you’ll stop many security issues based on strange inputs: A better idea may be to have a very creative person write your test cases .

You can’t test everything, but if you can see the source code you should have a good idea of what things to throw at it to break it. My natural inclination, unfortunately, is to avoid dangerous inputs; I even find myself doing it on programs for which I’ve never seen the source: It took me a long time to do any D&D, my theory being that I didn’t know if the program supported it right so why risk it! Of course, that theory was subconscious…

2005-12-05 11:03 pm

rayiner
You may save yourself some debugging time with std::string, but I doubt you’ll stop many security issues based on strange inputs:

The thing std::string saves you form is the following:

char buf[1024];

gets(buf);

do_stuff(buf);

Sure, in reality, you shouldn’t use gets(), but people do anyway. Why? Because its a simple function to use for a simple task, and the “proper” C equivalent is more complicated for the same functionality. In C++, you can do:

string buf;

cin >> buf;

do_stuff(buf.c_str());

Same three lines, except the latter will handle unlimited-length inputs securely. When doing the library makes it so that the straightforward thing is also the right thing, secure code gets written.
2005-12-06 2:08 am

rayiner
For some people, yes. For most people no. Many people think that C is the only language that will work for their project. Many people think they need a language that they’d write a kernel in for their project. Many people think that language performance is the primary bottleneck in their software. All of these things are untrue. Most people do not need these things, they only think so because they lack perspective.

The simple facts are thus:

1) The difference between (compiled) languages are a matter of constant factors. For many types of code, and modern compilers, the constant factors are very good.

2) The difference between algorithms can be linear or even quadratic in nature. The quality of the algorithms you can implement directly depends on how productively you can write algorithms in your chosen language.

3) The time available to implement anything is fixed. C programmers like to contemplate a world without deadlines, where the performance of your code approaches the limit of the machine. This is not the world we live in.

Consideration of the above three points suggest a few conclusions about software performance:

1) All else being equal, making software development faster improves the performance of the software, given fixed time constraints. Since better algorithms have better performance payoffs than more highly-tuned code, productive languages can be a net win for performance even in the face of worse constant factors.

2) Conversely to the above, languages that make development slow but generate more highly tuned machine code can result in worse overall performance. You only need to look at something like GTK+ to see this observation in action.
2005-12-06 11:38 am

Anonymous
What rayiner didn’t seem to mention to you is that: The STL makes heavy use of “const char*” which essentially means that std::string is only half useful in the STL.

Wrong. The parts of the C++ Standard Library which use C-style strings in its interface do not belong to the STL. That is, for example the Iostreams library does use char const*, for passing filenames to constructors of some classes, among other things. Or in the related Locale facilities you will find it. Mind you, using std::string for these purposes will probably be added in the next version of the standard.

The next problem is that std:string is not bounds checked, at all:

This is so wrong. You might want to actually inform yourself before posting about things.

There’s no real point in having [] access, no bounds checking, and calling it safe; you’ll still cause yourself the same problems.

Even if you first statement were true, that std::string was not bounds checked at all, this would still be false, because then there would still be the property that std::string grows automatically when needed, and that it manages its storage automatically.

You may save yourself some debugging time with std::string, but I doubt you’ll stop many security issues based on strange inputs

Of course not, because input validation has nothing to do with how you store the input after you have validated it.