Closed code threatens science

Thom Holwerda 2012-04-16 In the News 32 Comments

“Modern science relies upon researchers sharing their work so that their peers can check and verify success or failure. But most scientists still don’t share one crucial piece of information – the source codes of the computer programs driving much of today’s scientific progress.” Pretty crazy this isn’t the norm yet.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

32 Comments

2012-04-16 5:43 am

kwan_e
There is a danger that by releasing the source code, other scientists would use the source code with all the bugs which causes errors to propagate undetectably in derived research.

Ostensibly, scientists can check the source code to find bugs, but it’s never going to be complete.

There is something to be said about scientists having to recreate source code in a clean room environment because errors in either code or hypothesis is easier to expose.

2012-04-16 6:02 am

cyrilleberger
There is a danger that by releasing the source code, other scientists would use the source code with all the bugs which causes errors to propagate undetectably in derived research.

Instead of that, they should write their own implementation with its own set of bugs… right? Instead of fixing the bugs in the original implementation and bringing improvements…

2012-04-16 6:27 am

kwan_e
Instead of that, they should write their own implementation with its own set of bugs… right?

Yes. It’s another way to check that the theories in the original code are correctly implemented.

Say you want to write some code to verify the Hockey Stick Graph is correct. If you use the original source code, chances are, you’re not going to spot all the bugs in the implementation and you’ll likely end up with the same graph, which does not fulfil the goal of independent verification.

We’re talking scientific formulae, not a Linux desktop environment here. The most important thing is the data.

Instead of fixing the bugs in the original implementation and bringing improvements…

The only useful improvements for scientific research are corrections to formulae and theories. That can be done outside of code, and probably better served by being outside of code.

Do you seriously think it is a good idea for logic bugs to propagate through hundreds of research projects derived from the same code?

2012-04-16 8:19 am

j-kidd
Do you seriously think it is a good idea for logic bugs to propagate through hundreds of research projects derived from the same code?

The alternative is to have hundreds of research projects derived from hundreds of different codebases, each with its own bag of logic bugs. All code are inherently buggy, and scientists, due to lack of basic training in software engineering (e.g. code reuse, unit test, etc), tend to write buggier code.

Last year, I had to do some data deduplication work using string metric such as Jaroâ€“Winkler distance. I found 3 open source libraries, one in Java, two in Python. And the three of them implement the formula differently, resulting in significantly different metrics. The good thing is that, because the code is open, I submitted patches to the maintainers. Some got fixed, some did not (but the bug report is publicly available nevertheless).

One of these libraries, Febrl (Freely Extensible Biomedical Record Linkage), was released by Australia National University as part of research. I owe greatly to the authors, in particularly their willingness to put the code out for scrutiny.

2012-04-16 8:53 am

kwan_e

Do you seriously think it is a good idea for logic bugs to propagate through hundreds of research projects derived from the same code?

The alternative is to have hundreds of research projects derived from hundreds of different codebases, each with its own bag of logic bugs. All code are inherently buggy, and scientists, due to lack of basic training in software engineering (e.g. code reuse, unit test, etc), tend to write buggier code.

Yes, that’s the point.

If you have multiple independent implementations of the same formula, the more chance you have of finding problems with the actual formula.

You do understand that there is a development methodology that’s used for designing and writing robust code by implementing with different languages, don’t you?

In fact, many modern CPUs have a similar thing where a calculation takes place twice and the results are compared at the end to verify the calculation was correct. What I’m suggesting is that it’s analogous to having multiple cleanroom implementations of formula.

2012-04-16 7:43 am

looncraz
During peer review, the code would be checked to verify any unexpected results.

With open code, any meaningful problem would be found and solved, and old studies could be easily re-run and verified or discarded.

With closed code, the bugs are never found, and the authors have no reason to repair it if they get what they think are sound results.

–The loon

2012-04-16 8:44 am

kwan_e
With closed code, the bugs are never found, and the authors have no reason to repair it if they get what they think are sound results.

–The loon

It doesn’t matter because it’s the published results that matter, and if the results are wrong, someone can verify it independently when its published. If you use the original source code, to verify, it’s no longer independent.

In a research organization where there are hundreds of people pulling in open source code, you cannot guarantee someone did not pull code from the original base, leading to a compromised verification of the data.
2012-04-16 3:00 pm

thomasg76
Well, it can also work quite easily in the opposite. The source code is taken, with little or no review, and new data are run through it, confirming the original result.

I am sure that this happens. Not too long ago I run in to this issue while looking at studies done in the field of psychology. They run most there studies through SPSS to make a factor analysis, do get something out of the data. Everybody using the same software the same way of conducting the study, of course they confirm the result of others. Most of the conclusions drawn are just simply wrong, because less than half of the data is actually supporting the result.

Now since most psychologist aren’t statisticians, they just take the work of others as template for their own. And you propagate a wrong method / software.

The same is going to happen with opening the source code for all research. If the code is critical to the research than it should be implemented independently to confirm the results, based on the same data. If the code is auxiliary to the problem, then who cares anyway.

Also I know of Professors that stopped publishing all together because of that requirement. Now what do you gain?

The good thing from all the published work is, we KNOW that certain things work/exist, so they can be re-discovered and independently verified.

2012-04-16 8:32 am

renox
There is something to be said about scientists having to recreate source code in a clean room environment because errors in either code or hypothesis is easier to expose.

I’m not so sure: there was a time where a popular idea to produce safe code (for avionics or things like that) was to have several independant teams coding the same software to have different bugs.

A study discovered then that independants teams had quite a few identical bugs, so it became much less popular!

2012-04-16 8:59 am

kwan_e
A study discovered then that independants teams had quite a few identical bugs, so it became much less popular!

Yes, but with scientific research spread all over the world, we can afford to have more teams than any single organization can afford.

And again, I refer people to the Climategate non-scandal. What if it turns out everyone who verified the data were using the same code, or at least derived versions of the same code? Think about the fallout from that. Even if the bugs were mostly identical, do we want to risk being wrong?

2012-04-16 6:24 am

moondevil
But if the code becomes available how can you patent it?

(I’m being sarcastic)
2012-04-16 7:09 am

Neolander
Well, sure, others can take a copy of my source code anytime they want. I wrote it with a focus on readability by future lab members anyway. But… It’s written using proprietary software and a nonstandard programming language, so I can hardly see them doing anything with it without the closed-source software I used, unless they feel highly motivated.

Also, it would take them quite a lot of time to get familiar with the codebase, whereas if they asked me about the core measurement algorithm I could likely explain it on half a page of text, skipping all the annoying details of hardware initialization, GUI code, and so on.

So, I’m not sure that source code would be much more useful than what you find in your typical scientific paper in the end.

Edited 2012-04-16 07:12 UTC

2012-04-17 1:18 am

Alfman verbose=1
Neolander,

“So, I’m not sure that source code would be much more useful than what you find in your typical scientific paper in the end.”

Well I don’t read many scientific papers these days, but I would think whether or not source code is useful depends on how instrumental the source code is in making the case for the paper’s conclusions.

For example, if the data speaks for itself and doesn’t need complex software processing to be understood, then providing source code is more about convenience than an instrumental part of the paper.

If the data were transformed using atypical algorithms and there is no way to understand it directly without software analysis, then of course other scientists would be at a loss to validate the work unless they actually re-implemented the software from scratch. Even if they do, they might be left to guess about implementation details and not be able to validate the paper directly.

2012-04-16 8:55 am

lucas_maximus
This is a little simplistic.

Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary.

In fact, being able to reproduce the algorithm yourself, i.e. writing your own code, is a better test than merely inspecting

However this doesn’t scale for larger programs.

2012-04-16 9:02 am

kwan_e

In fact, being able to reproduce the algorithm yourself, i.e. writing your own code, is a better test than merely inspecting

However this doesn’t scale for larger programs.

Science isn’t an economic endeavour. Accuracy should be preferred over scale or efficiency.

2012-04-16 10:48 am

lucas_maximus
no but if you have a sufficiently large program it might not be feasible to recreate.

2012-04-16 10:57 am

kwan_e
no but if you have a sufficiently large program it might not be feasible to recreate.

You can say that about every sufficiently important scientific study. Yet researchers still must verify the results independently, and they used to do it for complex studies without source code.

2012-04-16 2:51 pm

lucas_maximus
Oh right.

Well that is interesting from a development prospective because I know that certain fortran code is particularly well debugged and I wonder how they recreate the accuracy of some numerical results.

2012-04-16 9:22 am

l3v1
Demanding souce code has nothing to do with verification of results.

A lot of researchers provide either sources (mostly proof-of-concept Matlab code, or sometimes regular sources under some license) or executable code along with some publications. To verify results – which is almost never done, mind you – noone requires sources, a simple executable, a binary library, or a Matlab basic code would be quite enought for all intents and purposes.

And, surprise surprise, a lot of researchers will provide you with either an executable, or a library if you ask for it, others will run their algorithm on your data and give it to you. And yes, there are people who don’t, but that’s their prerogative.

Demangind sources for all the algorithms is much more than any of the above, and in almost all cases you just simply couldn’t justify your need for it, besides saying that you want it. Well, not every day is Christmas.

What’s the reason behind it? Multiple. Research and creation of proof-of-concept code for an article is _not_ software development, and it shouldn’t be – unless of course the article deals with software development Also, producing such code is often a result of a lot of blood and sweat, and sometimes one requires more reason to share sources, than kindness of the heart (still, it happens from time to time). Also, some (a lot of) researchers can’t afford to patent results, thus keeping the sources is a fairly good way to make stealers’ work harder.

Not everything is black and white here, and most people can’t see that. And the title saying “Closed source threatens science” is just plain untrue. I mean come on, it threatens it now? Didn’t it threaten for decades? What is the exact nature of that thread (beacuse the simple unavailability of sources is simply not a threat)?

Also, about:

If I knew there was a publication requirement for my code, I probably would have done things like comment it better, kept better track of it, and generally put a bit more thought and effort into my code

…yes, and that’s exactly what I don’t want to do in a lot of cases. More often than not, the idea is far more important than the code that provides a method to test it. Implementation (yes, the code) is not the science (unless we’re in algirithm design and software implementation science), an implementation is just one way of realizing the presented idea.

Also, another important issue, sometimes the software implementation of a presented method, algorithm or idea is simply not public, restircted, sometimes it can be even confidential – in such cases, demainding pubication of the sources is just simply not an option. And you’d be willing to send away good papers because of that? Right, good luck being a high impact journal.

And if an idea is so good, and there’s real marketability in it, then a lot of times it evolves into a distributable software and everyone can get their hands on it.

2012-04-16 10:54 am

kwan_e
Demanding souce code has nothing to do with verification of results.

Really? ABSOLUTELY nothing?

2012-04-16 1:43 pm

tidux
Yes, absolutely all source code relevant to published science should be freely available. Otherwise, how do we know there isn’t some magic constant hidden in there compensating for systemic error? How do we even know the data’s being analyzed at all? If research that relies on computation doesn’t provide source code, it’s basically unreproducible over the long term, since no OS or hardware platform lasts forever.

2012-04-16 2:53 pm

lucas_maximus
The algorithm and any formulas used will be part of the paper. Someone with significant understanding can reproduce and verify the results.

I don’t think it is too different than working against a well written software specification.

2012-04-16 8:30 pm

tidux
Without the source code, how do you know that’s the formula they used? For all you know, their RNGs could have always returned 3.

2012-04-16 10:53 pm

kwan_e
Without the source code, how do you know that’s the formula they used? For all you know, their RNGs could have always returned 3.

Because they must publish papers describing their research and results. Then other people can reimplement parts of the paper and if the results don’t match what the paper has, then you know something is wrong.

If analyzing source code can really prove or disprove scientific results, we would already have this whole software business completely automated with software that writes itself.

2012-04-17 2:12 am

Alfman verbose=1
kwan_e,

“If analyzing source code can really prove or disprove scientific results,”

Why should we assume that it cannot? If we have an accurate mathematical model of a scientific process, then software can reveal insight and even proofs for scientific processes that wouldn’t otherwise be conceived. However in order for an implementation’s results to be valid, the authors source code ought to be peer reviewed.

It’s true that one might get away with publishing pseudo code or even logic charts instead of real code. But if the goal is to reduce errors and encourage 3rd parties to validate the work, the inclusion of real code is nice.

“we would already have this whole software business completely automated with software that writes itself.”

At an abstract level, I’d argue that’s what a compiler does already: it takes a specification written in one language and builds binary software implementing that specification. I guess you might have been thinking compiling software from an English spec, but there are at least two obstacles with that.

1 – English is ambiguous and it takes a lot of words to exactly describe a process that computer languages can nail down exactly. A thought experiment would be to stick 10 qualified programmers in 10 black boxes, give them an english spec, and see how different their work is. Now if we give those same programmers specs written in one computer language, and force them to translate them into another computer language, with any luck they’d still end up with functionally identical work because computer languages are so concise. (We should play the “telephone game” with computer languages!)

2 – English speakers have shared knowledge due to shared human experiences and contextual knowledge. This enables them to gloss over a great deal of information that would otherwise lead to holes in an english spec.

Sorry my response has gone way off topic, but the point is the lack of english language compilers is no reason to negate the scientific merit of computer source code proofs.
2012-04-17 7:40 am

lucas_maximus
Do you know what sort of effort it required to prove that a computer program is mathematically correct?

We are going into the Vienna Developer Method territory. Which is totally irrelevant anyway.

The whole point is that the information in the paper should be sufficient to reproduce the results, … you are putting the cart before the horse.

If formulae and relevant parts of the algorithm are described (pseudo-code is pretty good IMHO, and I write my proper code first as pseudo code) is more than sufficient.

2012-04-16 4:56 pm

jburnett
My academic background is computer science. I didn’t publish any papers that had algorithms so complex they could not reimplemented. Some of the visualization and driver code was a pain, but the core algorithm I was describing was pretty tight and neat. All of computer science was that way, you almost never see a paper with an algorithm that is difficult to reproduce (aside from the really complicated math). When you do see such a paper, it was generally a hardware specific way of doing something where the specific hardware was annoying to program. But I always thought of those as more marketing than science.

So, what is so complex in biology, chemistry, physics, psychology, etc… that it cannot be reimplemented fairly easily when given the core algorithm?

2012-04-16 5:10 pm

Neolander
So, what is so complex in biology, chemistry, physics, psychology, etc… that it cannot be reimplemented fairly easily when given the core algorithm?

Speaking for physics, not much… Except that the people who do numerical simulations may like to use open-source highly optimized math libraries, because those take a lot of time to mature.

I have heard that CERN scientists still use Fortran a lot, simply because they have gigabytes of highly optimized Fortran code around and can’t bother to rewrite it in something more modern like C.
2012-04-17 2:28 am

Alfman verbose=1
jburnett,

“So, what is so complex in biology, chemistry, physics, psychology, etc… that it cannot be reimplemented fairly easily when given the core algorithm?”

As long as it’s described well enough (and perhaps even if it’s not) then anything can be reimplemented – that is not a point of contention.

I just think it’s more work to do so, that’s all. If the scientific community has been doing it without source code all these years, then maybe it’s not such a big deal. The main issue I have is when someone reads the paper thinking “gee, I would like to play with the numbers myself but I don’t have the time/skill to write my own software from scratch”.

2012-04-16 8:45 pm

Yamin
Academia tends to have formal specifications for formula. They express things in mathematical formulas or specialized scientific notation.

To most scientists, software is just a tool. They’re really not interested in the program. Chances are it is messy, sporadic… they might even be embarrassed to release it.

So the output is the formula… that is what they release.

Others have mentioned, but I personally think it increases scientific accuracy if they don’t release their test software. It would be far too easy to copy that code or just run their program to validate the formula.

Let’s remember the output is the scientific formula… not the program itself.

It is actually better if someone else codes their own test system and validates it.

It’s kind of life making you buy your measuring equipment from a different source than your vendor. Would you buy a Cisco router tester from Cisco? Nope, you’d want something independent from a company like Ixia or something.

As a side note, I happen to think this line should blur. That is to say, it should be possible to have the scientist write their formula in whatever academic language they want and tools should be able to output that into a library for a common language or even their own compilers.

Software / formal academic specifications do the same thing. They express algorithms….
2012-04-17 1:19 pm

jburnett
From reading the comments since my last post it seems there are two main arguments for requiring the release of source code.

1. Authors do not fully disclose their method, making their work unreproducible. I thought this was the whole point of peer review. Journal editors are supposed to review the work to make sure that, at a minimum, it could be reproduced, even if they don’t reproduce the work themselves. Otherwise you have a SCIgen http://pdos.csail.mit.edu/scigen/ situation. (sorry, I don’t know how to embed links in comments)

2. Scientists have gotten so lazy or incompetent that they are unable to reproduce results when given everything that a skilled scientist requires. Having gone to public school in the US I can believe this.

Fortunately, science is mostly merit based, so these people will quickly fall out of the ranks and go on to stimulating careers writing about science and politics, where they won’t take up any more valuable peer review time.

2012-04-17 3:18 pm

Alfman verbose=1
jburnett,

I agree on your first point.

“2. Scientists have gotten so lazy or incompetent that they are unable to reproduce results when given everything that a skilled scientist requires. Having gone to public school in the US I can believe this.”

I have to take issue with your second though, not everyone who qualifies as a competent scientist is a competent programmer. And even those who are might be too busy to spend their time reimplementing other’s work.

So including source code has some merit, but the bigger question is to what degree. The overall consensus here is that it’s not terribly important, and that’s fine.