Linked by Thom Holwerda on Mon 16th Apr 2012 02:08 UTC
Permalink for comment 514319
To read all comments associated with this story, please click here.
To read all comments associated with this story, please click here.
Features
Linked by Thom Holwerda on 05/21/13 21:38 UTC
Linked by Thom Holwerda on 05/20/13 11:29 UTC
Linked by Thom Holwerda on 05/18/13 21:33 UTC
Linked by David Adams on 05/16/13 4:23 UTC
Linked by Thom Holwerda on 05/11/13 21:41 UTC
Linked by Thom Holwerda on 05/08/13 14:22 UTC
Linked by Thom Holwerda on 05/02/13 15:28 UTC
Linked by Thom Holwerda on 04/29/13 21:06 UTC
Linked by Thom Holwerda on 04/24/13 22:24 UTC
Linked by Thom Holwerda on 04/18/13 11:21 UTC
More Features »
Sponsored Links



Member since:
2005-07-06
The alternative is to have hundreds of research projects derived from hundreds of different codebases, each with its own bag of logic bugs. All code are inherently buggy, and scientists, due to lack of basic training in software engineering (e.g. code reuse, unit test, etc), tend to write buggier code.
Last year, I had to do some data deduplication work using string metric such as Jaro–Winkler distance. I found 3 open source libraries, one in Java, two in Python. And the three of them implement the formula differently, resulting in significantly different metrics. The good thing is that, because the code is open, I submitted patches to the maintainers. Some got fixed, some did not (but the bug report is publicly available nevertheless).
One of these libraries, Febrl (Freely Extensible Biomedical Record Linkage), was released by Australia National University as part of research. I owe greatly to the authors, in particularly their willingness to put the code out for scrutiny.