Linked by Thom Holwerda on Mon 3rd Dec 2012 22:51 UTC
General Unix "Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA."
Thread beginning with comment 544421
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: But ....
by dnebdal on Thu 6th Dec 2012 15:56 UTC in reply to "But ...."
dnebdal
Member since:
2008-08-27

I'd be curious to learn how the biologists deal with their gobs of data.


It really depends on the lab and the people there. Where I work, we mostly deal with array data, which is large-but-not-insane: Many GB, but not (yet) the TB-filling unpleasantness of large-scale sequencing. We mostly do things in R, with some of the rougher filtering done in shellscript and simple command line tools (I ended up writing a fast matrix transposer for tab-separated text files just because R was using hours while it could be done in minutes with a bit of C).

Going by the tools we run into from other labs, it's roughly what you'd expect from a bunch of biologists allied with whatever IT people they've roped in: Some places really love perl, some write exclusively in java, a few tools are C++, and there's always the odd diva company that only really support a GUI tool for windows if you want to interpret their raw files. Some python, though it feels to be a bit less popular than in comparable fields.

There is an R package for everything, though. The language itself is a bit weird, which is not surprising given the long and meandering development history, and it's the first language where I've looked at the OO system and decided I'd be better off ignoring it. (If you thought stereotypical OO-fetishist java was obfuscated, you've never looked at an R S3 class). Still, it's by far the best language I've used for dealing with tabular data.

Edited 2012-12-06 16:01 UTC

Reply Parent Score: 1

RE[2]: But ....
by Hypnos on Thu 6th Dec 2012 16:10 in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

Thanks for the summary!

If you have tabular data, it's easy to ingest with numpy into C arrays that you can work on with SciPy libraries. This is what many astronomers do, who have data sets at least as large.

Reply Parent Score: 2