Linked by Thom Holwerda on Mon 3rd Dec 2012 22:51 UTC
General Unix "Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA."
Thread beginning with comment 544043
To read all comments associated with this story, please click here.
But ....
by Hypnos on Tue 4th Dec 2012 00:02 UTC
Hypnos
Member since:
2008-11-19

I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.

In particle physics, where experimenters deal in petabytes of data often structured in nested trees, ROOT has become the standard. It takes the philosophy of early optimization and runs with it ;) Pure C++, with a RTTI and an interpreter tacked on.

In astronomy, where the data is more array-like, people use IRAF and IDL (or its quasi-clones, like SciPy).

I'd be curious to learn how the biologists deal with their gobs of data.

Reply Score: 6

RE: But ....
by Laurence on Tue 4th Dec 2012 00:57 in reply to "But ...."
Laurence Member since:
2007-03-26

I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.

That is where Perl and Python (depending on your camp) come into play. However the point of those tutorials is to show that basic data EDA can be done from the command line.

Though stuff like that comes into it's own if you're a sys admin and need to quickly churn through log files. I love working in a CLI but even I wouldn't advocate using Bash et al for analysing large, complex, datasets.

Edited 2012-12-04 00:57 UTC

Reply Parent Score: 5

RE[2]: But ....
by Hypnos on Tue 4th Dec 2012 02:19 in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

To be fair, Unix log files are actually designed in a way that make it easy analyze them with awk or perl -- they are structured line-by-line. It's a wonderfully sensible convention set to be discarded by the systemd folks ;)

Regarding Python, SciPy is a bunch of high-performance C(++) and Fortran libraries glued together by Python+NumPy. It's really becoming a viable substitute for IDL.

Reply Parent Score: 5

RE: But ....
by Soulbender on Tue 4th Dec 2012 01:59 in reply to "But ...."
Soulbender Member since:
2005-08-18

It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested;


My favorite example of shell scripts gone off-the-rails: GNU autotools.

Reply Parent Score: 3

RE[2]: But ....
by Hypnos on Tue 4th Dec 2012 02:15 in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

I'd like to learn of a use case where the best choice of programming language is actually m4, and not because you already have a pile of bash scripts duct taped together.

Reply Parent Score: 3

RE: But ....
by dnebdal on Thu 6th Dec 2012 15:56 in reply to "But ...."
dnebdal Member since:
2008-08-27

I'd be curious to learn how the biologists deal with their gobs of data.


It really depends on the lab and the people there. Where I work, we mostly deal with array data, which is large-but-not-insane: Many GB, but not (yet) the TB-filling unpleasantness of large-scale sequencing. We mostly do things in R, with some of the rougher filtering done in shellscript and simple command line tools (I ended up writing a fast matrix transposer for tab-separated text files just because R was using hours while it could be done in minutes with a bit of C).

Going by the tools we run into from other labs, it's roughly what you'd expect from a bunch of biologists allied with whatever IT people they've roped in: Some places really love perl, some write exclusively in java, a few tools are C++, and there's always the odd diva company that only really support a GUI tool for windows if you want to interpret their raw files. Some python, though it feels to be a bit less popular than in comparable fields.

There is an R package for everything, though. The language itself is a bit weird, which is not surprising given the long and meandering development history, and it's the first language where I've looked at the OO system and decided I'd be better off ignoring it. (If you thought stereotypical OO-fetishist java was obfuscated, you've never looked at an R S3 class). Still, it's by far the best language I've used for dealing with tabular data.

Edited 2012-12-06 16:01 UTC

Reply Parent Score: 1

RE[2]: But ....
by Hypnos on Thu 6th Dec 2012 16:10 in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

Thanks for the summary!

If you have tabular data, it's easy to ingest with numpy into C arrays that you can work on with SciPy libraries. This is what many astronomers do, who have data sets at least as large.

Reply Parent Score: 2