
"Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about
how I use Unix for EDA."
Member since:
2008-11-19
I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.
Pure C++, with a RTTI and an interpreter tacked on.
In particle physics, where experimenters deal in petabytes of data often structured in nested trees, ROOT has become the standard. It takes the philosophy of early optimization and runs with it
In astronomy, where the data is more array-like, people use IRAF and IDL (or its quasi-clones, like SciPy).
I'd be curious to learn how the biologists deal with their gobs of data.