Linked by Thom Holwerda on Mon 3rd Dec 2012 22:51 UTC
General Unix "Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA."
Thread beginning with comment 544046
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: But ....
by Laurence on Tue 4th Dec 2012 00:57 UTC in reply to "But ...."
Laurence
Member since:
2007-03-26

I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.

That is where Perl and Python (depending on your camp) come into play. However the point of those tutorials is to show that basic data EDA can be done from the command line.

Though stuff like that comes into it's own if you're a sys admin and need to quickly churn through log files. I love working in a CLI but even I wouldn't advocate using Bash et al for analysing large, complex, datasets.

Edited 2012-12-04 00:57 UTC

Reply Parent Score: 5

RE[2]: But ....
by Hypnos on Tue 4th Dec 2012 02:19 in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

To be fair, Unix log files are actually designed in a way that make it easy analyze them with awk or perl -- they are structured line-by-line. It's a wonderfully sensible convention set to be discarded by the systemd folks ;)

Regarding Python, SciPy is a bunch of high-performance C(++) and Fortran libraries glued together by Python+NumPy. It's really becoming a viable substitute for IDL.

Reply Parent Score: 5

RE[3]: But ....
by Soulbender on Tue 4th Dec 2012 03:21 in reply to "RE[2]: But ...."
Soulbender Member since:
2005-08-18

It's a wonderfully sensible convention set to be discarded by the systemd folks


...what? Systemd will change how logging is done? Please tell me that's not the case.

Reply Parent Score: 2