Linked by Thom Holwerda on Mon 3rd Dec 2012 22:51 UTC
General Unix "Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA."
Order by: Score:
But ....
by Hypnos on Tue 4th Dec 2012 00:02 UTC
Hypnos
Member since:
2008-11-19

I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.

In particle physics, where experimenters deal in petabytes of data often structured in nested trees, ROOT has become the standard. It takes the philosophy of early optimization and runs with it ;) Pure C++, with a RTTI and an interpreter tacked on.

In astronomy, where the data is more array-like, people use IRAF and IDL (or its quasi-clones, like SciPy).

I'd be curious to learn how the biologists deal with their gobs of data.

Reply Score: 6

RE: But ....
by Laurence on Tue 4th Dec 2012 00:57 UTC in reply to "But ...."
Laurence Member since:
2007-03-26

I'm sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.

That is where Perl and Python (depending on your camp) come into play. However the point of those tutorials is to show that basic data EDA can be done from the command line.

Though stuff like that comes into it's own if you're a sys admin and need to quickly churn through log files. I love working in a CLI but even I wouldn't advocate using Bash et al for analysing large, complex, datasets.

Edited 2012-12-04 00:57 UTC

Reply Score: 5

RE[2]: But ....
by Hypnos on Tue 4th Dec 2012 02:19 UTC in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

To be fair, Unix log files are actually designed in a way that make it easy analyze them with awk or perl -- they are structured line-by-line. It's a wonderfully sensible convention set to be discarded by the systemd folks ;)

Regarding Python, SciPy is a bunch of high-performance C(++) and Fortran libraries glued together by Python+NumPy. It's really becoming a viable substitute for IDL.

Reply Score: 5

RE[3]: But ....
by Soulbender on Tue 4th Dec 2012 03:21 UTC in reply to "RE[2]: But ...."
Soulbender Member since:
2005-08-18

It's a wonderfully sensible convention set to be discarded by the systemd folks


...what? Systemd will change how logging is done? Please tell me that's not the case.

Reply Score: 2

RE[4]: But ....
by Hypnos on Tue 4th Dec 2012 03:46 UTC in reply to "RE[3]: But ...."
Hypnos Member since:
2008-11-19

Ayyup:

https://docs.google.com/document/pub?id=1IC9yOXj7j6cdLLxWEBAGRL6wl97...

To be fair, they let you run your old logger in parallel -- all you have to do is change you tried-and-true tools to use their new super-duper interface:

http://www.freedesktop.org/wiki/Software/systemd/syslog

Are they not merciful?

Reply Score: 6

RE[5]: But ....
by Delgarde on Tue 4th Dec 2012 05:00 UTC in reply to "RE[4]: But ...."
Delgarde Member since:
2008-08-19

More than that - not only are they not stopping you from running a traditional logger if you want one, they're providing fancy tools designed specifically for running complex queries over the binary log format.

In short, they're actually making things easier to parse logging data.

Reply Score: 5

RE[6]: But ....
by Hypnos on Tue 4th Dec 2012 05:07 UTC in reply to "RE[5]: But ...."
Hypnos Member since:
2008-11-19

I can see how this is useful for corporate deployments with many, many machines. But it's overkill to incude by default for 99% of Linux users. It should at most be a plugin.

Reply Score: 3

RE[7]: But ....
by Delgarde on Tue 4th Dec 2012 20:31 UTC in reply to "RE[6]: But ...."
Delgarde Member since:
2008-08-19

I can see how this is useful for corporate deployments with many, many machines. But it's overkill to incude by default for 99% of Linux users. It should at most be a plugin.


Why not the other way around? The systemd binary logging approach by default, and the ability to install a traditional text logger (i.e a plugin) for those that want it?

Reply Score: 3

RE[5]: But ....
by Soulbender on Tue 4th Dec 2012 05:13 UTC in reply to "RE[4]: But ...."
Soulbender Member since:
2005-08-18

Ok, not as horrible as I initially thought. syslog is a pretty shitty logging system, that I can agree with.
I would have liked to see the example use more structured data though and not a freeform "user blah logged in" message.
Don't see why this should be a feature of systemd though and not a standalone system. The arguments for forcing systemd on us are pretty bogus. Tightly integrated? Yeah, right. Good thing we don't have message passing technologies these days or something.
Also, what's with being so defensive about UUId's? (I don't mind them, it's just fascinating how big a deal it seems to be)

As a big fan of Upstart (and daemontools/runit/etc) I think it's about time the abomination known as SysV init is abandoned (along with runlevels) and in that respect Systemd is a step forward. I kinda wish it didn't try to weasel in everywhere though. (A GNOME dependency? WTF?)

Reply Score: 4

RE[6]: But ....
by Hypnos on Tue 4th Dec 2012 05:18 UTC in reply to "RE[5]: But ...."
Hypnos Member since:
2008-11-19

I've been pretty happy with OpenRC on Gentoo, though it does depend on /sbin/init :

https://en.wikipedia.org/wiki/OpenRC

Reply Score: 2

RE[6]: But ....
by gan17 on Tue 4th Dec 2012 11:37 UTC in reply to "RE[5]: But ...."
gan17 Member since:
2008-06-03

Don't see why this should be a feature of systemd though and not a standalone system

Because it's made by Poettering. Everything he makes wants to take over your system and eat your brains.

Reply Score: 5

RE[7]: But ....
by zima on Sat 8th Dec 2012 11:38 UTC in reply to "RE[6]: But ...."
zima Member since:
2005-07-06

Because it's made by Poettering. Everything he makes wants to take over your system and eat your brains.

Seems almost fitting... https://en.wikipedia.org/wiki/Poettering ;p

Reply Score: 2

RE: But ....
by Soulbender on Tue 4th Dec 2012 01:59 UTC in reply to "But ...."
Soulbender Member since:
2005-08-18

It's easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested;


My favorite example of shell scripts gone off-the-rails: GNU autotools.

Reply Score: 3

RE[2]: But ....
by Hypnos on Tue 4th Dec 2012 02:15 UTC in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

I'd like to learn of a use case where the best choice of programming language is actually m4, and not because you already have a pile of bash scripts duct taped together.

Reply Score: 3

RE[3]: But ....
by Soulbender on Tue 4th Dec 2012 03:20 UTC in reply to "RE[2]: But ...."
Soulbender Member since:
2005-08-18

I'd like to learn of a use case where the best choice of programming language is actually m4


Sendmail?
Nah, just kidding. Sendmail's use of m4 is also awful.

Reply Score: 2

RE[4]: But ....
by moondevil on Tue 4th Dec 2012 12:35 UTC in reply to "RE[3]: But ...."
moondevil Member since:
2005-07-08

Don't be mean on Sendmail, after all it allowed some companies to live mainly from Sendmail configuration projects. ;)

Reply Score: 2

RE[3]: But ....
by Neolander on Tue 4th Dec 2012 07:22 UTC in reply to "RE[2]: But ...."
Neolander Member since:
2010-03-08

I'd like to learn of a use case where the best choice of programming language is actually m4, and not because you already have a pile of bash scripts duct taped together.

A mad sysadmin's dream company network, where everything runs some bare-bones variant of UNIX and the home partition is mounted in noexec mode?

Not sure if bash would still agree to run shell scripts in the latter case, though. And even if it spontaneously would, the mad sysadmin might well have patched it by hand so that it fails instead. After all, he can patch everything he wants since he never updates anything anyway.

Edited 2012-12-04 07:29 UTC

Reply Score: 3

RE[4]: But ....
by moondevil on Tue 4th Dec 2012 12:34 UTC in reply to "RE[3]: But ...."
moondevil Member since:
2005-07-08

Having worked with commercial UNIX systems that were pretty close to System V, I am not sure if I want to replicate the experience again.

Reply Score: 2

RE[5]: But ....
by Delgarde on Tue 4th Dec 2012 20:47 UTC in reply to "RE[4]: But ...."
Delgarde Member since:
2008-08-19

Having worked with commercial UNIX systems that were pretty close to System V, I am not sure if I want to replicate the experience again.


Agreed - at work, we develop on Linux, and deploy to AIX, Solaris, and HP-UX. And of those, AIX is decent enough, Solaris is a pain in the ass, and the less said about HP-UX the better.

Reply Score: 4

RE: But ....
by dnebdal on Thu 6th Dec 2012 15:56 UTC in reply to "But ...."
dnebdal Member since:
2008-08-27

I'd be curious to learn how the biologists deal with their gobs of data.


It really depends on the lab and the people there. Where I work, we mostly deal with array data, which is large-but-not-insane: Many GB, but not (yet) the TB-filling unpleasantness of large-scale sequencing. We mostly do things in R, with some of the rougher filtering done in shellscript and simple command line tools (I ended up writing a fast matrix transposer for tab-separated text files just because R was using hours while it could be done in minutes with a bit of C).

Going by the tools we run into from other labs, it's roughly what you'd expect from a bunch of biologists allied with whatever IT people they've roped in: Some places really love perl, some write exclusively in java, a few tools are C++, and there's always the odd diva company that only really support a GUI tool for windows if you want to interpret their raw files. Some python, though it feels to be a bit less popular than in comparable fields.

There is an R package for everything, though. The language itself is a bit weird, which is not surprising given the long and meandering development history, and it's the first language where I've looked at the OO system and decided I'd be better off ignoring it. (If you thought stereotypical OO-fetishist java was obfuscated, you've never looked at an R S3 class). Still, it's by far the best language I've used for dealing with tabular data.

Edited 2012-12-06 16:01 UTC

Reply Score: 1

RE[2]: But ....
by Hypnos on Thu 6th Dec 2012 16:10 UTC in reply to "RE: But ...."
Hypnos Member since:
2008-11-19

Thanks for the summary!

If you have tabular data, it's easy to ingest with numpy into C arrays that you can work on with SciPy libraries. This is what many astronomers do, who have data sets at least as large.

Reply Score: 2