Explorations in Unix

Thom Holwerda 2012-12-03 Unix 30 Comments

“Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix’s use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

30 Comments

2012-12-04 12:02 am
Hypnos
I’m sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It’s easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.
In particle physics, where experimenters deal in petabytes of data often structured in nested trees, ROOT has become the standard. It takes the philosophy of early optimization and runs with it Pure C++, with a RTTI and an interpreter tacked on.
In astronomy, where the data is more array-like, people use IRAF and IDL (or its quasi-clones, like SciPy).
I’d be curious to learn how the biologists deal with their gobs of data.

2012-12-04 12:57 am
Laurence
I’m sympathetic to his viewpoint, but the Unix shell is a bit of stretch unless your data entries are one line each. It’s easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested; in these cases you want a programming language that is more verbose, but also more readable. And if you have huge data sets, you need to think carefully about memory management.
That is where Perl and Python (depending on your camp) come into play. However the point of those tutorials is to show that basic data EDA can be done from the command line.
Though stuff like that comes into it’s own if you’re a sys admin and need to quickly churn through log files. I love working in a CLI but even I wouldn’t advocate using Bash et al for analysing large, complex, datasets.
Edited 2012-12-04 00:57 UTC

2012-12-04 2:19 am
Hypnos
To be fair, Unix log files are actually designed in a way that make it easy analyze them with awk or perl — they are structured line-by-line. It’s a wonderfully sensible convention set to be discarded by the systemd folks
Regarding Python, SciPy is a bunch of high-performance C(++) and Fortran libraries glued together by Python+NumPy. It’s really becoming a viable substitute for IDL.

2012-12-04 3:21 am
Soulbender
It’s a wonderfully sensible convention set to be discarded by the systemd folks
…what? Systemd will change how logging is done? Please tell me that’s not the case.

2012-12-04 3:46 am
Hypnos
Ayyup:
https://docs.google.com/document/pub?id=1IC9yOXj7j6cdLLxWEBAGRL6wl97…
To be fair, they let you run your old logger in parallel — all you have to do is change you tried-and-true tools to use their new super-duper interface:
http://www.freedesktop.org/wiki/Software/systemd/syslog
Are they not merciful?
2012-12-04 5:00 am
Delgarde
More than that – not only are they not stopping you from running a traditional logger if you want one, they’re providing fancy tools designed specifically for running complex queries over the binary log format.
In short, they’re actually making things easier to parse logging data.
2012-12-04 5:07 am
Hypnos
I can see how this is useful for corporate deployments with many, many machines. But it’s overkill to incude by default for 99% of Linux users. It should at most be a plugin.
2012-12-04 8:31 pm
Delgarde
I can see how this is useful for corporate deployments with many, many machines. But it’s overkill to incude by default for 99% of Linux users. It should at most be a plugin.
Why not the other way around? The systemd binary logging approach by default, and the ability to install a traditional text logger (i.e a plugin) for those that want it?
2012-12-05 12:36 am
Hypnos
It’s better software design for tools to be simpler by default, not more complex by default. This gives better maintainability and fork-ability, as well as fewer bugs and security problems.
2012-12-05 5:49 am
Alfman verbose=1
Hypnos,
“It’s better software design for tools to be simpler by default, not more complex by default.”
I’m on the fence. Text is more open. There’s no need to reverse engineer anything or be dependent upon binary parsers, which makes using standard command line tools very practical.
On the other hand, I almost wish all data were stored in a database where we can easily build complex indexes and queries to our hearts content, all while maintaining relational & transactional integrity as needed. So, if they replace the text files with a real database such as sqlite or mysql, I wouldn’t object much to that.
SQL is so second nature to me, it’s much easier than trying to build scripts using bash/sed/grep/etc for anything complex. I hate escaping command line arguments for bash scripts. Searching indexed tables is so much faster than searching text files too.
This is not a defence for syslogd’s proprietary tools, but I can see lots of benefits in using structured databases over text files.
2012-12-05 7:28 am
Soulbender
On the other hand, I almost wish all data were stored in a database where we can easily build complex indexes and queries to our hearts content, all while maintaining relational & transactional integrity as needed.
Might want to look into Logstash. It’s awesome.
Speaking of logstash, how does systemd’s logging integrate with other tools? Can we easily transport the data to, say, logstash?
SQL is so second nature to me
SQL isn’t that well-suited for logs though. Not much relational data to speak of in a log record.
( mysql? real database? surely you jest )
Edited 2012-12-05 07:30 UTC
2012-12-05 3:47 pm
Alfman verbose=1
Soulbender,
“Might want to look into Logstash. It’s awesome.”
Thanks for the suggestion.
“SQL isn’t that well-suited for logs though. Not much relational data to speak of in a log record.”
I actually think that the fact that text files are so poor at handling highly structured & relational information is the main reason we don’t store more structured & relational information in them in the first place.
In principal we should be able to do inner joins between log events & sessions & user account information. We should be able to perform aggregates, like “how many times has the user done XYZ”, how many times did it succeed, break it up by every month this year, etc… We could even join up records from separate services running on separate servers.
We can write our own tools to generate reports using scripts, as I’m sure many of us are accustomed to doing, but sometimes it’s too much effort to write a program to extract the sort of information a simple query could produce.
You are right that *as is*, simply inserting unstructured text logs into an SQL database as text wouldn’t give the full benefit SQL offers.
“mysql? real database? surely you jest ”
I said mysql? *Ahem* I meant postgre.
Hypnos,
“Do you really want to have a running SQL server just to figure out what went wrong with your mail server? It could be the SQL server!”
For one thing I think the logger should be a separate database instance, maybe even something like sqlite that doesn’t need an external daemon. However I can appreciate the concern over potential new failure modes: will the database logger fail under scenarios the text daemon would have succeeded?
2012-12-08 4:04 am
Soulbender
In principal we should be able to do inner joins between log events & sessions & user account information. We should be able to perform aggregates, like “how many times has the user done XYZ”, how many times did it succeed, break it up by every month this year, etc… We could even join up records from separate services running on separate servers.
Logstash+Kibana
Yes, those are things we need to be able to query (God knows grep’ing log files isn’t sufficient as soon as you scale up) but SQL isn’t necessarily the right language for this.
There are few, if any, natural relations in log data even if structured. SQL was designed to handle a different kind of data so I’m more in favor of the Logstash/Greylog2 approach which uses ElasticSearch.
2012-12-08 7:51 am
Alfman verbose=1
Soulbender,
“Yes, those are things we need to be able to query (God knows grep’ing log files isn’t sufficient as soon as you scale up) but SQL isn’t necessarily the right language for this.”
Why not? I’ve parsed log data into SQL and honestly SQL’s power has worked quite well for some of my more advanced analytical requirements (… I presume the next sentence is your answer)
“There are few, if any, natural relations in log data even if structured.”
But I beg to differ, there are plenty of relationships to other datasets. I’ve already highlighted events & sessions & account info, one might also have tons of other relational data like shopping carts, purchase history, inventory, zipcodes & geolocation data, and whatever else might prove useful – it obviously depends on what your doing.
“SQL was designed to handle a different kind of data so I’m more in favor of the Logstash/Greylog2 approach which uses ElasticSearch.”
Well ok, I have no experience with those tools and it’d be good to learn them. A cursory glance seems to indicate that they’re very specifically for wrapping up log data under a nice GUI. That’s cool and all, but can they do the relational queries across non-log data sets I’m talking about? How does that aspect improve upon SQL?
Maybe your right about the tools, but I think you are wrong about most logs having little relational data even when they’re well structured.
As an example:
A client’s website uses friendly category/product urls with a numeric product id appended. Only the ID is significant, the rest is just descriptive text and changes from time to time. The same product can show up under different categories. Now I want a report showing how many hits his products are getting. In SQL I can easily join up the ID with the actual product record in the database and do a “group by” aggregate showing the product name and hit count and purchase count.
Now without joining into the database, we could aggregate on the entire URL, but that’d be wrong because some products might be duplicated due to variations in categories or descriptive text.
Alternatively, without joining into the database, we might aggregate on just the product id in the URL, but the report is next to useless without descriptive product names. Some fancy code might be able to aggregate all the products under the first/last descriptive text found for a given ID, but even so we still couldn’t get a purchase count on the report without joining into the database.
Hypothetically if logs did contain relational data would you still object to SQL?
2012-12-08 8:14 am
Soulbender
Ok, I shouldn’t say that there are no relations in log data, the problem is that the relations are not well defined and comes from many different systems with different representations of the relations. Coercing this into a coherent solution that can be effectively used with SQL is a very difficult and time-consuming task and most of the time it’s not worth it.
The ID is significant, the rest is just descriptive text and changes from time to time. The same product can show up under different categories. Now I want a report showing how many hits his products are getting. In SQL I can easily join up the ID with the actual product record in the database and do a “group by” aggregate showing the product name and hit count and purchase count.
Isn’t this something that is better suited for a report in the actual system though?
(Edit: This kind of report can be done with logstash/elasticsearch, afaik))
Hypothetically if logs did contain relational data would you still object to SQL?
That could certainly be a good case on an per application-basis and for very specialized systems but not for general log management.
Edited 2012-12-08 08:17 UTC
2012-12-08 8:27 am
Alfman verbose=1
Souldbender,
Full agreement here. Hmm, must be a new milestone for the internet!
Edited 2012-12-08 08:28 UTC
2012-12-05 7:31 am
Hypnos
Do you really want to have a running SQL server just to figure out what went wrong with your mail server? It could be the SQL server!
2012-12-04 5:13 am
Soulbender
Ok, not as horrible as I initially thought. syslog is a pretty shitty logging system, that I can agree with.
I would have liked to see the example use more structured data though and not a freeform “user blah logged in” message.
Don’t see why this should be a feature of systemd though and not a standalone system. The arguments for forcing systemd on us are pretty bogus. Tightly integrated? Yeah, right. Good thing we don’t have message passing technologies these days or something.
Also, what’s with being so defensive about UUId’s? (I don’t mind them, it’s just fascinating how big a deal it seems to be)
As a big fan of Upstart (and daemontools/runit/etc) I think it’s about time the abomination known as SysV init is abandoned (along with runlevels) and in that respect Systemd is a step forward. I kinda wish it didn’t try to weasel in everywhere though. (A GNOME dependency? WTF?)
2012-12-04 5:18 am
Hypnos
I’ve been pretty happy with OpenRC on Gentoo, though it does depend on /sbin/init :
https://en.wikipedia.org/wiki/OpenRC
2012-12-04 11:37 am
gan17
Don’t see why this should be a feature of systemd though and not a standalone system
Because it’s made by Poettering. Everything he makes wants to take over your system and eat your brains.
2012-12-08 11:38 am
zima
Because it’s made by Poettering. Everything he makes wants to take over your system and eat your brains.
Seems almost fitting… https://en.wikipedia.org/wiki/Poettering ;p

2012-12-04 1:59 am
Soulbender
It’s easy for shell scripts to become spaghetti code if you have data entries that are more complex or nested;
My favorite example of shell scripts gone off-the-rails: GNU autotools.

2012-12-04 2:15 am
Hypnos
I’d like to learn of a use case where the best choice of programming language is actually m4, and not because you already have a pile of bash scripts duct taped together.

2012-12-04 3:20 am
Soulbender
I’d like to learn of a use case where the best choice of programming language is actually m4
Sendmail?
Nah, just kidding. Sendmail’s use of m4 is also awful.

2012-12-04 12:35 pm
moondevil
Don’t be mean on Sendmail, after all it allowed some companies to live mainly from Sendmail configuration projects.

2012-12-04 7:22 am
Neolander
I’d like to learn of a use case where the best choice of programming language is actually m4, and not because you already have a pile of bash scripts duct taped together.
A mad sysadmin’s dream company network, where everything runs some bare-bones variant of UNIX and the home partition is mounted in noexec mode?
Not sure if bash would still agree to run shell scripts in the latter case, though. And even if it spontaneously would, the mad sysadmin might well have patched it by hand so that it fails instead. After all, he can patch everything he wants since he never updates anything anyway.
Edited 2012-12-04 07:29 UTC

2012-12-04 12:34 pm
moondevil
Having worked with commercial UNIX systems that were pretty close to System V, I am not sure if I want to replicate the experience again.
2012-12-04 8:47 pm
Delgarde
Having worked with commercial UNIX systems that were pretty close to System V, I am not sure if I want to replicate the experience again.
Agreed – at work, we develop on Linux, and deploy to AIX, Solaris, and HP-UX. And of those, AIX is decent enough, Solaris is a pain in the ass, and the less said about HP-UX the better.

2012-12-06 3:56 pm
dnebdal
I’d be curious to learn how the biologists deal with their gobs of data.
It really depends on the lab and the people there. Where I work, we mostly deal with array data, which is large-but-not-insane: Many GB, but not (yet) the TB-filling unpleasantness of large-scale sequencing. We mostly do things in R, with some of the rougher filtering done in shellscript and simple command line tools (I ended up writing a fast matrix transposer for tab-separated text files just because R was using hours while it could be done in minutes with a bit of C).
Going by the tools we run into from other labs, it’s roughly what you’d expect from a bunch of biologists allied with whatever IT people they’ve roped in: Some places really love perl, some write exclusively in java, a few tools are C++, and there’s always the odd diva company that only really support a GUI tool for windows if you want to interpret their raw files. Some python, though it feels to be a bit less popular than in comparable fields.
There is an R package for everything, though. The language itself is a bit weird, which is not surprising given the long and meandering development history, and it’s the first language where I’ve looked at the OO system and decided I’d be better off ignoring it. (If you thought stereotypical OO-fetishist java was obfuscated, you’ve never looked at an R S3 class). Still, it’s by far the best language I’ve used for dealing with tabular data.
Edited 2012-12-06 16:01 UTC

2012-12-06 4:10 pm
Hypnos
Thanks for the summary!
If you have tabular data, it’s easy to ingest with numpy into C arrays that you can work on with SciPy libraries. This is what many astronomers do, who have data sets at least as large.