Linked by Thom Holwerda on Mon 3rd Dec 2012 22:51 UTC
General Unix "Few tools are more indispensable to my work than Unix. Manipulating data into different formats, performing transformations, and conducting exploratory data analysis (EDA) is the lingua franca of data science.1 The coffers of Unix hold many simple tools, which by themselves are powerful, but when chained together facilitate complex data manipulations. Unix's use of functional composition eliminates much of the tedious boilerplate of I/0 and text parsing found in scripting languages. This design creates a simple and succinct interface for manipulating data and a foundation upon which custom tools can be built. Although languages like R and Python are invaluable for data analysis, I find Unix to be superior in many scenarios for quick and simple data cleaning, idea prototyping, and understanding data. This post is about how I use Unix for EDA."
Thread beginning with comment 544253
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[9]: But ....
by Alfman on Wed 5th Dec 2012 05:49 UTC in reply to "RE[8]: But ...."
Alfman
Member since:
2011-01-28

Hypnos,

"It's better software design for tools to be simpler by default, not more complex by default."

I'm on the fence. Text is more open. There's no need to reverse engineer anything or be dependent upon binary parsers, which makes using standard command line tools very practical.

On the other hand, I almost wish all data were stored in a database where we can easily build complex indexes and queries to our hearts content, all while maintaining relational & transactional integrity as needed. So, if they replace the text files with a real database such as sqlite or mysql, I wouldn't object much to that.

SQL is so second nature to me, it's much easier than trying to build scripts using bash/sed/grep/etc for anything complex. I hate escaping command line arguments for bash scripts. Searching indexed tables is so much faster than searching text files too.

This is not a defence for syslogd's proprietary tools, but I can see lots of benefits in using structured databases over text files.

Reply Parent Score: 2

RE[10]: But ....
by Soulbender on Wed 5th Dec 2012 07:28 in reply to "RE[9]: But ...."
Soulbender Member since:
2005-08-18

On the other hand, I almost wish all data were stored in a database where we can easily build complex indexes and queries to our hearts content, all while maintaining relational & transactional integrity as needed.


Might want to look into Logstash. It's awesome.

Speaking of logstash, how does systemd's logging integrate with other tools? Can we easily transport the data to, say, logstash?


SQL is so second nature to me

SQL isn't that well-suited for logs though. Not much relational data to speak of in a log record.

( mysql? real database? surely you jest ;) )

Edited 2012-12-05 07:30 UTC

Reply Parent Score: 3

RE[11]: But ....
by Alfman on Wed 5th Dec 2012 15:47 in reply to "RE[10]: But ...."
Alfman Member since:
2011-01-28

Soulbender,

"Might want to look into Logstash. It's awesome."

Thanks for the suggestion.

"SQL isn't that well-suited for logs though. Not much relational data to speak of in a log record."

I actually think that the fact that text files are so poor at handling highly structured & relational information is the main reason we don't store more structured & relational information in them in the first place.

In principal we should be able to do inner joins between log events & sessions & user account information. We should be able to perform aggregates, like "how many times has the user done XYZ", how many times did it succeed, break it up by every month this year, etc... We could even join up records from separate services running on separate servers.


We can write our own tools to generate reports using scripts, as I'm sure many of us are accustomed to doing, but sometimes it's too much effort to write a program to extract the sort of information a simple query could produce.

You are right that *as is*, simply inserting unstructured text logs into an SQL database as text wouldn't give the full benefit SQL offers.


"mysql? real database? surely you jest ;) "

I said mysql? *Ahem* I meant postgre.


Hypnos,


"Do you really want to have a running SQL server just to figure out what went wrong with your mail server? It could be the SQL server!"

For one thing I think the logger should be a separate database instance, maybe even something like sqlite that doesn't need an external daemon. However I can appreciate the concern over potential new failure modes: will the database logger fail under scenarios the text daemon would have succeeded?

Reply Parent Score: 3

RE[10]: But ....
by Hypnos on Wed 5th Dec 2012 07:31 in reply to "RE[9]: But ...."
Hypnos Member since:
2008-11-19

Do you really want to have a running SQL server just to figure out what went wrong with your mail server? It could be the SQL server!

Reply Parent Score: 3