Linked by Thom Holwerda on Sat 11th May 2013 21:41 UTC
Windows "Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening." That's one way to start an insider explanation of why Windows' performance isn't up to snuff. Written by someone who actually contributes code to the Windows NT kernel, the comment on Hacker News, later deleted but reposted with permission on Marc Bevand's blog, paints a very dreary picture of the state of Windows development. The root issue? Think of how Linux is developed, and you'll know the answer.
Thread beginning with comment 561488
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[11]: Too funny
by Alfman on Tue 14th May 2013 02:08 UTC in reply to "RE[10]: Too funny"
Alfman
Member since:
2011-01-28

satsujinka,

"How would implementing an SQL database on top of plain text be less flexible and less accessible than SQL? That is plainly a contradiction."

That's not what I said, I said if you were to build your own custom database over top file system primitives, it's unlikely to be as flexible or accessible as an SQL database. The applications that I know of which do use a file system database are quite limited and not even remotely close to being SQL-complete (for example postfix mail queue). Anyways, given that all text logging systems to my knowledge use a flat files and not a file system database, I'd like for us to move past this particular issue.



"A CSV variant (i.e. DSV) is already understood by the standard tools. So considering MySQL uses CSV, there's no reason why we couldn't implement a query engine that can co-exist with the standard tools. And why not provide that compatibility if we can?"


The thing is, once you have data in a database, you wouldn't ever have a need to use the standard text tools to access the data since they're largely inferior to SQL (unless of course you didn't know SQL).

I don't object to your choice of using a text database engine if you want to. CSV is often a least common denominator format, which is simultaneously a strength (because it's pervasive) and weakness (because it lacks alot of the more advanced features a database can normally provide). But the choice is yours to make.



"Performance issues are a different matter. For log files, there probably won't be any problem... however, as I've said already; you can do indexing on plain text. You just have to add the appropriate semantics to your text format."

How do you index a plain text file using standard tools and then go on to query your records via that index? Wouldn't you need to write customized scripts to build and query the index? It seems to me that you need to rebuild custom tools frequently every you want to do something that SQL has built in.

Reply Parent Score: 2

RE[12]: Too funny
by satsujinka on Tue 14th May 2013 07:08 in reply to "RE[11]: Too funny"
satsujinka Member since:
2010-03-11

Your argument seems very confused to me, but maybe I'm misunderstanding you.

I'm going to drop the indexing discussion after this because I'm not sufficiently studied on the topic to explain how a database does indexing. However, if we take file=table and line=row; then I would imagine we can cache rows and mark them with their table (inside the cache.) But as I said, I don't know what databases do; so this is just my guess. Also I'm not convinced that a log database would have performance issues (as there's really only 1 record type and logs don't cross reference each other too much.)

Moving back to the top:

if you were to build your own custom SQL database over top file system primitives, it's unlikely to be as flexible or accessible as an SQL database

The bold part is what you're missing. And is why you're contradicting yourself. You are literally saying that an SQL database is less flexible and accessible then an SQL database. The backend is totally unimportant for non-performance considerations.

The thing is, once you have data in a database, you wouldn't ever have a need to use the standard text tools to access the data since they're largely inferior to SQL (unless of course you didn't know SQL).

See but there are reasons why you might not want to use a query engine. You list a trivial one (that at least a professional system admin. should try to overcome, but not everyone is a professional system admin.) Here are some more reasons:
* Because I want to verify that the query engine is returning the correct results. (Query engines have bugs too!)
* Because writing out a full query is more work than greping for some keyword. (I'm lazy.)
* Because log files shouldn't exist in some magical land separate from all my other files (e.g. off in SQL land while all of my other files are in CLI land; this can also be read as "CLI is what I reach for first".)
* Because I don't want to have to hunt down a database driver just to pick some things out of my logs from within my program.
* Or from the other side of the fence, because I don't want to have to hunt down a database driver to write some logs for my program.
* Because I want to pipe my results out to some other program (this is more a comment on most SQL query engines then a real limitation.)

because it lacks alot of the more advanced features a database can normally provide

And what "advanced" features would apply to a log? There's only 1 record type. CSV provides sufficient capabilities to handle that.

Consider wikipedia's CSV page:
CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.


Does this not sound exactly like what an entry in a log file is?

Reply Parent Score: 2

RE[13]: Too funny
by Alfman on Tue 14th May 2013 15:45 in reply to "RE[12]: Too funny"
Alfman Member since:
2011-01-28

satsujunka,


"if you were to build your own custom SQL database over top file system primitives, it's unlikely to be as flexible or accessible as an SQL database"
...
"The bold part is what you're missing. And is why you're contradicting yourself. You are literally saying that an SQL database is less flexible and accessible then an SQL database. The backend is totally unimportant for non-performance considerations."

Oh I see, you modified my quote in order to create a contradiction. (please don't do that again!) This is a bit out of context for log files which we were discussing, but since your quite adamant that textfiles are just as good for implementing SQL databases, I'll address why I think you are wrong.

You *could* build an SQL database on top of any format you chose. I won't discourage you from trying it, but unless you create ODBC / JDBC / native data connectors for it, then you'd end up with a rather inaccessible SQL implementation. Still you *could* build all the SQL connectors and have a fully usable SQL database.


Now, conceptually your happy, but the implementation details are where problems begin cropping up. Almost any changes to records (changing data values or columns) mean re-sequencing the whole text file, which is not only inefficient by itself (particularly for large databases), but it means rebuilding all the indexes as well. Also many years of research have gone into the so-called ACID features you'll find in SQL databases. Consider atomic transactions, foreign key integrity, etc. SQL implementations are designed to keep the data consistent even in the event of a power loss, think about what that means for flat file storage.

Another issue is that flat text file makes efficient concurrency difficult, any change that one program is making would have to block other readers/writers to avoid data corruption, I think you'll agree that the entire text file needs one large mutex in order to guaranty that the textual data is in a consistent state. Although linux has advisory file locking, I don't think standard tools like grep use it. After all your work to make your "custom SQL database" use a text format, you still cannot safely use standard tools like grep on it without first making sure the database is taken offline.

So I ask you now, what is the advantage of having a text SQL engine over being able to export text from a binary SQL database?


The only criticism I can give merit to as a real fundamental problem is if you don't trust the database implementation to produce reliable results (for fun: I challenge you to find an instance of a popular database having produced unreliable query results on working hardware). For everything else you could export a text copy and even then I have to reiterate that it's my honest belief that for anyone who is proficient with SQL, hardly any would want to use the text tools by choice. SQL really is superior even for simple adhoc queries.


"And what 'advanced' features would apply to a log? There's only 1 record type. CSV provides sufficient capabilities to handle that."

For example, on one production system I import apache logs into a database, index them, and compute per-user aggregate statistics. The database correlates these hits with user accounts groups them by date for displaying monthly statistics for our users. These web statistics also get joined to the sales email statistics.

I won't pretend most log files need this amount of analytical power, they don't. But it's still nice to have this ability without first having to write programs to parse the text files and manually compute aggregates. I can (and have) written perl & php scripts to run similar computations by hand, but the database is the clear winner IMHO. Even if I'm just browsing the data and not manipulating it, I'd rather have a tabular spreadsheet interface over a flat file one.

I do appreciate how cleverly the text tools can be used in a unix shell, but the more I think about it the more I like the database approach. Maybe we need to stop thinking about it as "binary" versus "text", and think about it as different degrees of data abstraction.

Reply Parent Score: 3