Linked by Thom Holwerda on Sat 11th May 2013 21:41 UTC
Windows "Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening." That's one way to start an insider explanation of why Windows' performance isn't up to snuff. Written by someone who actually contributes code to the Windows NT kernel, the comment on Hacker News, later deleted but reposted with permission on Marc Bevand's blog, paints a very dreary picture of the state of Windows development. The root issue? Think of how Linux is developed, and you'll know the answer.
Thread beginning with comment 561610
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[14]: Too funny
by satsujinka on Wed 15th May 2013 00:17 UTC in reply to "RE[13]: Too funny"
satsujinka
Member since:
2010-03-11


Oh I see, you modified my quote in order to create a contradiction. (please don't do that again!)

No I modified your quote to show you were you were misinterpreting me. You were under the impression I wasn't talking about a text backed SQL query engine. So I modified your quote because apparently just saying it wasn't enough.

You *could* build an SQL database on top of any format you chose. I won't discourage you from trying it, but unless you create ODBC / JDBC / native data connectors for it, then you'd end up with a rather inaccessible SQL implementation. Still you *could* build all the SQL connectors and have a fully usable SQL database.

This is the same issue that journald faces. Of course, I think it's reasonable to infer that you feel that journald should use an existing database (say MariaDB or SQLite or ...)

I felt it was quite reasonable to assume that I was talking about what journald should have done (if it was going to go ahead and reinvent a database anyways.)

Now, conceptually your happy, but the implementation details are where problems begin cropping up. Almost any changes to records (changing data values or columns) mean re-sequencing the whole text file, which is not only inefficient by itself (particularly for large databases), but it means rebuilding all the indexes as well. Also many years of research have gone into the so-called ACID features you'll find in SQL databases. Consider atomic transactions, foreign key integrity, etc. SQL implementations are designed to keep the data consistent even in the event of a power loss, think about what that means for flat file storage.


All databases have to solve the same challenges. Nothing mentioned here is particularly different no matter your database backend. In short, implementing a database is hard.

Of minor note, no matter the format; changing a column value would not require re-sequencing the file. The rest of the file hasn't changed so the representation currently cached would still be valid; no need to restore it.

Another issue is that flat text file makes efficient concurrency difficult, any change that one program is making would have to block other readers/writers to avoid data corruption, I think you'll agree that the entire text file needs one large mutex in order to guaranty that the textual data is in a consistent state. Although linux has advisory file locking, I don't think standard tools like grep use it. After all your work to make your "custom SQL database" use a text format, you still cannot safely use standard tools like grep on it without first making sure the database is taken offline.


Again, no issues that other formats don't face. However, you are mistaken about something. You don't have to lock a file for reading; ever. grep/sed/awk will always work even if a database is currently using a file. The only time you need to lock files is when you're writing (and that only blocks writers.) So since you shouldn't ever be modifying your logs... neither the database or grep/sed/awk need to lock the files.

So I ask you now, what is the advantage of having a text SQL engine over being able to export text from a binary SQL database?


It's easier and more reliable to get to the data, with all that that implies.

The only criticism I can give merit to as a real fundamental problem is if you don't trust the database implementation to produce reliable results (for fun: I challenge you to find an instance of a popular database having produced unreliable query results on working hardware). For everything else you could export a text copy and even then I have to reiterate that it's my honest belief that for anyone who is proficient with SQL, hardly any would want to use the text tools by choice. SQL really is superior even for simple adhoc queries.


I've had issues where MS SQL Server doesn't delete rows from a table, but they also don't show up in queries... (they did appear in Management Studio though.)

Anyways, I disagree, SQL is not superior for simple queries. I would much rather use grep. We probably aren't going to agree here.

Even if I'm just browsing the data and not manipulating it, I'd rather have a tabular spreadsheet interface over a flat file one.


Libre Office Calc can open CSV files as a spread sheet...

I do appreciate how cleverly the text tools can be used in a unix shell, but the more I think about it the more I like the database approach. Maybe we need to stop thinking about it as "binary" versus "text", and think about it as different degrees of data abstraction.


I don't really think that that's the issue. I really do think that there's just some misunderstanding going on. I mean, I haven't been saying "don't use SQL" or "the relational model sucks!" I've been saying "I want to be able to read my log files with any old tool, but adding in a query engine would be cool too." The only issue that could arise with what I want is performance. And not only do I think log files aren't likely to have this issue; but I think that it can be solved without abandoning text.

Reply Parent Score: 2

RE[15]: Too funny
by Alfman on Wed 15th May 2013 03:18 in reply to "RE[14]: Too funny"
Alfman Member since:
2011-01-28

satsujinka,


"Again, no issues that other formats don't face. However, you are mistaken about something. You don't have to lock a file for reading; ever. grep/sed/awk will always work even if a database is currently using a file. The only time you need to lock files is when you're writing (and that only blocks writers.)"


The difference is that normally databases aren't designed to have their datastores read/written by external processes as they're being used, so the problem doesn't really come up at all. Never the less I do want to point out that even for textual databases readers do need to be blocked and/or intercepted in order to prevent incomplete writes & incomplete transactions from being seen by the reader.

If you don't have a database background, you might not realize that transactions can involve many non-contiguous records such that without locking you'd end up with a race condition between the reader / writer starting and completing their work in the wrong order.



"I've had issues where MS SQL Server doesn't delete rows from a table, but they also don't show up in queries... (they did appear in Management Studio though.)"


It's not a bug, it's a feature ;)

In the absence of a confirmed bug report, my gut feeling is that the most likely cause of your problem was an uncommitted transaction. Maybe you deleted and queried the result *inside* one transaction, then from another program you still saw the original records. You wouldn't see the deletions until the first transaction was committed. This can surprise you if you aren't expecting it, but it makes sense once you think about it.

When you program in pl/sql, for example, by default all the changes you make (across many datatables and even schemas) remain uncommitted. You can execute one or more pl/sql programs & update statements from your IDE and then query the results, but until you hit the commit button, noone else can see your changes. The semantics of SQL guarantee that all those changes are committed atomically. It's understandable that awareness of these SQL concepts is low outside the realm of SQL practitioners given that they don't exist in conventional file systems nor programming languages.



"I mean, I haven't been saying 'don't use SQL' or 'the relational model sucks!' I've been saying 'I want to be able to read my log files with any old tool, but adding in a query engine would be cool too.' The only issue that could arise with what I want is performance. And not only do I think log files aren't likely to have this issue; but I think that it can be solved without abandoning text."

The thing is, if your software supports logging into a database interface, you could have a simple NULL database engine that does absolutely nothing except output a text record into a text file. That's essentially a freebee for you.

The inverse is not true, processes that are developed to output to text files will need additional external scripts/jobs for parsing and inserting relational records into the database. Also there's a very real risk that the logging program will not log enough information to maintain full relational integrity because it wasn't designed with that use case in mind. Our after-the-fact import scripts are sometimes left to fuzzy matching records based on timestamps or whatnot. If the standard logging conventions dictated that programs used a structured database record format from the get go, such ambiguities wouldn't have arisen.


It's probably wishful thinking that all programs could shift to more structured logging at this point since we're already entrenched in the current way of doing things. But if we had it all to do over, it would make a lot of sense to give modern database concepts a more prominent primary role in our everyday operating systems.

Edited 2013-05-15 03:22 UTC

Reply Parent Score: 2

RE[16]: Too funny
by satsujinka on Wed 15th May 2013 05:59 in reply to "RE[15]: Too funny"
satsujinka Member since:
2010-03-11

The difference is that normally databases aren't designed to have their datastores read/written by external processes as they're being used, so the problem doesn't really come up at all. Never the less I do want to point out that even for textual databases readers do need to be blocked and/or intercepted in order to prevent incomplete writes & incomplete transactions from being seen by the reader.


I guess, but in the case of a log this isn't really going to be an issue. And this also depends on whether or not reading incomplete transactions is an issue.

Another thing I should point out. I was rather purposefully using "query engine" before. Logs shouldn't be written to (by anything other than the logger,) so there won't be any writing being done by the database in the first place. It's just another read client (like the standard tools would be.)

If you don't have a database background, you might not realize that transactions can involve many non-contiguous records such that without locking you'd end up with a race condition between the reader / writer starting and completing their work in the wrong order.

Again, only if the reader cares what the writer is writing. In the case of a log, in which every record is only written once, this shouldn't be an issue.

In the absence of a confirmed bug report, my gut feeling is that the most likely cause of your problem was an uncommitted transaction. Maybe you deleted and queried the result *inside* one transaction, then from another program you still saw the original records. You wouldn't see the deletions until the first transaction was committed. This can surprise you if you aren't expecting it, but it makes sense once you think about it.

I wasn't doing the manipulations directly, however, I'm fairly certain that the delete was committed (since it was a single transaction spawned by a REST command.) I will admit that it may not have been entirely MS SQL Server's fault, but it was irritating enough that I'd really just rather have direct access to the data.

When you program in pl/sql, for example, by default all the changes you make (across many datatables and even schemas) remain uncommitted. You can execute one or more pl/sql programs & update statements from your IDE and then query the results, but until you hit the commit button, noone else can see your changes. The semantics of SQL guarantee that all those changes are committed atomically. It's understandable that awareness of these SQL concepts is low outside the realm of SQL practitioners given that they don't exist in conventional file systems nor programming languages.

I may not work with databases all the time, but I do have some experience; so I can pretty much guarantee that I don't have any issues with the commit/rollback functionality of databases (in point of fact, I've been trying to get approval to modify my employer's web server to not just commit everything it does right away; instead someone just decided that implementing a custom history was a good idea...)

The thing is, if your software supports logging into a database interface, you could have a simple NULL database engine that does absolutely nothing except output a text record into a text file. That's essentially a freebee for you.


This would be fine. The logs are in text and you can use SQL. That fits my requirements just fine.

The inverse is not true, processes that are developed to output to text files will need additional external scripts/jobs for parsing and inserting relational records into the database. Also there's a very real risk that the logging program will not log enough information to maintain full relational integrity because it wasn't designed with that use case in mind. Our after-the-fact import scripts are sometimes left to fuzzy matching records based on timestamps or whatnot. If the standard logging conventions dictated that programs used a structured database record format from the get go, such ambiguities wouldn't have arisen.


This is actually what I've been saying. The log should be structured in record format. CSV is a record format so that was the example I've been using (it has the added bonus of working with existing tools, but so long as the format can be read by humans I don't care.) The only additional requirement I have is that the record format should also be human readable.

Hell, the log could be an .sql file for all I care.

It's probably wishful thinking that all programs could shift to more structured logging at this point since we're already entrenched in the current way of doing things. But if we had it all to do over, it would make a lot of sense to give modern database concepts a more prominent primary role in our everyday operating systems.

I don't disagree with you. I do like the relational model, even if I'm not fond of SQL.

Reply Parent Score: 2