Linked by Thom Holwerda on Sat 11th May 2013 21:41 UTC
Windows "Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening." That's one way to start an insider explanation of why Windows' performance isn't up to snuff. Written by someone who actually contributes code to the Windows NT kernel, the comment on Hacker News, later deleted but reposted with permission on Marc Bevand's blog, paints a very dreary picture of the state of Windows development. The root issue? Think of how Linux is developed, and you'll know the answer.
Thread beginning with comment 561632
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[16]: Too funny
by satsujinka on Wed 15th May 2013 05:59 UTC in reply to "RE[15]: Too funny"
satsujinka
Member since:
2010-03-11

The difference is that normally databases aren't designed to have their datastores read/written by external processes as they're being used, so the problem doesn't really come up at all. Never the less I do want to point out that even for textual databases readers do need to be blocked and/or intercepted in order to prevent incomplete writes & incomplete transactions from being seen by the reader.


I guess, but in the case of a log this isn't really going to be an issue. And this also depends on whether or not reading incomplete transactions is an issue.

Another thing I should point out. I was rather purposefully using "query engine" before. Logs shouldn't be written to (by anything other than the logger,) so there won't be any writing being done by the database in the first place. It's just another read client (like the standard tools would be.)

If you don't have a database background, you might not realize that transactions can involve many non-contiguous records such that without locking you'd end up with a race condition between the reader / writer starting and completing their work in the wrong order.

Again, only if the reader cares what the writer is writing. In the case of a log, in which every record is only written once, this shouldn't be an issue.

In the absence of a confirmed bug report, my gut feeling is that the most likely cause of your problem was an uncommitted transaction. Maybe you deleted and queried the result *inside* one transaction, then from another program you still saw the original records. You wouldn't see the deletions until the first transaction was committed. This can surprise you if you aren't expecting it, but it makes sense once you think about it.

I wasn't doing the manipulations directly, however, I'm fairly certain that the delete was committed (since it was a single transaction spawned by a REST command.) I will admit that it may not have been entirely MS SQL Server's fault, but it was irritating enough that I'd really just rather have direct access to the data.

When you program in pl/sql, for example, by default all the changes you make (across many datatables and even schemas) remain uncommitted. You can execute one or more pl/sql programs & update statements from your IDE and then query the results, but until you hit the commit button, noone else can see your changes. The semantics of SQL guarantee that all those changes are committed atomically. It's understandable that awareness of these SQL concepts is low outside the realm of SQL practitioners given that they don't exist in conventional file systems nor programming languages.

I may not work with databases all the time, but I do have some experience; so I can pretty much guarantee that I don't have any issues with the commit/rollback functionality of databases (in point of fact, I've been trying to get approval to modify my employer's web server to not just commit everything it does right away; instead someone just decided that implementing a custom history was a good idea...)

The thing is, if your software supports logging into a database interface, you could have a simple NULL database engine that does absolutely nothing except output a text record into a text file. That's essentially a freebee for you.


This would be fine. The logs are in text and you can use SQL. That fits my requirements just fine.

The inverse is not true, processes that are developed to output to text files will need additional external scripts/jobs for parsing and inserting relational records into the database. Also there's a very real risk that the logging program will not log enough information to maintain full relational integrity because it wasn't designed with that use case in mind. Our after-the-fact import scripts are sometimes left to fuzzy matching records based on timestamps or whatnot. If the standard logging conventions dictated that programs used a structured database record format from the get go, such ambiguities wouldn't have arisen.


This is actually what I've been saying. The log should be structured in record format. CSV is a record format so that was the example I've been using (it has the added bonus of working with existing tools, but so long as the format can be read by humans I don't care.) The only additional requirement I have is that the record format should also be human readable.

Hell, the log could be an .sql file for all I care.

It's probably wishful thinking that all programs could shift to more structured logging at this point since we're already entrenched in the current way of doing things. But if we had it all to do over, it would make a lot of sense to give modern database concepts a more prominent primary role in our everyday operating systems.

I don't disagree with you. I do like the relational model, even if I'm not fond of SQL.

Reply Parent Score: 2

RE[17]: Too funny
by Alfman on Wed 15th May 2013 15:18 in reply to "RE[16]: Too funny"
Alfman Member since:
2011-01-28

satsujunka,

"I guess, but in the case of a log this isn't really going to be an issue. And this also depends on whether or not reading incomplete transactions is an issue."

We seem to keep floating between two separate topics here 1) logging, and 2) implementing a generic sql database using a text engine. I think I've made clear why implementing a full SQL database over text is more trouble than it's worth. I think you've made clear that it could nevertheless work for logging since it's append only. Personally I wouldn't see the point in using a special kind of database just for logs, but I'm not denying that it could be done.


Regarding NULL DB engines:
"This would be fine. The logs are in text and you can use SQL. That fits my requirements just fine."

This would be my preferred solution.


"This is actually what I've been saying. The log should be structured in record format. CSV is a record format so that was the example I've been using (it has the added bonus of working with existing tools, but so long as the format can be read by humans I don't care.) The only additional requirement I have is that the record format should also be human readable."

This is not what I meant. For one thing, CSV's data escaping rules are non-trivial and require a state machine generate & parse CSV character by character. Very often I've seen CSV data feeds output by trivial means that don't even escape the data fields at all. Sometimes this problem is not noticed until someone enters a comma on a production machine causing fields to become misaligned. More importantly though, CSV would be a poor choice because records don't contain field metadata, the reader has to be programmed with some template to just know what each column means. This ambiguity is unacceptable when we try to insert records into the database. So XML would technically be better, but this isn't what I meant either.

I think all programs should be using a structured library interface directly without bothering with the text conversion at all. It could be a library call similar to printf, but it would have to be capable of maintaining field metadata. This call would not output text (optionally it could for interactive debugging), instead it would transmit the log record to the system logger.

You, as the administrator, could setup the system logger to take in the *structured data* and do whatever you please with it. You could output text files (whether csv, xml, yaml, json, or whatever suits your fancy), you could record the records in the database of your choosing, you could filter/throw them out without logging, you could even have special triggers to perform various actions as things are happening. This could be highly efficient as there isn't a need to convert integers to text and back again or scan text values of unknown length as is necessary with a text format.

As a programer trying to integrate subsystems, I find this far more appealing than having daemons write out text logs and then programming monitoring scripts to parse text logs back into a structural form before being able to do something useful with them. The goal would be for all programs to build on top of interfaces which enable a higher degree of data abstractions. The lowest common denominator would get raised to a datatuple instead of a line of text as it stands today.




This is getting long winded, but since it's relevant: I was actually discussing this topic with Neolander. The vision was not just for logging, but actually to replace all sorts of text I/O streams with data tupples. When I do "route -n" or "iptables -L", the programs could open a datatuple output stream instead of (or in addition to) a text output stream. Bash could be modified to support these structured data output streams and work with them.

Some examples:
iptables -L # dump human output to console.
iptables -L | spreadsheet # open tuples in spreadsheet
iptables -L | gencsv > file.csv # save tupples as csv
iptables -L | genxml > file.xml # save tupples as xml
iptables -L | genjson > file.json # ...
parsexml file.xml | genjson > file.json

iptables -L | mysqlinsert database.datatable # insert tupples into database



Note that in these examples, iptables doesn't care how the structured data gets used, and the receiving tools don't care what is producing the data. Unlike today, no parsing would be needed. This would all be possible if we could get away from text as the least common denominator and transition to data tuple based IO. (This is why I said it's best not to think in terms of "text" versus "binary", but in terms of data abstractions)


I find these ideas very inspirational and extremely compelling, but I'm not sure if there's any chance of convincing existing mainstream operating systems to change their way of doing things. If I were still working on my own OS I would certainly do it this way.

Reply Parent Score: 2

RE[18]: Too funny
by satsujinka on Wed 15th May 2013 22:09 in reply to "RE[17]: Too funny"
satsujinka Member since:
2010-03-11

Okay, so moving out of the logging topic to databases in general as a organizational system for an operating system.

About CSV vs. XML: Considering Golang's csv and xml packages: both of the 2 go files for CSV have a combined size smaller than xml's 3 of the 4 go files for xml (the 4th is approximately the same size as csv's reader.go). To me this implies that CSV doesn't have any escaping issues that are particularly harder to solve then XML or JSON (JSON actually has the most code dedicated to it.)

Of course, part of this is that csv provides the smallest feature set. However, comparing similar functionality leads to the same conclusion.

As for metadata: you have to provide a schema no matter what data format you choose. XML isn't better in this regard; usually you match tag to column name. CSV has a similar rule: match on column position. I know in relational theory the columns are unordered; but in practice the columns are created/displayed with an order; just use that. Optionally, you can write a schema to do the matching. This is actually a better situation then XML which requires a schema all the time (what do we do with nesting? I can think of 3 reasonable behaviors off the top of my head.)

---

I think all programs should be using a structured library interface directly without bothering with the text conversion at all. It could be a library call similar to printf, but it would have to be capable of maintaining field metadata. This call would not output text (optionally it could for interactive debugging), instead it would transmit the log record to the system logger.

I'm not opposed to this in principle. However, I fear figuring out what this "printf"'s interface should be will not be so simple. Does it expect meta-data co-mingled with data? Does it take a schema as a parameter? Isn't "%s:%d" a schema already (one whose fields are anonymous, but paired with scanf, you can write and retrieve records with it)? Also, what should we use for a schema format? Or should we just choose to support as many as possible?

The vision was not just for logging, but actually to replace all sorts of text I/O streams with data tupples.

What would these data tuples look like? You'll need some data to mark where these tuples begin and end, their fields, and their relation. End can double as begin, so only 3 symbols are necessary (but the smallest binary that can hold 3 is 2 bits so you may as well use 4.) If you omit table separators, then you need to include a relation field.

With 4 symbols:
Bin Txt Description
00 - ( - tuple start
01 - , - field break
10 - ) - tuple end
11 - ; - table end

Ex.
(Id,First Name,Last Name)(001,John,Doe)(002,Mary,Jane);

With 3 symbols:
00 - , - field break
01 - \n - tuple end
10 - \d - table end

Ex.
Id,First Name,Last Name
001,John,Doe
002,Mary,Jane
\d
... Hey, wait a minute that's CSV! ;)

With 2 symbols:
0 - , - field break
1 - ; - tuple break

Ex.
Person,Id,First Name,Last Name;
Person,001,John,Doe;
Person,001,Mary,Jane;

Just to make it clear: I too want a standard format that everything uses. It's just that saying "use tuples" ignores the fact that we still have to parse information out of our inputs in order to do anything. You do go on to say "redesign bash to handle this". I assume you also mean "provide a library that has multiplexed stdin/stdout" as you also have to write and read from an arbitrary number of stdin/stdouts (as corresponds to the number of fields.) Alternately, you could shift to use a byte code powered shell (so that all programs use the same representations as the shell and can simply copy their memory structures to the shell.)

Reply Parent Score: 2