Linked by Thom Holwerda on Sat 11th May 2013 21:41 UTC
Windows "Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening." That's one way to start an insider explanation of why Windows' performance isn't up to snuff. Written by someone who actually contributes code to the Windows NT kernel, the comment on Hacker News, later deleted but reposted with permission on Marc Bevand's blog, paints a very dreary picture of the state of Windows development. The root issue? Think of how Linux is developed, and you'll know the answer.
Permalink for comment 561868
To read all comments associated with this story, please click here.
RE[22]: Too funny
by satsujinka on Thu 16th May 2013 19:05 UTC in reply to "RE[21]: Too funny"
Member since:

There's always time to discuss, even if the article gets buried. Just go to your comments and access the thread from there.

So would you have 3 length prefixes? Table, record, field? Or would you have table be a field of the record (allowing arbitrary records in 1 stream).

Ex w/ 3 length markers; numbers are byte lengths.
2 2 3 (3 bytes of data for field 1) 5 (5 bytes for field 2) 6 (record 2 field 1) 2 (record 2 field 2) (end of stream)

Ex w/ 2 markers. Same data, note that the number of fields is required for each record now. This adds (tableIdbytes + 1) * numRecords bytes to the stream.
3 5 (rec.1 table) 3 (rec.1 field1) 5 (rec.1 field2) 3 5 (rec.2 table) 6 (rec.2 field1) 2 (rec.2 field2)

Interestingly, the data could be binary or text here. While the prefixes have to be read as binary, everything else doesn't (and most languages while(character--!=0) makes sense...)

1. Quoting characters may or may not be present based on value heuristics.]2. The significance of whitespace is controversial & ambiguous between implementations (ie users often add whitespace to make the columns align, but that shouldn't be considered part of the data).

These are issues with arbitrary CSV-like formats. Given the opportunity to standardize (which we have) it's straight forward to mandate escapes (just as XML does) and declare leading whitespace as meaningless.
3. There are data escaping issues caused by special characters that show up inside the field (quotes, new lines, commas). These need to be quoted and escaped.

While XML only has to escape quotes in quotes; comma's can be ignored while in quotes and newlines can be escaped with "" (double quotes is usually the quote escape, but I'm swapping it with newline because there's another perfectly good quote escape "\"" and this way I preserve record=line.)

You keep saying that a "fancy state machine" is necessary, but XML requires 1 too. XML has quotes that need escaping so you still need a state machine to parse XML.

Now having said that, I wouldn't use CSV in the first place. I'd use a DSV with ASCII 31 (Unit Separator) as my delimiter, since it's a control character it has no business being in a field so it can simply be banned. Newlines can be banned, as they're a printing issue (appropriate escapes can be passed as a flag to whatever is printing.) Which leaves us with no state machine necessary. The current field always ends at unit separator and the current record always ends at a newline. Current tools are also able to manipulate this format (if they have a delimiter flag.)

Back to multiplexing:
Basically I was saying that one option is to pass tuples to n stdout where n is the number of fields in the tuple. These stdout have to be synchronized, but it gives you your record's fields without having to do any escaping (as stdout n is exactly the contents of field n.) So say you have
When printed this gets broken in half and sent to 2 stdout.
SOMETHING -> stdout 1

The reverse is true for reading. Basically, we're using multiple streams (multiple files) to do a single transaction (with different stream = different field.)

Reply Parent Score: 2