Linked by Thom Holwerda on Sat 11th May 2013 21:41 UTC
Windows "Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening." That's one way to start an insider explanation of why Windows' performance isn't up to snuff. Written by someone who actually contributes code to the Windows NT kernel, the comment on Hacker News, later deleted but reposted with permission on Marc Bevand's blog, paints a very dreary picture of the state of Windows development. The root issue? Think of how Linux is developed, and you'll know the answer.
Thread beginning with comment 561766
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[20]: Too funny
by satsujinka on Thu 16th May 2013 06:08 UTC in reply to "RE[19]: Too funny"
satsujinka
Member since:
2010-03-11

No, you as a programmer would be using the higher level abstraction of the tupple without caring about the mechanics used to implement them. You keep thinking in terms of programs parsing text streams, but with the tupple abstraction you can skip the intermediary text conversions entirely. You only need code to convert the tupple to text at the point where text is the desired form of output like in the shell or a logfile. I'm not sure your understanding this point.

Going back to your interface/class metaphor. I'm not the guy using the interface. I'm the guy writing the interpreter/compiler/VM that dispatches to the correct class on a particular method call from an interface. In which case, I do care how things are implemented, because I want to implement them!

As it stands, there is no hardware notion of a tuple. We just have data streams. So we either have to delimit the data or we have to multiplex the stream. If there are other options then please let me know, but "use higher abstraction" is not a means to implement a system.

If you're not interested in discussing how to implement a relational I/O stream abstraction (which we already both agree would be nice,) I guess there's really nothing else to talk about.

Moving back, my methodology is perfectly sound. I was not trying to show CSV is easier to parse. I was disproving your claim that CSV is particularly hard to parse. The fact that 2 common formats (1 of which you recommended) take more work to parse then CSV, in any general purpose language, is a sound disproof of your claim.

Next:
XML does not and cannot provide self-defining metadata. Consider your example of XML providing names. What do metadata do those names provide? What metadata does <contains>...</contains> provide? That something is contained? What if it's a type declaration? (x contains this type) In order to answer the question of what metadata is present, we need some context and if we need a context; then by definition our metadata is not self-defined. This is true of all data formats. Within a context it does make sense to say that some data has a meaning, but outside of it? No.

So back to what I originally said: XML is matched on name and CSV is matched on position. This is how we determine meaning for these two formats. Metadata follows this rule too. In XML we specify metadata with a known name (contains?) and in CSV we specify metadata with a known position (2nd?)

Reply Parent Score: 2

RE[21]: Too funny
by Alfman on Thu 16th May 2013 15:48 in reply to "RE[20]: Too funny"
Alfman Member since:
2011-01-28

satsujinka,

"If you're not interested in discussing how to implement a relational I/O stream abstraction (which we already both agree would be nice,) I guess there's really nothing else to talk about."

I'll play along, but I must emphasize that any implementation can be swapped out and all programs using the record I/O library would continue to work oblivious to how the library is implemented, which is irrelevant. If your wondering why I'm being such a stickler here, it's because I don't want to hear you criticizing an implementation because it happens to be binary. That's like criticizing C for doing calculations in binary under the hood when you want your numbers to be decimal. Binary is used under the hood because it's more efficient, it doesn't affect your program's I/O.


"As it stands, there is no hardware notion of a tuple. We just have data streams. So we either have to delimit the data or we have to multiplex the stream. If there are other options then please let me know, but 'use higher abstraction' is not a means to implement a system."

We actually have much more than data streams, we have a wealth of data structures as well that can be transferred via shared memory pages to create much more powerful IPC than simple data streams. These aren't used often because 1) the lack of portability between platforms, 2) the lack of network transparency, and 3) many developers never learned about it. Never the less I just wanted to bring it up to point out that IPC isn't limited to streams.


"Moving back, my methodology is perfectly sound. I was not trying to show CSV is easier to parse. I was disproving your claim that CSV is particularly hard to parse."

Regarding CSV files in particular (as documented here https://en.wikipedia.org/wiki/Comma-separated_values) has several issues which your examples completely omitted. One would like to implement a trivial CSV parser this way:

- Read one line to fetch one record
- split on commas
- add fields to a name value collection (using external knowledge about field names)

This is easy, but it breaks on legitimate CSV input files.

1. Quoting characters may or may not be present based on value heuristics.
2. The significance of whitespace is controversial & ambiguous between implementations (ie users often add whitespace to make the columns align, but that shouldn't be considered part of the data).
3. There are data escaping issues caused by special characters that show up inside the field (quotes, new lines, commas). These need to be quoted and escaped.
This is especially troubling because the main record fetching logic has no idea whether a particular newline indicates an end of record or is a databyte without knowing the quoting context in which it showed up, which is even more complicated once you consider that quote characters themselves have unique rules.

Ideally you could identify record boundaries without any knowledge of field level quotation, which is a very serious deficiency for CSV IMHO.


It turns out that XML is somewhat easier to parse without a fancy state machine because the characters used in specifying the records/fields are never present inside text fields. I'm not saying XML is the right choice, it has support for rich multidimentional structures make it complicated for other reasons. But for the sake of argument just consider this subset:

<record a="a" b="b"" c="b<"/>

(edit: pretend that the xml above is correctly quoted, an osnews bug prevents it from displaying as I wrote it)

When the parser reaches a "<" the parser can ALWAYS find the end of that node by searching for ">".
When the parser reaches a quote, it can ALWAYS find the end of the string by finding the next quote. It's trivial because special characters NEVER show up in the data. This is much easier than with CSV.

As far as escaping an unescaping values, here's a trivial implementation of that:

// escape a string
str = replace(str, "&", "&");
str = replace(str, "<", "<");
str = replace(str, ">", ">");
str = replace(str, "\"", """);

(edit: curse you osnews! it should show show & amp ; etc)

I hope this example helps clarify why CSV is awkward to implement for arbitrary data.



"You do go on to say 'redesign bash to handle this'. I assume you also mean 'provide a library that has multiplexed stdin/stdout' as you also have to write and read from an arbitrary number of stdin/stdouts (as corresponds to the number of fields.)"

Hmm, I'm not sure quite what your saying, but I was saying that since all applications would be communicating via data records a shell like bash could receive these data records and then output them as text. When piped into a logger, the logger would take the records and save them using whatever format it uses. The under-the-hood format and even the specific IPC mechanism used by this new record-I/O library could be left unspecified so that each OS could the mechanism that works best for them.


Now the moment you've been waiting for... My implementation would probably use a binary format with length prefixes (aka pascal strings) so that one wouldn't need to scan through text data AT all (eliminates 100% of quoting & escaping issues). Moreover, the parser would know the length of text right away without having to scan through it's bytes first. This way it could allocate new string variables of perfect size. Without the length prefix, you'd have 2 options 1) use one pass to determine the length first, and then another to copy the data or 2) guess the size of string needed then dynamically grow it when it overflows.


I hope this post is clear because I don't think we have much more time to discuss it.

Edited 2013-05-16 16:06 UTC

Reply Parent Score: 2

RE[22]: Too funny
by satsujinka on Thu 16th May 2013 19:05 in reply to "RE[21]: Too funny"
satsujinka Member since:
2010-03-11

There's always time to discuss, even if the article gets buried. Just go to your comments and access the thread from there.

So would you have 3 length prefixes? Table, record, field? Or would you have table be a field of the record (allowing arbitrary records in 1 stream).

Ex w/ 3 length markers; numbers are byte lengths.
2 2 3 (3 bytes of data for field 1) 5 (5 bytes for field 2) 6 (record 2 field 1) 2 (record 2 field 2) (end of stream)

Ex w/ 2 markers. Same data, note that the number of fields is required for each record now. This adds (tableIdbytes + 1) * numRecords bytes to the stream.
3 5 (rec.1 table) 3 (rec.1 field1) 5 (rec.1 field2) 3 5 (rec.2 table) 6 (rec.2 field1) 2 (rec.2 field2)

Interestingly, the data could be binary or text here. While the prefixes have to be read as binary, everything else doesn't (and most languages while(character--!=0) makes sense...)

1. Quoting characters may or may not be present based on value heuristics.]2. The significance of whitespace is controversial & ambiguous between implementations (ie users often add whitespace to make the columns align, but that shouldn't be considered part of the data).

These are issues with arbitrary CSV-like formats. Given the opportunity to standardize (which we have) it's straight forward to mandate escapes (just as XML does) and declare leading whitespace as meaningless.
3. There are data escaping issues caused by special characters that show up inside the field (quotes, new lines, commas). These need to be quoted and escaped.

While XML only has to escape quotes in quotes; comma's can be ignored while in quotes and newlines can be escaped with "" (double quotes is usually the quote escape, but I'm swapping it with newline because there's another perfectly good quote escape "\"" and this way I preserve record=line.)

You keep saying that a "fancy state machine" is necessary, but XML requires 1 too. XML has quotes that need escaping so you still need a state machine to parse XML.

Now having said that, I wouldn't use CSV in the first place. I'd use a DSV with ASCII 31 (Unit Separator) as my delimiter, since it's a control character it has no business being in a field so it can simply be banned. Newlines can be banned, as they're a printing issue (appropriate escapes can be passed as a flag to whatever is printing.) Which leaves us with no state machine necessary. The current field always ends at unit separator and the current record always ends at a newline. Current tools are also able to manipulate this format (if they have a delimiter flag.)

Back to multiplexing:
Basically I was saying that one option is to pass tuples to n stdout where n is the number of fields in the tuple. These stdout have to be synchronized, but it gives you your record's fields without having to do any escaping (as stdout n is exactly the contents of field n.) So say you have
RELATION ( SOMETHING, SOMETHING_ELSE );
When printed this gets broken in half and sent to 2 stdout.
SOMETHING -> stdout 1
SOMETHING_ELSE -> stdout2

The reverse is true for reading. Basically, we're using multiple streams (multiple files) to do a single transaction (with different stream = different field.)

Reply Parent Score: 2