Linked by Hadrien Grasland on Thu 19th May 2011 21:31 UTC
Hardware, Embedded Systems Having read the feedback resulting from my previous post on interrupts (itself resulting from an earlier OSnews Asks item on the subject), I've had a look at the way interrupts work on PowerPC v2.02, SPARC v9, Alpha and IA-64 (Itanium), and contribute this back to anyone who's interested (or willing to report any blatant flaw found in my posts). I've also tried to rework a bit my interrupt handling model to make it significantly clearer and have it look more like a design doc and less like a code draft.
Thread beginning with comment 474095
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[3]: Pop-up threads
by Neolander on Sat 21st May 2011 09:24 UTC in reply to "RE[2]: Pop-up threads"
Neolander
Member since:
2010-03-08

If those are your assumptions, then I can understand your conclusions, but you're assumptions are pretty weak. Hypothetically I could perform nearly any computation in parallel by dividing it into threads. But doing so does not imply better performance. That depends on the ratio of cpu work versus synchronization overhead.

If I create a MT ray tracer on an 8 core processor, which will perform the best?
A) a new thread for each pixel
B) a new thread for each line
C) 8 threads, processing individual lines from a dispatch queue.
D) 8 threads, processing 1/8th of the image at a time

The answer here is obviously not A or B. For performance it never makes any sense to have more threads than cores.

C and D are probably close, but C could win since some cores might finish their work before others, leaving them idle.

Good point, I should try to avoid having more running threads than there are available CPUs. I like this raytracer example a lot, except that I'd argue that in this case the synchronization overhead is close to zero since IIRC raytracers render individual pixels independently. Threading overhead here will be mostly caused by the cost of context switches and thread creation/anihilation. But that's totally nitpicking.

No, what leaves me wondering here is, how I/O calls can be managed in this model.

That's because you see, if you have 16 running threads, a 8-core processor, and suddenly 4 threads have to do I/O, then they can simply use blocking calls without worrying because 4 threads from the 8 ones that weren't running will take their place.

Now, if you have 8 running threads and 8 tasks waiting in the dispatch queue, and suddenly 4 threads have to do I/O, then for optimal performance you need to replace these 4 threads with 4 new number-crunching threads from the dispatch queues' task.

In this case, we have some overhead :
1/Threads must synchronize with each other so that they don't start 4 new threads with the same pending task in a textbook example of a race
2/4 new threads must be created, which implies some context switching overhead that could have been avoided if all threads were created during a time where the kernel runs.
3/After a while, the I/O task is completed, and our 4 old threads are ready to run again. We are now dealing with 12 threads, so at this point, 4 threads must be silenced for optimal performance. How do we choose them ?

To solve problems 1/ and 2/, having lots of threads created from the start sounds like the most sensible option, if threads are sufficiently cheap. Well, I have to define "cheap", isn't it ? Let's do it : threads that are not running cost CPU time when they are created and take some room in RAM because of OS management structures (a few bytes, negligible on modern computers) but also because of their stack. The CPU time cost can't be avoided, so creating lots of threads is only interesting for tasks which run for a sufficiently long time, as you point out. But the RAM cost can totally be avoided for threads that have never run, by having the stack allocated at run time in a COW-like mechanism.

An interesting experiment to do would be a mix of my two pop-up threads models where among the threads created, only <insert number of cores here> are allowed to run from the start, and when a thread blocks the scheduler automatically picks another one from the queue.

3/ is more interesting. I think the older thread should run first, because otherwise the number of threads which have run (and, as such, have a stack) is going to grow, in a waste of RAM.

Fine, then your report should say that: the overhead of threads is outweighed by I/O bound factors.

The thing is, not all interrupts are equal from that point of view. As an example, processing keyboard or mouse interrupts is nearly pure I/O. But processing a stream of multimedia data can require significantly more number crunching, to the point where the amount of I/O is negligible, especially if DMA is being used.

It's additionally possible to run a separate async handler on each core using cpu affinity such that multiple async requests can run in parallel with no synchronization overhead at all.

Sorry, did not understand this one. Can you please elaborate ?

Firstly, most I/O operations will not need threads in the first place, async is already sufficient, why go through the overhead when it's not needed.

Indeed, for pure I/O operations, it makes more sense to use pure async. As an example, an HDD driver is totally something which should be written in an async fashion, with a message queue.

I have to do more design work on this side.

Secondly if you do have a CPU bound operation, then the cost of a syscall should be fairly negligible.

Very fair point.

Thirdly, the cost of starting a microthread should be no worse in this scenario than when you start it by default (although I realize this is not strictly true for you since you're using a microkernel which is subject to additional IPC overhead).

There is a difference in design philosophy, though : who carries the burden of creating and dispatching threads ? In one case, it's a third-party developer, in the other case it's the operating system. I'd spontaneously believe that the second solution would result in more people considering the threaded option, since it costs them less.

You may shave off cycles here and there, but have you ever asked the question "do my tasks actually need threads?"

I've a bit went the other way around : threads sounded like the cleanest multicore-friendly way to implement the model of processes providing services to each other which I aim at, and then I've tried to apply this model to interrupt processing and drivers.

Ok, but will you get the full async performance benefits if your trying to emulate it inside a threaded model?

If by performance benefits you mention the fact that async has only one task running at the time and as such doesn't have to care about synchronization and that pending tasks cost is kept minimal, then yes this model may provide that. If you wonder if the total cost of starting an interrupt processing job is better or worse than in a pure async model where you just send a message to the dispatcher's entry pipe, then I can't answer for sure, but I think that both models have advantages and drawbacks.

Having an external dispatcher in the driver process implies a slight kernel-driver communication protocol overhead which should make emulated async more powerful when the driver is initially idle. On the other hand, when the driver is initially busy, sending a message in a pipe is faster than creating an inactive pop-up thread.

And it still bothers me that you characterized the async model as worse for performance.

Indeed, that part was bad. I wrote it too late in the evening, and didn't take the time to write an in-depth comparison of both. I didn't talk about performance, but scalability though.

The thing that baffles me about your model for drivers, is that it emphasizes threads which will rarely, if ever, be used without a mutex to serialize them again. If one driver forgets a mutex and accesses a device simultaneously from multiple CPUs, it is extremely likely to be a bug rather than a deliberate action.

The assumption behind this is that in many situations, the order in which things are processed only matters in the end, when the processing results are sent to higher-level layers. Like rendering a GUI : you can render individual controls in parallel, blit them in parallel as long as they don't overlap, it's only at the time of the final copy to the framebuffer that synchronization is needed.

I don't know how much it holds true for interrupt processing, though.

Reply Parent Score: 1

RE[4]: Pop-up threads
by Alfman on Sat 21st May 2011 11:14 in reply to "RE[3]: Pop-up threads"
Alfman Member since:
2011-01-28

I'm suffering from exhaustion at this point, but I'll try to respond quickly:


As to the number of threads versus cores, you could distinguish between "scheduled threads" and "active threads". To minimize thrashing, a scheduled thread should not start until there is a free core. Unless of course there is a deadline/priority scheduler with a pending deadline on a higher priority thread.

Hmm, I think you've got it figured out already.


"The thing is, not all interrupts are equal from that point of view. As an example, processing keyboard or mouse interrupts is nearly pure I/O."

I don't think the keyboard will be a problem.

"But processing a stream of multimedia data can require significantly more number crunching, to the point where the amount of I/O is negligible, especially if DMA is being used."

A surprising amount of IO can be done without much CPU effort, copying files between disks, sending files over network (with adapters supporting multi-address DMA and checksum). The CPU can get away with communicating memory addresses instead of actual bytes.

However, in cases where the driver has to do more with the data (1000s of cycles), then there is no doubt a thread is necessary to prevent latency of other pending requests.

Interestingly the x86 interrupt hardware already has a form of prioritization without threads. Low IRQs can interrupt even when higher IRQs haven't been acked yet, the inverse is not true. So even long running IRQs at low priority are theoretically ok since they'll never interfere with high priority IRQs.

...but the problem is that the driver code is no longer atomic, and therefor must be re-entrant. It becomes impossible to use a shared critical resource from two handlers. The second interrupt obviously cannot block, since doing so would block the first interrupt holding the critical resource as well. So, while this has it's place in extremely low latency applications, I doubt it's helpful for you.


"Sorry, did not understand this one. Can you please elaborate ?"


Consider an async model on a single processor. Now multiply that by X processors running X async handlers.
Now, you can map each device to raise interrupts on a specific CPU, also give the driver affinity to that CPU. Now when an interrupt for the device occurs, the driver can safely assume an async model for all internal resources without any synchronization.
Assuming the devices are well distributed (statically) between CPUs, then they'll be serviced in parallel.

The requirement to route the IRQs to a specific CPU may not be strictly necessary, but it simplifies the example and reduces latency caused by transferring the request. This could even be true for a threaded model?



"who carries the burden of creating and dispatching threads ? In one case, it's a third-party developer, in the other case it's the operating system. I'd spontaneously believe that the second solution would result in more people considering the threaded option, since it costs them less."

Well, there's no denying that last point. Unless you allow userspace threads which don't require a syscall?


"I've a bit went the other way around : threads sounded like the cleanest multicore-friendly way to implement the model of processes providing services to each other which I aim at, and then I've tried to apply this model to interrupt processing and drivers."

I think a clean & consistent design is a good reason to use light threads. However, part of the difficulty in providing an async userspace interface under linux (which is still a mess) was the fact that the kernel needed threads internally to handle blocking calls, even though there was no associated userspace threads to block. It's doable but not clean.


You hit some good points further down, but I'm so tired...I'd better not tackle them today.

Edited 2011-05-21 11:20 UTC

Reply Parent Score: 1

RE[5]: Pop-up threads
by Neolander on Sat 21st May 2011 20:11 in reply to "RE[4]: Pop-up threads"
Neolander Member since:
2010-03-08

Consider an async model on a single processor. Now multiply that by X processors running X async handlers.
Now, you can map each device to raise interrupts on a specific CPU, also give the driver affinity to that CPU. Now when an interrupt for the device occurs, the driver can safely assume an async model for all internal resources without any synchronization.
Assuming the devices are well distributed (statically) between CPUs, then they'll be serviced in parallel.

The requirement to route the IRQs to a specific CPU may not be strictly necessary, but it simplifies the example and reduces latency caused by transferring the request. This could even be true for a threaded model?

Yup, I can't see a reason why processing IRQs on different processors couldn't just as well be done with a threaded model. The problem will be to balance the IRQ load fairly across all processors, I think. Some IRQs fire very frequently (clock), some never happen (MachineCheck), but many interrupts happen frequently at a moment and then stop happening for minutes or hours. As a silly example, sound card interrupts (buffer is full/empty) only occur when you play or record sound.

Well, there's no denying that last point. Unless you allow userspace threads which don't require a syscall?

Well, it's not as if I could forbid them ^^ But looking at Tanenbaum's book (my OS development bible), they seem to have several issues. Like, you know, the fact that you cannot use clock interrupts to distribute time fairly across them, or the fact that if an userspace thread blocks for I/O, all other userspace thread are blocked for I/O at the same time, since it's the same thread from the kernel's point of view.

You hit some good points further down, but I'm so tired...I'd better not tackle them today.

I'd be happy to read this ;) In meantime, I've done my homework too : here's a seriously written, in-depth explanation of why I use a microkernel, RPC as the main IPC method, etc, ending with a discussion of threaded model and asynchronous model based on our points in these comments, if you want to have a look...

http://theosperiment.wordpress.com/2011/05/21/clarifications-on-pro...

Edited 2011-05-21 20:13 UTC

Reply Parent Score: 1

RE[4]: Pop-up threads
by Alfman on Sun 22nd May 2011 19:18 in reply to "RE[3]: Pop-up threads"
Alfman Member since:
2011-01-28

Neolander,

"If by performance benefits you mention the fact that async has only one task running at the time and as such doesn't have to care about synchronization and that pending tasks cost is kept minimal, then yes this model may provide that."

The thing is, you are going to have async and threaded drivers running side by side. This implies that even the async drivers will need to use synchronization when accessing shared resources.

I throw out this as an example:

Two userpace apps read from two different files. In the threaded model, this results in two user+kernel threads blocking in the file system driver, which itself has blocked in the block device driver.

Ignoring all the synchronization needed in the FS driver (and writes), the block driver (or the cache handler) must/should create a mutex around the blocks being read so that other threads requesting the same blocks are blocked until it is read. I think such a structure would require at least two mutexes, one for the specific blocks being read, and another for the structure itself.

Therefor a thread reading a cached block would only need to synchronize against one mutex, find it's data, and return immediately.

A thread reading an uncached block would synchronize against the structure mutex, latch onto a new or existing block read mutex, queue the disk request (if new) and release the structure mutex.

This way the structure mutex is only ever held momentarily, and the read mutex is held until the disk reads a block. After which all blocked read threads can resume.

I expect that this is more or less what you'll be writing?


Now, my point about async drivers is that zero synchronization is needed. It's true, that requests to this driver will be forced to be serialized.

However:
1) Many drivers, like the one described above, need a mutex to serialize requests anyways. (This could be mitigated by dividing the disk/structure into 4 regions so that concurrent threads are less likely to bump into each other, however then you need another mutex to serialize the requests to disk since IO to one device cannot be executed from two CPUs simultaneously).

2) If the driver's primary role is scheduling DMA IO and twiddling memory structures with very little computation, then these tasks are not really suitable for parallelization since the overhead of synchronization exceeds the cost of just doing the work serially.


"The assumption behind this is that in many situations, the order in which things are processed only matters in the end, when the processing results are sent to higher-level layers. Like rendering a GUI..."


Yes, I won't deny that some layers will benefit from parallelism. However, propagating that parallelism into drivers which are fundamentally serial in nature will make those drivers more complex and could even slow them down. These drivers will require thread synchronizations when an async model could handle it's state without synchronizations (more on this later for the other poster).

I'd like to emphasize that I'm not arguing against the threaded model, particularly when they make or break the design of the paradigm. I'm just trying to suggest that sometimes there are cases where the best MT implementation performs worse than the best ST implementation, and device drivers may be one of those cases.

Reply Parent Score: 1

RE[5]: Pop-up threads
by Neolander on Tue 24th May 2011 09:49 in reply to "RE[4]: Pop-up threads"
Neolander Member since:
2010-03-08

Sorry for the delay, I have been too sick for complex thinking during the last few days.

The thing is, you are going to have async and threaded drivers running side by side. This implies that even the async drivers will need to use synchronization when accessing shared resources.

Actually, you have this problem even if you have several distinct async drivers running side by side. As soon as there are shared resources, there is a synchronization overhead.

But I'm a bit curious about why separate drivers would need to share much resources with each other. It seems that you provide an example below, though so I'm reading a bit further.

I throw out this as an example:

Two userpace apps read from two different files. In the threaded model, this results in two user+kernel threads blocking in the file system driver, which itself has blocked in the block device driver.

Aggggghhhhhh... Disk I/O in drivers ! u_u Banish these impure thoughts from your head before becoming a fiendish servant of Satan ! You can still see the light !

Joking aside, I think that anything that requires some reactivity (and most drivers are included) should never, ever, depend on blocking file I/O. Nonblocking file I/O (like writing things is a log without caring when it's actually written to disk) is totally okay, on the other hand, but in this case we wouldn't have so much synchronization problems.

Ignoring all the synchronization needed in the FS driver (and writes), the block driver (or the cache handler) must/should create a mutex around the blocks being read so that other threads requesting the same blocks are blocked until it is read. I think such a structure would require at least two mutexes, one for the specific blocks being read, and another for the structure itself. (...)

I've already admitted before that for something as I/O centric as a disk/block device driver, where there is only little processing involved, queuing events in an async model is probably best ;)

Yes, I won't deny that some layers will benefit from parallelism. However, propagating that parallelism into drivers which are fundamentally serial in nature will make those drivers more complex and could even slow them down. These drivers will require thread synchronizations when an async model could handle it's state without synchronizations (more on this later for the other poster).

You see, this is actually something I'm wondering about. Are all drivers fundamentally serial in nature and doing little processing ?

Some time ago, I was wondering about an efficient way to do an FTIR touchscreen driver. In case you've not heard about those, their sensitive layer uses infrared light trapped within a piece of glass through total internal reflexion, that can only be scattered outside of the glass when something (like, say, a finger) comes in contact with the it. A webcam captures the infrared image, everything from that point must be done in software.

So we have to do a webcam driver, a blob detection/tracking system, and some kind of real time video output.

Now, I don't know what kind of data comes out of a webcam, but I guess it varies from one webcam to another. Maybe some will output raw bitmap pictures, some will output jpeg frames, and some will output MPEG-ish videos with i-frames and p-frames. Most peripherals send an interrupt when their output buffer is full, so we do not know how many frames will have to be processed at once (especially for variable-sized frames like in MPEG video).

Suppose that our blob detection/tracking algorithm works with bitmap data. Then there is some amount of image conversion involved. In case the frames are sufficiently independent from each other (not guaranteed for MPEG, but certainly the case for JPEG), it's better to do the decoding in parallel because it's essentially a cpu-intensive operating, and nothing scales better on multiple cores than independent operations.

Blob detection and tracking might work first by locating blobs in the initial pictures the brute-force way (tresholding + noise removal + locating sufficiently large packs of pixels in the whole picture) and then by only looking for the blobs in a region around their former positions, based on their estimated speed. Since no writes are involved, no synchronization is needed, and looking for blobs in these slices of pictures can be done on a "one-thread-per-slice" basis, with communication between threads only needed when such slices overlap each other and it's difficult to know to which slice a blob belongs.

Edited 2011-05-24 09:53 UTC

Reply Parent Score: 1