Linked by Hadrien Grasland on Thu 19th May 2011 21:31 UTC
Hardware, Embedded Systems Having read the feedback resulting from my previous post on interrupts (itself resulting from an earlier OSnews Asks item on the subject), I've had a look at the way interrupts work on PowerPC v2.02, SPARC v9, Alpha and IA-64 (Itanium), and contribute this back to anyone who's interested (or willing to report any blatant flaw found in my posts). I've also tried to rework a bit my interrupt handling model to make it significantly clearer and have it look more like a design doc and less like a code draft.
Thread beginning with comment 474044
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: Pop-up threads
by Neolander on Fri 20th May 2011 22:29 UTC in reply to "Pop-up threads"
Neolander
Member since:
2010-03-08

I am concerned that you are still claiming that running threads is more scalable than a non-threaded approach. It is false. More flexible, yes, more scalable (or better performing), no.

If things which can run in parallel run in parallel, you get better scalability than with an all-sequential model like async. That's why we have threads at all. They are cheap (or more exactly can be made so). Why not use them ?

As per usual, I'll note that threads are best suited for CPU bound tasks with relatively rare synchronization. Popup/Micro-threads doing trivial tasks are typically short-lived with high overhead.

If the task is not CPU-bound, and we consider IO waiting times that are typically orders of magnitude larger than instruction execution times, the overhead of creating a thread, which is pretty small on modern hardware, will be irrelevant, since creating a thread is essentially a CPU/memory-bound operation.

If the task is CPU-bound, threads offer significantly better performance.

So you either lose nothing or win a lot. I don't know what's the problem with that.

Thread creation and accounting cost can be reduced to the bare minimum, but synchronization primitives will always be an inherent bottleneck since they cannot operate at CPU speeds and must be serialized over the bus. As the number of micro-threads increases, the bus itself becomes a point of contention and thus the entire system looses performance.

Again, I'll refer to the asynchronous model as having solved these problems. By handling events asynchronously without threads, all of the overhead disappears since no blocking/synchronization is necessary. Since only one stack is needed for CPU, it should fit entirely within the CPU's L1 cache.

If so much synchronization is needed that it has a significant impact on interrupt processing speed, it is indeed better to use async. That's another reason (apart from the big developer laziness one) why I support both models.

This doesn't even get into the prevalence of multi-threaded bugs and the difficulty in debugging them.

That's why I talk about ease of programming. Multi-threaded programming is harder. You have to get in the right mindset to get it right. But once you get some basic coding patterns that prevent deadlocks and races, I don't think there are so many things that can go wrong.

If a long running thread is really needed, then the asynchronous code can spawn one off deliberately (and still without blocking).

Which implies going back to the kernel, creating a thread, and waiting until the scheduler dares to run it. Why not have the kernel just do it right away ?

If you are using threads to achieve some type of kernel protection between stacks, then I wish you'd just say that. Please don't claim that micro threads are used for scalability when other alternatives are technically more scalable.

Again, how can processing events sequentially be any more scalable than processing them in parallel ?

But yeah, you're right, separate threads have also other benefits, like a stack/CPU state cleanup each time a new event is processed, or way to introduce "timeout" mechanisms so that if an event takes too much time to be processed, the driver simply reverts the changes and jumps to the next event (effectively reducing the impact of hung drivers).

Anyways, since you don't mention it, how do you handle the acknowledgement of interrupts? Is this done implicitly by the global interrupt handler? Or is this done within the thread that handles the interrupt?

Yeah, I implicitely included it in "enabling interrupts again". Drivers shouldn't be trusted for that, in my opinion.

The reason I ask is if you ack the int right away, then the hardware could fire off another interrupt and have you create a new thread for the same device before the first thread is serviced.

Yup, that's the whole point of using a threaded model at all. If you don't want this behaviour, you use the async model and run the pop-up threads sequentially.

Do you protect from this somehow or do you leave it up to individual drivers?

Drivers have to explicitely make a choice between using a threaded or an async model. If they use a threaded model for an interrupt where races can occur, they know what they're doing and it's up to them to implement the relevant synchronization mechanisms.

Edited 2011-05-20 22:35 UTC

Reply Parent Score: 1

RE[2]: Pop-up threads
by Alfman on Sat 21st May 2011 00:28 in reply to "RE: Pop-up threads"
Alfman Member since:
2011-01-28

"If things which can run in parallel run in parallel, you get better scalability than with an all-sequential model like async."


If those are your assumptions, then I can understand your conclusions, but you're assumptions are pretty weak. Hypothetically I could perform nearly any computation in parallel by dividing it into threads. But doing so does not imply better performance. That depends on the ratio of cpu work versus synchronization overhead.

If I create a MT ray tracer on an 8 core processor, which will perform the best?
A) a new thread for each pixel
B) a new thread for each line
C) 8 threads, processing individual lines from a dispatch queue.
D) 8 threads, processing 1/8th of the image at a time

The answer here is obviously not A or B. For performance it never makes any sense to have more threads than cores.

C and D are probably close, but C could win since some cores might finish their work before others, leaving them idle.

"They are cheap (or more exactly can be made so). Why not use them ?"

The point is that it doesn't make sense to use them just because you can, it makes sense to use them when the tradeoffs point in that direction.

"If the task is not CPU-bound, and we consider IO waiting times that are typically orders of magnitude larger than instruction execution times..."

Fine, then your report should say that: the overhead of threads is outweighed by I/O bound factors.

"If the task is CPU-bound, threads offer significantly better performance."

This isn't automatically true. It depends on the work/overhead ratio, especially small workloads will suffer greatly when multithreaded compared to async with no overhead.

It's additionally possible to run a separate async handler on each core using cpu affinity such that multiple async requests can run in parallel with no synchronization overhead at all.


"If so much synchronization is needed that it has a significant impact on interrupt processing speed, it is indeed better to use async."

"Significant" is a judgment call I leave to you. I just object to your claim that threaded is more scalable.


"Which implies going back to the kernel, creating a thread, and waiting until the scheduler dares to run it. Why not have the kernel just do it right away?"

Firstly, most I/O operations will not need threads in the first place, async is already sufficient, why go through the overhead when it's not needed. Secondly if you do have a CPU bound operation, then the cost of a syscall should be fairly negligible. Thirdly, the cost of starting a microthread should be no worse in this scenario than when you start it by default (although I realize this is not strictly true for you since you're using a microkernel which is subject to additional IPC overhead).


"Again, how can processing events sequentially be any more scalable than processing them in parallel?"

You can't just put it in a thread and expect it to run faster.

The overhead for many small threads adds up compared to bigger threads which do more work.
If it costs 3000 cycles to send a thread to another CPU and another 3000 cycles to retrieve the results (thread creation+data transfer), then computations under 12000 cycles are probably better off being run serially on one cpu. Not only is it faster in real time, but it frees the bus for useful work on the other cpu.

You may shave off cycles here and there, but have you ever asked the question "do my tasks actually need threads?"


"Drivers have to explicitely make a choice between using a threaded or an async model."

Ok, but will you get the full async performance benefits if your trying to emulate it inside a threaded model? And it still bothers me that you characterized the async model as worse for performance.

"Yup, that's the whole point of using a threaded model at all."

Forgive me if I'm pointing out something you already considered, but my point was the fact that you are pre-emptively acknowledging the interrupt means that the drivers have to protect themselves from re-entrant (or double threaded in your model) IRQ behavior which would not normally be possible otherwise.

In other words, the implicit ack requires drivers to have more synchronization than would otherwise be necessary if they explicitly re-enabled their interrupt when ready. It may not be a big deal, but it's still several hundred/thousand cpu cycles wasted every IRQ.

The thing that baffles me about your model for drivers, is that it emphasizes threads which will rarely, if ever, be used without a mutex to serialize them again. If one driver forgets a mutex and accesses a device simultaneously from multiple CPUs, it is extremely likely to be a bug rather than a deliberate action.

Edited 2011-05-21 00:29 UTC

Reply Parent Score: 1

RE[3]: Pop-up threads
by Neolander on Sat 21st May 2011 09:24 in reply to "RE[2]: Pop-up threads"
Neolander Member since:
2010-03-08

If those are your assumptions, then I can understand your conclusions, but you're assumptions are pretty weak. Hypothetically I could perform nearly any computation in parallel by dividing it into threads. But doing so does not imply better performance. That depends on the ratio of cpu work versus synchronization overhead.

If I create a MT ray tracer on an 8 core processor, which will perform the best?
A) a new thread for each pixel
B) a new thread for each line
C) 8 threads, processing individual lines from a dispatch queue.
D) 8 threads, processing 1/8th of the image at a time

The answer here is obviously not A or B. For performance it never makes any sense to have more threads than cores.

C and D are probably close, but C could win since some cores might finish their work before others, leaving them idle.

Good point, I should try to avoid having more running threads than there are available CPUs. I like this raytracer example a lot, except that I'd argue that in this case the synchronization overhead is close to zero since IIRC raytracers render individual pixels independently. Threading overhead here will be mostly caused by the cost of context switches and thread creation/anihilation. But that's totally nitpicking.

No, what leaves me wondering here is, how I/O calls can be managed in this model.

That's because you see, if you have 16 running threads, a 8-core processor, and suddenly 4 threads have to do I/O, then they can simply use blocking calls without worrying because 4 threads from the 8 ones that weren't running will take their place.

Now, if you have 8 running threads and 8 tasks waiting in the dispatch queue, and suddenly 4 threads have to do I/O, then for optimal performance you need to replace these 4 threads with 4 new number-crunching threads from the dispatch queues' task.

In this case, we have some overhead :
1/Threads must synchronize with each other so that they don't start 4 new threads with the same pending task in a textbook example of a race
2/4 new threads must be created, which implies some context switching overhead that could have been avoided if all threads were created during a time where the kernel runs.
3/After a while, the I/O task is completed, and our 4 old threads are ready to run again. We are now dealing with 12 threads, so at this point, 4 threads must be silenced for optimal performance. How do we choose them ?

To solve problems 1/ and 2/, having lots of threads created from the start sounds like the most sensible option, if threads are sufficiently cheap. Well, I have to define "cheap", isn't it ? Let's do it : threads that are not running cost CPU time when they are created and take some room in RAM because of OS management structures (a few bytes, negligible on modern computers) but also because of their stack. The CPU time cost can't be avoided, so creating lots of threads is only interesting for tasks which run for a sufficiently long time, as you point out. But the RAM cost can totally be avoided for threads that have never run, by having the stack allocated at run time in a COW-like mechanism.

An interesting experiment to do would be a mix of my two pop-up threads models where among the threads created, only <insert number of cores here> are allowed to run from the start, and when a thread blocks the scheduler automatically picks another one from the queue.

3/ is more interesting. I think the older thread should run first, because otherwise the number of threads which have run (and, as such, have a stack) is going to grow, in a waste of RAM.

Fine, then your report should say that: the overhead of threads is outweighed by I/O bound factors.

The thing is, not all interrupts are equal from that point of view. As an example, processing keyboard or mouse interrupts is nearly pure I/O. But processing a stream of multimedia data can require significantly more number crunching, to the point where the amount of I/O is negligible, especially if DMA is being used.

It's additionally possible to run a separate async handler on each core using cpu affinity such that multiple async requests can run in parallel with no synchronization overhead at all.

Sorry, did not understand this one. Can you please elaborate ?

Firstly, most I/O operations will not need threads in the first place, async is already sufficient, why go through the overhead when it's not needed.

Indeed, for pure I/O operations, it makes more sense to use pure async. As an example, an HDD driver is totally something which should be written in an async fashion, with a message queue.

I have to do more design work on this side.

Secondly if you do have a CPU bound operation, then the cost of a syscall should be fairly negligible.

Very fair point.

Thirdly, the cost of starting a microthread should be no worse in this scenario than when you start it by default (although I realize this is not strictly true for you since you're using a microkernel which is subject to additional IPC overhead).

There is a difference in design philosophy, though : who carries the burden of creating and dispatching threads ? In one case, it's a third-party developer, in the other case it's the operating system. I'd spontaneously believe that the second solution would result in more people considering the threaded option, since it costs them less.

You may shave off cycles here and there, but have you ever asked the question "do my tasks actually need threads?"

I've a bit went the other way around : threads sounded like the cleanest multicore-friendly way to implement the model of processes providing services to each other which I aim at, and then I've tried to apply this model to interrupt processing and drivers.

Ok, but will you get the full async performance benefits if your trying to emulate it inside a threaded model?

If by performance benefits you mention the fact that async has only one task running at the time and as such doesn't have to care about synchronization and that pending tasks cost is kept minimal, then yes this model may provide that. If you wonder if the total cost of starting an interrupt processing job is better or worse than in a pure async model where you just send a message to the dispatcher's entry pipe, then I can't answer for sure, but I think that both models have advantages and drawbacks.

Having an external dispatcher in the driver process implies a slight kernel-driver communication protocol overhead which should make emulated async more powerful when the driver is initially idle. On the other hand, when the driver is initially busy, sending a message in a pipe is faster than creating an inactive pop-up thread.

And it still bothers me that you characterized the async model as worse for performance.

Indeed, that part was bad. I wrote it too late in the evening, and didn't take the time to write an in-depth comparison of both. I didn't talk about performance, but scalability though.

The thing that baffles me about your model for drivers, is that it emphasizes threads which will rarely, if ever, be used without a mutex to serialize them again. If one driver forgets a mutex and accesses a device simultaneously from multiple CPUs, it is extremely likely to be a bug rather than a deliberate action.

The assumption behind this is that in many situations, the order in which things are processed only matters in the end, when the processing results are sent to higher-level layers. Like rendering a GUI : you can render individual controls in parallel, blit them in parallel as long as they don't overlap, it's only at the time of the final copy to the framebuffer that synchronization is needed.

I don't know how much it holds true for interrupt processing, though.

Reply Parent Score: 1

RE[2]: Pop-up threads
by Alfman on Sat 21st May 2011 04:19 in reply to "RE: Pop-up threads"
Alfman Member since:
2011-01-28

I hope I'm not annoying you too much, that's not my intent.

If you can demonstrate that threads make your paradigm better, then go for it.

My main gripe is that, after all our talks, you still used blanket statements like the following:

"The former [threaded operation] is obviously significantly better for scalability..."

Reply Parent Score: 1

RE[3]: Pop-up threads
by Neolander on Sat 21st May 2011 09:44 in reply to "RE[2]: Pop-up threads"
Neolander Member since:
2010-03-08

I hope I'm not annoying you too much, that's not my intent.

No, no ;) Feedback like yours is extremely precious, because I'm one of these stubborn guys that will only listen to criticism with more than one ear when it includes argumentation and claim backing.

If you can demonstrate that threads make your paradigm better, then go for it.

As said above, it has initially appeared to me that pop-up threads were an extremely good option in the context of processes providing independent services to each other. What I'm trying to do here is to keep the design complexity low by creating a universal yet simple paradigm that can be used to solve an insane lot of problems at once, kind of like Unix systems make everything resolve around text files and pipes.

My main gripe is that, after all our talks, you still used blanket statements like the following:

"The former [threaded operation] is obviously significantly better for scalability..."

Apologies for that one ;) As said above, I have to rewrite this part anyway as it lacks completeness. This pop-up thread thing should have a full dedicated article anyway, considering how extensively I want to use it, not just a two-paragraph description in an article about interrupt handling.

Edited 2011-05-21 09:44 UTC

Reply Parent Score: 1