An IEEE Spectrum article outlines some interesting new OS-related research. Martin Maas, a University of California, Berkeley, PhD student who is now at Google, designed “a new type of device that relieves the CPU from its garbage collection duties.”
Maas notes that CPUs, which have traditionally been assigned garbage collection, were never specifically designed for the task. “CPUs are built to be flexible and run a wide range of applications. As a result, they are relatively large and can take up a significant amount of power,” he explains.
Instead, Maas and his colleagues created a compact accelerator unit that requires a small amount of chip area and power. It can be added to the CPU, similar to how many modern processor chips are integrated into graphics processing units.
“While the software application is running on the CPU, this unit sits on the side and performs garbage collection for the application,” says Maas. “In principle, this means that you could build a system where the software does not have to worry about garbage collection at all and just keeps using the available memory.”
“On top of its many tasks, a CPU must do something called “garbage collection,” whereby it identifies and deletes redundant or irrelevant data from applications to free up additional memory space.”
This is very badly worded – Garbage Collection is not a task the CPU does, it is a task performed by code running on a CPU. One of many such tasks. Many systems do not use GC at all – it’s typically a language/framework issue, not a CPU issue.
Not to mention, GC is intimately tied into both the language and the host OS. Doing it in hardware means new hardware every time a minor change is made in the language of OS. This is a company in search of unlimited streams of money rather than solving a programming issue.
This is true. Additionally, memory management in most languages with automatic memory management and a GC isn’t as simple as every object ever being new’ed and then handed over to a GC.
Why do we all need GC? If you are a good programmer you never need a garbage collector, that’s it.
The Symbolics 3600 Lisp machine implemented hardware accelerated garbage collection as a parallel subsystem some 35 years ago, it would be good to see this return.
So much stupid…
a) Most software doesn’t use GC, and nothing performance sensitive uses GC (e.g. if you care about performance but are using a language with GC then you end up using native code in a library for the performance sensitive parts).
b) For software that does use GC, often “references” are mangled (e.g. combined with something to indicate the type of object) and are not simple addresses. Different languages that use GC can/will mangle references in different ways.
c) For most operating systems there’s 3 layers of memory management (physical, virtual, then heap or GC in user-space), where it’d be extremely painful to try to do GC at the hardware level (where you can only see physical addresses) because you’d have to “unweave” the lower 2 layers of memory management.
d) Modern security features (e.g. encrypted RAM) are designed to prevent anything accessing or even seeing the true contents of RAM. If a hardware garbage collector is possible (outside of an artificially restricted “academic” setting) it will need to be placed above these security features, and will therefore have a large risk of being abused to bypass/violate the system’s security.
e) Modern systems have many things happen at once (multiple CPUs and multiple devices pounding memory at the same time). Tying to keep track of what does/doesn’t use GC (and how) while avoiding race conditions and upholding cache coherency while many things are happening would be a nightmare.
Mostly; I’d expect that someone did a “hardware GC” for one extremely restricted case (single CPU, no virtual memory, no multi-tasking, no security, single known language, …), and then a clueless reporter hyped the daylights out of it to make it sound more than it ever can be.
I agree with most of your points, but your first point is simply not true. GC has come a long way since stop-the-world, and in the modern day languages which abstract memory can sometimes beat native code.
If you trust the compiler to generate more efficient assembly code for your algorithms, why wouldn’t you trust it to make better decisions about memory usage, also?
FlyingJester,
I like this topic! To be honest, I don’t don’t see much benefit for moving GC into dedicated hardware because we already have parallel SMP processors that are more flexible and mostly unused anyways, they can handle background tasks pretty well already.
Also, “garbage collection” is an umbrella term covering many different types of algorithms and allocators, I’d prefer not to hardcode the semantics of one specific garbage collector in CPU silicone.
However, in terms of some of the generic arguments being made against garbage collection, I think we need to have a balanced view. The truth is unmanaged code continues to be a liability for computer security and robustness. Having garbage collected managed memory is one effective way to address this (serious) problem. Also, the gut reaction for many CS people is to assume alloc/free always wins over GC, but in fact garbage collection algorithms can sometimes do less work overall compared to all/free algorithms, which results in GC winning on performance. The fact that GC’s defer de-allocation means that it will typically use more memory. The point being there are some real tradeoffs to consider.
So, IMHO there’s no reason to put down GC..They can be beneficial and it has a place., but nothing in the article provides evidence that it is better to do in hardware than in software…only lip service to that effect. Unfortunately I can’t read the paper itself as it is locked behind a paywall. 🙁
In languages that require the use of finalizers (where user functions need to be called on an object as it is being garbage collected) I wonder if the overhead of propitiating the hardware GC events to software might even cause more performance loss due to interrupts and context switches.
I also prefer it wouldn’t be the case. 😉
We had an argument about it some time ago, didn’t we?
I didn’t read the article but I suppose it is intrinsic in it the premise that the G.C. support comes from a specialized part of the OS running in parallel all the time without disturbing the context switching of the other parts. If so, I think it would be a really nice addition.
acobar,
Hmm, I can’t find it. it’s become harder for me to find comments on wordpress. using 3rd party search engines. I was able to find more using the comment search on the old site.
Anyways, I am curious how they plan to address the GC compatibility issues,
Of course some GC algorithms already run in parallel, but between all the operating systems and languages that use GC there are probably many different structures under the hood. Obviously a GC allocator can reserve extra space for it’s own GC accounting when making an allocation, that would need to be standardized. Sometime a GC can take advantage of the fact that 64bit pointers have unused bits, for example to flag whether a pointer has been followed. These bits may well be unused, however if the hardware GC implementation assumes they are unused, it would conflict with software that uses those bits for something else. It’s important for a GC to identify where the pointers are in a structure, software based GCs can easily take advantage of language reflection for this, but a hardware GC has no clue about software reflection. So unless it was hardcoded for a specific language, I suspect a hardware GC implementation would require a new type reflection format, which could get ugly because now the GC requires compiler changes too.
I have no idea if they cover any of this in their paper, my guess is that they based their study on a specific GC from a specific language.
The right way to solve GC memory consumption problem shall be developing a way of the OS to communicate system memory pressure to GC enabled VMs so that OS can trigger GC to reclaim memory when needed.
That should be done when new memory is a must (new app started) or is a nice to have (mode cache needed).
Besides at least in theory managed heap result in less consumption for ceratain classes of memory hungry apps because it knows how to solve memory fragmentation problem.
dsmogor,
You’re right. In fact some of us feel that linux should be doing this regardless of using GCs or not long before it comes to the point of invoking the OOM killer. they didn’t provide the facilities to ask applications to voluntarily release resources. Consequently it’s up to applications to self-monitor, unfortunately.
https://stackoverflow.com/questions/23468823/how-to-get-notified-before-oom-liller-killing-or-memory-low-on-linux
That’s true, having a managed heap gives us de-fragmentation, although in practice you have to trade off between performance and memory efficiency.
Because a compiler makes its decisions once, and except for JIT’ed languages, makes them before the program ever runs.
If this could work at all (and I very strongly doubt that), I wonder how much extra memory bandwidth it would consume, and with what effect.
And a programmer isn’t making a decision once, before the program runs, when they write manual memory management code?
FlyingJester,
Considering most of them are implemented in native code, I would argue the native code is merely being beaten by other native code that does performance tuning on the fly. Turns out, native code written in native code can also do that, and such programs also beat non-native code which runs on native code backends.
Side note, “beat” is a bit vague. When GC languages “beat” native code, what category is it beating it in? Fast GC programs rarely approach the memory usage of reasonable native code programs.
People do. That’s why people are increasingly choosing value semantics over reference semantics, because a compiler can make much better decisions about memory usage when you aren’t allocating everything on the heap and pointing to other random places on the heap.
When you deal with values, and leave object lifetime decisions to the compiler, it turns out they can do a much better job.
>>Considering most of them are implemented in native code, I would argue the native code is merely being beaten by other native code that does performance tuning on the fly. Turns out, native code written in native code can also do that, and such programs also beat non-native code which runs on native code backends.
Binary code doesn’t come with the metadata that bytecode provides to inform the dynamic optimizer.
If you split the cost of GC into 2 parts, “finding the garbage” and “freeing the garbage that was found” it’s obvious that it’s impossible for GC to have better performance. The best case is that “finding the garbage” costs nothing and GC ends up with “equal but not better” performance. The same applies to memory footprint (GC can only achieve “equal but not better memory consumption” if it finds and frees garbage as soon as possible).
Any illusion of GC being better for performance or memory consumption is not directly related to GC itself. For example, maybe a hypothetical piece of software performs better when various pieces of data are moved closer to improve locality, and people using explicit memory management don’t do that even though they could, and (an implementation of) GC does do it, so the GC gives the illusion of better performance (but not because of GC itself).
Of course the entire point of GC is to be better at avoiding bugs, and it does achieve that (primarily, by sacrificing some performance). That’s what makes GC attractive (e.g. for finding bugs before software is released).
Brendan,
Many smart people in CS have this misconception, which is learned early on, but it’s wrong. There are a lot of clever ways GC can be optimized such that on the whole, the GC cycle costs less than the aggregate “free” calls on an unmanaged heap.
https://www.c-sharpcorner.com/UploadFile/ff2f08/depth-looks-in-garbage-collection/
“The garbage collector generates a significant performance hit and this is a major drawback of using a managed heap. But the GC only occurs when the heap is full, so that the managed heap is significantly faster than a C Runtime heap.”
Sure GC could benefit from locality, but depending on the GC algorithm it can additionally benefit from having less work to do on average than alloc/free. I don’t mean for this to be controversial, I really would like everyone to understand this, so here’s an example with an extremely simplified GC minus all the bells and whistles that get in the way of clear understanding…
The heap allocator consists of a single pointer to the next free space on the heap. It handles alloc calls reserving space for a NULL forwarding pointer plus the size of the alloc request. The “next” pointer gets advanced and a pointer to the newly allocated region is returned. This continues until the heap runs out of space, at which time a GC cycle is needed.
The GC cycle consists of recursively scanning the entire object hierarchy and copying objects to a second heap. All objects start with a NULL forwarding pointer corresponding to an address in the second heap. For every object reached, copy to the new heap (using the trivial allocator described above) and set the forwarding pointer to it’s new address. Next scan all references. If a referenced object has a forwarding pointer, simply update our reference to the object. If a referenced object does not have a forwarding pointer, then recursively process that object so that it gets copied to the second heap, then update our reference when that returns. By the end of this process all reachable objects will be copied over to the second heap, all references point into the second heap and the second heap will be 100% de-fragmented. The heaps swap roles and the second heap essentially becomes the primary heap.
I think this example is sufficient to explain why GC algorithms can outperform alloc+free in certain scenarios. The alloc function is absolutely trivial and so small it could even be inlined. While the garbage collection cycle is relatively slow due to the way it needs to scan and copy objects, the key insight here is that it’s performance is proportional to the size of the working set and NOT the number of objects freed since the previous cycle. The frees add literally ZERO cost! So for any given working set, the cost of GC is constant regardless of the number of allocations/frees that occurred.
Now compare this to a heap using C’s alloc/free functions: every alloc is more expensive than our example’s trivial alloc and every free adds cost. So with this in mind, we have a better idea when GC should come out ahead versus when alloc/free should. Algorithms that have a small working set (therefor low cost GC cycles) with lots of allocations and frees are most likely to favor the GC (and visa versa). What I hope everyone can take away is that the GC’s performance advantage is NOT merely due to data locality tricks, but actual algorithmic differences.
Of course, the simplicity of this algorithm does have it’s cons: copying every object every cycle, memory efficiency is bad, etc. But there are more sophisticated algorithms like generational GCs that help. Anyways it’s outside the scope of the example 🙂
I agree.
It’s not even just that the GC has to cost less in aggregate, but that it can be deferred, removing the synchronous cost of deallocation, and potentially converting it into an idle time asynchronous cost, when there is no other useful work to do.
This can aid in reducing latency, if the GC can be deferred to idle periods. Of course, if you have real time requirements, then this must operate a level above the GC so that real time tasks can proceed in spite of “stop the world events”, but then with a realtime task, you wouldn’t want to be (de)allocating memory anyway, you’d do it all up front, and not even require GC.
christian,
Those are good points. If you can get away with deferring free, it might reduce latency at least during the deferment period. I wouldn’t opt for GC in realtime systems myself unless of course the safety provided by managed memory was more important than the latency.
You seem to equate “performance” with “high throughput”, a mistake often made by GC proponents. In many cases, “performance” means “low latency”. Android, which is made up mostly by Java, has an audio latency that makes it unacceptable for any realtime audio apps. Proper latency introduced by the audio buffer would be 2-3 ms. Good is 4 ms. For some reason, many people are satisfied with 7 ms. Android apps often have above 60 ms latency or more. These devices are “factory bricked” with regards to audio latency. With today’s hardware and software/OS architechture, you generally can’t achieve 2-3 ms latency if your software is running a GC as well.
Android Apps can bypass java stack completely if needed so the reason must lay somewhere else.
Brendan,
Are we sure about that? A lot of software is being written in Javascript/Node. Almost all browsers have a Javascript engine. Java is still quite big in enterprise (and Android phones). Non-trivial amounts of software written in C#. Then there’s Python, Ruby, Perl, all running on the backend and forcing companies to run huge server farms.
But considering GC languages crawl, they are sensitive to performance. Of course, the performance they’re sensitive to is on the scale of human reaction time rather than nanosecond latencies, but people for sure notice performance issues in GC languages.
kwan_e,
The oft-stated performance gap between GC/alloc+free is an assumption isn’t always true, I wish we could all acknowledge that and move forward. As I’m sure you know by now from the last discussion, GC can sometimes be faster than alloc/free. Should I dig that up so we can discuss it more? I’m actually going on vacation with the family this weekend, so I’m not sure I want to get dragged down this path at the moment, haha. 🙂
Hey, I found the benchmarks! I redid these benchmarks to use wall time using the linux “time” command so that the timing between the gcc and java code is fair. Java starts off at a disadvantage because the JVM hotspot compiler incurs large startup overhead, but on large heap structures java manages to catch up using GC and eventually beat the c++ code using alloc/free.
TestJava.java
https://pastebin.com/QfE4Qakh
TestC.cc
https://pastebin.com/N2Z7Zk8U
GC versus Alloc/Free
https://i.postimg.cc/VvJyPS08/gcbenchmark.png
As always, YMMV, but the main point is that GC’s reputation for bad performance isn’t always justified by data.
For a fair comparison (given that your “Java with GC” didn’t free any of the garbage and just left it allocated until the process terminated) you should delete line 53 of “test.cc”.
Brendan,
No, that would leave the C++ version with a memory leak, the java version does not have a memory leak.
Nevertheless I’ve taken your feedback into account and modified the TestJava version by adding a parameter telling it to explicitly free memory after every rep, just like the C++ version does explicitly.
TestJava.java
https://pastebin.com/aNfZxDwE
I’ve rerun the benchmarks on a fix set size varying only the number of repetitions to demonstrate the effects on memory.
GC Benchmark Graph
https://i.postimg.cc/CL24rFhh/gc-benchmark.png
Regarding performance, we notice the same pattern as before: Java’s JIT overhead has a high startup cost, but over time it overtakes C on raw speed. I tested it both with and without the forced GC cycle You were right that adding GC cycles makes it slower, the data shows that adding a GC cycle after every rep results in 50% performance loss compared to allowing the GC cycle to be defered. However even with the forced GC cycles, Java was still 2.3X faster than C++ here.
Regarding memory I captured the “maximum resident set size”.
TestC started at 2.6M (ie C++ bloat with no work done).
TestJava started at 24.3M (ie Java bloat with no work done)
Coincidentally or otherwise, both consumed an additional 8MB to build the dataset.
As reps increased, TestC consumed the same 10M regardless of repetitions, which is expected.
TestJava with forced GC remained steady at 32M regardless of repetitions, which is expected.
TestJava without forced GC climbed very substantially to a steady state of about 700M. Make no mistake, I feel this is very high, however the performance achieved by deferring GC cycles is impressive. Java’s heuristics seem to be using 10% of free ram (in my case 7GB).
Note that at 10k reps, the benchmark generated 80G worth of data, so it is not accurate to suggest Java is faster because java is leaking data. The garbage collector IS running (ballpark math = 80G/700M=114 GC cycles). Java has tuning options that I don’t have time to look at, just know that these results are the defaults (the C++ version is compiled with -O3). I can think of some scenarios where it could make sense to trade free memory for a performance boost in enterprise applications, it’s pretty cool that garbage collectors give us this option.
Between your posts and kwan_e’s posts from last time, I’ve tried to make the comparisons fair. So thank you both for challenging my results, that’s helped me to improve them. Obviously a micro-benchmark is not the end of the story, real world application benchmarks would be interesting. Nevertheless, I’m hopeful we may finally agree that GC can beat alloc+free for some heap usage patterns (and I readily concede there are usage patterns where it does not).
Alfman,
No it wouldn’t. The C++ version will end, and the operating system reclaims the actual memory. That’s exactly the same as in Java. The program “ends” and the JVM reclaims the memory.
And so do pools. Which all non-GC languages have implementations of, and some even allow the overriding of the default allocator to use such pools for certain classes of objects. The benefit is you don’t get the runaway consumption of system memory for exactly the same performance gain.
I don’t have much interest picking up from last time. The fact is, what I remember from last time, is that you set different rules for each side for the comparison, and the Java side was pretty much a cheat all round – cheats you wouldn’t allow for the non-Java version, even though those are common/every day coding patterns that don’t rely on any fancy trickery. Stuff like implementing a tree/graph as an array (ie, a pool) instead of linked structures.
It’s not a memory leak, it’s an intentional optimization that many people often do (although I typically do an “#ifndef FAST_EXIT” and “free()” anyway, so it’s easier to see actual leaks that weren’t intentional before a release).
Well, no. The “system.gc()” is treated as a suggestion (whichever JVM you’re using can completely or partially ignore it if it wants and there’s no guarantee that it actually frees all possible garbage and doesn’t just free the easiest/most recent garbage).
Of course the next step would be to do a benchmark of something that doesn’t allocate memory at all (with the same compilers, JVM, optimization settings and computer); to see if Java is faster (after the “JIT warmup”) than C++ for reasons that have nothing to do with GC whatsoever (e.g. for all I know, maybe you forgot to enable the C++ compiler’s optimizer).
kwan_e,
Brendan,
Not for nothing, but any other day both of you would call this an application memory leak by definition 🙂 Sure the OS has it’s own page allocator and it will reclaim the application’s pages when the application terminates, however that’s happening at a different level. While sometimes programmers don’t bother to clean up resources (memory, files, sockets, etc) in short lived applications, let’s be absolutely clear that eliminating “free” is not something programmers get away with for very long inside memory hungry applications. This is something that all of us already know, I shouldn’t even have to argue it. But anyways I modified TestC to leak it’s memory by removing the delete calls, all of us already knows what’s going to happen, but here it is anyways:
It consumed all system memory plus swap, and then was killed by the OOM killer before finishing it’s work. So no, that’s not equivalent to java. It will work for short runs, but eventually the C++ memory leaks will result in failure.
kwan_e,
You can try pools if you want, I don’t object (no pun intended). I don’t have the time to do it myself, but I encourage you to rewrite both TestJava and TestC to use pools and see what happens. My guess is that if you do away with a language’s dynamic heap allocations in both languages, then you’ll get similar performance and memory usage in both languages.
kwan_e,
If you have more concerns as to why the latest version is unfair to C++, then please let me know. Otherwise I think it’s time for a moment of reckoning. Should we allow our preconceived notions to overrule the data? That’s not a scientific way to approach the topic. I don’t want this to end this way. Let me ask you, what data would you require to change your mind? Or is your mind set regardless of data?
I was also like you, and was in the camp that figured GC was always inherently slow, but I learned that those assumptions were wrong and I changed my opinion.
Brendan,
That would be a fair point, you are right that implementations might handle it differently, however the chart provided earlier shows that java did in fact free the unreferenced objects. The java version is freeing the objects just as the C++ version is.
Brendan,
I did use -O3, here’s my makefile entry:
Regardless, I provided the source so that you don’t have to take my word for it.
It’s certainly valid to benchmark Java and C++ on non-heap oriented tasks, however my goal here is to encourage people to question their own assumptions that alloc/free is always better than GC. Sometimes it isn’t!
I do appreciate the different points of view, however my hope is to etch away some of the preconceived bias labeling GC as a bad performer when in fact there are times it can perform better than normal alloc+free heap calls. The deferred work in GC heaps can result in less work and therefor higher performance overall. For anyone who wants a compelling technical reason to dislike GC…well there’s still higher memory consumption and GC cycle induced jitter 🙂
Sigh. Wrote a reply tryied to submit and got a bizarre “wordpress detected a security problem” nonsense. Racing against the stupid “edit timer” to rewrite an empty test reply and losing the race.
” It’s certainly valid to benchmark Java and C++ on non-heap oriented tasks, however my goal here is to encourage people to question their own assumptions that alloc/free is always better than GC. Sometimes it isn’t! ”
No. Common sense/logic suggests that GC must always be slower in a fair “apples to apples” comparison (the extra overhead of finding the garbage, with no other differences at all, can not make software faster).
The question here is, why is Java so much faster than C++ that the extra overhead of GC is compensated for? For example; maybe Java was smart enough to realize that (in “TestJava”) all the children are the same (and essentially “pure” in a functional programming sense), so it only needs one of them and can do “count += 3 * child1.Count();”, and avoid 66% of the work, but the C++ compiler wasn’t able to realize that so it’s doing 3 times the amount of work.
Mostly; you’ve shown that (for one specific program, for one specific Java compiler and JVM and one specific C++ compiler), Java can be faster than C++ for some unknown reason; but you’ve failed to prove that the reason has anything to do with GC and failed to show that GC can be faster.
UNRELATED:
I think I’ve narrowed down the WordPress “security” problem, by submitting an almost empty post and then using “edit” to add in pre-written text (to avoid the horrifically painful “edit timer race” that I lost on the first attempt) one paragraph at a time to see where it craps its pants.
The text that triggers the WordPress bug contains a mathmatical expression, like “value equals value to the power of three plus one”. Every time I try that expression I either get a “security problem detected with your request” page (if it’s a clean post) or the site fails to respond (if it was an edit to an existing post).
Alfman,
Yes I would. And that is why I’m pretty sure I also called the Java version’s memory usage a memory leak. From memory, the Java version on my machine used up 4GBs for a program that does nothing. Not calling that a memory leak would be insane.
No it’s not. The OS is the “VM” for the native program. The JVM is the “OS” for the Java program. Exactly the same level. Even more so in the enterprise world where Java programs runs in application servers because they use up too much resources otherwise.
But that’s not equivalent either. You seem to forget the point of the reps. The reps are merely there to smooth out the averages. The reps are simulating a program run from start to finish. That means the programs should end every rep, instead of accumulating from previous program runs. On my end, I’ve already made a concession to allow your Java benchmark to keep the JVM up between reps, which I realize is now an unfortunate decision as its made you more wedded to bad data.
But you can’t do away with dynamic heap allocations in those languages. You can’t have an array of objects in Java like you can in value semantic languages, because all the individual objects are themselves individually allocated on the heap.
I already showed from last time, in my own testing, that if you called System.gc() the same amount of times the C++ program does the cleanup run, the Java version slows to a crawl. Then you made some excuse about “the real world”, completely contradicting your earlier pleas to stick to an ultra literal interpretation of the benchmark which is not “real world”.
So the question is not what data would change my mind. I already provided my data. The question is why doesn’t it change your mind? To me, it seems like because you went to the trouble of cooking up a pointless benchmark that you now have to defend it against all sense.
This is precisely why in many other comments I have often said that bad data is worse than no data. People are too wedded to any data that took some effort to gather because they don’t like the idea that their effort was all for naught. That’s why, when that happens, it is called pseudo-science. It looks like science because of a focus on data, but lacks the real spirit of science – chucking out the bad data, even if it costs billions of dollars to gather.
Brendan,
Your question has an implied an assumption though. You are assuming that the GC always has extra overhead. Sometimes GC incurs more overhead, sometimes it does not. I don’t know why you ignored the example I provided earlier illustrating the logic.
Let’s assume that you are right about this: Java’s GC performs worse than C++’s allocator. Despite the overhead of Java’s GC, regardless of the reasons, you openly admit that for some algorithms Java is still able to produce more optimal code overall than C++? I’ve got to warn you, that’s sacrilegious in certain C++ circles 🙂
I’d like to get more data in response to your questions, however unfortunately I’ve run out of time before my trip. Since nobody else here is defending GC, it’ll have to wait until another time. It’s hard to tell when this topic will come up again, haha. As always, thanks for the discussion!
I’ve experienced a similar wordpress bug a few times, although I never knew what caused it. Also I’ve witnessed some caching bugs. Issues that I raised, like publishing md5sum of our emails are still unresolved. Alas, I’m thankful for what we have, I very much doubt osnews is interested in fixing wordpress, they chose it because they didn’t want to maintain anything and I can’t blame them.
Crazy thought, but why not use garbage collection available in modern C++ rather than Java so the app is running in the same language? Then you’re really benchmarking the GC and not any other aspects of the two.
As for the leak discussion, not freeing the memory in C++ is a memory leak. delete/free must be called for heap allocations, period. Not calling free is bad programming, period.
As for java, you do not need to explicitly call System.gc. it will eventually GC anything falling out of scope. You could also try different garbage collectors in java to get a sense of how they each perform and tune how frequently it will run. This will result in varied results.
For those arguing about small memory usage and short run applications, that’s a very specific case and only indicates that system tools like cp, rm, and so on should probably stay in a language like c or rust.
Truly paranoid performance freaks wouldn’t even use heap allocations in C at all. Everything would be on the stack like rust. I worked at a copy that banned using malloc/free in C code due to the performance cost that their previous manager decided was too much in 1985 on sun hardware. (this was 2011 and they had switched to linux/x86) Of course the code was also pre ANSI C so there’s that.
No. For example; if you have code that does “free(something); exit(0);” then it’s not a memory leak because it’s intentional, because the time it spends “unfreed” is irrelevant, and because the amount of memory left temporarily “unfreed” is bounded (not continually increasing). In this case it’s a perfectly legitimate optimization, not a problem, and not a leak.
Once you except this, you realize that the world is not full of “black or white” naiveties. There’s a sliding scale where “free before termination” ranges from “good” to “bad” depending on multiple factors (how much memory, how much does it cost to free, how much time before the process terminates, ..).
Sure, if you are going to do this kind of thing it’s nice to make it conditional, so that (e.g. when you’re using tools like valgrind to find real problems) you don’t get hassled by “intentional non-problems”.
Sure; for C++ (where you have an unknown amount of code doing who-knows-what behind you back, ruining your ability to rationalize about what the code actually does) it’s frowned on out of habit (e.g. because some tosser is going to whine about “Oh, what if someone adds code that does something important in a destructor in 20 years time?)”, but that doesn’t make it a bad for all conceivable cases. Heck; I’d even assume that it is something your C++ libraries are doing behind your back (e.g. not freeing stuff created between the process starting and your “main()” being called) and that you’re using this kind of optimization whether you know it or not.
I mean honestly, do you free the memory consumed by command line args after you’ve parsed them?
put it in a pcie4 slot and ship it. like aegea you will be bought up by a giant for billions.
edit physx ans s3
While I’m skeptical about a system that does need software managed GC, I quite like the idea of the CPU handing off the GC to an heavily optimised and very efficient dedicated external device.
I suppose different users will have different views on this based on how much garbage they actually have to collect.
If anyone wants to actually read the paper this article references without having an IEEE Micro subscription you can find it here:
https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-hwgc.pdf
Not my area of expertise, but after reading the paper its clear that this is not an “An alternative to garbage collection” at all. The idea here is a fairly low complexity accelerator unit that would assist a traditional software GC in doing specific tasks (i.e. similar in nature to the page-table tree walker hardware found in most MMUs).
They make a fairly sound argument that in most modern concurrent GCs it is inefficient and far from optimal to spin off GC threads on separate cores, because such threads generally do very mundane and embarrassingly parallel work that barely utilizes most of the vast resources of a CPU… Having a memory mapped accelerator for mark and sweep operations that can just go off and do its thing directly through the memory controller without polluting caches or interfering with threads that are doing actual work doesn’t sound like a terribly bad idea. The bulk of the GC is still orchestrated through software, it just offloads certain common operations to the accelerator to ease its burden.
It might not be all that big of a deal on desktop class chips, but such an accelerator has significant performance benefits on for embedded stuff like low end ARM and/or RISC V and whatnot using slower in-order cores. They show 4x-10x improvements in overall GC performance in a Java VM while using less power to boot.
Doesn’t seem like that bad of an idea to me. It would require some significant customization of the GCs in existing languages that want to support such hardware, but the basic operations this thing accelerates are pretty universal in modern GCs. It is equally applicable to stop-the-world type GCs, incremental GCs, and concurrent GCs – although they mention they have more work to do to properly support generational GCs.
galvanash,
Thank you, it’s helpful to read the actual paper! I made some pretty good guesses anyways, haha.
With respect to energy consumption, they probably have a point that general purpose cores are less efficient than a special purpose GC processor. With respect to their performance numbers, I am extremely skeptical because CPU should easily be able to saturate the memory bandwidth. The fact that their model does not seems wrong. They’re simulating the whole thing including the CPU and DDR on an FPGA, so something could be fishy there. Maybe I’m misunderstanding something, but it doesn’t look to me that their model accurately represents the performance of real CPU+DDR hardware at least for intel CPUs. Are ARM CPUs exceptionally bad at saturating memory? Maybe I should conduct some tests on real hardware, haha.
It may not be a terrible idea, but their work is tailed for a specific JVM implimentation. My speculation about them needing to customize the object structure was correct (see I. Bidirectional Object Layout: ). They either need to customize the hardware for a specific language (to understand the language’s reflection structure), or they need to customize the language to fit the hardware. Either way, the hardware would end up being extremely special purpose.
In practice, it would probably take the form of a new hardware standard/API for garbage collection (like cuda and opencl do for GPGPU computing). Then garbage collected applications will need to be re-written to explicitly support it. For that matter, I don’t see why one couldn’t build a hardware assisted mark/sweep GC using GPGPU hardware on the market today.
In general no, but I was more thinking about older in-order designs commonly used in embedded applications with low core counts. On these types of machines concurrent GC is very costly, so you mostly see stop-the-world or tracing GCs being used because they are overall more efficient. Think running memory intensive Java app like Lucene on a Raspberry Pi Zero – that kind of thing.
Is that common? Nope, but you do see it occasionally. It might not be quite as painful with something like this. I agree that they have ALOT of work to do to make this work bothering with for desktop class hardware though…
A computer that does not run a script language does not even know a garbage collector. How it manages heap memory is a central architectural characteristic of any program and the functionality is woven into all the program code. It is hard to see how that can be outsourced to some dedicated hardware.
> A computer that does not run a script language does not even know a garbage collector. How it manages heap memory is a central architectural characteristic of any program and the functionality is woven into all the program code.
Not even close to true. Even ignoring the fact that there are multiple GC-based memory management libraries for C and C++, there are quite a few compiled languages that are garbage-collected instead of requiring the programmer to handle all memory management themselves. On top of that, a lot of the memory management done by most large programs that don’t use an external garbage-collection library but do actual garbage collection themselves.
Yes, that is how I read it, I figured out those debating a hardware replacement of GC code hadn’t read the article.
It’s pretty obvious the authors discuss an optimised GC device that takes those tasks off the CPU, nothing at all about replacing GC code with GC hardware.
I can imagine as memory grows this issue of clearing the garbage becomes ever more CPU intensive, especially for big data processing.
kwan_e,
How much memory do you have? On my machine java invoked the GC automatically when it reached 10% of free memory (in my case 7GB). That surprised me as well, but arguably, since free memory is not doing anything anyways, there’s not much to gain by running the GC more frequently and that’s a function of java’s heuristics. If you don’t like the defaults, you can tune it or explicitly call System.gc (the java version is STILL faster as shown on the chart).
In any case, we already agree that java version uses more memory, but that’s not the point of contention. The GC version can still be faster than alloc+free. That’s what I’d like you to concede because it’s true. If you have enough ram, then it may make sense to choose a GC algorithm over alloc+free for performance reasons.
Are you suggesting to add “system.gc” calls throughout the code everywhere an object is unreferenced to force it’s memory to be reclaimed immediately? When you program in javascript and you unreference a few DOM variables here and there, do you insist on garbage collecting them one by one. Or do you defer the GC for awhile and then release them all at once? If you absolutely insist on the former, well then yeah that’s going to perform terribly, but I was right, we don’t do that in the real world. In order to get the performance gains possible with a GC, the implicit trade off is that you have to accept deferred garbage collection.
Maybe you have a special requirement that requires objects to be freed immediately. That’s fine, however there’s tons of software where that’s not the case and many C++ programs would still work fine with deferred deletions. Brendan even endorsed the practice of ifdefing” free to defer frees until program termination.
I meant taking the dynamic heap allocations out of the equation by using a pool instead. The initial objects would still come from the dynamic heap allocator, but it would be a one time deal (in both C++ and Java).
Please post your code and your results and we see what the difference is. Are you using the same code I posted or are you injecting system.gc everywhere such that you aren’t allowing any time to pass between GC cycles? If you’re a developer bent on making GC perform poorly, then sure it makes sense to run a GC cycle as frequently as possible. However if you want GC to perform well, you allow some time to pass between cycles. If your goal is to see which can give you the best performance, then objectively you should be trying to improve the performance rather than sabotage it 🙂
Alfman,
16GB on a laptop.
Yes it is. You claim that a similar non-GC program would be classified a leak. Then by that definition, the Java program also leaked. If you excuse the GC program of a leak, then you MUST allow the same concession to a non-GC program.
How much RAM is “enough RAM”? Most systems do not have one running program. Hell, even Java application servers don’t run one Java application. I won’t concede because you haven’t shown that there is an improvement for equal programs. You keep wanting to compare a program that leaks to one that doesn’t, and I’m no longer willing to let you off the hook for that.
I’d choose pooling algorithms over GC algorithms for performance reasons. Because alloc+free rarely counts towards performance in real world programs.
Websites that use heavy Javascript ALREADY perform terribly. Where’s your GC magic there? Why do you think WASM runs faster than Javascript on the same underlying VM? Hell, even Asm.js runs faster than Javascript. Do you know how? By implementing a machine emulator and bypassing all the fancy GC. That’s right: an emulated machine running on a Javascript VM is faster than code written directly in Javascript. Asm.js, and WASM, are direct disproofs of your claims.
I put the GC call at the end of every rep, which is what happens in the C++ version of the code – the destructors run when the object goes out of scope at the end of a rep. And it basically ran about the same as the C++ version – no log scales required.
I’m a developer bent on comparing equal programs. Allowing one to leak is not equal.
Calling system.gc in Java is silly. There isn’t really a reason to use this, ever.
Besides, I thought the whole point here was comparing automatic GC to manual cleanup. Seems silly to cripple the automated part, in that case.
You don’t want equal programs when comparing languages. You want the same tasks done with the same results, then compare how fast it went.
kwan_e,
The GC program does NOT leak, it *defers*. Omitting “free” in a non-GC program IS a leak because the program MUST terminate to reclaim lost memory. Obviously you know this, all programmers should as this is really basic stuff in computer science. GC paradigms including java can and do run in a steady state (ie not loosing resources over time due to leaks), the difference being that “frees” are batched into GC cycles rather than processed immediately. To be sure, there are pros and cons, but let’s not lie to ourselves that it’s the same as omitting free/delete in C++.
It’s a tradeoff. Different GC algorithms will use different heuristics, however in general on systems with limited ram compared to the working set, the need for more frequent GC cycles is a con. However if RAM is larger than the working set (and your 16GB of RAM is significantly larger than the benchmark), then deferring frees for longer increases performance as you’ve seen. As an administrator, you could choose to tune it, or use the default heuristics. Either way though the numbers clearly show that even with GC cycles java’s GC can still beat C++ alloc+free under the favorable conditions we talked about earlier. Not all programs will favor GC for performance (some do and some don’t), however we need to overcome the cognitive bias mental blocks that lead us to reject the possibility that GC could ever perform as well as or better than alloc+free.
kwan_e,
If you want to run more tests, such as multiple instances of TestC and TestJava concurrently to see what happens, I welcome you to experiment and try those things, all I ask that you keep an open mind and let the results drive your conclusions.
Who are you trying to convince there? Everyone can see for themselves that’s false so if that is your best argument then it’s not very compelling. The programs are the same in every way except that TestJava uses GC to batch all frees whereas TestC uses “delete” calls. Despite your repeated accusations to the contrary, the fact is that neither program is leaking memory as I’ve provided them. The data shows that both TestJava and TestC are freeing memory. It’s ok to be skeptical, but within reason. I was skeptical too, but our best arguments need to embrace and explain the data rather than contradict it, otherwise they become dogmatic. The best way to convince someone like me to change my mind is by providing data, so instead of telling me I’m wrong, provide the data that shows me I’m wrong.