This is a bit of a beginners question but I am somewhat new to C/C++. Are there good resources out there which describe the trade offs in writing custom allocators and diferrent strategies for managing memory? I have been having some difficulties finding these sorts of materials. Also are there materials that describe how the various different malloc implementations work?
This question could mean different things. Are you talking about managing memory as in garbage collection schemes, or are you talking about just how to allocate objects in memory to reduce fragmentation etc.?
Scott Meyers in "Effective C++" has an item on custom memory allocation, IIRC, that's worth a read.
Thanks for the reply. I am primarily talking about the latter, but also about alternatives to garbage collection schemes and things you would use to minimize allocations like shared memory implementations of various sorts, reference counting etc. Since I haven't done much coding in C/C++ it's kind of something I have to deal with which I am somewhat unuse to. I will go ahead and see if I can find a copy of that over here. Thanks.
The short answer is that you don't play games with the allocator unless you have rather specific allocation patterns. It's optimized to work well on average, so for a wide variety of allocation patterns. Of course, some apps allocate lots and lots of tiny objects that get destroyed very quickly; others might allocate lots of tiny objects but with significantly different lifespans.
The documentation on this isn't really C/C++ specific except for implementation details. Really anything out there on allocation patterns, allocators etc. would be appropriate.
I wouldn't think too much about implementing memory management (as in GC, refcount, etc) in C/C++ unless you really need to. I'm guessing that you're writing an engine for another language, so you aren't managing C++ itself as much as stuff in that other language.
Note that some management schemes are more expensive than just allocating more things. Of course, over-happy allocation can be a problem depending on the use case: again, it depends.
If I were you, I would use the standard allocators and only worry about it much, much later, once you have a real application to profile and see if allocations are actually hurting your performance.
Thanks. I actually did write a first version using standard STL containers but unfortunately it's hard to tell what is going on behind the scenes with these things. So I thought perhaps I could just do it by hand and have a good sense of what actually was going on. My main concern was if STL container clear() calls actually deallocates memory or not. If so it would be suboptimal since the container in question is quickly refilled and if deallocated would have to be reallocated. Also the deallocation of all items in the list happen all at once. The custom allocator in this case doesnt seem to be too difficult to write but I have been wondering if the trade offs (other than the time it takes to implement) are worth it.
This is for my LPC engine. Though I get the impression that allocation when you have alot of objects being allocated and destroyed quickly can become problematic overhead wise and that in these sorts of situations may require some thought.
The behavior of STL clear is documented for each container type, and is dependent on the specific container.
A list adds and removes list nodes as elements are added and removed, for example, so clear will delete all those list nodes (after calling their destructors).
A vector keeps a separate size and capacity. clear sets size to 0 (after calling destructors for all elements in the vector) but does not change the capacity. That is actually one of the common complaints with vector: there is in fact no way to reduce the vector's capacity (total allocated space) except for copying the vector into a temporary and then using swap.
So far as memory fragmentation, just don't worry about it. While it can lead to performance issues, those issues are simply not worth worrying about in general. If and when you start writing language runtimes meant to be used for massive programs (not a MUD) or on very resource-limited systems (mobile phones) you may need to worry about fragmentation, but for 99.9% of programs it's just not something worth wasting time thinking about. And to be blunt, if you're a beginner, you're not even close to the level where you even _could_ effectively solve fragmentation problems – every solution has its own efficiency trade-offs, and recognizing when to make those takes a lot of practice, and more importantly a lot of hardcore profiling. :)
Implementing a compacting/moving collector is not even remotely trivial. Implementing pool-based allocations is a fair bit easier, but really understanding when to use them and what parameters to use for them is something you need to work up to. And I don't think you'd see much benefit for an LPC system anyway.
Thanks. Yeah I might be jumping the gun a bit in terms of trying stuff I shouldn't perhaps be messing with too much yet… However from your description of list behavior the performance penalty versus using a custom pool type allocator sounds like it is quite substantial not just in terms of fragmentation but in terms of reallocation and needless reallocation as well. I am not sure how new/delete in the list allocators however behave (are they overridden?). i.e. is a chunk of ram reserved by the std::list or is some sort of malloc type call made for every element? This would to me seem to be glacially slow since malloc would have to continually update it's internal data structures as well. Or does it use some sort of placement new() in a preallocated memory block?
In this case I am concerned about in theory potentially 100's of structs being allocated/destroyed every second (probably a worse case scenario but it depends on how the library is implemented)- so it's a noticeable performance bottleneck. As for just doing it the standard way…. I suspect I will have to learn sometime so perhaps best to do so on some project where I am in full control and it essentially doesnt matter if the resulting custom allocation scheme is suboptimal.
I guess you must have extremely different notions of what "glacially slow" is than I do. The std::list is very fast IMO. Obviously, for the best performance for bulk update, you would preallocate the whole chunk and fill things in gradually. But that's not what std::list is for.
Hundreds of allocations a second is nothing. "Profile, don't think", as they say. It's not a noticeable bottleneck unless it actually shows up in your profiling data.
Custom allocators are hard. It's a whole field of research in computer science. People who write custom allocators – who actually come up with good results, that is, not the people who have no real idea what they're doing – tend to be rather competent, have studied the problem extensively in theory and in practice, and base their decisions on heaps upon heaps of data.
It can't be repeated enough. There is no point designing a custom allocator until you have concrete data on how your application uses the memory pool, and concrete profiling data to see exactly where time is being spent.
David, keep in mind that in C++ a "custom allocator" is totally different than a "custom memory manager." :)
Silenus, yes, if you are using a list to create and destroy hundreds or thousands of items per second, you may be better off using a pool allocator. I have to wonder why you're using a list at all, though. Whatever you're doing may be best done with a different data structure, or a different algorithm altogether.
I figured that he was talking about the memory manager, not the STL allocator, given the context, also since he was talking about something reserving a chunk, etc. But in any case, yes, we both agree that a list is probably not an appropriate data structure for this kind of behavior.
I am sort of new at this so you guys might be right. Basically what I am trying to do is implement a fork with delay (i.e. call out system in LPC is the jargon) which can handle potentially 100's of requests per second if so desired (or more) but where you cannot predict really what the maximum size of the number of call requests in the system is. The calls are basically queued in one at a time and then pulled off in groups depending on the delay which is in seconds or number of pulses.
I have looked at the relevant source code a few of the popular drivers and the performance comment comes (second hand) from someone testing one of these drivers under unusual loads. We suspect this has to do with the algorithm used and this I am changing as well- but I figure while I am at it, I might as well try to get the rest of it right rather than coming back to the code in future when I dont really remember what I was trying to do :P. I am not sure what a good data structure for this might be- but I am open to suggestions. I had considered using a min-heap exclusively but I suspect that this might not be ideal since you have alot of remove min calls for short calls which i speculate (hard to determine without profiling and then what should you profile? since this depends alot on how you design the mud library) that short delayed calls probably dominate.
What I am building which is quite similar to code in another driver algorithmically, but the requirements are slightly different and hence the code, is a combination of a min-heap for hopefully "infrequent" long calls and a chunks for short calls which are threaded in a link list like fashion to optimize for individual queuing and removal with the chunks shared between the various lists which are separately chained for different times for the short calls.
If you're dealing with a queue of events, this is almost exactly the kind of problem that really should not be solved by getting anywhere near the malloc implementation. It sounds like you have a priority queue of events, where events are sorted by time to fire (measured in seconds or pulses, irrelevant). An idea, based on no profiling whatsoever and which therefore might be completely wrong, is to use chunk-lists. A chunk-list is a linked list of fixed-size vectors. You get insertion time proportional to the chunk size instead of proportional to the number of elements: when inserting into a chunk, you need only shuffle the elements in the chunk instead of the entire collection. It is better than a vector if you need to insert at (basically) random locations due to sorting by time. Also, removing can be implemented as a single operation by treating the chunks as circular arrays. You can even index the chunks in this case, so that given a time, you can quickly look up which chunk it should be at (or before).
I would treat instant events (delay = 0) separately from timed events (delay > 0), because the former can be added to the end of a vector and then processed iteratively, followed by a single 'clear' operation.
In the end of the day, don't get too crazy. Heck, I'd even start with a plain old STL pqueue, which is as usual a good data structure optimized to the general case. If you find its performance to be unacceptable once you're operating in your real world instead of now, you can deal with it later. It's relatively easy to change implementation details of an event queue. It's very hard to work on a project if you don't start until you have the perfect memory allocator, and that without knowing what the real-world performance characteristics are like.
EDIT: looks like you already had the chunk list idea; ninja'd me there!
Thanks David. Sorry about the post edit I didn't think anyone was posting when I was fixing up the post. I actually considered using a pqueue to quickly get this implemented and actually it might still be a good idea though I think that the STL pqueue has some limitations AFAIK (could be wrong) in that I cannot remove items from the middle of the queue which though an infrequent operation does sometimes occur- but it might be good for a quick test to make sure the rest of the code is functioning correctly.
Well, if removing from the middle really is infrequent, you can always brute-force the removal by creating a whole new pqueue, and copying things over one at a time, excluding the ones you don't like. I admit that I can't think of many reasons why you'd remove from the middle of a pqueue that need to happen particularly rapidly.
If you really do need frequent add/remove from the middle of the queue as well as very low memory allocation overhead, you may be best off writing your own container.
A deque is close to what you want. It allocates chunks of memory (essentially mini-fixed-sized vectors) and tracks an independent start and end index. You can write your own that has independent start/end indexes for each chunk, allowing you to delete items from the middle of a large queue with relatively little overhead.
Before starting down that path, though, make damn sure you actually need to. It can be a fun learning exercise, but writing components you don't need is bad in general.
Well, right, it's a different abstraction, but I meant from an implementation perspective it's the same. Note however that the deque abstraction isn't right here, because you need events to be sorted by time as well. (Unless we're just talking about instantaneous events.)
Right, my point was that you could use an implementation similar to the usual deque internals to handle this. It's really the best you're going to get for random insertion/removal anywhere in the list with a minimal of memory allocations or expensive moving of large numbers of items. Even a sorted vector (using a pqueue adaptor over it) is going to have a whole lot of moving (which is only implemented as a simple memmove if the elements are PODs… and then only on some implementations of the STL) which means O(N) insertion/removal time.
Algorithms is hard. :)
I don't have a copy of the Art of Programming on me, but I wouldn't be surprised if there's a data structure in there for this specific case.
Well I still think using a min binary heap might be a good solution, since it's designed to make removing the smallest element quite fast and allows for removing stuff in the middle relatively quickly if you know which element you want to remove. In my case this will be a slow operation since I need to search through every element until i find the relevant element which is proportional to the size of the array :/. I do wonder if I should try a different sort of heap possibly like a binomial heap or fibbonaci heap though I suspect these are optimized for merging.