I’ve just read both parts of the article and I still feel like I’m left with more questions than answers.
The game is bottlenecked on memcpy so hard it takes two seconds to load each time? On a modern machine with double-digit GB/s RAM bandwidth and single-digit GB/s SSD bandwidth, when the game was released on two DVDs and thus can’t have more than couple dozen GB of assets total[1]. How? OK, they’re doing a memcpy per image row, that’s not nice and can probably cost you an order of magnitude or so, and the assets are JPEG-compressed so it’s another order of magnitude to copy around uncompressed pixels, but still, how?
Furthermore, if it really is bottlenecked on memcpy, why does running on a modern machine not improve things? I almost want to think there’s a fixed amount of per-frame work hardcoded somewhere, and loading DDS is just accounted for incorrectly.
[1] In fact, a screenshot in part 1 shows data.m4b taking up 1.4GB, and the rest of the files shown are either video, sound, or small.
It's what the profiling hinted at at least, but I don't know how much overhead that tool adds per function call, so if you profile a lot of very small/fast functions you basically just measure which function gets called most.
But you should not underestimate the impact of unnecessarily shoving data around in memory even with fast ram. Cpu speed has improved much much more than memory speed over the past decades. If your data layout sucks and you hit L3 or even worse actual memory, it's slow as heck relative to L1, or even better no copy at all. And then the overhead of the plain function call itself. As this is a 3rd party library you're guaranteed that each call to this wrapped memcpy is an actual call and not inlined.
But in addition to that I'm pretty sure the decoding library used originally isn't nearly as fast as mango.
If the profiler used is a sampling profiler (and it seems to be), then unlike with instrumentation-based profilers, it doesn't add any function call overhead. It just pauses the program every few ms and records what the call stack is at that point. While this makes the data noisier compared to instrumenting all calls, it also makes the data an unbiased approximation of how the program behaves when not being profiled.
But sampling profilers do still tend to "basically just measure which function gets called most". They can tell a program spent a lot of time in a particular function, but they can't count how many times that function was called – so they can't determine whether it's a slow function, or a fast function that was called many times.
To be completely honest, it's surprising to me as well. I would expect it to be bad, but not as bad as it was. I entirely expected that the slow part would be decoding, not copying. In fact, my initial plan was to convert the remaining images that couldn't be DDS to Targa, on the assumption it would decode faster. However, when I investigated the slow functions and found they were only copying, I changed tactic because then in theory that would not make a difference.
There is no fixed amount of per-frame work. After the 550ms hardcoded timer is up, it is blocking during the loading of those images, and during this phase all animations on screen are completely still. I thought to check for this, because it did occur to me that if it tried to render a frame inbetween loading each image to keep the app responsive, that would push it to be significantly longer, and that would be a pretty normal thing to want to do! But I found no evidence of this happening. Furthermore, I never changed anything but the actual image loading related code - if it tried to push out a frame after every image load or every x number of image loads, those x number of frames wouldn't go away only by making the images load faster, so it'd have never gotten as instant as it did without even more change.
The only explanation I can really fathom is the one I provided. The L_GetBitmapRow function has a bunch of branches at the start of it, it's a DLL export so the actual loop happens in a different DLL, and that happens row by row for 500+ images per node... I can only guess it must be because of a lack of CPU caching, it's the only thing that makes sense given the data I got. Probably doesn't help that the images are loaded in single threaded fashion, either.
That said, there have been plenty of criticisms of my profiling methodology here in these comments, so it would be nice to perhaps have someone more experienced in low level optimizations back me up. At the end of the day, I'm pretty sure I'm close enough to right, at least close enough to have created a satisfactory solution :)
I absolutely did not mean to imply that you did a bad job at any point, or to discourage you. The mere fact that you reached that far into the game’s internals, achieved the speedup you were aiming for, and left it completely functional is extremely impressive to me.
And that’s part of why I’m confused. If you’d screwed up the profiling in some obvious way, I’d have chalked it up to bad profiling and been perfectly unconfused. But your methods are good as far as I can see, and with the detail you’ve gone into I feel I see sufficiently far. Also, well, whatever you did, it evidently did help. So the question of what the hell is happening is all the more poignant.
(I agree with the other commenter that you may have dismissed WaitForSingleObject too quickly—can your tools give you flame graphs?.. In general, though, if machine code produced by an optimizing compiler takes a minute on a modern machine—i.e. hundreds of billions of issued instructions—to process data not measured in gigabytes, then something has gone so wrong that even the most screwed-up of profiling methodologies shouldn’t miss the culprit that much. A minute of work is bound to be a very, very target-rich environment, enough so that I’d expect even ol’ GDB & Ctrl-C to be helpful. Thus my discounting the possibility that your profiling is wrong.)
The game is bottlenecked on memcpy so hard it takes two seconds to load each time? On a modern machine with double-digit GB/s RAM bandwidth and single-digit GB/s SSD bandwidth, when the game was released on two DVDs and thus can’t have more than couple dozen GB of assets total[1]. How? OK, they’re doing a memcpy per image row, that’s not nice and can probably cost you an order of magnitude or so, and the assets are JPEG-compressed so it’s another order of magnitude to copy around uncompressed pixels, but still, how?
Furthermore, if it really is bottlenecked on memcpy, why does running on a modern machine not improve things? I almost want to think there’s a fixed amount of per-frame work hardcoded somewhere, and loading DDS is just accounted for incorrectly.
[1] In fact, a screenshot in part 1 shows data.m4b taking up 1.4GB, and the rest of the files shown are either video, sound, or small.