It uses a background thread to do most of the work, and it appears the 7ns laten...

gpderetta · on Aug 23, 2024

Yes, the overhead in the logging thread is what this is trying to minimize. The background thread is considered "free". This sort of async logging is a common setup for some class of applications.

And yes, it boils down to writing data to a message queue. Most of the overhead is probably the call to the hardware timestamp counter.

jnordwick · on Aug 23, 2024

In my logging code I wrote that is basically a SPSC ring buffer, I use some RDTSC assembly and at startup I calculate the frequency and epoch offset. It has a throughput of around 30 cycles.

That's already ~10 ns, so I'm not sure how they are getting their numbers. If they are timestamping the data when the background thread gets to it that would be pushing even more work to it. It guessing they do or else they could potentially be logging out of order data with multiple threads.

loeg · on Aug 23, 2024

In a similar SPSC ring buffer setup I see something like 1.8ns per 16B entry without RDTSC (our design has a cached thread-local tsc that is updated periodically, so this is vaguely reasonable) or 7.6ns per entry with RDTSC (approximate RDTSC cost of 4.8ns). This is on a VM on some sort of Skylake CPU (maybe Skylake X).

gpderetta · on Aug 23, 2024

They are likely timestamping on the logging thread. Possibly they are just running at higher frequency.

screcth · on Aug 23, 2024

You could store the current date and time in a global variable and have the producers just read it atomically. The consumer thread would then update it periodically.

Timestamps will be somewhat inaccurate but it may help performance.

jnordwick · on Aug 23, 2024

that's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

The fastest for nanosecond precision (bonus is this is even sub nanosecond) is just to store the return value of RDTSC and let the background thread figure it all out. You don't even need to precalcuate the freq or epoch offset. Just write a couple logging messages of the rdtsc value and CLOCK_REALTIME and let the post processing figure it out.

To cut down on I/O each log message's timestamp can just be an offset from the last even.

If you are willing to push a lot of work to the background thread and even more to the post processsing step, you really don't need to do very much.

Thaxll · on Aug 23, 2024

> hat's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

Not sure it matters a lot of to have multiple messages with the same timestamp, since they were added in order you still know which one is older, the problem might arise when you send those logs to a remote place and the order of insert is discarded and the timestamp is used instead.

I assume that when you use a single thread with a queue / ring buffer the order of insertion is kept.

gpderetta · on Aug 23, 2024

FWIW, I have relied on the accuracy of log timestamps well into the low microseconds.

If you have an event loop, it might be acceptable to sample the counter once per iteration, but less than that it becomes too lossy.

pas · on Aug 23, 2024

probably in this case it's important to use some kind of synthetic timestamping to preserve the ordering (for example, for a 5.123 [ms] timestamp one can add ~1000000 ns timestamps, so let's say are a thousand entries that need to be ordered, one can then represent them as 5.123000[n] ... and the "000" part is just a silly in-band way to give a hint to someone who will later examine the logs)

jnordwick · on Aug 23, 2024

since you aren't going to be writing a message per nanosecond, you can always just do `last nanos = max(last nanos + 1, cur nanos)` and then use last nanos for the timestamp. you can even do it in rdtsc ticks and get 1/3 of nano values. Obv the clock isn't nearly that accurate, but it lets you use those fractional nanos to ensure a strictly increasing ordering.

andrepd · on Aug 23, 2024

If rdtsc is a cycle counter then this also assumes a constant clock rate.

loeg · on Aug 23, 2024

rdtsc has been "invariant" for a long time.

andrepd · on Aug 23, 2024

Oh TIL, I always thought it was a plain cycle counter

1000100_1000101 · on Aug 24, 2024

It was at first. Until it became obvious that it was useless like that.

You had the OS down-clocking cores to avoid CPUs cooking themselves, causing time to increase an inconsistent rates. Your app queried the clock rate, but unless it kept querying it, the number of cycles was meaninglesss.

ie:

  (CPU rate changed to 1GHz)
  Query Rate, 1GHz.
  (CPU rate changed to 700MHz)
  Read start time
  Read end time
  convert delta into time, using 1GHz rate... not accurate.

or

  (CPU rate set to 1GHz)
  Read start time
  Read end time
  (CPU rate set to 700MHz)
  Query rate, 700Mhz.
  convert delta into time, using 700MHz rate... not accurate.

Next you had multicore, and the OS would down-clock or turn off some idle cores, again, perhaps they were throttled to avoid overheating. Again, the clock rate would change dynamically, but per core. Not only did time not increase at a constant pace, but if a thread jumped between cores, time could appear to go backwards!

It was a mess. Windows introduced a call to get time in a safe way, as they knew when they changed CPU frequencies. Intel fixed the hardware to use a constant clock after a just a few models, but the damage is done... It's invariant...usually, but not for certain.

loeg · on Aug 24, 2024

> It's invariant...usually, but not for certain.

You can verify it's invariant through one of the cpuid leaves. There's a feature bit.

DesiLurker · on Aug 23, 2024

I did something similar about a decade ago. the main active costs for logging (assuming writing into a fixed sized lockless ring buffer) is for doing the string formatting and timestamp fetching. I actually did my own ts interpolation with rdtsc and periodic fetches to get this to minimal. for string formatting you could completely avoid by preregistering messages and store away arguments.

I have also seen people make the logger into dso so they can dump strings from main app binary and record addresses to avoid the hassel of preregistering. but with preregistering I was able to get the perf all the way down to ~40ns per message w/o the separate log dumper thread. at that point you actually start running into disk IO limitations.

one additional thing I did was to create the ring buffer in a shared memory segment and log to that from all services in the system. so this way you dont even have to think about crashing services. anyway the point is this hardly seems groundbreaking. what I'd like to see is this evolve into a fast syslog type service when one just freely logs trace messages throughout w/o worry about the cost and we can use them for postmortem analysis or filter at dump.

szundi · on Aug 23, 2024

Only thing that makes sense is that the thread sending the logs is blocket for 7ns - otherwise too much context dependent extra comes in to make a claim like this

renewiltord · on Aug 23, 2024

Yeah. But that’s okay. For the applications that care about this, it’s just a question of how long the application thread is blocked. The logging thread can spin on its core and take as long as it wants so long as it can burn down the buffer faster than the application thread can add to it.

I thought log4j2 used a disruptor in between app thread and logging thread but maybe I’m wrong.

hinkley · on Aug 23, 2024

One of the reasons I think logging as a production tool has risen in status in the last fifteen years is that in the 90’s you had languages where it was decided that if your app crashes we want to know what the last operation that was in flight when it crashed.

As we now know the syscalls to force flush to disk often didn’t, so if you were trying to figure out what happened right before a kernel panic, good fucking luck. But if it was a purely userspace problem, you might be okay.

These days we care more about throughput. I’d personally rather see us commit to stats for use cases that require that sort of logging traffic, but I think that’s one of those conversations that’s nearing the top of the heap. Maybe we’ll have that argument as soon as we stop flooding the literature with talk of AI.

So 7ns tells me something about throughput but I think i’d rather just know the messages per second per CPU. And perhaps more importantly, with a CPU that’s 80% saturated doing non-logging tasks. Because I need to make forward progress between the log messages.