Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It uses a background thread to do most of the work, and it appears the 7ns latency numbers are a little cooked:

1. The paper's 7ns like number is 8ns for microbenchmarks but 18ns in applications. The 7ns number I'm guessing is microbenchmarks, and the true application level number is prob more in the 17ns range.

2. It isn't precisely clear what that is measuring. The says that is the invocation time of the logging thread. Considering the thread making the call to log just passes most of the work to a background threads through a multi-producer single consumer queue of some sort, this is likely the time to dump it in the queue.

So you really aren't logging in 7ns. The way I'm reading this is you're dumping on a queue in 17ns and letting a background thread do the actual work.

The workload is cut down by preprocessing the creating a dictionary of static elements do reduce the I/O cost of the thread doing the actual writing (I assume this just means take the format strings and index them, which you could build at runtime, so i'm not sure the pre-processing step is really needed).

My logger than dumps binary blobs onto a ring buffer for another process to log might be able to beat this invocation latency. This isn't really groundbreaking. I know a few place that log the binary blobs and format them later. None of them do the dictionary part, but when that is going to a background thread, I'm not sure how much that matters.



Yes, the overhead in the logging thread is what this is trying to minimize. The background thread is considered "free". This sort of async logging is a common setup for some class of applications.

And yes, it boils down to writing data to a message queue. Most of the overhead is probably the call to the hardware timestamp counter.


In my logging code I wrote that is basically a SPSC ring buffer, I use some RDTSC assembly and at startup I calculate the frequency and epoch offset. It has a throughput of around 30 cycles.

That's already ~10 ns, so I'm not sure how they are getting their numbers. If they are timestamping the data when the background thread gets to it that would be pushing even more work to it. It guessing they do or else they could potentially be logging out of order data with multiple threads.


In a similar SPSC ring buffer setup I see something like 1.8ns per 16B entry without RDTSC (our design has a cached thread-local tsc that is updated periodically, so this is vaguely reasonable) or 7.6ns per entry with RDTSC (approximate RDTSC cost of 4.8ns). This is on a VM on some sort of Skylake CPU (maybe Skylake X).


They are likely timestamping on the logging thread. Possibly they are just running at higher frequency.


You could store the current date and time in a global variable and have the producers just read it atomically. The consumer thread would then update it periodically.

Timestamps will be somewhat inaccurate but it may help performance.


that's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

The fastest for nanosecond precision (bonus is this is even sub nanosecond) is just to store the return value of RDTSC and let the background thread figure it all out. You don't even need to precalcuate the freq or epoch offset. Just write a couple logging messages of the rdtsc value and CLOCK_REALTIME and let the post processing figure it out.

To cut down on I/O each log message's timestamp can just be an offset from the last even.

If you are willing to push a lot of work to the background thread and even more to the post processsing step, you really don't need to do very much.


> hat's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

Not sure it matters a lot of to have multiple messages with the same timestamp, since they were added in order you still know which one is older, the problem might arise when you send those logs to a remote place and the order of insert is discarded and the timestamp is used instead.

I assume that when you use a single thread with a queue / ring buffer the order of insertion is kept.


FWIW, I have relied on the accuracy of log timestamps well into the low microseconds.

If you have an event loop, it might be acceptable to sample the counter once per iteration, but less than that it becomes too lossy.


probably in this case it's important to use some kind of synthetic timestamping to preserve the ordering (for example, for a 5.123 [ms] timestamp one can add ~1000000 ns timestamps, so let's say are a thousand entries that need to be ordered, one can then represent them as 5.123000[n] ... and the "000" part is just a silly in-band way to give a hint to someone who will later examine the logs)


since you aren't going to be writing a message per nanosecond, you can always just do `last nanos = max(last nanos + 1, cur nanos)` and then use last nanos for the timestamp. you can even do it in rdtsc ticks and get 1/3 of nano values. Obv the clock isn't nearly that accurate, but it lets you use those fractional nanos to ensure a strictly increasing ordering.


If rdtsc is a cycle counter then this also assumes a constant clock rate.


rdtsc has been "invariant" for a long time.


Oh TIL, I always thought it was a plain cycle counter


It was at first. Until it became obvious that it was useless like that.

You had the OS down-clocking cores to avoid CPUs cooking themselves, causing time to increase an inconsistent rates. Your app queried the clock rate, but unless it kept querying it, the number of cycles was meaninglesss.

ie:

  (CPU rate changed to 1GHz)
  Query Rate, 1GHz.
  (CPU rate changed to 700MHz)
  Read start time
  Read end time
  convert delta into time, using 1GHz rate... not accurate.
or

  (CPU rate set to 1GHz)
  Read start time
  Read end time
  (CPU rate set to 700MHz)
  Query rate, 700Mhz.
  convert delta into time, using 700MHz rate... not accurate.
Next you had multicore, and the OS would down-clock or turn off some idle cores, again, perhaps they were throttled to avoid overheating. Again, the clock rate would change dynamically, but per core. Not only did time not increase at a constant pace, but if a thread jumped between cores, time could appear to go backwards!

It was a mess. Windows introduced a call to get time in a safe way, as they knew when they changed CPU frequencies. Intel fixed the hardware to use a constant clock after a just a few models, but the damage is done... It's invariant...usually, but not for certain.


> It's invariant...usually, but not for certain.

You can verify it's invariant through one of the cpuid leaves. There's a feature bit.


I did something similar about a decade ago. the main active costs for logging (assuming writing into a fixed sized lockless ring buffer) is for doing the string formatting and timestamp fetching. I actually did my own ts interpolation with rdtsc and periodic fetches to get this to minimal. for string formatting you could completely avoid by preregistering messages and store away arguments.

I have also seen people make the logger into dso so they can dump strings from main app binary and record addresses to avoid the hassel of preregistering. but with preregistering I was able to get the perf all the way down to ~40ns per message w/o the separate log dumper thread. at that point you actually start running into disk IO limitations.

one additional thing I did was to create the ring buffer in a shared memory segment and log to that from all services in the system. so this way you dont even have to think about crashing services. anyway the point is this hardly seems groundbreaking. what I'd like to see is this evolve into a fast syslog type service when one just freely logs trace messages throughout w/o worry about the cost and we can use them for postmortem analysis or filter at dump.


Only thing that makes sense is that the thread sending the logs is blocket for 7ns - otherwise too much context dependent extra comes in to make a claim like this


Yeah. But that’s okay. For the applications that care about this, it’s just a question of how long the application thread is blocked. The logging thread can spin on its core and take as long as it wants so long as it can burn down the buffer faster than the application thread can add to it.

I thought log4j2 used a disruptor in between app thread and logging thread but maybe I’m wrong.


One of the reasons I think logging as a production tool has risen in status in the last fifteen years is that in the 90’s you had languages where it was decided that if your app crashes we want to know what the last operation that was in flight when it crashed.

As we now know the syscalls to force flush to disk often didn’t, so if you were trying to figure out what happened right before a kernel panic, good fucking luck. But if it was a purely userspace problem, you might be okay.

These days we care more about throughput. I’d personally rather see us commit to stats for use cases that require that sort of logging traffic, but I think that’s one of those conversations that’s nearing the top of the heap. Maybe we’ll have that argument as soon as we stop flooding the literature with talk of AI.

So 7ns tells me something about throughput but I think i’d rather just know the messages per second per CPU. And perhaps more importantly, with a CPU that’s 80% saturated doing non-logging tasks. Because I need to make forward progress between the log messages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: