Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to build software like an SRE (willett.io)
316 points by kiyanwang on Oct 17, 2022 | hide | past | favorite | 226 comments


> Extremely strict RPC settings. I’m talking zero retries (or MAYBE one) [...]

I disagree. If we're talking about distributed systems, then one thing is guaranteed - network is not going to be reliable. And if we have 10's or 100's of services, this policy means that at the smallest blip the whole thing collapses like a house of cards. If the concern is "it's hard to troubleshoot", well then perhaps implement better logging ("connectivity to service X has been unreliable with X% failure rate over the last Xhrs" instead of "connection terminated", or no logging at all).


The answer is to "kick retries up the stack"; when you fail to reach a service, you return a 503 and have your clients retry, to avoid a case where every service in the stack starts retrying all at once and causes a massive increase in traffic.

IMHO you should only add retries if it proves necessary in practice to reach your SLA, if you're building something that isn't itself triggered by an RPC, or if you're performing an operation that can never be made idempotent.


Sometimes you want the exact opposite, though. Consider an endpoint that makes 100 behind-the-scenes requests (say, to S3). You absolutely want to retry at the lowest level, not the highest level. You could fail on the 99th request. If you kick it up, the caller will retry and you'll do those 99 requests again, instead of just retrying the one that failed. With enough requests, there's a point where you're unlikely ever to succeed without one of the calls failing if you restart from the beginning every time. I don't think you can "one size fits all" this, and that's one of the reasons retries are hard.

I use S3 as an example because it has a comparatively high random failure rate. You MUST implement timeouts and retries if you're using S3 as a backend.


That's why I said "except if your SLA requires it." If you've agreed--with other teams, with outside clients, or just as a project goal--that your service should work on 99.9% of calls, and you find in practice that you need to retry S3 calls to meet that target, then adding retries is reasonable.

If the problem is just that the 99 calls could overload downstream services on retries, the ideal solution is to add rate-limiting, though admittedly that is an imprecise science.


There's never a reason that a human needs to read an error message caused merely by Kubernetes restarting a pod in the normal course of operations.

Passing every error up the stack causes people to be confused by 500 status errors for trivially retry-able calls. Browsers (and TCP) natively implement retries despite there being no SLAs in place.


> Consider an endpoint that makes 100 behind-the-scenes requests

Don't do that; Don't ever make middleware retry.

An API endpoint should be quick, and if you can't figure out how to make it quick, use batch processing metaphors (submit task/query task) and a task-runner instead. The task-runner can retry indefinitely, but a user needs to be given feedback or they will push the refresh button.


Lisp's conditions and restarts are great for this, so you can make the policy decision at a high level while not fully unrolling the stack and allowing retries to be done in the original context if that's the verdict.


If a client retries it’s not necessary for all 100 requests to be attempted again. The system could be designed to only repeat the unsuccessful operations.


You mean like retrying at the lowest level?


You do want the policy on retries to live and be evaluated at the outermost level where your business logic lives. If that’s living across an RPC boundary, then you’re stuck making this weird trade off and where I think this back and forth is happening - people have different mental models of the specific service they have familiarity with that they’re using to test the recommendation against and because there’s a trade off to be made, the advice can be correct in some contexts and wrong in others. If you can encode your policy and it’s evaluation generically so it flows through the stack, that’s not terrible although it becomes hard to manage certain other kinds of SLAs (eg bounding latency of your overall operation or the latency of a specific suboperation).


I like the idea of pushing the retry up to the caller, but in a lot of apps I've seen they've built up an in-memory object-graph from the results of calling out to other services. Wouldn't failing due to one bad service and asking the caller to retry result in every other service being unnecessarily hit?


I had a tech lead who vehemently agreed with the parent commenter (retry from the top), but I ended up learning different lessons.

* Differentiate retryable and non-retryable errors. If the service can't return success because the DB it queries is borked, it should send a non-retryable error. Then it won't get overwhelmed by retries from upstream.

* Retry configuration should have sane defaults. Even "retry once" is too often; many services aren't overprovisioned for 100% increase in traffic. What ended up working here for us was having the retry module collect req/sec statistics, then only allowing 20% of that number in retries/second. Individual requests can be retried twice. That was small enough to not push over any services, but enough retries to compensate for garden-variety unavailability.

* Services shouldn't serve requests first-come-first-served under load. When SHTF, fairness means everyone has to suffer long delays, often much longer than rpc timeout. Instead, serve them in an unfair order - the most recent requests to come in are the most likely to have a caller still interested in them. Answer those!

* Use headless services in kubernetes. By exposing the replicaset to the client, the client can load balance itself more intelligently. Retries should go to different replicas than the failing request. Furthermore, you can perform request hedging to different replicas than the lagging request.

* Define a degraded form of your in-memory object-graph. If a feature is optional to the core business flow, it shouldn't take down your whole product. This one is a lot more involved. We needed custom monitoring for degraded responses, in-memory collection and storage of "guesses" to substitute for degraded portions of the object graph, as well as some other work I can't think of right now. This does enable an organization to compartmentalize better, having faster, less fearful deploys of newer initiatives.


It seems like knowing the difference between a retryable and non-retryable error is itself difficult (perhaps impossible in practice).

DB is unreachable? Is it the DB or my host? If it’s the DB, maybe it’s not retryable, but if it’s me (and I’m load-balanced), it is probably retryable.


IMHO even if db is not reachable, it should be retriable.


That's a variation of ye olde end-to-end argument.

In fact, this subthread seems to be recreating many of the network protocol design ideas from the ancient days.


> If we're talking about distributed systems, then one thing is guaranteed - network is not going to be reliable. And if we have 10's or 100's of services, this policy means that at the smallest blip the whole thing collapses like a house of cards.

With RPC, I believe the author is talking about retries at the application level. There are already enough retries in the TCP layer below it that happen with exponential backoff. Tuning that and also your HTTP library's timeout settings is possible if you happen to have a unique enough network that the defaults don't work.

But very likely, your slowness or problems will exist in the application layer – either on "your" side (your service is tied up doing something too long) or on the other side (their service is tied up doing something too long). The correct fix is to "Fix the flaky service!" as the author recommends, and this can take many forms – spin up more copies of the service, or fix any CPU or I/O resource problems.

Slapping on another layer of "just retry" on top of all the other retries at the application layer is what the author is recommending against – this is because you will end up inventing a newer, complicated model of a distributed system.


No, TCP retries are not enough.

TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

A reasonable application-level retry policy (exponential randomized delays, limited attempts) would turn these from a service disruption for the client into a mere delay, often pretty short.


> TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

Yeah, this is where the nuance begins.

You are correct that TCP is not always sufficient. Perhaps where we differ is that in my experience, it still helps for this to be a feature of the framework or infrastructure that the applications are running on (e.g. a retry budget in the service mesh, or a load balancer) rather than scattered around in the application itself. At some point it becomes a word game – you could say that the service mesh is also kind of an "application" itself, but the core principle is that the retries should be kept in a few simple, common places that are rarely tuned.

Otherwise, you will find that N developers who are tasked with figuring out something like this will scatter N different version of your exponential randomized delay policy all across your codebase. It is always possible to avoid this with enough code review discipline, but once the trend starts, it's much harder to say "No, you need to fix this the right way".


I’ve seen this exact problem many times. Counting TCP, the current system I’m working has 5 different layers of retries.

TCP, the service mesh, the company wide http client, application retries, and a database backed queueing library that everyone uses.

Imagine what happens when a service is flaky and each one of those layers retries 3-5 times.


Yep. And then the user says.. "Oh, that didn't work - I'll just try again... and again..."


It’s pretty important to implement a mechanism to prevent a thundering herd if you can, and TCP doesn’t help here. Accidental synchronization can have many many causes.

In addition, despite fantasies, tcp in particular is an instance level concept and reserving the ability to shoot instances without draining first is a good design.

Application-level user-perceived reliability is an end to end concern, not a transport layer concern.


My experience is that you need to plan for retries at the system level anyway (above application; think “dead letter queue processing” or equivalent).

When you don’t (as we sometimes haven’t), you end up with people having to write scripts or perform manual actions to poorly approximate what the system should have done, often with thousands of orders (or whatever entity matters to your company) in a limbo state.


TCP eventually gives up though, right? Like if you want to tolerate network partitions that lays days or weeks, during which the other node might have received a different address, then you're gonna have to move that logic up the stack.


This particular concern costed some 10 folks 2 mandays per head last week across the globe and due to consequences of the issue got escalated to higher management.

Different topic a bit - messaging & routing ecosystem working 10 years without flaw suddenly started exhibiting slowness and randomly would just stop, needing restart of client. We debugged like crazy, java messaging system by me on one end, Tibco ems system on the other. Tibco refused to help due to server being out of support (note for us/US team). We had network guys on 13h call too, but they didn't have as much experience with WANs.

After 2 days, they discovered some internal backbone network system between US and Switzerland just failed out of blue in the worst way possible - dropped some +-20% of the packets, so things kept chugging along somehow, till they didn't (some Acks on transaction commits rarely didn't happen and then all got blocked without a hint why).

2 lessons - don't always doubt yourself and your skills when SHTF. And don't take things like servers, OS, network for granted. Don't expect some clever monitoring already in place will figure out issues for you.


The stage of testing is important here. Early on, I agree with the author. Catch code/config bugs early with no wiggle room for retries. But later, testing in a live system that lives with unreliability, you need those configs that allow for it. This enters the chaos testing phase, where you can assume with some degree that the code works deterministically, but now you have to test how it works in non-deterministic settings. Or more likely, why it failed in retrospect and how it recovered previously. This is much harder.


You are 100% correct. It's crucial to understand and manage the failure modes of your system around transient network failures and permanent bottlenecks.

Retries are a must in most systems, but need to be planned, otherwise you DoS your own network or services.


Agreed, If you have retries then you also need circuit breakers


I'm with you. I recognize that "fail hard and restart" can be a valid design, but if you aren't careful you're going to hit some form of thundering herd problem at some point.

> If your service can’t load the config on startup for any reason, it should just crash

Not always the case. You don't want to carry on as usual as if everything is normal, but running in a degraded state or dedicated failure mode that you can coax information out of is often quite useful.

It almost seems like the author err's on the side of KISS to a fault. Writing off or failing to recognize things like retry-logic, metrics, logging, failure modes, etc.

It strikes me that the author may have dealt with a lot of projects that are far from feature complete and may not have seen one taken one all the way to rock-solid. And by rock-solid I mean a system that auto-remediates to a certain extent and lets you ask questions of when the sky is starting to fall.

That said most of the advice is alright, especially for a relatively young project. Rome wasn't built in a day and much of the failure behavior I tend to plumb into projects gets added only after a problem is encountered.

Edit: grammar


In practice, due to the way network / system failures tend to work at scale, failure of a first retry is generally strongly correlated with failure of a second retry. Thus a second retry can be more problematic than a first (especially if each retry causes load). From that you can infer that a single retry at the highest level is the right approach (most of the time as always YMMV). It's worth measuring this for your own services in production with real workloads by including a metric that captures how often a first and a second retry succeed.

When you don't choose zero / one as your multiplier, there's a strong risk of implementing a retry strategy that is multiplicative. E.g. given 3 layers with a try and then 3 retries at each layer you cause a potential 64x (4x4x4) amplification of any failure at the lowest level. Retries are an easy way to overload a service that would otherwise recover from a problematic situation.

Adaptive retry using a token bucket / circuit breaker approach are reasonable second alternative to zero/one.

In practice, for resilient systems, you can actually go even further than zero retries when you have shared knowledge of an outage to the downstream service (due to concurrent calls from the same source). You can choose not make the call altogether, and look at only making a small amount of calls to the service to let it recover sanely. Obviously this is only useful for calls that are optional parts of a call chain. An example implementation skip a percentage of calls based on the percentage of failing calls (e.g. if 50% of the calls are failing due to an overloaded downstream service, backing off to only make around 50% of the call volume is directionally appropriate).

Better logging is always appreciated regardless of situation ;)


I was just about to post the exact same thing. 9 times out of 10 the issue isn't a broken service. It's going to be something environmental. You absolutely must have retries. Set alerts for elevated numbers of retries so you can respond when it can't recover on it's own but don't leave these out. You'll just generate alert fatigue for whoever is on call in your teams.


Retries are complicated to implement properly, you have some many timers in an HTTP request ( tcp hanshake, tls, headers, send bytes etc ... ) Let say you have 15sec to do an API call you have to take all the timers in consideration for the retry to be ffective and also to cancel the request if you go over 15sec.


> Extremely strict RPC settings. I’m talking zero retries (or MAYBE one) [...]

> I disagree

I too disagree with that too, pretty strongly.

The network is unreliable. Technology has become pretty good, to the point that we have become spoiled and take the reliability for granted. But it is at its very core an unreliable medium. Applications should expect that and survive accordingly.

Now, before packets go out to the network, or after they have made their way inside your network, they are in an environment that you have a degree of control. Errors and retries should be monitored. If they increase and remain elevated, they should be investigated. But guess what, if your services are resilient, you should have some time to investigate before things start breaking down - as they would if you treated everything as if it was _localhost_

Case in point: our app started breaking in horrendous ways once we deployed it in China and it tried to cross over the Great Firewall. Most issues would have been survivable if they just backed off and retried. The GFW doesn't usually like the first attempt to/from a new address, but it will usually allow the traffic after a while. We had to pay a company a not insignificant chunk of cash to get better connectivity (that would have been fine as an improvement, but not to make the system work at all).

Retries are at the core of everything we do. TCP has retries (if the author followed their own advice, they should switch to UDP!). Kubernetes has a whole bunch of retries before reaching error states: CrashLoopBackOff, ImagePullBackOff. Your ethernet card has retries. But it also tracks errors.

Track errors. And have retries. For as long as it makes sense to retry for your use-case.

> well then perhaps implement better logging

Not logging. Metrics! Logs are good for troubleshooting. For signaling issues, metrics are better. Errors can be a simple counter that you increment. Have Prometheus or similar scrape that and throw alerts as necessary.


Yup, better advice (imo) is put metrics on all incoming requests and any external services. Add request count, request time buckets, and error counters. A lot of APM agents will instrument this automatically but it's still easy to do yourself. Some frameworks have this functionality built in


If you do structured binary logging, then it is better than doing unstructured/text logging and metrics separately. With structured binary logs you can extract whatever metrics you want to your hearts content and you have the freedom to turn metrics processing on/off as needed.


Agreed. That bullet point hit me as a very wrong. Returning failures downstream is costly and most likely a terrible user experience.


I prefer to keep statistics like that in my metrics system instead of my logging system.


Big +1 to “ Never give up on local testing”. The current code base I’m dealing with has many tests that require interaction with a dev environment on the cloud, and occasionally these tests fail due to a timeout or some other thing not related to actually testing the code (and instead reveals that I forgot to refresh my MFA).

Additionally the dependency thing can be huge; we ran into a weird bug for installing a particular dependency on our CI system so our test there keeps failing, but being able to run it locally let’s us know that the changes did not break our actual code tests.


Then he contradicts it somewhat later by saying unit tests are the least important. I think he’s dead wrong there.


Yeah, that advice was bad and looked like it was written by someone that's done too many deployments and not enough development.

He's certainly right that deployment validation is the best test at some level. But it completely leaves off the most important aspect. That deployment validation, and the mistakes that you inevitably find, should directly guide your test development to prevent regression.

If you do this, future component upgrades and deployments get easier. Its well worth the extra effort.


He's not discouraging automated testing; he's just saying that we should prefer integration tests (using services running locally) to unit tests of individual functions. Look up "testing trophy" for more about this philosophy.


Totally with you on the value of integration tests. I'm familiar with the trophy and the triangle, but he's saying this:

'..with unit tests notably coming in last place – i.e. “only if you have some time”.'

The "only if you have time" part is what I disagree with. If we're doing TDD then I don't see how you can avoid writing unit tests, or even deprioritize them.

It's also kind of weird to separate them like that, as Fowler discussed here: https://martinfowler.com/articles/2021-test-shapes.html


> The "only if you have time" part is what I disagree with. If we're doing TDD then I don't see how you can avoid writing unit tests, or even deprioritize them.

Perhaps the author doesn't recommend TDD? They don't suggest anywhere that they do. It seems to me that the industry these days sees TDD as a tool for occasional use, not the dogma once proposed.


+1 unit tests are essential for software evolution like refactors, updating business logic, and updating dependencies. More generally, good testing requires a defense-in-depth strategy and unit testing, integration testing, and canary testing all have a role to play. All other things being equal, it's much better to catch a bug before you even push to the the source repo versus when it has one foot out the door to your customers.


Yeah as some of the responses to your reply said, agreed, and I've always found value in unit tests, and the "too many deployments and not enough development" comment from jsight rings loud in my head. Deploying to a real non-prod environment as a "test" can be fine for a team, but to do so while not testing for how you expect the code to behave... maybe that works for some folks and if you're moving super fast, but I wouldn't be super comfortable with it myself.


I totally disagree with the author on testing. Integration tests are more trouble than they are worth in my experience, and unit tests are really useful.


It depends on what type of software you are writing, I suppose.


How does this work in a serverless world? As far as I can tell local dev for most serverless environments that people actually care about is a joke.


Usually there is some sort of local development framework for serveless functions.

Most seem to use docker, here’s one for Google Cloud functions: https://cloud.google.com/functions/docs/running/overview

If you use something like “cloud run” then it’s containers anyway.


Most people building stuff quickly in serverless seem to eschew the concept of local entirely, and it's something I've recently started doing too. I make a code change locally and it gets picked up by the deployment of my stack associated with my feature branch the second I save the file.

It's an uncomfortable concept at first but I find it's helped me build things faster and has ultimately led to less rabbit hole chasing across multiple categories of issues.


Having used localstack [1], I can vouch that it's not a joke.

[1]: https://github.com/localstack/localstack


thanks for saying that! people are pretty skeptical about localstack when they hear about it for the first time, and don't understand how we could ever emulate something remotely resembling AWS. tbh sometimes i'm baffled myself, it's pretty crazy what it can already do (i work there).


With AWS SAM you can invoke functions locally, run api gateway and use step functions. There's also a 3rd party project to run various AWS services locally, can't remember the project name.

There's also SAM sync that syncs changes you make quickly, so if you have your own dev copy in AWS you can quickly test changes.


There's also a fair amount of lambdas where you can just invoke the lambda handler yourself without anything special other than a compatible version of python/node/etc. Variations of things like:

node -e "console.log(require('./index').handler(require('./event.json')));"


For golang, but would work in any language. What's really worked is the idea that your main loads configuration and injects all of the configured interfaces into your handler.

You then can build all of the testing on mocks/stubs to test the behavior. If you access a database you access in through the interface which can mimic the appropriate behavior for your code in test vs production. If you need to you can do local integration testing of the db access layer.


Honestly, my answer is: don't use serverless, or at least have a non-serverless way of running your code locally, preferably outside of a container/VM. Otherwise you have no way of running or testing your code locally, much less attaching a debugger; IMHO this can slow down dev cycles so much that it cancels out the time saved by adopting a serverless architecture.


Serverless deployments aren't painfully slow like they are with Kubernetes or other container orchestration platforms. At least with AWS lambda you can change and deploy your code in <10 seconds and have it taking traffic (In an infinite number of "environments"). So developing locally is kind of pointless.


> Serverless deployments aren't painfully slow like they are with Kubernetes

What are you doing with your K8s cluster/containers that is making your deployments slow? My cluster at work pulls down containers faster than my local, and deployments are usually swapping within ~1 minute of being committed in git/the container build finishing…


For Cloudflare Workers, it's no longer a joke either: https://github.com/cloudflare/workerd


> Use Helm. Or some other tool for managing Kubernetes manifests, I’m not picky – the important thing is that you ~never directly use kubectl apply, edit, or delete. The resource lifecycle needs to be findable in version control.

I have to partly disagree with that one. I find tools like Helm to obscure things that should be readily visible. My favoured method is to keep manifests in full (which you can source from `helm template`!) as pure yaml files and version that. If possible freeze versions and go through regular patch cycles to review updates. That you apply them through `kubectl apply` or through Argo is irrelevant. I treat the repo as the state and the running cluster as stateless. If it's borked, just redeploy. I don't see it as useful to care too much about the in-cluster resource life-cycle. But I completely agree that resources need to be version-controlled.


> I find tools like Helm to obscure things that should be readily visible.

Goodness, is this ever true. Particularly with Prometheus Operator and all the monitoring bits that go around that. Dealing with this infrastructure breaks a number of the points in the article, like "Deploy everything all the time." and "No in-code fallbacks for configs."

A previous team built this monitoring infrastructure, so when I had to go back in and re-deploy, a bunch of the Helm charts were broken (YAML errors and the like). It hadn't been re-deployed in likely 4-6 months.

Then a lot of the components don't rely on default configs, but the default configs are there nevertheless. So another team was troubleshooting an issue, and they reached the conclusion that the config for AlertManager was empty, but it's not. The config for AlertManager is in a different directory from the default config. Then an issue with Prom2Teams came up, and Prom2Teams gives an error for its default configuration file that it doesn't have permission to load that file--Prom2Teams runs as a user, the file is readable only by root. So another team came to the conclusion that Prom2Teams can't read its configuration file. But that's not the file it's actually using to configure the service; it's just a default.

So two red herrings as a result of default config files that aren't being used at all, compounded by Helm obscuring components that should be visible, and ultimately stemming from the inherent complexity of the system.

But in reality, there are issues that make this worse which are unrelated to Helm, Kubernetes, and the Prometheus stack.


Agree with this. Helm is a great tool for making really terrible abstractions over well designed native configuration. I either end up having to fork charts and fix or literally just writing my own half the time.

It can be used for good and it can be used for bad. Less is more with Helm otherwise it will create bad.

This (and CNI) are the rough bits of Kubernetes.


I recently discovered `kustomize` and the `kubectl apply -k` flag (which uses `kustomize`), which makes keeping full manifests pretty straightforward. There's only one or two things I dislike about `kustomize`, but those are things that can be worked around.


Kustomize has the best local dev environment for k8s that I have found; hot-reloading your cluster as you edit manifests gives a very tight development feedback loop.

For my money it is worth spending the time to grok the slightly funky overlay semantics, at least for teams with infra focus / dedicated SRE.

However, I’m not certain I recommend it for small teams doing full-stack DevOps, I.e. engineers deploying their own code that aren’t k8s experts; if you only work with the Kustomize layer infrequently it can be a bit annoying/unintuitive.

Note you can still use a two-step GitOps process where the Kustomize scripts write to your raw config repo; I think this is a good middle option that keeps the infra legible, while allowing developers to get the ergonomics of a bit of dynamic logic in their deploy pipelines. (Eg parameters for each environment).


Agree 100%. Plus Helm runs against a basic tenet of microservices (the usual architecture for the deployed apps in k8s these days). People use to bundle services together when using Helm and the like which, in time, couples services together.


But don't you actually want a certain amount of coupling in the operations part? After all you need to ensure they run in a place and in a manner that allows them to find and talk to each other and usually in a combination determined by the goal of a specific deployment, i.e. sometimes you might not need a rail-cargo-service because the customer only ships by truck, etc. Then scaling/autoscaling (if any) needs to be compatible, versions need to be within a certain range, any central data store must be coordinated as well, not to speak of service meshes, chaos experiments and the like. It's a good thing to develop services with minimal coupling, but that stage has different risks and goals from devops/ops, at least in my experience on both sides of the dev/ops transition zone.


> But don't you actually want a certain amount of coupling in the operations part?

In my opinion, for sure. There's a balance between "too coupled" and "too de-coupled" that should be stricken rather than too far on either side. It's good to say that this is also contextual; some projects may be fine with either more or less coupling than others, and that's OK.


I understand but it is a slippery slope.


> My favoured method is to keep manifests in full (which you can source from `helm template`!) as pure yaml files and version that.

I do something like this but I normally find that helm charts are not parametrized as I want and have to manually modify the output manifests. When updating from helm it can be challenging for other team members to understand what bits we want to take from the new helm template output and which ones we don’t. How do you deal with this?

Sometimes I update the helm chart to fit our use case, but it’s still hard if that is not merged upstream (because that means maintaining our own version of the helm chart)


> How do you deal with this?

Isn't that the problem that kustomize is designed to solve? Flux even has a first-class declaration for "take this thing, then kustomize it". The `helm template` into git pattern could be extended to "helm template, write kustomize files, then version control both" since it would capture the end state as well as the diffs that were applied on top of the vanilla chart

I think the "maintaining our own version of the helm chart" is only painful if the helm chart itself is moving around a lot, versus they're using helm releases to carry changes to "--set image.tag=$(( tag + 1 ))" type thing


Ah that’s a great suggestion! I didn’t think about it even though I use kustomize a lot.


I'd prefer to deploy full manifests as well, but it's not my impression that you can entirely obtain those though "helm template". Certain variables, like "Release.namespace" are only available when actually being applied, AFAIK.

You will get a manifest, but it will usually be missing certain parts of it.

I completely agree with the philosophy of just redeploying the cluster if it's borked. I'm using NixOS myself for the task, and was trying to obtain full manifests though "helm template" originally -- so I'd love to know, if I was just missing something.


This isn't a problem we have seen before, and we deploy allmost all of the third party applications in this manner.

When we generate the templates we use -n to override the namespace.

The command looks like this helm template CHART_NAME CHART_PATH -f CHART_INPUT_PATH/config.yaml --output-dir CHART_OUTPUT_PATH/manifests -n NAMESPACE --include-crds --render-subchart-notes --kube-version KUBE_VERSION


My memory didn't quite serve me right, so it's not exactly as I described, and I can see, that as you say, providing the namespace to the template command does work.

The problem for me is, that setting the namespace in that way with "helm template" does not seem to add it to any manifests not specifically specifying the namespace to .Release.Namespace.

The rancher 2.6.8 chart does not set this for all manifests, but does with some, so when I set namespace though template, and deployed it all through the manifests folder, I got some objects in default namespace (because they had none specified), and some in the intended namespace; resulting in an installation that did not work.

As another reply to my comment suggested, this can of course be handled with post-processing of the result of "helm template", though, at the time, I was not certain the problem was limited to this namespace issue, so I didn't feel lucky enough to go down that route. :-)


About the namespace, I usually modify it manually if it’s a few files, otherwise use some post processing like https://github.com/helm/helm/issues/3553#issuecomment-417800...


That could definitely work. And I considered it a bit but didn't feel confident that the problem would be limited to the namespace, so it felt like the wrong tool for the job at the time. :-)

Thank you for the suggestion though. It's comforting to hear that it may actually be a viable approach.


We've deployed our helm charts with Spinnaker. Spinnaker has a nice UI that shows which charts are deployed, which environment variables were used, and the manifest files themselves.


I like to think I'm a pretty good architect - my team respects me, I solve a lot of problems that they feel like they can't, I read lists like this and if I don't follow all the advice, at least most of the advice isn't surprising.

But there's one thing that at this point makes me feel like I'm taking crazy pills because it seems like I disagree with just about everyone. I think unit tests should be first priority, ahead of integration tests and selenium-ish tests (whether you call them frontend tests, functional tests, whatever)

I have experience with all three approaches, but slowly moving from backend to full-stack and frontend, I'm struck by how many people - it almost seems a consensus at this point - argue that integration tests should be your first line of defense. I've long since given up the battle on this; I'm not directly on the frontend teams, so I can't seagull my way in there and contradict the lead.

But I think the point that gets lost on people is that the value of unit tests isn't chiefly the output of running the test suite. It's that the process of writing good unit tests forces you to write well-structured code. Code that is well-structured enough that it actually minimizes your need for integration and selenium-ish tests. But I feel like in the industry (not just my employer), I'm surrounded by people who are comfortable with writing integration tests and calling it good, even with spaghetti-code implementations underneath.

I've always suspected that the ideal testing setup would be mostly unit tests, integration tests only to fill those gaps, and selenium-ish tests (hopefully rare!) only to fill in the final remaining gaps. But I think the system dynamics are set up such that unit tests are rare (due to wanting to finish the Jira story when the feature starts to work), integration tests are frequent since you can happy-path across a large surface area, and large expensive suites of selenium-ish tests are written by a completely different department since you can write them without having to understand the codebase. It just seems like a recipe for poor overall quality and a lot of wasted work/time/money.

But I'm clearly in the minority. Maybe I'm missing something basic that I've somehow never learned in my long career.


What you're missing is that most people in web dev use a full featured framework like Django or Rails and do not need unit tests nearly as much as integration tests because very, very, very often most of the architectural decisions have been made for them and the framework linking it all together is where mismatches between expectations seep in.

"Wups, I pluralized this thing that should have been singular in the routes / urls file"

Also, when developing APIs for the front end it's pretty unlikely that I need to test Rabbit.permanent_url on its own and much more likely I need to test things like listing all the rabbits for a given rabbit farm that are candidates for sale in the local meat market.

Where exactly should this test go? The framework handles all the magic SQL generation and the frontend folks really only care about input -> output.

If you're building everything from the ground up, then of course there will be way more unit tests, but with an established framework you don't have to test everything. You trust the framework to get some things right for you.


There's a couple reasons I personally don't find front-end unit tests valuable:

1. (In my experience) client code is mostly integration: it's integrating local user interactions and remote data APIs into a stable experience. It's rare that bugs come from an idempotent function with a clear I/O that can be unit tested — it's much more likely that bugs come from something like an unexpected API response or a complex combination of user states.

2. TypeScript. Static typing obviates a good chunk of the low-hanging unit tests. And it addresses your point here:

> But I think the point that gets lost on people is that the value of unit tests isn't chiefly the output of running the test suite. It's that the process of writing good unit tests forces you to write well-structured code.

Strict TypeScript (+ ESLint) also does wonders to encourage well-structured code, such as making it hard to have a mystery object passed around your app, collecting new properties as it goes. That mystery object would need a type definition, and would be easier to deal with as a series of discrete states instead of an amalgamation of mutations. Types encourage clear structures and interfaces for your code.

With all that, I'd rather focus my time on type safety + integration tests.


I'm generally a proponent of "why not both?" when it comes to types and unit tests. At least with our codebase (nextjs, typescript strict mode, eslint), there is still a ton of room for improvement.


This is such an important line of thinking. If it has a positive ROI then do them both. (The greater the availability of capital the more truth this holds)


I agree, but I'd disagree with your reasoning on why unit tests delivery better bang for the time spent. In my experience, integration tests are very fragile, and something like selenium is like testing egg shells by dancing on them. Sure, like all tests they improve reliability, but the amount of effort required to maintain them is enormous compared to unit tests. Given that unit tests generally are at least the same amount of code again, that's one hell of a whack.

Re "It's that the process of writing good unit tests forces you to write well-structured code": it doesn't ring true to me. I've see a lot of beautifully structured code that doesn't have a lot of tests. Much of the Linux kernel is like that. But what is true is it you are forced to write tests, you forced to write code that's testable. As anyone who's tried to write unit tests after the fact will testify, the difference between code that's designed to be testable and code that wasn't written with that in mind is so dramatic, it's almost never worth the effort to retro fit unit tests unless you are doing a major refactor. That's because you have to refactor it to get decent coverage.

Which brings us to the 100% code coverage thing mentioned in the article. The benefit of insisting on 100% code coverage isn't that 100% of the code is tested. It's that 100% of the code can be tested. What less than 100% means is that at some point the programmer gave up making his code testable.

But maybe I'm wrong about the benefits of 100% code testing for reliability. Sqlite's report on the difference achieving DO-178B made to bug report levels was an eye opener. Still, they say to achieve DO-178B, the size of the unit test code went from 1-2 times the original code base to 7 times. Again, that's a _lot_ of an overhead. But maybe that's what we actually need to be at.


I remember pushing to get a project to 100% code coverage a few years ago - getting that last 1% was tough but it revealed a bug in a previously uncovered catch block - it was doing something that would have thrown an exception without logging the cause of the original failure.


in my experience the hard parts in software are remote interactions, i.e. behavior that localized unit testing has a hard time to capture, and concurrency.

so the valuable tests, those which actually find issues, are the tests in near production environments under near production load.

now I'm not sure if you describe just superficial testing, at all levels, including the "integration" and "selenium-ish" level?

did you ever measure code coverage across all tests?

my thinking is rather you only need to consider unit tests for those paths that are not touched by your integration and system tests.

my aha moment was when SQLite, famous for their efforts in full MCDC coverage, found uncomfortably many bugs through fuzzing. the SQLite team realized they didn't add corner case checks where it was hard to create a test case "because nobody provides suchandsuch input anyhow".

so my take would be to measure the coverage in the testing of your organization, to look at what you find and to then decide as a team where you have business risk level "undercoverage".


Your ideal is what is commonly described as the "testing pyramid".

It's very well suited to "single artifact" code (a lib, an app with an easy way to drive interaction through a gui, etc...)

It tends to turn into a "testing sandglass" with time (because most of the value of integration tests is also derived from gui/e2e tests.)

Depending on the app, it might makes sense.

Kent C Dodds (author of the poorly named "testing library") is embracing this sandglass shape by calling it the "testing trophy".

Honestly, to me the hardest deterrent to testing are :

* If it's not done from day 1, you end up with that one hard to test piece of code that makes every other piece of code hard to test

* Few people enjoy writing tests (I do, but I reckon I'm part of a minority.)

Do what helps you the most !!


I strongly agree. Most of this felt fine, except about testing.

I'm frequently tasked with writing reliable services, and I always 100% of the time start with tests. Are they perfect? Absolutely not, but I am able to write, test, maintain and iterate on at-scale critical services with great confidence; and 90% of that confidence can come from well structured tests. My code is generally also easy to refactor, understand, port to other languages, etc. Testing is such a critical part of that. I only test things "in staging" as a last check; and very rarely am I disappointed with the behavior (sometimes the speed or scale, but not behavior)


I just struggle to get value out of frontend unit tests when using Typescript and React. Your main vulnerability is code receiving data in a shape/type it doesn't expect, and it is going to be very difficult to do that with React and TS. I still write tests for specific functions and complex Regex's but very rarely will for a React component.


> I've always suspected that the ideal testing setup would be mostly unit tests, integration tests only to fill those gaps, and selenium-ish tests (hopefully rare!) only to fill in the final remaining gaps.

I would have thought this is pretty standard. You've just described the testing pyramid.


Unit tests are great for getting something done correctly the first time, but they aren't as helpful for avoiding regressions over time. New functionality gets lots of manual testing, so you can get away with not having automated tests that test the assembled unit (ahem) that you are responsible for delivering. In the short term, this continues to be true, as the service is maintained by developers who have deep, recent knowledge of the system and are adept at manual testing.

However, once you have a large amount of established, stable functionality, and developers have moved on to other projects, you want to be able to make small marginal changes (extensions and bug fixes) with small marginal effort. Spending hours running manual tests isn't reasonable like it was when the system was getting its first big release. But at the same time you don't want to break all that stuff you aren't testing, so developers are careful to make sure that their changes only affect the functionality they are willing (and able) to manually test. If they find the bug in code that affects the whole system, they often won't fix it there. Large refactorings are completely ruled out. Over time, the consequences of making purely local changes and avoiding refactoring put you on a slide to crufty, non-cohesive, special-case-ridden code.

If you have automated top-level tests that test the complete unit that you are responsible for delivering, you can make whatever changes seem appropriate, even if they have global consequences, and feel confident shipping your code.

Reading guides to unit testing, it's funny that they almost all frame their examples as, you are responsible for delivering a certain class, so let's write unit tests for it. But how often are you responsible for shipping a single class? How do you extend that advice to shipping an entire service? Do you test the service as a running whole, or do you test all the classes in it? For me, tests of the entire service are what gives future developers the confidence to make changes and deliver them without worrying about the global effects of the change they made, so I think that's the most important level of testing in the long term.

Edit/PS: Unit tests of units (classes/functions) that are lower down from the top-level functionality also only help as long as the functionality they test is stable. Higher-level, more public functionality changes more slowly, which mean tests at that level require less maintenance over time than lower-level tests that might be invalidated by internal changes due to refactoring.


100% in agreement. The largest projects that I've successfully completed have had extensive unit test and test suites in general.


> I think unit tests should be first priority,

15 YoE here and I full agree that unittests can easily be the best ROI on quality investments. However, coverage and UTs are easy to game. So if you have a toxic culture then folks will simply abuse the goals/metrics, just like any other.

> showing it does what you wanted and doesn’t break everything.

This is an extremely broken premise. Why? Well how do you show it doesnt break *Everything* ? Sure you can easily click through a single happy path of your 2^10 branches in the new feature, but that doesnt convince a rational person that 1) the feature works as designed in all scenarios, nor that 2) you didn't break tons of other things.

I've seen this be addressed in a couple of ways 1) Be like "no customers complained" to which I'd share my experience is thy rarely do. Most customers shrug, try again, and if they cant get what they want they move to another task, or if it's really critical path they simply churn out of your product into a competitor.

or 2) Using "stats" like datadog dashboards. Unfortunately those most often simply mean you hit a line of code, maybe with some volume of data (eg if you count the length of an array). A datadog dashboard isnt going to tell you that you're pumping invalid JSON into that VARCHAR field, or that oops your code is actually returning a 200 OK to the customer when a downstream service fails to accept the data you've accepted as safe in your db...

Unittests can also be creatively used to isolate the easy to test portions and leave the harder to simulate things to other layers of testing. Eg: extract a portion of the code out to a function and UT the function, leave the network details to integration tests instead of implementing a full fledged Mock (which are usually actually stubs btw[1].

I've come to the conclusion that a couple factors are in play -- nihilism and ignorance. Sadly so many engineers and their managers (and sadly some product folks too) have started to behave as though quality doesnt matter. And sure, they're right no one died when you took down prod or broke that feature. But your $1B ARR company can lose approximately an engineer year every few hours if you screw things up. and as for ignorance, so much of tech nowadays is just the blind leading the blind -- engineers are promoted as coin tosses gone well rather than smart decisions made[2], they take absurd risks and lose little (personally) when it fails, but keep the full reward when it works.

[1]: https://jesusvalerareales.medium.com/testing-with-test-doubl... and sinonjs docs https://sinonjs.org/releases/v14/ are both super good resources to help one think about what those "mocks" do for you and how much assurance each provides that what you wanted to happen happen.

[2]: if you dont understand the difference consider if an employee should be fired for taking company funds to Vegas and betting it all? Even if they win that's not the kind of employee you want around.


A colleague of mine once shared his lay sociological theory about dev vs ops, and if taken for what it is -- an essentialization -- it's an interesting perspective.

The idea is that ops people have inherited a blue-collar culture, whereas devs have inherited an office-worker/academic culture.

Ops people conceive of their work as fundamentally operational: progress is measured in terms of actions taken, and while automation is greatly valued, there is nothing inherently "messy" with one-off fixes; the objective is to get things working now. The pathological case for this mindset is that of constantly being on the back-foot, responding to incidents with one-off fixes without recognizing that many of them share a common cause that could be addressed.

Dev people conceive of their work as fundamentally intellectual: progress is attained when the problem is correctly conceived, at which point the solution follows naturally. While writing code is greatly valued, most effort should be spent understanding the problem; the objective is to solve it correctly, once and for all. The pathological case for this mindset is that of over-engineering by an ivory-tower idealist, disconnected from the messiness of real-world praxis.

In nearly all orgs I've seen, the proportion of first-generation college grads is greater in ops teams than in dev teams. So too is the proportion of people who come from blue-collar families, or are mechanically inclined (ex: look around and see who tinkers with cars). Likewise, the proportion of people holding graduate degrees is greater in the dev crowd (ex: look around and see who's into math).

It should hopefully be clear that neither is superior to the other. The point is rather that the divide between dev and ops is partly sociological, which means it is largely based on values. Ops will tend to over-value "honest hard work" and dev will tend to over-value "clear articulate thought". There is also some latent, historical tension between these sociological groups, which has a funny way of masquerading as a technical problem. It is helpful to view arguments about "just ship it" vs "design it the right way" through this lens.

Far from being a cute "just so" story, it's been my experience that this dynamic is very important for two reasons: (1) it's harder than you might expect to foster the sense of common destiny required for real "devops"-style collaboration, and (2) each side of the dev-ops divide has a lot to gain from learning when the other side's mindset is helpful, and how to cultivate it in themselves.


i like this observation. its an interesting dichotomy even if you dont accept the sociological bit. the same thing occurs in machining and fabrication i think- theres a role for "dog meat" and a role for precision. true wisdom is valuing both and knowing when which is appropriate. i really like your neutral characterization of both attitudes, its very hard not to be partial to one.


I think there's definitely a large nugget of truth here, even if it's not universally true or accepted wisdom.

I would enjoy reading it expanded on even more, 'ops' people tend to be almost universally coming from non-university backgrounds, for example, in my experience.


Interesting idea but perhaps you are confusing attitudes to code with attitudes in general.

eg Ops tends to be very attached to getting the infrastructure correct whereas developers care about that less. Developers will write an application that handles all sorts of data edge cases but doesn't react well to network errors.


I don't think these are mutually exclusive. However, a preference for perfecting infrastructure doesn't explain most of the behaviors mentioned above.

IMO, the curious thing is that in addition to focusing on slightly different problems, there exists a cultural difference with respect to work values.


imo this is seeing this a bit too black and white. Like, ops is _paid_ to get things back on track ASAP. Whereas devs are generally given more time and have the space to think about the elegant solution. And, sometimes one-off issues are just that - one and done problems that probably won't ever arise again. In that case, especially on a lean team, is it really worth the time to elegantly solve an edge case? Especially when 6 other things are on fire? Just my two cents.


Agreed. Though surely these job requirements select for different backgrounds/values, no?


> Use Helm. Or some other tool for managing Kubernetes manifests, I’m not picky – the important thing is that you ~never directly use kubectl apply, edit, or delete. The resource lifecycle needs to be findable in version control.

I'd say `kubectl apply` fits better with a Git oriented flow than the imperative `helm install` flow. Nothing about helm guarantees or forces you to use version control. Actually the opposite. If `helm install` fails. You need to `helm delete` first. It has no way of declaratively managing updates at all.

Have a CI job call `kubectl apply -k ./deploy --prune` on every commit is way nicer than helm in my opinion.


Helm also hides complexity in the templates.

There's a lot of valid complaints floating around YAML but what's worse than dealing with YAML is dealing with templated YAML.

kubectl brings everything to the surface. Helm is an unnecessary abstraction.


I think Helm can be nice to template mildly complex stuff you cannot reach simply with Kustomize, but it's true there are lot of things you can't see until you apply them, even when `helm lint` doesn't fail.

If you need this templating, a think you can do is use helm for that but not for install/deployment: render the charts with `helm template` and then apply them with `kubectl apply` and/or verify the changes with `kubectl diff`.

IIRC Argo CD can show you the differences between the running helm chart and the new version you want to apply if you haven't enabled automated syncing, but I only use Argo CD with Kustomize, so I can't say for sure right now.

edit: by reading the article again I think the spirit of that point is not 'use just helm' but 'don't apply stuff directly'. Long term, if there's more than one person working in Kubernetes, that's prone to failure and configuration drift. Doing the templating in whatever language or system that fills your needs and then using something like Argo CD or Octopus to apply them can remove most of those problems.


> by reading the article again I think the spirit of that point is not 'use just helm' but 'don't apply stuff directly'

Yep, this is what I meant! Just "if you're using some kind of manager for your resources, which you probably are, then ONLY use that manager, don't just `apply` stuff out of band" is the idea :)

I might take a pass later at making that more clear


Helm templates tend to be too complex, in my opinion. On the other hand, having 50 services, deployments, ingresses, etc which have names/labels copy/pasted in 3-5 different places is also extremely complex--especially when you want to make a change to update all of them. Kustomize seems too basic--maybe JSONNet or something fills this gap


I think one of the strengths of Kustomize is precisely that it's basic - although it may not be as basic as you think if you haven't used it in anger.

I don't want complexity in my manifests. I want them to be to the point.

Renaming in multiple places is very easy with find-replace.


Helm makes sure your tags and other metadata are the same in a more complex setup.

It also allows to calculate a checksum for a config map which then triggers a restart of a deployment.

It allows to do a proper release do to working conventions.

It also makes it easy to pull out configuration options.

I think kubectl apply is the worst option.


Sounds like they're both the worst option.


Why?


Because we can't begin to agree on how to do this in a standard way. The whole idea of git oriented flows seems to be an afterthought rather than being considered an important part of a standard production k8s deployment from the start. Here are some competing options I see after a quick glance in this thread:

  - kubectl apply -k ./deploy --prune
  - heml install
  - helm template + kubectl apply + kubectl diff
  - Kustomize
  - Argo CD
  - Octopus
  - JSONNet
  - Cue
  - Pulumi
And because this is an advanced topic (most of the above require a large investment reading docs and github issues), I'm guessing many/most people never get beyond ad-hoc kubectl commands and letting their cluster run wild.


Funny that most of those things don't compete: helm template and helm install are two entirely different things. Argo CD is just a way to deliver manifests to your cluster - manifests could originate from helm template, kustomize, raw manifests or JSONNet.

kubectl apply -k is...well...Kustomize...

There is a big distinction between first party software and third-party software. Helm is fine for first party software because you control the chart, but awful for 3rd party software because charts often don't have level of customization you need.

Reinstalling shit with kustomize that was installed with helm is like 1/4 of my time at work.

Kustomize won't get you 100% though because something has to generate base files first.


k8s doesn't need to solve every aspect of life.

And I'm quite happy, if not by far most happy about my helm + kustomize + argocd workflow.

None of it is complicated but works tremendously well.


oh! oh! Don't forget about embedding your k8s manifests and/or helm installs in terraform, and using `terraform apply` to push them out


I agree; helm is too declarative.

Whenever I can, I use helmfile[0] for storing variables for helm since it does add a declarative layer on top of helm.

0 - https://github.com/roboll/helmfile


I tried helmfile, but helm state weirdness became enough of an issue that I investigated `kubectl apply -k` (which uses `kustomize`), and after I got it working it felt like a significant improvement over helm+helmfile. (But I still use `helm` to generate manifests.)


If you don't like Helm, there is Kustomize.


I'd like to second Kustomize. It's not getting enough popularity, IMO. It's also great because it doesn't break tooling by mixing in a templating language on top of a configuration language. If people need more than Kustomize supports, there's always Cue or Pulumi; I'd never recommend Helm, personally.


And if you don't like Kustomize, try Jsonnet. I wish this was a more common setup. This works very well at my company.


If install fails with helm, you can just rollback to previous release: helm rollback <RELEASE> [REVISION]


FYI Argo is the evolution of this idea.


This list has things I agree with and things I don’t. You could probably make another similar list that didn’t overlap at all.

That’s because the real way to build software that’s reliable has very little to do with a list of features/tools it has to do with designing for the sad paths at the same time as the happy paths.

For each failure modality the app should have a conscious decision about what the app will do when the bad thing happens. This can be everything from “this will cause an outage” to “this will route to an alternate system”. What’s important is figuring out up front what those decisions mean for your data, your operations, your business and your users.

Without that all the structured logs and version controlled configs in the world are just window dressing.


Good article, until:

> Use Kubernetes.

Dear God no lol

If you're running a startup that's starting to scale dramatically, I think maybe this is valid advice.

But nowadays you can keep things dead simple and still be able to accomplish most business use-cases.


That stuck out to me as well as being suspect.

> Assuming you have more than one service and more than one instance, you either need or will need stuff like service discovery, autoscaling, and deployment versioning, and you do not want to manage all that by yourself.

To paraphrase, "If you have anything but a tiny amount of complexity to your architecture, you must add in a high degree of complexity". It sounds insane, right? Like, we're skipping some middle steps.

Yes, Kubernetes solves those problems, BUT ALSO the care and feeding of Kubernetes clusters themselves is a significant time investment. Manual scaling sucks, but if you have a good automation game, it's not that bad. Often what people think is a need to scale extremely rapidly is really an organizational problem they are trying to automate around (example: An ECommerce site sends out coupons to 10 million customers without telling the devops teams).


> BUT ALSO the care and feeding of Kubernetes clusters themselves is a significant time investment

AWS and GCP will do this for you for a small markup


While this is true (btw, Azure as well has a managed offering) the complexity is not the only issue. Cost!

The article is basically advocating, "if you have more than one service, you need K8S", and sticking your app on Kubernetes in the cloud (managed or not) is a huge cost upgrade from a single or handful of servers.


You're right, but the higher your service count gets, the more you should probably be on k8s. Large scale ECS or Lambda stuff is much worse to maintain


I would disagree in a couple of important ways:

1. The higher your service count gets, the more you should probably be on some container orchestrator. There is more than one. 2. The number of services you have to manage to justify moving to a container orchestrator is most certainly not just > 1. That is crazy. I don't know the right number, as it probably varies according to how mature your organization is and how well the parts work together. But it is definitely higher than > 1!

My last point is: If all teams worked together smoothly, or one team had infinite capacity, or if automation was perfect, etc. etc. etc. we wouldn't need containers, let alone container orchestration. These are solutions to people problems, not technology problems.


You can't just yadda-yadda over the most important parts!


Time and time again, commenters suggest that the reason to use Kubernetes is to enable easy scaling, but this is not the case in my organization. Kubernetes is valuable because it is a standardized and flexible deployment model.

Easy scalability is a nice side effect of that.


I think better phrasing for this would be "develop 12-factor apps". At scale, Kubernetes is great.

When things are still small and simple, an Ansible Playbook and a Docker container can take a company pretty far while still being able to move to k8s pretty easily.


Doing simple things in Kubernetes isn't that much more complicated than doing simple things outside of Kubernetes.


I agree - I think the hardest hump for developers is to learn Kubernetes first. After that, it's really not bad for most workloads.


Kube has hit enough market/mind saturation that using it for more trivial use case isn't isn't necessarily a horrendous idea anymore.

Just having a Kube available I can run a quick kron on is valuable.


> Kube has hit enough market/mind saturation

I agree with this. But I really see this from a business perspective. I think k8s works better at companies with large, complex engineering teams and infra. A lot of companies simply aren't ready, nor will they benefit from, introducing k8s into their stack.


GKE/EKS/AKS lower the bar by a huge lot. There is a ton of value in having just one platform, toolchain, deployment language, set of templates/blueprints, set of dashboards, set of alerts, target for hardening/securing, etc. pp. One of the big upsides of this for me is that it becomes a lot more tractable to get ordinary devs and others involved since you only have to polish things and train people once.

I've found it's quite neat for e.g. data scientists to be able to toss a quick yaml into git and have their thing running to their specifications in non-prod. Things like flux and Kyverno make sure they are boxed in tightly enough that they can't cause a lot of damage, and if that thing works out, someone adds CI in front of the yaml and some templating to the yaml itself and off we go to prod, and the result is still quite comprehensible, so people are enabled at least to ask the good questions to the right people. That's what I see as one of the killer applications of Kubernetes, to build a lingua franca interface for all things ops that doesn't change (much) for different service levels and even totally different needs, like frontend devs and data scientists.

It's probably not a great choice for a three-person startup trying to gain traction by moving fast, but IMO the threshold where k8s starts making sense is a lot lower than many here appear to think.


How much overhead does it require to feed and water k8s?


After initial setup, a managed GKE/AKS is pretty hands-off.


Sorry for the throwaway.

> I’ve been doing this “reliability” stuff for a little while now (~5 years), at companies ranging from about 20 developers to over 2,000

I've been doing it for 20+ years including running critical services in FAANGs.

> Use Docker > Use Kubernetes.

Hard pass. There's a reason why most FAANGs developed their own packages management and deployment system and keep using them. They are simpler, less bloated, easier to debug.

> Deploy everything all the time

On your local testbed, maybe, unless you already know a commit is broken. On production, absolutely not.


> There's a reason why most FAANGs developed their own packages management and deployment system and keep using them. They are simpler, less bloated, easier to debug.

And for every other company that isn't one of those 5, rolling your own deployment system won't be simpler, leaner or easier to debug. Nor will it, in all likelihood, be as stable, feature rich, observable, etc.


> There's a reason why most FAANGs developed their own packages management and deployment system and keep using them. They are simpler, less bloated, easier to debug.

Most FAANGs developed their computing infrastructure before Kubernetes gained popularity. After all the investment into building it and fine tuning for their software services (like ads), they probably have a lot more reasons to not migrate to something different. Not just because their systems are "simpler, less bloated, easier to debug".


Interestingly, it's G in FAANG who came up with K8s in the first place.


> There's a reason why most FAANGs developed their own packages management and deployment system and keep using them. They are simpler, less bloated, easier to debug.

FAANG-appropriate solutions aren’t always appropriate for the rest of the world.


> Hard pass. There's a reason why most FAANGs developed their own packages management and deployment system and keep using them. They are simpler, less bloated, easier to debug.

That seems like an odd thing to say when Kubernetes came out of Google and they based it on Borg, their internal orchestrator.

FAANG create many things themselves as they're trend setters. They have the size to create them. Having worked on in-house solutions, it is incredibly difficult to maintain adequate documentation, training, and resources to keep it going.


Why, deploying everything all the time is fine, it's what FB, for instance, does. If you don't, there is a hazard of accumulating too much change per deployment, which becomes more dangerous to deploy, and more problematic to roll back.

But it does not mean "deploy every change to 100% of the fleet instantly". Use canaries, use slow, monitored rollout. With smaller changes, it's relatively painless and routine.


> Deploy everything all the time

More often than not, this ends up being a business decision about balancing risk with potential loss of revenue (you don't want to take your stuff down, but you also don't want to delay new features that could bring more business)

Fast growing companies trying to enter the market tend to favor "deploy more often, break more often" while companies with large, established bases tend to favor "more testing, more deliberate changes"

I would say the inverse is a better rule "you should be /able/ to deploy any time"


I'm nowhere near faangs but I agree with your view, coming from about 20+ years of development.


> Kubernetes gives infra teams scalability superpowers

Gives nightmares too. Im sticking with Fargate, full, native IaaC support, feature parity with k8s. Or plain EC2s in ASG.


I don't think you need to take the advice of "use Kubernetes" as a literal "there are no other options", but more as along the lines of "use an existing implementation that you will be able to hire engineers to maintain, rather than a kludge of batch scripts and home rolled abstractions"

> Im sticking with Fargate, full, native IaaC support, feature parity with k8s

Fargate by itself doesn't provide anything that k8s does, ecs does. You can run k8s on fargate via EKS, for example.

> Or plain EC2s in ASG.

This is exactly what this post is saying not to do. ASG's are a subset of the features of k8s. k8s does load balancing, service discovery, deployments, health checks for example. You may not need all of those, and that's fine, but most containerised applications IME benefit from the _majority_ of k8s features.


> I don't think you need to take the advice of "use Kubernetes" as a literal "there are no other options"

It looks pretty literal there. And why everyone talks about service discovery? Put an LB in front of your app instances and give it DNS name. Here, sorted, no service discovery needed, service is reachable at this URI only.


When you spin more instances as load grows, what updates the DNS?

When you have more than one service interacting, or when your LB restarts / fails over, what updates the DNS?

Here's where service discovery enters the picture. It needs not be excessively complex.


> When you spin more instances as load grows, what updates the DNS?

If you're using cloud LBs, this happens automatically. If not, you can have the instances register themselves in DNS when they turn on. You can also have an out of band system register things in DNS based on rules (or maybe you already have software that supports registering based on health checks so you can add DNS entries from a fixed list of static IPs when they become healthy/powered on)

> When you have more than one service interacting, or when your LB restarts / fails over, what updates the DNS?

You can have the LBs communicate peer-to-peer so they can update DNS when they become unavailable (if A and B can't reach C, they remove its DNS record). Some care needs to be taken to prevent things like split brain but there's established patterns for cluster formation. Something like keepalived could be used. You can also use VIPs instead of DNS

You end up with service discovery either way. Either you discover a load balancer or you discover an endpoint directly. Load balancers allow you to more granularly route traffic with the client being aware of the server topology. This is good when you're exposing services to the web/public clients. On the other hand, load balancers add a certain overhead and can become bottlenecks for high-traffic services


> you can have the instances register themselves in DNS when they turn on. You can also have an out of band system register things in DNS based on rules (or maybe...

Yeah, don't do this. This is _exactly_ what the article is saying not to do. It's a nightmare to maintain, totally non standard.

> You can have the LBs communicate peer-to-peer so they can update DNS when they become unavailable <...>

Or you could use a system that doesn't require you to write custom P2P traffic for a solved problem. Doing this instead of using an existing system is an interesting decision.


Dns of LB stays as it is in both cases. Instances get registered and unregistered in LB by ECS Service. LB in AWS has redundancy built in.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...


See the authors reply about not taking this so literally. If you're using ECS, you're following the advice of the article.


> And why everyone talks about service discovery?

Because service discovery is pretty much the next problem that people find themselves facing once they've decided to not use kubernetes or ECS or something, and it's also something that it solves very well, without relying on DNS with all its quirks. To use my favourite saying, "it's always DNS".

> Put an LB in front of your app instances and give it DNS name. Here, sorted, no service discovery needed, service is reachable at this URI only.

Dumping a service behind a load balancer and relying on DNS is a heavy handed approach. Sure it's "simple", for some definitions of the word simple, but so is writing 3 60 lines of yaml and/or terraform to spin up an eks or ECS cluster


> It looks pretty literal there

You might have missed the first 5 paragraphs of the blog post


> I don't think you need to take the advice of "use Kubernetes" as a literal "there are no other options", but more as along the lines of "use an existing implementation that you will be able to hire engineers to maintain

Yeah, exactly this actually :) I didn't want to get too verbose in the little bullet points (I already use parentheticals way too often lol), but it'd be more accurate to have written "Use kubernetes, or some other tool that takes care of container networking, lifecycle, health checks etc for you" -- I bet if you squint even platforms like GCP app engine would fit the bill here


Fargate has full feature parity with Kubernetes? Is that true?

Also, what exactly is "full, native IaaC support" in this case, compared to e.g. EKS?


I meant ECS with fargate launch-type, of course. IaaC support is you can use single CFN template to deploy autoscaled service, with load balancing and run time parameters. No kubectl or helm needed.


Question stands for ECS too, does it actually have full feature parity by now? Last time I used it it was still more limited.


I’m sure it doesn’t have “full” feature parity, but do you really need 100% of k8s’ features? ECS is a lot simpler to work with and has plenty of features for most systems.


So like Full Self Driving?


Give me one, not too esoteric plz


What is IaaC? Infrastructure as a container?


For .NET folks, Dapr.


Not to be confused with the _other_ “Dapper” that .NET folks may be familiar with! [1]

[1]: https://github.com/DapperLib/Dapper


>The highest-value-per-time-spent kind of test is just pushing your change to staging (or better, prod!) and showing it does what you wanted and doesn’t break everything

Ah yes, the time honoured testing in prod strategy


Well, "this but unironically", really! One of the main tenets of how I hold SRE is that you should be confident in sending pretty much _anything_ to prod, because you trust the system you've built around it to catch problems before they lead to an outage. And in a world like that, with good shadow/canary deployments etc -- there's no better way to validate your change, right?


unless you run a service that stores data or does something that can't fail in production.


"Everyone's got a test system, some also have a stable test system"


Ah hello, author here! I wrote this up last week and posted it to Reddit (didn’t expect to find it on HN!) and I should probably say the same thing I did over there — this is a list of things I have learned are important, based on my own experiences. I’m sure others have had different experiences leading them to different opinions (sometimes even opposite to mine!), which is good, and I’m enjoying reading all the viewpoints here :)


> Structured logs are non-negotiable

I disagree with this. Logs should only be used by human operators; if you want something machine-readable, you should expose metrics via some other channel or record errors in a dedicated service.

Sure, some tools will let you read structured logs quite cleanly, but it will never be as nicely formatted as something designed, first and foremost, for humans. Of course, you may need to (say) tack on a request ID to avoid logs from different threads/goroutines/etc from getting interleaved, but that doesn't require everything to be in JSON.


There's some logs which you want machine readable that are not necessarily metrics.

Let's say you have 10 services in serial, and you want to get precise information on how that request was performed - because your customer reported it performed super bad. To do that you might need to get logs from all those 10 services and align them. If they all are human readable and you can filter by a common request-id, that's easy to do. And you can even have tooling on top of it that automatically determines what might have happened by looking at other fields of the log entries.

Metrics are mostly just pushed to a metrics system, and that moment you don't care about them or the relation to an end-to-end request anymore. So while all services might have emitted a metric for the request, looking at the overall metrics won't tell you if outliers in the services refer to the same end-to-end request or different ones.


Structured logs can still be human readable.

Ex:

[2022-01... INFO] trace=abcd123 <msg>

There is a structure which can be parsed.


Yeah, I'm referring to the "log everything as JSON" format used by (say) MongoDB.


I don't care if it's in json, fmt or your own hacky format, just chose on and structure all you output so that it's predictable where I have to look.


> Logs should only be used by human operators

I suspect that you will find a lot of resistance to the idea. At least I have. Machines should be consuming metrics, not logs, unless there aren't any alternatives. That statement never seems to be controversial, but people... just don't do it, for whatever reason.

Also, the author seems to want to do a poor man's version of distributed tracing.


As a counterpoint to part of this article, I just posted an article on HN in which an SRE turned solo dev ditched Kubernetes for Fly after 2 years and rising costs.[1]

VMs are good enough for high-traffic sites like HN and Pinboard. I think they can be good enough for most solo devs as well.

[1] https://news.ycombinator.com/item?id=33235042


> Enable limited “instant” config rollouts

Hmm, I think "config" is too vague here. I've seen numerous incidents caused by NGINX config changes that would have been far worse if we had instant 100% rollouts, but I'm not sure this is the kind of config change the author is referring to.

> Run 3 of everything

What is a "thing" in this example? Is it an instance of a service? A database? Something else?

If you're using Kubernetes and have autoscaling, what does "3 of everything" even mean? It sounds like the author is basically talking about databases..?


> What is a "thing" in this example? Is it an instance of a service? A database? Something else?

No, you just run three of everything. While only databases and similarly distributed data storages really need exactly three (or sometimes, just an odd number larger than two) for quorum and network partition detection, there is a good possibility that you'd have three availability zones because of your databases. And since you already have three availability zones, it's only logical to put every service in every zone at least once, so running "three of everything" becomes a rule of the thumb, easy to remember, easy to implement.

> If you're using Kubernetes and have autoscaling, what does "3 of everything" even mean? It sounds like the author is basically talking about databases..?

If you use autoscaling, you still need to start somewhere. And this somewhere is "three of everything".


I think they’re getting at something closer to feature flags?


> Use docker

Heh nah, I’ve started learning ansible and it’s so much easier for me to understand.

For some reason docker is a little to complex for my purposes.

Ansible just runs ssh commands , no extra layer of abstraction

I found that the fewest abstractions the better, because at some point you’ll have to debug both the abstraction and the bottom layer.


It's very easy to understand.

With Ansible, you first start a machine, then build a configuration on it.

With Docker, you first build the configuration, then start the machine (not a real VM, but an isolated enough slice).

The upside of a Docker image is that you build it once, and then start instantly, no need to wait for Ansible to run through steps.

The downside, but also an upside, of Docker is that you can't make persistent changes inside it while it's running, and usually you avoid such changes. You always can restart it from a known state.

Of course Ansible can handle more aspects than just application software configuration, basically any aspect at all. This is its upside, but also a downside: you can make an unintended system change along with some innocent-looking operation by mistake, and nothing will stop you, there's no separation of concerns.


Thanks for that detailed explanation.

So it’s possible to pair Ansible with docker?

Like only the app on docker but all the runtime, Apache, dependencies configured using Ansible ?


That’s an extremely common mode of operating.

Typically, if using bare nodes, you’ll see OS level concerns (disk, networking, init systems, etc) managed via a config management system, while app level concerns (lib dependencies, env variables, port setups, etc) managed via a container.


Yes, it's possible to configure some aspects of your machine via Ansible, and run Docker containers on it.

You can of course run a non-dockerized, locally configured Apache, and make it talk to backend code that lives in containers which just expose TCP ports or Unix sockets.

It allows you to package all the dependency tree horrors of a Node app, or of a large Django app, into a container once, at build time, and just put them on a host where you run them. You're guaranteed to run exactly the same code on your prod box(es) which you've tested on your dev and CI boxes, with all dependencies, libraries, etc guaranteed the same, never reinstalled.

Eventually you may discover that it's convenient to put the Apache into a container, too. Suddenly it stops depending on the distro your host is running.

You may also not need to run Docker proper on your prod machine(s); in simpler cases systemd will start / restart your containers just as well. During development though, docker-compose is invaluable, it allows you to locally run nearly your prod configuration of services, at a tiny scale.


Yes. Although, typically some od those other dependencies come with docker images themselves (such as Apache). Also, depending on your definition of dependency, some app dependencies need to live inside the image too (ie shared libraries, or app deps)


If you think Ansible and Docker are solving the same problem you haven’t understood yet why people use containers for deploying software.

We use them because then what’s running in prodcution is 100% the same as is running locally when testing the application.

You can hardly get there using Ansible.


One challenge both containers and scripts tackle is reproducibility and in that case it is solving that problem.

With Linux competency and some thought put into design, you can find a balance between reproducibility and ease of use with Ansible. Easier for me and those who agree with me to configure and debug a Linux host vs. debugging and configuring Docker to be as flexible as running it straight on the OS.

It’s not an impossible task to harmonize configurations with Ansible. It definitely takes more thought than Docker and that’s where the competencies of the team count.


> what’s running in prodcution is 100% the same as is running locally

Only if you’re running the same kernel.


Do you test your ansible roles? If not, you absolutely should be. Check out molecule, which coincidentally uses docker. It'll definitely make your ansible roles better, and possibly help you learn docker and how it does something different from what ansible does.


You'll find yourself in the world of pain the next time you need to remove something. Over time, your playbook will be full with stuff that shouldn't be on the machines. At some point you'll consider re-installing the machines from scratch from time to time. At some later point you'd consider re-installing machines every time your Ansible playbook changes. And at that point you'll have reimplemented containers. Poorly.


> Ansible just runs ssh commands

No SSH is better than any SSH :) When your infra matures enough, you'll bake machine images (AMIs in AWS) and start machines from this image. And containers already operate this way, thats why they popular.


I agree with most of these points. They are a good starting point. I'd probably steer clear of kubernetes until you actually really really need it. ECS is good enough for most things, and a fucktonne more simple to look after.

but, where I hard diverge is the lack of metrics. Everything should be generating metrics. Logs are great, but they are crap for giving you near realtime trends.

Everytime your container gets a connection: increment a counter. Every message processed, counter. every message failed, counter. Every KB of data sent out, counter. Every service call, counter by service.

metric all the things, in a sensible, mostly automated way, with a decent schema.

Then you can combine all those metrics into a dashboard that shows your system performing against business goals.


I disagree so much with discarding unit tests, if you’re not dead sure of the base units, then you have a combinatoric explosion down the road. How do you even design code without unit tests :D


Discarding unit tests is such terrible advice. The author clearly never discovered regression bugs at build time, or developed using a dynamically typed language.

Also, I would have expected to see "Use only strongly typed languages" as one of the bullet items. I get the sense that the author doesn't actually have a ton of enterprise experience.


> Use Git. Use it for everything – infrastructure, configuration, code, dashboards, on-call rotations. Your git repository is your point-in-time-recoverable source of truth.

I'd amend that to "Use configuration management."

Git is wonderful, but it isn't a Swiss Army knife. There are some things that are better archived using systems like Perforce (many game developers use it, because Very Big Assets), or even simple incremental file system backups.

Back in the "dark ages," the Japanese release managers at my company used to clone an entire computer disk drive onto an external disk, and store that disk in a special file cabinet, so you not only had the source, you also had the entire system that was used to build it. This has some obvious issues, but it worked for them, for years.

Also -and I cannot stress this enough- TEST YOUR BACKUP RECOVERY. I have been in the position of needing to restore old files (either partially, or whole-hog), and found out, the hard way, that my backups were pooched.


The title should be "how to build software for an SRE"


I'd argue that without SRE you won't be delivering any software to the customer.


Great insights! I came to some similar conclusions as well.

I really like the one about failing on missing config parameters. I used to set default params all the time (thinking I was clever).

Another big win in the last few years has been feature flagging. I use it extensively to tune performance when in production. I can control batch sizes, delays, retries, and API key rotation.

bookmarked!


Lots of YAML, some duct tape.


Hah. That made me smile. But that's my issue with most "kids" these days. The code is in rust, takes dark wizard incantations to understand, a treasure map to even compile, ... And then the rest is built on yolo-yaml reproducible like that time I did pancakes that didn't suck.


I agree on most of the bullet points.

As a Software Architect (the SRE mirror role :) I am following https://12factor.net/ and that guidelines simplified a lot my CI/CD pipelines. Helm + K8s is complex but you can easily enable to change some param and redeploy in a very fast and relable way.

And yes, docker outperform ansible on a lot of use-cases in my humble opinion.


IMHO, How to build software like an SRE:

1. Let us know when something is broken. If a curl fails, log it. We want more logs not less.

2. Make the code configuration file free. Use ENV vars. If an ENV var is missing crash and say why. The environment will determine how the app acts.

3. Build process and Deploy process should be different. Github actions is great for the Build process, Jenkins is great for deployments.


helm is a bad solution to a lazy thought out problem. why would i template yaml from another yaml and without versioning and weird if else constructs just for deploying which clearly should be json? because we devops are obviously not able to abstract the problem. oh you switched from nginx to haproxy, lets change all the labels in my git comitted repos where my yaml lies. oh the author changed the way the helm chart works, lets do manual compare to see what changed. Oh k8s changed the api version again, lets change all the yaml. and helmfile on top of helm -> more abstraction in the wrong direction -> love it.

i would much rather use something like starlark and program my deployment needs against the correct library version of the k8s api which my cluster has and get the final yaml for free instead of templating and forking and maintaining dead weight every damn version.


> Never give up on local testing. It keeps dev cycle time much shorter than needing to rely on (and fiddle with) CI or remote workspaces.

Excellent advice for improving productivity and testability. I wish more teams would be sure to follow this by not adopting technologies that are impossible to test locally.


> Avoid state like the plague This really depends on the problem domain. Application state is fine in many areas, and in the case the state is non trivial managing it in the application is a lot easier that pushing to databases.


I have worked at some places where SRE guys were a joy to work with. But at my current place they take a cowboy approach and make my life harder. So I'm currently not interested in building software like an SRE.


SRE's should not be cowboys. They should be the least cowboy engineers in your company. Something is very wrong there.


That used be my perception as well. I have now had back to back jobs where there is some maverick in the SRE team. A single guy who modifies some configuration at 2AM for services your team supposedly owns can ruin your week. Some people think lack of bureaucracy is a get out of jail card, and they are often right.


As someone who has managed a team of SREs that kind of behavior would land someone on a PIP if done repeatedly. Sounds like you had a run of bad luck in teams.


A PIP would be nice tbh. At my previous place they PIP'd one of those guys, but they ended up moving him to my team, where we taught him a bit about processes.


I saw a great talk at EuRuKo last week about treating your CI like an SRE would and using metrics and tripwires and things, by Mel Kaulfuss.


Great, but it is just a way to build reliable software. Many companies run reliable software without using half of the suggested points.


> Don’t waste time on code coverage

Is it really that useless? I was about to start doing code coverage and now wonder if I should reconsider.


What do you mean by "doing code coverage"?

Tracking it? Doesn't take too long to set up in some situations.

Increasing it? Depends on if that's trying to get to 100% or just to have a metric to see where tests are needed.

It's usually not productive to enforce it for long ongoing development projects (aside from maybe "stay about 70%" or something). It can be a neat metric, but I don't think it's very productive. Testing comes in the form over very unbalanced importance (see "Pareto distribution"). Start by testing the most important business critical functionality and then move out from there. 20% of your tests will cover 80% of your actual important use cases.

The trade off would be easier to manager if tests didn't require maintenance, but to get the best bang for your buck, focus on the most critical stuff first. Code coverage is by no means a necessity.


I was thinking about including the necessary instrumentation to check code coverage during regression tests. Something like coverage.py. The intent is not to get a specific number, but to understand which parts of the code might not be exercised enough by existing test cases.


> My goal here isn’t “what is 100% the most reliability-oriented way we can build things”, it’s more like “what is the 80% of reliability we can get for 20% of the effort while still enabling devs to go fast”

The author also points out that these bullet points will not apply to every unique situation.

If your goal is to allow devs to go fast at the cost of more hardened code, then yes, skipping code coverage is a major way to code faster. There are definitely times when it is appropriate to skip writing tests when the primary goal is to go fast.

It's difficult to answer your question without knowing your specific situation though.


Code coverage doesn't tell you much IMO.

I'll take a system with 70% coverage that's focused on the really complex pieces over a system with 99% coverage that's mostly checking that "person.Name" returns the name.

Your engineers should know where be dragons in the code and unit testing appropriately.


As with most good guidelines, it's useful as a general guideline, just dont get strcit about it.


I couldn’t agree more on these bullets.

People forgot what it means to build 12 factor applications.


Good article. Seems like solid and useful advice.


"Extremely strict RPC settings. I’m talking zero retries (or MAYBE one), and a timeout like 3x the p99. We are striving for predictability here, and sprinkling retries or long timeouts as a quick fix for a flaky downstream service will turn into a week-long investigation and a migraine a year from now. Fix the flaky service!"

You want the 80%, right? Might be the network is flaky, might be a service, might be a load balancer. Use timeouts and retries to ignore the flakiness until it eats into your error budgets.

"Never give up on local testing. It keeps dev cycle time much shorter than needing to rely on (and fiddle with) CI or remote workspaces. Containerizing the local test environment can make it easier to keep dependencies straight and consistent across machines."

Until debugging the snowflakey local testing, or the difference between local dev and CI, ends up eating more time than just using a local<->remote sync tool. There's like a thousand tools now to combine local dev and remote envs, and they will save your organization a lot of time and hassle.

"Your git repository is your point-in-time-recoverable source of truth."

I call this "suggestion-of-truth". The truth is you never know what state things are actually in, but you hope it's like what's in Git. And there's never enough information kept in Git, like all the versions of all your services used at each deploy, or the state of the entire system, or filesystem snapshots, so recovery often fails.

"For infra changes, make plans extremely obvious. This could mean “post the Terraform plan as a comment on the pull request”,"

Sadly this isn't enough as they might not be using a plan file, and when they apply something unexpected happens. Even if you have a plan file, there's a million other things you should be "making obvious" at deploy time, but people never do. Hard problem to solve.

"Use Kubernetes. Assuming you have more than one service and more than one instance, you either need or will need stuff like service discovery, autoscaling, and deployment versioning, and you do not want to manage all that by yourself. Kubernetes gives infra teams scalability superpowers."

Yeah, if you pay for it, in money, in expertise, in person-hours. Those things don't just come out of the box. Autoscaling: not at all. Service discovery: you have to set up a DNS controller with an external nameserver, define a naming + namespace convention and stick to it, and use external service definitions for stuff not hosted in K8s. Deployment versioning: you have to define a deployment pattern that stores the versions of everything at deploy time and then link that to your K8s deploys. There's still weeks/months of setting shit up, and then dealing with the complexity of it all and the constant upgrading.

Do literally everything you can to abstract away K8s setup + maintenance. Pay anyone to do it all for you. If you're unwilling to pay for it, don't use it. If you do it yourself, even with a managed control plane, it's like building a car from scratch. People on HN sometimes say "oh setting up / maintaining K8s was easy" and I think of the guy who built his own 2-story wooden deck saying it was easy. I wouldn't stand near it.


> How to Build Software Like an Site Reliability Engineer

Acronyms don't excuse poor grammar.


The "S" in "SRE" is pronounced "ess" which would take an "an" as the article. The title is grammatically correct. You would also say "an FBI agent" or sending "an SOS message."


No, "an SRE" is correct here; it's actually "a SRE" that would be bad grammar.

Whether you use "a" or "an" depends on whether the next word starts with a vowel sound, not with a vowel letter; since the letter "S" is pronounced "ess" in an acronym, you would write "an SRE," but "a Site Reliability Engineer."


It's an initialism, not an acronym. And _an_ SRE is correct.


I have bad news for you about something I heard a recruiter say a couple weeks ago!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: