Is managing a platform just housekeeping?
November 9, 2020 - François
Building and maintaining a globally-deployed platform to ensure the best possible availability and latency experience, requires work on the foundation of your infrastructure, regardless of your business.
Our main objectives? an ultra-low latency live streaming service, and the perfect caching platform to deliver your VOD. For that purpose, we need to deploy dozens of datacenters, aka Points of Presence, around the world.
For that matter, your services, whether they are for delivery or for ingest purposes, must rely on other, more obscure, bricks. These are the bricks that will ensure that you know what's going on, at all times, on the platform, or allow you to be proactive.
They can be bricks dedicated to security, management, log management, monitoring, ... Today, we will quickly review most of these topics, often forgotten, and yet so essential, giving you rules and guidelines.
Hector Barbossa - Pirates of the Caribbean
Most of the infrastructures I had to intervene on in my previous lives, always reproduce the same error by omission: not having an internal DNS resolution platform.
Why is this critical? To compensate for the lack of this internal DNS, the following workarounds are implemented:
- External DNS resolution: use the recursive DNS of your provider, and/or those of Google, OpenDNS, Cloudflare, ...
- Make your services communicate with each other: use heavy modifications on
/etc/hosts, or declare the name of all your machines and services in the public DNS zone of your business application. Or even use IP only.
Both represent security & performance issues:
- Relying on a third party for public DNS resolution, means that you rely on a third party for all your external actions (updates, calls to other APIs or SaaS, ...). The DNS is often the last wheel of the carriage for many hosting providers or ISPs: it's slow, and it breaks down regularly. It is filtered at the network level, restricting or preventing you from going out.
- If you do not retrieve the information from the right source, you are potentially under the influence of a hijacking of DNS responses: for political, financial, fallacious, ... reasons, ...
- DNS queries are the first step of any electronic transaction (web/api, emails, ...): if this point is slow or in error, all the rest of the service will suffer.
- Your services need to communicate via DNS records: this is what gives you flexibility, and with the right nomenclature, you can do wonders in your automation.
- If you have hundreds or even thousands of machines/services, and you modify your
/etc/hostsfile, even automatically, it quickly becomes unmanageable on a daily basis: do a
nslookup, and ask yourself why you forgot this or that firewall rule, while your DNS resolution for your applications is done via a lookup at
/etc/hostsfile which is too big. This file should be reserved to exceptions only. The other direct consequence of using this file too massively, is the additional disk I/O that it can generate, unnecessarily.
- Declaring your DNS records for private use in a public DNS zone is the best way to expose elements related to your architecture and internal workings to attackers.
Eminem - Hi! My name is (what?)
One of the practical questions to ask yourself, is to know if you want to manage in the same DNS zone, your servers and your services: pure practical choice totally subjective.
Then, you have to choose your DNS zone: in any case, avoid domain extensions that can be used publicly, or with usage constraints.
Hello there, .dev & .test gTLD!
Next, consider that you will have two types of DNS services to manage internally:
- an authoritative, for these DNS zones that you have just defined
- a recursive for all your machines, based on your private zones, and on a public resolution, with Root Servers for example: be careful to configure your DNS cache not to be blacklisted if you use them
On my end, I love PowerDNS services for that purpose, especially the recursor service that can embed
/etc/hosts answers in their lookup process. Quite useful when you want all your apps worldwide to always request the very same DNS record, but expect custom answer.
Don't forget to redundant them!
🔎 Beware, if you do end-to-end SSL within your service, you will need a valid PKI, because LetsEncrypt will not work for these domains: no compatible challenge.
In addition to what has been seen above, I have always noticed a significant gain in the performance of internal applications: the fact of no longer going through a DNS external to the infrastructure can often result in a gain of 10 to 100ms per DNS request, depending on your context!
The faster you work internally, the faster you delivery to your end-users!
I know what you did
Logging is the first step of any debug / test / forensic matters. You need to know what happen. But you also need to ensure that you keep track of your logs.
That's just a job for my ELK setup !
Yeah... but no. ELK, or any other equivalent solution is just the end of one of the path you'll walk through to benefit from your logs.
First, you need to ensure how you collect them, and how you deal with any outage, within your datacenter/point of presence, but also between your various locations, before your logs can even get in your ELK.
Indeed, when you manage dozens of point of presence, you don't except to have one ELK per location.
Ryan Philippe - I Know What You Did Last Summer
Logs can come from many different sources: the system (via journald for example), your applications, your middlewares, your network appliances, ... Before being able to process them, it's interesting to bring them all back to the same dialog method: go through a syslog, which is totally standard. Whether it is local to your machine, or the central one at your datacenter, it must be configured to manage connectivity losses to its upstream, as well as a little buffering in case of congestion.
Therefore, this syslog central to your datacenter that we just mentioned, can itself push the logs to your log management infrastructure.
For us, the log management can be complex:
- part of it can go to ELK for maintenance purpose mainly
- part of it goes through our billing process
- and part of it is just backup-ed
🔎 Don't forget to enable audits also, to monitor any user action locally, especially someone that would temper with your logs!
Passing your logs to a dedicated service, which manage properly loss of connectivity, buffering, ... can positively impact your local I/O, leading to a better behavior of your machine: namely, a better UX at the end of the day.
You're being watched
Any machine, any service you deployed should be actively monitored.
I often read that you should monitor and collect only what you know useful and understand. I'm not fond of this as you never know everything, and curiosity is your best companion along the road.
When you want to monitor a service, you never know the corner case, or the race conditions you'll hit a couple month later. Having the metrics/details ahead of the event is the perfect way to, best case scenario, avoid the matter or, worst case scenario, be able to proceed with the proper root cause analysis.
Root - Person of Interest
Today there are mainly two paradigms for monitoring: the old world, relatively static (Nagios, Zabbix, ...) and the world revolving around the buzzword "observability" (Prometheus, Sensu, ...).
Whereas previously we designed our monitoring tools to warn us of any threshold exceeded, or unexpected value, this method of operation quickly showed its limits with many holes in the racket.
Today, when we talk about observability, we are talking first of all about having a fairly broad vision, mixing metrics, logs, but also application traces. However, the alert part is also undergoing an important evolution: we are focusing more and more on behavior, and more precisely on behavior change.
Let's take our database connections as an example. The number of connections is determined by the service's audience, with usual peaks depending on the region of the world and the time of day. With a permanent increase in traffic, defining alert thresholds is complicated and represents an endless race. While this information is useful for sizing each node in the cluster, it does not provide information about an abnormal situation.
On the other hand, if you are able to compare the current behavior with what was expected by extrapolating the data from the previous period, then you can very quickly determine if anything needs to change.
Another example, a steering coefficient of a curve will always speak more about a change in behavior than a simple threshold.
This is how we conceive our monitoring, and we apply ourselves to foresee any anomaly, or to refine our capacity planning. Being as pro-active as possible is the best way to reduce the moments when we have to be reactive.
🔎 Then, point all your observability data sources (log parsing, metrics collection, ...) to the same alerting service. This will allow you to mutualize the data, and thus refine the results, but also avoid the monitoring fatigue due to too many alerts, by only reporting the one that makes sense.
For the same reason that the Borg want to assimilate you (the Collective is powerful), it is in your interest to assimilate your facilities. In other words, it is in your interest that everything is perfectly reproducible.
Resistance is futile
Need performance? Scale, deploy seamlessly, and execute.
A bug? Deploy elsewhere the same way, and test.
A new point of presence to better respond to some of your users? Deploy always the same way, and enjoy.
Captain Picard as Locutus - Star Trek Next Generation
Ansible, Puppet, Chief, Terraform, ... choice is yours to make. But always automate.
Thanks to your private DNS infrastructure, your log management, your monitoring, you'll be able to deploy a complete new point of presence in a matter of minutes.
Define generic rules as your defaults. Apply PoP specificity via your group configuration (or alike). Customize any exception where it applies.
With the proper nomenclature defined within your private DNS zone, you can easily even automate the configuration with cluster-aware setups where required, without writing explicit configuration.
The more you automate, the less human errors you should implement at the end.
🔎 Want to proceed with some A/B testing, or some canary deployment ? Restrict the run of your automating tool, and 'voilà'!
Don't avoid the housekeeping
Working on your underlying services can help your daily platform management: better performances, better security, better prediction, ... At the end, you'll have a better day, and your users a better overall experience.
When you're working an ultra low latency platform objective, every millisecond counts!
We're deploying our own Points of Presence, around the world, with that objective in mind.
Our Infrastructure team is also hiring for our next challenges: build our own private CDN, revamp the worldwide storage platform, orchestre the applications, ... Just drop us a note if you want to join!
Head of Infrastructure @api.video