api.video

Use Cases

Infrastructure

Video geo-replication journey

October 6, 2020 - François

We’re committed to making api.video easiest and fastest for you and your users. To do so, we've focused on instant worldwide delivery. We propose this article as a first article on the evolution of the api.video infrastructure on 2020/2021. Indeed, we have big evolution projects in order to always better satisfy you, while guaranteeing the existing service.



To do so, we sometimes need to implement quick solutions based on the existing one, before diverging strongly from it. This is precisely the case of the subject that interests us today.

The api.video service was initially deployed in 2019, on OVH's infrastructure, first in Roubaix & Gravelines, for Europe, then Beauharnois, to meet the North American market.

Certain choices in terms of historical architecture have led to logical building blocks being used exclusively in Europe at first (e.g. databases and storage), before attempts were made to create transatlantic clusters. Synchronous clusters, with an average internal latency of 80ms per TCP transaction.

In late 2019, this led to poor performances on several bricks, and a rollback has been decided until a better solution could be implemented.

A direct consequence of this setup has been very poor performance for our North American users to access the videos:

  • Latency: 2 seconds compared to 0.3 seconds from France
  • Throughput: 500 kB/s compared to 15 MB/s from France (and even 150 MB/s locally at storage)

As we deliver several dozens of TB of videos, this is not a small matter to deal with.

As you may know, 2020 is api.video's year. This comes with new hiring, especially for the infrastructure topics.

To sustain the current evolution of usage, until the new platform, with edge networking for instance, it has been decided to deploy an asynchronous geo-replicated storage between Europe and North America.

What objectives, for which result ?

Our primary objective here is to properly redundant the service between Europe and North America, and to guarantee the delivery of videos to end-users in proper conditions.

From a performance perspective, we defined two performance indicators as our initial objectives: the newly created geo-replicated storage will have to provide us, internally at the platform, with the same ranges of latency and throughput, whether we are in Europe or North America:

  • Latency: 0.3 seconds
  • Throughput: at least 15 MB/s


Those are our new worst-case-scenario metrics that we should implement into our monitoring probes.

One last objective, should be the feedback for our users, either via social media, or our helpdesk. A bit more difficult to measure though.

Let's build it

(New) Initial attempt

As said, a first attempt in 2019 was performed to build a replicated storage between Europe and North America. It was based on the standard replication process from Gluster. As this is a blocking synchronous replication, this led to the poor performances we know of.

Just as a quick reminder, we want this new setup to be a "quick-win". As the platform was still relying on Gluster, but with replicated nodes all located in the same datacenter, we decided to move forward with it, and build a geo-replicated Gluster storage.

The GlusterFS documentation (since version 3.5) also promises the correct operation of geo-replication over a distributed setup. This seems as a good option to benefit from a scalable storage volume meanwhile we'd have also a geo-replicated storage.

We agree to have a master/slave setup at this stage of the infrastructure, as our priority is to deliver videos in the proper conditions.

Setting it up

2020 is also a year of automation for us, from an infrastructure perspective. Ansible is our new BFF, and we run all our setup (installation, configuration, tuning) with it. The setup described here is also available via our own playbook.

The idea of this setup is to have the following elements:

  • Each country has its own cluster GlusterFS (the master being in France)
  • Each cluster has two servers in distributed mode
  • Each server has 10 disks of 4 TB available for this storage, built in software Raid type 10, formatted in XFS
  • Each node mounts the cluster locally with these parameters: 127.0.0.1:/storage-fr /opt/self glusterfs defaults,_netdev,noatime,log-level=ERROR,log-file=/var/log/gluster.log,direct-io-mode=no,fopen-keep-cache,negative-timeout=10,attribute-timeout=30,fetch-attempts=5,use-readdirp=no,backupvolfile-server=fs-fr-02 0 0

GlusterFS Distribued Volume - official RedHat documentation’

That’s the setup at the very beginning. Now let’s make it live.

And then there’s the drama

Thanks to the Ansible playbook, the entire deployment goes smoothly and without errors. So I import my 11 TB and then I admit… but not for long. It’s perfect, but a SysOps rule scares me:

"If everything goes well on the first try, you’ve forgotten something."

Fortunately for me, not everything is so beautiful: geo-replication doesn’t start. A glance at the geo-replication status does indeed bring me back to reality:

MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
------------------------------------------------------------------------------------------------------------------------------------------------------------------
fs-fr-01 storage-fr /opt/brick root fs-ca-01::storage-ca N/A Faulty N/A N/A
fs-fr-02 storage-fr /opt/brick root fs-ca-02::storage-ca N/A Faulty N/A N/A```

In fact, the startup passes the replication to Active and then immediately to Faulty. For a successful setup, we'll go over it again. So obviously, we always start by consulting the logs first: less gluster volume geo-replication storage-fr fs-ca-01::storage-ca config log-file

And here we see a beautiful log entry, extremely explicit:

E [syncdutils(worker
/gfs1-data/brick):338:log_raise_exception] <top>: FAIL:
Traceback (most recent call last): File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py",
line 322, in main func(args) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 82, in subcmd_worker local.service_loop(remote) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1277, in service_loop g3.crawlwrap(oneshot=True) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 599, in crawlwrap self.crawl() File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1555, in crawl self.changelogs_batch_process(changes) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1455, in changelogs_batch_process self.process(batch) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1290, in process self.process_change(change, done, retry) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1229, in process_change st = lstat(go[0]) File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
564, in lstat return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY]) File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
546, in errno_wrap return call(*arg)
OSError: [Errno 22] Invalid argument:
'.gfid/1ab24e67-1234-abcd-5f6g-1ab24e67'

First though is about a specific issue at some files. Checking the file matching the gfid, it occurs to seems valid. As an attempt to save the day, we get rid of it and restart the geo-replication.

... With no success at it fails the same way ...



Discussing with the Gluster dev team over IRC, they wonder if the amount of initial data might not be the issue. Gluster is meant to deal with PB of data, but fair enough, we get read of the data, 1 TB per 1 TB, testing the geo-replication at each stage.

... Still no luck ...

As I end with a totally wiped folder, with no more data, we wondered if the changelog did not get corrupted or filled up during those test. As a reminder, changelog in Gluster, is the way for it to track any change it needs to replicate (creation, modification, deletion, permissions, ...)

We decide to wipe the complete cluster and attempt once again the deployment with Ansible, but without importing any data. To wipe the setup on the servers, we got with the following commands:

sudo ansible fs-fr-01 -m shell -a "gluster volume geo-replication storage-fr fs-ca-01::storage-ca stop force"
sudo ansible fs-fr-01 -m shell -a "gluster volume geo-replication storage-fr fs-ca-01::storage-ca delete reset-sync-time"
sudo ansible fs_fr -m shell -a "echo y | gluster volume stop storage-fr"
sudo ansible fs_ca -m shell -a "echo y | gluster volume stop storage-ca"
sudo ansible fs -m shell -a "dpkg -l | grep gluster | awk '{ print \$2 }' | xargs sudo apt -y remove"
sudo ansible fs -m shell -a "service glusterd stop"
sudo ansible fs -m shell -a "umount /opt/self"
sudo ansible fs -m shell -a "setfattr -x trusted.glusterfs.volume-id /opt/brick"
sudo ansible fs -m shell -a "setfattr -x trusted.gfid /opt/brick"
sudo ansible fs -m shell -a "rm -fR /opt/brick/.glusterfs"
sudo ansible fs -m shell -a "find /usr -name '*gluster*' -exec rm -fR {} \;"
sudo ansible fs -m shell -a "find /var -name '*gluster*' -exec rm -fR {} \;"
sudo ansible fs -m shell -a "find /etc -name '*gluster*' -exec rm -fR {} \;"

And as you might (not) have expected: it failed over once again! Bollocks!

E [syncdutils(worker
/gfs1-data/brick):338:log_raise_exception] <top>: FAIL:
Traceback (most recent call last): File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py",
line 322, in main func(args) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 82, in subcmd_worker local.service_loop(remote) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1277, in service_loop g3.crawlwrap(oneshot=True) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 599, in crawlwrap self.crawl() File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1555, in crawl self.changelogs_batch_process(changes) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1455, in changelogs_batch_process self.process(batch) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1290, in process self.process_change(change, done, retry) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py",
line 1229, in process_change st = lstat(go[0]) File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
564, in lstat return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY]) File
"/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line
546, in errno_wrap return call(*arg)
OSError: [Errno 22] Invalid argument:
'.gfid/00000000-0000-0000-0000-000000000001'

This time, the gfid matches with the Gluster root itself: /opt/brick. How may it fails over the empty root folder?

We review the complete setup with the Gluster dev guys, and we could not find any issues with it: it correspond to the simpliest version of the documentation, without any potential permission issue as we built the setup as a root user to begin with. One of them proposed to use a python script that can do it all “by magic”. The tool installation and usage is pretty straight forward.

… but yet: no luck ...

After several days of digging, building and wiping things, using different hardware and VMs, we asked a simple question: is the distributed setup meant to be the source of the geo-replication? Colleagial positive answer, from the Gluster dev team, as it’s available since version 3.5, and as we are at version 7.3 (when building this setup), and none of them touch this part of the code, it should work.

Right. 

Should is the key to it all.

None of them tested this setup since the branch 4.x of GlusterFS, as the branch 5.x was a major change to the core code and logic of things.

We love to question things, even more assumptions: fair enough, we can build a new setup quite quickly. We move to a replicated setup as a source of the geo-replication setup.

… And voilà!

Now it works perfectly fine. There was some regression at least since the branch 5.x and no one noticed.



(New (new)) Initial setup

This replicated/geo-replicated setup is not the most perfect solution, but we can still achieve the same results we expect. Let’s proceed with the complete setup:

  • Each country has its own cluster GlusterFS (the master being in France)
  • Each cluster has two servers in replicated mode
  • Each server has 10 disks of 4 TB available for this storage, built in software Raid type 0, formatted in XFS
  • Each node mounts the cluster locally with these parameters: 127.0.0.1:/storage-fr /opt/self glusterfs defaults,_netdev,noatime,log-level=ERROR,log-file=/var/log/gluster.log,direct-io-mode=no,fopen-keep-cache,negative-timeout=10,attribute-timeout=30,fetch-attempts=5,use-readdirp=no,backupvolfile-server=fs-fr-02 0 0

GlusterFS Replicated Volume - official RedHat documentation

Now that we have a working setup, we can import the initial dozens of TB. It will take quite some time. I run a simple rsync command in a screen without supervising it, and we’ll repeat the command over and over until the migration to be as up-to-date as possible prior to the M-Day:

rsync -PWavu /mnt/old/ /mnt/new/ --delete-during --ignore-errors

A couple hours later

So far so good, the M-Day is here and we decide to proceed with the migration. We update the mount points, and benefit from the same options as previously. We are all happy as the promised performance are reach in term of latency and throughput. Even better for the later, as we have a 45 MB/s average.

Did you think it was over? A couple hours later, one of our US users reached out as he couldn't access any files. Checking the delivery logs, we can see a bunch of HTTP codes 200 and 304, so we initially wonder if the issue is on his end, or an intermediate cache issue somewhere else: we asked him for a couple of problematic URL to proceed with further tests.

Checking the provided URL on both European and North American platforms, we succeed to reproduce the issu: checking first the our origin web servers, they show that the files do not exists. As the platform data was (assumed to be) up-to-date, the reason was pretty unclear.

Comparing the same folder on both platforms, the North American instance appeared … empty! But df reports the same volume usage: we decided to just add a simple -a param to our ls … and there they all are: the files do exist, with the proper size, in the proper folder but instead of player.json (for instance), it’s named as .player.json.AbvgGY.

Weird ? Not so much, as this is the format of the temp files from an rsync: remember our way to import files?

But why do Gluster do not fix them on the North American platform as the data and naming are valid on the European side? Simply because Gluster relies on a simple checksum of the content and metadata, to track changes, ignoring the actual name of the files. Though, it expects to track the rename() operation on the fly.



At this stage, our guess was that something happened during the data importation, leading to the wrong files being replicated. Checking each cluster logs, status and details (including the heal info from Gluster), it appeared that a split-brain occurred on the European platform.

Pretty logical with only two nodes in it, and no arbiter. Digging a bit more on the timestamps, logs, states, … At the end, a pretty race condition of multiple things:

  • A pebcak (from yours truly): I ran the rsync command with no specific param, leading to the creation of temp files ; I should have added -inplace argument to ensure using the proper name at once
  • Network issues at the provider stage: kinda expected but it occurred multiple times, and part during the expected rename() operations
  • Gluster georep being too simple: why the heck don’t they include the name in the validation of the content, in addition to the checksum?

So, as long as it’s simple to heal the European cluster (both split-brain and invalid data on one node), fixing the names on the North America cluster is not that obvious. There is no Gluster mechanism to do so, and we can’t afford waiting for a week for the data to sync over again, if we decide to wipe the data. From here, we came with a new idea that is just an abuse of what Gluster is, and how to fix a split-brain situation.

We know that we have no more user connections to the North America cluster as we moved them all to the European cluster while solving the issue.

To summarize the idea:

  • Gluster considers that writing to a local brick (outside of Gluster) is an issue as it will lead to a split-brain
  • A split-brain can be resolve by reading locally the file via Gluster (mountpoint over 127.0.0.1) if there is no concurrent access to it
  • rename() operation does not change the gfid, it just update the underlying link Gluster use (take a look at the directory tree in .glustefs folder of your brick)
  • The geo-replication happens once, based on the changelog: if a file is locally remove from the slave cluster, it can't be "retrieve"
  • We know the data is valid, just the name is an issue

Thanks to this, I write a quick script and run it over my folders:

#!/bin/bash
find /opt/brick/ -name ".*" -type f | while read f; do rename -f -d -v 's/\.([^.]*)\.([^.]*)\..*/$1.$2/' $f brick=`dirname $f` self=`echo $brick | sed 's/brick/self/g'` filename=`basename $f` echo $self | xargs ls -la &> /dev/null dest=`dirname $brick` sudo rsync -a --inplace $brick root@fs-ca-02:$dest --delete-after > /dev/null
done
exit 0

Not the proper thing but efficient at least. It rename the file and ensure to solve the split-brain, before removing any invalid remaining dot file on the other node of the cluster. Neat.

One more thing…

From now on, we should be fine. “Should”. But we were not.

New issue occurs on the web servers as a full heal is running on the European platform. Some of them exposed corrupted files, that are being healed, then cached those invalid files.

As they are ok on both nodes of the cluster, the issue is clearly on the web servers. To solve it, we just need to umount/mount the Gluster endpoint. Quick, efficient. We should re-do it once the full heal is achieved.

Several months later

It's a been a while now since we build this setup. The service was quite stable. We had issues outside Gluster, that impacted the cluster though. One of the European nodes kept crashing its harddrives (4 dead in a month). Besides that the tech team from our provider does not consider this as an abnormal situation, it led to a couple of side effects on Gluster we had to deal with:

  • Cache issues on the gluster client, due to the mountoptions and how Fuse deal with the cache: we noticed wrong permissions, invalid timestamp
  • Replication stuck between the European nodes
  • HTTP error codes when trying to deliver the videos

This has been quickly fixed, but required to get the culprit node out of the cluster (and therefore, adding another node in it).

We were also able to replace the nodes to grow the storage, and better absorb our steady growth.

With that setup, we were able to gain expected delivery performances in both our main continents, while keeping the engineering time cool enough to work on the new infrastructure, to be discussed in future blog posts.

Also, our American users were relieved about the performance and usage of the platform.



From your perspective, how would you have dealt with it ? Would you have go with Gluster geo-replicated solution?

From this year with Gluster, we also know that we won't pursue with it for our next stage. Our newly created infrastructure team is also hiring for our next challenges: build our own CDN, revamp the worldwide storage, orchestration of the applications, ... Just drop us a note if you want to join!

François

François

Head of Infrastructure @api.video

Get started now

Connect your users with videos