I had to automate Cassandra backups recently and thought I’d write it up, because the process is not really well documented. If you’re used to MySQL or PostgreSQL backups, you will see that the process is different with Cassandra. You don’t get a single file that has all back up data. What you get is a number of snapshot files per keyspace table which you would want to put is a single .tar.gz file.
Typically, every Cassandra node contains only a part of the whole keyspace dataset and you have to take a backup of every node separately. It actually depends on keyspace replication factor and if you have replication factor of 3 and run only a 3 node cluster, you need to take only 1 backup.
When backing up, you need to backup the schema and data using 2 different processes. Cassandra backups are pretty quick, because they are essentially snapshots. In simple words, the backup process is as follows:
# clear old snapshots nodetool clearsnapshot -t mybackup1 -- mykeyspace1 # back up schema cqlsh -e "describe keyspace mykeyspace1" > mykeyspace1.cqlsh # snapshot keyspace to /var/lib/cassandra/data/mykeyspace1/*/snapshots/mybackup1 nodetool snapshot -t mybackup1 -- mykeyspace1 # tar gz FILES=$(find /var/lib/cassandra/data/ -path /var/lib/cassandra/data/mykeyspace1/*/snapshots/mybackup1) tar czf backup.tar.gz $FILES ./mykeyspace1.cqlsh # delete snapshot - we don't need it anymore nodetool clearsnapshot -t mybackup1 -- mykeyspace1
This will create a .tar.gz with a bunch of snapshot files and a keyspace schema .cqlsh file.
Restoring From Backup
This is a tricky part which is not well documented. I spend some time trying a few different options. There are 2 ways you may restore a Cassandra database: using
sstableloader tool or copying the database files in place directly. Each of methods has its advantages and drawbacks. The first method,
sstableloader tool is very reliable, and automatically puts data on a correct node according to current ring configuration, but is a bit slow. The second method is a lot faster, but requires either an existing Cassandra cluster where the backup was taken from OR a new Cassandra cluster with the 100% same ring configuration as the old one (which is difficult to reproduce). This article will only cover the
sstableloader tool because the second method is not 100% reliable (or it was not 100% reliable for me).
Once again, it is important to understand how Cassandra distributes data between the nodes depending on replication factor and which backup files you will need to restore a keyspace. For a 3 node cluster and
keyspace with replication factor 3, only one file is enough. For a 5 node cluster and a replication factor 1, you would need backup files from all 5 nodes. However you won’t get any duplicate records if you ingest a backup file twice by mistake. Also, it is possible to restore data on a cluster that has a different replication strategy, because new keyspace does not have to have same replication strategy as the old one and
sstableloader will take care about connecting to correct Cassandra nodes and distributing data correctly.
Before you restore any data, you must make sure that the keyspace schema exists and it is up to date. Also, all Cassandra nodes rpc_address should be reachable from the node where
sstableloader will be running. First of all, restore the keyspace:
cqlsh < mykeyspace1.cqlsh
Then for every keyspace/tablename directory from the backup file, run
sstableloader -d cassandra1 mykeyspace1/table1 sstableloader -d cassandra1 mykeyspace1/table2 ... sstableloader -d cassandra1 mykeyspace1/tableN
The data should be available straight away, use
cqlsh to verify. It is recommended to try the process out on a multi node Cassandra installation and try to clone a keyspace for better understanding how the process works.