I - CTA basic concepts
The CERN Tape Archive (CTA) is a high performance tape archive storage system that transfers files directly between a remote disk storage systems and tape drives. CTA is developed at CERN and is currently compatible with remote disk storage systems such as EOS or dCache.
These are the basic elements that make up CTA (simplified):
Catalogue: A relational database that connects to all tapeservers and contains all CTA logical data and a file record with all its metadata to keep peristent info about stored files. You can either use Oracle or PostgreSQL for the Catalogue.
ObjectStore: The queuing system of CTA is implemented over an ObjectStore. This is were the Scheduler database is, therefore, all your servers must share the same ObjectStore. You can use Ceph or VFS / NFS systems to create an ObjectStore, as long as it is a shared space between your tapeservers. The ObjectStore is mainly composed by two types of "objects". The ones that store all queueing and requests related information and the ones representing the running service, maintenance and drive processes. All these have an unique identifier.
Scheduler: An algorithm that every ten seconds checks, sorts and filters the queued requests to extract the current potential tape mounts ready to process.
Frontend: When we talk about the frontend it is usually about the daemon that allows us to authenticate through Kerberos and to communicate with the cta cli. It is also used to communicate with other EOS services.
CTA Admin CLI (cta-admin): This is the tool used by the admin users to execute cta-admin commands such as listing, adding, changing or removing most of the logical components of your infrastructure.
Tapeserver: The set of processes and subprocesses managing the tape drive operations and transferring data.
Inside CTA there is a whole logical structure of administrative metadata in order to keep our tape storage system organized. You should familiarize with the following concepts before configuring your own instance, as you will be reading about these a lot:
Logical Library (ll): Besides of your physical tape library, in CTA you will need to define a logical library. Tape drives are assigned to logical libraries. We can have as many drives as we want on a single logical library, but each drive can only belong to one logical library. Same happens with tapes.
Disk Instance (di): This is generally a different term for “EOSCTA instance”, the ‘small EOS’ instance used to archive/retrieve files from tape. A CTA installation must have at least one disk instance. You can have one or multiple VOs assigned to a di. At CERN, there is a di for each of the large LHC experiments (i.e:
eosctaatlas
) , and then also instances shared by the small/medium sized experiments (i.e:eosctapublic
), based on their archival needs.Media type (mt): This concept is the one in charge to define and specify the tape format of our tapes, such as type of cartridge, capacity, density codes and so on. By defining the mediatypes properly, you can ensure that your system is correctly configured to handle the specific tape format you are working with.
Virtual Organisation (vo): It represents a group of users working with the same data/experiment that require separated data storage. Each experiment (regardless of the size) gets its dedicated VO with the same name. Each Virtual Organisation is linked to a Disk Instace. We can use VOs to enforce quotas, such as an experiment's number of dedicated drives for archival and retrieval, as well as to gather usage statistics and summaries for each experiments.
Tapepool (tp): A logical grouping of tapes. Tapepools are used to keep data belonging to different VOs separate, categorise types of data and to separate multiple copies of files so that they are physically stored in different buildings. A tape can only belong to one tapepool, and tapepools can only belong to one virtual organisation.
Storage Class (sc): With storage classes we are able to specify how many tape copies an archive file is expected to have. A storage class can only belong to one virtual organisation, but one vo can have multiple storage classes with different names and number of copies.
Archive Route (ar): Archive routes link storage classes with tapepools by specifying on which set of tapes the copies of an archive file will be written. These archive routes are useful if we want to create data separation between different file families of a same vo.
Mount Policy (mp): This feature allows us to set the requirements that trigger a tape mount, based on both 1. the minimum age (in seconds) for files to wait in queue and 2. the priority parameter. You can create mount policies and customize these requirements either for archive and retrieve requests in order to ensure that tapes are only mounted when necessary. Mount policies have to be assigned to one (or more) Requester Mount Rule and Group Mount Rule.
Requester Mount Rule (rmr): In CTA, this is what matches a mount policy to the username of the user performing the archive or retrieve request. The rmr has a mount policy associated to it, which means that every file coming from this rmr will follow the mount requirements of that specific mount policy. When using CTA+dCache, this is the same user we define on the pools as
cta-user
when we create the hsm.Group Mount Rule (gmr): In CTA, this is what matches a mount policy to the group of the user performing the archive or retrieve request. If the username of the rmr is not matching any mount policy, CTA will use the mount policy assigned to the gmr to manage the requests. When using CTA+dCache, this is the same group we define on the pools as
cta-group
when we create the hsm.
Other useful concepts
Archiving a file: writting a file from disk to tape.
Retrieving a file: reading a file from tape to disk.
EOS: A disk-based, low-latency storage service with a highly-scalable hierarchical namespace, using the XRoot protocol for data access possible.
dCache: dCache is a distributed storage system proven to scale to hundreds of petabytes developed by DESY. Originally conceived as a disk cache (hence the name), it has evolved into highly scalable general-purpose open source storage solution.
Enstore: Enstore is the mass storage system developed and implemented at Fermilab as the primary data store for scientific data sets. It provides access to data on tape to/from a user’s machine on-site over the local area network, or over the wide area network through the dCache disk caching system. In the near future, Enstore will no longer be maintained.
CASTOR: The CERN Advanced STORage manager is a hierarchical storage (i.e. has disk and tape) management system which was developed at CERN for archiving physics data (with very large data volumes, see the plot on the right). CTA is the successor of CASTOR.
File Family (ff): A concept imported from Enstore that defines a category (or family) of data files within an experiment. There may be many file families inside a vo, as there is no preset limit. In dCache, a file_family tag is assigned to the different directories specifying the ff associated to them. This concept does not exist in CTA. However, when integrating CTA+dCache we can create a storage class for every ff using the convention "
vo.ff@cta
" for storage class names. This allows us to define different archive routes for each sc (in other words, that means that files coming from different ffs can be stored in different tapepools to preserve the data separation between them).
You've finished reading the basic concepts to understand CTA logically. Jump to II - CTA Services to know more about the components that make everything work!
Last updated