-
Notifications
You must be signed in to change notification settings - Fork 4
Create Read Replicas
High Performance Computing (HPC) clusters (eg. Azure Batch https://azure.microsoft.com/en-us/services/batch/) typically present a pattern whereby a very large number of jobs running on virtual machines will start running concurrently on some scale-out computation. Typically, each job will require a reference dataset that will parameterize the computation and the dataset is often shared across all of the jobs. While the reference dataset itself may be relatively small (< 100GB), the effect of a very large number of jobs all simultaneously converging on the same data will often exceed the throughput limits of either individual blobs or the storage account the dataset is contained in.
A scalable solution to the above problem is to create multiple replicas of the dataset so that the read load of all jobs is distributed across all the replicas and thereby increasing the effective throughput.
Dash can be configured to asynchronously replicate blobs to all configured data accounts. Each subsequent read request will be randomly distributed across the set of available replicas.
Dash supports two mechanism to signal that a given blob should be replicated:
- The name of the blob matches a configured Regular Expression. If the regex provides a match, the blob will be replicated once it has been written. The configuration setting is
ReplicationPathPattern
. The regular expression grammar is described here: https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx. An example that matches all blobs below/output-blobs
and contained in a folder named_temporary
is;^.*/output-blobs/(?!.*/_temporary).*
- If a blob contains metadata that matches configured values the blob will be replicated once it has been written. The configuration setting for the name of the metadata is
ReplicationMetadataName
. The configuration setting that specifies what the metadata value should be isReplicationMetadataValue
.
If a blob satisfies the configured replication settings it will be triggered for replication whenever Dash receives any of the following operations:
- Put Blob
- Put Block List
- Put Page
- Set Blob Metadata
- Set Blob Properties
- Delete Blob
- Copy Blob
The handlers for the above operations simply enqueue a message to replicate the blob asynchronously and therefore replication does not increase the latency of any of the operations. Note, however, that due to the fact that replicas are created asynchronously you may not assume that a replica will be present immediately after writing the blob. The effective delay for the creation of all replicas is 1 minute, but may occur sooner.
For the Delete Blob operation all replicas are still deleted asynchronously. However, because the namespace entry for the blob has been deleted synchronously the replicas will be inaccessible to callers.
In addition to asynchronously copying potentially large amounts of data across storage accounts, the cost of storage each replica will also apply to the billing for the subscription. As a consequence, it is strongly advised to use replication judiciously and only replicate the small set of blobs that will benefit from increased read throughput (as describe above).