Create Read Replicas

Improving Read Throughput Using Replicas

High Performance Computing (HPC) clusters (eg. Azure Batch https://azure.microsoft.com/en-us/services/batch/) typically present a pattern whereby a very large number of jobs running on virtual machines will start running concurrently on some scale-out computation. Typically, each job will require a reference dataset that will parameterize the computation and the dataset is often shared across all of the jobs. While the reference dataset itself may be relatively small (< 100GB), the effect of a very large number of jobs all simultaneously converging on the same data will often exceed the throughput limits of either individual blobs or the storage account the dataset is contained in.

A scalable solution to the above problem is to create multiple replicas of the dataset so that the read load of all jobs is distributed across all the replicas and thereby increasing the effective throughput.

Dash can be configured to asynchronously replicate blobs to all configured data accounts. Each subsequent read request will be randomly distributed across the set of available replicas.

Configuring DASH to Replicate Blobs

Dash supports two mechanism to signal that a given blob should be replicated:

The name of the blob matches a configured Regular Expression. If the regex provides a match, the blob will be replicated once it has been written. The configuration setting is ReplicationPathPattern. The regular expression grammar is described here: https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx. An example that matches all blobs below /output-blobs and contained in a folder named _temporary is; ^.*/output-blobs/(?!.*/_temporary).*
If a blob contains metadata that matches configured values the blob will be replicated once it has been written. The configuration setting for the name of the metadata is ReplicationMetadataName. The configuration setting that specifies what the metadata value should be is ReplicationMetadataValue.

Triggering Replication

If a blob satisfies the configured replication settings it will be triggered for replication whenever Dash receives any of the following operations:

Put Blob
Put Block List
Put Page
Set Blob Metadata
Set Blob Properties
Delete Blob
Copy Blob

The handlers for the above operations simply enqueue a message to replicate the blob asynchronously and therefore replication does not increase the latency of any of the operations. Note, however, that due to the fact that replicas are created asynchronously you may not assume that a replica will be present immediately after writing the blob. The effective delay for the creation of all replicas is 1 minute, but may occur sooner.

For the Delete Blob operation all replicas are still deleted asynchronously. However, because the namespace entry for the blob has been deleted synchronously the replicas will be inaccessible to callers.

Replication Overhead

In addition to asynchronously copying potentially large amounts of data across storage accounts, the cost of storage each replica will also apply to the billing for the subscription. As a consequence, it is strongly advised to use replication judiciously and only replicate the small set of blobs that will benefit from increased read throughput (as describe above).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Read Replicas

Improving Read Throughput Using Replicas

Configuring DASH to Replicate Blobs

Triggering Replication

Replication Overhead

Clone this wiki locally