Skip to content
James Baker edited this page Feb 2, 2016 · 5 revisions

Data At Scale Hub (DASH) for Azure Storage

Overview

DASH provides scalability to very large solutions using Azure Storage where the scale of the solution exceeds the limits of a single storage account (by capacity, throughput or IOPS). While the approach of aggregating the limits of multiple storage accounts to provide greater scalability is a relatively common pattern, the ability of Dash to perform this in a completely transparent manner, while maintaining maximal network and compute efficiency is a considerable improvement over existing practices.

An architectural overview of Dash is provided here: [Architectural Overview](Architectural Overview).

Supported Scenarios

The following scalability scenarios are supported for Azure Storage clients:

  1. Total storage capacity greater than 500TB. Applications that require very large storage capacity but do not want to build in the complexity of mapping where the data actually resides. Eg. Media, backup, genomics.
  2. Distributed analytics workloads. Distributed compute clusters (eg. HDInsight/Hadoop, Mesos, etc.) are capable of exceeding the throughput limits of a single storage account by all computers in the cluster converging their I/O on an account. This is applicable to even modestly sized clusters (>= 15 nodes). Dash is capable of aggregating the throughput capabilities of multiple storage accounts (> 300Gbps) without introducing significant application complexity to the workload.
  3. High Performance Computing (HPC) clusters requiring high read throughput for reference datasets. HPC clusters typically consist of a very large number of VMs that need to converge on a relatively small (< 100GB) reference dataset to be utilized in the workload calculation. Dash is able to create many Read Replicas that effectively distributes the read load over multiple storage accounts.
  4. Workloads that require a very large amount of transactional throughput or IOPS, such as very large Key Value stores (eg. HBase running on HDInsight) can typically require more than the permitted 20,000 operations per second for a single storage account. These workloads can utilize Dash to aggregate the transaction rate across multiple storage accounts.
  5. (Coming Soon) The 200GB limit for single blobs is an issue for many workloads. Dash will provide a mechanism that will allow for more than the standard 50,000 blocks to be written to a single blob resulting in blobs of multi-TB.
  6. (Coming Soon) Geo-distribution. Many applications utilize Azure to provide a geo-distributed footprint. While having a local point-of-presence for the web frontend delivers many important improvements, data locality is also an issue. Dash will provide a mechanism whereby blobs may be replicated over a flexible topology of storage accounts yielding the desired data locality. Additionally, Dash will also provide a policy-based mechanism whereby data may be assigned to ONLY exist in a given region, regardless of which frontend performed the writing. This capability provides necessary data sovereignty or 'safe harbor' qualities demanded by certain jurisdictions.

Deployment

There are 3 ways to deploy Dash, each with applicability to different audiences:

  1. Deploy a pre-built binary directly to Azure. See [Deploying Pre-Built Binaries](Deploying Pre-Built Binaries).
  2. Use Visual Studio to build Dash and then deploy from within the IDE. See [Deploying from Visual Studio](Deploying from Visual Studio).
  3. Incorporate building and deploying Dash into your normal development lifecycle. See [Incorporate Dash Deployment into ALM](Incorporate Dash Deployment into ALM).

Management

Once a Dash service has been deployed it may be managed using various approaches:

  1. Use the built-in Management Portal that provides a web application to manage and monitor the service.
  2. Write your own application or tooling and call the Management API REST interface.
  3. Use the Azure Management Portal to directly manipulate the configuration for the Dash service.

How Do I Use It From My Application?

From the application's perspective, DASH looks exactly like a standard Azure Blob Storage endpoint. The same REST API is supported so applications directly using the REST API or any of the storage libraries will work unmodified.

The only thing that needs to change is the connection string:

  • The standard connection string supports the specification of 'custom domain names' as described here
  • Specify the DNS name for your DASH endpoint in the BlobEndpoint attribute (eg. BlobEndpoint=http://mydashservice.cloudapp.net;
  • Include the account name/key OR shared access signature as normal
  • A complete example of a DASH connection string is:

AccountName=dashaccount;AccountKey=myBase64EncodedAccountKey;BlobEndpoint=http://mydashservice.cloudapp.net

Library Support for the Automatic Following of Redirections

A number of the standard Azure Storage libraries for certain languages do not automatically follow HTTP redirections. While DASH will support communication with clients utilizing these libraries, it will do so in 'Passthrough Proxy' mode which is not anywhere near to redirection mode for efficiency.

To address this issue, we have versions of the standard libraries available for the following languages which have been modified to automatically support following of redirections (new language support will be added in an as-demanded order):

  • .NET - The .NET library https://github.com/Azure/azure-storage-net/ does support automatic following of HTTP redirections. We have modified this library to support the Expect: 100-Continue header which means that payload is NOT send to the DASH server for PUT requests.
  • Java - Various JREs have inconsistent support for automatic following of HTTP redirects and so the Java library for Azure Storage https://github.com/Azure/azure-storage-java explicitly prevents it. Additionally, support for the Expect: 100-Continue request header was only added in Java 8. We have modified the standard library to support both of these features and work in the most efficient manner with DASH.
  • Python (Coming Soon) - The Python library for Azure Storage https://github.com/Azure/azure-sdk-for-python exhibits similar behavior to the Java library and will be modified to work with DASH in full redirection mode.
  • Node.js (Coming Soon) - The Node.js library for Azure Storage https://github.com/Azure/azure-storage-node will be modified to fully support detection of DASH server, automatic following of redirections and Expect:100-Continue behavior.

These modified libraries are available as pre-built binaries from our package downloader with the base URI https://www.dash-update.net/Client/Latest

Requirements for Storage Clients Making Direct REST Calls

The following requirements must be met by clients making REST calls (ie. NOT using one of the standard Azure Storage SDKs) directly to the DASH endpoint:

  1. Augment the request's User-Agent header value to include 'DASH' as either part of the browser information or extensions component. The User Agent string is used to determine if the client can automatically support following HTTP redirections. Eg. A user agent string 'Azure-Storage/2.2.0+(JavaJRE+1.7.0_79;+Linux+3.13.0-63-generic+dash)' will cause DASH to use redirection mode.
  2. For PUT requests with a body payload, include the request header Expect:100-Continue to prevent that payload being sent to the DASH server. If the DASH server receives a request with body payload, it will automatically use proxy mode for that request to the appropriate storage account, regardless of the User-Agent string.
  3. A client can query a storage endpoint to determine if it is a DASH endpoint. This is useful for determining if to turn on or off the Expect:100-Continue request header. To determine if DASH is the endpoint send an OPTIONS request. If the response includes a header; x-ms-dash-client then the endpoint is DASH.

An additional security measure may be performed by REST clients. When DASH sends a HTTP redirection (302) response, in addition to the standard Location header it also includes a response header which may be used to verify that the response originated from a DASH server with access to the expected Account Key value. This measure is used to mitigate 'Man In The Middle' attacks where a malicious interloper could potentially redirect a client to an unexpected location to steal the application's data. This mitigation is normally only required when not communicating using a trusted SSL/TLS tunnel.

To verify the redirection location a client must use the signature included in the x-ms-redirect-signature response header. The value of this header is as follows:

[SharedKey|SharedKeyLite] <AccountName>:<Signature>

where:

AccountName is the name of the storage account - this should match the expected account name.

Signature is the base64-encoded SHA-256 HMAC of a canonicalized resource string (see below) using the AccountKey used to sign the original request. For details on how to construct a canonicalized resource string, see this article: https://msdn.microsoft.com/en-us/library/azure/dd179428.aspx.

The canonicalized resource string is composed of the following attributes in this format:

StringToSign = 	VERB + "\n" +
			   	Date + "\n" + 
				Location + "\n" + 
				CanonicalizedResponseHeaders

The CanonicalizedResponseHeaders is constructed as described in the 'Constructing the Canonicalized Headers String' section in the article above.

The base64-encoded HMAC may be compared to the Signature value to determine if the response was issued by a server that knows the secret Account Key. It also validates that the response has not been altered in any significant manner.