shutterstock_184473665.jpg

Summit 7 Team Blogs

The Performance Impact of Premium Storage in Microsoft Azure

Hello once again, all! Let me extend a special welcome to all my fellow CloudPros.

My primary project this fall has been deploying a medium-sized SharePoint 2013 farm in Microsoft Azure's Infrastructure as a Service (IaaS). We successfully deployed an identical environment early this summer to replace the customer's Development environment. This time, though, we are deploying their new Production farm in Azure.

The major concern is whether or not Azure will give us the performance we need to host the Production environment. When we deployed the Dev environment, the new Premium Storage option was not yet GA and so we only used Standard Storage. Premium Storage is Microsoft's persistent solid-state storage (SSD) offering. It provides significantly better performance over regular Standard Storage. Although several Azure VM sizes offered SSD, the disks were ephemeral. This means they would be destroyed if the VM moves to a new host (extremely likely). When we performed load tests in Dev, we found that, sure enough, the SQL Server disk performance was a significant bottleneck. It was not able to offer the performance needed while the system was under heavy load.

However, since then, Premium Storage has gone GA and so we chose to use it with the SQL Server. We pinned our Production hopes on Premium Storage's advertised performance. We recently completed the deployment of the farm (although at this time it's not yet live), and we just completed the load testing. Thankfully, it looks like our hopes were well-founded.

The Environments

Before discussing the load test and the results, it is important to first describe the environments. The primary load test was a full crawl of the farm's content. As such, this description will focus on the search components.

Both environments were hosted 100% in Microsoft Azure IaaS. Except for the SQL Server and a redesign of Workflow Manager (doesn't factor into the tests), they are identical to each other in almost every way. Same VM sizes, same disk layouts, same network architecture, same farm settings, same web applications, same search architecture, etc. The Dev environment is hosted in US East, but the Production environment is hosted from US East 2 in order to give us access to the DS series of VMs. We needed a DS for the SQL Server, since only the DS and GS sizes can use Premium Storage. Despite using a different series, we still used the same size for the SQL Server VM (D12 vs DS12).

The SharePoint farms contain two Search Service Applications. One is for the public-facing search site and crawls both a local web application as well as a collection of around 26 Sitecore sites. The second is for the intranet portal and crawls each local web application as well as some file shares in the corporate office.

We deployed two servers (D12) dedicated to Search. Because they are dedicated, we deployed all Search components on these two servers. Each had a separate, dedicated disk to hold the search indexes.

The following shows the Development Farm physical architecture (the names are approximate):

The following is the Production Farm physical architecture. The primary differences here are: it resides in US East 2, we eliminated a server and chose to host Workflow Manager on one of the App servers, we're publishing the web applications to the Internet, and we're using Premium Storage for the SQL Server. Note that we're using the DCs in US East, not US East 2. The names are approximate.

Storage Design

Since we expected the storage to be the primary bottleneck, we designed the storage to provide maximum performance with what was available. For Standard Storage, that meant using multiple disks in a RAID-0 (striped) configuration across multiple Storage Accounts. I won't be getting into the Why's at this time (future blog). Both Dev and Prod also have a two-disk RAID-0 array to host a drive to hold backups, installation media, etc.

In both the Dev and Prod environments, because we're using D- and DS-series VMs, we chose to place the TempDB database on the local, ephemeral SSD disk. Normally, we are not supposed to use the local SSD disk. However, we used a technique to ensure that this will not be a problem. I will likely post a separate blog on how to do this.

Again, in the Dev environment, we used Standard Storage for everything. For the SQL Server, we created two four-disk RAID-0 arrays. The database log files were placed on one and the data files were placed on another. The disks holding the search indexes also use a four-disk RAID-0 array. The following is the physical storage architecture for the Dev environment:

The storage architecture for the Prod environment is identical to Dev's except with one important difference: instead of the two RAID-0 arrays across four Standard Storage Accounts, two P30 disks were used from a single Premium Storage Account. The database log and data disks each have their own P30 disk. Again, though, the Search servers continued to use Standard Storage (saves money).

The Load Test

Load testing SharePoint isn't the easiest thing to do. It's especially difficult to test the write performance of a farm. My favorite test for writes is to do a full crawl of the content. The nice thing about a crawl is that it hits the SQL Server hard because not only does SharePoint need to read/write from/to the Search databases, but it also needs to read from the content databases to serve the content to the crawler. For maximum impact, we first reset the indexes in order to truly start from scratch. This essentially starts us out with no index on the Search servers, giving us heavy writes on the index disks, as well. A crawl also uses the web servers, of course. For the test, we kicked off Full Crawls for all content sources at the same time in both public and intranet Search Service Applications (SSAs). Again, we reset the indexes beforehand so that it's a fresh crawl. This gave us a good, repeatable means to beat up our storage infrastructure.

One of the content sources in the intranet SSA is a file share back at the corporate office. The farm is hosted in Azure, so the crawl needs to pull the documents across the Internet over the VPN. This is unfortunate, but we intend on solving this challenge soon by deploying the new Cloud SSA (a topic for another time). Unfortunately, the Dev environment did not have access to the share, so you'll see that it failed to crawl. We will, however, see the impact of crawling across a VPN via the Prod crawl results.

The Results

When crawling, we have three primary metrics in addition to the total number of items crawled. First is the duration, of course. Second is the Crawl Rate, which measures how many documents are crawled per second. The third is the Repository Latency, which measures the amount of time (in ms) the crawl took to pull down the data. The SQL database log files are on L and the data files are on M. Here are the numbers for the intranet SSA:

 

Development Test

Content Source

Count

Duration (min)

Crawl Rate (dps)

Repository Latency (ms)

On-site file share (all failures)

0

00:47

0

0

Site collection from a local web app

14,040

1:09:13

3.7

467

Local SharePoint sites

8,559

14:56

9.7

583

User Profiles

11,544

15:40

12.3

270

 

Production Test

Content Source

Count

Duration (min)

Crawl Rate (dps)

Repository Latency (ms)

On-site file share

73,769

6:48:56

3.1

6,935

Site collection from a local web app

17,719

19:27

16.2

237

Local SharePoint sites

13,534

19:28

12.8

381

User Profiles

11,611

19:28

9.9

149


Interpreting the Results

The following are the primary take-aways from the results above.

  • For some, the time to complete full crawls was less in Dev. However the Crawl Rate and Repository Latencies tell a more accurate story. For the intranet SSA, the duration of the crawl of the other local web app was 3x longer, the crawl rate was almost 4x lower, and the repository latency was double that of Dev. For "Local SharePoint Sites," Prod was roughly 1/3 higher and the repository latency was a little less than 50% lower. Dev saw a faster crawl rate than Prod, but the Repository Latency was worse. Additionally, Prod was crawling the on-site file share for the duration of these crawls, potentially contributing to the longer durations.
  • It is very likely that the Repository Latencies (the time it takes to retrieve an item from the content source) is the primary cause for the poorer crawl rates in Dev. SharePoint was slower to serve content, leading to lower crawl rate.
  • For the duration of the SharePoint crawls, the SQL Server disks performed very well. The Sec/Write of the transaction log disk averaged only 2.8ms (max of just 8.7ms), and the data disk averaged 20ms (max of 43ms). The transaction log disk averaged 187 IOPS (max of 348), and the data disk averaged 104 IOPS (max of 460). The IOPS were 25-33% lower in Dev, and the Sec/Write was 50% slower for the data disk and 1,200% slower for the transaction log disk.
  • The write speeds on the index disks (on APP3 and APP4) were poor, averaging 71ms and 86ms per write. These disks utilize Standard Storage. The indexing performance could be improved by utilizing Premium Storage instead.
  • The improved disk performance of the SQL Server has pushed the crawling/indexing bottleneck to the search servers (primarily the CPU). Increasing the number of CPUs on APP3 and APP4 will decrease the time to complete the full crawls. Adding Premium Storage will decrease it even further.
  • The very long duration of the on-site file share crawl and the very slow Repository Latency (nearly 7 seconds!) is the result of the crawler needing to pull files from on-site to Azure over the VPN. We are currently looking to solve this issue by using the new CloudSSA both on-site and in Azure to eliminate the need to pull the docs over the VPN. A good alternative would be to use ExpressRoute.
  • Although the search servers in both Dev and Prod are the same VM size (D12), Dev shows the CPU as being an E5-2660 while Prod is an E5-2673 v3. The Prod CPU is better (http://www.cpu-world.com/Compare/212/Intel_Xeon_E5-2660_vs_Intel_Xeon_E5-2673_v3.html) and could potentially explain some of the improvement.

Conclusion

Overall, these crawl tests show significant improvement in Prod over Dev. Since the architectures are identical except for the Premium Storage, I think we can point to Premium Storage as the reason for the performance improvement. Additionally, if we find that the disk performance still isn't good enough, we can still scale up significantly by adding additional Premium Storage disks. Prod only uses a single Premium disk for each database drive, but we could very easily scale upwards by using multiple P30 disks in one or more RAID-0 arrays. We could also use Premium disks elsewhere in the environment, such as with the index disks on APP3 and APP4. We definitely have some good options if we need them.

Hopefully this blog has demonstrated that it's possible to host production workloads in Microsoft Azure IaaS. There are, of course, other factors which need to be considered when architecting a complete solution (hopefully a future blog or whitepaper). However, I believe that Premium Storage can eliminate concerns about disk performance and enable more high-end scenarios. A key roadblock has been removed. Well done, Microsoft!

Thanks for reading. To the Cloud!

New Call-to-action

SHARE THIS STORY | |