As part of the Secrets of Data Analytics Leaders by the Eckerson Group, I did a 30-minute podcast with Wayne Eckerson where I discussed myths of modern data management. Some of the myths discussed include ‘all you need is a data lake’, ‘the data warehouse is dead’, ‘we don’t need OLAP cubes anymore’, ‘cloud is too expensive and latency is too slow’, ‘you should always use a NoSQL product over a RDBMS.’ I hope you check it out!
Podcast: Myths of Modern Data Management
Cost savings of the cloud
I often hear people say moving to the cloud does not save money, but frequently they don’t take into account the savings for indirect costs that are hard to measure (or the benefits you get that are simply not cost-related). For example, the cloud allows you to get started in building a solution in a matter of minutes while starting a solution on-prem can take weeks or even months. How do you put a monetary figure on that? Or these other benefits that are difficult to put a dollar figure on:
- Unlimited storage
- Grow hardware as demand is needed (unlimited elastic scale) and even pause (and not pay anything)
- Upgrade hardware instantly compared to weeks/months to upgrade on-prem
- Enhanced availability and reliability (i.e. data in Azure automatically has three copies). What does each hour of downtime cost your business?
- Benefit of having separation of compute and storage so don’t need to upgrade one when you only need to upgrade the other
- Pay for only what you need (Reduce hardware as demand lessons)
- Not having to guess how much hardware you need and getting too much or too little
- Getting hardware solely based on the max peak
- Ability to fail fast (cancel a project and not have to hardware left over)
- Really helpful for proof-of-concept (POC) or development projects with a known lifespan because you don’t have to re-purpose hardware afterwards
- The value of being able to incorporate more data allowing more insights into your business
- No commitment or long-term vendor lock
- Benefit from changes in the technology impacting the latest storage solutions
- More frequent updates to OS, sql server, etc
- Automatic software updates
- The cloud vendors have much higher security than anything on-prem. You can imagine the loss of income if a vendor had a security breach, so the investment in keeping things secure is massive
As you can see, there is much more than just running numbers in an Excel spreadsheet to see how much money the cloud will save you. But if you really needed that, Microsoft has a Total Cost of Ownership (TCO) Calculator that will estimate the cost savings you can realize by migrating your application workloads to Microsoft Azure. You simply provide a brief description of your on-premises environment to get an instant report.
The benefits that are easier to put a dollar figure on:
- Don’t need co-location space, so cost savings (space, power, networking, etc)
- No need to manage the hardware infrastructure, reducing staff
- No up-front hardware costs or costs for hardware refresh cycles every 3-5 years
- High availability and disaster recovery done for you
- Automatic geography redundancy
- Having built-in tools (i.e. monitoring) so you don’t need to purchase 3rd-party software
Also, there are some constraints of on-premise data that go away when moving to the cloud:
- Scale constrained to on-premise procurement
- Yearly operating expense (OpEx) instead of CapEx up-front costs
- A staff of employees or consultants administering and supporting the hardware and software in place
- Expertise needed for tuning and deployment
I often tell clients that if you have your own on-premise data center, you are in the air conditioning business. Wouldn’t you rather focus all your efforts on analyzing data? You could also try to “save money” by doing your own accounting, but wouldn’t it make more sense to off-load that to an accounting company? Why not also off-load the costly, up-front investment of hardware, software, and other infrastructure, and the costs of maintaining, updating, and securing an on-premises system?
And when dealing with my favorite topic, data warehousing, a conventional on-premise data warehouse can cost millions of dollars in the following: licensing fees, hardware, and services; the time and expertise required to set up, manage, deploy, and tune the warehouse; and the costs to secure and back up the data. All items that a cloud solution eliminates or greatly minimizes.
When estimating hardware costs for a data warehouse, consider the costs of servers, additional storage devices, firewalls, networking switches, data center space to house the hardware, a high-speed network (with redundancy) to access the data, and the power and redundant power supplies needed to keep the system up and running. If your warehouse is mission critical then you need to also add the costs to configure a disaster recovery site, effectively doubling the cost.
When estimating software costs for a data warehouse, organizations frequently pay hundreds of thousands of dollars in software licensing fees for data warehouse software and add-on packages. Also think about additional end users that are given access to the data warehouse, such as customers and suppliers, can significantly increase those costs. Finally, add the ongoing cost for annual support contracts, which often comprise 20 percent of the original license cost.
Also note that an on-premises data warehouse needs specialized IT personnel to deploy and maintain the system. This creates a potential bottleneck when issues arise and keeps responsibility for the system with the customer, not the vendor.
I’ll point out my two key favorite advantages of having a data warehousing solution in the cloud:
- The complexities and cost of capacity planning and administration such as sizing, balancing, and tuning the system, are built into the system, automated, and covered by the cost of your subscription
- By able to dynamically provision storage and compute resources on the fly to meet the demands of your changing workloads in peak and steady usage periods. Capacity is whatever you need whenever you need it
Hopefully this blog post points out that while there can be considerable costs savings in moving to the cloud, there are so many other benefits that cost should not be the only reason to move.
More info:
How To Measure the ROI of Moving To the Cloud
Cloud migration – where are the savings?
Comparing cloud vs on-premise? Six hidden costs people always forget about
The high cost and risk of On-Premise vs. Cloud
Why Move To The Cloud? 10 Benefits Of Cloud Computing
TCO Analysis Demonstrates How Moving To The Cloud Can Save Your Company Money
5 Common Assumptions Comparing Cloud To On-Premises
5 Financial Benefits of Moving to the Cloud
IT Execs Say Cost Savings Make Cloud-Based Analytics ‘Inevitable’
Podcast: Big Data Solutions in the Cloud
In this podcast I talk with Carlos Chacon of SQL Data Partners on big data solutions in the cloud. Here is the description of the chat:
Big Data. Do you have big data? What does that even mean? In this episode I explore some of the concepts of how organizations can manage their data and what questions you might need to ask before you implement the latest and greatest tool. I am joined by James Serra, Microsoft Cloud Architect, to get his thoughts on implementing cloud solutions, where they can contribute, and why you might not be able to go all cloud. I am interested to see if more traditional DBAs move toward architecture roles and help their organizations manage the various types of data. What types of issues are giving you troubles as you adopt a more diverse data ecosystem?
I hope you give it a listen!
Azure SQL Data Warehouse Gen2 announced
Monday was announced the general availability of the Compute Optimized Gen2 tier of Azure SQL Data Warehouse. With this performance optimized tier, Microsoft is dramatically accelerating query performance and concurrency.
The changes in Azure SQL DW Compute Optimized Gen2 tier are:
- 5x query performance via a adaptive caching technology. which takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency
- Significant improvement in serving concurrent queries (32 to 128 queries/cluster)
- Removes limits on columnar data volume to enable unlimited columnar data volume
- 5 times higher computing power compared to the current generation by leveraging the latest hardware innovations that Azure offers via additional Service Level Objectives (DW7500c, DW10000c, DW15000c and DW30000c)
- Added Transparent Data Encryption with customer-managed keys
Azure SQL DW Compute Optimized Gen2 tier will roll out to 20 regions initially, you can find the full list of regions available, with subsequent rollouts to all other Azure regions. If you have a Gen1 data warehouse, take advantage of the latest generation of the service by upgrading. If you are getting started, try Azure SQL DW Compute Optimized Gen2 tier today.
More info:
Turbocharge cloud analytics with Azure SQL Data Warehouse
Blazing fast data warehousing with Azure SQL Data Warehouse
Video Microsoft Mechanics video
Microsoft Build event announcements
Another Microsoft event and another bunch of exciting announcements. At the Microsoft Build event this week, the major announcements in the data platform space were:
Multi-master at global scale with Azure Cosmos DB. Perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues. More info
Azure Cosmos DB Provision throughput at the database level in preview. Azure Cosmos DB customers with multiple collections can now provision throughput at a database level and share throughput across the database, making large collection databases cheaper to start and operate. More info
Virtual network service endpoint for Azure Cosmos DB. Generally available today, virtual network service endpoint (VNET) helps to ensure access to Azure Cosmos DB from the preferred virtual network subnet. The feature will remove the manual change of IP and provide an easier way to manage access to Azure Cosmos DB endpoint. More info
Azure Cognitive Search now in preview. Cognitive Search, a new preview feature in the existing Azure Search service, includes an enrichment pipeline allowing customers to find rich structured information from documents. That information can then become part of the Azure Search index. Cognitive Search also integrates with Natural Language Processing capabilities and includes built-in enrichers called cognitive skills. Built-in skills help to perform a variety of enrichment tasks, such as the extraction of entities from text or image analysis and OCR capabilities. Cognitive Search is also extensible and can connect to your own custom-built skills. More info
Azure SQL Database and Data Warehouse TDE with customer managed keys. Now generally available, Azure SQL Database and Data Warehouse Transparent Data Encryption (TDE) offers Bring Your Own Key (BYOK) support with Azure Key Vault integration. Azure Key Vault provides highly available and scalable secure storage for RSA cryptographic keys backed by FIPS 140-2 Level 2 validated Hardware Security Modules (HSMs). Key Vault streamlines the key management process and enables customers to maintain full control of encryption keys and allows them to manage and audit key access. This is one of the most frequently requested features by enterprise customers looking to protect sensitive data and meet regulatory or security compliance obligations. More info
Azure Database Migration Service is now generally available. This is a service that was designed to be a seamless, end-to-end solution for moving on-premises SQL Server, Oracle, and other relational databases to the cloud. The service will support migrations of homogeneous/heterogeneous source-target pairs, and the guided migration process will be easy to understand and implement. More info
4 new features now available in Azure Stream Analytics: Public preview: Session window; Private preview: C# custom code support for Stream Analytics jobs on IoT Edge, Blob output partitioning by custom attribute, Updated Built-In ML models for Anomaly Detection. More info
Getting value out of data quickly
There are times when you need to create a “quick and dirty” solution to build a report. This blog will show you one way of using a few Azure products to accomplish that. This should not be viewed as a replacement for a data warehouse but rather as a way to quickly show a customer how to get value out of their data or if you need a one-time report or if you just want to see if certain data would be useful to move into your data warehouse.
Let’s look at a high-level architecture for building a report quickly using NCR data (restaurant data):
This solution has the restaurant data that is in on-prem SQL Server replicated to Azure SQL Database using transactional replication. Azure Data Factory is then used to copy the point-of-sale transactions logs in Azure SQL Database into Azure Data Lake Store. Then Azure Data Lake Analytics with U-SQL is used to transform/clean the data and store it back into Azure Data Lake Store. That data is then used in Power BI to create the reports and dashboards (business users can build the models in Power BI and the data can be refreshed multiple times during the day via the new incremental refresh). This is all done with Platform-as-a-Service products so there is nothing to setup and install and no VMs – just quickly and easily doing all the work via the Azure portal.
This solution is inexpensive since there is no need for the more expensive services like Azure SQL Data Warehouse or Azure Analysis Services, and Azure Data Lake Analytics is a job service that you only pay for when the query runs (where you specify the account units to use).
Some things to keep in mind with a solution like this:
- Power BI has been called “reporting crack” because once a business user is exposed to it they want more. And this solution gives them their first taste
- This solution should have a very limited scope – it’s more like a proof-of-concept and should be a short-term solution
- It takes the approach of ELT instead of ETL in that data is loaded into Azure Data Lake Store and then converted using the power of Azure Data Lake Analytics instead of it being transformed during the move from the source system to the data lake like you usually do when using SSIS
- This limits the data model building to one person using it for themselves or a department verses have multiple people build models for an enterprise solution using Azure Analysis Services
- This results in quick value but sacrifices an enterprise solution that includes performance, data governance, data history, referential integrity, security, and master data management. Also, you will not be able to use tools that need to work against a relational format
- This solution will normally require a power user to develop reports since it’s working against a data lake instead of a easier-to-use relational model or a tabular model
An even better way to get value of of data quickly is with another product that is in preview called Common Data Service for Analytics. More on this in my next blog.
Understanding Cosmos DB
Cosmos DB is an awesome product that is mainly used for large-scale OLTP solutions. Any web, mobile, gaming, and IoT application that needs to handle massive amounts of data, reads, and writes at a globally distributed scale with near-real response times for a variety of data are great use cases (It can be scaled-out to support many millions of transactions per second). Because it fits in the NoSQL category and is a scale-out solution, it can be difficult to wrap your head around how it works if you come from the relational world (i.e. SQL Server). So this blog will be out the differences in how Cosmos DB works.
First, a quick comparison of terminology to help you understand the difference:
RDBMS | Cosmos DB (Document Model) | Cosmos DB (Graph Model |
Database | Database | Database |
Table, view | Collection | Graph |
Row | Document (JSON) | Vertex |
Column | Property | Property |
Foreign Key | Reference | Edge |
Join | Embedded document | .out() |
Partition Key/Sharding Key | Partition Key | Partition Key |
From Welcome to Azure Cosmos DB and other documentation, here are some key points to understand:
- You can distribute your data to any number of Azure regions, with the click of a button. This enables you to put your data where your users are, ensuring the lowest possible latency to your customers
- When a new region gets added, it is available for operations within 30 minutes anywhere in the world (assuming your data is 100 TBs or less).
- To control exact sequence of regional failovers in cases of an outage, Azure Cosmos DB enables you to associate a priority with various regions associated with the database account
- Azure Cosmos DB enables you to configure the regions (associated with the database) for “read”, “write” or “read/write” regions.
- For Cosmos DB to offer strong consistency in a globally distributed setup, it needs to synchronously replicate the writes or to synchronously perform cross-region reads. The speed of light and the wide area network reliability dictates that strong consistency will result in higher latencies and reduced availability of database operations. Hence, in order to offer guaranteed low latencies at the 99th percentile and 99.99% availability for all single region accounts and all multi-region accounts with relaxed consistency, and 99.999% availability on all multi-region database accounts, it must employ asynchronous replication. This in-turn requires that it must also offer well-defined, relaxed consistency model(s) – weaker than strong (to offer low latency and availability guarantees) and ideally stronger than “eventual” consistency (with an intuitive programming model)
- Using Azure Cosmos DB’s multi-homing APIs, an app always knows where the nearest region is and sends requests to the nearest data center. All of this is possible with no config changes. You set your write-region and as many read-regions as you want, and the rest is handled for you
- As you add and remove regions to your Azure Cosmos DB database, your application does not need to be redeployed and continues to be highly available thanks to the multi-homing API capability
- It supports multiple data models, including but not limited to document, graph, key-value, table, and column-family data models
- APIs for the following data models are supported with SDKs available in multiple languages: SQL API, MongoDB API, Cassandra API, Gremlin API, Table API
- 99.99% availability SLA for all single region database accounts, and all 99.999% read availability on all multi-region database accounts. Deploy to any number of Azure regions for higher availability and better performance
- For a typical 1KB item, Cosmos DB guarantees end-to-end latency of reads under 10 ms and indexed writes under 15 ms at the 99th percentile within the same Azure region. The median latencies are significantly lower (under 5 ms). So you will want to deploy your app and your database to multiple regions to have users all over the world have the same low latency. If you have an app in one region but the Cosmos DB database in another, then you will have additional latency between the regions (see Azure Latency Test to determine what that latency would be, or go to see existing latency via the Azure Portal and choose Azure Cosmos DB then choose your database then choose Metrics -> Consistency -> SLA -> Replication latency)
- Developers reserve throughput of the service according to the application’s varying load. Behind the scenes, Cosmos DB will scale up resources (memory, processor, partitions, replicas, etc.) to achieve that requested throughput while maintaining the 99th percentile of latency for reads to under 10 ms and for writes to under 15 ms. Throughput is specified in request units (RUs) per second. The number of RUs consumed for a particular operation varies based upon a number of factors, but the fetching of a single 1KB document by id spends roughly 1 RU. Delete, update, and insert operations consume roughly 5 RUs assuming 1 KB documents. Big queries and stored procedure executions can consume 100s or 1000s of RUs based upon the complexity of the operations needed. For each collection (bucket of documents), you specify the RUs
- Throughput directly affects how much the user is charged but can be tuned up dynamically to handle peak load and down to save costs when more lightly loaded by using the Azure Portal, one of the supported SDKs, or the REST API
- Request Units (RU) are used to guarantee throughput in Cosmos DB. You will pay for what you reserve, not what you use. RUs are provisioned by region and can vary by region as a result. But they are not shared between regions. This will require you to understand usage patterns in each region you have a replica
- For applications that exceed the provisioned request unit rate for a container, requests to that collection are throttled until the rate drops below the reserved level. When a throttle occurs, the server preemptively ends the request with RequestRateTooLargeException (HTTP status code 429) and returns the x-ms-retry-after-ms header indicating the amount of time, in milliseconds, that the user must wait before reattempting the request. So, you will get 10ms reads as long as requests stay under the set RU’s
- Cosmos DB provides five consistency levels: strong, bounded-staleness, session, consistent prefix, and eventual. The further to the left in this list, the greater the consistency but the higher the RU cost which essentially lowers available throughput for the same RU setting. Session level consistency is the default. Even when set to lower consistency level, any arbitrary set of operations can be executed in an ACID-compliant transaction by performing those operations from within a stored procedure. You can also change the consistency level for each request using the x-ms-consistency-level request header or the equivalent option in your SDK
- Azure Cosmos DB accounts that are configured to use strong consistency cannot associate more than one Azure region with their Azure Cosmos DB account
- There is not support for GROUP BY or other aggregation functionality found in database systems (workaround is to use Spark to Cosmos DB connector)
- No database schema/index management – it automatically indexes all the data it ingests without requiring any schema or indexes and serves blazing fast queries. By default, every field in each document is automatically indexed generally providing good performance without tuning to specific query patterns. These defaults can be modified by setting an indexing policy which can vary per field.
- Industry-leading, financially backed, comprehensive service level agreements (SLAs) for availability, latency, throughput, and consistency for your mission-critical data
- There is a local emulator running under MS Windows for developer desktop use (was added in the fall of 2016)
- Capacity options for a collection: Fixed (max of 10GB and 400 – 10,000 RU/s), Unlimited (1,000 – 100,000 RU/s). You can contact support if you need more than 100,000 RU/s. There is no limit to the total amount of data or throughput that a container can store in Azure Cosmos DB
- Costs: SSD Storage (per GB): $0.25 GB/month; Reserved RUs/second (per 100 RUs, 400 RUs minimum): $0.008/hour (for all regions except Japan and Brazil which are more)
- Global distribution (also known as global replication/geo-redundancy/geo-replication) is for delivering low-latency access to data to end users no matter where they are located around the globe and for adding regional resiliency for business continuity and disaster recovery (BCDR). When you choose to make containers span across geographic regions, you are billed for the throughput and storage for each container in every region and the data transfer between regions
- Cosmos DB implements optimistic concurrency so there are no locks or blocks but instead, if two transactions collide on the same data, one of them will fail and will be asked to retry
- Because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
- You can set up a policy to geo-fence a database to specific regions. This geo-fencing capability is especially useful when dealing with data sovereignty compliance that requires data to never leave a specific geographical boundary
- Backups are taken every four hours and two are kept at all times. Also, in the event of database deletion, the backups will be kept for thirty days before being discarded. With these rules in place, the client knows that in the event of some unintended data modification, they have an eight-hour window to get support involved and start the restore process
- Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit. If you need Role-Based Access Control (RBAC), Azure Active Directory (AAD) is supported in Cosmos DB
- Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations. It is recommended to create a granular key with highly distinct values. The partitions are managed for you. Cosmos DB will split or merge partitions to keep the data properly distributed. Keep in mind your key needs to support distributed writes and distributed reads
- Until recently, writes could only be made to one region. But now in private preview is writes to multi regions. See Multi-master at global scale with Azure Cosmos DB. With Azure Cosmos DB multi-master support, you can perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues.
Azure Cosmos DB allows you to scale throughput (as well as, storage), elastically across any number of regions depending on your needs or demand.
The above pictures shows a single Azure Cosmos DB container horizontally partitioned (across three resource partitions within a region) and then globally distributed across three Azure regions
An Azure Cosmos DB container gets distributed in two dimensions (i) within a region and (ii) across regions. Here’s how (see Partition and scale in Azure Cosmos DB for more info):
- Local distribution: Within a single region, an Azure Cosmos DB container is horizontally scaled out in terms of resource partitions. Each resource partition manages a set of keys and is strongly consistent and highly available being physically represented by four replicas also called a replica-set and state machine replication among those replicas. Azure Cosmos DB is a fully resource-governed system, where a resource partition is responsible to deliver its share of throughput for the budget of system resources allocated to it. The scaling of an Azure Cosmos DB container is transparent to the users. Azure Cosmos DB manages the resource partitions and splits and merges them as needed as storage and throughput requirements change
- Global distribution: If it is a multi-region database, each of the resource partitions is then distributed across those regions. Resource partitions owning the same set of keys across various regions form a partition set (see preceding figure). Resource partitions within a partition set are coordinated using state machine replication across multiple regions associated with the database. Depending on the consistency level configured, the resource partitions within a partition set are configured dynamically using different topologies (for example, star, daisy-chain, tree etc.)
The following links can help with understanding the core concepts better: Request units in Azure Cosmos DB, Performance tips for Azure Cosmos DB and .NET, Tuning query performance with Azure Cosmos DB, Partitioning in Azure Cosmos DB using the SQL API, Leverage Azure CosmosDB metrics to find issues.
You can Try Azure Cosmos DB for Free without an Azure subscription, free of charge and commitments. For a good training course on Cosmos DB check out Developing Planet-Scale Applications in Azure Cosmos DB and Learning Azure Cosmos DB.
More info:
Relational databases vs Non-relational databases
A technical overview of Azure Cosmos DB
Analytics Platform System (APS) AU7 released
The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has just released an appliance update (AU7), which is sort of like a service pack, except that it includes many new features.
Below is what is new in this release:
Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.
Faster performance
APS AU7 now provides the ability to automatically create statistics and update of existing outdated statistics for improved query optimization. APS AU7 also adds support for setting multiple variables from a single select statement reducing the number of redundant round trips to the server and improving overall query and ETL performance time. Other T-SQL features include HASH and ORDER GROUP query hints to provide more control over improving query execution plans.
Better security
APS AU7 also includes latest firmware and drivers along with the hardware and software patch to address the Spectre/Meltdown vulnerability from our hardware partners.
Management enhancements
Customers already on APS2016 will experience an enhanced upgrade process to APS AU7 allowing a shorter maintenance window with the ability to uninstall and rollback to a previous version. AU7 also introduces a section called Feature Switch in configuration manager giving customers the ability to customize the behavior of new features.
More info:
Microsoft releases the latest update of Analytics Platform System
Azure Data Lake Store Gen2
Big news! The next generation of Azure Data Lake Store (ADLS) has arrived. See the official announcement.
In short, ADLS Gen2 is the combination of the current ADLS (now called Gen1) and Blob storage. Gen2 is built on Blob storage. By GA, ADLS Gen2 will have all the features of both, which means it will have features such as limitless storage capacity, support all Blob tiers (Hot, Cool, and Archive), the new lifecycle management feature, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.
A Gen2 capability is what is called “multi-modal” which means customers can use either Blob object store APIs or the new Gen2 file system APIs. The key here is that both blob and file system semantics are now supported over the same data.
For existing customers of Gen1, once Gen2 is GA, no new features will be added to Gen1. Customers can stay on Gen1 if they don’t need any new capabilities or can move to Gen2 where they can leverage all the goodness of the combined capabilities. They can upgrade when they chose to do so.
Existing customers of Blob storage can continue to use Blob storage to save a bit of money (storage costs will be the same between Blob and Gen2 but transaction costs will be a bit higher for Gen2 due to the overhead of namespaces). By GA, existing Blob storage accounts will just need to “enable Gen2” to get all the features of Gen2. Before GA, they will need to copy their data from Blob storage to Gen2.
New customers should go with Gen2 unless the simplicity of an object store is all that is needed – for example, storing images, storing backup data, website hosting, etc where the apps really don’t benefit from a file system namespace and the customer wants to save a bit of money on transaction costs.
Note that Blob storage and ADLS Gen1 will continue to exist and that Gen2 pricing will be roughly half of Gen1.
It was announced yesterday (June 27th) and be available for a limited public preview (customers will have to sign up).
Because ADLS Gen2 is part of blob storage, it is a “ring 0” service and will at GA be available in all regions. The limited public preview program kicks off with two regions in the US with new regions added throughout the preview window.
For those using the current Blob SDK’s: Initially the SDK’s are different and some code changes will be required. Microsoft is looking at whether they can reduce the need for code changes. For customers using the WASB or ADLS driver, it will be as simple as switching to the new Gen2 driver and changing configs.
Check out the Azure Data Lake Storage Gen2 overview video for more info as well as A closer look at Azure Data Lake Storage Gen2 and finally check out the Gen2 documentation.
Monitoring Azure SQL Database
There are a number of options to monitor Azure SQL Database. In this post I will briefly cover the built-in options and not 3rd-party products that I blogged about a while back (see Azure SQL Database monitoring).
Monitoring keeps you alert of problems. Another reason monitoring helps you is to determine whether your database has excess capacity or is having trouble because resources are maxed out, and then decide whether it’s time to adjust the performance level and service tiers of your database. You can monitor your database using:
- Graphical tools in the Azure portal (click “Resource” on the Overview blade): monitor a single database’s metrics of CPU percentage, DTU percentage, Data IO percentage, Database size percentage and more. You can configure alerts if metrics exceed or fall below a certain threshold over a time period – click “Alerts (Classic)” under “Monitoring”.
- Use SQL dynamic management views (DMV): The two main one’s are sys.resource_stats in the logical master database of your server, and sys.dm_db_resource_stats in the user database. You can use the sys.dm_db_resource_stats view in every SQL database. The sys.dm_db_resource_stats view shows recent resource use data relative to the service tier. Average percentages for CPU, data IO, log writes, and memory are recorded every 15 seconds and are maintained for 1 hour. Because this view provides a more granular look at resource use, use sys.dm_db_resource_stats first for any current-state analysis or troubleshooting. The sys.resource_stats view in the master database has additional information that can help you monitor the performance of your SQL database at its specific service tier and performance level. The data is collected every 5 minutes and is maintained for approximately 14 days. This view is useful for a longer-term historical analysis of how your SQL database uses resources. See Monitoring Azure SQL Database using dynamic management views for other DMV’s you might want to use
- Monitor resource usage using SQL Database Query Performance Insight (requires Query Store). Review top CPU consuming queries and view individual query details
- Azure SQL Intelligent Insights is proactive monitoring that uses built-in intelligence to continuously monitor database usage through artificial intelligence and detect disruptive events that cause poor performance. Once detected, a detailed analysis is performed that generates a diagnostics log (usually to Azure Log Analytics) with an intelligent assessment of the issue. This assessment consists of a root cause analysis of the database performance issue and, where possible, recommendations for performance improvements. Intelligent Insights analyzes SQL Database performance by comparing the database workload from the last hour with the past seven-day baseline workload. It also monitors absolute operational thresholds and detects issues with excessive wait times, critical exceptions, and issues with query parameterizations that might affect performance. The system automatically considers changes to the workload and changes in the number of query requests made to the database to dynamically determine normal and out-of-the-ordinary database performance thresholds. Integration of Intelligent Insights with Azure Log Analytics is performed through first enabling Intelligent Insights logging (selecting “SQLInsights” under LOG) and then configuring Intelligent Insights log data to be streamed into Azure Log Analytics, which is a feature of the Operations Management Suite (OMS)
- Azure SQL Analytics: provides reporting and alerting capabilities on top of the Intelligent Insights and other diagnostics log data as well as metric data
Other ways of monitoring SQL Database:
- Overall Azure service health: See Track service health
- Azure Resource Health: Helps you diagnose and get support when an Azure service problem affects your resources. It informs you about the current and past health of your resources. And it provides technical support to help you mitigate problems.
- SQL Database Auditing: See Get started with SQL database auditing and Monitor your Azure SQL Database Auditing activity with Power BI
- Unified Alerts: A new unified alert experience that allows you to manage alerts from multiple subscriptions and introduces alert states and smart groups. Define your alert criteria by choosing a signal (ie. create a database) and defining your alert condition, alert details, and action group (ie. send a text)
- Emit metrics and diagnostics logs: Azure SQL Database can emit metrics and diagnostics logs for easier monitoring. You can configure SQL Database to store resource usage, workers and sessions, and connectivity into Azure Storage, Azure Event Hubs, or Azure Log Analytics
- SQL Database Threat Detection: See Get started with SQL Database Threat Detection
- Extended Events: See Extended Events for Azure SQL Database
- System Center: See Microsoft System Center Management Pack for Microsoft Azure SQL Database
More info:
Monitoring database performance in Azure SQL Database
The need for having both a DW and cubes
I have heard some people say if you have a data warehouse, there is no need for cubes (when I say “cubes” I am referring to tabular and multidimensional OLAP models). And I have heard others say if you have OLAP cubes, you don’t need a data warehouse. I strongly disagree with both these statements, as almost all the customers I see that are building a modern data warehouse use both in their solutions. Here are some reasons for both:
Why have a data warehouse if you can just use a cube?
- Breaking down complex steps so easier to build cube
- Cube is departmental view (cube builder not thinking enterprise solution)
- Easier to clean/join/master data in DW
- Processing cube is slow against sources
- One place to control data for consistency and have one version of the truth
- Use by tools that need relational format
- Cube does not have all data
- Cube may be behind in data updates (needs processing)
- DW is place to integrate data
- Risk of having multiple cubes doing same thing
- DW keeps historical records
- Easier to create data marts from DW
Reasons to report off cubes instead of the data warehouse (a summary from my prior blog post of Why use a SSAS cube?):
- Semantic layer
- Handle many concurrent users
- Aggregating data for performance
- Multidimensional analysis
- No joins or relationships
- Hierarchies, KPI’s
- Row-level Security
- Advanced time-calculations
- Slowly Changing Dimensions (SCD)
- Required for some reporting tools
The typical architecture I see looks like this:
Power BI new feature: Composite models
There are two really great features just added to Power BI that I wanted to blog about: Composite models and Dual storage mode. This is part of the July release for Power BI Desktop and it is in preview (see Power BI Desktop July 2018 Feature Summary). I’ll also talk about a future release called Aggregations.
First a review of the two ways to connect to a data source:
Import – The selected tables and columns are imported into Power BI Desktop. As you create or interact with a visualization, Power BI Desktop uses the imported data. You must refresh the data, which imports the full data set again (or use the preview feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh. Import datasets in the Power BI services have a 10GB dataset limitation for Premium version and 1GB limitation for free version (although with compression you can import much large data sets). See Data sources in Power BI Desktop
DirectQuery – No data is imported or copied into Power BI Desktop. As you create or interact with a visualization, Power BI Desktop queries the underlying data source, which means you’re always viewing current data. DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import all of the data with pre-aggregation. See Data sources supported by DirectQuery.
Up until now in Power BI, when you connect to a data source using DirectQuery, it is not possible to connect to any other data source in the same report (all tables must come from a single database), nor to include data that has been imported. The new composite model feature removes this restriction, allowing a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data. So this means you can combine multiple DirectQuery sources with multiple Import sources. If your report has some DirectQuery tables and some import tables, the status bar on the bottom right of your report will show a storage mode of ‘Mixed.’ Clicking on this allows all tables to be switched to import mode easily.
For example, with composite models it’s possible to build a model that combines sales data from an enterprise data warehouse using DirectQuery, with data on sales targets that is in a departmental SQL Server database using DirectQuery, along with some data imported from a spreadsheet. A model that combines data from more than one DirectQuery source, or combines DirectQuery with imported data is referred to as a composite model.
Also, composite models include a new feature called dual storage mode. If you are using DirectQuery currently, all visuals will result in queries being sent to the backend source, even for simple visuals such a slicer showing all the Product Categories. The ability to define a table as having a storage mode of “Dual” means that a copy of the data for that table will also be imported, and any visuals that reference only columns from this table will use the imported data, and not require a query to the underlying source. The benefits of this are improved performance, and lessened load on the backend source. But if there are large tables being queried using DirectQuery, the dual table will operate as a DirectQuery table so no table data would need to be imported to be joined with an imported table.
Another feature due out in the next 90 days is “Aggregations” that allows you to create aggregation tables. This new feature along with composite models and dual storage mode allows you to create a solution that uses huge datasets. For example, say I have two related tables: One is at the detail grain called Sales, and another is the aggregated totals of Sales called Sales_Agg. Sales is set to DirectQuery storage mode and Sales_Agg is set to Import storage mode. If a user sends a query with a SELECT statement that has a GROUP BY that can be filled by the Sales_Agg table, the data will be pulled from cache in milliseconds since that table was imported (for example, 1.6 billion aggregated rows imported from SQL DW compressed to 10GB in memory). If a user sends a query with a GROUP BY for a field that is not in the Sales_Agg table, it will do a DirectQuery to the Sales table (for example, sending a Spark query to a 23-node HDI Spark cluster of 1 trillion details rows of 250TB, taking about 40 seconds). The user is not aware there is a Sales_Agg table (all aggregation tables are hidden) – they simple send a query to Sales and Power BI automatically redirects the query to the best table to use. And if using a Date table, it can be set to Dual mode so it joins with Sales_Agg in memory in the first part of the example, or joins with Sales on the data source using DirectQuery in the second part of the example (so it does not have to pull the 1 trillion detail rows into Power BI in order to join with the imported Date table).
You will need to right-click the Sales_Agg table and choose “Manage aggregations” to map the aggregated Sales_Agg table columns to the detail Sales table columns. There is also a “Precedence” field that allows you to have multiple aggregation tables on the same fact table at different grains:
You can also create a report with a drillthrough feature where users can right-click on a data point in a report page that was built with an aggregation table and drillthrough to a focused page to get details that are filtered to that context that is built using DirectQuery.
So in summary, there are three values for storage mode at the table level:
- Import – When set to Import, imported tables are cached. Queries submitted to the Power BI dataset that return data from Import tables can only be fulfilled from cached data
- DirectQuery – With this setting, DirectQuery tables are not cached. Queries submitted to the Power BI dataset (for example, DAX queries) that return data from DirectQuery tables can only be fulfilled by executing on-demand queries to the data source. Queries submitted to the data source use the query language for that data source (for example, SQL)
- Dual – Dual tables can act as either cached or not cached, depending on the context of the query submitted to the Power BI dataset. In some cases, queries are fulfilled from cached data; in other cases, queries are fulfilled by executing an on-demand query to the data source
Note that changing a table to Import is an irreversible operation; it cannot be changed back to DirectQuery, or back to Dual. Also note there are two limitations during the preview period: DirectQuery only supports the tabular model (not multi-dimensional model) and you can’t publish files to the Power BI service.
More info:
Power BI Monthly Digest – July 2018
Composite models in Power BI Desktop (Preview)
Storage mode in Power BI Desktop (Preview)
Power BI Composite Models: The Good, The Bad, The Ugly
Composite Model; DirectQuery and Import Data Combined; Evolution Begins in Power BI
Power BI: Dataflows
Dataflows, previously called Common Data Service for Analytics as well as Datapools, will be in preview soon and I wanted to explain in this blog what it is and how it can help you get value out of your data quickly (it’s a follow-up to my blog Getting value out of data quickly).
In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas. This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data). You can also extend the CDM by creating custom entities. Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.
A dataflow is not just the data itself, but also logic on how the data is manipulated. Dataflows belong to the Data Warehouse/Mart/Lake family. Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint. Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.
This adds two new layers to Power BI (Dataflows and Storage):
But you can instead use your own Azure Data Lake Store Gen2, allowing other Azure services to reuse the data (i.e. Azure Databricks can be used to manipulate the data).
You can also setup incremental refresh for any entity, link to entities from other dataflows, and can pull data down from the dataflows into Power BI desktop.
To use dataflows, in the Power BI Service, under a Workspace: Create – Dataflow – Add entities: This starts online Power Query and you then choose a connector from one of the many data sources (just like you do with Power Query in Power BI Desktop). Then choose a table to import and the screen will look like this:
To create a dashboard from these entities, in Power BI Desktop you simply choose Get Data -> Power BI dataflows.
The bottom line is Power BI users can now easily create a dataflow to prepare data in a centralized storage, using a standardized schema, ready for easy consumption, reuse, and generation of business insights.
Dataflows are a great way to have a power user get value out of data without involving IT. But while this adds enterprise tools to Power BI, it does not mean you are creating an enterprise solution. You still may need to create a data warehouse and cubes: See The need for having both a DW and cubes.
More info:
Self-service data prep with dataflows
Microsoft Common Data Services
Video Introduction to Common Data Service For Analytics
Video Common Data Service for Analytics (CDS-A) and Power BI – an Introduction
Power BI expands self-service prep for big data, unifies modern and enterprise BI
Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights
Azure SQL Database high availability
In this blog I want to talk about how Azure SQL Database achieves high availability. One of the major benefits from moving from on-prem SQL Server to Azure SQL Database is how much easier it is to have high availability – no need for creating and managing a SQL Server failover cluster, AlwaysOn availability groups, database mirroring, log shipping, SAN replication, etc.
Azure SQL Database is a highly available database Platform as a Service that guarantees that your database is up and running 99.99% of time, without worrying about maintenance and downtimes. This is a fully managed SQL Server Database Engine process hosted in the Azure cloud that ensures that your SQL Server database is always upgraded/patched without affecting your workload. Azure automatically handles patching, backups, replication, failure detection, underlying hardware, software or network failures, deploying bug fixes, failovers, database upgrades, and other maintenance tasks. Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.
Azure SQL Database is based on the SQL Server Database Engine architecture that is adjusted for the cloud environment in order to ensure 99.99% availability even in the cases of infrastructure failures. There are two high-availability architectural models that are used in Azure SQL Database (both of them ensuring 99.99% availability):
(NOTE: Basic/Standard/Premium are service tiers that are DTU-based and used only for SQL Database Single, and General Purpose/Business Critical are vCore-based and used for both SQL Database Single and SQL Database Managed Instance)
- Basic/Standard/General Purpose model that is based on remote storage. This architectural model relies on high availability and reliability of the storage tier, but it might have some potential performance degradation during maintenance activities. This model uses Azure Premium Storage Disks
- Premium/Business Critical model that is based on a cluster of database engine processes. This architectural model relies on a fact that there is always a quorum of available database engine nodes and has minimal performance impact on your workload even during maintenance activities. This model uses AlwaysOn Availability Groups and local attached SSD storage. Provides higher IOPS and throughput than Basic/Standard/General Purpose
Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most of the users would not notice that the upgrades are performed continuously.
More details on these two options:
Basic/Standard/General Purpose
High availability in these service tiers is achieved by separation of compute and storage layers and the replication of data in the storage tier (which uses Azure Premium Storage):
There are two layers:
- Active compute nodes: A stateless compute layer that is running the sqlserver.exe process and contains only transient and cached data (for example – plan cache, buffer pool, column store pool). This stateless SQL Server node is operated by Azure Service Fabric that initializes process, controls health of the node, and performs failover to another place if necessary
- Azure Storage accounts: A stateful data layer with database files (.mdf/.ldf) that are stored in Azure Premium Storage Disks, which is remote storage (i.e. it is accessed over the network, using Azure network infrastructure). It is able to use Azure Premium Storage by taking advantage of SQL Server native capability to use database files directly in Azure Blob Storage. This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage. Azure Storage guarantees that there will be no data loss of any record that is placed in any database file (since three copies of the data is made via LRS). Azure Storage has built-in data availability/redundancy that ensures that every record in log file or page in data file will be preserved even if SQL Server process crashes. Note the tempdb database is not using Azure Premium Storage but rather it is located on the local SSD storage, which provides very low latency and high IOPS/throughput
Whenever the database engine or operating system is upgraded, or some part of underlying infrastructure fails, or if some critical issue is detected in the SQL Server process, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node. There is a set of redundant (“spare”) nodes that is waiting to run new compute service in case of failover in order to minimize failover time. Data in the Azure storage layer is not affected, and data/log files are attached to the newly initialized SQL Server process. Failover time can be measured in seconds. This process guarantees 99.99% availability, but it might have some performance impacts on a heavy workload that is running due to transition time and the fact the new SQL Server node starts with cold cache.
See Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) for info on performance improvement.
Premium/Business Critical
High availability in these service tiers is designed for intensive workloads that cannot tolerate any performance impact due to the ongoing maintenance operations.
In the premium model, Azure SQL database integrates compute and storage on the single node. High availability in this architectural model is achieved by replication of compute (SQL Server Database Engine process) and storage (locally attached SSD) deployed in a Always On Availability Groups cluster with enough replicas to achieve quorum and provide HA guarantees (currently 4 nodes as shown below):
Both the SQL Server Database Engine process and the underlying mdf/ldf files are placed on the same node with locally attached SSD storage providing low latency to your workload. High availability is implemented using Always On Availability Groups. Every database is a cluster of database nodes with one primary database that is accessible for the customer workload, and three secondary processes containing copies of data. The primary node constantly pushes the changes to secondary nodes in order to ensure that the data is available on secondary replicas if the primary node crashes for any reason. Failover is handled by the Azure Service Fabric – one secondary replica becomes the primary node and a new secondary replica is created to ensure enough nodes in the cluster, and the workload is automatically redirected to the new primary node. Failover time is measured in milliseconds for most workloads, and the new primary instance is immediately ready to continue serving requests.
A note on the difference in the handling of a failover compared to on-prem: The database engine cannot control the failover because it may not be running when a failover has to occur, i.e. it may have just crashed. Failover has to be initiated by a component external to the database engine. For traditional SQL Server, this component is Windows Failover Clustering. For SQL DB and MI, this is Service Fabric.
IO Performance difference:
vCore model (from here):
General Purpose | Business Critical | |
IO throughput (approximate) | Singleton Database: 500 IOPS per vCore with 7000 maximum IOPS Managed Instance: 500-7500 IOPS per data file (depends on size of file) |
5000 IOPS per core with 200000 maximum IOPS |
DTU model (from here):
Basic | Standard | Premium | |
IO throughput (approximate) | 2.5 IOPS per DTU | 2.5 IOPS per DTU | 48 IOPS per DTU |
IO latency (approximate) | 5 ms (read), 10 ms (write) | 5 ms (read), 10 ms (write) | 2 ms (read/write) |
Note Managed Instance only supports the vCore model.
A word about storage: Azure SQL Database Singleton and Managed Instance both use Azure storage page blobs as the underlying persistent storage for its databases. Azure premium managed disks are just premium page blobs with some API make-up so they look like disks. Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.
A word about failover times: In Business Critical case, there is a secondary replica that is an exact read-only copy of primary instance, so failover is just a switch to new IP address and it is almost instant. In more realistic cases, there is always some lag in secondary replica because it is constantly redoing transaction log records that are sent from the primary node. Failover time is equal to the time needed to apply all remaining transaction log records to became consistent with the primary node, and then the switch to the new IP is completed. Under heavy workload that saturates both primary and secondary replica there might be a chance that secondary cannot immediately catchup to primary so log redo time might be even longer. The exact time depends on the workload and there are no official numbers or formula to calculate this. In the General Purpose case, there is a stateless compute node ready to run sqlservr.exe that attaches .mdf/ldf files from the remote storage. This is cold cache process that should be initialized and failover time is longer than Business Critical. Failover time depends on the database size and also can vary.
Finally, if you are interested in how Microsoft manages data integrity for Azure SQL Database, check out Data Integrity in Azure SQL Database.
More info:
High-availability and Azure SQL Database
Overview of business continuity with Azure SQL Database
Overview: Active geo-replication and auto-failover groups
Reaching Azure disk storage limit on General Purpose Azure SQL Database Managed Instance
File layout in General Purpose Azure SQL Managed Instance
Azure SQL Database disaster recovery
My last blog post was on Azure SQL Database high availability and I would like to continue along that discussion with a blog post about disaster recovery in Azure SQL Database. First, a clarification on the difference between high availability and disaster recovery:
High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems. Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology. SQL Server IaaS would handle this with:
- Always On Failover cluster instances
- Always On Availability Groups (in same Azure region)
- SQL Server data files in Azure
Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database. Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery. Usually includes HA also. SQL Server IaaS would handle this with:
- Log Shipping
- Database Mirroring
- Always On Availability Groups (different Azure regions)
- Backup to Azure
Azure SQL Database makes setting up disaster recovery so much easier than SQL Server IaaS (in a VM). Disaster recovery is done via active geo-replication, which is an Azure SQL Database feature that allows you to create readable replicas of your database in the same or different data center (region). All it takes is navigating to this page and choosing the region to create a secondary database (this example of active geo-replication is configured with a primary in the North Central US region and secondary in the South Central US region):
Once created, the secondary database is populated with the data copied from the primary database. This process is known as seeding. After the secondary database has been created and seeded, updates to the primary database are asynchronously replicated to the secondary database automatically. Asynchronous replication means that transactions are committed on the primary database before they are replicated to the secondary database.
Compare this to setting up AlwaysOn Availability Groups! And then think about the time it takes to monitor and maintain AlwaysOn Availability Groups, stuff that you won’t have to worry about anymore, and you can see why Azure SQL database is such a pleasure.
Active geo-replication is designed as a business continuity solution that allows an application to perform quick disaster recovery in case of a data center scale outage. If geo-replication is enabled, the application can initiate a failover to a secondary database in a different Azure region. Up to four secondaries are supported in the same or different regions, and the secondaries can also be used for read-only access queries. The failover can be initiated manually by the application or the user. After failover, the new primary has a different connection end point. As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.
Auto-failover groups is an extension of active geo-replication. It is designed to manage the failover of multiple geo-replicated databases simultaneously using an application initiated failover or by delegating failover to be done by the SQL Database service based on a user defined criteria. The latter allows you to automatically recover multiple related databases in a secondary region after a catastrophic failure or other unplanned event that results in full or partial loss of the SQL Database service’s availability in the primary region. Because auto-failover groups involve multiple databases, these databases must be configured on the primary server. Both primary and secondary servers for the databases in the failover group must be in the same subscription. Auto-failover groups support replication of all databases in the group to only one secondary server in a different region.
If you are using active geo-replication and for any reason your primary database fails, or simply needs to be taken offline, you can initiate failover to any of your secondary databases. When failover is activated to one of the secondary databases, all other secondaries are automatically linked to the new primary. If you are using auto-failover groups to manage database recovery, any outage that impacts one or several of the databases in the group results in automatic failover. You can configure the auto-failover policy that best meets your application needs, or you can opt out and use manual activation. In addition, auto-failover groups provide read-write and read-only listener end-points that remain unchanged during failovers. Whether you use manual or automatic failover activation, failover switches all secondary databases in the group to primary. After the database failover is completed, the DNS record is automatically updated to redirect the end-points to the new region.
Active geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation. The primary and secondary instances in a geo-replication relationship have independent HA capabilities, the same as a standalone instance would have. Auto-failover groups provide the group semantics on top of active geo-replication but the same asynchronous replication mechanism is used. While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to never have partial transactions. Cross-region redundancy enables applications to quickly recover from a permanent loss of an entire datacenter or parts of a datacenter caused by natural disasters, catastrophic human errors, or malicious acts. The specific recovery point objective (RPO) data can be found at Overview of Business Continuity (The time period of updates that you might lose is under 5 seconds).
More info:
Data Integrity in Azure SQL Database
High-availability and Azure SQL Database
Overview of business continuity with Azure SQL Database
Overview: Active geo-replication and auto-failover groups
Designing globally available services using Azure SQL Database
Spotlight on SQL Database Active Geo-Replication
Azure SQL Database Business Continuity Enhancements
Azure SQL Database Read Scale-Out
Read Scale-Out is a little-known feature that allows you to load balance Azure SQL Database read-only workloads using the capacity of read-only replicas, for free.
As mentioned in my blog Azure SQL Database high availability, each database in the Premium tier (DTU-based purchasing model) or in the Business Critical tier (vCore-based purchasing model) is automatically provisioned with several AlwaysON read-only replicas using synchronous-commit mode to support the availability SLA of 99.99% (these AlwaysON replicas and created automatically even if you are not using geo-replication). These replicas are provisioned with the same performance level as the read-write replica used by the regular database connections. The Read Scale-Out feature allows you to load balance SQL Database read-only workloads using the capacity of one of the read-only replicas instead of all queries hitting the read-write replica. This way the read-only workload will be isolated from the main read-write workload and will not affect its performance. This feature is intended for applications that include logically separated read-only workloads, such as analytics, and therefore could gain performance benefits using this additional capacity at no extra cost.
I highlighted “one” above to bring attention to the fact that only one replica is used, meaning it does not use multiple read-only replica’s and load balance between them.
Another option for read-only workloads is if you also decide to use geo-replication (which is not free), this will create secondary databases (currently four) using asynchronous-commit mode that can be made readable and you can direct connections to each of those secondary’s directly in the connection string, and do your own load balancing between them. For more info on geo-replication see my blog Azure SQL Database disaster recovery.
And if you are using Read Scale-Out to load balance read-only workloads on a database that is geo-replicated (e.g. as a member of a failover group), make sure that Read Scale-Out is enabled on both the primary and the geo-replicated secondary databases. This will ensure the same load-balancing effect when your application connects to the new primary after failover.
To read how to enable Read Scale-Out and send queries to the read-only replica, check out Use read-only replicas to load balance read-only query workloads (preview).
More info:
Overview: Active geo-replication and auto-failover groups
Azure Database Migration Service (DMS)
As I first mentioned in my blog Microsoft database migration tools, the Azure Database Migration Service (DMS) is a PaaS solution that makes it easy to migrate from on-prem/RDS to Azure and one database type to another. I’ll give a brief overview of some of the key features:
The first thing you will do is create an instance of Azure Database Migration Service, which basically reserves compute power in a region (make sure it’s the region where you destination databases will be – there is a limit of 2 DMS’s but you can email support to have it increased). DMS allows you to choose an already existing VNET with connectivity to your source or create a basic VNET that can connect to your source servers that have public facing IPs (or are under the same VNET as the DMS service or are accessible via VNET peering or tunneling). In case your migration target is SQL Database Managed Instance (SQL MI), this also needs to be the VNET where your SQL MI instance is located (you can skip this step if migrating to SQL Database Single since it doesn’t support VNET). Note the usual case is to connect to an existing VNET that has connectivity to the source and target so you don’t have to create a new VNET when creating a DMS. After you create the DMS, you can create a migration project and run migration activities.
Note that DMS now supports using existing SQL Server backup files for migrations from SQL Server to SQL MI, saving time and making the overall migration process easier to perform.
When creating a migration project, you will choose either “online data migration” or “offline data migration” to refer to migrations with and without ongoing replication, respectively.
Offline migrations does a backup & restore in SQL to SQL VM and SQL to SQL MI scenarios. For SQL to SQL Database Single, it copies the data (schema migration support to be added soon) itself using a home-built DMS streaming pipeline that uses bulk copy APIs, which has one of the best throughput when compared to all existing tech. The same tech ships in Database Migration Assistant (DMA), but DMS is more reliable and scalable.
Online migrations (also known as continuous migrations, minimal downtime, continuous sync) uses tech that is based on reading the logs and streaming the data (for migrations to SQL Database Single it syncs via transactional replication and you must use DMA first to migrate the schema which may need changes). In private preview is a new pipeline for migrations to SQL MI which will be based on log shipping.
Here is the list of all the current migration options along with the roadmap:
The new Database Migration Guide is for enterprise customers, partners, and business decision makers who are interested in moving to Azure cloud services (i.e. migrating from Oracle or SQL Server to Azure Data Services). The Database Migration Guide provides comprehensive, step-by-step guidance for performing migrations, as well as improves the discoverability of the guidance, tools, software, and programs that are available to assist customers in performing these migrations. Also, this white paper will guide you through the thought process and steps required to migrate your database workloads from on-premises to Azure-based cloud services.
Currently Microsoft does not have assessment rules in DMA specifically for SQL MI but it should be available soon.
More info:
Migrating and modernizing your data estate to Azure with Data Migration Services : Build 2018
Azure SQL Database Hyperscale
At Microsoft Ignite, one of the announcements was for Azure SQL Database Hyperscale, which was made available in public preview October 1st, 2018 in 12 different Azure regions. SQL Database Hyperscale is a new SQL-based and highly scalable service tier for single databases that adapts on-demand to your workload’s needs. With SQL Database Hyperscale, databases can quickly auto-scale up to 100TB, eliminating the need to pre-provision storage resources, and significantly expanding the potential for app growth without being limited by storage size. Check out the documentation.
Compared to current Azure SQL Database service tiers, Hyperscale provides the following additional capabilities:
- Support for up to a 100 TB of database size
- Nearly instantaneous database backups (based on file snapshots stored in Azure Blob storage) regardless of size with no IO impact on Compute
- Fast database restores (based on file snapshots) in minutes rather than hours or days (not a size of data operation)
- Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes
- Rapid scale out – you can provision one or more read-only nodes for offloading your read workload and for use as hot-standbys
- Rapid Scale up – you can, in constant time, scale up your compute resources to accommodate heavy workloads as and when needed, and then scale the compute resources back down when not needed
The Hyperscale service tier removes many of the practical limits traditionally seen in cloud databases. Where most other databases are limited by the resources available in a single node, databases in the Hyperscale service tier have no such limits. With its flexible storage architecture, storage grows as needed. In fact, Hyperscale databases aren’t created with a defined max size. A Hyperscale database grows as needed – and you are billed only for the capacity you use. Storage is dynamically allocated between 5 GB and 100 TB, in 1 GB increments. For read-intensive workloads, the Hyperscale service tier provides rapid scale-out by provisioning additional read replicas as needed for offloading read workloads.
The Hyperscale service tier is primarily intended for customers who have large databases either on-premises and want to modernize their applications by moving to the cloud or for customers who are already in the cloud and are limited by the maximum database size restrictions (1-4 TB). It is also intended for customers who seek high performance and high scalability for storage and compute.
The Hyperscale service tier supports all SQL Server workloads, but it is primarily optimized for OLTP. The Hyperscale service tier also supports hybrid and analytical (data mart) workloads.
It is available under the vCore-based purchasing options for SQL Database (it is not available yet for SQL Database Managed Instance).
Azure SQL Database Hyperscale is built based on a new cloud-born architecture which decouples compute, log and storage.
A Hyperscale database contains the following different types of nodes:
Compute nodes
The compute nodes look like a traditional SQL Server, but without local data files or log files. The compute node is where the relational engine lives, so all the language elements, query processing, and so on, occur here. All user interactions with a Hyperscale database happen through these compute nodes. Compute nodes have SSD-based caches (labeled RBPEX – Resilient Buffer Pool Extension in the preceding diagram) to minimize the number of network round trips required to fetch a page of data. There is one primary compute node where all the read-write workloads and transactions are processed. There are one or more secondary compute nodes that act as hot standby nodes for failover purposes, as well as act as read-only compute nodes for offloading read workloads (if this functionality is desired).
Log service
The log service externalizes the transactional log from a Hyperscale database. The log service node accepts log records from the primary compute node, persists them in a durable log cache, and forwards the log records to the rest of the compute nodes (so they can update their caches) as well as the relevant page server(s), so that the data can be updated there. In this way, all data changes from the primary compute node are propagated through the log service to all the secondary compute nodes and page servers. Finally, the log record(s) are pushed out to long-term storage in Azure Standard Storage, which is an infinite storage repository. This mechanism removes the necessity for frequent log truncation. The log service also has local cache to speed up access.
Page servers
The page servers host and maintain the data files. It consumes the log stream from the log services and applies the data modifications described in the log stream to data files. Read requests of data pages that are not found in the compute’s local data cache or RBPEX are sent over the network to the page servers that own the pages. In page servers, the data files are persisted in Azure Storage and are heavily cached through RBPEX (SSD-based caches).
Page servers are systems representing a scaled-out storage engine. Multiple page servers will be created for a large database. When the database is growing and available space in existing page servers is lower than a threshold, a new page server is automatically added to the database. Since page servers are working independently, it allows us to grow the database with no local resource constraints. Each page server is responsible for a subset of the pages in the database. Nominally, each page server controls one terabyte of data. No data is shared on more than one page server (outside of replicas that are kept for redundancy and availability). The job of a page server is to serve database pages out to the compute nodes on demand, and to keep the pages updated as transactions update data. Page servers are kept up-to-date by playing log records from the log service. Long-term storage of data pages is kept in Azure Standard Storage for additional reliability.
Azure standard storage node
The Azure storage node is the final destination of data from page servers. This storage is used for backup purposes as well as for replication between Azure regions. Backups consist of snapshots of data files. Restore operation are fast from these snapshots and data can be restored to any point in time.
Automated backup and point in time restore
In a Hyperscale database, snapshots of the data files are taken from the page servers periodically to replace the traditional streaming backup. This allows for a backup of a very large database in just a few seconds. Together with the log records stored in the log service, you can restore the database to any point in time during retention (7 days in public preview) in a very short time, regardless of the database size.
Since the backups are file-snapshot base and hence they are nearly instantaneous. Storage and compute separation enable pushing down the backup/restore operation to the storage layer to reduce the processing burden on the primary compute node. As a result, the backup of a large database does not impact the performance of the primary compute node. Similarly, restores are done by copying the file snapshot and as such are not a size of data operation. For restores within the same storage account, the restore operation is fast.
More info:
Video New performance and scale enhancements for Azure SQL Database
SQL Server 2019 Big Data Clusters
At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage. This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes. Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on Azure Stack. The SQL Server 2019 relational database engine in a big data cluster leverages an elastically scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. The Spark engine is now part of SQL Server:
While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes that can filter and aggregate data locally in parallel across all of the HDFS data nodes.
Performance of PolyBase queries in SQL Server 2019 big data clusters can be boosted further by distributing the cross-partition aggregation and shuffling of the filtered query results to “compute pools” comprised of multiple SQL Server instances that work together (this is similar to a PolyBase scale-out group).
When you combine the enhanced PolyBase connectors with SQL Server 2019 big data clusters data pools, data from external data sources can be partitioned and cached across all the SQL Server instances in a data pool, creating a “scale-out data mart”. There can be more than one scale-out data mart in a given data pool, and a data mart can combine data from multiple external data sources and tables, making it easy to integrate and cache combined data sets from multiple external sources. This will also be a great solution for importing IoT data.
SQL Server 2019 big data clusters make it easier for big data sets to be joined to the data stored in the enterprise relational database, enabling people and apps that use SQL Server to query big data more easily. The value of the big data greatly increases when it is not just in the hands of the data scientists and big data engineers but is also included in reports, dashboards, and applications used by regular end users. At the same time, the data scientists can continue to use big data ecosystem tools against HDFS while also utilizing easy, real-time access to the high-value data in SQL Server because it is all part of one integrated, complete system.
Azure Data Studio (previously released under the name of SQL Operations Studio) is an open-source, multi-purpose data management and analytics tool for DBAs, data scientists, and data engineers. New extensions for Azure Data Studio integrate the user experience for working with relational data in SQL Server with big data. The new HDFS browser lets analysts, data scientists, and data engineers easily view the HDFS files and directories in the big data cluster, upload/download files, open them, and delete them if needed. The new built-in notebooks in Azure Data Studio are built on Jupyter, enabling data scientists and engineers to write Python, R, or Scala code with Intellisense and syntax highlighting before submitting the code as Spark jobs and viewing the results inline. Notebooks facilitate collaboration between teammates working on a data analysis project together. Lastly, the External Table Wizard, which uses PolyBase connectors, simplifies the process of creating external data sources and tables, including column mappings (it’s much easier than the current way of creating external tables).
There will also be a management service that will provision a bunch of agents on each pod that will collect monitoring data and the logs that can be seen via a browser-based cluster admin portal, which will also provide managed services for HA, backup/recovery, security, and provisioning.
In summary, SQL Server 2019 Big Data Clusters improves the 4 V’s of Big Data with these features:
More info:
Introducing Microsoft SQL Server 2019 Big Data Clusters
What are SQL Server 2019 big data clusters?
SQL Server 2019 Big Data Clusters white paper
Video New Feature Offerings in SQL Server 2019
Video SQL Server vNext meets AI and Big Data
Video The future of SQL Server and big data
Video Deep dive on SQL Server and big data
Premium blob storage
As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement). It is in private preview in US East 2, US Central and US West regions.
This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers. Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.
It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions. It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.
Microsoft internal testing shows that both average and 99th percentile server latency is significantly better than the Hot access tier, providing faster and more consistent response times for both read and write across a range of object sizes.
Premium Blob Storage is available with Locally-Redundant Storage and comes with High-Throughput Block Blobs (HTBB), which provides a) improved write throughput when ingesting larger block blobs, b) instant write throughput, and c) container and blob names have no effect on throughput.
You can store block blobs and append blobs in Premium Blob Storage (page blobs is not yet available). To use Premium Blob Storage you provision a new ‘Block Blob’ storage account in your subscription and start creating containers and blobs using the existing Blob Service REST API and/or any existing tools such as AzCopy or Azure Storage Explorer.
Premium Blob Storage has higher data storage cost, but lower transaction cost compared to data stored in the regular Hot tier. This makes it cost effective and can be less expensive for workloads with very high transaction rates. Check out the pricing page for more details.
At present data stored in Premium cannot be tiered to Hot, Cool or Archive access tiers. Microsoft is working on supporting object tiering in the future. To move data, you can synchronously copy blobs from using the new PutBlockFromURL API (sample code) or a version of AzCopy that supports this API. PutBlockFromURL synchronously copies data server side, which means that the data has finished copying when the call completes and all data movement happens inside Azure Storage.
Check out the announcement on how to signup for the private preview.
To summarize the four storage tiers:
- Premium storage (preview) provides high performance hardware for data that is accessed frequently
- Hot storage: is optimized for storing data that is accessed frequently
- Cool storage is optimized for storing data that is infrequently accessed and stored for at least 30 days
- Archive storage is optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours)