Podcast: Myths of Modern Data Management

April 2, 2018, 8:00 am

≪ Previous: Webinar: Is the traditional data warehouse dead?

As part of the Secrets of Data Analytics Leaders by the Eckerson Group, I did a 30-minute podcast with Wayne Eckerson where I discussed myths of modern data management. Some of the myths discussed include ‘all you need is a data lake’, ‘the data warehouse is dead’, ‘we don’t need OLAP cubes anymore’, ‘cloud is too expensive and latency is too slow’, ‘you should always use a NoSQL product over a RDBMS.’ I hope you check it out!

↧

Cost savings of the cloud

April 11, 2018, 8:00 am

≫ Next: Podcast: Big Data Solutions in the Cloud

≪ Previous: Podcast: Myths of Modern Data Management

I often hear people say moving to the cloud does not save money, but frequently they don’t take into account the savings for indirect costs that are hard to measure (or the benefits you get that are simply not cost-related). For example, the cloud allows you to get started in building a solution in a matter of minutes while starting a solution on-prem can take weeks or even months. How do you put a monetary figure on that? Or these other benefits that are difficult to put a dollar figure on:

Unlimited storage
Grow hardware as demand is needed (unlimited elastic scale) and even pause (and not pay anything)
Upgrade hardware instantly compared to weeks/months to upgrade on-prem
Enhanced availability and reliability (i.e. data in Azure automatically has three copies). What does each hour of downtime cost your business?
Benefit of having separation of compute and storage so don’t need to upgrade one when you only need to upgrade the other
Pay for only what you need (Reduce hardware as demand lessons)
Not having to guess how much hardware you need and getting too much or too little
Getting hardware solely based on the max peak
Ability to fail fast (cancel a project and not have to hardware left over)
Really helpful for proof-of-concept (POC) or development projects with a known lifespan because you don’t have to re-purpose hardware afterwards
The value of being able to incorporate more data allowing more insights into your business
No commitment or long-term vendor lock
Benefit from changes in the technology impacting the latest storage solutions
More frequent updates to OS, sql server, etc
Automatic software updates
The cloud vendors have much higher security than anything on-prem. You can imagine the loss of income if a vendor had a security breach, so the investment in keeping things secure is massive

As you can see, there is much more than just running numbers in an Excel spreadsheet to see how much money the cloud will save you. But if you really needed that, Microsoft has a Total Cost of Ownership (TCO) Calculator that will estimate the cost savings you can realize by migrating your application workloads to Microsoft Azure. You simply provide a brief description of your on-premises environment to get an instant report.

The benefits that are easier to put a dollar figure on:

Don’t need co-location space, so cost savings (space, power, networking, etc)
No need to manage the hardware infrastructure, reducing staff
No up-front hardware costs or costs for hardware refresh cycles every 3-5 years
High availability and disaster recovery done for you
Automatic geography redundancy
Having built-in tools (i.e. monitoring) so you don’t need to purchase 3rd-party software

Also, there are some constraints of on-premise data that go away when moving to the cloud:

Scale constrained to on-premise procurement
Yearly operating expense (OpEx) instead of CapEx up-front costs
A staff of employees or consultants administering and supporting the hardware and software in place
Expertise needed for tuning and deployment

I often tell clients that if you have your own on-premise data center, you are in the air conditioning business. Wouldn’t you rather focus all your efforts on analyzing data? You could also try to “save money” by doing your own accounting, but wouldn’t it make more sense to off-load that to an accounting company? Why not also off-load the costly, up-front investment of hardware, software, and other infrastructure, and the costs of maintaining, updating, and securing an on-premises system?

And when dealing with my favorite topic, data warehousing, a conventional on-premise data warehouse can cost millions of dollars in the following: licensing fees, hardware, and services; the time and expertise required to set up, manage, deploy, and tune the warehouse; and the costs to secure and back up the data. All items that a cloud solution eliminates or greatly minimizes.

When estimating hardware costs for a data warehouse, consider the costs of servers, additional storage devices, firewalls, networking switches, data center space to house the hardware, a high-speed network (with redundancy) to access the data, and the power and redundant power supplies needed to keep the system up and running. If your warehouse is mission critical then you need to also add the costs to configure a disaster recovery site, effectively doubling the cost.

When estimating software costs for a data warehouse, organizations frequently pay hundreds of thousands of dollars in software licensing fees for data warehouse software and add-on packages. Also think about additional end users that are given access to the data warehouse, such as customers and suppliers, can significantly increase those costs. Finally, add the ongoing cost for annual support contracts, which often comprise 20 percent of the original license cost.

Also note that an on-premises data warehouse needs specialized IT personnel to deploy and maintain the system. This creates a potential bottleneck when issues arise and keeps responsibility for the system with the customer, not the vendor.

I’ll point out my two key favorite advantages of having a data warehousing solution in the cloud:

The complexities and cost of capacity planning and administration such as sizing, balancing, and tuning the system, are built into the system, automated, and covered by the cost of your subscription
By able to dynamically provision storage and compute resources on the fly to meet the demands of your changing workloads in peak and steady usage periods. Capacity is whatever you need whenever you need it

Hopefully this blog post points out that while there can be considerable costs savings in moving to the cloud, there are so many other benefits that cost should not be the only reason to move.

More info:

How To Measure the ROI of Moving To the Cloud

Cloud migration – where are the savings?

Comparing cloud vs on-premise? Six hidden costs people always forget about

The high cost and risk of On-Premise vs. Cloud

Why Move To The Cloud? 10 Benefits Of Cloud Computing

TCO Analysis Demonstrates How Moving To The Cloud Can Save Your Company Money

5 Common Assumptions Comparing Cloud To On-Premises

5 Financial Benefits of Moving to the Cloud

↧

Podcast: Big Data Solutions in the Cloud

April 16, 2018, 8:00 am

≫ Next: Azure SQL Data Warehouse Gen2 announced

≪ Previous: Cost savings of the cloud

In this podcast I talk with Carlos Chacon of SQL Data Partners on big data solutions in the cloud. Here is the description of the chat:

Big Data. Do you have big data? What does that even mean? In this episode I explore some of the concepts of how organizations can manage their data and what questions you might need to ask before you implement the latest and greatest tool. I am joined by James Serra, Microsoft Cloud Architect, to get his thoughts on implementing cloud solutions, where they can contribute, and why you might not be able to go all cloud. I am interested to see if more traditional DBAs move toward architecture roles and help their organizations manage the various types of data. What types of issues are giving you troubles as you adopt a more diverse data ecosystem?

I hope you give it a listen!

↧

Azure SQL Data Warehouse Gen2 announced

May 2, 2018, 8:00 am

≫ Next: Microsoft Build event announcements

≪ Previous: Podcast: Big Data Solutions in the Cloud

Monday was announced the general availability of the Compute Optimized Gen2 tier of Azure SQL Data Warehouse. With this performance optimized tier, Microsoft is dramatically accelerating query performance and concurrency.

The changes in Azure SQL DW Compute Optimized Gen2 tier are:

5x query performance via a adaptive caching technology. which takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency
Significant improvement in serving concurrent queries (32 to 128 queries/cluster)
Removes limits on columnar data volume to enable unlimited columnar data volume
5 times higher computing power compared to the current generation by leveraging the latest hardware innovations that Azure offers via additional Service Level Objectives (DW7500c, DW10000c, DW15000c and DW30000c)
Added Transparent Data Encryption with customer-managed keys

Azure SQL DW Compute Optimized Gen2 tier will roll out to 20 regions initially, you can find the full list of regions available, with subsequent rollouts to all other Azure regions. If you have a Gen1 data warehouse, take advantage of the latest generation of the service by upgrading. If you are getting started, try Azure SQL DW Compute Optimized Gen2 tier today.

More info:

Turbocharge cloud analytics with Azure SQL Data Warehouse

Blazing fast data warehousing with Azure SQL Data Warehouse

Video Microsoft Mechanics video

↧

Microsoft Build event announcements

May 11, 2018, 8:00 am

≫ Next: Getting value out of data quickly

≪ Previous: Azure SQL Data Warehouse Gen2 announced

Another Microsoft event and another bunch of exciting announcements. At the Microsoft Build event this week, the major announcements in the data platform space were:

Multi-master at global scale with Azure Cosmos DB. Perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues. More info

Azure Cosmos DB Provision throughput at the database level in preview. Azure Cosmos DB customers with multiple collections can now provision throughput at a database level and share throughput across the database, making large collection databases cheaper to start and operate. More info

Virtual network service endpoint for Azure Cosmos DB. Generally available today, virtual network service endpoint (VNET) helps to ensure access to Azure Cosmos DB from the preferred virtual network subnet. The feature will remove the manual change of IP and provide an easier way to manage access to Azure Cosmos DB endpoint. More info

Azure Cognitive Search now in preview. Cognitive Search, a new preview feature in the existing Azure Search service, includes an enrichment pipeline allowing customers to find rich structured information from documents. That information can then become part of the Azure Search index. Cognitive Search also integrates with Natural Language Processing capabilities and includes built-in enrichers called cognitive skills. Built-in skills help to perform a variety of enrichment tasks, such as the extraction of entities from text or image analysis and OCR capabilities. Cognitive Search is also extensible and can connect to your own custom-built skills. More info

Azure SQL Database and Data Warehouse TDE with customer managed keys. Now generally available, Azure SQL Database and Data Warehouse Transparent Data Encryption (TDE) offers Bring Your Own Key (BYOK) support with Azure Key Vault integration. Azure Key Vault provides highly available and scalable secure storage for RSA cryptographic keys backed by FIPS 140-2 Level 2 validated Hardware Security Modules (HSMs). Key Vault streamlines the key management process and enables customers to maintain full control of encryption keys and allows them to manage and audit key access. This is one of the most frequently requested features by enterprise customers looking to protect sensitive data and meet regulatory or security compliance obligations. More info

Azure Database Migration Service is now generally available. This is a service that was designed to be a seamless, end-to-end solution for moving on-premises SQL Server, Oracle, and other relational databases to the cloud. The service will support migrations of homogeneous/heterogeneous source-target pairs, and the guided migration process will be easy to understand and implement. More info

4 new features now available in Azure Stream Analytics: Public preview: Session window; Private preview: C# custom code support for Stream Analytics jobs on IoT Edge, Blob output partitioning by custom attribute, Updated Built-In ML models for Anomaly Detection. More info

↧

Getting value out of data quickly

May 30, 2018, 8:00 am

≫ Next: Understanding Cosmos DB

≪ Previous: Microsoft Build event announcements

There are times when you need to create a “quick and dirty” solution to build a report. This blog will show you one way of using a few Azure products to accomplish that. This should not be viewed as a replacement for a data warehouse but rather as a way to quickly show a customer how to get value out of their data or if you need a one-time report or if you just want to see if certain data would be useful to move into your data warehouse.

Let’s look at a high-level architecture for building a report quickly using NCR data (restaurant data):

This solution has the restaurant data that is in on-prem SQL Server replicated to Azure SQL Database using transactional replication. Azure Data Factory is then used to copy the point-of-sale transactions logs in Azure SQL Database into Azure Data Lake Store. Then Azure Data Lake Analytics with U-SQL is used to transform/clean the data and store it back into Azure Data Lake Store. That data is then used in Power BI to create the reports and dashboards (business users can build the models in Power BI and the data can be refreshed multiple times during the day via the new incremental refresh). This is all done with Platform-as-a-Service products so there is nothing to setup and install and no VMs – just quickly and easily doing all the work via the Azure portal.

This solution is inexpensive since there is no need for the more expensive services like Azure SQL Data Warehouse or Azure Analysis Services, and Azure Data Lake Analytics is a job service that you only pay for when the query runs (where you specify the account units to use).

Some things to keep in mind with a solution like this:

Power BI has been called “reporting crack” because once a business user is exposed to it they want more. And this solution gives them their first taste
This solution should have a very limited scope – it’s more like a proof-of-concept and should be a short-term solution
It takes the approach of ELT instead of ETL in that data is loaded into Azure Data Lake Store and then converted using the power of Azure Data Lake Analytics instead of it being transformed during the move from the source system to the data lake like you usually do when using SSIS
This limits the data model building to one person using it for themselves or a department verses have multiple people build models for an enterprise solution using Azure Analysis Services
This results in quick value but sacrifices an enterprise solution that includes performance, data governance, data history, referential integrity, security, and master data management. Also, you will not be able to use tools that need to work against a relational format
This solution will normally require a power user to develop reports since it’s working against a data lake instead of a easier-to-use relational model or a tabular model

An even better way to get value of of data quickly is with another product that is in preview called Common Data Service for Analytics. More on this in my next blog.

↧

Understanding Cosmos DB

June 11, 2018, 8:00 am

≫ Next: Analytics Platform System (APS) AU7 released

≪ Previous: Getting value out of data quickly

Cosmos DB is an awesome product that is mainly used for large-scale OLTP solutions. Any web, mobile, gaming, and IoT application that needs to handle massive amounts of data, reads, and writes at a globally distributed scale with near-real response times for a variety of data are great use cases (It can be scaled-out to support many millions of transactions per second). Because it fits in the NoSQL category and is a scale-out solution, it can be difficult to wrap your head around how it works if you come from the relational world (i.e. SQL Server). So this blog will be out the differences in how Cosmos DB works.

First, a quick comparison of terminology to help you understand the difference:

RDBMS	Cosmos DB (Document Model)	Cosmos DB (Graph Model
Database	Database	Database
Table, view	Collection	Graph
Row	Document (JSON)	Vertex
Column	Property	Property
Foreign Key	Reference	Edge
Join	Embedded document	.out()
Partition Key/Sharding Key	Partition Key	Partition Key

From Welcome to Azure Cosmos DB and other documentation, here are some key points to understand:

You can distribute your data to any number of Azure regions, with the click of a button. This enables you to put your data where your users are, ensuring the lowest possible latency to your customers
When a new region gets added, it is available for operations within 30 minutes anywhere in the world (assuming your data is 100 TBs or less).
To control exact sequence of regional failovers in cases of an outage, Azure Cosmos DB enables you to associate a priority with various regions associated with the database account
Azure Cosmos DB enables you to configure the regions (associated with the database) for “read”, “write” or “read/write” regions.
For Cosmos DB to offer strong consistency in a globally distributed setup, it needs to synchronously replicate the writes or to synchronously perform cross-region reads. The speed of light and the wide area network reliability dictates that strong consistency will result in higher latencies and reduced availability of database operations. Hence, in order to offer guaranteed low latencies at the 99th percentile and 99.99% availability for all single region accounts and all multi-region accounts with relaxed consistency, and 99.999% availability on all multi-region database accounts, it must employ asynchronous replication. This in-turn requires that it must also offer well-defined, relaxed consistency model(s) – weaker than strong (to offer low latency and availability guarantees) and ideally stronger than “eventual” consistency (with an intuitive programming model)
Using Azure Cosmos DB’s multi-homing APIs, an app always knows where the nearest region is and sends requests to the nearest data center. All of this is possible with no config changes. You set your write-region and as many read-regions as you want, and the rest is handled for you
As you add and remove regions to your Azure Cosmos DB database, your application does not need to be redeployed and continues to be highly available thanks to the multi-homing API capability
It supports multiple data models, including but not limited to document, graph, key-value, table, and column-family data models
APIs for the following data models are supported with SDKs available in multiple languages: SQL API, MongoDB API, Cassandra API, Gremlin API, Table API
99.99% availability SLA for all single region database accounts, and all 99.999% read availability on all multi-region database accounts. Deploy to any number of Azure regions for higher availability and better performance
For a typical 1KB item, Cosmos DB guarantees end-to-end latency of reads under 10 ms and indexed writes under 15 ms at the 99th percentile within the same Azure region. The median latencies are significantly lower (under 5 ms). So you will want to deploy your app and your database to multiple regions to have users all over the world have the same low latency. If you have an app in one region but the Cosmos DB database in another, then you will have additional latency between the regions (see Azure Latency Test to determine what that latency would be)
Developers reserve throughput of the service according to the application’s varying load. Behind the scenes, Cosmos DB will scale up resources (memory, processor, partitions, replicas, etc.) to achieve that requested throughput while maintaining the 99th percentile of latency for reads to under 10 ms and for writes to under 15 ms. Throughput is specified in request units (RUs) per second. The number of RUs consumed for a particular operation varies based upon a number of factors, but the fetching of a single 1KB document by id spends roughly 1 RU. Delete, update, and insert operations consume roughly 5 RUs assuming 1 KB documents. Big queries and stored procedure executions can consume 100s or 1000s of RUs based upon the complexity of the operations needed. For each collection (bucket of documents), you specify the RUs
Throughput directly affects how much the user is charged but can be tuned up dynamically to handle peak load and down to save costs when more lightly loaded by using the Azure Portal, one of the supported SDKs, or the REST API
Request Units (RU) are used to guarantee throughput in Cosmos DB. You will pay for what you reserve, not what you use. RUs are provisioned by region and can vary by region as a result. But they are not shared between regions. This will require you to understand usage patterns in each region you have a replica
For applications that exceed the provisioned request unit rate for a container, requests to that collection are throttled until the rate drops below the reserved level. When a throttle occurs, the server preemptively ends the request with RequestRateTooLargeException (HTTP status code 429) and returns the x-ms-retry-after-ms header indicating the amount of time, in milliseconds, that the user must wait before reattempting the request. So, you will get 10ms reads as long as requests stay under the set RU’s
Cosmos DB provides five consistency levels: strong, bounded-staleness, session, consistent prefix, and eventual. The further to the left in this list, the greater the consistency but the higher the RU cost which essentially lowers available throughput for the same RU setting. Session level consistency is the default. Even when set to lower consistency level, any arbitrary set of operations can be executed in an ACID-compliant transaction by performing those operations from within a stored procedure. You can also change the consistency level for each request using the x-ms-consistency-level request header or the equivalent option in your SDK
Azure Cosmos DB accounts that are configured to use strong consistency cannot associate more than one Azure region with their Azure Cosmos DB account
There is not support for GROUP BY or other aggregation functionality found in database systems (workaround is to use Spark to Cosmos DB connector)
No database schema/index management – it automatically indexes all the data it ingests without requiring any schema or indexes and serves blazing fast queries. By default, every field in each document is automatically indexed generally providing good performance without tuning to specific query patterns. These defaults can be modified by setting an indexing policy which can vary per field.
Industry-leading, financially backed, comprehensive service level agreements (SLAs) for availability, latency, throughput, and consistency for your mission-critical data
There is a local emulator running under MS Windows for developer desktop use (was added in the fall of 2016)
Storage capacity options for a collection: Fixed (max of 10GB and 400 – 10,000 RU/s), Unlimited (1,000 – 100,000 RU/s). You can contact support if you need more than 100,000 RU/s. There is no limit to the total amount of data or throughput that a container can store in Azure Cosmos DB
Costs: SSD Storage (per GB): $0.25 GB/month; Reserved RUs/second (per 100 RUs, 400 RUs minimum): $0.008/hour (for all regions except Japan and Brazil where are more)
Global distribution (also known as global replication/geo-redundancy/geo-replication) is for delivering low-latency access to data to end users no matter where they are located around the globe and for adding regional resiliency for business continuity and disaster recovery (BCDR). When you choose to make containers span across geographic regions, you are billed for the throughput and storage for each container in every region and the data transfer between regions
Cosmos DB implements optimistic concurrency so there are no locks or blocks but instead, if two transactions collide on the same data, one of them will fail and will be asked to retry
Because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
You can set up a policy to geo-fence a database to specific regions. This geo-fencing capability is especially useful when dealing with data sovereignty compliance that requires data to never leave a specific geographical boundary
Backups are taken every four hours and two are kept at all times. Also, in the event of database deletion, the backups will be kept for thirty days before being discarded. With these rules in place, the client knows that in the event of some unintended data modification, they have an eight-hour window to get support involved and start the restore process
Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit. If you need Role-Based Access Control (RBAC), Azure Active Directory (AAD) is supported in Cosmos DB
Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations. It is recommended to create a granular key with highly distinct values. The partitions are managed for you. Cosmos DB will split or merge partitions to keep the data properly distributed. Keep in mind your key needs to support distributed writes and distributed reads
Until recently, writes could only be made to one region. But now in private preview is writes to multi regions. See Multi-master at global scale with Azure Cosmos DB. With Azure Cosmos DB multi-master support, you can perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues.

Azure Cosmos DB allows you to scale throughput (as well as, storage), elastically across any number of regions depending on your needs or demand.

Azure Cosmos DB distributed and partitioned collections

The above pictures shows a single Azure Cosmos DB container horizontally partitioned (across three resource partitions within a region) and then globally distributed across three Azure regions

An Azure Cosmos DB container gets distributed in two dimensions (i) within a region and (ii) across regions. Here’s how (see Partition and scale in Azure Cosmos DB for more info):

Local distribution: Within a single region, an Azure Cosmos DB container is horizontally scaled out in terms of resource partitions. Each resource partition manages a set of keys and is strongly consistent and highly available being physically represented by four replicas also called a replica-set and state machine replication among those replicas. Azure Cosmos DB is a fully resource-governed system, where a resource partition is responsible to deliver its share of throughput for the budget of system resources allocated to it. The scaling of an Azure Cosmos DB container is transparent to the users. Azure Cosmos DB manages the resource partitions and splits and merges them as needed as storage and throughput requirements change
Global distribution: If it is a multi-region database, each of the resource partitions is then distributed across those regions. Resource partitions owning the same set of keys across various regions form a partition set (see preceding figure). Resource partitions within a partition set are coordinated using state machine replication across multiple regions associated with the database. Depending on the consistency level configured, the resource partitions within a partition set are configured dynamically using different topologies (for example, star, daisy-chain, tree etc.)

You can Try Azure Cosmos DB for Free without an Azure subscription, free of charge and commitments.

More info:

Relational databases vs Non-relational databases

A technical overview of Azure Cosmos DB

↧

Analytics Platform System (APS) AU7 released

June 18, 2018, 8:00 am

≫ Next: Microsoft Connect(); announcements

≪ Previous: Understanding Cosmos DB

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has just released an appliance update (AU7), which is sort of like a service pack, except that it includes many new features.

Below is what is new in this release:

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

Faster performance

APS AU7 now provides the ability to automatically create statistics and update of existing outdated statistics for improved query optimization. APS AU7 also adds support for setting multiple variables from a single select statement reducing the number of redundant round trips to the server and improving overall query and ETL performance time. Other T-SQL features include HASH and ORDER GROUP query hints to provide more control over improving query execution plans.

Better security

APS AU7 also includes latest firmware and drivers along with the hardware and software patch to address the Spectre/Meltdown vulnerability from our hardware partners.

Management enhancements

Customers already on APS2016 will experience an enhanced upgrade process to APS AU7 allowing a shorter maintenance window with the ability to uninstall and rollback to a previous version. AU7 also introduces a section called Feature Switch in configuration manager giving customers the ability to customize the behavior of new features.

More info:

Microsoft releases the latest update of Analytics Platform System

↧

Microsoft Connect(); announcements

November 15, 2017, 8:00 am

≫ Next: What is Azure Databricks?

≪ Previous: Analytics Platform System (APS) AU7 released

Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made. Here is a summary of the data platform related announcements:

Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. It delivers one-click set up, streamlined workflows, and an interactive workspace all integrated with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory, and Power BI. More info
Azure Cosmos DB with Apache Cassandra API: In preview, this enables Cassandra developers to simply use the Cassandra API in Azure Cosmos DB and enjoy the benefits of Azure Cosmos DB with the familiarity of the Cassandra SDKs and tools, with no code changes to their application. More info. See all Cosmos DB announcements
Microsoft joins the MariaDB Foundation: Microsoft is a platinum sponsor – MariaDB is a community of the MySQL relational database management system and Microsoft will be actively contributing to MariaDB and the MariaDB community. More info
Azure Database for MariaDB: An upcoming preview will bring fully managed service capabilities to MariaDB, further demonstrating Microsoft’s commitment to meeting customers and developers where they are by offering their favorite technologies on Azure. More info
Azure SQL Database with Machine Learning Services: In preview this provides support for machine learning models inside Azure SQL Database. This makes it seamless for data scientists and developers to create and train models in Azure Machine Learning and deploy models directly to Azure SQL Database to create predictions at blazing fast speeds
Visual Studio Code Tools for AI: In preview, create, train, manage, and deploy AI models with all the productivity of Visual Studio and the power of Azure. Works on Windows and MacOS. More info

↧

What is Azure Databricks?

November 20, 2017, 8:00 am

≫ Next: Is the traditional data warehouse dead?

≪ Previous: Microsoft Connect(); announcements

Azure Databricks (documentation and user guide) was announced at Microsoft Connect, and with this post I’ll try to explain its use case. At a high level, think of it as a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with Azure Blog Storage, Azure Data Lake Storage (ADLS), Azure SQL Data Warehouse (SQL DW), Cosmos DB, Azure Event Hub, Apache Kafka for HDInsight, and Power BI (see Spark Data Sources). Think of it as an alternative to HDInsight (HDI) and Azure Data Lake Analytics (ADLA).

It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost. Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. Note that all clusters within the same workspace share data among all of those clusters.

Also note with built-in integration to SQL DW it can write directly to SQL DW, as opposed to HDInsight which cannot and therefore more steps are required: when HDInsight processes data it must write it back to Blob Storage and then requires Azure Data Factory (ADF) to move the data from Blob Storage to SQL DW.

It is in limited public preview now: Sign up for the Azure Databricks limited preview

More info

Microsoft makes Databricks a first-party service on Azure

DATABRICKS + MICROSOFT AZURE = CLOUD-SCALE SPARK POWER

Microsoft Launches Preview of Azure Databricks

A technical overview of Azure Databricks

Microsoft Azure Debuts a ‘Spark-as-a-Service’

3 REASONS TO CHOOSE AZURE DATABRICKS FOR DATA SCIENCE AND BIG DATA WORKLOADS

↧

Is the traditional data warehouse dead?

December 22, 2017, 8:00 am

≫ Next: Reference architecture for enterprise reporting in Azure

≪ Previous: What is Azure Databricks?

There have been a number of enhancements to Hadoop recently when it comes to fast interactive querying with such products as Hive LLAP and Spark SQL which are being used over slower interactive querying options such as Tez/Yarn and batch processing options such as MapReduce (see Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto).

This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark SQL? Which leads to the argument: “Is the data warehouse dead?”

I think what is confusing is the argument should not be over whether the “data warehouse” is dead but clarified if the “traditional data warehouse” is dead, as the reasons that a “data warehouse” is needed are greater than ever (i.e. integrate many sources of data, reduce reporting stress on production systems, data governance including cleaning and mastering and security, historical analysis, user-friendly data structure, minimize silos, single version of the truth, etc – see Why You Need a Data Warehouse). And what is meant by a “traditional” data warehouse is usually referring to a relational data warehouse built using SQL Server (if using Microsoft products) and when a data lake is mentioned it is usually one that is built in Hadoop using Azure Data Lake Store (ADLS) and HDInsight (which has cluster types for Spark SQL and Hive LLAP that is also called Interactive Query).

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake? and What is a data lake? and my presentation Big data architectures and the data lake.

The main benefits I hear of a data lake-only approach: Don’t have to load data into another system and therefore manage schemas across different systems, data load times can be expensive, data freshness challenges, operational challenges of managing multiple systems, and cost. While these are valid benefits, I don’t feel they are enough to warrant not having a relational data warehouse in your solution.

First lets talk about cost and dismiss the incorrect assumption that Hadoop is cheaper: Hadoop can be 3x cheaper for data refinement, but to build a data warehouse in Hadoop it can be 3x more expensive due to the cost of writing complex queries and analysis (based on a WinterCorp report and my experiences).

Understand that a “big data” solution does not mean just using Hadoop-related technologies, but could mean a combination of Hadoop and relational technologies and tools. Many clients will build their solution using just Microsoft products, while others use a combination of both Microsoft and open source (see Microsoft Products vs Hadoop/OSS Products). Building a data warehouse solution on the cloud or migrating to the cloud is often the best idea (see To Cloud or Not to Cloud – Should You Migrate Your Data Warehouse?) and you can often migrate to the cloud without retooling technology and skills.

I have seen Hadoop adopters typically falling into two broad categories: those who see it as a platform for big data innovation, and those who dream of it providing the same capabilities as an enterprise data warehouse but at a cheaper cost. Big data innovators are thriving on the Hadoop platform especially when used in combination with relational database technologies, mining and refining data at volumes that were never before possible. However, most of those who expected Hadoop to replace their enterprise data warehouse have been greatly disappointed, and in response have been building complex architectures that typically do not end up meeting their business requirements.

As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this. The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question. Another risk in the first case is slower performance because the data is not laid out efficiently. Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).

Relational data warehouses continue to meet the information needs of users and continue to provide value. Many people use them, depend on them, trust them, and don’t want them to be replaced with a data lake. Data lakes offer a rich source of data for data scientists and self-service data consumers (“power users”) and serves analytics and big data needs well. But not all data and information workers want to become power users. The majority (at least 90%) continue to need well-integrated, systematically cleansed, easy to access relational data that includes a large body of time-variant history. These people are best served with a data warehouse.

I can’t stress enough if you need high data quality reports you need to apply the exact same transformations to the same data to produce that report no matter what your technical implementation is. If you call it a data lake or a data warehouse, or use an ETL tool or Python code, the development and maintenance effort is still there. You need to avoid falling into the old mistake that the data lake does not need data governance. It’s not a place with unicorns and fairies that will magically make all the data come out properly – a data lake is just a glorified file folder.

Here are some of the reasons why it is not a good idea to have a data lake in Hadoop as your data warehouse and forgo a relational data warehouse:

Hadoop does not provide for very fast query reads in all use cases. While Hadoop has come a long way in this area, Hive LLAP and Spark SQL have limits on what type of queries they support (i.e. not having full support for ANSI SQL such as certain aggregate functions which limits the range of users, tools, and applications that can access Hadoop data) and it still isn’t quite at the performance level that a relational database can provide
Hadoop lacks a sophisticated query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing strategies and therefore performs poorly for complex queries
Hadoop does not have the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance to reduce cost
Hadoop is not relational, as all the data is in files in HDFS, so there is always a conversion process to convert the data to a relational format if a reporting tool requires it in a relational format
Hadoop is not a database management system. It does not have functionality such as update/delete of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
There is no metadata stored in HDFS, so another tool such as a Hive Metastore needs to be used to store that, adding complexity and slowing performance. And most metastores only work with a limited number of tools, requiring multiple metastores
Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
Hadoop is super complex, with lot’s of integration with multiple technologies to make it work
Hadoop has many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard. See all the various Apache Hadoop technologies here
Some reporting tools don’t work against Hadoop
May require end-users to learn new reporting tools and Hadoop technologies to query the data
The newer Hadoop solutions (Tez, Spark, Hive LLAP etc) are still figuring themselves out. Customers might not want to take the risk of investing in one of these solutions that may become obsolete (like MapReduce)
It might not save you much in costs: you still have to purchase hardware or pay for cloud consumption, support, licenses, training, and migration costs. As relational databases scale up, support non-standard data types like JSON, and run functions written in Python, Perl, and Scala, it makes it even more difficult to replace them with a data lake as the migration costs alone would be substantial
If you need to combine relational data with Hadoop, you will need to move that relational data to Hadoop or invest in a technology such as PolyBase to query Hadoop data using SQL
Is your current IT experience and comfort level mostly around non-Hadoop technologies, like SQL Server? Many companies have dozens or hundreds of employees that know SQL Server and not Hadoop so therefore would require a ton of training as Hadoop can be overwhelming

As far as performance, it is greatly affected by the use of indexing – Hive with LLAP (or not) doesn’t have indexing, so when you run a query, it reads all of the data (minus partition elimination). Spark SQL, on the other hand, isn’t really an interactive environment – it’s fast-batch – so again, not going to see the performance users will expect from a relational database. Also, a relational database still beats most competitors when performing complex, multi-way joins. Given that most analytic queries are just that, a traditional data warehouse still might be the right choice.

From a security standpoint, you would need to integrate Hive LLAP or Spark with Apache Ranger to support granular security definition at the column level, including data masking where appropriate.

Concurrency is another thing to think about – Hadoop clusters have to get VERY large to support hundreds or thousands of concurrent connections – remember, these systems aren’t designed for interactive usage – they are optimized for batch and we are trying to shoehorn interactivity on top of that.

A traditional relational data warehouse should be viewed as just one more data source available to a user on some very large federated data fabric. It is just pre-compiled to run certain queries very fast. And a data lake is another data source for the right type of people. A data lake should not be blocked from all users so you don’t have to tell everyone “please wait three weeks while I mistranslate your query request into a new measure and three new dimensions in the data warehouse”.

Most data lake vendors assume data scientists or skilled data analysts are the principal users of the data. So, they can feed these skilled data users the raw data. But most business users get lost in that morass. So, someone has to model the data so it makes sense to business users. In the past, IT did this, but now data scientists and data analysts can do it using powerful, self-service tools. But the real question is: does a data scientist or analyst think locally or globally? Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases? So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.

I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset for? Is your IT department radical or is it conservative?

I think a relational data warehouse still has an important place: performance, ease of access, security, integration with reporting components, and concurrency all lean towards using it, especially when performing complex, multi-way joins that make up analytic queries which is the sweet spot for a traditional data warehouse.

The bottom line is a majority of end users need the data in a relational data warehouse to easily do self-service reporting off of it. A Hadoop data lake should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse.

More info:

Is Hadoop going to Replace Data Warehouse?

IS AZURE SQL DATA WAREHOUSE A GOOD FIT?

The Demise of the Data Warehouse

Counterpoint: The Data Warehouse is Still Alive

The Future of the Data Warehouse

Whither the Data Warehouse? Reflections From Strata NYC 2017

Big Data Solutions Decision Tree

Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop

Hadoop vs Data Warehouse: Apples & Oranges?

HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH

↧

Reference architecture for enterprise reporting in Azure

January 5, 2018, 8:00 am

≫ Next: Data Virtualization vs. Data Movement

≪ Previous: Is the traditional data warehouse dead?

As I mentioned in my recent blog Use cases of various products for a big data cloud solution, with so many products it can be difficult to know the best products to use when building a solution. When it comes to building an enterprise reporting solution, there is a recently released reference architecture to help you in choosing the correct products. It will also help you get started quickly as it includes an implementation component in Azure. The blog post announcement is here.

This reference architecture is focused solely on reporting, for those use cases where you will have a lot of users building dashboards via Power BI and operational reports via SSRS. You can certainly expand the capabilities to add more features such as machine learning as well as enhancing the purpose of certain products, such as using Azure SQL Data Warehouse (SQL DW) to accept large ad-hoc queries from users. The reference architecture is also for a batch-type environment (i.e. loading data every hour) and not a real-time environment (i.e. handling thousands of events per second).

Key features and benefits include:

Pre-built based on selected and stable Azure components proven to work in enterprise BI and reporting scenarios
Easily configured and deployed to an Azure subscription within a few hours
Bundled with software to handle all the operational essentials for a full-fledged production system
Tested end-to-end against large workloads
You can operationalize the infrastructure using the steps in the User’s Guide, and explore component level details from the Technical Guides. Also, check out the FAQ

You can one-click deploy the infrastructure implementation from one of these two locations, which also go into details on each step in the above diagram:

The idea is you are deploying a base architecture, then you will modify as needed to fit all your needs. But the hard work of choosing the right products and building the starting architecture is done for you, reducing your risk and shortening development time. However, this does not mean you should use these chosen products in every situation. For example, if you are comfortable with Hadoop technologies you can use Azure Data Lake Store and HDInsight instead of SQL DW, or use Azure Analysis Services (AAS) instead of SQL Server Analysis Services (SSAS) in a VM (AAS did not support VNETs when this reference architecture was created). But for many who just need an enterprise reporting solution, this will do the job with little modification.

Note the Cortana Intelligence Gallery has many others solutions so be sure to check them out and avoid “reinventing the wheel”.

↧

Data Virtualization vs. Data Movement

February 2, 2018, 8:00 am

≫ Next: Conversations with Data Warehouse Experts – Podcast

≪ Previous: Reference architecture for enterprise reporting in Azure

I have blogged about Data Virtualization vs Data Warehouse and wanted to blog on a similar topic: Data Virtualization vs. Data Movement.

Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.

Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.

If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?

The most common scenario where you would want to do data movement is if you will aggregate/transform one time and query the results many times. Another common scenario is if you will be joining data sets from multiple sources frequently and the performance needs to be super fast. These turn out to be the scenarios for most data warehouse solutions. But there could be cases where you will have many ad-hoc queries that don’t need to be super fast. And you could certainty have a data warehouse that uses data movement for some tables and data virtualization for others.

Here is a comparison of both:

Other data virtualization benefits:

Provides complete data lineage from the source to the presentation layer
Additional data sources can be added without having to change transformation packages or staging tables
All data presented through the data virtualization software is available through a common SQL interface regardless of the source (i.e. flat files, Excel, mainframe, SQL Server, etc)

While this table gives some good benefits of data virtualization over data movement, it may not be enough to overcome the sacrifice in performance or other drawbacks listed at Data Virtualization vs Data Warehouse. Also keep in mind the virtualization tool you choose may not support some of your data sources.

The better data virtualization tools provide such features as query optimization, query pushdown, and caching (i.e. Denodo) that may help with performance. You may see tools with these features called “data virtualization” and tools without these features called “data federation” (i.e. PolyBase).

More info:

A FRESH LOOK AT DATA VIRTUALIZATION

Developing a Bi-Modal Logical Data Warehouse Architecture Using Data Virtualization

↧

Conversations with Data Warehouse Experts – Podcast

February 6, 2018, 8:00 am

≫ Next: Azure Data Architecture Guide (ADAG)

≪ Previous: Data Virtualization vs. Data Movement

In this podcast I talk with Mike Rabinovici of Dimodelo Solutions about data being the new currency, the importance of showing customers the art of the possible, and last but not least my go to TV show. Click here to listen. Also check out the podcasts of other data warehouse experts.

↧

Azure Data Architecture Guide (ADAG)

February 15, 2018, 11:24 am

≫ Next: My latest presentations

≪ Previous: Conversations with Data Warehouse Experts – Podcast

The Azure Data Architecture Guide has just been released! Check it out: http://aka.ms/ADAG

Think of it as a menu or syllabus for data professionals. What service should you use, why, and when would you use it. I had a small involvement in its creation, but there were a large number of people within Microsoft and from 3rd parties that put it together over many months. Hopefully you find this clears up some of the confusion caused by so many technologies and products.

“This guide presents a structured approach for designing data-centric solutions on Microsoft Azure. It is based on proven practices derived from customer engagements.”

You can even download a PDF version (106 pages!).

The guide is structured around a basic pivot: The distinction between relational data and non-relational data:

Within each of these two main categories, the Data Architecture Guide contains the following sections:

Concepts. Overview articles that introduce the main concepts you need to understand when working with this type of data.
Scenarios. A representative set of data scenarios, including a discussion of the relevant Azure services and the appropriate architecture for the scenario.
Technology choices. Detailed comparisons of various data technologies available on Azure, including open source options. Within each category, we describe the key selection criteria and a capability matrix, to help you choose the right technology for your scenario.

The table of contents looks like this:

Traditional RDBMS

Concepts

Scenarios

Big data and NoSQL

Concepts

Scenarios

Cross-cutting concerns

↧

My latest presentations

February 26, 2018, 8:00 am

≫ Next: It’s all about the use cases

≪ Previous: Azure Data Architecture Guide (ADAG)

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting. We all know technology changes so quickly so there is no shortage of topics! There is a list of all my presentations with slide decks. Here are the new presentations I created the past year:

Differentiate Big Data vs Data Warehouse use cases for a cloud solution

It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn’t, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption. (slides)

Introduction to Azure Databricks

Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. (slides)

Azure SQL Database Managed Instance

Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it’s features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes. (slides)

What’s new in SQL Server 2017

Covers all the new features in SQL Server 2017, as well as details on upgrading and migrating to SQL Server 2017 or to Azure SQL Database. (slides)

Microsoft Data Platform – What’s included

The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source. (slides)

Learning to present and becoming good at it

Have you been thinking about presenting at a user group? Are you being asked to present at your work? Is learning to present one of the keys to advancing your career? Or do you just think it would be fun to present but you are too nervous to try it? Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it. It’s easier than you think! I am an introvert and was deathly afraid to speak in public. Now I love to present and it’s actually my main function in my job at Microsoft. I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear. You can do it! (slides)

Microsoft cloud big data strategy

Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution. (slides)

Choosing technologies for a big data solution in the cloud

Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will level-set what big data is and other definitions, cover questions to ask to help decide which technologies to use, go over the new technologies to choose from, and then compare the pros and cons of the technologies. Finally we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a “logical data warehouse”? What is this lambda architecture? And we’ll close with showing some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud. (slides)

↧

It’s all about the use cases

March 5, 2018, 8:00 am

≫ Next: Public preview of Azure SQL Database Managed Instance

≪ Previous: My latest presentations

There is no better way to see the art of the possible with the cloud than in use cases/customer stories and sample solutions/architectures. Many of these are domain-specific which resonates best with the business decision makers:

Use cases/customer stories

Microsoft IoT customer stories: Explore Internet of Things (IoT) examples and IoT use cases to learn how Microsoft IoT is already transforming your industry. The industry’s are broken out by: Manufacturing, Smart Infrastructure, Transportation, Retail, and Healthcare.

Customer stories: Dozens of customer stories of solutions built in Azure that you can filter on by language, industry, product, organization size, and region.

Case studies: See the amazing things people are doing with Azure broken out by industry, product, solution, and customer location.

Sample solutions/architectures

Azure solution architectures: These architectures to help you design and implement secure, highly-available, performant and resilient solutions on Azure.

Pre-configured AI solutions: These serve as a great starting point when building an AI solution. Broken out by Retail, Manufacturing, Banking, and Healthcare.

Internet of Things (IoT) solutions: Great IoT sample solutions such as: connected factory, remote monitoring, predictive maintenance, connected field service, connected vehicle, and smart buildings.

↧

Public preview of Azure SQL Database Managed Instance

March 8, 2018, 8:00 am

≫ Next: Is the traditional data warehouse dead? webinar

≪ Previous: It’s all about the use cases

Microsoft has announced the public preview of Azure SQL Database Managed Instance. I blogged about this before. This will lead to a title wave of on-prem SQL Server database migrations to the cloud. In summary:

Managed Instance is an expansion of the existing SQL Database service, providing a third deployment option alongside single databases and elastic pools. It is designed to enable database lift-and-shift to a fully-managed service, without re-designing the application. SQL Database Managed Instance provides the broadest SQL Server engine compatibility and native virtual network (VNET) support so you can migrate your SQL Server databases to SQL Database without changing your apps. It combines the rich SQL Server surface area with the operational and financial benefits of an intelligent, fully-managed service.

Two other related items that are available:

Azure Hybrid Benefit for SQL Server on Azure SQL Database Managed Instance. The Azure Hybrid Benefit for SQL Server is an Azure-based benefit that enables customers to use their SQL Server licenses with Software Assurance to save up to 30% on SQL Database Managed Instance. Exclusive to Azure, the hybrid benefit will provide an additional benefit for highly-virtualized Enterprise Edition workloads with active Software Assurance: for every 1 core a customer owns on-premises, they will receive 4 vCores of Managed Instance General Purpose. This makes moving virtualized applications to Managed Instance highly cost-effective.
Database Migration Services for Azure SQL Database Managed Instance. Using the fully-automated Database Migration Service (DMS) in Azure, customers can easily lift and shift their on-premises SQL Server databases to a SQL Database Managed Instance. DMS is a fully managed, first party Azure service that enables seamless and frictionless migrations from heterogeneous database sources to Azure Database platforms with minimal downtime. It will provide customers with assessment reports that guide them through the changes required prior to performing a migration. When the customer is ready, the DMS will perform all the steps associated with the migration process.

More info:

Migrate your databases to a fully managed service with Azure SQL Database Managed Instance

What is Azure SQL Database Managed Instance?

Video Introducing Azure SQL Database Managed Instance

Azure SQL Database Managed Instance – the Good, the Bad, the Ugly

↧

Is the traditional data warehouse dead? webinar

March 26, 2018, 9:21 am

≫ Next: Webinar: Is the traditional data warehouse dead?

≪ Previous: Public preview of Azure SQL Database Managed Instance

As a follow-up to my blog Is the traditional data warehouse dead?, I will be doing a webinar on that very topic tomorrow (March 27th) at 11am EST for the Agile Big Data Processing Summit that I hope you can join. Details can be found here. The abstract is:

Is the traditional data warehouse dead?

With new technologies such as Hive LLAP or Spark SQL, do you still need a data warehouse or can you just put everything in a data lake and report off of that? No! In the presentation, James will discuss why you still need a relational data warehouse and how to use a data lake and an RDBMS data warehouse to get the best of both worlds.

James will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. He’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution, and he will put it all together by showing common big data architectures.