Azure Archive Blob Storage

August 31, 2017, 8:00 am

≪ Previous: Data Virtualization vs Data Warehouse

Last week Microsoft released a public preview of a new service called Azure Archive Blob Storage, offering customers a lower-cost cloud storage solution for rarely accessed data. This allows for storage tiering, where organizations can place their critical data on expensive, high-performance storage and then move it down the line as it winds up being accessed less frequently over time.

Last year Microsoft introduced Azure Cool Blob storage, which cost customers a penny per GB per month in some Azure regions. Now, users have another, lower-cost option in Azure Archive Blob Storage, along with new Blob-Level Tiering data lifecycle management capabilities. So there are now three Azure blog storage tiers: Hot, Cool, and Archive.

Azure Archive Blob Storage costs 0.18 cents per GB per month when the service is delivered through its cloud data center in the East US 2 (for comparison, in the same region hot is 1.8 cents and cool is 1.0 cents per GB per month) . Customers can expect a 99 percent availability SLA (service level agreement) when the service makes its way out of the preview stage.

Complementing the new service is a new Blob-level Tiering feature that will allow customers to change the access tier of blob storage objects among Hot, Cool or Archive. Also in preview, it enables users to match costs to usage patterns without moving data between accounts.

Archive storage has the lowest storage cost and highest data retrieval costs compared to hot and cool storage.

While a blob is in archive storage, it cannot be read, copied, overwritten, or modified. Nor can you take snapshots of a blob in archive storage. However, you may use existing operations to delete, list, get blob properties/metadata, or change the tier of your blob. To read data in archive storage, you must first change the tier of the blob to hot or cool. This process is known as rehydration and can take up to 15 hours to complete for blobs less than 50 GB. Additional time required for larger blobs varies with the blob throughput limit.

During rehydration, you may check the “archive status” blob property to confirm if the tier has changed. The status reads “rehydrate-pending-to-hot” or “rehydrate-pending-to-cool” depending on the destination tier. Upon completion, the “archive status” blob property is removed, and the “access tier” blob property reflects the hot or cool tier.

Example usage scenarios for the archive storage tier include:

Long-term backup, archival, and disaster recovery datasets
Original (raw) data that must be preserved, even after it has been processed into final usable form. (For example, Raw media files after transcoding into other formats)
Compliance and archival data that needs to be stored for a long time and is hardly ever accessed. (For example, Security camera footage, old X-Rays/MRIs for healthcare organizations, audio recordings, and transcripts of customer calls for financial services)

More info:

Announcing the public preview of Azure Archive Blob Storage and Blob-Level Tiering

Microsoft Unveils Cost-Cutting Archival Cloud Storage Option

↧

Distributed Writes

September 21, 2017, 8:00 am

≫ Next: Microsoft Ignite Announcements

≪ Previous: Azure Archive Blob Storage

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes. So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is? There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes. Here are some ideas:

Peer-to-Peer transactional replication (or Multi-master replication) with SQL Server. See Peer-to-Peer – Transactional Replication
Sharding in Azure SQL Database via elastic database tools which requires coding. See Building scalable cloud databases. You can also implement sharding in code for SQL Server
Merge replication in SQL Server. See Merge Replication
Create a messaging and queuing application in SQL Server Service Broker where all writes are placed on the queue and sent to different servers
Create a message queue using an asynchronous Azure Event Hub
Use a 3rd party product: Attunity Replicate for SQL Server, ScaleArc for SQL Server
Instead of using SQL Server, use a NoSQL or multi-model database service like Azure Cosmos DB (no coding involved – think of it as auto-sharding)

The one option out of all the above options that does not require coding and can support a large number of writes per second is Azure Cosmos DB. All the other options can require significant coding and/or can only handle a limited amount of writes per second. This is because Cosmos DB uses documents (JSON files) where all the information needed is included in that document so no joins are needed and documents can be spread on multiple servers (see Partition and scale in Azure Cosmos DB and A technical overview of Azure Cosmos DB). This is opposed to relational databases that use multiple tables that must be joined. If the tables are on different nodes that will cause a lot of data shuffling causing performance problems.

To go into greater detail on the benefits of Cosmos DB over SQL Server for distributed writes:

Consistency
- Peer2Peer SQL Replication introduces pains around data consistency and conflict resolution
Availability
- Sharding with SQL introduces pains around maintaining availability when increasing/decreasing the degree of scale-out. Frequently, downtime is involved due to needs to re-balancing data across shards
- SQL requires rigid schemas and indices to be defined upfront. Every time schema and index updates are needed – you will incur a heavy operational cost of running Create Index and Alter Tables scripts across all database shards and replicas. Furthermore, this introduces availability issues as schemas are being altered.
Handling sustained heavy write ingestion
- Putting queueing mechanisms in front of SQL only gives you a buffer for handling spikes in writes, but at the end of the day, the database itself needs to support sustained heavy write ingestion in order to consume the buffered events. What happens if events come in to the buffer faster than you drain it? You will need a database specifically designed for heavy write ingestion

Azure Cosmos DB solves these by:

Providing 5 well-defined consistency models to help developers tune the right Consistency vs Performance tradeoffs for their scenario
Scale on demand and support for flexible data model while maintaining high availability (99.99% availability SLA). Scaling out and partition management is taken care of by the service on behalf of the user
Use of log-structured techniques to be a truly latch-free database to sustain heavy write ingestion with durable persistence

In the end, eliminating schema, index management, and JOINs are a necessary byproduct of scale out that Azure Cosmos DB provides.

After the initial post of this blog, I received the question “Why not just use SQL 2016 in-Memory tables for heavy write systems (info)?” and received a great reply from a Cosmos DB product manager:

SQL in-memory is only viable when:

Request and data volume are small enough to fit on a single machine. You still have the fundamental problem of hard limits due to scale-up model.
Scenario does not need durability, reliability, or availability – which are requirements for >99% of mission critical customer scenarios.

Durability

If data is kept in only in-memory, you experience data loss upon any intermittent failure that requires computer to restart (e.g. os crash, power outage, os decides it wants to reboot to update, etc.). In order for data to be durable, it needs to be persisted to disk. In order to offer resiliency against disk failures, you will need to replicate to a number of disks
For durable scenarios – memory only acts as a buffer to absorb spikes. In order to achieve sustained write ingestion – you will need to flush the buffer as fast as you input into the buffer. Now you have a bottleneck on disk i/o unless you scale-out
This is why they immediately have to address that this is for “applications where durability is not required”; durability is a requirement for >99% of data scenarios
Data loss and data corruption should be treated as cardinal sin for any data product

Scale

This model is still a scale-up model – in which there hard many hard limits
What happens for data volume that doesn’t fit in memory (which tends to be very small size relative to disk storage)? You need to scale-out
What happens for request volume that memory bandwidth is inadequate? You need to scale out
This is why the throughput numbers in the blog are orders of magnitude smaller than what customers are doing everyday on Cosmos DB, and talking about storage size is quietly ignored

Expensive

Memory is 100x more expensive than SSD. Achieving high storage in a scale-out system will yield not only better scale and durability characteristics – but incur much lower costs for any large-scale scenarios

More info:

Database Sharding

↧

Microsoft Ignite Announcements

September 28, 2017, 8:00 am

≫ Next: What’s new in SQL Server 2017 presentation

≪ Previous: Distributed Writes

Many product announcements were made this week at Microsoft Ignite, and I wanted to give a quick overview of all the data platform related announcements:

SQL Server 2017 on Linux, Windows, and Docker, generally available on October 2nd. SQL Server 2017 is being released simultaneously for Windows and various flavors of Linux: Red Hat Enterprise Linux 7.3, SUSE Linux Enterprise Server 12, Ubuntu and Docker. The official Docker image is based on Ubuntu 16.04. The performance of SQL Server on Linux vs Windows is “basically the same”. However, not everything has been ported. There are no Reporting Services or Analysis Services, nor Machine Learning Services, transactional replication, Stretch DB, or File Table (see Unsupported features and services). Management tools remain for the most part Windows only, though command-line tools work. The major new features are graph query support, Python in Machine Learning Services, SSIS scale-out, and Adaptive Query Processing and Automatic Tuning for better query optimization. Learn more and see What’s new in SQL Server 2017
Azure Database Migration Service (DMS) and Azure SQL Database Managed Instance, public preview. New Managed Instance offering within SQL Database offers near-complete SQL Server compatibility and network isolation for easiest lift and shift to Azure. DMS offers a fully managed, first party Azure service that enables customers to easily migrate their on-premises SQL server databases to Azure SQL Database Managed Instance and SQL Server in Azure Virtual Machines with minimal to no downtime. Customers can maximize existing license investments with discounted rates on Managed Instance using a new Azure Hybrid Benefit for SQL Server. Sign up for news on availability
Azure Machine Learning, new capabilities in public preview. Updates connect every element of the data science process with enhanced productivity and collaboration for AI developers and data scientists at any scale. Enables them to start building right away with their choice of tools and frameworks. The updated platform includes a enhanced data cleansing and prepping tool called ML Workbench to start the modeling process sooner. It is a client application that runs on Windows and Mac and is targeted at data scientists who are not users of Visual Studio and integrates with popular open source data science toolkits such as Python Scikit Learn, Jupyter Notebooks and Matplotlib. It integrates with the cloud by seamlessly moving the heavy lifting to the GPU-powered VMs in Azure. Other new capabilities include The Azure Machine Learning Experimentation service allows developers and data scientists to increase their rate of experimentation; The Model Management service provides deployment, hosting, versioning, management, and monitoring for models in Azure, on-premises, and to IOT Edge devices. These new features will help data scientists develop, deploy, and manage machine learning and AI models at any scale wherever data lives: in the cloud, on-premises, and edge. Learn more on the Azure Machine Learning page and Diving deep into what’s new with Azure Machine Learning
Microsoft Cognitive Services updates. Includes general availability of Text Analytics API, a cloud-based service for language processing such as sentiment analysis, key phrase extraction and language detection. In October, we will also make generally available Bing Custom Search to create customized search experience for a section of the web, and Bing Search APIs v7 for searching the entire web for more relevant results using Bing Web, News, Video & Image search. Read the announcement blog post
Announcing the preview of Machine Learning Services with R support in Azure SQL Database. You can evaluate this preview functionality in any server/database created in the West Central US Region. More info
Azure Data Factory (ADF) – announcing new capabilities in public preview. These new capabilities in ADF will enable you to build hybrid data integration at scale. Now you can create, schedule, and orchestrate your ETL/ELT workflows, wherever your data lives, in the cloud or on any self-hosted network. Meet security and compliance needs while taking advantage of extensive capabilities and paying only for what you use. Accelerate your data integration with multiple data source connectors natively available in-service. SQL Server Integration Services (SSIS) customers will benefit from easily lifting their SSIS packages into the cloud using new managed SSIS hosting capabilities in Data Factory. We have taken the first steps to separate Control Flow and Data Flow within ADF to provide greater control over complex orchestrations that now facilitate looping, branching, and conditional structures within Control Flow. We have added new flexibility to scheduling by enabling triggering with wall-clock timers or on-demand via event generation. Parameters can now be defined and passed while invoking pipelines to enable incremental data loads. If you want to move your SSIS workloads, you can create a data factory version 2, and provision an Azure-SSIS Integration Runtime (IR). The Azure-SSIS IR is a fully managed cluster of Azure VMs (nodes) dedicated to run your SSIS packages in the cloud. For step-by-step instructions, see the tutorial: deploy SSIS packages to Azure. Full details of the release and features can be found on the Azure Data Factory service page. We encourage you to try these new capabilities, available at public preview pricing
Announcing the preview for the Azure Data Box. A hardware appliance that companies can use to load their data for shipping to the closest Microsoft Azure data center. The 45-lb box, which is tamper proof, holds up to 100 terabytes (TB) of data. It plugs into a corporate network for downloads, and then into Azure’s own high-speed networks to upload its contents. Companies will be able to rent it, fill it, and ship it while tracking its progress. Data on the device will be encrypted throughout the journey. More info
Introducing Azure Availability Zones for resiliency and high availability. Availability Zones are fault-isolated locations within an Azure region, providing redundant power, cooling, and networking. Availability Zones allow customers to run mission-critical applications with higher availability and fault tolerance to datacenter failures. More info
Public preview: Virtual network service endpoints for Azure Storage and SQL Database. You can now secure Azure Storage and Azure SQL Database to only your virtual networks, by using virtual network service endpoints. Endpoints provide a direct connection from your virtual network to the Azure services, extending your virtual network’s private address space and identity to the services. Traffic from your virtual network to the services will always remain on the Microsoft Azure network backbone. More info
Intelligent insights for Azure SQL Database. Azure SQL Database built-in intelligence continuously monitors database usage through artificial intelligence and detects disruptive events that cause poor performance. Once detected, a detailed analysis is performed generating a diagnostic log with intelligent assessment of the issue. This assessment consists of a root cause analysis of the database performance issue and where possible recommendations for performance improvements. More info
Read replicas for Azure Database for MySQL. Read replicas will allow customers using MySQL on-premises or on other cloud service providers to create replicas of their instance in Azure. They can then choose to upgrade the replica to master in Azure Database for MySQL, and connect their apps directly to the new database instance. If you are interested in understanding the functionality of this private preview visit the Azure blog or more information
Renamed R Server to Machine Learning Server. Announced was the renaming of Microsoft R Server to Microsoft Machine Learning Server and SQL Server R Services to SQL Server Machine Learning Services. The additional language support aligns the Advanced Analytics workload to Machine Learning capabilities and focus on AI. With Python support in addition to R and Microsoft ML libraries we enhance Machine Learning capabilities and offer the ability to develop new intelligent applications combining the best of open source and enterprise capabilities of SQL Server 2017. More info
Azure SQL Database: Vulnerability Assessment. SQL Vulnerability Assessment (currently in preview) is an easy to configure tool that can discover, track, and remediate potential database vulnerabilities. Use it to proactively improve your database security. More info
The Power BI team announced a much-awaited feature; automatic updates to the Power BI Desktop. Through the Windows Store, you can now install the Power BI Desktop once and get updates automatically every month. Read this blog post on http://aka.ms/biatmicrosoft to learn more
Faster compute optimized performance tier for Azure SQL Data Warehouse. The compute optimized performance tier brings several benefits to your analytics workloads. The first benefit can be seen through dramatically improved query performance. Individual query execution times have improved by as much as 10x. We’ve also seen some fantastic results with customer workloads and benchmarks where queries are completing twice as fast on average. The compute and storage scalability has also been dramatically increased with this performance tier. You can now provision 5x the computing power and store an unlimited amount of columnar data, empowering you to run your largest and most complex analytics workloads. More info
Azure free account, now available. A best-in-industry offer, the Azure free account helps customers try Azure. It comes with 12-months free access to compute, storage, database, and networking services, along with 25+ always-free services, including Azure App Service and Functions. It also includes a $200 credit allowing customers to try any Azure product for the first 30 days. More information at azure.com/free and Azure Free Account FAQ
Azure Stack, now shipping through Dell EMC, HPE, and Lenovo. Azure Stack is an extension of Azure, allowing customers to uniquely meet hybrid requirements like compliance, latency, and true consistency as a part of their hybrid cloud strategy. Cisco and Wortmann will start taking orders soon. Customers can also buy Azure Stack as a managed service from Avanade, Rackspace, and several MSP partners. Azure Stack certification for IT Professionals materials are available now, and certifications exams will start Q1 2018. More information on azure.com/azurestack
Azure Reserved Virtual Machine Instances. When available later in 2017, customers will be able to reserve virtual machines on Azure for a one- or three-year term with significant cost savings of up to 82% over pay-as-you-go prices when combined with Azure Hybrid Benefit and up to 72% on all VMs. Customers select the VM type, term, and datacenter region, so the compute resources are available when and where needed. Improve budgeting with a single up-front payment while maintaining the flexibility to exchange or cancel at any time. Details on Azure.com
Native integration between Azure Cosmos DB and Azure Functions. We’re bringing the power of Azure Cosmos DB to our serverless offering, Azure Functions. With this integration, developers can write serverless apps backed by Cosmos DB, with just a few lines of code. They can innovate faster by reacting in real-time to changes happening in the database to drive more engaging and personalized customer experiences. Using Azure Functions and Azure Cosmos DB, customers can create and deploy event-driven, planet-scale serverless apps with extremely low-latency access against very rich data. Read the blog
GA of HDInsight Interactive Query (Hive LLAP). This is an Azure HDInsight cluster type. It supports in-memory caching, which makes Hive queries faster and much more interactive. More info
Microsoft is now offering Blob storage accounts with up to 5PB (petabytes) of maximum capacity, a 10x increase. Both incoming and outgoing data can now move at up to 50Gbps (gigabits per second) and users can expect 50,000 TPS/IOPS (transactions per second/input output operations per second) performance, a 2.5x jump. More info
Announcing new Azure VM sizes for more cost-effective database workloads. We are excited to announce the latest versions of our most popular VM sizes (DS, ES, GS, and MS), which constrain the vCPU count to one half or one quarter of the original VM size, while maintaining the same memory, storage and I/O bandwidth. We have marked these new VM sizes with a suffix that specifies the number of active vCPUs to make them easier for you to identify. For example, the current VM size Standard_GS5 comes with 32 vCPUs, 448GB mem, 64 disks (up to 256 TB), and 80,000 IOPs or 2 GB/s of I/O bandwidth. The new VM sizes Standard_GS5-16 and Standard_GS5-8 comes with 16 and 8 active vCPUs respectively, while maintaining the rest of the specs of the Standard_GS5 in regards to memory, storage, and I/O bandwidth. More info
New in Stream Analytics: Output to Azure Functions, built-in anomaly detection, etc. Announced the preview of several new and compelling capabilities in Azure Stream Analytics. These include built-in inline machine learning based anomaly detection, egress to Azure functions, support for compressed data formats, JavaScript User defined aggregates, and support for CI/CD in Visual Studio tooling. These new features will start rolling out over the course of the next several weeks. More info
Announcing Azure Migrate. A new service that provides the guidance, insights, and mechanisms needed to assist you in migrating on-premises virtual machines and servers to Azure. More info

↧

What’s new in SQL Server 2017 presentation

October 6, 2017, 11:10 am

≫ Next: Use cases of various products for a big data cloud solution

≪ Previous: Microsoft Ignite Announcements

I just uploaded a new presentation called “What’s new in SQL Server 2017”. It covers all the new features in SQL Server 2017 (which went GA this past Monday), as well as details on upgrading and migrating to SQL Server 2017 or to Azure SQL Database. Check out the slides (note there is a “download” button if you wish to have the PowerPoint presentation).

This is a good time to mention my other related presentations: Should I move my database to the cloud?, Implement SQL Server on an Azure VM, Introducing Azure SQL Database, and HA/DR options with SQL Server in Azure and hybrid.

Hope you find these useful!

↧

Use cases of various products for a big data cloud solution

October 11, 2017, 8:00 am

≫ Next: Microsoft Ignite Big Data Presentations

≪ Previous: What’s new in SQL Server 2017 presentation

There are a tremendous amount of Microsoft products that are cloud-based for building big data solutions. It’s great that there are so many products to choose from, but it does lead to confusion on what are the best products to use for particular use cases and how do all the products fit together. My job as a Microsoft Cloud Solution Architect is to help companies know about all the products and to help them in choosing the best products to use in building their solution. Based on a recent architect design session with a customer I wanted to list the products and use cases that we discussed for their desire to build a big data solution in the cloud (focusing on compute and data storage products and not ingestion/ETL, real-time streaming, advanced analytics, or reporting; also, only PaaS solutions are included – no IaaS):

Azure Data Lake Store (ADLS): Is a high throughput distributed file system built for cloud scale storage. It is capable of ingesting any data type from videos and images to PDFs and CSVs. This is the “landing zone” for all data. It is HDFS compliant, meaning all products that work against HDFS will also work against ADLS. Think of ADLS as the place all other products will use as the source of their data. All data will be sent here including on-prem data, cloud-based data, and data from IoT devices. This landing zone is typically called the Data Lake and there are many great reasons for using a Data Lake (see Data lake details and Why use a data lake? and the presentation Big data architectures and the data lake)
Azure HDInsight (HDI): Under the covers, HDInsight is simply Hortonworks HDP 2.6 that contains 22 open source products such as Hadoop (Common, YARN, MapReduce), Spark, HBase, Storm, and Kafka. You can use any of those or install any other open source products that can all use the data in ADLS (HDInsight just connects to ADLS and uses that as its storage source)
Azure Data Lake Analytics (ADLA): This is a distributed analytics service built on Apache YARN that lets you submit a job to the service where the service will automatically run it in parallel in the cloud and scale to process data of any size. Included with ADLA is U-SQL, which has a scalable distributed query capability enabling you to efficiently analyze data whether it be structured (CSV) or not (images) in the Azure Data Lake Store and across Azure Blob Storage, SQL Servers in Azure, Azure SQL Database and Azure SQL Data Warehouse. Note that U-SQL supports batch queries and does not support interactive queries, and does not handle persistence or indexing.
Azure Analysis Services (AAS): This is a PaaS for SQL Server Analysis Services (SSAS). It allows you to create an Azure Analysis Services Tabular Model (i.e. cube) which allows for much faster query and reporting processing compared to going directly against a database or data warehouse. A key AAS feature is vertical scale-out for high availability and high concurrency. It also creates a semantic model over the raw data to make it much easier for business users to explore the data. It pulls data from the ADLS and aggregates it and stores it in AAS. The additional work required to add a cube to your solution involves the time to process the cube and slower performance for ad-hoc queries (not pre-determined), but there are additional benefits of a cube – see Why use a SSAS cube?
Azure SQL Data Warehouse (SQL DW): This is a SQL-based, fully-managed, petabyte-scale cloud data warehouse. It’s highly elastic, and it enables you to set up in minutes and scale capacity in seconds. You can scale compute and storage independently, which allows you to burst compute for complex analytical workloads. It is an MPP technology that shines when used for ad-hoc queries in relational format. It requires data to be copied from ADLS into SQL DW but this can be done quickly using PolyBase. Compute and storage are separated so you can pause SQL DW to save costs (see
SQL Data Warehouse reference architectures)
Azure Cosmos DB: This is a globally distributed, multi-model (key-value, graph, and document) database service. It fits into the NoSQL camp by having a non-relational model (supporting schema-on-read and JSON documents) and working really well for large-scale OLTP solutions (it also can be used for a data warehouse when used in combination with Apache Spark – a later blog). See Distributed Writes and the presentation Relational databases vs Non-relational databases. It requires data to be imported into it from ADLS using Azure Data Factory
Azure Search: This is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich full-text search experience over your data. You can store indexes in Azure Search with pointers to objects sitting in ADLS. Azure Search is rarely used in data warehouse solutions but if queries are needed such as getting the number of records that contain “win”, then it may be appropriate. Azure Search supports a pull model that crawls a supported data source such as Azure Blob Storage or Cosmos DB and automatically uploads the data into your index. It also supports the push model for other data sources such as ADLS to programmatically send the data to Azure Search to make it available for searching. Note that Azure Search is built on top of ElasticSearch and uses the Lucene query syntax
Azure Data Catalog: This is an enterprise-wide metadata catalog that makes data asset discovery straightforward. It’s a fully-managed service that lets you register, enrich, discover, understand, and consume data sources such as ADLS. It is a single, central place for all of an organization’s users to contribute their knowledge and build a community and culture of data. Without using this product you will be in danger having a lot of data duplication and wasted effort

In addition to ADLS, Azure Blob storage can be used instead of ADLS or in combination with it. When comparing ADLS with Blob storage, Blob storage has the advantage of lower cost since there are now three Azure Blob storage tiers: Hot, Cool, and Archive, that are all less expensive than ADLS. The advantage of ADLS is that there are no limits on account size and file size (Blob storage has a 5 PB account limit and a 4.75 TB file limit). ADLS is also faster as files are auto-sharded/chunked where in Blob storage they remain intact. ADLS supports Active Directory while Blob storage supports SAS keys. ADLS also supports WebHDFS while Blob storage does not (it supports WASB which is a thin layer over Blob storage that exposes it as a HDFS file system). Finally, while Blob storage is in all Azure regions, ADLS is only in two US regions (East, Central) and North Europe (other regions coming soon). See Comparing Azure Data Lake Store and Azure Blob Storage.

Now that you have a high-level understanding of all the products, the next step is to determine the best combination to use to build a solution. If you want to use Hadoop and don’t need a relational data warehouse the product choices may look like this:

Most companies will use a combination of HDI and ADLA. The main advantage with ADLA over HDI is there is nothing you have to manage (i.e. performance tuning), you only incur costs when running the jobs where HDI clusters are always running and incurring costs regardless if you are processing data or not, and you can scale individual queries independently of each other instead of having queries fight for resources in the same HDIinsight cluster (so predictable vs unpredictable performance). In addition, ADLA is always available so there is no startup time to create the cluster like with HDI. HDI has an advantage in that it has more products available with it (i.e. Kafka) and you can customize it (i.e. install additional software) where in ADLS you cannot. When submitting a U-SQL job under ADLA you specify the resources to use via a Analytics Unit (AU). Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM and you can go as high as 450 AU’s. For HDI you can give more resources to your query by increasing the number of worker nodes in a cluster (limited by the region max core count per subscription but you can contact billing support to increase your limit).

Most of the time a relational data warehouse as part of your solution, with the biggest reasons being familiarity with relational databases by the existing staff and the need to present an easier to understand presentation layer to the end-user so they can create their own reports (self-service BI). A solution that adds a relational database may look like this:

The Data Lake technology can be ADLS or blob storage, or even Cosmos DB. The main reason against using Cosmos DB as your Data Lake is cost and having to convert all files to JSON. A good reason for using Cosmos DB as a Data Lake is that it enables you to have a single underlying datastore that serves both operational queries (low latency, high concurrency, low compute queries – direct from Cosmos DB) as well as analytical queries (high latency, low concurrency, high compute queries – via Spark on Cosmos DB). By consolidating to a single data store you do not need to worry about data consistency issues between maintaining multiple copies across multiple data stores. Additionally, Cosmos DB has disaster recovery built-in by easily allowing you to replicate data across Azure regions with automatic failover (see How to distribute data globally with Azure Cosmos DB), while ADLS requires replication and failover to be done manually (see Disaster recovery guidance for data in Data Lake Store). Blob storage has disaster recovery built-in via Geo-redundant storage (GRS) but requires manual failover by Microsoft (see Redundancy Options in Azure Blob Storage).

An option to save costs is to put “hot” data in Cosmos DB, and warm/cold data in ADLS or Blob storage while using the same reporting tool, Power BI, to access the data from either of those sources as well as many others (see Data sources in Power BI Desktop and Power BI and Excel options for Hadoop).

If Cosmos DB is your data lake or used as your data warehouse (instead of SQL DW/DB in the picture above), you can perform ad-hoc queries using familiar SQL-like grammar over JSON documents (including aggregate functions like SUM) without requiring explicit schemas or creation of secondary indexes. This is done via the REST API, JavaScript, .NET, Node.js, or Python. Querying can also be done via Apache Spark on Azure HDInsight, which provides additional benefits such as faster performance and SQL statements such as GROUP BY (see Accelerate real-time big-data analytics with the Spark to Azure Cosmos DB connector). Check out the Query Playground to run sample queries on Cosmos DB using sample data. Note the query results are in JSON instead of rows and columns.

You will need to determine if your solution will have dashboard and/or ad-hoc queries. Your choice of products in your solution will depend on the need to support one or both of those queries. For ad-hoc queries, you have to determine what the acceptable performance is for those queries as that will determine if you need a SMP or MPP solution (see Introducing Azure SQL Data Warehouse). For dashboard queries (i.e. from PowerBI) it’s usually best to have those queries go against AAS to get top-notch performance and because SQL DW has a 32-concurrent query limit (and one dashboard can have a dozen or so queries). Complex queries, sometimes referred to as “last mile” queries, may be too slow for a SMP solution (i.e. SQL Server, Azure SQL Database) and require a MPP solution (i.e. SQL DW).

The diagram above shows SQL DW or Azure SQL Database (SQL DB) as the data warehouse. To decide which is the best option, see Azure SQL Database vs SQL Data Warehouse. With a clustered column store index SQL DB competes very well in the big data space, and with the addition of R/Python stored procedures, it becomes one of the fastest performing machine learning solutions available. But be aware that the max database size for SQL DB is 4 TB, but there will soon be an option called SQL DB Managed Instance that supports a max database size much higher. See the presentations Should I move my database to the cloud? and Introducing Azure SQL Database.

You will also need to determine if you solution will have batch and/or interactive queries. All the products support batch queries, but ADLA does not support interactive queries (so you could not use the combination of Power BI and ADLA). If you want to stay within the Hadoop world you can use the HDInsight cluster types of Spark on HDInsight or HDInsight Interactive Query (Hive LLAP) for interactive queries against ADLS or Blob Storage (see General availability of HDInsight Interactive Query – blazing fast queries on hyper-scale data) and can use AtScale instead of AAS to build cubes/OLAP within Hadoop. AtScale will work against data in ADLS and Blob Storage via HDInsight.

Whether to have users report off of ADLS or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data. The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question (falling into the old mistake that the data lake does not need data governance and will magically make all the data come out properly – not understanding that HDFS is just a glorified file folder). Another risk in the first case is performance because the data is not laid out efficiently. Most solutions incorporate both to allow “power users” to access the data quickly via ADLS while allowing all the other users to access the data in a relational database or cube.

Digging deeper, if you want to run reports straight off of data in ADLS, be aware it is file-based security and so you may want to create a cube for row-level security and also for faster performance since ADLS is a file system and does not have indexes (although you can use a products such Jethro Data to create indexes within ADLS/HDFS). Also, running reports off of ADLS compared to a database has disadvantages such as limited support of concurrent users; lack of indexing, metadata layer, query optimizer, and memory management; no ACID support or data integrity; and security limitations.

The decisions on which products to use is a balance between having multiple copies of the data and the additional costs that incurs and the maintaining and learning of multiple products versus less flexibility in reporting of data and slower performance. Also, while incorporating more products into a solution means it takes longer to build, the huge benefit of that is you “future proof” your solution to be able to handle any data in the future no matter what the size, type, or frequency.

The bottom line is there are so many products with so many combinations of putting them together that a blog like this can only help so much – you may wind up needing a solution architect like me to help you make sense of it all

More info:

My presentation Choosing technologies for a big data solution in the cloud

Using Azure Analysis Services on Top of Azure Data Lake Store

Understanding WASB and Hadoop Storage in Azure

↧

Microsoft Ignite Big Data Presentations

October 17, 2017, 8:00 am

≫ Next: Azure SQL Database Managed Instance

≪ Previous: Use cases of various products for a big data cloud solution

There were so many good presentations at Microsoft Ignite, all of which can be viewed on-demand. I wanted to list the big data related presentations that I found the most useful. It’s a lot of stuff to watch and with our busy schedules can be quite challenging to view them all. What I do is set aside 40 minutes every day to watch half a session (they are 75 minutes). If may take a few weeks, but if you consistently watch you will be rewarded by a much better understanding of all the product options and their uses cases, and my last blog post (Use cases of various products for a big data cloud solution) can be used as a summary of all these options:

Modernize your on-premises applications with SQL Database Managed Instances: More and more customers who are looking to modernize their data centers have the need to lift and shift their fleet of databases to public cloud with the low effort and cost. We’ve developed Azure SQL Database to be the ideal destination, with enterprise security, full application compatibility and unique intelligent PaaS capabilities that reduce overall TCO. In this session, through preview stories and demos learn what SQL Database Managed Instances are, and how you can use them to speed up and simplify your journey to cloud.

Database migration roadmap with Microsoft: Today’s organizations must adapt quickly to change, using new technologies to fuel competitive advantage, or risk getting left behind. Organizations understand that data is a key strategic asset which, when combined with the scale and intelligence of cloud, can provide the opportunity to automate, innovate, and increase the speed of business. But every migration journey is unique, so knowing the tricks of the trade will make your journey far easier. In this session, we use real-world case studies to provide details about how to perform large-scale migrations. We also share information about how Microsoft is investing in making this journey simpler with Azure Database Migration Service and related tools.

What’s new with Azure SQL Database: Focus on your business, not on the database: – Azure SQL Database is Microsoft’s fully managed, database-as-a-service offering based on the world’s top relational database management system, SQL Server. In this session, learn about the latest innovations in Azure SQL Database and how customers are using our managed service to modernize their applications. Our most recent version combines advanced intelligence, enterprise-grade performance, high-availability, and industry-leading security in one easy-to-use database. Thanks to innovations such as In-Memory OLTP, Columnstore indexes, and our most recent Adaptive Query Processing feature family, customers can rely on Azure SQL DB for their relational data management needs, from managing just a few megabytes of transactional data.

Deep dive into SQL Server Integration Services (SSIS) 2017 and beyond: See how to use the latest SSIS 2017 to modernize traditional on-premises ETL workflows, transforming them into scalable hybrid ETL/ELT workflows. We showcase the latest additions to SSIS Azure Feature Pack, introducing/improving Azure connectivity components, and take a deep dive into SSIS Scale-Out feature, guiding you end-to-end from cluster installation to parallel execution, to help reduce the overall runtime of your workflows. Finally, we show you how to orchestrate/schedule SSIS executions using Azure Data Factory (ADF) and share our cloud-first product roadmap towards SSIS Platform as a Service (PaaS).

Understanding big data on Azure – structured, unstructured and streaming: Data is the new Electricity, and Big Data technologies are helping organizations leverage this new phenomena to foster their businesses in innovative ways. In this session, we show how you can leverage the big data services such as Data Warehousing, Hadoop, Spark, Machine Learning, and Real Time Analytics on Azure and how you can make the most of these for your business scenarios. This is a foundational session to ground your understanding on the technology, its use cases, patterns, and customer scenarios. You will see a lot of these technologies in action and get a good view of the breadth. Join this session if you want to get a real understanding of Big Data on Azure, and how the services are structured to achieve your desired outcome.

Architect your big data solutions with SQL Data Warehouse and Azure Analysis Services: Have you ever wondered what’s the secret sauce that allows a company to use their data effectively? How do they ingest all their data, analyze it, and then make it available to thousands of end users? What happens if you need to scale the solution? Come find out how some of the top companies in the world are building big data solutions with Azure Data Lake, Azure HDInsight, Azure SQL Data Warehouse, and Azure Analysis Services. We cover some of the reference architectures of these companies, best practices, and sample some of the new features that enable insight at the speed of thought.

Building Petabyte scale Interactive Data warehouse in Azure HDInsight: Come learn to understand real world challenges associated with building a complex, large-scale data warehouse in the cloud. Learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it better by dramatically improved performance and simplified architecture that suites the public clouds. In this session, we go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.

Building modern data pipelines with Spark on Azure HDInsight: You are already familiar with the key value propositions of Apache Spark. In this session, we cover new capabilities coming in the latest versions of Spark. More importantly we cover how customers are using Apache Spark for building end-to-end data analytics pipeline. It starts from ingestion, Spark streaming, and then goes into the details on data manipulation and finally getting your data ready for serving to your BI analysts.

Azure Blob Storage: Scalable, efficient storage for PBs of unstructured data: Azure Blob Storage is the exa-scale object storage service for Microsoft Azure. In this session, we cover new services and features including the brand new Archival Storage tier, dramatically larger storage accounts, throughput and latency improvements and more. We give you an overview of the new features, present use cases and customer success stories with Blob Storage, and help you get started with these exciting new improvements.

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent: Increasingly, customers looking to modernize their analytics needs are exploring the data lake approach. They are challenged by poorly-integrated technologies, a variety of data formats, and inconvenient data types. We explore a modern ETL pipeline through the lens of Azure Data Lake. This approach allows pipelines to scale to thousands of nodes instantly and lets pipelines integrate code written in .NET, Python, and R. This degree of extensibility allows pipelines to handle formats such as CSV, XML, JSON, Images, etc. Finally, we explore how the next generation of ETL scenarios are enabled by integrating intelligence in the data layer in the form of built-in cognitive capabilities.

Azure Cosmos DB: The globally distributed, multi-model database: Earlier this year, we announced Azure Cosmos DB – the first and only globally distributed, multi-model database system. The service is designed to allow customers to elastically and horizontally scale both throughput and storage across any number of geographical regions, it offers guaranteed <10 ms latencies at the 99th percentile, 99.99% high availability and five well defined consistency models to developers. It’s been powering Microsoft’s internet-scale services for years. In this session, we present an overview of Azure Cosmos DB—from global distribution to scaling out throughput and storage—enabling you to build highly scalable mission critical applications.

First look at What’s New in Azure Machine Learning: Take in the huge set of capabilities announced at Ignite for the next generation of the Azure Machine Learning platform. Build and deploy ML applications in the cloud, on-premises, and at the edge. Get started by wrangling your data into shape easily and efficiently, then take advantage of popular tools like Cognitive Toolkit, Jupyter, and Tensorflow to build advanced ML models and train them locally or at large scale in the cloud. Learn how to deploy models with a powerful, new, Docker-based hosting service complete with the ability to monitor and manage everything in production.

↧

Azure SQL Database Managed Instance

October 31, 2017, 12:17 pm

≫ Next: Analytics Platform System (APS) AU6 released

≪ Previous: Microsoft Ignite Big Data Presentations

Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into an Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it’s features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes.

I have created a presentation about Managed Instance here. If you are not familiar with Azure SQL Database, first check out my introduction presentation.

For more details see the presentation at Ignite by Drazen Sumic called “Modernize your on-premises applications with SQL Database Managed Instances (BRK2217)” and this blog post by Lindsey Allen.

There was also a presentation at Ignite called “What’s new with Azure SQL Database: Focus on your business, not on the database (BRK2230)” on the new features in SQL Database (Adaptive Query Processing, SQL Graph, Automatic Tuning, Intelligent Insights, Vulnerability Assessment, Service Endpoint) as well details on Azure Data Sync and an introduction to Managed Instances.

More info:

Native database backup in Azure SQL Managed Instance

Top Questions from New Users of Azure SQL Database

Managed Instances versus Azure SQL Database—What’s the Right Solution for You?

↧

Analytics Platform System (APS) AU6 released

November 7, 2017, 8:00 am

≫ Next: Microsoft Connect(); announcements

≪ Previous: Azure SQL Database Managed Instance

Better late than never: The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), released an appliance update (AU6) about a year ago, and I missed the announcement. Below is what is new in this release, also called APS 2016. APS is alive and well and there will be another AU next calendar year:

Microsoft is pleased to announce that the appliance update, Analytics Platform System (APS) 2016, has been released to manufacturing and is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing fully integrated system for data warehouse specific workloads.

This appliance update builds on the SQL Server 2016 release as a foundation to bring you many value-added features. APS 2016 offers additional language coverage to support migrations from SQL Server and other platforms. It also features improved security for hybrid scenarios and the latest security and bug fixes through new firmware and driver updates.

SQL Server 2016

APS 2016 runs on the latest SQL Server 2016 release and now uses the default database compatibility level 130 which can support improved query performance. SQL Server 2016 allows APS to offer features such as secondary index support for CCI tables and PolyBase Kerberos support.

Transact-SQL

APS 2016 supports a broader set of T-SQL compatibility, including support for wider rows and a large number of rows, VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX). For greater analysis flexibility, APS supports full window frame syntax for ROWS or RANGE and additional windowing functions like FIRST_VALUE, LAST_VALUE, CUME_DIST and PERCENT_RANK. Additional functions like NEWID() and RAND() work with new data type support for UNIQUEIDENTIFIER and NUMERIC. For the full set of supported T-SQL, please visit the online documentation.

PolyBase/Hadoop enhancements

PolyBase now supports the latest Hortonworks HDP 2.4 and HDP 2.5. This appliance update provides enhanced security through Kerberos support via database-scoped credentials and credential support with Azure Storage Blobs for added security across big data analysis.

Install and upgrade enhancements

Hardware architecture updates bring the latest generation processor support (Broadwell), DDR4 DIMMs, and improved DIMM throughput – these will ship with hardware purchased from HPE, Dell or Quanta. This update offers customers an enhanced upgrade and deployment experience on account of pre-packaging of certain Windows Server updates, hotfixes, and an installer that previously required an on-site download.

APS 2016 also supports Fully Qualified Domain Name support, making it possible to setup a domain trust to the appliance. It also ships with the latest firmware/driver updates containing security updates and fixes.

Flexibility of choice with Microsoft’s data warehouse portfolio

The latest APS update is an addition to already existing data warehouse portfolio from Microsoft, covering a range of technology and deployment options that help customers get to insights faster. Customers exploring data warehouse products can also consider SQL Server with Fast Track for Data Warehouse or Azure SQL Data Warehouse, a cloud based fully managed service.

Next Steps

For more details about these features, please visit our online documentation or download the client tools.

↧

Microsoft Connect(); announcements

November 15, 2017, 8:00 am

≫ Next: What is Azure Databricks?

≪ Previous: Analytics Platform System (APS) AU6 released

Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made. Here is a summary of the data platform related announcements:

Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. It delivers one-click set up, streamlined workflows, and an interactive workspace all integrated with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory, and Power BI. More info
Azure Cosmos DB with Apache Cassandra API: In preview, this enables Cassandra developers to simply use the Cassandra API in Azure Cosmos DB and enjoy the benefits of Azure Cosmos DB with the familiarity of the Cassandra SDKs and tools, with no code changes to their application. More info. See all Cosmos DB announcements
Microsoft joins the MariaDB Foundation: Microsoft is a platinum sponsor – MariaDB is a community of the MySQL relational database management system and Microsoft will be actively contributing to MariaDB and the MariaDB community. More info
Azure Database for MariaDB: An upcoming preview will bring fully managed service capabilities to MariaDB, further demonstrating Microsoft’s commitment to meeting customers and developers where they are by offering their favorite technologies on Azure. More info
Azure SQL Database with Machine Learning Services: In preview this provides support for machine learning models inside Azure SQL Database. This makes it seamless for data scientists and developers to create and train models in Azure Machine Learning and deploy models directly to Azure SQL Database to create predictions at blazing fast speeds
Visual Studio Code Tools for AI: In preview, create, train, manage, and deploy AI models with all the productivity of Visual Studio and the power of Azure. Works on Windows and MacOS. More info

↧

What is Azure Databricks?

November 20, 2017, 8:00 am

≫ Next: Is the traditional data warehouse dead?

≪ Previous: Microsoft Connect(); announcements

Azure Databricks (documentation and user guide) was announced at Microsoft Connect, and with this post I’ll try to explain its use case. At a high level, think of it as a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with Azure Blog Storage, Azure Data Lake Storage (ADLS), Azure SQL Data Warehouse (SQL DW), Cosmos DB, Azure Event Hub, Apache Kafka for HDInsight, and Power BI (see Spark Data Sources). Think of it as an alternative to HDInsight (HDI) and Azure Data Lake Analytics (ADLA).

It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost. Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. Note that all clusters within the same workspace share data among all of those clusters.

Also note with built-in integration to SQL DW it can write directly to SQL DW, as opposed to HDInsight which cannot and therefore more steps are required: when HDInsight processes data it must write it back to Blob Storage and then requires Azure Data Factory (ADF) to move the data from Blob Storage to SQL DW.

It is in limited public preview now: Sign up for the Azure Databricks limited preview

More info

Microsoft makes Databricks a first-party service on Azure

DATABRICKS + MICROSOFT AZURE = CLOUD-SCALE SPARK POWER

Microsoft Launches Preview of Azure Databricks

A technical overview of Azure Databricks

Microsoft Azure Debuts a ‘Spark-as-a-Service’

↧

Is the traditional data warehouse dead?

December 22, 2017, 8:00 am

≫ Next: Reference architecture for enterprise reporting in Azure

≪ Previous: What is Azure Databricks?

There have been a number of enhancements to Hadoop recently when it comes to fast interactive querying with such products as Hive LLAP and Spark SQL which are being used over slower interactive querying options such as Tez/Yarn and batch processing options such as MapReduce (see Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto).

This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark SQL? Which leads to the argument: “Is the data warehouse dead?”

I think what is confusing is the argument should not be over whether the “data warehouse” is dead but clarified if the “traditional data warehouse” is dead, as the reasons that a “data warehouse” is needed are greater than ever (i.e. integrate many sources of data, reduce reporting stress on production systems, data governance including cleaning and mastering and security, historical analysis, user-friendly data structure, minimize silos, single version of the truth, etc – see Why You Need a Data Warehouse). And what is meant by a “traditional” data warehouse is usually referring to a relational data warehouse built using SQL Server (if using Microsoft products) and when a data lake is mentioned it is usually one that is built in Hadoop using Azure Data Lake Store (ADLS) and HDInsight (which has cluster types for Spark SQL and Hive LLAP that is also called Interactive Query).

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake? and What is a data lake? and my presentation Big data architectures and the data lake.

The main benefits I hear of a data lake-only approach: Don’t have to load data into another system and therefore manage schemas across different systems, data load times can be expensive, data freshness challenges, operational challenges of managing multiple systems, and cost. While these are valid benefits, I don’t feel they are enough to warrant not having a relational data warehouse in your solution.

First lets talk about cost and dismiss the incorrect assumption that Hadoop is cheaper: Hadoop can be 3x cheaper for data refinement, but to build a data warehouse in Hadoop it can be 3x more expensive due to the cost of writing complex queries and analysis (based on a WinterCorp report and my experiences).

Understand that a “big data” solution does not mean just using Hadoop-related technologies, but could mean a combination of Hadoop and relational technologies and tools. Many clients will build their solution using just Microsoft products, while others use a combination of both Microsoft and open source (see Microsoft Products vs Hadoop/OSS Products). Building a data warehouse solution on the cloud or migrating to the cloud is often the best idea (see To Cloud or Not to Cloud – Should You Migrate Your Data Warehouse?) and you can often migrate to the cloud without retooling technology and skills.

I have seen Hadoop adopters typically falling into two broad categories: those who see it as a platform for big data innovation, and those who dream of it providing the same capabilities as an enterprise data warehouse but at a cheaper cost. Big data innovators are thriving on the Hadoop platform especially when used in combination with relational database technologies, mining and refining data at volumes that were never before possible. However, most of those who expected Hadoop to replace their enterprise data warehouse have been greatly disappointed, and in response have been building complex architectures that typically do not end up meeting their business requirements.

As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this. The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question. Another risk in the first case is slower performance because the data is not laid out efficiently. Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).

Relational data warehouses continue to meet the information needs of users and continue to provide value. Many people use them, depend on them, and don’t want them to be replaced with a data lake. Data lakes offer a rich source of data for data scientists and self-service data consumers (“power users”) and serves analytics and big data needs well. But not all data and information workers want to become power users. The majority (at least 90%) continue to need well-integrated, systematically cleansed, easy to access relational data that includes a large body of time-variant history. These people are best served with a data warehouse.

I can’t stress enough if you need high data quality reports you need to apply the exact same transformations to the same data to produce that report no matter what your technical implementation is. If you call it a data lake or a data warehouse, or use an ETL tool or Python code, the development and maintenance effort is still there. You need to avoid falling into the old mistake that the data lake does not need data governance. It’s not a place with unicorns and fairies that will magically make all the data come out properly – a data lake is just a glorified file folder.

Here are some of the reasons why it is not a good idea to have a data lake in Hadoop as your data warehouse and forgo a relational data warehouse:

Hadoop does not provide for very fast query reads in all use cases. While Hadoop has come a long way in this area, Hive LLAP and Spark SQL have limits on what type of queries they support (i.e. not having full support for ANSI SQL such as certain aggregate functions which limits the range of users, tools, and applications that can access Hadoop data) and it still isn’t quite at the performance level that a relational database can provide
Hadoop lacks a sophisticated query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing strategies and therefore performs poorly for complex queries
Hadoop does not have the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance to reduce cost
Hadoop is not relational, as all the data is in files in HDFS, so there is always a conversion process to convert the data to a relational format if a reporting tool requires it in a relational format
Hadoop is not a database management system. It does not have functionality such as update/delete of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
There is no metadata stored in HDFS, so another tool such as a Hive Metastore needs to be used to store that, adding complexity and slowing performance. And most metastores only work with a limited number of tools, requiring multiple metastores
Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
Hadoop is super complex, with lot’s of integration with multiple technologies to make it work
Hadoop has many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard. See all the various Apache Hadoop technologies here
Some reporting tools don’t work against Hadoop
May require end-users to learn new reporting tools and Hadoop technologies to query the data
The newer Hadoop solutions (Tez, Spark, Hive LLAP etc) are still figuring themselves out. Customers might not want to take the risk of investing in one of these solutions that may become obsolete (like MapReduce)
It might not save you much in costs: you still have to purchase hardware or pay for cloud consumption, support, licenses, training, and migration costs. As relational databases scale up, support non-standard data types like JSON, and run functions written in Python, Perl, and Scala, it makes it even more difficult to replace them with a data lake as the migration costs alone would be substantial
If you need to combine relational data with Hadoop, you will need to move that relational data to Hadoop or invest in a technology such as PolyBase to query Hadoop data using SQL
Is your current IT experience and comfort level mostly around non-Hadoop technologies, like SQL Server? Many companies have dozens or hundreds of employees that know SQL Server and not Hadoop so therefore would require a ton of training as Hadoop can be overwhelming

As far as performance, it is greatly affected by the use of indexing – Hive with LLAP (or not) doesn’t have indexing, so when you run a query, it reads all of the data (minus partition elimination). Spark SQL, on the other hand, isn’t really an interactive environment – it’s fast-batch – so again, not going to see the performance users will expect from a relational database. Also, a relational database still beats most competitors when performing complex, multi-way joins. Given that most analytic queries are just that, a traditional data warehouse still might be the right choice.

From a security standpoint, you would need to integrate Hive LLAP or Spark with Apache Ranger to support granular security definition at the column level, including data masking where appropriate.

Concurrency is another thing to think about – Hadoop clusters have to get VERY large to support hundreds or thousands of concurrent connections – remember, these systems aren’t designed for interactive usage – they are optimized for batch and we are trying to shoehorn interactivity on top of that.

A traditional relational data warehouse should be viewed as just one more data source available to a user on some very large federated data fabric. It is just pre-compiled to run certain queries very fast. And a data lake is another data source for the right type of people. A data lake should not be blocked from all users so you don’t have to tell everyone “please wait three weeks while I mistranslate your query request into a new measure and three new dimensions in the data warehouse”.

Most data lake vendors assume data scientists or skilled data analysts are the principal users of the data. So, they can feed these skilled data users the raw data. But most business users get lost in that morass. So, someone has to model the data so it makes sense to business users. In the past, IT did this, but now data scientists and data analysts can do it using powerful, self-service tools. But the real question is: does a data scientist or analyst think locally or globally? Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases? So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.

I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset for? Is your IT department radical or is it conservative?

I think a relational data warehouse still has an important place: performance, ease of access, security, integration with reporting components, and concurrency all lean towards using it, especially when performing complex, multi-way joins that make up analytic queries which is the sweet spot for a traditional data warehouse.

The bottom line is a majority of end users need the data in a relational data warehouse to easily do self-service reporting off of it. A Hadoop data lake should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse.

More info:

Is Hadoop going to Replace Data Warehouse?

IS AZURE SQL DATA WAREHOUSE A GOOD FIT?

The Demise of the Data Warehouse

Counterpoint: The Data Warehouse is Still Alive

The Future of the Data Warehouse

Whither the Data Warehouse? Reflections From Strata NYC 2017

Big Data Solutions Decision Tree

Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop

Hadoop vs Data Warehouse: Apples & Oranges?

HADOOP AND THE DATA WAREHOUSE: WHEN TO USE WHICH

↧

Reference architecture for enterprise reporting in Azure

January 5, 2018, 8:00 am

≫ Next: Data Virtualization vs. Data Movement

≪ Previous: Is the traditional data warehouse dead?

As I mentioned in my recent blog Use cases of various products for a big data cloud solution, with so many products it can be difficult to know the best products to use when building a solution. When it comes to building an enterprise reporting solution, there is a recently released reference architecture to help you in choosing the correct products. It will also help you get started quickly as it includes an implementation component in Azure. The blog post announcement is here.

This reference architecture is focused solely on reporting, for those use cases where you will have a lot of users building dashboards via Power BI and operational reports via SSRS. You can certainly expand the capabilities to add more features such as machine learning as well as enhancing the purpose of certain products, such as using Azure SQL Data Warehouse (SQL DW) to accept large ad-hoc queries from users. The reference architecture is also for a batch-type environment (i.e. loading data every hour) and not a real-time environment (i.e. handling thousands of events per second).

Key features and benefits include:

Pre-built based on selected and stable Azure components proven to work in enterprise BI and reporting scenarios
Easily configured and deployed to an Azure subscription within a few hours
Bundled with software to handle all the operational essentials for a full-fledged production system
Tested end-to-end against large workloads
You can operationalize the infrastructure using the steps in the User’s Guide, and explore component level details from the Technical Guides. Also, check out the FAQ

You can one-click deploy the infrastructure implementation from one of these two locations, which also go into details on each step in the above diagram:

The idea is you are deploying a base architecture, then you will modify as needed to fit all your needs. But the hard work of choosing the right products and building the starting architecture is done for you, reducing your risk and shortening development time. However, this does not mean you should use these chosen products in every situation. For example, if you are comfortable with Hadoop technologies you can use Azure Data Lake Store and HDInsight instead of SQL DW, or use Azure Analysis Services (AAS) instead of SQL Server Analysis Services (SSAS) in a VM (AAS did not support VNETs when this reference architecture was created). But for many who just need an enterprise reporting solution, this will do the job with little modification.

Note the Cortana Intelligence Gallery has many others solutions so be sure to check them out and avoid “reinventing the wheel”.

↧

Data Virtualization vs. Data Movement

February 2, 2018, 8:00 am

≫ Next: Conversations with Data Warehouse Experts – Podcast

≪ Previous: Reference architecture for enterprise reporting in Azure

I have blogged about Data Virtualization vs Data Warehouse and wanted to blog on a similar topic: Data Virtualization vs. Data Movement.

Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.

Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.

If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?

The most common scenario where you would want to do data movement is if you will aggregate/transform one time and query the results many times. Another common scenario is if you will be joining data sets from multiple sources frequently and the performance needs to be super fast. These turn out to be the scenarios for most data warehouse solutions. But there could be cases where you will have many ad-hoc queries that don’t need to be super fast. And you could certainty have a data warehouse that uses data movement for some tables and data virtualization for others.

Here is a comparison of both:

Other data virtualization benefits:

Provides complete data lineage from the source to the presentation layer
Additional data sources can be added without having to change transformation packages or staging tables
All data presented through the data virtualization software is available through a common SQL interface regardless of the source (i.e. flat files, Excel, mainframe, SQL Server, etc)

While this table gives some good benefits of data virtualization over data movement, it may not be enough to overcome the sacrifice in performance or other drawbacks listed at Data Virtualization vs Data Warehouse. Also keep in mind the virtualization tool you choose may not support some of your data sources.

The better data virtualization tools provide such features as query optimization, query pushdown, and caching (i.e. Denodo) that may help with performance. You may see tools with these features called “data virtualization” and tools without these features called “data federation” (i.e. PolyBase).

More info:

A FRESH LOOK AT DATA VIRTUALIZATION

↧

Conversations with Data Warehouse Experts – Podcast

February 6, 2018, 8:00 am

≫ Next: Azure Data Architecture Guide (ADAG)

≪ Previous: Data Virtualization vs. Data Movement

In this podcast I talk with Mike Rabinovici of Dimodelo Solutions about data being the new currency, the importance of showing customers the art of the possible, and last but not least my go to TV show. Click here to listen. Also check out the podcasts of other data warehouse experts.

↧

Azure Data Architecture Guide (ADAG)

February 15, 2018, 11:24 am

≫ Next: My latest presentations

≪ Previous: Conversations with Data Warehouse Experts – Podcast

The Azure Data Architecture Guide has just been released! Check it out: http://aka.ms/ADAG

Think of it as a menu or syllabus for data professionals. What service should you use, why, and when would you use it. I had a small involvement in its creation, but there were a large number of people within Microsoft and from 3rd parties that put it together over many months. Hopefully you find this clears up some of the confusion caused by so many technologies and products.

“This guide presents a structured approach for designing data-centric solutions on Microsoft Azure. It is based on proven practices derived from customer engagements.”

You can even download a PDF version (106 pages!).

The guide is structured around a basic pivot: The distinction between relational data and non-relational data:

Within each of these two main categories, the Data Architecture Guide contains the following sections:

Concepts. Overview articles that introduce the main concepts you need to understand when working with this type of data.
Scenarios. A representative set of data scenarios, including a discussion of the relevant Azure services and the appropriate architecture for the scenario.
Technology choices. Detailed comparisons of various data technologies available on Azure, including open source options. Within each category, we describe the key selection criteria and a capability matrix, to help you choose the right technology for your scenario.

The table of contents looks like this:

Traditional RDBMS

Concepts

Scenarios

Big data and NoSQL

Concepts

Scenarios

Cross-cutting concerns

↧

My latest presentations

February 26, 2018, 8:00 am

≫ Next: It’s all about the use cases

≪ Previous: Azure Data Architecture Guide (ADAG)

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting. We all know technology changes so quickly so there is no shortage of topics! There is a list of all my presentations with slide decks. Here are the new presentations I created the past year:

Differentiate Big Data vs Data Warehouse use cases for a cloud solution

It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn’t, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption. (slides)

Introduction to Azure Databricks

Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. (slides)

Azure SQL Database Managed Instance

Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it’s features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes. (slides)

What’s new in SQL Server 2017

Covers all the new features in SQL Server 2017, as well as details on upgrading and migrating to SQL Server 2017 or to Azure SQL Database. (slides)

Microsoft Data Platform – What’s included

The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source. (slides)

Learning to present and becoming good at it

Have you been thinking about presenting at a user group? Are you being asked to present at your work? Is learning to present one of the keys to advancing your career? Or do you just think it would be fun to present but you are too nervous to try it? Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it. It’s easier than you think! I am an introvert and was deathly afraid to speak in public. Now I love to present and it’s actually my main function in my job at Microsoft. I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear. You can do it! (slides)

Microsoft cloud big data strategy

Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution. (slides)

Choosing technologies for a big data solution in the cloud

Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will level-set what big data is and other definitions, cover questions to ask to help decide which technologies to use, go over the new technologies to choose from, and then compare the pros and cons of the technologies. Finally we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a “logical data warehouse”? What is this lambda architecture? And we’ll close with showing some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud. (slides)

↧

It’s all about the use cases

March 5, 2018, 8:00 am

≫ Next: Public preview of Azure SQL Database Managed Instance

≪ Previous: My latest presentations

There is no better way to see the art of the possible with the cloud than in use cases/customer stories and sample solutions/architectures. Many of these are domain-specific which resonates best with the business decision makers:

Use cases/customer stories

Microsoft IoT customer stories: Explore Internet of Things (IoT) examples and IoT use cases to learn how Microsoft IoT is already transforming your industry. The industry’s are broken out by: Manufacturing, Smart Infrastructure, Transportation, Retail, and Healthcare.

Customer stories: Dozens of customer stories of solutions built in Azure that you can filter on by language, industry, product, organization size, and region.

Case studies: See the amazing things people are doing with Azure broken out by industry, product, solution, and customer location.

Sample solutions/architectures

Azure solution architectures: These architectures to help you design and implement secure, highly-available, performant and resilient solutions on Azure.

Pre-configured AI solutions: These serve as a great starting point when building an AI solution. Broken out by Retail, Manufacturing, Banking, and Healthcare.

Internet of Things (IoT) solutions: Great IoT sample solutions such as: connected factory, remote monitoring, predictive maintenance, connected field service, connected vehicle, and smart buildings.

↧

Public preview of Azure SQL Database Managed Instance

March 8, 2018, 8:00 am

≫ Next: Is the traditional data warehouse dead? webinar

≪ Previous: It’s all about the use cases

Microsoft has announced the public preview of Azure SQL Database Managed Instance. I blogged about this before. This will lead to a title wave of on-prem SQL Server database migrations to the cloud. In summary:

Managed Instance is an expansion of the existing SQL Database service, providing a third deployment option alongside single databases and elastic pools. It is designed to enable database lift-and-shift to a fully-managed service, without re-designing the application. SQL Database Managed Instance provides the broadest SQL Server engine compatibility and native virtual network (VNET) support so you can migrate your SQL Server databases to SQL Database without changing your apps. It combines the rich SQL Server surface area with the operational and financial benefits of an intelligent, fully-managed service.

Two other related items that are available:

Azure Hybrid Benefit for SQL Server on Azure SQL Database Managed Instance. The Azure Hybrid Benefit for SQL Server is an Azure-based benefit that enables customers to use their SQL Server licenses with Software Assurance to save up to 30% on SQL Database Managed Instance. Exclusive to Azure, the hybrid benefit will provide an additional benefit for highly-virtualized Enterprise Edition workloads with active Software Assurance: for every 1 core a customer owns on-premises, they will receive 4 vCores of Managed Instance General Purpose. This makes moving virtualized applications to Managed Instance highly cost-effective.
Database Migration Services for Azure SQL Database Managed Instance. Using the fully-automated Database Migration Service (DMS) in Azure, customers can easily lift and shift their on-premises SQL Server databases to a SQL Database Managed Instance. DMS is a fully managed, first party Azure service that enables seamless and frictionless migrations from heterogeneous database sources to Azure Database platforms with minimal downtime. It will provide customers with assessment reports that guide them through the changes required prior to performing a migration. When the customer is ready, the DMS will perform all the steps associated with the migration process.

More info:

Migrate your databases to a fully managed service with Azure SQL Database Managed Instance

What is Azure SQL Database Managed Instance?

Video Introducing Azure SQL Database Managed Instance

Azure SQL Database Managed Instance – the Good, the Bad, the Ugly

↧

Is the traditional data warehouse dead? webinar

March 26, 2018, 9:21 am

≫ Next: Webinar: Is the traditional data warehouse dead?

≪ Previous: Public preview of Azure SQL Database Managed Instance

As a follow-up to my blog Is the traditional data warehouse dead?, I will be doing a webinar on that very topic tomorrow (March 27th) at 11am EST for the Agile Big Data Processing Summit that I hope you can join. Details can be found here. The abstract is:

Is the traditional data warehouse dead?

With new technologies such as Hive LLAP or Spark SQL, do you still need a data warehouse or can you just put everything in a data lake and report off of that? No! In the presentation, James will discuss why you still need a relational data warehouse and how to use a data lake and an RDBMS data warehouse to get the best of both worlds.

James will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. He’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution, and he will put it all together by showing common big data architectures.