Should I load structured data into my data lake?

November 27, 2018, 8:00 am

With data lakes becoming very popular, a common question I have been hearing often from customers is, “Should I load structured/relational data into my data lake?”. I talked about this a while back in my blog post What is a data lake? and will expand on it in this blog. Melissa Coates also talked about this recently, and I used her graphic below to illustrate:

I would not say it’s common place to load structured data into the data lake, but I do see it frequently.

In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

Wanting to offload the data refinement to the data lake (usually Hadoop), so the processing and space on the enterprise data warehouse (EDW) is reduced, and you avoid clashing with people running queries/reports on the EDW
Wanting to use some Hadoop technologies/tools to refine/filter data that do the refinement quicker/better than your EDW
It can ingest large files quickly and provide data redundancy
ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value. Note PolyBase allows end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies. PolyBase even allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster
Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source). You can keep a long history of raw data
A power user/data scientist wants to make use of the structured data in the data lake (usually combining it with non-relational data)
As a quicker way to load data into Azure SQL Data Warehouse via PolyBase from the data lake (which is usually much faster than using SSIS to copy from the source to Azure SQL Data Warehouse)

You will have to balance these benefits with the extra work required to export data from a relational source into a format such as CSV, then copied to the data lake (where it can be cleaned with a tool such as Databricks and then copied into the relational EDW such as Azure SQL Data Warehouse). Keep in mind for the relational data that is moved to the data lake you will lose valuable metadata for that data such as data types, constraints, foreign keys, etc.

I have seen some customers with relational data who bypassed the data lake and had the source systems copy data directly to the EDW (i.e. Azure SQL Data Warehouse) using SSIS/ADF, but still wanted all data in the data lake so they used Databricks to copy the data warehouse structures into the data lake using the SQL Data Warehouse Spark Connector. In these cases, the customers used the data lake for static data (i.e. web logs, IOT data streams and older ERP data that will not change) and managed non-static data directly in the EDW, as moving data updates through the data lake to the EDW adds a layer of complexity they wanted to avoid.

I have also seen cases when the customer moved EDW data to Hadoop, refined it, and moved it back to the EDW which gave the benefit of offloading the processing and/or when they needed to use Hadoop tools.

A common practice when I see customers migrate an on-prem EDW solution to the cloud is they don’t use the data lake for their current sources (which are usually all structured data), but will use it for all new sources (which are usually mostly unstructured data). However, they like the benefits listed above and will slowly modify most of their ETL packages (usually SSIS) to use the data lake, in some cases replacing SSIS with Azure Data Factory.

More info:

When Should We Load Relational Data to a Data Lake?

↧

Azure Data Factory Data Flow

December 17, 2018, 8:00 am

≫ Next: What product to use to transform my data?

≪ Previous: Should I load structured data into my data lake?

Azure Data Factory v2 (ADF) has a new feature in public preview called Data Flow. I have usually described ADF as an orchestration tool instead of an Extract-Transform-Load (ETL) tool since it has the “E” and “L” in ETL but not the “T”. But now it has the data transformation capability, making ADF the equivalent of “SSIS in the cloud” since it has the ability to mimic SSIS Data Flow business logic.

Data Flow works without you having to write any lines of code as you build the solution by using drag-and-drop features of the ADF interface to perform data transformations like data cleaning, aggregation, and preparation. Behind the scenes, the ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compile and execute in Azure Databricks which will automatically scale-out as needed.

There is an excellent chart created by Kamil Nowinski that shows the SSIS tasks and the equivalent ADF operation. There are currently 13 different dataset manipulation operations with more to come:

New Branch
Join
Conditional Split
Union
Lookup
Derived Column
Aggregate
Surrogate Key
Exists
Select
Filter
Sort
Extend

The Data Flow feature in ADF is currently in limited public preview. If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview. Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

Follow Mark Kromer and the ADF team on Twitter to stay up to date on the rollout of the preview. Also check out the ADF Data Flow’s documentation and these ADF Data Flow’s videos.

Don’t confuse this with Dataflows in Power BI, they have nothing to do with each other.

More info:

Azure Data Factory v2 and its available components in Data Flows

WHAT IS AZURE DATA FACTORY DATA FLOW?

AZURE DATA FACTORY DATA FLOW PREVIEW: SABERMETRICS EXAMPLE

↧

What product to use to transform my data?

January 8, 2019, 8:00 am

≫ Next: Storage options for SQL Server database files in Azure

≪ Previous: Azure Data Factory Data Flow

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):

SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud. If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF). Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required. A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT. Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster. For those SSIS packages that are too slow you can just replace those with a product below
Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool. Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks. This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination. ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks). My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks. When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities. There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks. But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer. Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below). Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF. Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established. But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option. In terms of pure Spark workloads Databricks greatly outperforms HDI. Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL. An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies. But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW. Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL. Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution

In addition to what products to use to clean the data, another consideration is where to clean the data. Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW. And I not only have to clean the data but join these CSV files with other files also in the data lake. Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)? Or clean it as it moves from the data lake to SQL DW via ADF or Databricks? In most cases you will want do all the transforming in the data lake. So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF. But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

More info:

Decision Tree for Enterprise Information Management (EIM)

Big Data Decision Tree v4

↧

Storage options for SQL Server database files in Azure

January 29, 2019, 8:00 am

≫ Next: Azure Data Lake Store Gen2 is GA

≪ Previous: What product to use to transform my data?

If you are using SQL Server in an Azure VM (IaaS) you have a number of options of where to store the database files (.mdf, .ldf, and .ndf). Most customers use managed disks, available in a number of offerings: Standard HDD, Standard SSD, Premium SSD, and Ultra SSD. Managed disks are highly recommended to use over unmanaged disks (see Azure Managed vs Unmanaged disks : The choice). You create and then attach one or more managed disks to the VM. Each VM comes with built-in storage, but this is a temporary storage drive, labeled as the D: drive, and is not persisted to Azure blob storage. You do not want to store your user database files or user transaction log files on the D: drive (see Performance best practices for Azure VM). The same goes for the C: drive, which holds the operating system, and should not be used for database files.

If using managed disks, the recommendation is to attach several premium SSDs to a VM. From Sizes for Windows virtual machines in Azure you can see for the size DS5_v2 you can add 64 of 4TB data disks (P50) yielding the potential total volume size of up to 256TB per VM, and if you use the preview sizes (P80) your application can have up to around 2PB of storage per VM (64 of 32TB disks). With premium storage disks, your applications can achieve 80,000 I/O operations per second (IOPS) per VM, and a disk throughput of up to 2,000 megabytes per second (MB/s) per VM. Note the more disks you add the more storage you get and the more IOPS you get, but only up to a certain point due to VM limits, meaning at some point more disks get you more storage but not more IOPS.

Ultra SSD is in preview and an option for data-intensive workloads such as SAP HANA, top tier databases, and transaction-heavy workloads and has a disk capacity up to 4PB (64 of 64TB disks) and 160,000 IOPS per disk with a disk throughput of up to 2,000 megabytes per second (MB/s) per VM. Ultra SSD has higher performance than Premium SSD so you can use fewer of them to achieve the performance you need, simplifying things (i.e. possibly not having to create a storage space) and saving costs without sacrificing IOPS (see Mission critical performance with Ultra SSD for SQL Server on Azure VM). Note that there is no SQL Server VM image that uses Ultra SSD from the Azure Marketplace yet.

In the past, after provisioning a SQL Server VM (a SQL VM created from an Azure Marketplace image), you had to manually attach and configure the right number of data disks to provide the desired number of IOPS or throughput (MB/s). Then you needed to stripe your SQL files across the disks or create a Storage Pool to divide the IOPS or throughput across them. Finally, you’d have to configure SQL Server according to the performance best practices for Azure VM.

A couple of years ago, Microsoft made this part of the provisioning experience (but only when choosing a SQL Server VM from the Marketplace, not a regular VM). When creating the VM you can easily configure the desired IOPS, throughput, and storage size within the limits of the selected VM size, as well as the target workload to optimize for (either online transaction processing or data warehousing). As you change the IOPS, throughput, and storage size, it will automatically select the right number of disks to attach to the VM. During the VM provisioning, if more than one disk is required for your specified settings, it will automatically create one Windows storage space (virtual drive) across all disks (see Windows Server 2012 Storage Virtualization Explained). For more details see Storage configuration for SQL Server VMs.

If you are using Ultra SSD, since there is no SQL Server VM image in the Azure Marketplace that uses it, you will need to implement storage spaces manually if you wish to use more than one data disk – see Implementing Storage Spaces Inside Azure Virtual Machines.

Keep in mind that Azure premium managed disks are just premium page blobs with some API make-up so they look like disks (disks and blobs use different APIs). Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs. Page blobs are one of three types of blob storage, the other two being block blobs and append blobs.

If you need extreme performance for SQL Server and are OK with not having high-availability for storage, the Lsv2-series VMs are optimized to use the local disk on the node attached directly to the VM rather than using durable data disks. This allows for greater IOPS / throughput for your workloads. It supports up to 19.2TB of storage. Your applications can achieve up to 3.4M I/O operations per second (IOPS) per VM, and a disk throughput of up to 22,000 megabytes per second (MB/s) per VM. However, you miss out on the high-availability benefits of Azure blob storage, so instead of using it for SQL Server, it is better used for NoSQL stores such as Apache Cassandra and MongoDB which replicate data across multiple VMs to achieve persistence in the event of the failure of a single VM.

Another option for storing the database files is Azure blob storage. I see this used for some on-prem SQL Server installations when the latency caused by accessing blob storage from on-prem is not a big deal (see SQL Server data files in Microsoft Azure and SQL Server 2014 and Windows Azure Blob Storage Service: Better Together). It is also sometimes used when using SQL Server in a VM with a large database (3-30 TB range) and a fast backup/restore is needed, as a file-snapshot backup (BACKUP DATABASE using WITH FILE_SNAPSHOT) could be done very quickly (note a file-snapshot cannot be done when database files are on managed disks, so you must do streaming backups). Other benefits that come with storing the database files in Azure blob storage is Instant Log Initialization for SQL Server in Azure, fast consistency checking, and reporting off of a restored copy of the snapshot. Be aware there are some limitations: you are limited to a database size of 35TB and geo-replication for your storage account is not supported (if a storage account is geo-replicated and a geo-failover happened, database corruption could occur), there is no local SSD cache to potentially improve storage performance and the same network bandwidth is shared between the VM network traffic and the database storage traffic, which can affect network intensive applications. For more info see SQL Server VLDB in Azure: DBA Tasks Made Simple.

In looking at the PaaS version of SQL Server, Azure SQL Database, here is the underlying persistent storage that is used for its databases for the various options:

Singleton: DTU Basic tiers (max 2GB database size) and Standard tiers (max 1TB database size) use blob storage, Premium tier (max 4TB database size) uses a locally attached SSD; vCore General Purpose tier (max 4TB database size) uses premium blob storage, Business Critical tier (max 4TB database size) uses a locally attached SSD
Managed Instance: General Purpose tier (max 8TB database size) uses premium blob storage, Business Critical tier (max 4TB database size) uses a locally attached SSD
Hyperscale: Uses standard blob storage (max 100TB database size) but achieves very high performance due to multiple SSD-based caching tiers

More info:

Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)

Azure SQL Database high availability

Premium blob storage

High-performance Premium Storage and managed disks for VMs

About disks storage for Azure Windows VMs

Tutorial: Use Azure Blob storage service with SQL Server 2016

Storage Configuration Guidelines for SQL Server on Azure VM

How to Choose the Correct Azure VM Disk Tier

↧

Azure Data Lake Store Gen2 is GA

February 19, 2019, 8:00 am

≫ Next: My latest presentations

≪ Previous: Storage options for SQL Server database files in Azure

Azure Data Lake Store (ADLS) Gen2 was made generally available on February 7th. In short, ADLS Gen2 is the best of the previous version of ADLS (now called ADLS Gen1) and Azure Blob Storage. ADLS Gen2 is built on Blob storage and because of that it is a “ring 0” service and is available in all regions. To create an ADLS Gen2 account, see Quickstart: Create an Azure Data Lake Storage Gen2 storage account. Note that creating Blob storage or ADLS Gen2 both follow the same process, which is to create a storage account. The difference is there is an Advanced tab with an ADLS Gen2 section where you set Hierarchical namespace (HNS) to Enabled to make the account ADLS Gen2 instead of Blob.

ADLS Gen2 has most of the features of both ADLS Gen1 and Blob storage (with the features not supported yet listed below). Features currently supported include limitless storage capacity, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

When to use Blob vs ADLS Gen2

New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Upgrade guidance

Soon existing Blob storage accounts (general purpose v2) will have a seamless mechanism to convert it to ADLS Gen2 to get all the features of ADLS Gen2 (eg. go to the Configuration screen and set Hierarchical namespace to Enabled). If your Blob storage account is general purpose v1, you will first need to upgrade it to general purpose v2 before you can set Hierarchical namespace to Enabled. For now you can’t upgrade to ADLS Gen2 this way so you will need to copy your data from Blob storage to ADLS Gen2.

For existing customers of ADLS Gen1, no new features will be added to ADLS Gen1. You can stay on ADLS Gen1 if you don’t need any of the new capabilities or can move to ADLS Gen2 where you can leverage all the goodness of the combined capabilities and save money (ADLS Gen2 pricing will be roughly half of ADLS Gen1). You can upgrade when you chose to do so, which is done by copying the data from ADLS Gen1 to ADLS Gen2 as well as any app modifications required to run on ADLS Gen2, with migration tools coming soon. See Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 for help on migrating as well as the list of some Azure products and 3rd-party products that don’t yet support ADLS Gen2 that may make you want to hold off on migrating.

What works and what does not work

You might want to stay with Blob storage for now if there is a feature that is not yet supported in ADLS Gen2 such as soft delete, snapshots, object level storage tiers (Hot, Cool, and Archive), and lifecycle management (see Known issues with Azure Data Lake Storage Gen2). If you are using a 3rd-party app or an Azure app, make sure that it supports ADLS Gen2 (see Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2). If you are using the WASB or ADLS driver, it will be as simple as switching to the new ADLS Gen2 driver and changing configs.

Blob Storage APIs and ADLS Gen2 APIs aren’t interoperable with each other yet in an ADLS Gen2 account (an account with the HNS enabled). If you have tools, applications, services, or scripts that use Blob APIs, and you want to use them to work with all of the content that you upload to your account, then don’t upgrade yet until Blob APIs become interoperable with ADLS Gen2 APIs. Using a storage account without a hierarchical namespace means you then don’t have access to ADLS Gen2 specific features, such as directory and filesystem access control lists.

Note that the file size limit in ADLS Gen2 is 5TB.

Replication options

Below is an excellent slide that describes the four replication options available with Azure Blob Storage which is also available for ADLS Gen2. Notice the SLA which is a guarantee of the percentage of time Azure will successfully process requests to read data or write data from storage. Also notice the durability, which specifies the chances of losing data, which is incredible small.

More info:

Individually great, collectively unmatched: Announcing updates to 3 great Azure Data Services

10 THINGS TO KNOW ABOUT AZURE DATA LAKE STORAGE GEN2

↧

My latest presentations

March 11, 2019, 8:00 am

≫ Next: Azure Data Explorer

≪ Previous: Azure Data Lake Store Gen2 is GA

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting. We all know technology changes so quickly so there is no shortage of topics! There is a list of all my presentations with slide decks. Here are the new presentations I created the past year:

Building a modern data warehouse

Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each. (slides)

AI for an intelligent cloud and intelligent edge: Discover, deploy, and manage with Azure ML services

Discover, manage, deploy, monitor – rinse and repeat. In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators. We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you. Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content. (slides)

Power BI for Big Data and the New Look of Big Data Solutions

New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized. (slides)

How to build your career

In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft. Along the way I saw my yearly income triple. Is it because I know some secret? Is it because I am a genius? No! It is just about laying out your career path, setting goals, and doing the work.

I’ll cover tips I learned over my career on everything from interviewing to building your personal brand. I’ll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.

Your first step to enhancing your career will be to attend this session! Let me be your career coach! (slides)

Is the traditional data warehouse dead?

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures. (slides)

↧

Azure Data Explorer

March 14, 2019, 8:00 am

≫ Next: Two more of my presentations

≪ Previous: My latest presentations

Azure Data Explorer (ADX) was announced as generally available on Feb 7th. In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices. ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (currently up to 3 nodes) and queries across a billion records take less than a second. A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.

It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform. It can not only work on numbers but also does full-text search on semi-structured or un-structured data. One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds. Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch or Splunk. One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.

Azure Data Explorer integrates with other major services to provide an end-to-end solution that includes data collection, ingestion, storage, indexing, querying, and visualization. It has a pivotal role in the data warehousing flow by executing the EXPLORE step below on terabytes of diverse raw data.

In addition to getting instant feedback from streaming data (via ad-hoc analysis of source data), another way to use ADX is to have it consume large amounts of streaming data, aggregate it, have it land in its internal database, and then have that aggregated data copied to ADLS Gen2 in CSV format via data export or Azure Data Factory. Once there, it can then be transformed and cleaned using another tool such as Databricks (since ADX can’t transform data) and then copied to a traditional data warehouse such as Azure SQL Data Warehouse. This can be much faster than trying to use other tools such as Azure Data Factory to land tons of streaming data into ADLS Gen2 and much less expensive than landing streaming data into Azure SQL Data Warehouse.

The following diagram shows the different aspects of working with Azure Data Explorer:

Work in Azure Data Explorer generally follows this pattern:

Create database: Create a dedicated cluster (currently available VMs are D11_v2, D12_v2, D13_v2, D14_v2, L4, L8, L16) and then create one or more databases in that cluster. See Quickstart: Create an Azure Data Explorer cluster and database. You can scale up the cluster (increasing the size of the node/instance) or scale out the cluster, also known as autoscale (adding more nodes/instances)
Ingest data: Load data from a streaming or a batch source (event hub, iot hub, blob storage or ADLS Gen2 via event grid, Kafka, Logstash plugin, apps via APIs, or Azure Data Factory) into database tables in the database created in step 1 so that you can run queries against it. This is done by creating a target table in ADX and connecting the source such as event hub to this table and mapping the incoming data to the column names in the target table. Supported formats are Avro, CSV, JSON, MULTILINE JSON, PSV, SOH, SCSV, TSV, and TXT. The database is maintenance free so no DBA is needed for it. See Quickstart: Ingest data from Event Hub into Azure Data Explorer
Query database: Use an Azure web application to run, review, and share queries and results (you can go to the web application directly via dataexplorer.azure.com). It is available in the Azure portal and as a stand-alone application. In addition, you can send queries programmatically (using an SDK to query data from your code) or to a REST API endpoint. The Azure portal-based query builder displays the results in a table where you can filter your data and even apply basic pivot table options. It also includes tools for basic visualization, and you can choose from a range of chart types. ADX uses a different query language called Kusto Query Language (KQL) that is for querying only (you can not update or delete data using KQL) but is a very robust and powerful language that has 400+ operators (filtering, aggregation, grouping, joining, etc as well as full-text indexing and time series analysis) and has built-in machine learning features such as clustering, regression, detecting anomalies or doing forecasting. KQL can also query data from a SQL Database and Cosmos DB. KQL is the same language used in Azure Log Analytics and Application Insights. You can find more information about the query language here. You can also take an online PluralSight course. See Quickstart: Query data in Azure Data Explorer
You can also visualize your analysis in ADX either through the ‘render’ command in KQL or via three options for connecting to data in Power BI: use the built-in connector, import a query from Azure Data Explorer, or use a SQL query. Finally, there is a Python plugin so you can start using tools like the Anaconda analytics environment and Jupyter Notebooks with saved trained ML models that work with your Azure Data Explorer data sets

All the products and tools that can interact with ADX:

Capability for many data types, formats, and sources
Structured (numbers), semi-structured (JSON\XML), and free text
Batch or streaming ingestion
Use managed ingestion pipeline or queue a request for pull ingestion
Compute and storage isolation
– Independent scale out / scale in
– Persistent data in Azure Blob Storage
– Caching for low-latency on compute
Multiple options to support data consumption
Use out-of-the box tools and connectors or use APIs/SDKs for custom solution

Kusto stores its ingested data in reliable storage (Azure Blob Storage), away from its actual processing (e.g. Azure Compute) nodes. To speed-up queries on that data, Kusto caches this data (or parts of it) on its processing nodes, SSD or even in RAM. Kusto includes a sophisticated cache mechanism designed to make intelligent decisions as to which data objects to cache. The Kusto cache provides a granular cache policy (set when creating a database in step #1 above) that customers can use to differentiate between two data cache policies: hot data cache and cold data cache. The Kusto cache will attempt to keep all data that falls into the hot data cache in local SSD (or RAM), up to the defined size of the hot data cache. The remaining local SSD space will be used to hold data that is not categorized as hot.

The main implications of setting the hot cache policy are:

Cost: The cost of reliable storage can be dramatically cheaper than local SSD (for example, in Azure it is currently about 45x times cheaper)
Performance: Data can be queried faster when it is in local SSD. This is particularly true for range queries, i.e. queries that scan large quantities of data

For more details on why ADX is so fast, check out Azure Data Explorer Technology 101 and Azure Data Explorer whitepaper.

Note that Azure Time Series Insights (TSI) is built on top of Azure Data Explorer. TSI is a SaaS offering as part of the IoT product suite and ADX is a PaaS offering, part of the Azure data platform, powering TSI and other SaaS apps. So, using ADX allows us to get a layer deeper than TSI and build a custom solution. Products like TSI, Azure Monitor Log Analytics, Application insights, and Azure Security Insights are pre-built solutions for a specific purpose, where ADX is used when you are building your own analytics solution or platform and need full control of data sources, data schema and data management, the powerful analytics capabilities of KQL over time series and telemetry data, and a reserved resources cost model.

To sum up in one sentence, Azure Data Explorer is a big data analytics cloud platform optimized for interactive, ad-hoc queries on top of fast flowing data. To get an idea of the cost, check out the cost estimator and note that your cluster can be stopped (you only pay for when it is running).

More info:

Azure Data Explorer Technology 101

Azure Data Explorer whitepaper

What is Azure Data Explorer and Kusto Querying Language (KQL)?

Time series analysis in Azure Data Explorer

How to save $$$ when using Azure Data Explorer

How to use Azure Data Explorer for large-scale data analysis

Video An overview of Azure Data Explorer (ADX)

Video Azure Data Explorer – Query billions of records in seconds!

↧

Two more of my presentations

April 4, 2019, 8:00 am

≫ Next: Where should I clean my data?

≪ Previous: Azure Data Explorer

I recently made available two more presentations that you might find helpful. Feel free to download them and present them to others (adding a line that you got them from me is all I ask). There is a full list of all my presentations with slide decks here. I continually update all my decks, but unfortunately SlideShare no longer allows you to update decks, so just email me at jamesserra3@gmail.com if you want the latest one.

Azure data platform overview

Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build. (slides)

Machine Learning and AI

The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases. (slides)

↧

Where should I clean my data?

April 25, 2019, 8:00 am

≫ Next: Microsoft Build event announcements

≪ Previous: Two more of my presentations

As a follow-up to my blogs What product to use to transform my data? and Should I load structured data into my data lake?, I wanted to talk about where you should you clean your data when building a modern data warehouse in Azure. As an example, let’s say I have an on-prem SQL Server database and I want to copy one million rows from a few tables to a data lake (ADLS Gen2) and then to Azure SQL DW, where the data will be used to generate Power BI reports (for background on a data lake, check out What is a data lake? and Why use a data lake? and Data lake details). There are five places that you could clean the data:

Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
Clean data as it goes from source to the data lake. The products that could be used for this include SSIS, Azure Data Factory (ADF) Data Flows (renamed to ADF Mapping Data Flows), Databricks, or Power BI Dataflow. Note that ADF Mapping Data Flows can only be used against four data sources: Azure SQL DB, Azure SQL DW, Parquet (from ADLS & Blob), and Delimited Text (from ADLS & Blob). Think of these four data sources as staging areas for the data to be used by ADF Mapping Data Flows. The disadvantages of this option are: It will cause additional time pulling data from the source system which could affect performance of end users using the system; and you won’t have the raw data in the data lake. If using SSIS or Power BI Dataflow it would cause nonparallel line-by-line transformations (see Difference between ETL and ELT) which could be very slow when cleaning many rows of data. See below for how SSIS compares to ADF Mapping Data Flows and Databricks. Strongly advise against this option
Clean data after it has landed into data lake. You land the data into a raw area in the data lake, clean it, then write it to a cleaned area in the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW via Polybase, all of which can be orchestrated by ADF. This is the preferred option as it results in having the raw data in the data lake and minimizes the time hitting the source system, as well as saving costs as opposed to cleaning the data in a data warehouse (see Reduce Costs By Adding A Data Lake To Your Cloud Data Warehouse). This will also be the way to clean the data the fastest. ADF makes it real easy to move data from a source system (supports 79 sources) to the data lake (ADLS Gen2) by creating an ADF pipeline that uses the Copy Activity with a Self-Hosted Integration Runtime connected to your on-prem SQL Server. You could land the data in the data lake in Parquet format with a high Data Integration Units (DIU) setting to make it super-fast. ADF also makes it real easy via the Copy Activity to copy the cleaned data in the data lake to 23 destinations (“sinks”) such as SQL DW. In addition, you could copy the cleaned data in the data lake to an MDM solution to master it, then write it back to the data lake in another layer called the mastered layer. The cleaned area is also where you could join the data from various files together or split them apart for security reasons (i.e. having a file with rows from different countries split into multiple files for each country). Products that could be used to clean the data: ADF Mapping Data Flows, Databricks, or HDInsight. See below for how ADF Mapping Data Flows compares to Databricks
Clean data as it moves from raw area in data lake to SQL DW. The disadvantages of using this option is you won’t have the clean data in the data lake, it would do row-by-row transformations which are slow, and some products don’t use Polybase so it would be slow to load the data into SQL DW. Products that can be used for this include ADF Mapping Data Flows or Databricks using the SQL DW Connector (which uses Polybase). Strongly advise against this option
Copy data from raw area in data lake to SQL DW then clean it via stored procedures in SQL DW. The disadvantages of using this option is it is more costly as SQL DW costs more than Databricks, the data transformations can affect users running queries on SQL DW, and the cleaned data won’t be available in the data lake for others to use. You would use ADF to copy the data from the raw data area into staging in SQL DW, which would use Polybase. Then execute stored procedures that would clean the data and copy it to production tables. Strongly advise against this option

Be aware this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

A bit about how ADF Mapping Data Flows works “under the covers”. ADF Mapping Data Flows is a big deal as it brings GUI-based design together with scale. It uses the Azure Databricks compute engine under the covers: ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compiled and executed in Azure Databricks which will automatically scale-out as needed. The distribution of data across worker/compute nodes in Databricks is where the optimizations come in: each of the transformations in the data flow has an “optimize” tab that lets you control how the data gets partitioned across the worker nodes. If your source is Azure SQL DB, it’ll be using the out-of-the-box JDBC driver, which means the Databricks workers are accessing Azure SQL DB directly. There are slight exceptions – if Azure SQL Data Warehouse is the source/destination, then it can land the data temporarily in blob storage (not the hierarchical namespace / ADLS Gen2) before Polybasing it in. If the source/destination can Polybase, you set a staging linked service/container in the Data Flow activity settings.

How does ADF Mapping Data Flows compare to using SSIS? In SSIS, you would create an SSIS Data Flow Task to do, for example, a Lookup and a Data Conversion on the one million rows. This could be slow because it would have to copy the data from SQL Server in batches (usually 10,000 rows) to the SSIS server (so your dependent on the size of the SSIS server), then would go row-by-row updating the data, and landing the batch to a cleaned layer in ADLS Gen2 before doing the next one. With ADF Mapping Data Flows, you create an ADF pipeline that uses the Copy Activity to copy the one million rows from SQL Server to a raw area in ADLS Gen2, then create a Data Flow activity in the ADF pipeline to do the transformations (see Azure Data Factory Data Flow), which behind-the-scenes fires up Databricks, puts the data in a Spark in-memory DataFrame across the workers, and runs Scala code on the DataFrame to update the rows. This is very fast because the reads from that data lake into memory are done in parallel by the Databricks worker engines (you can be reading as many extents of that file as you have worker cores), and because Databricks can be scaled for the transformations (it’s still doing row-by-row iterations, but it’s doing them in parallel according to the number of workers/cores and the way the data is distributed across those workers). After the transformations are done, you would use the ADF Mapping Data Flow Sink Transformation to save this data in a cleaned layer in ADLS Gen2. So the end result is ADF Mapping Data Flows is likely to be much faster than SSIS.

What is the difference between using Databricks vs ADF Mapping Data Flows? ADF Mapping Data Flows is much easier to use because of its drag-and-drop interface, while Databricks is all code-based. However, use Databricks if you like notebooks, want to use ML tools, and want more monitoring. Performance when using ADF Mapping Data Flows would be the same as using Databricks separately or when using the Databricks Notebook activity in ADF unless you used a different partitioning scheme (controlled by the “optimize” tab for ADF Mapping Data Flows) or cluster size (default is 8 cores of general compute for ADF Mapping Data Flows – the number of worker nodes is dictated by the number of cores you use – the driver node is assigned 4 cores with the rest assigned to workers), but also depends on the startup time for Databricks when using an ADF Mapping Data Flow (which can take five minutes, but this can be avoided if you turn on the Mapping Data Flow Debug Mode) compared to using a separate Databricks cluster that is already running (which is instantaneous if you keep it running but that can be very expensive). Also keep in mind that Databricks allows you to build a generic solution that could clean data in many files, while ADF Mapping Data Flows requires a package per object.

More info:

ADF Data Flow tutorial videos

ADF Mapping Data Flow Patterns

↧

Microsoft Build event announcements

May 16, 2019, 8:00 am

≫ Next: Common Data Model

≪ Previous: Where should I clean my data?

Another Microsoft event and another bunch of exciting announcements. At the Microsoft Build event last week, the major announcements in the data platform and AI space were:

Machine Learning Services enhancements

In Private Preview is a new visual interface for Azure Machine Learning adds drag-and-drop workflow capabilities to Azure Machine Learning service. It simplifies the process of building, testing, and deploying machine learning models for customers who prefer a visual versus coding experience. This integration brings the best from ML Studio and AML service together. The drag-n-drop experience let any data scientist can quickly build a model without coding. The tool also gives enough flexibility for data scientist to fine tune its model. The AML service as the backend platform offer all the scalability, security, debuggability … that ML studio can’t give. The easy deployment capability in visual interface enables easy generation of score.py file and creating images. With a few clicks, a trained model can be deployed to any AKS cluster associated with AML service. Think of this as Azure Machine Learning Studio V2. More info

Create, run, and explore automated machine learning experiments in the Azure portal without a single line of code. Automated machine learning automates the process of selecting the best algorithm to use for your specific data, so you can generate a machine learning model quickly. More info

Azure Cognitive Services enhancements

The launch of Personalizer, along with Anomaly Detector and Content Moderator, are part of the new Decision category of Cognitive Services that provide recommendations to enable informed and efficient decision-making for users. Available now in preview. More info

Wrangling Data Flows in Azure Data Factory

A new capability called Wrangling Data Flows, available in preview, gives users the ability to explore and wrangle data at scale. Wrangling Data Flows empowers users to visually discover and explore their data without writing a single line of code. It is described at https://mybuild.techcommunity.microsoft.com/sessions/76997?source=speakerdetail#top-anchor (at 21 min). It is what you use with PowerQuery Online. The execution runtime underneath is Mapping Dataflows but the user interface is PowerQuery UI. More info

Azure Database for PostgreSQL Hyperscale (Citus)

Now in Public Preview, this brings high-performance scaling to PostgreSQL database workloads by horizontally scaling a single database across hundreds of nodes to deliver blazingly fast performance and scale. With horizontal scale-out, you can fit more data in-memory, parallelize queries across hundreds of nodes, and index data faster. The addition of Hyperscale (Citus) as a deployment option for Azure Database for PostgreSQL simplifies your infrastructure and application design, saving organizations time to focus on their business needs. More info

Azure SQL Database Hyperscale

Now Generally Available. More info

Azure SQL Database Serverless

Azure SQL Database serverless is a new compute tier that optimizes price-performance and simplifies performance management for databases with intermittent, unpredictable usage. Serverless automatically scales compute for single databases based on workload demand and bills for compute used per second. Serverless databases automatically pause during inactive periods when only storage is billed and automatically resume when activity returns. Serverless helps customers focus on building apps faster and more efficiently. with all databases in SQL Database, serverless databases are always up-to-date, highly available, and benefit from built-in intelligence for further security and performance optimization.

Azure SQL Data Warehouse’s support for semi-structured data

Azure SQL Data Warehouse now supports semi-structured data. Now with one service, both structured and semi-structured data formats (like JSON) can now be analyzed directly from the data warehouse for faster insights. In private preview. More info

Azure Data Explorer enhancements

The new Spark connector will enable customers to seamlessly read data from Azure Data Explorer into a Spark Dataframe as well as ingest data from it. This enables a number of use cases related to data transformation and machine learning in Spark and the ability to use Azure Data Explorer as a data source/destination for interactive analytics or to operationalize machine learning models for fast scoring of data arriving in Azure Data Explorer.

Azure Data Explorer Python and R plugins will enable customers to embed Python and R code as part of the query. More info

Continuous data export: This feature writes CSV or Parquet files to the data lake, as data streams in via event hubs, IoT hubs, or any other path. It enables building analytical solution over fresh data and seamlessly fill the data lake.

The ability to query data in the lake in its natural format using Azure Data Explorer. Simply define this data as an external table in Azure Data Explorer and query it. You can then join/union it with more data from Azure Data Explorer, SQL servers, and more.

Autoscale for Azure HDInsight

The Autoscale capability for Azure HDInsight is now available in public preview. Autoscale automatically scales your Spark, Hive, or MapReduce HDInsight clusters up or down based on load or a pre-defined schedule.

Load-based Autoscale uses several workload-specific metrics, such as CPU and memory usage, to intelligently scale your cluster between user-configurable min and max sizes based on the load
Schedule-based Autoscale allows you to set your own custom schedule for cluster scaling. For example, you can set your cluster to scale up to 10 nodes starting at 9:00 am and scale back down to 3 nodes at 9:00 pm on working days

The Autoscale capability can be configured using the Azure portal or programmatically using ARM templates.

↧

Common Data Model

June 6, 2019, 12:00 pm

≫ Next: The Microsoft Power Platform

≪ Previous: Microsoft Build event announcements

The Common Data Model (CDM) is a shared data model that is a place to keep all common data to be shared between applications and data sources. Another way to think of it is is a way to organize data from many sources that are in different formats into a standard structure.

The Common Data Model includes over 340 standardized, extensible data schemas that Microsoft and its partners have published. This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships. The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data. Having the metadata stored along with the data is very helpful when using a data lake which normally does not have metadata due to the schema-on-read nature. This graphic shows some elements of the standard entities (for more information see Common Data Model repository on GitHub):

Use the Visual Entity Navigator for interactively exploring entities, entity extensions/inheritance, attributes, and relationships.

Industry Accelerators are foundational components within the Microsoft Power platform and Dynamics 365 that enable ISVs and other solution providers to quickly build industry vertical solutions. The accelerators extend the Common Data Model to include new entities to support a data schema for concepts within specific industries. Think of the time savings if you are in one of these industries and need to create a data model. Instead of spending weeks or months creating one from scratch, you have one already built for you. Microsoft is focused on delivering accelerators for these industries and others:

The CDM is already supported in the Common Data Services for Apps, Dynamics 365, PowerApps, Power BI via Dataflows, and it will be supported in many upcoming Azure data services.

With Power BI Dataflows, the common data model stores the data into Azure Data Lake Storage (ADLS) Gen2, either internal storage provided by Power BI or stored in your organization’s ADLS Gen2 account (see Dataflows and Azure Data Lake integration (Preview)). Dataflow can map the output of a query to an entity in the common data model. This feature is handled with the “Map to Standard” option in the Dataflow Power Query Editor (see CDM and Azure Data Services Integration). Note that healthcare is currently the only industry accelerator supported for Power BI Dataflows.

The following diagram shows a sample CDM folder in ADLS Gen2, created by a Power BI Dataflow, that contains three entities:

Still to be determined is how exactly CDM fits into a modern data warehouse solution as there are still a number of products that do not have integration points into it yet (i.e. Azure Data Factory is not yet able to write directly to CDM). However, there are manual steps you can do to integrate Azure Data Factory, Databricks, Azure Machine Learning, and SQL Data Warehouse with CDM until the products gain built-in integration (see CDM and Azure Data Services Integration).

If we think about just Power BI Dataflows, the current state is limiting as Power BI requires its own filesystem inside of ADLS Gen2, which means that sadly we can’t integrate any of it with the rest of our data lake. In turn that means it’s difficult to frame the enterprise data lake plus Power BI Dataflows, which may be CDM entities or custom entities, plus CDM from other products like Dynamics.

Also, keep in mind that a modern data warehouse also has a relational database in the solution that is fed from the data lake, so the CDM in the data lake would need to be replicated in the relational database. However, using Power BI Dataflows a business user can do the work of copying and transforming data from multiple source systems into the CDM in the data lake, instead of waiting for IT to do it (who may have a backlog and who may not understand the data).

One more thing to point out: the Open Data Initiative (ODI) is a collaboration between Microsoft, Adobe, and SAP to use the same common data models in their products.

More info:

Video Dataflows and data interop with the Common Data Model

Video Tips & Tricks for getting data of any shape and size into Azure Data Lake using Dataflows

What is the Common Data Model and Why Should I Care? Part 3 of Dataflow Series in Power BI

Video Introduction to the Microsoft Common Data Model

Video Microsoft Common Data Model (CDM): An introductory session – BRK2052

↧

The Microsoft Power Platform

June 25, 2019, 8:00 am

≫ Next: Big Data Workshop

≪ Previous: Common Data Model

The Microsoft Power Platform consists of three products: Power BI, PowerApps, and Microsoft Flow. I find customers are confused on the use cases of these products and how they compare to other products (Azure Functions, Logic Apps, and Azure Data Factory). So I wanted to write a blog and describe at a high level what these products are, their primary use case, how they interact, and how they compare.

Power BI – A self-service reporting tool that allows you to visually explore and analyze data to get better insights into your data so you can make better business decisions. The reports can be shared with colleagues on any device. This product is easy to understand, although its new feature, dataflows, makes things a bit confusing.

PowerApps – Allows you to quickly build custom web and mobile business applications that can be shared within your organization without custom coding using any data source you wish (via over 300 connectors or build a custom connector). Example: Create an app to allow employees to view and compare laptops and place an order, entering data along the way. Some people call PowerApps the next version of Microsoft Access. It is not meant for external usage, only internal because of licencing and security reasons (with some exceptions via the upcoming PowerApps Portals). PowerApps can also store data via the Common Data Service (CDS) and uses the Common Data Model. The data in CDS can be consumed by other products such as Power BI. Within PowerApps you can also load data from 23 external data sources to the Common Data Service using Power Query. Note that there are two types of PowerApps: Canvas Apps for simple apps and the newer Model-Driven Apps for more complex apps (view a comparison). Model-Driven apps are in fact the evolution of the backend of Microsoft Dynamics 365 Customer Engagement (CRM) so in that regard it’s not really new.

Microsoft Flow – Create sophisticated automated workflows directly in your apps with a no-code approach that connects to a wide variety of applications and services (over 250 connectors). Allows non-technical users to automate complex business processes and workflows without a complex IT deployment. You will frequently call Microsoft Flow from PowerApps. Example: After a laptop order is placed, send an email to the selected approver and take actions based on their approved or rejected response.

So those are brief explanations of the three products that make up the Microsoft Power Platform. Using one sentence to describe the Power Platform: “The Power Platform is a new low-code development platform that intertwines the best components of Power BI, Microsoft Flow, Common Data Service, and PowerApps”. With CDS as a core data component of Power Platform you can think of it as actually four products. Now onto how they compare to other Azure products:

Azure Functions – A solution for easily running small pieces of code, or “functions,” in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or Python. You pay only for the time your code runs and trust Azure to scale as needed, allowing you to build truly serverless applications. Azure Functions integrates with various Azure products (i.e. PowerApps, Microsoft Flow, Azure Data Factory, Logic Apps) and 3rd-party services. These products and services can trigger your function and start execution, or they can serve as input and output for your code.

Logic Apps – Azure Logic Apps is a cloud service that helps you schedule, automate, and orchestrate tasks, business processes, and workflows when you need to integrate apps, data, systems, and services across enterprises or organizations. Despite the word “Apps” in the product name, is does not have a visual interface like PowerApps for building business applications – it is more like an ETL tool comparable to Microsoft Flow. In fact, Microsoft Flow is build on top of Logic Apps and they share the same workflow designer and the same connectors. Think of Logic Apps as a more powerful version of Microsoft Flow – you can build a solution with Microsoft Flow at first, and then convert it to a Logic App as needed. Logic Apps includes additional capabilities like integration with Azure Resource Manager and the Azure Portal, PowerShell and the Azure Command-Line Interface (Azure CLI), Visual Studio, and more advanced connectors.

Logic Apps and Functions fall under the umbrella of developer type services, which can be built in a browser or Visual Studio. PowerApps and Microsoft Flow are limited to the browser only, hence are targeted for power users or business users.

Azure Data Factory (ADF) – Easily construct ETL and ELT processes code-free within the intuitive visual environment, or write your own code. Visually integrate data sources using more than 80 natively built and maintenance-free connectors. ADF is somewhat similar to Logic Apps but there are clear differences for their use cases. With ADF, think about using it for moving data, especially large data sets, and transforming the data and building data warehouses (data integration). If you’re monitoring a folder on-prem or in OneDrive and you’re looking to see when files get posted there and you want to simply move that file to another location or send a notification about an action on the file, that a great use case for a Logic App (application integration). But keep in mind these two products will often be used together (i.e. copying data from SQL Server to a Data Lake via ADF, then using the web API out of ADF to send an email notification through a Logic App back to a user to say a job has competed or failed). Also note that Logic Apps, along with Microsoft Flow, have connectors to run an ADF pipeline.

More info:

The Microsoft Power Platform – Empowering millions of people to achieve more

What are Microsoft Flow, Logic Apps, Functions, and WebJobs?

Power Apps, Logic Apps and Flow

Azure Data Factory vs Logic Apps

Azure Data Factory 2.0 – How is it different from Azure Logic Apps?

PowerApps Canvas vs Model-driven Applications

Azure Logic Apps vs Microsoft Flow, Why Not Both?

↧

Big Data Workshop

July 10, 2019, 8:00 am

≫ Next: Top Modern Data Warehouse questions

≪ Previous: The Microsoft Power Platform

A challenge I have with customers who want to get hands-on experience with the Azure products that are found in a modern data warehouse architecture is finding a workshop that covers many of those products. To the rescue is a workshop created by my Microsoft colleagues Fabio Braga and Rod Colledge, explained in their blog post Azure Data Platform End2End with the GitHub located here. This is an on-demand workshop with labs that you can run at any time.

The idea of this workshop is to give experienced BI professionals (but new to Azure) a view of the variety of data services available and the role they play in the overall architecture. Most professionals never had a chance before to use a Spark cluster, or a NoSQL database, so the workshop aims to fill this gap. It’s true that similar outcomes can also be achieve with other services/features (this workshop uses only a subset of a much larger family of Azure services), but there is only so much that can be covered in a 2-day workshop. So keep in mind the architecture used in this workshop is only one of many possibilities for building a modern data warehouse solution. The lab will be updated as new products and features are released (i.e. ADF Mapping Data Flow when it GA’s).

A description of the workshop:

In this 2-day workshop you will learn about the main concepts related to advanced analytics and Big Data processing and how Azure Data Services can be used to implement a modern data warehouse architecture. You will understand what Azure services you can leverage to establish a solid data platform to quickly ingest, process and visualize data from a large variety of data sources. The reference architecture you will build as part of this exercise has been proven to give you the flexibility and scalability to grow and handle large volumes of data and keep an optimal level of performance. In the exercises in this lab you will build data pipelines using data related to New York City. The workshop was designed to progressively implement an extended modern data platform architecture starting from a traditional relational data pipeline. Then we introduce big data scenarios with large files and distributed computing. We add non-structured data and AI into the mix and finish with real-time streaming analytics. You will have done all of that by the end of the workshop. The workshop include a series of five labs with a discussion of concepts in-between each lab.

Technologies you will use: SQL Data Warehouse, SQL Server in a VM, Azure Data Factory, Databricks (w/Spark), Cognitive Services (w/computer vision), Event Hub, Stream Analytics, PolyBase, Power BI, Blob Storage, Cosmos DB, Logic App

↧

Podcasts on Modern Data Warehouse

August 8, 2019, 8:00 am

≫ Next: New product: Azure Data Share

≪ Previous: Top SQL Server migration questions

I did a couple of recent podcasts that I wanted to mention:

The first one was with Kirby Repko from the YouTube channel SQLTalk: Modern Data Warehouse Design with James Serra (15 minutes). Check out his playlist of other useful topics.

The other was with Jason Strate from the PASS Cloud Virtual Group: Big data architectures and the data lake (60 minutes). Check out the ton of other previous meetings.

↧

New product: Azure Data Share

August 19, 2019, 8:00 am

≫ Next: Ways to access data in ADLS Gen2

≪ Previous: Podcasts on Modern Data Warehouse

A brand new product by Microsoft called Azure Data Share was recently announced. It is in public preview. To explain the product in short, any data which resides in Azure storage can be securely shared between a data provider and a data consumer. It does this by copying a snapshot of the data to the consumer’s subscription (called snapshot-based copying, and in the future there will be in-place sharing). It currently supports ADLS Gen1, ADLS Gen2, and Blob storage, and eventually will support Azure Data Explorer, SQL DB, and SQL DW. Check out the Documentation and a video, and then go try it out.

You can share the data with a few clicks as long as the user you are trying to share with has access to an Azure Subscription and storage account. The copying and updating of the data is handled for you using the Microsoft backbone for best performance, and is encrypted during transit. You can specify the frequency at which the data consumers receive updates. It also is a simple way to control, manage, and monitor all of your data sharing.

To visually see how Azure Data Share helps, consider how data sharing is done today:

With Azure Data Share, as a data provider, you provision a data share and invite recipients to the data share. Data consumers receive an invitation to your data share via e-mail. Once a data consumer accepts the invitation, they can trigger a full snapshot of the data shared you shared them. This data is received into the data consumers storage account. Data consumers can receive regular, incremental updates to the data shared with them so that they always have the latest version of the data.

When a data consumer accepts a data share, they are able to receive the data in a storage account of their choosing. For example, if the data provider shares data using Azure Blob Storage, the data consumer can receive this data in Azure Data Lake Store.

And to see how Azure Data Share fits in a modern data warehouse architecture:

Azure Data Share provides a great way for companies to monetize some of their most valuable internal data. With all the work done to build a modern data warehouse, why not sell the data to partners and business customers? They will save a ton of time and money trying to create the same data.

Microsoft is focusing first on offering the technology for its corporate customers to share data with their partners in compliance with governance restrictions, data controls and privacy policies. For now, monetization will be up to those customers. Longer term, Microsoft is also considering whether it could operate a marketplace where data could be more easily monetized and shared.

More info:

New Microsoft Azure service helps companies share (and potentially sell) their internal data

How to share data securely using azure data share

Video Azure Data Share

WTH is Azure Data Share?

Video Share data simply and securely using Azure Data Share | Azure Friday

↧

Ways to access data in ADLS Gen2

September 16, 2019, 8:00 am

≫ Next: Databricks Delta Lake

≪ Previous: New product: Azure Data Share

With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.

Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it. Here are some of the options:

Power BI can access it directly for reporting (in beta) or via dataflows (in preview) which allows you to copy and clean data from a source system to ADLS Gen2 (see Connect Azure Data Lake Storage Gen2 for dataflow storage (Preview)). If you get an “Access to the resource is forbidden” error when trying to read the data in Power BI, go to the ADLS Gen2 storage account on the Azure portal, choose Access control, “Add a role Assignment”, and add “Storage Blob Data Owner” (you will only get this error if, when accessing ADLS Gen2 via Get Data in Power BI, you sign in with your username – you won’t get the error if you use Access Key). Also, when importing into Power BI, make sure to choose “Combine & Transform Data” or “Combine & Load”. Note that trying to read Parquet format is not supported (only CSV, Excel) – a work around is you can use a Spark connector to a Databricks cluster which has imported the Parquet files
PolyBase in SQL Data Warehouse (SQL DW). PolyBase allows the use of T-SQL. There is no pushdown computation support, so PolyBase is mostly used for data loading from ADLS Gen2 (see Load data from Azure Data Lake Storage to SQL Data Warehouse)
PolyBase in SQL Server 2016/SQL Server 2017. Pushdown computation support is only supported when using Hadoop (Hortonworks or Cloudera). Note that PolyBase in SQL Server 2016/2017 only supports Hadoop, Blob Storage, and ADLS Gen1 (NOT ALDS Gen2), and SQL Database does not support PolyBase at all
SQL Server 2019 CTP 2.0 introduces new connectors for PolyBase, including SQL Server, Oracle, Teradata, and MongoDB, all of which support pushdown computation. SQL Server 2019 also introduces a new feature called big data clusters (BDC), which has a special feature called ‘HDFS tiering’ that allow you to mount a directory from ADLS Gen2 as a virtual directory in the HDFS inside of the big data cluster. Then you can create an external table over that HDFS directory and query it from the SQL Server master instance in the big data cluster. You can only do this with BDC. Other than that there is no support for ADLS Gen2 in PolyBase yet. For more information, see the PolyBase documentation for SQL Server 2019 CTP 2.0
Azure Databricks via Spark SQL, Hive, Python, Scala, or R. Accessing ADLS Gen2 with Databricks is a bit of a pain. For help, see the official docs and these blogs: Access to Azure Data Lake Storage Gen 2 from Databricks Part 1: Quick & Dirty, Avoiding error 403 (“request not authorized”) when accessing ADLS Gen 2 from Azure Databricks while using a Service Principal, Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks and the video Creating a Connection to Azure Data Lake Gen 2 with Azure Databricks. Note that in public preview is a way to enable ADLS credential passthrough on standard clusters in Databricks, simplifying the process of connecting to ADLS Gen2 – see Enable Azure Data Lake Storage credential passthrough for a standard cluster (Azure AD Credential Passthrough has been GA on interactive clusters for about six months)
Azure Data Factory supports ADLS Gen2 as one of its many data sources. See Copy data to or from Azure Data Lake Storage Gen2 using Azure Data Factory
Azure HDInsight supports ADLS Gen2 and is available as a storage option for almost all Azure HDInsight cluster types as both a default and an additional storage account. See Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
Azure Data Explorer (ADX). To query, see Query data in Azure Data Lake using Azure Data Explorer (Preview) or execute a query that writes to ADLS Gen2
HDInsight with Hive or Pig or MapReduce. See Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
Hortonworks (see Configuring access to ADLS Gen2) or Cloudera (see Configuring ADLS Gen2 Connectivity) which are both available in the Azure Marketplace

The main thing to consider when determining the technology to use to access data in ADLS Gen2 is the skillset of the end user and the ease of use of the tool. T-SQL is easiest, but currently the Microsoft products have some limitations on when T-SQL can be used.

Note that if you are looking for info on how to access the Common Data Model (CDM) which stores the data in ADLS Gen2, check out my blog post Common Data Model.

↧

Databricks Delta Lake

October 15, 2019, 8:00 am

≫ Next: Microsoft Ignite Announcements

≪ Previous: Ways to access data in ADLS Gen2

Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many features such as ACID transactions, data versioning and rollback. It does not replace your storage system. It is a Spark proprietary extension and cloud-only. It has been open sourced and the code can be found here. Full documentation is at the Delta Lake Guide.

It’s real easy to use. When writing code in Databricks, instead of using “parquet” for the format property, just use “delta”. So instead of having data land in your cloud storage in its native format, it instead lands in parquet format and while doing so adds certain features to the data. See Delta Lake Quickstart.

Some of the top features are:

ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data via the implementation of a transaction log, which includes checkpoint support. See Diving Into Delta Lake: Unpacking The Transaction Log
Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink, making a solution for a Lambda architecture but going one step further since both batch and real-time data are landing in the same sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion. See Diving Into Delta Lake: Schema Enforcement & Evolution
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments. This allows queries to pull data as it was at one point in the past. It can query the data using a data version point or a data timestamp. In both cases, Delta guarantees that the query will return as a complete dataset as was at the time of the version or Timestamp. It can be used for data recovery following an unwanted change or job run. It can help identify all the changes that occurred to a table during a time interval. In the case of Disaster Recovery, Delta time travel can be used to replicate the data from the DR site back to the regular site and reduce the work and time until the business is back to usual. See Introducing Delta Time Travel for Large Scale Data Lakes
Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, etc. See Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine
Performance: Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet. It accomplishes this via Data Skipping (Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query), Compaction (Delta manages file sizes of the underlying Parquet files for the most efficient use), and Data Caching (Delta automatically caches highly accessed data to improve run times for commonly run queries) and well as other optimizations

The Bronze/Silver/Gold in the above picture are just layers in your data lake. Bronze is raw ingestion, Silver is the filtered and cleaned data, and Gold is business-level aggregates. This is just a suggestion on how to organize your data lake, with each layer having various Delta Lake tables that contain the data. Delta Lake tables are a combination of Parquet based storage, a Delta transaction log and Delta indexes (so updating the indexes and ACID support will slow down the ingestion performance a bit). The tables can only be written/read by a Delta cluster. See Productionizing Machine Learning with Delta Lake.

Having a Delta Lake does not mean you don’t need a relational data warehouse in your modern data warehouse solution. While Delta Lake can store and process data faster and easier than a relational data warehouse and can scale better, it is not a replacement for a data warehouse as it is not as robust and performant, among other reasons (see Is the traditional data warehouse dead?) .

I usually see customers using Delta Lake for the staging and processing of data, supporting streaming and batch. As for the final product I see it used for drill down, in-frequent, or large data queries and real-time dashboards.

One big clarification, since the transaction log in Delta Lake sounds a lot like one you would find in a relational database. This leads to customers asking if Delta Lake can support OLTP. The answer is a big NO. This is not the domain of the product and I would not recommend it. You can stream a large number of inserts to Delta, but update and deletes will slow down the operation as this involves file copies and versioning.

Another thing to note is Delta Lake does not support true data lineage (which is the ability to track when and how data has been copied within the Delta Lake), but does have auditing and versioning (storing the old schemas in the metadata). Also, Databricks partnered with Immuta and Privacera, which provides row/column security in Delta Lake like you find with Apache Ranger in HDFS.

One limitation is you can’t access Delta Lake tables outside of Databricks Runtime as ingestion requires using the Delta ACID API (via Spark Streaming) and running queries requires the use the Delta JDBC (with one exception of static data sets but it’s not easy – see Can I access Delta Lake tables outside of Databricks Runtime?). So most of the use of Delta Lake is within the Databricks ecosystem as you can’t copy data from the Delta Lake to upstream products like Azure SQL Data Warehouse, but expect this to change as other 3rd party products along with Hive and Presto build native readers to Delta Lake. Details on how to connect Power BI to Delta Lake can be found here.

More info:

Databricks Delta Lake vs Data Lake ETL: Overview and Comparison

A standard for storing big data? Apache Spark creators release open-source Delta Lake

A Deep Dive Into Databricks Delta

Stream IoT sensor data from Azure IoT Hub into Databricks Delta Lake

Video Simplify and Scale Data Engineering Pipelines with Delta Lake

Video Delta Architecture, A Step Beyond Lambda Architecture

Video Making Apache Spark Better with Delta Lake

Video Delta Lake – Open Source Reliability for Data Lakes

↧

Microsoft Ignite Announcements

November 6, 2019, 8:00 am

≫ Next: Azure Synapse Analytics new features

≪ Previous: Databricks Delta Lake

Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements, and I will follow-up with more details in future blogs but wanted to post a quick summary:

Azure Synapse Analytics (preview) – This is an evolution of Azure SQL Data Warehouse. It blends together big data, data warehousing, and data integration into a single service for end-to-end analytics at cloud scale. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources. Azure Synapse brings the data warehousing and Big Data analytics together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs. Read the blog and check out the docs. Join the private preview here. The changes boil down to these areas:

The core data warehouse engine now supports not only explicitly provisioned workloads (pay per DWU provisioned) but serverless on-demand workloads (pay per TB processed) using T-SQL
A Synapse workspace that allows you to provision resources including SQL pools (SQL DW) and Apache Spark pools
Integration of Apache Spark, ADLS Gen2, Power BI, and Azure Data Factory (all are provisioned or linked within the Synapse workspace). Files in ADLS are query-able using a SQL Analytics runtime (T-SQL that does not require an EXTERNAL TABLE command) or a Spark runtime (Spark SQL, Python, Scala, R, .NET)
A unified web user interface called Azure Synapse Studio which controls all your resources. It integrates a notebook experience that can use Python, Scala, .NET, and Spark SQL. It is here you can ingest (Azure Data Factory), explore (T-SQL script or Spark SQL in a notebook), analyze (Machine Learning Services in a notebook), and visualize data (Power BI), bringing everything together in a single pane of glass

Azure Arc (preview) – A set of technologies that unlocks new hybrid scenarios for customers by bringing Azure services and management to any infrastructure:

Enables Azure SQL Database (managed instance) and Azure Database for PostgreSQL Hyperscale to run on-premises (using Kubernetes on Linux on the hardware of choice), at the edge and in multi-cloud environments. Eventually Arc will support all the data services
Expanded the Azure Stack portfolio to offer customers even more flexibility with the addition of Azure Stack Edge. Azure Stack Edge, previously known as Azure Data Box Edge, is a managed AI-enabled edge appliance that brings compute, storage and intelligence to any edge. Also introduced is a new Rugged series of Azure Stack form-factors designed to provide cloud capabilities in the harshest environment conditions supporting scenarios such as tactical edge, humanitarian and emergency response efforts
Extend Azure management and security to any infrastructure. Hundreds of millions of Azure resources are organized, governed and secured daily by customers using Azure management. Azure Arc extends these proven Azure management capabilities to Linux and Windows servers, as well as Kubernetes clusters on any infrastructure across datacenters, clouds and edge devices
Read the blog

SQL Server 2019 is GA – Major new features are Big Data Clusters (see this great workshop to learn it), data virtualization, intelligent database, secure enclaves, extensiblity, and accelerated database recovery. Read the blog.

Azure Cosmos DB enhancements:

Azure Cosmos DB new analytics storage engine (preview) – Azure Cosmos DB containers are now internally backed by two storage engines: a transactional storage engine and a new analytical storage engine. Both the storage engines are log-structured and write-optimized for blazing fast updates. The transactional storage engine is encoded in row-oriented format for blazing fast transactional reads and queries. The analytical storage engine is encoded in columnar format for fast analytical queries and scans and is fully updatable. The transactional storage engine is backed by local SSDs, and the analytical storage is stored on a cost efficient off-cluster SSD storage. They also integrated the Azure Cosmos DB Spark connector directly into Azure Synapse Analytics’ Spark capabilities, making it easy to have Azure Synapse query and operate over data in the Cosmos DB analytical storage (ideal for the use case of Cosmos DB being used as a hot updatable data lake as it ingests telemetry data). Check out the docs
Azure Cosmos DB autopilot mode (preview) – With Autopilot mode customers select a throughput tier and Azure Cosmos DB will automatically scale RU/s up and down as needed to match capacity with demand. Customers can opt-in for tier upgrades that will automatically shift their account to the next highest tier if demand spikes beyond the upper limit of their current tier. This means customers won’t have to manage or adjust provisioned throughput up or down to match traffic patterns. This saves time, eliminates the need for custom scripting, and removes the risk of over- or under-provisioning throughput. When opted-in to tier upgrades, customers can also rest assured that their apps will never be rate-limited due to excess demand or insufficient throughput. Check out the docs
GROUP BY support – Check out the docs
More details and other new features listed at What’s new in Azure Cosmos DB: November 2019

Azure Data Share is now GA – Azure Data Share enables organizations to easily and securely share data with other organizations to expand analytics datasets for enhanced insights.

Azure Database for PostgreSQL – Hyperscale (Citus) is now GA – Scales out Postgres horizontally and is ideal for multi-tenant and SaaS applications that are growing fast—and for real-time analytics apps that need sub-second response times across billions of rows. Check out the docs.

Azure Database Migration Service new Hybrid Mode (preview) – Uses a migration worker hosted on-premises together with an instance of Azure Database Migration Service running in the cloud (the “cloud service”) to manage database migrations. The Azure Database Migration Service hybrid mode is especially useful for scenarios in which there is a lack of site-to-site connectivity between the on-premises network and Azure or limited site-to-site connectivity bandwidth. Check out the docs.

Azure Search – Has been renamed to Azure Cognitive Search. There are new built-in skills: translation, custom entity lookup (preview), and custom document cracking, making it simple to transform content in meaningful ways, such as translating text and identifying custom entities important to your business. Also, there are new data connectors for ADLS Gen2, Cosmos DB Gremlin API, and Cosmos DB Cassandra API.

Azure SQL Database enhancements:

Azure Stream Analytics integration (preview) – Customers may now stream real time data into a SQL Database table directly from the Azure Portal using Azure Stream Analytics
New hardware choices (preview) – M-series, now in preview, is a new memory optimized hardware option in SQL Database for workloads demanding more memory and higher compute limits than provided by Gen5. M-series provides 29 GB per vCore and 128 vCores which increases the previous memory limit in SQL Database by 8x to nearly 4 TB. Fsv2-series, now in preview, is a new compute optimized hardware option in SQL Database delivering low CPU latency and high clock speed for the most CPU demanding workloads. Depending on the workload, Fsv2-series can deliver more CPU performance per vCore than Gen5, and the 72 vCore size can provide more CPU performance for less cost than 80 vCores on Gen5. Fsv2 provides less memory and tempdb per vCore than other hardware so workloads sensitive to those limits may want to consider Gen5 or M-series instead. Check out the docs
PowerApps integration (preview) – The Azure portal offers an entry point to create an app with PowerApps that is automatically connected to a table in a given database. Quickly create low-code robust customer experiences for any device using data from your SQL Database tables. Customers may also seamlessly work with their Azure based services, and benefit from rich extensibility enabling “no limits” development. This new entry point comes in addition to the creation, connection, and configuration experiences supporting SQL Database that already exist in the PowerApps Studio. This new experience starts from the context of a customers’ database, enabling them to create an app backed by SQL in minutes and navigate seamlessly between the SQL Database and PowerApps
Azure SQL Database serverless now generally available – Serverless enables auto-scaling of compute and hands-off compute sizing, and pay for compute used on a per second basis

Azure Data Factory Wrangling Data Flows (preview) – Allows Data Engineers to do code-free agile data preparation at cloud scale via spark execution. It uses the Power Query data preparation technology (also used in Power Platform dataflows) to seamlessly prepare/shape the data. Check out the docs and video Data Exploration with ADF Data Flows.

Data Migration Assistant and SQL Server 2019 is GA – Customers can now upgrade their on-premises SQL Server to SQL Server 2019 by detecting compatibility issues that can impact database functionality on the new version of SQL Server. Data Migration Assistant recommends performance and reliability improvements for the customers’ target environment, and it enables the ability to move not only their schema and data, but also uncontained objects from their source server to their target SQL Server 2019. Check out the docs.

Azure SQL Database Edge (preview) – This is a small-footprint containerized database (< 500MB) running in ARM- and x64-based devices in a connected or disconnected environment. It is available as a Ubuntu container, which can run on a Linux host. It has built-in data streaming and time-series that you deploy from the Azure IoT Edge as a module. See the blog post.

Large models in Power BI Premium (preview) – Until now, dataset caches in Power BI Premium have been limited to 10 GB after compression. Large models remove this limitation, so dataset cache sizes are limited only by the Power BI Premium capacity size (up to 400GB). View the blog.

Microsoft Flow – has been renamed to Power Automate and had added Robotic Process Automation (RPA), called UI Flows, which records step-by-step actions such as mouse clicks, keyboard use, and data entry, then replay those actions and turn them into intelligent workflows using a simple, guided process. View the blog.

Power Virtual Agents – A no-code/low-code app that allows anyone to create and deploy intelligent AI-powered virtual agents (bots). View the blog.

More info:

The official Microsoft Ignite 2019 Book of News

Build a data-driven culture to accelerate innovation

↧