Quantcast
Channel: James Serra's Blog
Viewing all 511 articles
Browse latest View live

SQL on-demand in Azure Synapse Analytics

$
0
0

The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analytics opens up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged for the data scanned by queries you run) and auto-scales depending on the amount of data being queried. See Quickstart: Use SQL on-demand

A SQL on-demand pool currently can access data from ADLS Gen2, Spark Tables, and Cosmos DB using T-SQL (click picture to enlarge):

Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported). The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive). You can also query folders and multiple files and use file metadata in queries.

Here are the three main scenarios that SQL on-demand is great for:

  • Basic discovery and exploration – Quickly view the data in various formats (Parquet, CSV, JSON) in your data lake, so you can plan how to extract insights from it
  • Logical data warehouse – Provide a relational abstraction on top of raw or disparate data without having to relocate or transform the data, allowing an always up-to-date view of your data. By putting T-SQL views on top of data in your data lake, this makes it appear to the end user that they are querying data in a relational database since they are using T-SQL, blurring the line between a relational database and a data lake
  • Data transformation – Simple, scalable, and performant way to transform data in the lake using T-SQL, so it can be fed to BI and other tools, or loaded into a relational data store (Synapse SQL databases, Azure SQL Database, etc.). For example, using the Copy activity in Azure Data Factory you can convert CSV files in the data lake (via T-SQL views in SQL on-demand) to Parquet files in the data lake.  See Azure Synapse Analytics: How serverless is replacing the data warehouse

Different professional roles can benefit from SQL on-demand:

  • Data Engineers can explore the lake, transform and prepare data using this service, and simplify their data transformation pipelines. For more information, check out this tutorial. You could even create a view over data in the data lake and use that to refresh your data in a tabular model
  • Data Scientists can quickly reason about the contents and structure of the data in the lake, thanks to features such as OPENROWSET and automatic schema inference
  • Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favorite tools, which can connect to SQL on-demand. So if you create a Spark Table, that table will be created as an external table in SQL On-Demand so you can query it without having to keep a Spark cluster up and running. Currently this is only supported with Spark Tables created using Parquet as the input format
  • BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables. You are connecting to a “database” that has no data, but rather views or external tables to data in the data lake. Not only can you create reports, but you can use Power BI to explore the data and do basic discovery

Here are the top FAQ’s about SQL on-demand that I have heard from customers:

  • Will SQL on-demand be able to query SQL Pool tables? It will be able to in the future
  • Is SQL serverless MPP? SQL serverless uses nodes to scale out processing, similar to MPP, although it is completely different architecture from existing MPP data warehouses.  It uses customized SQL engines
  • What are the warmup times when submitting a query? For 95 percentile it is less than 2s. There is a timeout defined after which there is scale-down, so if you don’t execute any query targeting storage during that timeout, your resources will be “taken from you”. Once you get back/connect and execute a query that targets storage, resources are granted back to your endpoint.  Maintained is a pool of warm clusters to get the quick warmup time
  • Is there a predicate pushdown concept for SQL on-demand in Azure Synapse? Yes, there is filter pushdown where SQL on-demand will push down queries from the front-end to back-end nodes. For example, if you query parquet files parquet metadata is used to target only column groups that contain values you are looking for. Microsoft is expanding the range of cases in which filter pushdown is supported
  • What are the best practices for SQL on-demand?  Check out https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-sql-on-demand
  • Why should I choose serverless and have a penalty on first (warmup) query, instead of using provisioned?  It depends on your workload. For constant or high workloads provisioned might be a better choice, while for sporadic or ad-hoc or exploratory workloads serverless might be better fit. It also brings a difference in charges: provisioned means you are paying for resources, while in serverless you are paying for what you use
  • So when would you use provisioned SQL (aka SQL Pool, which is exactly what was in SQL Data Warehouse)? When you need consistent performance, high performance, or have a large number of queries being consistently run. Provisioned SQL may give you better and more predictable performance due to resource reservation. But the good thing is because both use T-SQL, it is easy to transition back-and-forth between SQL Serverless and a SQL pool
  • Most customers are starting to use Delta Lake to store their data. When will SQL OD support this file format?  We are working on the support for Delta Lake, in the following months there will be more details regarding it and when will it be available
  • What is the ADLS Gen2 security? There are different methods to access storage account files using SQL serverless. You can use Synapse workspace MSI, SAS, or control ACL on storage level for a particular AAD account
  • Can we read from files stored in Common Data Model (CDM) format within the Data Lake (ADLS Gen2)?  Yes, at this moment we have a way to read CDM data using the SQL script or provisioned script, which gets metadata from the model.json and creates views over CDM data (native support in the query currently doesn’t exist)
  • Why use external tables over views?  Views are faster and have more features, so avoid external tables if possible. Particularly, virtual functions (filepath and filename are not supported in external tables at this moment which means users cannot do partition elimination on external tables)
  • Can you use PBI and AAS with SQL on-demand?  Yes, you can create views on top of your data lake, use wildcards and filepath/filename functions to expose partitioning column values in a view. That way, for PBI and AAS it is just another object (please make sure that you do proper data type casting of virtual function results as stated in best practices document)
  • Why use a logical data warehouse instead of just using AAS or PBI instead? Hot analytics is one reason as well as a logical data warehouse that can be used to speed up user request to delivery time.  Ideally, there would be no ETLs created (it might not be feasible in all cases, so providing another perspective here)
  • Is there any cost control? Yes, cost control will be available per workspace and time period daily/weekly/monthly. Also, cost estimation for later
  • What about CI/CD? When Synapse goes GA there will be improvements overall in CI/CD area, where customers can create group/department and scope the permissions of particular teams to the artifacts and capabilities
  • Can I use other tools with SQL on-demand? Yes, there is a default endpoint (i.e. workspacejs1966-ondemand.sql.azuresynapse.net) for this service that is provided within every Azure Synapse workspace. Any tool capable of establishing TDS connection SQL offerings can connect to and query Synapse SQL on-demand endpoint. You can connect with Azure Data Studio or SSMS and run ad-hoc queries or connect with Power BI to gain insights. You can also use Private Link to bring your SQL on-demand endpoint into your managed workspace VNet

More info:

Talking about Azure Synapse on Microsoft Mechanics!

Azure Synapse Analytics – Getting Started with SQL On Demand

AZURE SYNAPSE ANALYTICS – THE ESSENTIAL SQL ON-DEMAND CHEAT SHEET

THINGS TO KNOW ABOUT SQL ON-DEMAND (SERVERLESS QUERY) IN AZURE SYNAPSE ANALYTICS

Power BI and SQL on-demand


Cosmos DB for the SQL Professional – Referencing Tables

$
0
0

I had a previous blog comparing Cosmos DB to a relational database (see Understanding Cosmos DB coming from a relational world) and one topic that it did not address that I want to now is how to handle reference tables that are common in the relational database world.

A big difference with Cosmos DB compared to a relational database is you will create a denormalized data model.  Take a person record for example.  You will embed all the information related to a person, such as their contact details and addresses, into a single JSON document.  Retrieving a complete person record from the database is now a single read operation against a single container and for a single item.  Updating a person record, with their contact details and addresses, is also a single write operation against a single item.  By denormalizing data, your application typically will have better read performance and write performance and allow for a scale-out architecture since you don’t need to join tables. 

(Side note: “container” is the generic term. Depending on the API, a specific term is used such as “collection” for the SQL API). Think of a container as one or more tables in the relational world. Going a little deeper, think of a container as a group of one or more “entities” which share the same partition key. A relational table shares a schema, but containers are not bound in that way.)

Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth.  In a document database, you can have information in one document that relates to data in other documents. While there may be some use cases that are better suited for a relational database than in Cosmos DB (see below), in most cases you can handle relationships in Cosmos DB by creating a normalized data model for them, with the tradeoff that it can require more round trips to the server to read data (but improve the efficiency of write operations since less data is written).  In general, use normalized data models to represent one-to-many relationships or many-to-many relationships when related data changes frequently. The key is knowing whether the cost of the updates is greater than the cost of the queries.

When using a normalized data model, your application will need to handle creating the reference document.  One way would be to use a change feed that triggers on the creation of a new document – the change feed essentially triggers an Azure function that creates the  relationship record.

When using a normalized data model, your application will need to query the multiple documents that need to be joined (costing more money because it will use more request units), and do the joining within the application (i.e. join a main document with documents that contain the reference data) as you cannot do a “join” between documents within different containers in Cosmos DB (joins between documents within the same container can be done via self-joins).  Since every time you display a document it needs to search the entire container for the name, it would be best to put the other document type (the reference data) in a different container so you can have different partition keys for each document type (read up on how partitioning can make a big impact on performance and cost).

Note that “partitioning” in a RDBMS compared to Cosmos DB are different things: partitioning in Cosmos DB refers to “sharding” or “horizontal partitioning“, where replica sets which contain both the data and copies of compute (database) resources operating in a “shared nothing” architecture (i.e. scaled “horizontally” where each compute resource (server node) operates independently of every other node, but with a programming model transparent to developers). Conversely, what is often referred to as “partitioning” in a RDBMS is purely a separation of data into separate file groups within a shared compute (database) environment. This is also often called “vertical partitioning”.

Another option that is common pattern for NoSQL databases is to create a separate container to satisfy specific queries.  For example, having a container for products based on category and another container for products based on geography.  Both of those containers for my query/app are being sourced from one that is my “main” or “source” container that is being updated (front end, or another app) and the change feed attached to that pushes out to my other containers that I use for my queries.  This means duplicating data, but storage is cheap and you save costs to retrieve data (think of those extra containers as covering indexes in the relational database world). 

Since joining data can involve multiple ways of reading the data, it’s important to understand the two ways to read data using the Azure Cosmos DB SQL API:

  • Point reads – You can do a key/value lookup on a single item ID and partition key. The item ID and partition key combination is the key and the item itself is the value. For a 1 KB document, point reads typically cost 1 request unit with a latency under 10ms. Point reads return a single item
  • SQL queries – You can query data by writing queries using the Structured Query Language (SQL) as a JSON query language. Queries always cost at least 2.3 request units and, in general, will have a higher and more variable latency than point reads. Queries can return many items. See Getting started with SQL queries

The key in deciding when to use a normalized data model is how frequently the data will change.  If the data only changes once a year it may not be worthwhile to create a reference document and instead just do an update to all the documents.  But be aware that the update has to be done from the client side spread over the affected documents, doing it in batches as one big UPDATE statement does not exist in Cosmos DB.  You will need to retrieve the entire document from Cosmos DB, update the property/properties in your application and then call the ‘Replace’ method in the Cosmos DB SDK to replace the document in question (see CosmosDb – Updating a Document (Partially)). If you are using SQL API and .NET or Java, you can consider using bulk support (.NET) or bulk executor (Java). Other ideas would involve using change feed, or if you really need a level of ACID consistency, you can achieve this using stored procedures, with snapshot isolation scoped to a single partition (this is not the same as stored procedures in SQL – rather these are designed specifically to support multi-doc transactions).  

Also be aware that because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself.  If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.

What are OLTP scenarios where a relational database is essential?

Avoiding the broader topic of when to use a relational database over a non-relational database, there are a few use cases where a relational database is essential:

  • The customer experience and comfort zone is with relational databases. It is a reality that relational databases are ahead in the maturity curve with respect to tooling (an example would be foreign-key constraint behavior). However, it should be noted that this is not the same as saying that “more use cases are technically better suited to the relational model”. Rather, the barrier to entry in new customer projects tends to be lower because mindshare is greater in the relational space. In these cases, it often isn’t worth the effort for companies to upskill
  • The system really needs strict ACID semantics across the entire dataset. Sharded/partitioned databases like Cosmos DB will not provide ACID guarantees across the entire set of physical partitions (and likely never will). In reality, however, the use cases where this is necessary is quite small. Things like transaction management and other SDK-level things that go along with these aspects come easier in the RDBMS space, but this is really the same as above point – RDBMS is ahead on maturity curve for user-level tooling to help abstract paradigm specific concepts – but this does not make the paradigm better suited to a greater number of use cases
  • Having a single data store that services both operational and analytical needs with equal utility, including tabular models – this is probably the most powerful argument, and NoSQL engines are likely never going to serve a data structure that coalesces as well into tabular models that produce reports, charts, graphs, etc. But again, history has proven that, at scale, the “one fits all” approach can have some non-trivial drawbacks. And the new Analytical Store in Cosmos DB is addressing the need to service both operational and analytical needs

You can create complex hierarchical “relationships” in Cosmos DB, which would have to be modelled in separate tables in an RDBMS. Cosmos DB can’t handle them using joins – but again, this is a paradigmatic/semantic difference, not a fundamental flaw in the database model itself. In order to do the equivalent of what one may be trying to achieve in a relational database, you may have to “unlearn what you have learned”, but this comes back to your comfort level with a RDBMS, which is not a trivial thing and can be the main and very valid reason for staying with a RDBMS.

In summary, in a NoSQL database like Cosmos DB, most use cases are covered. Some things are a little harder (due to lack of maturity in tooling), but most things are easier, many things can only be done in NoSQL (i.e. handling millions of transactions per second), and very few things cannot be done in a NoSQL database. Most NoSQL engines are characterized by having a lot more configurability, tunability, and flexibility than a RDBMS. And in many ways, that is the hardest challenge for newcomers.

More info:

Data modeling in Azure Cosmos DB

Video Data modelling and partitioning in Azure Cosmos DB: What every relational database user needs to know

Video A tour of Azure Cosmos DB database operations models

The post Cosmos DB for the SQL Professional - Referencing Tables first appeared on James Serra's Blog.

Data Lakehouse & Synapse

$
0
0

I am starting to see this relatively new phrase, “Data Lakehouse”, being used in the data platform world. It’s the combination of “Data Lake” and “Data Warehouse”. In this post I’ll give my thoughts on it, and how the next version of Azure Synapse Analytics that is in public preview fits right in with the Data Lakehouse.

First, I want to clear up a bit of confusion regarding Azure Synapse Analytics. If you login to the Azure portal and do a search, you will see:

Notice that Azure SQL Data Warehouse (SQL DW) is now called Azure Synapse Analytics. This product has been generally available (GA) for a while and is simply a renaming. It contains new features and soon-to-be new features that I blogged about here, but this is separate from other features in a public preview version that I will call “Synapse PP” (see the new docs). I’ll call the current GA version “Synapse GA”. I blogged about this here.

I like to think a major feature of a Data Lakehouse as the ability to use T-SQL to query data the same way no matter whether the data is in a data lake (storage such as ADLS Gen2) or a data warehouse (a relational database such as Synapse GA).

Here are some of the major new Synapse PP features that support a Data Lakehouse:

There are some great benefits to having a data lake that I blogged about here. And there are many reasons to also have a relational data warehouse along with your data lake that I blogged about here. Up until now, the data lake and the relational data warehouse where sort of on their own island, having to jump to a different product to interface with each. For example, your data lake could be in ADLS Gen2 and you can use Azure Databricks to query it with SparkSQL (first setting up a connection to ADLS Gen2), while your relational data warehouse could be in Synapse GA and you use SSMS to query it with T-SQL. Having to use Azure Data Factory or Power BI means opening up another tab in your browser, logging into those products, and connecting to ADLS Gen2. So it can be time consuming and costly to create an environment to query both, and also be confusing to query using two different versions of SQL.

You can now get the best of both worlds all under one roof via Azure Synapse Studio: using the relational data warehouse when you need fast query performance, high user concurrency, enhanced security, or just prefer working in a relational environment; or using the data lake when you have huge data volumes, need instant access to the data via schema-on-read, or need to deal with semi-structured data. You never have to leave your workspace no matter which tool you want to use, and can use T-SQL no matter if the data is in the data lake or in the relational data warehouse. And querying a file in the data lake is simply a matter of right-clicking the file (more on this in my next blog).

So at a high level you can view Azure Synapse Studio as supporting business intelligence, data science, T-SQL plus other languages (Python, Scala, C#, SparkSQL), schema flexibility, ACID compliance, any data type, petabytes of data, streaming and batch, with high performance and user concurrency.

So as you can see, Synapse PP combines the use of a data lake and a relational database to make it a Data Lakehouse. It is not really a “new” technology, but just a merging of existing technologies to make it easier to gain insights from data in order to make better business decisions.

With these new features, especially SQL on-demand and T-SQL against ADLS, I am starting to see a few use cases where you may not need a relational database anymore in your modern data warehouse, which goes against my previous way of thinking (see Is the traditional data warehouse dead?). I still feel you will want to have a relational database in your modern data warehouse architecture a large majority of the time, but there will be exceptions.

For example, I can land a bunch of parquet files into the data lake and create a T-SQL view on top of that data, where that view is stored in SQL on-demand. Then I call that view from Power BI making it appear to the end-user that they are using a relational database and at the same time they are only paying when the query is being run. So that leads to compute costs savings as well as not having to copy the data from the data lake to a relational database. This means less time is needed to build a solution and there is a reduction in the complexity of the solution and therefore additional cost savings. Below are some of the concerns that arise out of skipping the relational database and the new features that “counter” some of those concerns:

  • Speed: Queries against a relational storage will always be faster than against a data lake because of the lack of statistics, query plans, and indexes in the data lake. Counter: DirectParquet, CSV 2.0, query acceleration, predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database.  Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into Azure Analysis Services to get even faster performance
  • Security: Row-level security (RLS), column-level security, and dynamic data masking are security-related features that are not available in a data lake. Counter: User RLS in Power BI or RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse
  • Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources
  • Missing features: Auditing, referential integrity, ACID compliance, updating/deleting rows of data, data caching, full support of T-SQL are not available in a data lake. Counter: many of these features can be added when using a Delta Lake, but will not be as easy to implement as a relational database

Some additional reasons for using a relational database include helping to break up a large problem into smaller pieces (think of the relational database as an additional step in the data journey from a data lake to the end-user), and the need to handle slowly changing dimensions (SCD). Using time travel in the Delta Lake or creating daily folders for each table in the data lake are options for handling SCD but much more complex than in a relational database world. I also see some customers who have source systems that are relational databases and as they are in the early stages of building a modern data warehouse in Azure they will bypass the data lake and copy the relational data right into a relational data warehouse (especially if they have many existing SSIS packages that they are using for their on-prem data warehouse and they want to continue using those for the short-term). They do this for quick wins and eventually will land the data in the data lake.

A strong reason for skipping a relational database is because, for the short term, SQL on-demand is not able to access a relational database, so you can only use this feature against a data lake. So if you want to save costs by using SQL on-demand it would have to be against data in the data lake. This could be especially true if you have a “small” data warehouse.

I’m still thinking through this, but these are my initial thoughts and it will be interesting to have discussions with customers on this topic when Synapse PP goes GA. Please comment below on your thoughts!

I’ll be presenting and doing a demo of the public preview of Azure Synapse Analytics at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/10 (session info), and Big Data Conference Europe on 11/25 (session info).

More info:

What is a Lakehouse?

The Data Lakehouse – Dismantling the Hype

THE DATA LAKEHOUSE (updated)

Data LakeHouse — Paradigm of the Decade

Eckerson Group Webinar -Shop Talk – the Data Lakehouse – April 17, 2020

All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse)

An Architect’s View of the Data Lakehouse: Perplexity and Perspective

Data Lakehouses Hold Water (thanks to the Cloud Data Lake)

The Rise of Data Lakehouse with Data Virtualization

Meet the Data Lakehouse

The post Data Lakehouse & Synapse first appeared on James Serra's Blog.

Query options in Azure Synapse Analytics

$
0
0

The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.

(NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/10 (session info), and Big Data Conference Europe on 11/25 (session info). I hope you can join!)

Relational DatabaseADLS Gen2Spark TableCosmos DB
SQL Provisioned poolYY (*1)N (*2)N (*6)
SQL On-demand poolN (*6)YY (*3)N (*4)
Apache Spark poolYYYY (*5)

*1: This is available via an external table which uses the Polybase technology and does not use push-down queries so can be slow. A feature that will be available after Azure Synapse Analytics goes GA called fast parquet will speed up queries over external tables mapped to parquet files (the technology underneath is the same that is being used for SQL on-demand)
*2: This feature will be available soon, but will only support Spark tables created using Parquet as the input format
*3: This feature only supports Spark tables created using Parquet as the input format (explained here)
*4: This feature is available via the Azure Synapse Link and will soon be available
*5: This feature is available via the Azure Synapse Link
*6: This feature will be available after Azure Synapse Analytics goes GA

A huge benefit of using the public preview version of Azure Synapse Analytics is the ability to query a file in the data lake by simply right-clicking the file. When you do that a menu will pop-up giving you three choices on how to query the data. Currently this works on three different files types: parquet, csv, and JSON (note that you can always query additional file types using a Spark notebook). The three query choices are listed below with all but one currently supported:

PreviewNew SQL script ->
Select TOP 100 rows
New notebook
parquetN (*3)YY
csvYY (*1)Y
JSONYY (*2)Y

“Preview” opens a pop-up window with the contents of the file, “Select TOP 100 rows” opens a tab with a T-SQL SELECT statement using SQL on-demand and then you can run the statement to show the results, and “New notebook” opens a Spark notebook that has PySpark code that you can run to load the data into a Spark in-memory DataFrame and display the rows.

*1: This feature was just added and has a limitation in that the column headers do not show correctly (a feature is being worked on to enable the query to infer the column names if the file has a header row)
*2: replace “varchar(8000)” with “varchar(MAX)” in the SELECT statement if you receive an error when running the query
*3: A work-around is to right-click the file, choose New SQL script -> Bulk load, click the Continue button and you will see a “Preview data” option

The post Query options in Azure Synapse Analytics first appeared on James Serra's Blog.

Azure Synapse Analytics overlooked features

$
0
0

There are some options in Azure Synapse Analytics that are not obvious that I wanted to point out.

While in Azure Synapse Studio, going to the Data hub and clicking the “+” at the top to the right of Data brings up:

Workspace 
Synapse SQL database 
Linked 
Connect to external data 
Integration dataset 
Browse samples

Under “Workspace”, choosing “Synapse SQL Database” gives you the ability to create a database in a SQL on-demand pool. A SQL on-demand pool (which you access via an endpoint) can have multiple databases. Note that SQL on-demand has no local storage so only metadata objects are stored in the databases.

After you create a database, if you then go to the Develop hub and create a new SQL script, you can choose to connect to “SQL on-demand”, and then the newly created database will be available under “Use database”. You can then create views or external tables in that database.  When you go to the Data hub, under Databases you will see the newly created database (you might need to hit refresh first).

Under “Linked”, choosing “Connect to external data” will create a linked service to ADLS Gen2 or Cosmos DB that you can use in the Data hub (under “Linked”).  This is the same method as going to the Manage hub and choosing “Linked services” and clicking “New” (but that option has more sources to connect to). These linked services can also be used in Azure Data Factory (see Linked Services in Azure Data Factory). 

Under “Linked”, choosing “Integration dataset” will allow you to create a dataset that can be used in pipeline activities and in data flows.  When creating the dataset you will need to specify a linked service to use for the dataset (which you could of done via “Connect to external data”), or you can create a new linked service (see Datasets in Azure Data Factory).

Under “Linked”, choosing “Browse samples” will give you a ton of samples to choose from broken into four sets:

  • Datasets: A bunch of datasets that when added will show up under Data -> Linked -> Azure Blob Storage.  You can then choose an action (via “…” next to any of the containers in the dataset) and choose New SQL script -> Select TOP 100 rows to examine the data as well as choose “New notebook” to load the data into a Spark dataframe.  Any dataset you add is a linked service to files in a blob storage container using SAS authentication.  You can also create an external table in a SQL on-demand pool or SQL provisioned pool to each dataset via an action (via “…” next to “External tables” under the database, then New SQL script -> New external table) and then query it or insert the data into a SQL provisioned database
  • Notebooks: A bunch of sample Apache Spark notebooks in various languages (PySpark/Scala/Spark.NET C#/SparkSQL) that will show up under Develop -> Notebooks
  • SQL scripts: A bunch of sample SQL scripts that will show up under Develop -> SQL scripts
  • Pipelines: This is similar to choosing the “Create pipeline from template” on the home page of a Azure Data Factory workspace and will create a pipeline that will show up under Orchestrate -> Pipelines

These samples offer a great way to shortcut the process of building a solution in Synapse.

The post Azure Synapse Analytics overlooked features first appeared on James Serra's Blog.

Microsoft Ignite Announcements 2020

$
0
0

Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements:

Usage-based optimization with Azure Synapse and Power BI: A new feature that analyzes usage patterns in Power BI and shares the information with Synapse is now available to improve query performances. Synapse automatically creates a materialized view optimized for Power BI users — greatly accelerating the speed of a query performance. More info

Apache Spark for Azure Synapse In-cluster Caching and Shuffle Service (Preview): Caching and shuffle are two of the components of infrastructure for Apache Spark that have the greatest impact on performance. These new services, which are written from scratch, allow the optimization of performance for these components on modern hardware and operating systems. The service is enabled for Apache Spark Pools in Azure Synapse today.

High throughput output from Stream Analytics to Azure Synapse Analytics (General Availability): ​Azure Stream Analytics jobs now have the ability to output to an existing SQL pool table in Azure Synapse Analytics, and can process throughput rates even higher than 200 MB/s. This will support the most demanding real-time analytics and hot-path data processing needs of Azure customers who depend on their Data Warehouse for workloads such as reporting and dashboarding.

Azure AD authentication features for Azure SQL DB, Azure Synapse Analytics, and Azure SQL Managed Instance: ​Three new features using Azure Active Directory (Azure AD) authentication are currently in preview for Azure SQL Database, Azure Synapse Analytics, and Azure Managed Instance. These features help automate user creation using Azure AD applications and allow individual Azure AD guest users to be created in SQL Database, Azure Synapse Analytics, or SQL Managed Instance. First, support is being added for Azure AD user creation in SQL Database and Azure Synapse Analytics on behalf of Azure AD applications. This functionality is useful in the Azure AD application automation processes where Azure AD objects are created and maintained in SQL Database and Azure Synapse Analytics without human interaction. Second, users can use cloud groups to manage Directory Readers role assignment in Azure AD. This change will promote ease of use when setting up SQL Database, Azure Synapse Analytics, and SQL Managed Instance to create users. Finally, Azure AD guest users can now be created as database users and set as Azure AD admin without the need to first add them as members of a group created in Azure AD. These features extend existing functionality, remove user limitations, and provide customers with greater ease of use when setting up the SQL Database, Azure Synapse Analytics, or SQL Managed Instance.

New Azure Cognitive Services and capabilities in preview: ​​In the Decision category, we have the new Metrics Advisor service that proactively monitors metrics and diagnoses issues so organizations can protect their growth engine, from sales revenue to manufacturing. Built on top of Anomaly Detector, Metrics Advisor can be applied to business metric monitoring, AI Ops, and predictive maintenance scenarios.  See an example in action. In the Vision category, the new spatial analysis feature in Computer Vision helps organizations maximize the value of their physical spaces by understanding peoples movements in near-real time – especially critical these days in helping organizations reopen safely by supporting social distancing and other health compliance measures. It comes available in a container to run at the edge. Here’s a walkthrough of how customers are applying it. Additionally for containers, the following are available in preview: for Speech there are automatic language detection and neural text-to-speech containers and for Computer Vision, the Read 3.0 and 3.1 containers for text extraction.

Announcing Azure SQL Edge now in general availability: ​Azure SQL Edge, which brings the most secure Microsoft SQL data engine to Internet of Things (IoT) gateways and edge devices, is now available. Optimized for IoT workloads, Azure SQL Edge supports built-in data streaming, storage, and artificial intelligence packed into a small footprint container that works in connected, disconnected, and semi-connected environments where local storage and analytics are important. You can reduce cost, drive innovation, and improve operational efficiencies by using Azure SQL Edge to stream, store, and analyze IoT data. You simply need to develop your application once and deploy it anywhere across the edge, your datacenter, and Azure.

Hosting catalog databases for all supported versions of SSRS in Azure SQL Managed Instance: Now in general availability, Azure SQL Managed Instance can host catalog databases for all supported versions of SQL Server Reporting Services (SSRS). Last year, SQL Server 2019 Reporting Services introduced native support for hosting catalog databases in SQL Managed Instance. Now you can also use SQL Managed Instance to host catalog databases for earlier supported versions of SSRS. This is especially useful for fast migration of the existing solutions to SQL Managed Instance, without the need to test and adopt the latest version of SSRS. It also helps you get quick business benefits and then modernize further at your own pace. To learn how to configure your SQL Managed Instance to host catalog databases for earlier supported versions of SSRS, visit the Tech Community blog.

Major performance improvements for Azure SQL Managed Instances: We are announcing a set of major performance improvements for Azure SQL Managed Instances, which enable you to migrate your more performance-hungry database workloads to Azure SQL Managed Instance. These improvements include better transaction log write throughput for general purpose and business critical instances and superior data/log IOPS for business critical instances. Increased performance predictability and stability for general purpose service tier through better integration and enhanced tempdb performance are also included. These improvements are automatically enabled for all existing and future Azure SQL Managed Instances at no extra charge, making Azure SQL Managed Instance the best fully-managed database platform for your mission-critical workloads. Learn more about major performance improvements for Azure SQL Managed Instances

Machine Learning on Azure SQL Managed Instance in preview: Machine Learning Services with support for R and Python languages now include preview support on Azure SQL Managed Instance. When using Machine Learning Services with R or Python support in Azure SQL Managed Instance, you can run R and Python scripts to do data preparation and general purpose data processing, train machine learning models in database, and deploy your models and scripts into production in stored procedures. Azure SQL Managed Instance

Configurable backup storage redundancy options for Azure SQL Database coming soon: By default, all backups taken on Azure SQL Database are stored on read-access geo-redundant storage (RA-GRS) blobs that are being replicated in paired regions, ensuring data is protected in case of planned and unplanned events. Preview of configurable backup storage redundancy options for Azure SQL DB is coming soon to South East Asia, with additional regions expected in the coming months. This feature will provide you with more flexibility and choice to pick between locally redundant storage (LRS), zone redundant storage (ZRS) or geo redundant (RA-GRS) storage blobs for their backups. As additional regions are launched in preview, there will be more backup storage redundancy options available to use to meet your data residency requirements or minimize backup costs. You will be able to easily configure backup storage redundancy in the Azure portal. For more information, visit our Tech Community blog.

Azure Synapse Link now supports Azure Cosmos DB API for Mongo DB (Preview): Azure Synapse Link for Azure Cosmos DB now supports API for Mongo DB in preview. You can analyze the data in Azure Cosmos DB API for Mongo DB collections, using Synapse Apache Spark or Synapse SQL serverless, without impacting the performance of your transactional workloads. Azure Synapse Link, announced earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. You can directly access your Azure Cosmos DB data from Azure Synapse Analytics and run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB.

Azure Synapse Link for Azure Cosmos DB: SQL serverless runtime support coming soon: ​Azure Synapse Link will soon have preview support for querying Azure Cosmos DB data with Synapse SQL serverless. In the coming months, you will be able to run analytical T-SQL queries over data in Azure Cosmos DB in place within seconds. These queries can be used to build rich near real-time dashboards using Power BI integration with Azure Synapse Analytics. This new feature enhances the capability of Azure Synapse Link and will give you a wide range of business intelligence and ad-hoc querying tools via the T-SQL interface. Azure Synapse Link, announced earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. This gives you direct access to Azure Cosmos DB data from Azure Synapse Analytics to run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB, without impacting the performance of your transactional workloads. Azure Synapse Analytics will soon support Azure Synapse Link with Synapse Apache Spark and Synapse SQL serverless.

Serverless offer for Azure Cosmos DB coming soon on all APIs: ​Azure Cosmos DB serverless is in preview for the Core (SQL) API and will soon be available in preview for all Azure Cosmos DB APIs, including MongoDB, Cassandra, Gremlin, and Table. This offers NoSQL developers a database operations option with zero capacity planning or management required. As a consumption-based option with no minimum, serverless could also significantly lower the entry price for applications. It is ideally suited for small and medium-sized workloads that do not require steady throughput, receive requests sporadically or in short bursts, and have moderate performance requirements. Learn more.

Enhanced management experience for Azure SQL Managed Instance: ​Management operations are an essential part of Azure SQL Managed Instance, covering create, update, and delete scenarios. Most of the management operations In SQL Managed Instance are long-running but did not have operation progress visible. Through the introduction of a new CRUD API version, the SQL Managed Instance resource is now visible from when the create request is submitted. In addition, the new OPERATIONS API adds the ability to monitor management operations, see operation steps, and take dependent actions based on operation progress. Learn More

Transactionally consistent database copy for Azure SQL Database Hyperscale tier: Now in preview, the Hyperscale service tier in Azure SQL Database supports generating a transactionally consistent database copy to either the same or different logical server or region, similar to database copy already supported in other service tiers. This enables scenarios where Hyperscale databases have to be copied for various reasons, such as development and testing. If a database copy is created within the same region, it is a fast operation regardless of data size, similar to a point in time restore in Hyperscale. A cross-region database copy is a slower size of data operation. If you require support for database copy in your workflow, you can now take advantage of unique Hyperscale capabilities such as support for very large databases, fast scaling, and multiple replicas. Learn how to create a database copy and visit our Hyperscale database copy blog for more details.

Hyperspace, an indexing subsystem for Apache Spark™, is now open source: For Microsoft’s internal teams and external customers, we store datasets that span from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative ”finding the needle in a haystack” type of queries (e.g., point-lookups, summarization). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring ways to reducing their operational costs – incurring unchecked expenses are their worst nightmares. Over the years, we have seen a huge demand for bringing indexing capabilities that come de facto in the traditional database systems world into Apache Spark™. Today, we are making this possible by releasing an indexing subsystem for Apache Spark called Hyperspace – the same technology that powers indexing within Azure Synapse Analytics.

New MERGE command for Azure Synapse Analytics: ​The new MERGE command in Azure Synapse Analytics allows customers to insert, update, or delete data from one table based on the differences discovered in another table. With the MERGE command, the previous multi-step process to synchronize two tables in a dedicated SQL pool is now streamlined into a single step, thus reducing the data operation cost and improving the user experience. Learn More

COPY command now generally available in Azure Synapse Analytics: The COPY command feature in Azure Synapse Analytics provides users a simple, flexible, and fast interface for high-throughput data ingestion for SQL workloads. With this announcement, we’ve added performance improvements with file splits, ease of use with Parquet auto-schema discovery and complex data type support, and the support of the COPY command within our data integration partner ecosystem including: Databricks, Informatica, Streamsets, Panoply, and Talend. Learn More

Column-level encryption for Azure Synapse Analytics: ​Column-level encryption (CLE) helps customers implement fine-grained protection of sensitive data within a table (server-side encryption). With CLE, customers gain the ability to use different protection keys for columns in tables with each key having its own access permissions. The data in CLE enforced columns is encrypted on disk (and remains encrypted in memory) until the DECRYPTBYKEY function is used to decrypt it. This feature applies to dedicated SQL pools in Azure Synapse Analytics. Learn More

Azure Databricks – Delta engine with Photon: Announced the preview of Photon powered Delta engine on Azure Databricks – fast, easy, and collaborative Analytics and AI service. Built from scratch in C++ and fully compatible with Spark APIs, Photon is a vectorized query engine that leverages modern CPU architecture along with Delta Lake to enhance Apache Spark 3.0’s performance by up to 20x. As organizations worldwide embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly analyze massive amounts and types of data. However, this has been a challenge. While storage and network performance have increased 10x, CPU processing speeds have only increased marginally. This leads to the question if CPUs have become the bottleneck, how can we achieve the next level of performance? The answer with Photon lies in greater parallelism of CPU processing at both the data-level and instruction-level. Photon powered Delta Engine is a 100% Apache Spark-compatible vectorized query engine designed to take advantage of modern CPU architecture for extremely fast parallel processing of data. Read the blog to learn more.

Azure Arc enabled data services: Announcing Azure data services anywhere, a new capability enabled by Azure Arc that allows customers to run Azure data services on-premises, multi-cloud and edge, using Kubernetes on the hardware of choice. Azure Arc enabled SQL Managed Instance and Azure Arc enabled PostgreSQL Hyperscale are available in preview now, with other services to come over time. Customers can now deploy a variety of data services, with access to latest innovation, cloud benefits like automation and elastic scale, unified management across all data assets in hybrid environments, and unique Azure security and governance capabilities like Azure Security Center, Policies and Role-based Access Control for on-premises data workloads. Best of all, customers can use modern cloud billing models on-premises for better cost efficiency. Learn more about Azure Arc enabled data services

Azure ML Ignite 2020 announcements: ​Azure Machine Learning Enterprise Edition (preview) capabilities are merging into Basic Edition on September 22, 2020.  With these enhanced capabilities included in Basic Edition, the Enterprise Edition of Azure Machine Learning will retire on January 1, 2021.

Designer

  • No-code drag and drop experience for building and deploying ML models

AutoML UI

  • Automatically build and deploy predictive models using the no-code UI or through a code-first notebooks experience.

ML Labeling Assist

  • The machine learning assisted labeling feature lets users trigger automatic machine learning models to accelerate the labeling task.

Advanced RBAC (preview)

  • AzureML operation level RBAC will allow customers to set custom roles, or re-use pre-built roles to control specific operations for the individual users in a workspace.

mlflow updates (preview)

  • MLflow projects support for simplified job submission from local to cloud
  • Mlflow model registry support
  • Model deployment support (model from MLflow model registry and deploy it to ACI or AKS)
  • Expanded support for MLflow experimentation UI

Workspace Private Link (preview)

A network isolation feature that enables you to access Azure ML over a private IP in your virtual network (VNet).

Azure SQL Database – Cross-SQL MI distributed Tx – Preview: The distributed database transactions spanning multiple Azure SQL Managed Instances will be added soon to enable frictionless migration of existing applications, as well as development of modern multi-tenant applications relying on vertically or horizontally partitioned database architecture. By utilizing distributed transactions once public preview is announced, customers will be able to save time when migrating existing applications that require this capability as it eliminates the need to change application code and to perform extensive testing. If customers develop new applications, they will be able to benefit from partitioning data into multiple databases to overcome current sizing limitations of Azure SQL Managed Instance while utilizing distributed transactions to keep partitioned data in a consistent state. Once preview is announced, two methods of running distributed transactions will be supported using BEGIN DISTRIBUTED TRANSACTION statement from Transact-SQL code and using TransactionScope class from .NET code.

Power BI announcements from Ignite:

  • Power BI Premium will soon be available on a per-user basis​.  Individuals will be able to upsize their per-seat Professional licenses to Premium. During the preview period, the upgrade carries no additional cost. See Answering your questions around the new Power BI Premium per user license
  • Power BI Premium is getting faster, and is gaining an autoscale capability that, when enabled, allows Microsoft to provision additional “V-cores” (virtual CPU cores) to the customer’s cloud tenant for periods of 24 hours, when overloads are detected. The v-cores are automatically removed during idle periods​. See Announcing the Upcoming Evolution of Power BI Premium to enterprise markets and beyond
  • Preview of “smart narratives,” an augmented analytics feature which provides plain-English summarizations of the data in a report, either on a per-visualization or page-wide basis. The narratives automatically update when data is filtered or drilled down upon, and the narratives are editable, both in terms of formatting and for insertion of arbitrary or calculation-driven text. See Smart Narrative (preview)
  • Dedicated Power BI add-in application for Microsoft’s Teams collaboration platform, released as a preview.  The Teams integration includes the ability to browse reports, dashboards and workspaces  and directly embed links to them in Teams channel chats. It’s not just about linking though, as Teams users can also browse Power BI datasets, both through an alphabetical listing of them or by reviewing a palette of recommended ones. In both cases, datasets previously marked as Certified or Promoted will be identified as such, and Teams users will have the ability view their lineage, generate template-based reports on them, or just analyze their data in Excel. See Announcing: New Power BI experiences in Microsoft Teams
  • Through a new preview feature, Power BI asset sensitivity levels, applied as Microsoft Information Protection (MIP) labels, can be applied from within Power BI Desktop, making it unnecessary to apply them to the report and its underlying dataset in the cloud service’s user interface
  • Microsoft is enhancing its Power Query technology to feature a visual interface rather than relying on the data grid view that has been its hallmark. Essentially, Power Query gets a new Diagram View where all queries in the model, and each of the transformation steps within them, will appear together, in a flowchart-like representation. This visual presentation indicates the dependencies between queries and the sequence of individual transformations within each. New transforms can be easily added in the Diagram View as well. Visual data preparation is coming soon. It will be available initially in Power BI dataflows only (i.e. online), and not in Power BI Desktop. The capability will come to Desktop within the next calendar year
  • A visual anomaly detection (coming soon)
  • A new visual that can trigger Power Automate processes in a data-driven fashion (also coming soon)
  • A performance accelerator for Azure Synapse Analytics that automatically creates materialized views to accelerate frequent queries from Power BI
  • GA of deployment pipelines (including availability on government clouds) that can move Power BI assets between development, test and production environments. See Announcing deployment pipelines General Availability (GA)
  • A preview of a data impact analysis feature that can notify users of datasets that may be affected by a change to another dataset upstream

More info:

Driving a data culture in a world of remote everything

Microsoft Ignite 2020 Book of News

What’s new in Azure SQL Managed Instance at Ignite 2020

The post Microsoft Ignite Announcements 2020 first appeared on James Serra's Blog.

Synapse and federated queries

$
0
0

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.

The queries are also fast as SQL on-demand will push-down queries from the front-end to the back-end nodes (which contain a SQL Server engine) that sit next to the storage. Note there is no cache yet in SQL on-demand so the queries won’t run faster after the first run. On to my demo:

Working inside Azure Synapse Studio, first, I wrote a T-SQL view that queried all the parquet files in an ADLS Gen2 folder (the key is the OPENROWSET function):

CREATE VIEW [dbo].[vFQ_ADLSTable] AS
SELECT
  ADLSTable.*
FROM
  OPENROWSET(
  BULK 
  'https://asaexpdatalakejs1234.dfs.core.windows.net/twitterdata/*.parquet',
  FORMAT='PARQUET'
  ) AS ADLSTable

Notice how I am using a wildcard in the file path to query all parquet files in the folder instead of just one.

Then I wrote a T-SQL view that queried a Spark table called bannedusers in the Spark database users. Even though it is a Spark table and the Spark cluster is not running, I am able to query this table using SQL on-demand because of a feature in Synapse that copies the metadata for the Spark table to SQL on-demand so you can use it outside of Spark (see Azure Synapse Analytics shared metadata):

CREATE VIEW [dbo].[vFQ_SparkTable] AS
SELECT 
  SparkTable.* 
FROM users.dbo.bannedusers AS SparkTable

Then I wrote a T-SQL view that queries a database in Cosmos DB called CosmosIOTDemo and the container bannedcities using Azure Synapse Link for Cosmos DB:

CREATE VIEW [dbo].[vFQ_CosmosTable] AS
SELECT
  CosmosTable.*
FROM OPENROWSET (
    'CosmosDB', N'account=synapselinkdemoserra;database=CosmosIOTDemo;region=eastus;key=xxx',bannedcities
    ) with ( id varchar(36), city varchar(1), cityname varchar(30), banned varchar(1) ) AS CosmosTable

Finally, I created a view using all three above views to join the data together. The parquet files in ADLS Gen2 are twitter feeds, and the Spark table and Cosmos DB table are reference tables that list the users and cities in the twitter feeds that should be excluded from the query (they have to be in the table with the banned field set to Y to not show in the query). While the twitter feed was only about 4000 rows, this query only took 2-3 seconds (you may have to wait an additional 2-3 seconds for the SQL on-demand pool to warm up the first time):

CREATE VIEW [dbo].[vFQ_Combined] AS
SELECT 
    t1.*, 
    t2.Banned as UserBanned, 
    t3.Banned as CityBanned
FROM 
    vFQ_ADLSTable t1 
        LEFT JOIN vFQ_SparkTable t2
            ON (t1.UserName COLLATE SQL_Latin1_General_CP1_CI_AS = t2.username)
        LEFT JOIN vFQ_CosmosTable t3
            ON (t1.City = t3.cityname)
WHERE (t2.banned is NULL or t2.banned = 'N')
AND (t3.banned is NULL or t3.banned = 'N')

Then I went to the Develop hub, and under Power BI chose “Power BI datasets” (first you have to link to a Power BI workspace), chose “New Power BI dataset”, chose my SQL on-demand pool, and downloaded the Power BI Desktop Source (.pbids) file that has the connection info to the SQL on-demand database. Once I clicked on that download file, it opened up Power BI Desktop and allowed me to choose the combined query vFQ_Combined. You can choose to import the data or use DirectQuery to get the results in real-time as any file that is dropped into the ADLS Gen2 folder along with the other parquet files will be read.

It appears to the end user that the data is stored in a relational database because the view is creating an abstract layer with the metadata (although because of automatic schema inference you don’t even need to specify the field layouts for a file in ADLS Gen2 or a Spark table).

Data in ADLS Gen2 can be moved to different access tiers to save costs, and eventually when SQL on-demand supports querying the relational SQL Provisioned pool you can archive out older relational data to ADLS Gen2 and use a query on that storage and combine it with a query on the relational database to also save costs and improve performance within the relational database.

As I said, very cool!

The post Synapse and federated queries first appeared on James Serra's Blog.

Top Azure Synapse Analytics and Power BI questions

$
0
0

I have listed below the top questions I see from customers about Azure Synapse Analytics and Power BI, and the blogs that I have written that try to answer each question. I have also added these questions to my blogs FAQ page that also includes “Modern data warehouse” and “SQL Server migration to Azure”, which I have updated.

Azure Synapse Analytics

Below are the top questions I am seeing from customers about Azure Synapse Analytics (formally called Azure SQL Data Warehouse), and the blogs that I have wrote that try to answer each question:

  1. What is the difference between the GA version of Synapse and the one in public preview? Azure Synapse Analytics confusion
  2. What are the newest features for the SQL Pool? Azure Synapse Analytics new featuresMicrosoft Ignite Announcements 2020
  3. What are the performance options when using Synapse and Power BI? Azure Synapse Analytics & Power BI performance
  4. Can I run Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency
  5. Can you explain the new SQL on-demand in Synapse? SQL on-demand in Azure Synapse Analytics
  6. What is a “Data Lakehouse”? Data Lakehouse & Synapse
  7. What are all the ways to query data in Synapse? Query options in Azure Synapse Analytics
  8. Are there samples to use in Synapse? Azure Synapse Analytics overlooked features
  9. How can I create federated queries in Synapse? Synapse and federated queries

Power BI

Below are the top questions I am seeing from customers about Power BI, and the blogs that I have wrote that try to answer each question:

  1. What are Composite models and Aggregations? Power BI new feature: Composite models
  2. What are Dataflows? Can I clean data with Power BI? Power BI: DataflowsWhere should I clean my data?
  3. What is the Microsoft Power Platform? The Microsoft Power Platform
  4. Can Power BI access data in ADLS Gen2? Ways to access data in ADLS Gen2
  5. What are performance options when using Power BI against Synapse? Azure Synapse Analytics & Power BI performance
  6. Can I use Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency
  7. How can I stream data into Power BI and do real-time updates? Power BI Real-time Streaming
  8. What are the various type of security options in Power BI? Power BI Security
  9. What are all the ways to share dashboards and reports in Power BI? Power BI Sharing
  10. What are the ways to monitor performance and usage in Power BI? Monitoring Power BI
  11. What were the latest announcements about Power BI at Ignite? Microsoft Ignite Announcements 2020
  12. What are other questions you hear customers asking? Power BI FAQ

The post Top Azure Synapse Analytics and Power BI questions first appeared on James Serra's Blog.


External tables vs T-SQL views on files in a data lake

$
0
0

A question that I have been hearing recently from customers using Azure Synapse Analytics (the public preview version) is what is the difference between using an external table versus a T-SQL view on a file in a data lake?

Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as a SQL On-demand pool.

Here are the differences that I have found:

  • Overall summary: views are generally faster and have more features such as OPENROWSET
  • Virtual functions (filepath and filename) are not supported with external tables which means users cannot do partition elimination based on FILEPATH or complex wildcard expressions via OPENROWSET (which can be done with views)
  • External tables can be shareable with other computes, since their metadata can be mapped to and from Spark and other compute experiences, while views are SQL queries and thus can only be used by SQL On-demand or SQL Provisioned pool
  • External tables can use indexes to improve performance, while views would require indexed views for that
  • Sql On-demand automatically creates statistics both for a external table and views using OPENROWSET. You can also explicitly create/update statistics on files on OPENROWSET. Note that automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported
  • Views give you more flexibility in the data layout (external tables expect the OSS Hive partitioning layout for example), and allow more query expressions to be added
  • External tables require an explicit defined schema while views can use OPENROWSET to provide automatic schema inference allowing for more flexibility (but note that an explicitly defined schema can provide faster performance)
  • If you reference the same external table in your query twice, the query optimizer will know that you are referencing the same object twice, while two of the same OPENROWSETs will not be recognized as the same object. For this reason in such cases better execution plans could be generated when using external tables instead of views using OPENROWSETs
  • Row-level security (Polybase external tables for Azure Synapse only) and Dynamic Data Masking will work on external tables. Row-level security is not supported with views using OPENROWSET
  • You can use both external tables and views to write data to the data lake via CETAS (this is the only way either option can write data to the data lake)
  • If using SQL On-demand, make sure to read Best practices for SQL on-demand (preview) in Azure Synapse Analytics

I often get asked what is the difference in performance when it comes to querying using an external table or view against a file in ADLS Gen2 vs. querying against a highly compressed table in a SQL Provisioned pool (i.e. managed table). It’s hard to quantify without understanding more about each customers scenario, but you will roughly see a 5X performance difference between queries over external tables and views vs. managed tables (obviously, depending on the query, that will vary but that’s a rough number – could be more than 5X in some scenarios). A few things that contribute to that: in-memory caching, SSD based caches, result-set caching, and the ability to design and align data and tables when they are stored as managed tables. You can also create materialized views for managed tables which typically bring lots of performance improvements as well. If you are querying Parquet data, that is in a columnstore file format with compression so that would give you similar data/column elimination as what managed SQL clustered columnstore index (CCI) would give, but if you are querying non-Parquet files you do not get this functionality. Note that for managed tables, on top of performance, you also get a granular security model, workload management capabilities, and so on (see Data Lakehouse & Synapse).

The post External tables vs T-SQL views on files in a data lake first appeared on James Serra's Blog.

Azure Stack and Azure Arc for data services

$
0
0

For those companies that can’t yet move to the cloud, have certain workloads that can’t move to the cloud, or have limited to no internet access, Microsoft has options to build your own private on-prem cloud via Azure Stack and Azure Arc. I’ll focus this blog on using these products to host your databases.

Azure Stack is an extension of Azure that provides a way to run apps and databases in an on-premises environment and deliver Azure services via three options:

Azure Stack Hub: Run your own private, autonomous cloud—connected or disconnected with cloud-native apps using consistent Azure services on-premises. Azure Stack Hub integrated systems are comprised in racks of 4-16 servers built by trusted hardware partners and delivered straight to your datacenter. Azure Stack Hub is built on industry standard hardware and is managed using the same tools you already use for managing Azure subscriptions. As a result, you can apply consistent DevOps processes whether you’re connected to Azure or not. The Azure Stack Hub architecture lets you provide Azure services for remote locations with intermittent connectivity or disconnected from the internet. You can also create hybrid solutions that process data locally in Azure Stack Hub and then aggregate it in Azure for additional processing and analytics. Finally, because Azure Stack Hub is installed on-premises, you can meet specific regulatory or policy requirements with the flexibility of deploying cloud apps on-premises without changing any code. See Azure Stack Hub overview

Azure Stack Edge: Get rapid insights with an Azure-managed appliance using compute and hardware-accelerated machine learning at edge locations for your Internet of Things (IoT) and AI workloads. Think of it as a much smaller version of Azure Stack Hub that uses purpose-built hardware-as-a-service such as Pro GPU, Pro FPGA, Pro R, and Mini R. The Mini is designed to work in the harshest environment conditions, supporting scenarios such as tactical edge, humanitarian and emergency response efforts. See Azure Stack Edge documentation

Azure Stack HCI (preview): A hyperconverged infrastructure (HCI) cluster solution that hosts virtualized Windows and Linux workloads and their storage in a hybrid on-premises environment. Think of it as a virtualization fabric for VM or kubernetes hosting – software only to put on your certified hardware. See Azure Stack HCI solution overview

These Azure Stack options are almost all VMs/IaaS, with no PaaS options for data services such as SQL Database (the only data service available is SQL Server in a VM). It is integrated certified hardware and software run by Microsoft, just plug in and go. For support, there is “one throat to choke” as the saying goes. It is a great option if you are disconnected from Azure. It extends Azure management and security to any infrastructure and provides flexibility in deployment of applications, making management more consistent (a single view for on-prem, clouds, and edge). It brings the Azure fabric to your own data center but allows you to use your own security requirements. Microsoft orchestrates the upgrades of hardware, firmware, and software, but you control when those updates happen.

Azure Arc is a software only solution that can be deployed on any hardware, including Azure Stack, AWS, or your own hardware. With Azure Arc and Azure Arc-enabled data services (preview) you can deploy Azure SQL Managed Instance (SQL MI) and Azure Database for PostgreSQL Hyperscale to any of these environments, which requires kubernetes. It can also manage SQL Server in a VM by just installing an agent on the SQL server (see Preview of Azure Arc enabled SQL Server is now available). Any of these databases can then be easily moved from your hardware to Azure down the road. It allows you to extend Azure management across your environments, adopt cloud practices on-premises, and implement Azure security anywhere you choose.

This allows for many options to use Azure Arc on Azure Stack or on other platforms (click to expand):

Some features about Azure Arc:

  • It can be used to solve for data residency requirements (data sovereignty)
  • It is supported in disconnected and intermittently connected scenarios such as air gapped private data centers, cruise ships that are off the grid for multiple weeks, factory floors that have occasional disconnects due to power outages, etc.
  • Customers can use Azure Data Studio (instead of the Azure Portal) to manage their data estate when operating in a disconnected/intermittent connected mode
  • Could eventually support other products like Azure Synapse Analytics
  • Can use larger hardware solutions and more hardware tiers then what is available in Azure, but have to do your own HA/DR
  • You are not charged if you shut down SQL MI, unlike in Azure, as it’s your hardware, where in Azure the hardware is dedicated to you even if you are not using it
  • With Arc you are managing the hardware, but with Stack Microsoft is managing the hardware
  • Can use modern cloud billing models on-premises for better cost efficiency
  • With Azure Arc enabled SQL Server, you can use the Azure Portal to register and track the inventory of your SQL Server instances across on-premises, edge sites, and multi-cloud in a single view. You can also take advantage of Azure security services, such as Azure Security Center and Azure Sentinel, as well as use the SQL Assessment service
  • Azure Stack hub provides consistent hardware, but if you use your own hardware you have more flexibility and possibly cheaper hardware costs

These slides covers the major benefits of Azure Arc and what the architecture looks like:

Looking at the differences when you are connected directly vs connected indirectly (i.e. an Arc server is not connected to the Internet so must coordinate with a server that is connected):

Here is what an Azure Arc data services architecture looks like:

Some of the top use cases we see with customers using Azure Stack and/or Azure Arc:

  • Cloud-to-cloud failover
  • On-prem databases with failover to cloud
  • Easier migration: Deploy locally, then flip a switch to go to cloud

This slide provides details on the differences with SQL databases (click to expand):

More info:

Understanding Azure Arc Enabled SQL Server

What is Azure Arc Enabled SQL Managed Instance

The post Azure Stack and Azure Arc for data services first appeared on James Serra's Blog.

PASS SQLSaturdays I will be speaking at

$
0
0

I will be speaking at two upcoming PASS SQLSaturdays. These are free events that you can attend virtually:

Azure Synapse Analytics: A Data Lakehouse

12/5/20, 1:10pm EST, PASS SQL Saturday Atlanta BI (info) (slides)

Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. In this presentation, I’ll talk about the new products and features that make up Azure Synapse Analytics and how it fits in a modern data warehouse, as well as provide demonstrations. (register) (full schedule)

How to build your career

12/12/20, 4:30pm EST, PASS SQL Saturday Minnesota, (info) (slides)

In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft.  Along the way I saw my yearly income triple.  Is it because I know some secret?  Is it because I am a genius?  No!  It is just about laying out your career path, setting goals, and doing the work.

I’ll cover tips I learned over my career on everything from interviewing to building your personal brand.  I’ll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.

Your first step to enhancing your career will be to attend this session! Let me be your career coach! (register) (full schedule)

The post PASS SQLSaturdays I will be speaking at first appeared on James Serra's Blog.

Azure Synapse Analytics is GA!

$
0
0

(Note: I will give a demo on Azure Synapse Analytics this Saturday Dec 5th at 1:10pm EST, at the PASS SQL Saturday Atlanta BI (info) (register) (full schedule))

Great news! Azure Synapse Analytics is now GA (see announcement). While most of the feature are GA, there are a few that are still in preview:

For those of you who were using the public preview version of Azure Synapse Analytics, nothing has changed – just access your Synapse workspace as before.

For those of you who have a Synapse database (i.e. SQL DW database) that was not under a Synapse workspace, your existing data warehouse resources are now listed under “Dedicated SQL pool (formerly SQL DW)” in the Azure portal (where you can still create a standalone database, called a SQL pool). You now have three migration options going forward for your existing database:

  1. Standalone: Keep the database (called a SQL pool) as is and get none of the new workspace features listed here, but you are able to continue using your database, operations, automation, and tooling like before with no changes
  2. Enable Azure Synapse workspace features: Go to the overview page for your existing database and choose “New synapse workspace” in the top menu bar and get all the new features except unified management and monitoring. All management operations will continue via SQL resource provider. Except for SQL requests submitted via the Synapse Studio, all SQL monitoring capabilities remain on the database (dedicated SQL pool). So basically you should just continue to use your current processes for monitoring your SQL resources on the server. For more details on the steps to enable the workspace features see Enabling Synapse workspace features for an existing dedicated SQL pool (formerly SQL DW)
  3. Migrate to Azure Synapse workspace: Create a user-defined restore point through the Azure portal, create a new synapse workspace or use an existing one, and then restore the database and get all the new features. All monitoring and management is done via the Synapse workspace and the Synapse Studio experience

I also wanted to point out that Dedicated SQL pools costs have dropped 83% to $23 per TB per month (East US)! Check out the Azure Calculator.

More info:

Microsoft introduces Azure Purview data catalog; announces GA of Synapse Analytics

The post Azure Synapse Analytics is GA! first appeared on James Serra's Blog.

New Microsoft data governance product: Azure Purview

$
0
0

Azure Purview is at data governance solution that is the sequel to the product Azure Data Catalog, and is now available in public preview.

Purview catalogs data from on-premises, multi-cloud, or software-as-a-service (SaaS) locations. Purview let’s you understand exactly what data you have, manage its compliance with privacy regulations, and derive insights. Create a unified, up to date understanding of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage with Purview Data Map.  Purview aims to maximize the compliant use of a your own data by understanding it, how it moves (i.e. lineage), and who it is shared with. It also integrates with Microsoft Information Protection and Power BI sensitivity labels and certification and promotion labels.

Azure Purview includes three main components:

  • Data discovery, classification, and mapping: It will automatically find all of an organization’s data on-premises or in the cloud, even those that are managed by other providers, and evaluate the characteristics and sensitivity of the data as it scans it
  • Data catalog: It enables all users to search for trusted data using a simple web-based search engine. There is also visual graphs that let you quickly see if data of interest is from a trusted source
  • Data governance: It provides a bird’s-eye view of your company’s data landscape, enabling “data officers” to efficiently govern the use of data. This enables key insights such as the distribution of data across multiple environments, how data is moving, and where sensitive data is stored

There is a sophisticated search engine to view all the scanned items:

It tracks data lineage (click to expand):

Below are the nine current different sources you can scan (more to come soon). I have got all the scans to work on all of the sources except Power BI as that requires a bit of extra work to scan a workspace different from the one in your subscription (by default, the system will use the Power BI tenant that exists in the same Azure subscription). To register a Power BI workspace outside your subscription, see Use PowerShell to register and scan Power BI in Azure Purview (preview). For those sources that are not supported, there is an option to submit data to the catalog via Azure Purview REST APIs. You can also use the APIs to build your own user experience on the catalog.

You can also use a “map view” to see all the sources and group them under collections (click to expand):

Azure Purview also comes with system defined classification rules but you can also add your own custom classification rules:

To ramp-up quickly, I suggest you visit the Azure Purview product page. Get started with Azure Purview documentation and view the Mechanics video to see Azure Purview in action and give your feedback via UserVoice

More info:

A first look at Azure Purview – Data Governance for your data estate

Azure Synapse Analytics – Introduction to Azure Purview

Microsoft introduces Azure Purview data catalog; announces GA of Synapse Analytics

Use Power BI with Azure Purview to achieve better data governance and discovery

Map your data estate with Azure Purview

Unified Data Governance using Azure Purview – preventing Data Lake from becoming a Data Swamp

The post New Microsoft data governance product: Azure Purview first appeared on James Serra's Blog.

DirectQuery for Power BI datasets and Azure Analysis Services (preview)

$
0
0

Announced last week is a major new feature for Power BI: you can now use DirectQuery to connect to Azure Analysis Services or Power BI Datasets and combine it with other DirectQuery datasets and/or imported datasets. This is a HUGE improvement that has the Power BI community buzzing! Think of it as the next generation of composite models. Note this requires the December version of Power BI Desktop, and you must go to Options -> Preview features and select “DirectQuery for Power BI datasets and Analysis Services”.

Diagram showing combinations of enterprise semantic models with Power BI datasets

You begin with a Power BI dataset that has been published to the Power BI service. In Power BI Desktop you connect to the dataset, where you can build a new model over the original one.  You can extend the original model by adding tables, columns, and measures, and you can also connect to other datasets and combine them into a single semantic model.  While doing this you do not lose any elements from the original model – measures and relationships continue to work. You do not have to worry about anything but the additional data you want to integrate. When the original model is refreshed, your local model also sees any updated information. You work as if you have a local copy of the model and full rights to modify and expand it, even though you are not duplicating any data already stored on the server.

This feature is ideal for report authors who want to combine the data from their enterprise semantic model with other data they may own like an Excel spreadsheet, or who want to personalize or enrich the metadata from their enterprise semantic model. This seals the marriage between self-service and corporate BI.

The main technology that makes this work is DirectQuery storage mode. DirectQuery will allow composite models to work with live connected sources and other data sources like Excel or SQL Server.

Using DirectQuery for Power BI datasets and Azure Analysis Services requires that your report has a local model.  You can start from a live connection to an existing dataset and upgrade to a local model, or start with a DirectQuery connection or imported data, which will automatically create a local model in your report.

Live connection is basically a connection to the remote model (the model is not inside of Power BI Desktop).  Converting the remote model to DirectQuery gives you a local model in Power BI Desktop.  So if you want to make changes to your live connection, you will first convert to a DirectQuery connection. If you don’t need to make changes to the remote model and just combine it with other models, you can keep the remote model as a live connection.

Keep in mind that when you publish a report with a local model to the service, a dataset for that local model will be published a well.  This is the same behavior as when you publish a report with imported data to the service.

When connecting to a remote model, the data for that model is kept in the cloud, and you can join it with another local model with it’s data to create new columns, new measures and new tables without ever moving the data from the remote model to your local PC. It’s an easy way to extend that remote model, which could be managed by IT and refreshed all the time. You are just responsible for managing and refreshing your local model and data. This is how you are combining the efforts of enterprise IT and end-users.

With this new feature I can see many companies creating enterprise semantic models in Azure Analysis Services or in a premium instance of Power BI. These semantic models can have the entire business logic of your company using the huge amount of data from your data warehouse. Then users can use DirectQuery against those models and extend those models locally with their own calculations, without having to download any data from the semantic models to Power BI Desktop. This is definitely a game changer.

This new feature allows you to do dataset chaining. Chaining allows you to publish a report and a dataset that is based on other Power BI datasets, which was previously not possible. Together, datasets and the datasets and models they are based on form a chain.

For example, imagine your colleague publishes a Power BI dataset called Sales and Budget that is based on an Azure Analysis Services model called Sales, and combines it with an Excel sheet called Budget.

If you create and publish a new report and dataset called Sales and Budget Europe that is based on the Sales and Budget Power BI dataset published by your colleague, making some further modifications or extensions, you are effectively adding a report and dataset to a chain of length three (the max supported), which started with the Sales Azure Analysis Services model and ends with your Sales and Budget Europe Power BI dataset:

This opens a whole new world of possibilities. Before this feature was available, to modify a dataset you would have to get a copy of the pbix file with the dataset and make your own pbix copy, which would also include the data. Also, you were not able to chain models together or to combine datasets (i.e. making models of models). This is quite an improvement!

Share your feedback on this new feature at this Power BI Community forum post.

More info:

New composite models in Power BI: A milestone in Business Intelligence

Power BI Direct Query Composite Models = Amazing

Composite Models Gen 2 and DirectQuery over Power BI Datasets

Power BI Composite Models using Analysis Services -Direct Query Mode

Composite models over Power BI datasets and Azure Analysis Services

“Composite” arrived – CAUTION!

Prologika Newsletter Winter 2020

The Most Important Update to Power BI Desktop, Ever!

DirectQuery for Power BI datasets and Analysis Services. What is it and why it is a big deal?

Understand the queries behind Composite Models

The post DirectQuery for Power BI datasets and Azure Analysis Services (preview) first appeared on James Serra's Blog.

Syncing SQL databases

$
0
0

There are various ways to keep SQL Server databases and SQL Database/SQL Managed Instance (SQL MI) databases in-sync asynchronously that I will discuss in this blog. The main use cases for syncing databases are to distribute reads (i.e. a second copy for reporting), for a hybrid application (having some data on-prem and some in Azure), for a globally distributed application (reduce network latency by having data in a region close to you), or to keep a test or development database in sync with a production database. The options below are not for disaster recover (DR), unless indicated otherwise.

Some of these options where discussed in my blog Distributed Writes, which focused on scaling out database writes.

  1. An ETL tool such as Azure Data Factory (ADF) or SQL Server Integration Services (SSIS). This will require building of the solution, which could be complex due to having to determine what records have changed in the source database (see Methods for populating a data warehouse)
  2. Azure SQL Data Sync: This is a Microsoft Windows Azure web service that provides bi-directionally data synchronization capabilities for SQL databases. It allows data to be synchronized between on-premises SQL Server databases and Azure SQL databases; in addition, it can also keep multiple Azure SQL databases in sync. It does not support Azure SQL Managed Instance at this time. Note there is a minimum 5-minute frequency between syncs
  3. SQL Server Always On availability group (which supports Hybrid, Cross-OS and Distributed), SQL Database Active Geo-Replication, or SQL Managed Instance Auto-failover group: One-way replication that automatically copies your data to one or more secondary replicas. Note this could be used for DR
  4. SQL Server Replication: There are six types of replication to copy data from one database to another. SQL Server supports all six. Azure SQL Database can only be used as the push subscriber in a one-way transactional or snapshot replication topology (see Replication to Azure SQL Database). The SQL Server instance acting as publisher and/or distributor can be an instance of SQL Server running on-premises, an Azure SQL Managed Instance, or an instance of SQL Server running on an Azure virtual machine in the cloud. In preview is the ability of SQL MI to publish to an on-prem SQL Server as well as allowing an on-prem SQL Server to publish to SQL MI
  5. SQL Server backup/restore, export/import of a bacpac file for SQL Server/SQL Database/SQL MI, or a database snapshot for SQL Server are options with the drawbacks of doing a full copy instead of just an incremental copy, and having longer periods where the destination database is not available. This could be used as an inexpensive solution for DR, with the tradeoff of possible data loss and long downtime
  6. Use a 3rd party product: Attunity Replicate for SQL Server (now Qlik Replicate), ScaleArc for SQL ServerStarQuest Data Replication, RedGate SQL Data Compare
  7. Azure Database Migration Service (DMS) has an online (continuous sync) option for copying data from SQL Server to Azure SQL Database or SQL MI with minimal downtime
  8. If you need to copy data for latency and/or data sovereignty reasons, Azure Arc might be a solution (see Azure Stack and Azure Arc for data services)
  9. Create a messaging and queuing application in SQL Server Service Broker where all writes are placed on the queue and sent to different servers
  10. Create a message queue using an asynchronous Azure Event Hub
  11. All of the above options require no changes to your source database. If you are in a position that would allow you to spend the time to change your database to another solution, then Cosmos DB could be used to greatly reduce the complexity of copying data since it can transparently replicate data to all the regions you specify nearly instantaneously with no downtime

The post Syncing SQL databases first appeared on James Serra's Blog.


Data Lakehouse defined

$
0
0

As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.

Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.

While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.

It’s funny how when Hadoop first came out, I heard many say that the end of relational data warehouses is here and just use a data lake for everything. This was clearly a mistake (see Is the traditional data warehouse dead?). Now were are here once again!

For simplicity I’ll break down a data lakehouse into two types of architectures: one-tier that is data lake (in the form of using schema-on-read storage), which I’ll call NoEDW, and two-tier that is a data lake and a relational database (in the form of an enterprise data warehouse, or EDW), which I’ll call ProEDW. While Databricks touts NoEDW by using Delta Lake and SQL Analytics, Microsoft touts ProEDW with Azure Synapse Analytics.

For NoEDW, my thought process is, if you are trying to make a data lake work like a relational database, why not just use a relational database (RDBMS)? Then have the data lake do what it is good at, and the RDBMS do what it is good at?

The extra cost, complexity, and time to value in incorporating a relational database into a data lakehouse is worth it for many reasons, one of which is a relational database combines the metadata with the data to make it much easier for self-service BI compared to a data lake where the metadata is separated out from the data in many cases. This becomes even more apparent when you are dealing with data from many different sources.

I can certainly see some uses where you could be fine with a NoEDW: if you have a small amount of data, if the users are all data scientists (hence have advanced technical skills), if you just want to build a POC, or if you want to get a quick win with a report/dashboard. The additional prior reason which was to save costs, is much less now that the storage cost for Synapse has dropped around 80%, so it is about the same cost as data lake storage.

But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake. And a ProEDW gives you the additional benefits of speed, security, and features that I mentioned at Data Lakehouse & Synapse.

In addition, the NoEDW option requires using Delta Lake, adding a layer of complexity and requiring all tools using the data lake to have to support Delta Lake. Also note that Delta Lake does not support cross-table transactions and that Databricks does not have a pay-per-query approach like Synapse serverless has. The paper also does not discuss how master data management (MDM) fits in, which are almost always relational database solutions.

You definitely do need a data lake (see reasons). The bottom line is you can try to get by with just a NoEDW, but it is very likely that you will run into issues and will need to have some of the data in the data lake copied to a relational database.

Federated queries

What I’m seeing customers do is adopting a lakehouse architecture that goes beyond the data lake and the data warehouse. They do this by using federated queries to integrate data across the data lake, data warehouses, and any purpose-built data services that are being used.

Part of this architecture is making it easy to query data in multiple sources by building out a semantic layer using a distributed query engine like Presto or OPENROWSET in a serverless pool in Synapse. With Synapse you can do a federated query over ADLS Gen2, Spark Tables, and Cosmos DB, and eventually others such as Synapse dedicated pools, SQL Database and SQL Managed Instance. It uses T-SQL and is pay-per-query, so simply create SQL views over the data and they can be called from any tool that supports T-SQL, such as Power BI (see Synapse and federated queries).

A federated query give you this ad-hoc view. So if you want to see what you’ve got in multiple places, you can write a query and can get a result right now. Then think of a SQL view as being able to materialize that result on a continuous basis. Part of thinking about data movement is not just getting data from A to B, but it’s also being able to query data wherever it might live and get a result back in one place.

While using federated queries removes the ETL complexity and staleness, there are many cons, so make sure to read Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement. Querying distributed data falls into the Data Mesh category, a distributed data architecture, the subject of my next blog.

Unified analytics platform

I see Synapse as a great solution for a ProEDW, a unified analytics platform approach, where it incorporates a data lake, a relational data warehouse, spark tables, and tools such as Azure Data Factory, Power BI, and soon Azure Purview all under one roof called Azure Synapse Studio. It also supports federated queries mentioned above via a serverless model.

Synapse also makes supporting machine learning (ML) easier via automated ML for data in a data lake – see Tutorial: Train a machine learning model without code Tutorial: Sentiment analysis with Cognitive Services (Preview) Tutorial: Anomaly detection with Cognitive Services (Preview) and even for relational data – see Tutorial: Machine learning model scoring wizard (preview) for dedicated SQL pools

All these options allow for many ways to transform and move data, so here is my list of those options in order of easiest to hardest to build, less features to more features:

Transforming data options:

  1. *Power BI Data flows
  2. Azure Data Factory Wrangling Data flows
  3. Azure Data Factory Mapping Data flows
  4. T-SQL in serverless or dedicated SQL pools
  5. Synapse Spark
  6. *Databricks

ETL options:

  1. *Power BI Data flows
  2. Azure Data Factory
  3. T-SQL in serverless or dedicated SQL pools (COPY INTO)
  4. T-SQL in serverless or dedicated SQL pools (CETAS/CTAS)
  5. Synapse Spark
  6. *Databricks

* = done outside Synapse

More info:

Data Alone Is Not Enough: The Evolution of Data Architectures

The post Data Lakehouse defined first appeared on James Serra's Blog.

Data Mesh defined

$
0
0

The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.

Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. From that blog is the graphic (Data mesh architecture from 30,000 foot view):

The data mesh is a new approach to designing and developing data architectures. Unlike a centralized and monolithic architecture based on a data warehouse and/or a data lake, a data mesh is a highly decentralized data architecture.

Data mesh tries to solve three challenges with a centralized data lake/warehouse:

  • Lack of ownership: who owns the data – the data source team or the infrastructure team?
  • Lack of quality: the infrastructure team is responsible for quality but does not know the data well
  • Organizational scaling: the central team becomes the bottleneck, such as with an enterprise data lake/warehouse

Its goal is to treat data as a product, with each source having its own data product manager/owner (who are part of a cross-functional team of data engineers) and being its own clearly-focused domain that has an autonomous offering, becoming the fundamental building blocks of a mesh, leading to a domain-driven distributed architecture. Note that for performance reasons, you could have a domain that aggregates data from multiple sources. Each domain should be discoverable, addressable, self-describing, secure (governed by global access control), trustworthy, and interoperable (governed by an open standard). Each domain will store its data in a data lake and in many cases will also have a copy of some of the data in a relational database (see Data Lakehouse defined for why you still want a relational database in most cases).

Another component in a data mesh is data infrastructure as a platform, which provides storage, pipeline, data catalog, and access control to the domains. The main idea is to avoid duplicating effort. This will allow each data product team to build its data products quickly. Note this data infrastructure platform should not become a data platform (it stays domain agnostic).

It’s a mindset shift where you go from:

  • Centralized ownership to decentralized ownership
  • Pipelines as first-class concern to domain data as first-class concern
  • Data as a by-product to data as a product
  • A siloed data engineering team to cross-functional domain-data teams
  • A centralized data lake/warehouse to an ecosystem of data products

As for my option on Data Mesh, it’s something that sounds great in theory but I’m really struggling to see how companies are going to solve it technically.  It would seem to require a full proprietary virtualization software and doing data virtualization has many issues (I already blogged about those at Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement). There is also a large gap in open-source or commercial tooling to accelerate implementation of a data mesh (for example, implementation of a universal access model to time-based polyglot data).

I am seeing attempts from Microsoft customers at building a data mesh. An excellent slide by John Mallinder from Microsoft (click to expand) for a customer building a data mesh, it which he uses the name “Harmonized Mesh”:

Creating this in the Azure world, Azure Purview would be your starting point for discovering data. If you need to do cross-domain queries, also called federated queries, you would use Synapse serverless with Azure virtual network peering if querying data from storage accounts (by linking the storage accounts in each Synapse workspace). If querying data from Synapse relational dedicated pools, that would currently require extra work, such as using Synapse Spark notebooks, Databricks, Power BI, or Azure Data Factory data flows to call multiple databases hosted in separate dedicated pools.

Keep in mind a data mesh only makes sense for companies with many large domains of data, and where there is usually a lot of political infighting over who controls the data and/or data sovereignty is needed. So typically a data mesh is for only the largest companies as it can be very difficult and time consuming to setup this environment. I am aware of a handful of companies who have been setting this up for many months but have not yet seen a company with a data mesh in production, but am hopeful. Please comment below if you have a data mesh in production!

More info:

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake

The Data Mesh – The New Kid On The Data Architecture Block

The Distributed Data Mesh as a Solution to Centralized Data Monoliths

Data mesh

Keynote – Data Mesh by Zhamak Dehghani

Data Mesh Paradigm Shift in Data Platform Architecture

The Data Mesh: Re-Thinking Data Integration

What is a Data Mesh — and How Not to Mesh it Up

The post Data Mesh defined first appeared on James Serra's Blog.

Power BI and a Star Schema

$
0
0

I sometimes get asked from customers if they should use a star schema or should they use one large single table/flatfile for their dataset in Power BI. The answer is almost always star schema, and Patrick LeBlanc does a great job explaining what a star schema is and why star schemas are so important for Power BI in his video Why Power BI loves a Star Schema. I wanted to list out his four reasons:

  1. Usability: If you have just one large table with all the data, when creating a report it is difficult to find the fields you are looking for in the “Fields” list. With a star schema, the Fields list is much easier to navigate. Star schemas make your data model much cleaner and easier to use
  2. Simpler DAX: The code for DAX will be much simpler against a star schema compared to a wide flat table. It will be shorter, easier to write, easier to read, and easier to maintain
  3. Performance: Star schema’s can handle huge amounts of data without impacting performance. The test Patrick ran was against 7 million rows and a query took 29ms against the one large table compared to 7ms against the equivalent star schema. You can imagine the performance difference against 10’s of millions of rows
  4. Faster refresh: In Power BI Desktop, Patrick used Power Query against 7.8 million records from a SQL Server Database and choose Close and Apply. For the large table, this took 34 minutes (2069 rows/second). The star schema took only 4 minutes (28k rows/second)

Patrick also has a video showing how to turn a flat file into a star schema: Power BI Tutorial | From Flat File To Data Model.

And a report using a flattened table could return inaccurate numbers, as pointed out by sqlbi: The importance of star schemas in Power BI.

For more information on star schema, check out the “bible” by Ralph Kimball: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling.

More info:

Understand star schema and the importance for Power BI

Basics of Modeling in Power BI: What is a Dimension Table and Why Say No to a Single Big Table

Power BI Basics of Modeling: Star Schema and How to Build it

Build Your First Star Schema Model in Action: Power BI Modeling Basics

Power BI Tips for Star Schema and Dimensional Data Modeling

Visualise your Power BI Refresh

The post Power BI and a Star Schema first appeared on James Serra's Blog.

Microsoft Ignite Announcements 2021

$
0
0

Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform and AI related announcements:

  • Azure Arc enabled machine learning (preview): Build models on-premises, in multi-cloud, and at the edge with Azure Arc. It does this by deploying Azure Machine Learning to a Kubernetes cluster that resides on-prem or in another cloud. More info
  • Azure Migrate new features: Discover and assess your SQL servers and their databases for migration to Azure from within the Azure Migrate portal, get target SKU recommendations, and estimate monthly costs. More info
  • Azure SQL enhancements: Maintenance window – Azure SQL Database and Azure Managed Instance fixed maintenance windows (preview) – More info; Advance notifications – Enable you to prepare for planned maintenance events on your Azure SQL Database resources and minimize the impact of database failover on your sensitive workloads – More info
  • Azure Purview (preview): Ability to scan Amazon AWS S3 and Amazon AWS subscriptions, Oracle DB, SAP ECC, SAP S4/HANA, Azure Synapse Analytics workspaces, Azure subscriptions and resource groups; Ability to implement hierarchical business glossary; Support for private endpoints; Purview-powered search in Synapse Studio; extension of free four capacity unit until June 1st. More info
  • Azure Synapse Pathway: Migrate from your current data warehouse to Azure Synapse Analytics. It will translate existing SQL code optimized for your source system (e.g. Snowflake) to SQL code optimized for your new target system (e.g. Azure Synapse Analytics). This preview version of Azure Synapse Pathway supports conversion of database, schema, and tables from IBM Netezza, SQL Server, and Snowflake today. Soon, it will support for Teradata, followed by Redshift, BigQuery and Hive based Hadoop solutions. It will expand the surface area of conversion to database views, procedures, functions and more to further simplify the automation of analytics workloads into Azure Synapse. More info
  • Azure Cosmos DB: API for MongoDB v4.0, GA; Continuous backup and PITR, Public Preview; RBAC with AAD integration, Public Preview; Azure Synapse Link, GA. More info
  • Azure Managed Instance for Apache Cassandra (public preview): Connect to a on-prem Cassandra database. Eventually will have data replication from Cassandra to Cosmos DB. More info
  • Azure Cache for Redis: Now with Enterprise tiers and active-active geo-replication which gives multi-master writes and five-9’s of availability. More info
  • Azure Cognitive Search: Semantic Search Capability – When entering search phrases it returns instant answers (Short snippet of the most relevant query result), Better Query Understanding (Results based on meaning of words and not just syntax), and Intelligent Ranking (Contextual query ranking using deep neural networks). In addition, a new SharePoint connector makes it easier to ingest and explore SharePoint content within Azure Cognitive Search. More info
  • Form Recognizer: Pre- build IDs: Passport and Driving License, Pre-built Invoices, 64 new languages, support for custom extraction. More info
  • Azure Stream Analytics: Now provides support for single tenant dedicated clusters (GA). ​ Azure Stream Analytics Dedicated provides single tenant hosting for increased reliability with no noise from other tenants.  Customer resources are “isolated” and perform better during bursts in traffic. More info
  • Power BI: Gen2, the latest update to Power BI Premium, is now available in public preview. This includes architectural updates for across-the-board improvements to performance, stability and security. Premium Gen2 delivers enhancements for every Power BI Premium customer, including: Flexibility to license per user or per capacity; Faster processing — up to 16x performance boost with Microsoft managed architecture; Autoscale to automatically add capacity when needed; and consistent and reliable cost management with utilization metrics. Power BI Premium Per User will become generally available on April 2, 2021. More info

For more details, check out the Microsoft Ignite book of news.

More info:

Microsoft Ignite Data and Analytics roundup: Platform extensions are the key theme

At Microsoft Ignite, expanding reach for Azure ML & Purview, Power BI Premium

The post Microsoft Ignite Announcements 2021 first appeared on James Serra's Blog.

A new WordPress theme and a new job

$
0
0

As you can tell, I have switched to a new WordPress theme for this blog site. I have now been blogging for ten years! It seemed like a good time to finally make a change 🙂

Speaking of which, I have also changed jobs. After seven years at Microsoft, I have moved to EY as a Data Platform Architecture Lead, where I am a Director in the Client Technology Group. Microsoft was BY FAR the best company I have ever worked for. As evidence, the longest I have ever stayed at a company before Microsoft was for four years at one of my first jobs back in the 80’s. The reasons I joined Microsoft still ring true, which I blogged about here: Why I just became a Microsoft Employee.

I met many great people at Microsoft and I will continue those relationships while at EY as I will remain immersed in Microsoft technology. Crazy that I have been using Microsoft technologies now for 35 years. I still remember using SQL Server 1.0 in 1989 on OS/2!

I will continue to blog on Microsoft technologies, present at conferences, post presentation decks, and do a video or two in the future.

EY is a great new opportunity in which I will stretch my skillset, and I am really looking forward to it. What spurred me on to make this major career change are the traits that I have leaned on many times in my career: having a growth mindset, taking risks, and the saying “Take a leap: Extend beyond what you think you are capable of”.

On to a new adventure!

The post A new WordPress theme and a new job first appeared on James Serra's Blog.
Viewing all 511 articles
Browse latest View live