Almost lost in all the announcements from Ignite was a bunch of amazing new features that were added to the Provisioned Resources/SQL Pool section (read SQLDW functionalities) side of Azure Synapse Analytics (formally called Azure SQL Data Warehouse). Here is a quick run-down of those features:
Just wanted to make everyone aware of my latest presentations that I recently uploaded. Details below. I also have a list of all my presentations with slide decks here.
Azure Synapse Analytics Overview
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works. (slides)
Data Lake Overview
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse. (slides)
Power BI Overview
Power BI has become a product with a ton of exciting features. This presentation will give an detailed overview of some of them, including Power BI Desktop, Power BI service, what’s new, integration with other services, Power BI premium, and administration. (slides)
Power BI Overview, Deployment and Governance
Deploying Power BI in a large enterprise is a complex task, and one that requires a lot of thought and planning. The purpose of this presentation is to help you make your Power BI deployment a success. After a quick Power BI overview, I’ll discuss deployment strategies, common usage scenarios, how to store and refresh data, prototyping options, how to share externally, and then finish with how to administer and secure Power BI. I’ll outline considerations and best practices for achieving an optimal, well-performing, enterprise level Power BI deployment. (slides)
With two new relational database features (Result-set caching and Materialized Views) just GA’d in Azure Synapse Analytics (formally called Azure SQL Data Warehouse), it makes for some very compelling reporting performance options when combined with Power BI. In this blog I’ll discuss the different ways you can make Power BI queries run faster, and whether you still need Azure Analysis Services or if the tabular model (i.e. “cube”) within Power BI is enough. Note there is a separate preview version of Azure Synapse Analytics which adds workspaces and new features such as Apache Spark, Azure Analytics Studio, serverless on-demand queries, and in which the relational database engine and relational storage are part of an “SQL Analytics” pool. Everything in this blog also applies to the SQL Analytics pool. To avoid confusion in the rest of this blog I will use “SQL DW” to refer to both the current version of Azure Synapse Analytics and the SQL Analytics pool that is in preview.
First a review of options available within Power BI:
Import: The selected tables and columns from the data source (i.e. SQL DW) are imported into Power BI Desktop and into the computers memory. As you create or interact with a visualization, Power BI Desktop uses the imported data and never touches the data source (underneath the covers Power BI stores the data in an analysis services engine in-memory cache). You must refresh the data, which imports the full data set again (or use the PBI premium feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh (so it’s not real-time). Imported datasets in the Power BI services have a 10GB dataset limitation for the Power BI premium version (with 400GB in preview, which is what Azure Analysis Services supports) and 1GB limitation for the Power BI free version. Note data is heavily compressed when imported into memory so you can import much large datasets than these limits. See Data sources in Power BI Desktop.
DirectQuery: No data is imported or copied into Power BI Desktop. Instead, as you create or interact with a visualization, Power BI Desktop queries the underlying data source (i.e. SQL DW), which means you’re always viewing the latest data in SQL DW (i.e. real-time). DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import and aggregate all of the data (although now with support for 400GB datasets and with Aggregation tables the need to use DirectQuery because the dataset won’t fit into memory goes away in many cases and DirectQuery is needed only if real-time results are required). See Data sources supported by DirectQuery.
Composite model: Allows a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data. So this means you can combine multiple DirectQuery sources with multiple Import sources.
Dual Storage Mode: Dual tables can act as either cached (imported) or not cached, depending on the context of the query that’s submitted to the Power BI dataset. In some cases, you fulfill queries from cached data. In other cases, you fulfill queries by executing an on-demand query (DirectQuery) to the data source.
Aggregations: Create an aggregated table (which will be in-memory if set to Import mode) from an underlying detail table (which is set to DirectQuery meaning the detailed data is kept at the data source and not imported). If the user query asks for data that it can get from the aggregated table, it will get it from the in-memory table. Otherwise, it will use a DirectQuery to the underlying detail table. You can create multiple aggregation tables using different summations off of one detail table. Think of aggregation tables as mini-cubes, or a performance optimization technique similar to the way you use indexes in SQL databases to help speed up SQL queries. Be aware it could take some time to create the aggregated table (its similar to processing a cube) and the data is not real-time (it’s as of the last refresh). See Aggregations in Power BI and Creative Aggs Part I : Introduction.
You may ask are aggregation tables still needed with Power BI now supporting large models (400GB) in import mode? In most cases, yes. For one, large model support is only in Power BI Premium, and if you are not using that you are limited to 10GB models. Also, importing a large detail table will import all of the detailed records with no aggregations being done (if it’s a 26 billion row table, you are importing 26 billion rows into memory). So really large tables won’t be able to be imported even with a 400GB limit, where using aggregation tables allows you to create aggregations, hence the name (a 26 billion row table may reduce to only 10 million aggregated rows so easily fits in memory). So, you need to replicate much less data into memory. And if a query wants an aggregation that it can’t get from the in-memory aggregation table, no problem, Power BI will send a DirectQuery to the detail table. So in summary aggregation tables can unlock massive big datasets that would not otherwise fit in memory, save costs by needing a smaller sku, and not have to manage replicating all that data into memory.
Two new performance features in Azure Synapse Analytics that are a big deal:
Result-set caching: Automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance (down to milliseconds) and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots in Azure Synapse Analytics and thus do not count against existing concurrency limits
Materialized Views: A view that pre-computes, stores, and maintains its data in SQL DW just like a table. There’s no recomputation needed each time when a materialized view is used. Queries that use all or a subset of the data in materialized views can get faster performance. Even better, queries can use a materialized view without making direct reference to it, so there’s no need to change application code
So, if we look at all the layers involved where a query can access data when using Power BI that is using Azure Synapse Analytics as the data source, it would look like this:
As an example of the speed of each layer, during an Ignite session (view here), there was a Power BI query run against 26 billion rows that was returning a sum of store sales by year. The same query was run three times using a different layer:
Using a DirectQuery against tables in SQL DW took 8 seconds
Using a DirectQuery against a materialized view in SQL DW took 2.4 seconds. Note you don’t have to specify that you are using a materialized view in the query, as the SQL DW optimizer will know if it can use it or not
Using a Aggregation table that is Imported into Power BI took 0 milliseconds
Keep in mind this is all hidden from user – they just create the report. If they do a query against a table not in memory in Power BI, it will do a DirectQuery against the data source which could take a while. However, due to SQL DW result-set caching, repeat DirectQuery’s can be very fast (in the Ignite session they demo’d a DirectQuery that took 42 seconds the first time the query was run, and just 154 milliseconds the second time the query was run that used result-set caching).
So by using features such as result-set caching and materialized views, you may achieve the results you are looking for without having to load the data into Power BI.
One thing to note: In the short term, Synapse won’t leverage materialized views with queries that use outer joins (which are the types of queries Power BI sends by default). So Power BI users will need to set the trust data source for data integrity setting for materialized views to be useful (this won’t be required when Synapse goes GA). Also note that materialized views can be queried directly, so users can create a materialized view and in Power BI they can create aggregations that refer to them directly.
These new features bring into question whether you still need Azure Analysis Services (AAS) or if the tabular model (i.e. “cube”) within Power BI is enough. Or whether you don’t need either – instead just use DirectQuery within Power BI.
For the first question, the goal is soon there won’t be a reason to use AAS assuming you already have Power BI Premium (especially now with XMLA endpoint support). Eventually Power BI Premium will be a superset of AAS. But AAS will be supported for the foreseeable future for the lift and shift scenario: AAS has modeling parity with SSAS on-prem so for customers that haven’t yet looked at the full modernization benefits of Power BI (see those benefits below) then those customers may choose to move to AAS perhaps as a stop-gap till they’re ready to commit to Power BI (but eventually lifting and shifting SSAS to Power BI will be supported). And of course AAS would still be needed if you wanted to use a 3rd-party product against it (a reporting tool for example) and were not able to use Power BI Premium (because your company had standardized on a different reporting tool for example). A scenario I currently see is customers spending money training a small group of people to build AAS cubes and making them available for a large group of users who can then do self service reporting against the cubes.
The good news is that it should be very easy to deploy an Analysis Services model to Power BI Premium in the future, so you won’t lose any work if you use Analysis Services now. You’ll simply change the value of the deployment server property in Visual Studio from your existing SSAS or Azure AS server to the XMLA endpoint of the Power BI Premium workspace.
For the second question, on whether you can just use DirectQuery against SQL DW and avoid using both AAS and a cube within Power BI, the answer is ‘yes’ if you are certain the queries to be used will always hit the result-set cache in SQL DW (see what’s not cached here and when cache results are used here). But if some won’t, especially when you are using a Power BI dashboard where you need millisecond response time for all queries, it would be best to use a cube or aggregations in Power BI for the dashboards and use SQL DW for ad-hoc queries.
A common question I here from customers is because of the performance of Azure Synapse Analytics (formally called Azure SQL Data Warehouse or SQL DW), can they run Power BI dashboards against it using DirectQuery (and not have to use Azure Analysis Services (AAS), Import the data into Power BI, or use Power BI aggregation tables), avoiding having another copy of the data (saving money), and having data “real time” (as of the last refresh of the data warehouse)?
There are two things to think of in considering an answer this question. The first is if you will get the performance you need (discussed in my last blog), the second is if a certain amount of concurrent queries or connections will cause a problem (the subject of this blog).
In Power BI, if you use import mode or create aggregation tables, then you are not hitting SQL DW. But if you define a table to be in DirectQuery mode or Dual Storage Mode (and the query is not fulfilled in cache so switches to DirectQuery mode) then you need to worry about concurrency against SQL DW.
The first thing to be aware of is that there is a limit of 128 max concurrent queries in SQL DW and 1,024 max open concurrent connections. When the concurrency limit is exceeded, the request goes into an internal queue where it waits to be processed. When the connection limit is exceeded, connections will be refused with an error. So if you have thousands of dashboard users who will be hitting the database at the same time, you may want to leave SQL DW just for ad-hoc queries and have those dashboard users query against another source such as: data marts in Azure SQL Database, AAS tabular models (“cubes”), or within Power BI by importing the data or using Power BI aggregation tables.
This concurrent query limit is imposed since there is no resource governor or CPU query scheduler in SQL DW like there is in SQL Server. But the benefit is each SQL DW query gets its own resources (CPU, memory) and it won’t affect other queries (i.e. you don’t have to worry about a query taking all resources and blocking everyone else). There are also resource classes in SQL DW that allow more memory and CPU cycles to be allocated to queries run by a given user so they run faster, with the trade-off that it reduces the number of concurrent queries that can run.
Note that the max concurrent queries is based on the SQL DW service level (see here). Maximum Concurrent open sessions is also based on the SQL DW service level (see here). Keep in mind those concurrent queries that have to wait in the queue just means they will likely take only a few extra seconds to run due to the fast performance of SQL DW. I have seen customers running 600+ concurrent queries with no user complaints on delays due to the nature of those queries being “ad-hoc” and overall taking just a few seconds. But you would likely get complaints if these queries came from a dashboard where users are expecting millisecond response time as they slice and dice through the dashboard.
Another option mentioned before is to use Azure Analysis Services (AAS), which supports thousands of concurrent queries and connections based on the tier selected, and can support even more queries via the scale-out feature. Note there are not any hard limits for concurrent queries or connections, but certainly soft limits (based on tier/QPU and avg query duration). For example, setting a tier of 100 QPU = ~5 cores = ~5 concurrent queries, and any additional queries “wait” for resources. So while the queries don’t fail, performance and user experience suffers (this math gets a bit more complicated with the new query interleaving feature). Because AAS contains aggregated data, queries against it usually take a few milliseconds.
Be aware that SQL DW queries that hit the SQL DW result-set cachedo not use any concurrency slots in Azure Synapse Analytics and thus do not count against existing concurrency limits.
Coming to Azure Synapse Analytics in the future is a feature called Multi-master cluster where user workloads can operate over the same shareable relational data set while having independent clusters to serve those various workloads. This allows for very high concurrency. This was demo’d at Ignite by Rohan Kumar showing 10k concurrent queries (video at 0:28). Two features called Workload Isolation (public preview) and Workload Importance will make this even more powerful. I’ll post a blog with more details when Azure Synapse Analytics GA’s.
DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), has become a very popular way to shorten the systems development life cycle and provide continuous delivery of applications (“software”). The implementation of continuous delivery and DevOps to data analytics has been termed DataOps, which is the topic of this blog.
Databases are more difficult to manage than applications from a development perspective. Applications, generally, do not concern themselves with state. For any given “release” or build an application can be deployed and overlaid over the previous version without needing to maintain any portion of the previous application. Databases are different. It’s much harder to deploy the next version of your database if you need to be concerned with maintaining “state” in the database.
So what is the “state” you need to be concerned with
maintaining?
Lookup data is the simple example. Almost every database has tables that are used for allowable values, lookup data, and reference data. If you need to change that data for a new release, how do you do that? What happens if the customer or user has already changed that data? How do you migrate that data?
Another example: a table undergoes a major schema migration. New columns are added and the table is split and normalized among new tables. How do we write the migration code to ensure it runs exactly once or runs multiple times without side effects (using scripts that are “idempotent”)?
Other objects that require state to be considered during
an upgrade:
Indexes: what happens if an index is renamed or an included column is added? What happens if the DBA adds a new emergency index? Will your DevOps tool remove it since it isn’t in an official build?
Keys: if you change a primary key, will that change require the PK to be dropped and recreated? If so, what happens to the foreign keys?
In most cases, database objects like functions, views, and stored procedures have no state considerations and can be re-deployed during every release.
So how do you overcome these “state” difficulties,
especially if you are aiming towards frequent releases and agile, collaborative
development?
The first step is to make a major decision when including databases in your DevOps processes, and that is how you will store the data model. There are two options:
Migration-based deployment: Sometimes called transformation-based deployment, this is the most common option today and is a very traditional way to work with databases during development. At some point you create an initial database (a “seed” database that is a single migration script stored inside source control), and after that you keep every script that’s needed to bring the database schema up to the current point (you can use SQL Server Management Studio to create the scripts). Those migration scripts will have an incremental version number and will often include data fixes or new values for reference tables, along with the Data Definition Language (DDL) required for the schema changes. So basically you are migrating the database from one state to another. The system of truth in a migration-based approach is the database itself. There are a few problems with this option:
Deployments keep taking longer as more and more scripts need to be applied when upgrading a database. A way around this is to create new seed databases on a regular basis to avoid starting with the very first database
A lot of wasted time can happen with large databases when dealing with, for example, the design of an index. If the requirements keep changing, a large index can be added to the database, then deleted, then reapplied slightly differently (i.e. adding a new column to it), and this can be repeated many times
There is no data model that shows what the database should really look like. The only option is to look at the freshly updated database
Upgrade scripts can break if schema drift occurs. This could happen if a patch was made to a production server and those changes didn’t make it back to the development environment or were not implemented the same way as was done in the production environment
Upgrade scripts can also break if not run in the correct order
State-based deployment: With this option you store the data model by taking a snapshot of the current state of the database and putting it in source control, and using comparison tools to figure out what needs to be deployed (i.e. doing a schema compare between your repository and the target database). Every table, stored procedure, view, and trigger will be saved as separate sql files which will be the real representation of the state of your database object. This is a much faster option as the only changes deployed are those that are needed to move from the current state to the required state (usually via a DACPAC). This is what SQL Server Data Tools (SSDT) for Visual Studio does with its database projects that includes schema comparison and data comparison tools, or you can use a product like SQL Compare from Red-Gate. Using the example above of creating an index, in this option you simply create the final index instead of creating and modifying it multiple times. In a state-based approach the system of truth is the source code itself. Another good thing is that you do not have to deal with ALTER scripts with a state-based approach – the schema/data compare tool takes care of generating the ALTER scripts and runs it against the target database without any manual intervention. So the developer just needs to keep the database structure up-to-date and the tools will do all the work. The end result is there is much less work needed with this option compared to the migration-based deployment.
While it may seem state-based deployment is always the way to go, the migration-based deployment may make more sense in scenario’s where you need more fine-grain control in the scripts as with the state-based deployment you are not able to modify the difference script. And having control over the scripts allows you to write better scripts than you think the script compare would generate. Other reasons are: by making the change a first class artifact, you can “build once, deploy often” (as opposed to something new that is generated prior to each deployment); you encourage small, incremental changes (per Agile/DevOps philosophy); and it’s much easier to support parallel development strategies with migrations – in part because the migrations themselves are small, incremental changes (i.e. the ability to deploy different features or development branches to target databases, that is environments like stage and production).
As a followup to my blog post Azure Data Lake Store Gen2 is GA, I wanted to give some pointers when using ADLS Gen2 as well as blob storage, as it can get a bit confusing with all the options that are available.
Note that underneath the covers, ADLS Gen2 uses Azure Blob Storage and is simply a layer over blob storage providing additional features (i.e. hierarchical file system, better performance, enhanced security, Hadoop compatible access).
For blob storage, you organize a set of files/blobs under a container. In the Azure portal this is located in “Containers” under “Blob service”. It is called “Blob Containers” in both the portal and desktop Storage Explorer. For ADLS Gen2, you also use containers, located in the portal in “Containers” under “Data Lake Storage”. It is called “Blob Containers” in the desktop Storage Explorer but it is called “File Systems” in the portal Storage Explorer
In the Azure portal, for blob storage, you can upload/access files by going to the storage account and choosing Containers (under “Blob service”) or by using the Storage Explorer (preview) in the portal. For ADLS Gen2, Containers (under “Data Lake Storage”) has no functionality except to create a container (click “File system”), but you can use the portal Storage Explorer. However, that is limited (you can’t upload files or change access tiers), so you should use the desktop Azure Storage Explorer to upload files or change access tiers
There are two types of storage performance tiers: Premium and Standard. The Premium performance tier can’t be changed to the Standard performance tier and visa-versa, so this is locked in when you create the storage account
The Premium performance tier is not yet available for ADLS Gen2, and only supports locally redundant storage (LRS)
There are three types of storage access tiers: Hot, Cool, and Archive. You can change access tiers with the Standard performance tier, but not with the Premium performance tier. Only block blobs support access tiers
When creating a storage account, you will be asked for the Account kind and you should use the default of General purpose v2 (StorageV2) unless you want to create block blobs or append blobs with the premium performance tier in which case you should choose Block Blob (BlockBlobStorage). Note that BlockBlobStorage accounts don’t currently support tiering to hot, cool, or archive access tiers
A storage access tier can be set for each file, but if it is not set it will default to the access tier (Hot or Cool) that the storage account is set to. The account access tier is the default tier that is inferred by any file without an explicitly set tier. The Archive access tier can only be set at the file level and not on the account
For blob storage, you can specify the access tier for a file (hot, cool, or archive) when uploading via the portal, but not when using the desktop Storage Explorer. For ADLS Gen2, there is not a way to upload files via the portal and you also can’t specify the access tier when uploading via the desktop Storage Explorer
For blob storage, to change an access tier, in the Azure portal, under the storage account, go to the container and choose the file and click “Change tier” to change its access tier. Or go to the portal or go to desktop Storage Explorer and right-click the file and choose “Change Access Tier”. For ADLS Gen2, you must use desktop Storage Explorer to change the access tier
For blob storage, you can specify that the blob type of a file is block, page, or append when uploading the file via the portal or with desktop Storage Explorer. Once a file has been created, its blob type cannot be changed. ADLS Gen2 only supports block blob type
Data in the Archive tier blob cannot be read until it is rehydrated to the Cool or Hot tier. The “standard” rehydration process can take up to 15 hours to complete. There is a priority retrieval (called “high”) that takes less than an hour (see Azure Archive Storage expanded capabilities: faster, simpler, better). You specify the rehydrate priority (standard or high) when choosing to switch from the archive tier on the portal. The option to choose the high priority retrieval is not available in the desktop Storage Explorer and is not available anywhere for ADLS Gen2
With Power BI real-time streaming, you can stream data and update dashboards in real-time. Any visual or dashboard that can be created in Power BI can also be created to display and update real-time data and visuals. When learning how to do this, I found it a bit difficult, so wanted to write this blog to hopefully make it easier for you.
At a high level, you create real-time visuals by logging into the Power BI Service and choosing “Streaming dataset” from the Create menu, choosing the streaming dataset type (API, Azure Stream, or PubNub), then adding report visuals or tiles to your dashboard that uses that streaming dataset. You will then push data into the streaming dataset using various methods (Power BI REST APIs, Streaming Dataset UI, Azure Stream Analytics).
Note you can’t create a streaming dataset in Power BI Desktop, but can connect to a streaming dataset created in the Power BI Service.
First off, let’s define the three categories of real-time datasets which are designed for display on real-time dashboards. These are not specific options you will find in Power BI, rather, the choices you make when building the streaming datasets will result in the dataset fitting into one of these categories:
Push dataset – data is pushed into the Power BI service. When the dataset is created, the Power BI service automatically creates a new database in the service to store the data. Since there is an underlying database that continues to store the data as it comes in (up to 5M rows per table), reports can be created with the data. These reports and their visuals are just like any other report visuals, which means you can use all of Power BI’s report building features to create visuals, including custom visuals, data alerts, pinned dashboard tiles, and more. Once a report is created using the push dataset, any of its visuals can be pinned to a dashboard. On that dashboard, visuals update in real-time whenever the data is updated. Within the service, the dashboard is triggering a tile refresh every time new data is received. The push dataset is a special case of the streaming dataset in which you enable Historic data analysis in the Streaming data source configuration dialog
Streaming dataset – data is also pushed into the Power BI service, with an important difference: Power BI only stores the data into a temporary cache, which quickly expires. The temporary cache is only used to display visuals which have some transient sense of history, such as a line chart that has a time window of one hour (there is currently no way to clear data from a streaming dataset, though the data will clear itself after an hour). With a streaming dataset, there is no underlying database and therefore limited history, so you cannot build report visuals using the data that flows in from the stream. As such, you cannot make use of report functionality such as filtering, custom visuals, and other report functions. The only way to visualize a streaming dataset is, while editing a dashboard, choose “Add tile” and choose “Custom Streaming Data” under “Real-time Data” and then choose the streaming dataset. The custom streaming tile that is based on a streaming dataset is optimized for quickly displaying real-time data. There is very little latency between when the data is pushed into the Power BI service and when the visual is updated, since there’s no need for the data to be entered into or read from a database. In practice, streaming datasets and their accompanying streaming visuals are best used in situations when it is critical to minimize the latency between when data is pushed and when it is visualized (they update on change, meaning that if your data changes every second, so will the tiles). In addition, it’s best practice to have the data pushed in a format that can be visualized as-is, without any additional aggregations. Examples of data that’s ready as-is include temperatures, and pre-calculated averages. You disable Historic data analysis in the Streaming data source configuration dialog to create a Streaming dataset (but you can always change this afterwards to switch to a push dataset)
PubNub streaming dataset – the Power BI web client uses the PubNub SDK to read an existing PubNub data stream, and no data is stored by the Power BI service. As with the streaming dataset, there is no underlying database in Power BI, so you cannot build report visuals against the data that flows in, and cannot take advantage of report functionality such as filtering, custom visuals, and so on. As such, the PubNub streaming dataset can also only be visualized by adding a tile to the dashboard, and configuring a PubNub data stream as the source. Tiles based on a PubNub streaming dataset are optimized for quickly displaying real-time data. Since Power BI is directly connected to the PubNub data stream, there is very little latency between when the data is pushed into the Power BI service and when the visual is updated. PubNub is third-party data service (http://pubnub.com)
To clarify, using report visuals (i.e line chart), which requires Historic data analysis to be turned on, gives you added functionality like filtering, but does not update as fast as tiles (in my testing, pushing data every 1-2 seconds will update 6-8 points on the report visual every 6-8 seconds). Using tiles with Historic data analysis turned on will update immediately (but tiles are limited to only showing the current value). Using tiles with Historic data analysis turned off did not result in a faster update, so it seems you should always turn it on (other than to save some storage space).
And tiles will update faster when using streaming datasets or PubNub datasets instead of push datasets. Note when using a tile, you can choose from five different visualization types: Card, Line chart, Clustered bar chart, Clustered column chart, and Gauge. These “real time tiles” will have a lightning bolt on the upper left of the tile when displayed on a dashboard.
There are three primary ways you can push data into a dataset (notice there is no need for you to create a database to handle the streaming data). Be aware with these options you can also create a dataset:
Using the Power BI REST APIs – Can be used to create and send data to push datasets and to and streaming datasets. Once a dataset is created, use the REST APIs to push data using the PostRows API
Using the Streaming Dataset UI – In the Power BI Service, choose “Streaming dataset” from the Create menu, choose the streaming dataset type of “API”, then configure the values to be used in the stream. Select Create, and you will be given a Push URL (REST API URL endpoint). Then create an application (i.e. C# in Azure Functions) that uses POST requests to the Push URL to push the data. Another option is to choose the streaming dataset type of PubNub and follow the same instructions
Using Azure Stream Analytics – You can add Power BI as an output within Azure Stream Analytics (ASA), which uses the Power BI REST APIs to create its output data stream to Power BI. ASA creates the dataset which stores 200,000 rows, and after that limit is reached, rows are dropped in a first-in first-out (FIFO) fashion
As I see a huge number of customers migrating their on-prem databases to the Azure cloud, the main question they ask is about whether they should go with an IaaS solution (SQL Server in a VM) or a PaaS solution (SQL Database). Because SQL Database MI (Managed Instance) has near 100% compatibility with on-premise SQL Server (supports SQL Agent, VNET, cross-database queries, CLR, replication, CDC, Service Broker, see Azure SQL Database Features), I’m seeing a large majority of them go with MI.
The only reasons to go with IaaS:
You need control over / access to the operating system
You need full control over the database engine. You can choose when to start maintenance/patching, change the recovery model to simple or bulk-logged, pause or start the service when needed, and you can fully customize the SQL Server database engine (but with this additional control comes the added responsibility to manage the virtual machine)
You have to run an app or agent side-by-side with the database
You need one of the few features MI does not support such as Filestream, Filetable, or linked servers to non SQL Server (see Azure SQL Database Features). You can use Database Migration Assistant to see if there is anything in your database that is not supported. So if you do have a database that has a feature that is not supported, land that database in a VM and place the others in MI
You have a 3rd-party database that the vendor has not tested on MI. Even though it may work fine, the vendor would not provide support until they give it a stamp of approval
Performance (80 cores is the max supported by MI) or storage (MI supports a database max of 8TB)
Cost (out of scope for this blog)
Below I’ll list the major benefits of an PaaS solution over IaaS:
No VM’s (virtual machines). How great is that! Never have to remote into a server anymore or manage it in any way. Everything is done via the Azure portal via tools like SSMS. Of course there is a VM somewhere that hosts the databases, but you are completely obfuscated from that
No patching or upgrading. I was a DBA for many years, and one of the biggest pains was having to patch or upgrade SQL Server, OS, drivers, etc. Having to setup a test environment, then knocking people off the servers on a weekend to upgrade and hoping not to run into any problems. Those days are gone! MI is patched with no downtime (see Hot patching SQL Server Engine in Azure SQL Database)
You get database backups out of the box. As soon as you create a database, Azure automatically starts backing up the database and you can do a point-in-time restore and even restore a deleted database. No more using SQL Agent or a 3rd-party product to setup and monitor the backups
Simplified disaster recovery (DR). Availability Groups are a pain to setup and monitor. With MI, it has instance failover groups that basically just requires you to click an area of the country that you want the DR to be and Azure takes care of setting it up and making sure it keeps working
You get the latest version of SQL server. Think of it as SQL Server 2019+. New features are added to MI every few weeks – you can choose to use them or not. Every couple of years those features will be gathered up and added to the boxed version, but you get them right away with MI. Database compatibility level exists so you code won’t break when there is an upgrade
Built-in Advanced threat detection that detects anomalous activities indicating unusual and potentially harmful attempts to access or exploit databases
Built-in Vulnerability Assessment that helps you discover, track, and help you remediate potential database vulnerabilities
Built-in Data Discovery & Classification that provides advanced capabilities built into Azure SQL Database for discovering, classifying, labeling & reporting the sensitive data in your databases
Migrating you database can be as simple as backing up your on-prem database, copying the .bak to Azure storage, and restoring to an SQL Database MI database. Check out Database Migration Service (DMS) for help in the migration. Also check out the Azure Database Migration Guide.
A common topic I have been discussing recently with customers is the security around Power BI. Basically, how to prevent users seeing data they shouldn’t. So I’ll discuss the various “layers” of security.
The Power BI service architecture is based on two clusters – the Web Front End (WFE) cluster and the Back-End cluster. The WFE cluster manages the initial connection and authentication to the Power BI service, and once authenticated, the Back-End handles all subsequent user interactions. Power BI uses Azure Active Directory (AAD) to store and manage user identities (in Azure Blob), and manages the storage of data and metadata (in Azure SQL Database), both using encryption at rest.
Power BI also leverages the Azure Traffic Manager (ATM) for directing traffic to the nearest WFE, based on the DNS record of the client, for the authentication process and to download static content and files. Power BI uses the Azure Content Delivery Network (CDN) to efficiently distribute the necessary static content and files to users based on geographical locale (see Power BI Security).
User Authentication: Power BI uses AAD to authenticate users who sign in to the Power BI service, and in turn, uses the Power BI login credentials whenever a user attempts to access any resources that require authentication. Users sign in to the Power BI service using the email address used to establish their Power BI account.
Power BI workspaces and apps: You can publish content from your Power BI desktop into Power BI workspaces, which is a collection of dashboards, reports, workbooks, datasets, and dataflows. You can then add security groups, distribution lists, Office 365 groups, or individuals to the workspaces and assign users their roles and privileges as either viewer, contributor, member or admin. You have the option to bundle that collection into a neat package called an app and distribute/publish it to your whole organization, or to specific people or groups. Only dashboards, reports, and workbooks are part of the bundle, and you choose which ones you want to publish via the “INCLUDED IN APP” option. You can also allow app users to connect to the app’s underlying datasets by giving them Build permission. They’ll see these datasets when they’re searching for shared datasets. You typically start the process of creating an app within workspaces, where you can collaborate on Power BI content with your colleagues, and then publish the finished apps to a large group of people in your organization. Apps make it easier to manage permissions on these collections.
Row-Level Security: With Row-level security (RLS) you are given the ability to publish a single report to your users but expose the data differently to each person. So instead of creating multiple copies of the same report in order to limit the data, you can just create one report that will only show the data the logged in user is allowed to see. This is done with filters, which restrict data access at the row level, and you define filters within roles. For example, creating a role called “United States” that filters the data in a table where the Region = “United States”. You then add members (user, security group, or distribution list) who can only see data for the United States to the “United States” role (the assignment of members can only be done within the Power BI Service). If a user should not have access to a report, then just don’t include that person in any of the roles for that report, so they would always see a blank report.
Auditing: Knowing who is taking what action on which item in your Power BI tenant can be critical in helping your organization fulfill its requirements, like meeting regulatory compliance and records management. With Power BI, you have two options to track user activity: The Power BI activity log and the unified Office 365 audit log (differences listed here). These logs both contain a complete copy of the Power BI auditing data so you can view exhaustive logs of all Power BI activities.
There are many ways to share the dashboards, reports, and datasets that you create in Power BI. Below I’ll compare all such options (there are twelve!).
First make sure to have an understanding about Power BI security (which I blogged about here), and be aware that no matter which option you choose below, to share your content you need a Power BI Pro license or the content needs to be in a Premium capacity.
Collaborate in a workspace: You can publish content from your Power BI desktop into Power BI workspaces, which is a collection of dashboards, reports, workbooks, datasets, and dataflows. You can then add security groups, distribution lists, Office 365 groups, or individuals to the workspaces and assign users their roles and privileges as either viewer, contributor, member or admin. This allows collaboration between your coworkers on everything within the workspace. You might naturally put content in your My Workspace (a personal workspace that only you have access to) and share it from there (via the options below). But workspaces are better for collaboration than My Workspace, because others can access it and they allow co-ownership of content. You and your entire team can easily make updates or give others access. My Workspace is best used by individuals for one-off or personal content. More info at Create the new workspaces in Power BI.
Distribute insights in an app: You can take selected items in a workspace and bundle it into a neat package called an app and distribute/publish it to your whole organization, or to specific people or groups. Only dashboards, reports, and workbooks are part of the bundle, and you choose which ones you want to publish via the “INCLUDED IN APP” option. You can send your business users a direct link to the app, or they can search for it in Microsoft AppSource. After they install an app, they can view it in their browser or mobile device (you can publish apps to people outside your organization via template apps described below). More info at Publish an app in Power BI.
Subscribe yourself and others: You can subscribe yourself and your colleagues to report pages, dashboards, and paginated reports via the Power BI service. Power BI emails a snapshot of the report page or dashboard with a link to open the report or dashboard. It can be setup on a schedule. More info at Subscribe yourself and others to reports and dashboards in the Power BI service.
Annotate and share from the Power BI mobile apps: In a Power BI mobile app for iOS or Android devices, you can annotate a tile, report, or visual and then share it with anyone via email. You’re sharing a snapshot of the tile, report, or visual, and your recipients see it exactly as it was when you sent the mail. The mail also contains a link to the dashboard or report. More info at Annotate and share a tile, report, or visual in Power BI mobile apps.
Embed a report in Microsoft Teams: You can add separate Power BI tabs for each individual report for your colleagues to view and then comment on in a conversation window within Teams (when you add a Power BI report tab to Teams, Teams automatically creates a tab conversation to accompany the report). More info at Embed report with the Power BI tab for Microsoft Teams.
Embed reports in Sharepoint Online: With Power BI’s new report web part for SharePoint Online, you can easily embed interactive Power BI reports in SharePoint Online pages. Power BI enforces all permissions and data security before users can see content. The person viewing the report needs the appropriate license. More info at Embed with report web part in SharePoint Online.
Embed a report in a secure portal or website: Easily embed reports in internal web portals such as SharePoint 2019 or a website or blog using a link or HTML. Power BI enforces all permissions and data security before users can see content. The person viewing the report needs the appropriate license. More info at Embed a report in a secure portal or website.
Publish from Power BI to the public web: Easily embed interactive Power BI visualizations online, such as in blog posts and websites using HTML, or via a link you can send in emails or via social media. You can also easily edit, update, refresh, or stop sharing your published visuals. Anyone on the Internet can view your reports, and you have no control over who can see what you’ve published. They don’t need a Power BI license. More info at Publish to web from Power BI.
Share a dataset: This option allows your whole organization to benefit from using the same well-designed data models and provide ‘one source of truth’. Dataset creators can control who has access to their data by using the Build permission. Dataset creators can also certify or promote datasets so others know which datasets are high quality and official. When you open the dataset catalog in Power BI Desktop or Power BI Service, it shows datasets that are in your My Workspace and in other workspaces that have shared them. More info at Intro to datasets across workspaces (Preview).
I see a lot of confusion among many people on what features are available today in Azure Synapse Analytics (formally called Azure SQL Data Warehouse) and what features are coming in the future. Below is a picture (click to zoom) that I describe below that hopefully clears things up:
So, if you log into the Azure portal today and use Synapse Analytics, you are using the GA version and nothing is different – it’s simply a name change form SQL DW. For the purposes of this picture we will call the GA version “v1” (the v1, v2, and v3 in my diagram is not in any way officially part of the product naming and is just used in this blog). This v1 will get certain new features over time (shown on the left) that I blogged about here.
In private preview are other new features, shown as v2 in my diagram. If you join the private preview, your Azure subscription will be whitelisted and you will have these new features available to you. You will not see these new features unless you are accepted into the private preview. The major new features in v2 include Azure Synapse Studio (a single pane of glass that uses workspaces to access databases, ADLS Gen2, ADF, Power BI, Spark, SQL Scripts, notebooks, monitoring, security), Apache Spark, on-demand T-SQL, and T-SQL over ADLS Gen2.
Much further down the road will be “Gen3”, or v3 in my diagram. The biggest feature in that version will be Multi-master cluster, where user workloads can operate over the same shareable relational data set while having independent clusters to serve those various workloads. This allows for very high concurrency. This was demo’d at Ignite by Rohan Kumar showing 10k concurrent queries (video at 0:28).
Any of the features that get added to the v1 version will also be added to v2 and v3.
But about a year ago Microsoft added a way to use SSMS without using a VNET (announcement) by allowing you to enable a public endpoint for your SQLMI. This made it easy for me to access a SQLMI database on my laptop.
I am posting this blog because I found a lot of people don’t know this (it’s not mentioned under “Quick start” when you are in the Azure portal for the SQL managed instance). Plus there are a couple of quirks to be aware of to get it to work.
The directions to set this up are at Configure public endpoint in Azure SQL Database managed instance. The part easily missed is at the end where the article explains the connection string. This is where you will find the server name to use in SSMS to login to your SQLMI database. It will look like serrami.public.c23ca324af.database.windows.net,3342. Notice the “public” in the name and the “,3342” at the end.
When you created your Azure SQL Database Managed Instance, you where prompted to create an admin login and password, as the screenshot below shows. This info will also be used to login via SSMS.
So, to login via SSMS, gather the server name and admin login/password mentioned above, choose to connect to a database engine in SSMS, and then you will enter that info into the connection screen similar to the one below, using SQL Server Authentication. Again, note the “public” in the server name and the “,3342” at the end of the server name:
The result is you connect to SQLMI without a VNET. Hope this helps!
Just announced is Query Acceleration for Azure Data Lake Storage Gen2 (ADLS) as well as Blob Storage. This is a new capability for ADLS that enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation from storage. This reduces the time and processing power that is required to query stored data.
For example, if an application will execute a SELECT statement that filters columns and rows from a csv file, instead of all pulling the entire csv file over the network into the application and then filtering the data, it will instead do the filtering at the time the data is read from the disk, so that only the filtered data is transferred over the network to the application. So if you have a csv file with 50 columns and 1 million rows, but the filters limit the data to 5 columns and 1000 rows, then only the 5 columns and 1000 rows will be retrieved from the disk and sent over the network to the application.
It accomplishes this by pushing-down predicates and column projections, so they may be applied at the time data is first read, enabling applications to filter rows and columns at the time that data is read from disk so that all downstream data handling is saved from the cost of filtering and processing unrequired data. This improves network latency and compute cost (an analysis showed that 80% of data is needlessly transferred across the network, parsed, and filtered by applications). Also, the CPU load that is required to parse and filter unneeded data requires your application to provision a greater number and larger VMs in order to do it’s work. By transferring this compute load to query acceleration, applications can realize significant cost savings.
You can use SQL to specify the row filter predicates and column projections in a Query Acceleration request. A request processes only one file. Therefore, advanced relational features of SQL, such as joins and the GROUP BY aggregate, aren’t supported. Query acceleration supports CSV and JSON formatted data as input to each request.
The following diagram illustrates how a typical application uses Query Acceleration to process data:
The client application requests file data by specifying predicates and column projections
Query Acceleration parses the specified query and distributes work to parse and filter data
Processors read the data from the disk, parse the data by using the appropriate format, and then filter data by applying the specified predicates and column projections. Azure Storage already consists of a non-trivial amount of compute to implement all of the storage functionality (eg. serve requests, encrypt/decrypt, attach JBODs, etc.). Query Acceleration simply is allocated a quota of this resource to do its varied jobs
Query Acceleration combines the response shards to stream back to client application
The client application receives and parses the streamed response. The application doesn’t need to filter any additional data and can apply the desired calculation or transformation directly
Query acceleration supports an ANSI SQL-like language for expressing queries over blob contents. The query acceleration SQL dialect is a subset of ANSI SQL, with a limited set of supported data types, operators, etc., but it also expands on ANSI SQL to support queries over hierarchical semi-structured data formats such as JSON.
Right now this works via Java and .NET, and in the future it will work for such tools as Python and Azure Synapse Analytics (Microsoft is actively working with OSS and commercial partners to integrate Query Acceleration into these frameworks, as appropriate).
Due to the increased compute load within the ADLS service, the pricing model for using query acceleration differs from the normal ADLS transaction model. Query acceleration charges a cost for the amount of data scanned as well as a cost for the amount of data returned to the caller.
Despite the change to the billing model, Query acceleration’s pricing model is designed to lower the total cost of ownership for a workload, given the reduction in the much more expensive VM costs.
To find out more about Query Acceleration for Azure Data Lake Storage you can:
There are a number of options for monitoring Power BI that I wanted to mention:
Performance analyzer: Find out how each of your report elements, such as visuals and DAX formulas, are performing. Using the Performance Analyzer, you can see and record logs that measure how each of your report elements performs when users interact with them, and which aspects of their performance are most (or least) resource intensive. This is accomplished by clicking a “Start recording” button and interacting with the elements you want to test. More info at Use Performance Analyzer to examine report element performance.
Usage Metrics: Enables you to discover how your reports are being used throughout your organization and who’s using them. You can also identify high-level performance issues. To run, from the workspace content list, open the context menu of the report and select View usage metrics report. Alternatively, open the report, then open the context menu on the command bar, and then select Usage metrics. More info at Monitor usage metrics in the new workspace experience. Please be aware of the upcoming Admin UI tenant usage metrics improvement in June – see Admin Usage Metrics.
Query Diagnostics: Allows you to determine what Power Query is doing during authoring time. It allows you to understand what sort of queries you are emitting, what slowdowns you might run into during authoring refresh, and what kind of background events are happening. To run, in the Power Query Editor ribbon choose ‘Start Diagnostics’ or ‘Diagnose Step’ under the Tools menu. More info at Query Diagnostics
Monitor Premium capacities: For a higher level overview of average use metrics over the last seven days for Power BI Premium, you can use the Admin portal. To learn more about monitoring in the portal, see Monitor Premium capacities in the Admin portal.
Microsoft Cloud App Security real time user activity monitoring: Help protect your Power BI reports, data, and services from unintended leaks or breaches. With Cloud App Security, you create conditional access policies for your organization’s data, using real-time session controls in Azure Active Directory (Azure AD), that help to ensure your Power BI analytics are secure. Once these policies have been set, administrators can monitor user access and activity, perform real-time risk analysis, and set label-specific controls. See Using Microsoft cloud app security controls in Power BI (preview)
A few data platform announcements yesterday at Microsoft Build that I wanted to blog about.
The biggest one is Azure Synapse Analytics is now available in public preview! You can immediately log into your Azure portal and use it. While in the Azure portal, search for “Synapse” and you will see “Azure Synapse Analytics (workspaces preview)”. Choose that and then click “Create Synapse workspace” (you first may need to register the resource provider “Microsoft.Synapse” in your subscription – see Azure resource providers and types).
Check out the full documentation. Note on the home page of the Synapse workspace, under “Useful links” there is a “Getting started” link that has an option to “Query sample data” that creates a new SQL pool for you and loads sample data into it. It also provides sample scripts so you can start querying the data (there is not much sample data/queries yet) .
I also have a large PowerPoint deck on an overview of Azure Synapse Analytics that you may find useful.
Another announcement was Azure Synapse Link (see What is Azure Synapse Link for Azure Cosmos DB (Preview)?). This allows you to take your Azure Synapse Analytics and point it directly at your operational database and do T-SQL queries against it without having to copy the data to Synapse. This means you can do real-time analytics without impacting your online or operational systems. This is especially important when you are talking about lots of data at big scale. Sometimes this is referred to as hybrid transactional-analytical processing (HTAP). See Azure Analytics: Clarity in an instant.
Also announced was that the Azure SQL Edge product is now available in public preview. Announced at last year’s Build conference under the name “Azure SQL Database Edge,” the product is a version of the Azure SQL Database can run on small edge devices, including those based on ARM processors. SQL Edge also integrates a specially-implemented version of Azure Stream Analytics.
With SQL Edge’s public preview, this now means Microsoft’s T-SQL language works at the edge, on-premises and in the cloud, on relational and NoSQL operational data, on the data warehouse, on the data lake and in HTAP implementations.
Blob index preview: Recently announced in preview, blob index is a managed secondary index that allows you to store multi-dimensional object attributes to describe your data objects for Azure Blob storage. This allows you to categorize and find data based on attribute tags set on the data. Cool! To populate the blob index, you define key-value tag attributes on your data, either on new data during upload or on existing data already in your storage account. These blob index tags are stored alongside your underlying blob data. The blob indexing engine then automatically reads the new tags, indexes them, and exposes them to a user-queryable blob index. Blob Index not only helps you categorize, manage, and find your blob data but also provides integrations with other Blob service features, such as Lifecycle management, allowing you to move data to cooler tiers or delete data based on the tags applied to your blobs.
The below scenario is an example of how Blob Index works:
In a storage account container with a million blobs, a user uploads a new blob “B2” with the following blob index tags: < Status = Unprocessed, Quality = 8K, Source = RAW >
The blob and its blob index tags are persisted to the storage account and the account indexing engine exposes the new blob index shortly after
Later on, an encoding application wants to find all unprocessed media files that are at least 4K resolution quality. It issues a FindBlobs API call to find all blobs that match the following criteria: < Status = Unprocessed AND Quality >= 4K AND Status == RAW>
The blob index quickly returns just blob “B2,” the sole blob out of one million blobs that matches the specified criteria. The encoding application can quickly start its processing job, saving idle compute time and money
Geo-Zone-Redundant Storage (GZRS): GZRS and Read-Access Geo-Zone-Redundant Storage (RA-GZRS) are now generally available. GZRS writes three copies of your data synchronously across multiple Azure Availability zones, similar to Zone redundant storage (ZRS), providing you continued read and write access even if a datacenter or availability zone is unavailable. In addition, GZRS asynchronously replicates your data to the secondary geo pair region to protect against regional unavailability. RA-GZRS exposes a read endpoint on this secondary replica allowing you to read data in the event of primary region unavailability. To learn more, see Azure Storage redundancy.
Account failover: Customer-initiated storage account failover is now generally available, allowing you to determine when to initiate a failover instead of waiting for Microsoft to do so. When you perform a failover, the secondary replica of the storage account becomes the new primary. The DNS records for all storage service endpoints—blob, file, queue, and table—are updated to point to this new primary. Once the failover is complete, clients will automatically begin reading from and writing to data to the storage account in the new primary region, with no code changes. Customer initiated failover is available for GRS, RA-GRS, GZRS and RA-GZRS accounts. To learn more, see Disaster recovery and account failover
Versioning preview: Versioning automatically maintains prior versions of an object and identifies them with version IDs. You can restore a prior version of a blob to recover your data if it is erroneously modified or deleted. A version captures a committed blob state at a given point in time. When versioning is enabled for a storage account, Azure Storage automatically creates a new version of a blob each time that blob is modified or deleted. Versioning and soft delete work together to provide you with optimal data protection. To learn more, see Blob versioning.
Point in time restore preview: Point in time restore for Azure Blob Storage provides storage account administrators the ability to restore a subset of containers or blobs within a storage account to a previous state. This can be done by an administrator to a specific past date and time in the event of an application corrupting data, a user inadvertently deleting contents, or a test run of a machine learning model. Point in time restore makes use of Blob Change feed, currently in preview. Change feed enables recording of all blob creation, modification, and deletion operations that occur in your storage account. To learn more, see Point in time restore.
Routing preferences preview: Configure a routing preference to direct network traffic for the default public endpoint of your Storage account using the Microsoft global network or using the public internet. Optimize for premium network performance by using the Microsoft global network, which delivers low-latency path selection with high reliability and routes traffic through the point-of-presence closest to the client. Alternatively, route traffic through the point-of-presence closest to your storage account to lower network costs and minimize traversal over the Microsoft global network. Routing configuration options for your Storage account also enable you to publish additional route-specific endpoints. Use these new public endpoints to override the routing preference specified for the default public endpoint by expliciting routing traffic over a desired path. Learn more.
Object replication preview: Object replication is a new capability for block blobs that lets you asynchronously replicate your data from your blob container in one storage account to another anywhere in Azure. Object replication unblocks a new set of common replication scenarios:
Minimize latency – have your users consume the data locally rather than issuing cross-region read requests
Increase efficiency – have your compute clusters process the same set of objects locally in different regions
Optimize data distribution – have your data consolidated in a single location for processing/analytics and then distribute only resulting dashboards to your offices worldwide
Minimize cost – tier down your data to Archive upon replication completion using lifecycle management policies to minimize the cost
Power BI has become hugely popular and I find common questions about functionality and features, so I thought I would put in this blog some of those questions with answers:
What is the best way to organize workspaces?
To improve your Power BI implementation, it is highly recommended to separate datasets from reports by having separate pbix files for each, resulting in a “data workspace” and a “reporting workspace”. The data workspace should have permissions so only certain people (i.e. data modelers) can edit the data, while giving broader permissions to people (i.e. report developers) for the reporting workspace. Power BI row-level security will still be invoked for report developers even if they don’t have edit rights on the dataset (they just need read and build permissions). Make sure to use auditing to track who is using the workspaces. A few downsides: pinning to a dashboard from Q&A or Quick Insights are no longer valid options because you can’t choose a dashboard in another workspace. Also, if your company has a huge investment in O365 groups, currently the build permission can’t be set on that.
Using a dataset for multiple reports is called a ‘golden dataset’ or the ‘hub and spoke’ method. To do this, when creating a new report in Power BI Desktop, rather than importing data into the .pbix file, instead use “Get Data” to make a live connection to an existing Power BI dataset. It’s a good idea to use report pages in the data workspace to describe the datasets. For more info, check out Melissa Coates blog and video at 5 Tips for Separating Power BI Datasets and Reports [VIDEO].
Should we have dev, test, and prod workspaces?
Yes! You should use change management to move reports through the dev/test/prod workspace tiers via the new deployment pipelines in Power BI. Use the workspaces to collaborate on Power BI content with your colleagues, while distributing the report to a larger audience by publishing an app. You should also promote and certify your datasets. The reports and datasets should have repeatable test criteria.
When you publish an app, is it always in the same dedicated capacity as the workspace? Wondering if you could have a workspace in dedicated capacity “A” and publish the app to dedicated capacity “B” (if you do not want people hammering the workspace to cause performance issues with people using the app).
An App is tied to the content stored in a workspace. That content is physically located in Azure storage (data model) and metadata is stored in Azure SQL. These details are covered in the security whitepaper. In fact in Power BI Embedded you can use the app GUID and workspace GUID interchangeably. GUIDs are logical content groupings. Capacity backend cores and memory are used to process data in use – this is the only real physical relationship with the capacity. You should look into shared datasets. Shared datasets can reside in any workspace that they can move to any available capacity. Frontend cores are shared per cluster, so there is no front end load benefits. Workspace movement between capacities is instantaneous (assuming they are in the same data center).
Can you share datasets across workspaces that reside
in different capacities?
In a deployment pipeline, can you have dev/test/prod
in workspaces that are in different capacities?
Yes, can change the workspace to a different capacity via settings on the deployment pipeline screen.
How can you know about datasets that you don’t have access to? It would make sense to be able to search for a dataset and get a result back that the dataset exists but you need to request permission to use it.
This is not supported. Possibly could be a use case for Azure Data Catalog.
What is the differences in PowerQuery in PBI desktop
vs PBI service (dataflows)?
The main difference is around available connectors (~125 in Power Query Desktop vs. ~50 in Power Query Online).
There is a 3rd party product to recover deleted reports: Power BI Sentinel.
For a deleted dashboard: No. Tenant Admin can recover deleted workspaces, but not individual artifacts. You can raise a support ticket to Microsoft Support.
How can I get an alert when a premium capacity hits a
performance metric (i.e. when the CPU of a premium capacity hits 100% or a
report takes more than a minute to execute)?
When a Power BI
Premium capacity is experiencing extended periods of high resource use that
potentially impacts reliability, a notification email is automatically
sent. Examples of such impacts include
extended delays in operations such as opening a report, dataset refresh, and
query executions. More info at https://docs.microsoft.com/en-us/power-bi/admin/service-interruption-notifications#capacity-and-reliability-notifications. Coming in July is a public preview of Azure
Monitor integration that will allow customers to connect their Power BI
environment to pre-configured Azure Log Analytics workspaces. This provides long term data storage,
retention policies, adhoc query capability, and the ability to analyze the log
data directly from Power BI (see Azure
Monitor integration).
Why would you restart a Power BI Premium capacity?
Users can cause performance issues by overloading the Power BI service with jobs, writing overly complex queries, creating circular references, and so on, that can consume all of the resources available on the capacity. You need the ability to mitigate significant issues when they occur. The quickest way to mitigate these issues is to restart the capacity. More details at https://docs.microsoft.com/en-us/power-bi/admin/service-admin-premium-restart.
Can I prompt for parameters in the Power BI Service?
Is it possible to change the region selection of a
premium capacity after it is created?
This is not
possible. Work arounds are:
Create a second capacity and move workspaces. Free users won’t experience any downtime as long as the tenant has spare v-cores.
If creating a second capacity isn’t an option, you can temporarily move the content back to shared capacity from Premium. You don’t need extra v-cores, but free users will experience some downtime. Then create the premium capacity in the new region and move the workspace from shared to that premium capacity
Is there a way to see the permissions given to a
published app?
Yes, go to the Apps
screen and choose “Edit app”.
Then click the “Update app” and go to the
“Permissions” tab to see the permissions previously given to the app. There is not a way to see the permissions
programmatically (you can utilize the Activity Logs to review when people are
added).
How can I see what new features are on the roadmap?
Check out the Power
Platform: 2020 release wave 1 plan (last
updated May 14, 2020). The Power
Platform release plan (formerly release notes) for the 2020 release wave 1
describes all new features releasing from April
2020 through September 2020 for
Power BI, Power Apps, Power Automate, AI Builder, Power Virtual Agents, and
Common Data Model and Data Integration.
You can either browse the release plan online or download the
document as a PDF
file or via the Power BI Release
Wave.
I discover a small new feature the other day, but a very useful one. Previously, there was no way to upload files to an ADLS Gen2 storage account via the Azure portal. You had to use Azure Storage Explorer – if you hit the Upload button via the “Storage Explorer (preview)” in the Azure Portal, it told you to download Azure Storage Explorer.
But now, when on the Azure Portal, you can upload a file by choosing “Containers” from the overview blade or choosing “Containers” under “Data Lake Storage”, selecting a container, and using the “Upload” button. Note that you can upload multiple files at once and specify there authentication type, block size and access tier.
You can also change the tier of a file to Hot, Cool, or Archive, create a container, or create a folder/directory (Storage Explorer in the portal does not support changing the tier).
Also note in the containers section that next to each file the “…” has an option to “View/edit” the file, so you don’t have to download it to view it (unless the file size is over 2.1MB which is the max supported by the editor). Among the types of files that can be viewed are csv, json, jpg, log, avro but not parquet. All files in the Storage Explorer in the portal must be downloaded to view.
I have updated my previous post Azure Storage tips to reflect these changes.
Archive tier is now GA: The archive tier provides an ultra-low cost tier for long term retention of data while keeping your data available for future analytics needs. Tier your data seamlessly among hot, cool, and archive so all your data stays in one storage account. Lifecycle management policies can be set so files are moved automatically to the archive tier when data access becomes rare. When needed, data in the archive tier can be quickly and easily rehydrated so that the data is available for your analytics workloads. More info.
Immutable storage (preview): Immutable storage provides the capability to store data in a write once, read many (WORM) state. Once data is written, the data becomes non-erasable and non-modifiable, and you can set a retention period so that files can’t be deleted until after that period has elapsed. Additionally, legal holds can be placed on data to make that data non-erasable and non-modifiable until the hold is removed. This preview is currently available in Canada Central, Canada East, France Central, and France South. To enroll in the preview, complete this form. More info.
File snapshots (preview): Use file snapshots to take an unlimited number of point-in-time snapshots of your files. These snapshots can be used to revert a file back to that snapshot in the case of accidental or inadvertent updates. Snapshots can also be retained so you can reference the content of a file at that point in time. File snapshots is currently available in preview in Canada Central, Canada East, France Central, and France South. To enroll in the preview, complete this form. More info.
Static website (preview): Use static website to directly host static content from Azure Data Lake Storage, and view that site content from a browser by using the public URL of that website. This preview is currently available in Canada Central, Canada East, France Central, and France South. To enroll in the preview, complete this form. More info.
Business continuity in Azure SQL Database and SQL Managed Instance refers to the mechanisms, policies, and procedures that enable your business to continue operating in the face of disruption, particularly to its computing infrastructure. In the most of the cases, SQL Database and SQL Managed Instance will handle the disruptive events that might happen in the cloud environment and keep your applications and business processes running. From a database perspective, these are the major potential disruption scenarios, or disasters:
Data corruption or deletion typically caused by an application bug or human error (user accidentally deleted or updated a row in a table). Such failures are application-specific and typically cannot be detected by the database service
Malicious attacker succeeded to delete data or drop a database
Datacenter outage or temporarily disabled, possibly caused by a natural disaster such as an earthquake. This scenario requires some level of geo-redundancy with application failover to an alternate datacenter
Local hardware or software failures affecting the database node such as a disk-drive failure
Upgrade or maintenance errors, unanticipated issues that occur during planned infrastructure maintenance or upgrades may require rapid rollback to a prior database state
This overview describes the capabilities that SQL Database and SQL Managed Instance provide for business continuity and disaster recovery.
To mitigate the local hardware and software failures, SQL Database includes a high availability architecture, which guarantees automatic recovery from these failures with up to 99.995% availability SLA.
SQL Database and SQL Managed Instance also provide several business continuity features that you can use to mitigate various unplanned scenarios:
Temporal tables enable you to restore row versions from any point in time
To protect your business from data loss, SQL Database and SQL Managed Instance automatically create full database backups weekly, differential database backups every 12 hours, and transaction log backups every 5 – 10 minutes (see Built-in automated backups). The backups are stored in RA-GRS storage for at least 7 days for all service tiers. All service tiers except Basic support configurable backup retention period for point-in-time restore, up to 35 days. Point in Time Restore enables you to restore a complete database to some point in time within the configured retention period
You can restore a deleted database to the point at which it was deleted if the server has not been deleted.
For a corrupted database, you can create a new database from a backup to the same server, usually in less than 12 hours unless it is a very large or very active database (see database recovery time). You can also restore a database to another geographic region, called geo-restore, due to geo-replicated backups, which is where a backed up database is automatically copied to an Azure blob in a different region (note there is a delay between when a backup is taken and when it is geo-replicated to an Azure blob in a different region, so as a result, the restored database can be up to one hour behind the original database). This allows you to recover from a geographic disaster when you cannot access your database or backups in the primary region. It creates a new database on any existing server or managed instance, in any Azure region. See Azure SQL Database and Backups
Long-term backup retention (LTR) enables you to keep the backups up to 10 years (this is in limited public preview for SQL Managed Instance). LTR allows you to restore an old version of the database by using the Azure portal or Azure PowerShell to satisfy a compliance request or to run an old version of the application
Active geo-replication enables you to create readable replicas and manually failover to any replica in case of a datacenter outage or application upgrade (see table below to compare with auto-failover groups). If you have an application that must be taken offline because of planned maintenance such as an application upgrade, check out Manage application upgrades which describes how to use active geo-replication to enable rolling upgrades of your cloud application to minimize downtime during upgrades and provide a recovery path if something goes wrong
Auto-failover group allows the application to automatically recovery in case of a datacenter outage (see table below to compare with active geo-replication)
Auto-failover groups simplify the deployment and usage of active geo-replication and add the additional capabilities as described in the following table:
Feature
Geo-replication
Failover groups
Automatic failover
No
Yes
Fail over multiple databases simultaneously
No
Yes
User must update connection string after failover
Yes
No
SQL Managed Instance support
No
Yes
Can be in same region as primary
Yes
No
Multiple replicas
Yes
No
Supports read-scale
Yes
Yes
Recovering a database
Some details if you need to recover a database due to the very rare case of an Azure datacenter having an outage:
One option is to wait for your database to come back online when the datacenter outage is over. This works for applications that can afford to have the database offline. When a datacenter has an outage, you do not know how long the outage might last, so this option only works if you don’t need your database for a while
Another option is to restore a database on any server in any Azure region using geo-restore, as explained above
Finally, you can quickly recover from an outage if you have configured either geo-secondary using active geo-replication or an auto-failover group for your database or databases, as explained above. Depending on your choice of these technologies, you can use either manual or automatic failover. While failover itself takes only a few seconds, the service will take at least 1 hour to activate it. This is necessary to ensure that the failover is justified by the scale of the outage. Also, the failover may result in small data loss due to the nature of asynchronous replication
As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after the disruptive event. The time required for application to fully recover is known as Recovery time objective (RTO). You also need to understand the maximum period of recent data updates (time interval) the application can tolerate losing when recovering from an unplanned disruptive event. The potential data loss is known as Recovery point objective (RPO).
Different recovery methods offer different levels of RPO and RTO. You can choose a specific recovery method, or use a combination of database backups and active geo-replication to achieve full application recovery. The following table compares RPO and RTO of each recovery option:
The following sections provide an overview of the steps to recover using database backups, active geo-replication, or auto-failover groups. For detailed steps including planning requirements, post recovery steps, and information about how to simulate an outage to perform a disaster recovery drill, see Recover a database in SQL Database from an outage.
Prepare for an outage
Regardless of the business continuity feature you use, you must:
Identify and prepare the target server, including server-level IP firewall rules, logins, and master database level permissions
Determine how to redirect clients and client applications to the new server
Document other dependencies, such as auditing settings and alerts
Failover to a geo-replicated secondary database
If you are using active geo-replication or auto-failover groups as your recovery mechanism, you can configure an automatic failover policy or use manual unplanned failover. Once initiated, the failover causes the secondary to become the new primary and to be ready to record new transactions and respond to queries – with minimal data loss for the data not yet replicated. For information on designing the failover process, see Design an application for cloud disaster recovery. When the Azure datacenter comes back online the old primaries automatically reconnect to the new primary and become secondary databases. If you need to relocate the primary back to the original region, you can initiate a planned failover manually (failback).
Perform a geo-restore
If you are using the automated backups with geo-redundant storage (enabled by default), you can recover the database using geo-restore. Recovery usually takes place within 12 hours – with data loss of up to one hour determined by when the last log backup was taken and replicated. Until the recovery completes, the database is unable to record any transactions or respond to any queries. Note that geo-restore only restores the database to the last available point in time. If the datacenter comes back online before you switch your application over to the recovered database, you can cancel the recovery.
Perform post failover / recovery tasks
After recovery, you must perform the following additional tasks before your users and applications are back up and running:
Redirect clients and client applications to the new server and restored database
Ensure appropriate server-level IP firewall rules are in place for users to connect or use database-level firewalls to enable appropriate rules
Ensure appropriate logins and master database level permissions are in place (or use contained users)
Configure auditing, as appropriate
Configure alerts, as appropriate
If you are using an auto-failover group and connect to the databases using the read-write listener, the redirection after failover will happen automatically and transparently to the application.