Quantcast
Channel: James Serra's Blog
Viewing all 516 articles
Browse latest View live

Microsoft Build event announcements

$
0
0

There were a number of data platform announcements at Microsoft Build yesterday that I wanted to blog about.

Everything announced at Build can be found in the Microsoft Build 2022 Book of News.

Microsoft Intelligent Data Platform. This is not a new product, but rather a first step in integrating best in class databases, analytics and governance products into a unified data ecosystem. More info

SQL Server 2022 is now in public preview. You can download it now. More info

Azure Cosmos DB new features. This includes increased serverless capacity to 1 TB, shared throughput across database partitions, support for hierarchical partition keys, improved 30-day free trial experience, now generally available, and support for MongoDB data in the Azure Cosmos DB Linux desktop emulator. A new, free, continuous backup and point-in-time restore capability enables seven-day data recovery and restoration from accidental deletes, and role-based access control support for Azure Cosmos DB API for MongoDB offers enhanced security.

Azure SQL Database new features. In preview: Updated input and output bindings in Azure Functions, a local development environment, and new JSON constructors and ISJSON enhancements. Also, the ledger feature in Azure SQL Database is now GA. More info

Azure Synapse Analytics new features. In preview: Azure Synapse Link for SQL, and Microsoft Graph Data Connect. More info

Azure Database for MySQL Flexible Server Memory Optimized service tier. Now GA. Microsoft is renaming Azure Database for MySQL Flexible Server Memory Optimized service tier to the Business Critical service tier. More info

Microsoft Purview new features. In preview: Microsoft Purview Data Policy, Microsoft Purview Data Estate Insights, and Expansion of Microsoft Purview multicloud and extensibility capabilities (NOTE: in case you missed it, Azure Purview was renamed to Microsoft Purview). More info – Data Policy, More info – Data Estate

Datamart in Power BI. In preview: A new self-service capability included with Power BI Premium that enables users to uncover actionable insights through their own data without any help from IT teams. Build a relational database for analytics using no code experiences for workloads up to 100GB. I’ll go more into this new feature in my next blog post. More info

More info:

The Microsoft Intelligent Data Platform: Bringing together databases, analytics and governance

The biggest Azure announcements from Microsoft Build 2022

The post Microsoft Build event announcements first appeared on James Serra's Blog.

Power BI Datamarts

$
0
0

As mentioned in my previous blog about the Microsoft Build event announcements, the biggest news was Power BI Datamarts. This is a new self-service capability included with Power BI Premium (premium capacity or per user) that enables users to uncover actionable insights through their own data without any help from IT teams. You can quickly and easily build a relational database for analytics using no-code experiences for workloads up to 100GB. Think of a datamart as a mini data warehouse for a department.

It is a combined data prep and model authoring experience built-in to the Power BI Service. It does this by bringing together dataflows and data modeling, with an automatically provisioned Azure SQL Database behind the scenes storing the relational database and powering everything. So you ingest data from different various sources and extract, transform, and load (ETL) the data using Power Query via a special dataflow into an Azure SQL database that’s fully managed and requires no tuning or optimization. This differs from a regular dataflow which just creates a data model that is stored with the report.

Once data is loaded into a datamart, you can then create relationships/measures/calculated tables using the new model authoring experience. It will also automatically generate a Power BI DirectQuery dataset pointing to the datamart which can be used to create Power BI reports and dashboards (even though it does not support an Import dataset, it is still very fast). What’s more is that you’ll get a T-SQL endpoint so you can query the underlying data inside Power BI or outside Power BI (i.e. SSMS, Azure Data Studio). The dataflow you setup to populate the datamart can be refreshed on a schedule (and incremental refresh is supported). All the above is editable via a single web UI and treated as one package. This makes a very compelling argument to keep the simpler workloads within Power BI – reducing complexity and lowering the barrier to entry for many users.

Some important benefits of a datamart:

  • A datamart can be built from source systems, or it can be built from an enterprise data warehouse created by IT for the case when you want a subset of the data warehouse where you can run your own queries on a much smaller data model
  • It cuts through several barriers that users face when they want to set up a reporting solution. Instead of having to get access to an Azure subscription to set up an Azure Data Factory resource and a Azure SQL Database resource, you can now just do it with the click of a button
  • It can be a better option than storing data in a data lake (which dataflow does now by storing data in CDM format in a data lake), as loading a dataflow into an Azure SQL Database allows you to assign roles and row level security, as well as allows for simple connections from a variety of tools, and has faster performance
  • Because it has a brand new web experience for data modelling and measures authoring, no Power BI desktop is required (this means Mac users don’t have to run Power BI Desktop in a Windows emulator anymore)


To create a datamart: Log onto app.powerbi.com and go into a premium workspace (premium capacity or per user). On the home page of a workspace, click New -> Datamart. Then choose “Get Data” and connect to a data source. Then choose the data you want to import into the datamart, and you will be taken to Power Query. Then a datamart and a DirectQuery dataset will be created (both with the same name). You will then be taken to the datamart workspace (visual designer) where there are four tabs:

  • Data – See the data in the tables, and where you can create a new measure (via DAX) or setup incremental refresh, and do filtering and sorting
  • Design – Create a query to view the data via a Power Query diagram View. This is for users who don’t know SQL – they can write custom queries using this visual query editor
  • SQL – A visual editor that allows you to create T-SQL queries to view the data. In the future you will be able to save queries
  • Model – Model the data, import additional data sources, manage roles (row-level security), and create new measures

If you go to the dataset that was created (which pulls data from the datamart), from there you can create a report. Or you can go to Power BI desktop and use the “Azure SQL Database” connection and create a report that way (via the connection string mentioned below). You can also use “Power BI datasets” to connect to the datamart. See Create reports using datamarts.

To use a tool like SSMS to view the datamart: Go to Setting for the datamart, go to “Server settings” and copy the connection string. In SSMS, use that string for the server name, use “Azure Active Directory – Universal with MFA” for the authentication, and after it connects you can go to the Views section in SSMS and choose a view to query the tables in the datamart. See Analyze outside the editor and Connecting to Power BI Datamart’s SQL Server from Desktop Tools and External Services.

Some of the main reasons to use datamarts:

  • When you are not able to wait for IT to make changes to the enterprise data warehouse (i.e. you want to quickly add budgets and forecasts to the data)
  • Sort, filter, do simple aggregation visually or through expressions defined in SQL
  • For outputs that are results, sets, tables, and filtered tables of data
  • Provide accessible data through a SQL endpoint (so you can use SSMS, Excel, etc)
  • Enable users who don’t have access to Power BI Desktop

Datamarts do net yet support data manipulation language (DML) to update data in the datamart, or data definition language (DML) to update the database schema’s in the datamart, but will in the future.

I still see use cases where you would use a regular dataflow instead of a datamart: for tables that are often reused within an organization such as a calendar table, a dimension table, or a lookup table.

The bottom line is departments now have a full blown analytics product that allows them to connect to data sources, transfer and clean the data, and create a data mart with a dataset on top. All without writing any code!


More info:

Introduction to datamarts (official docs)

Power BI Datamart – What is it and Why You Should Use it?

Datamarts and exploratory analysis using Power BI

Microsoft Power BI Gets Low-Code Datamart Feature

CREATE END-TO-END SELF-SERVICE ANALYTICS WITH POWER BI DATAMARTS

Power BI Data: Dataset vs. Dataflow vs. Datamart vs. Dataverse vs. SQL Server vs. Synapse – Are You Confused Just Yet?

Video Add Data at Scale | Datamart in Power BI

Dataflows with benefits

Video Exploring the preview of datamart in Power BI! and Power BI dataflows vs datamarts: What’s the difference??? and Answering your datamart questions

Video What is Power BI Datamarts?

First Look at Datamart

Power BI Datamarts First Impressions

Video What is a datamart? | Compared with data lakes, data warehouses & databases and Will datamarts become the online version of Power BI Desktop?

The post Power BI Datamarts first appeared on James Serra's Blog.

Power BI as a enterprise data warehouse solution?

$
0
0

With Power BI continuing to get many great new features, including the latest in Datamarts (see my blog Power BI Datamarts), I’m starting to hear customers ask “Can I just build my entire enterprise data warehouse solution in Power BI”? In other words, can I just use Power BI and all its built-in features instead of using Azure Data Lake Gen2, Azure Data Factory (ADF), Azure Synapse, Databricks, etc? The short answer is “No”.

First off, as far as Datamarts, I do not recommend using this new capability as a compete or replacement for an existing data warehouse. Datamart is a self-service capability recommended for business users/analysts that can connect to many data sources or any data warehouse to pull data to create a Datamart for business use. On the other hand, IT departments build and manage a data warehouse. I think in many cases you will see Datamarts pulling data from a enterprise data warehouse, but I can also see Datamarts pulling from data sources that are not yet in the data warehouse (in which case a Datamart could then possibly be used as a source to a enterprise data warehouse).

The bottom line is a Datamart is an evolution of the Power BI service best used for the “departmental citizen analyst” persona and are ideal for smaller workloads, while DBAs, data engineers, and data architects should use Azure Synapse and other powerful Azure tools for larger workloads.

Here is a summary of the major reasons why Power BI should not be used for enterprise data warehouses:

ETL: Power BI allows for self-service data prep via Dataflows, but it is not nearly as robust as ADF. ADF has two kinds of data flows: Mapping data flows and Wrangling data flows. ADF wrangling data flows and Power BI dataflows are very similar and both use the Power Query engine. However, ADF wrangling data flows are more powerful – see How Microsoft Power Platform dataflows and Azure Data Factory wrangling dataflows relate to each other. And ADF mapping data flows are way more powerful than Power BI dataflows – see Power BI Dataflows vs ADF Mapping Data Flows.

Data size – Datamarts support a max data size of 100GB

Performance tuning – There are no knobs to turn to improve performance in Power BI, which is fine for small workloads, but essential for large workloads (yes, a DBA is still needed for things that you can do in Synapse, like index creation, workload management, adjusting SQL syntax, etc.)

Inconsistent performance: While Power BI offers a central modeling layer that can include disparate data sources, it will fall back on a data source’s capability – for example, whether they can support the predicate pushdown for DirectQuery or not. Hence reporting/model building efforts will be an inconsistent performance experience. Trying to put an entire corporate strategy around this model gives little space for citizen ETL/ELT developers who don’t want visualization/dashboards as a final outcome (for example, the data should be surfaced in the CRM system or a one-time answer).

Computing resources: Since Power BI is model centric, the compute will have limitations: Model size limits have upper bounds (1Gb, 100 Gb, 400 Gb) and capacity SKUs ranges: EM1 to EM3, P1/A4 to P5/A8) with total vCores (1 to 128). On the other hand, Synapse supports Massively Parallel Processing and is capable of processing and storing huge volumes of data at scale (tera/peta bytes). It also has a resource model via workload classifiers and static/dynamic resource allocation. It addresses complex scenarios where heavy data processing can occur at the same time as high concurrency for reporting – all under different workload isolations. This level of compute-selection complexity is not present in Power BI. Then there is the additional features you get with ADF: Different groups can have their own TCO-based compute for performing their data processing and cleansing, or tie it in with other components such as AI/ML workbooks who’s net output can be Power BI. ADF has much more compute available than Power BI by way of clusters, and has compute environments to process or transform data, all of which will shut down once it’s inactive.

Different universes: If you are using Power BI to build your solution, then obviously you are limited to just the features within Power BI. Azure, on the other hand, is designed for scale and most importantly, integration within the Microsoft platform and other platforms. There is a supermarket selection of services to choose from in the Azure marketplace (even such options as a VM with SQL Server). In the Power BI world, you are limited to what’s provided for you, which can be a blessing or curse depending on what you want to build. Azure options allow you to integrate with first-party services (i.e Databricks) and 3rd-party services (i.e Profisee for Master Data Management), which is quite important for an enterprise solution, which is very different than the strategy Power BI offers.

Who will use it? Just one department, or the entire company? Power BI is focused on use for a department. Synapse is focused on use for the entire company, and has features to accommodate this such as source control to support a large amount of people using it.

Finally, in the Power BI Datamart scenario you also lose all the goodness that you get from the data lake and data warehouse, such as:

  1. Data exhaust: You are only really storing the curated datasets in the Datamart – you lose all of the raw data if you go straight to the Datamart
  2. Flexibility: Anytime you are accessing the data for data science or ad-hoc workloads, you are accessing Power BI compute. This is either going to be expensive or throttled and most likely both
  3. Features: You lose out on the ability to do complex transformations or near-realtime datasources


In the end I think Datamarts are a great solution as a presentation layer for small (under 100GB) curated datasets which were built from data that had been ingested and integrated in Synapse or Databricks. Also, Power BI can be a great option for building a POC and/or getting a quick win, then using that POC as the blueprint/requirements for building an enterprise solution.

Hopefully this blog helped you to understand when is it better to build a solution via “self service” in Power BI versus having IT build an enterprise solution using tools like Synapse. Would love to hear your thoughts!

More info:

What is the point of a data warehouse if Power BI has ETL Capabilities?

The post Power BI as a enterprise data warehouse solution? first appeared on James Serra's Blog.

Questions to ask when designing a data architecture

$
0
0

When I’m leading a full-day architecture design session (ADS) with a customer and the goal is to come up with a data architecture for them, I first start doing “discovery” with them for at least a couple of hours so I can come up with the best architecture for their particular use case. Discovery involves asking them a bunch of questions so I can design a high-level architecture without mentioning any products at first, then once the high-level architecture is complete, I’ll ask more questions and apply products to the architecture. I tell customers Microsoft has many tools in the toolbox for building a solution, and based on the answers to my questions, I’ll reduce that toolbox to just a few tools that are most appropriate for them. I know it is very difficult for customers to keep up with all the new technology Microsoft comes out with on a very frequent bases (after all, you have day jobs), and the reason Microsoft has architects like me is to clear up the confusion and help customers in choosing the right tools for the job.

As an example, if I were asked what product to use to store data in the Azure cloud, I could come up with at least a dozen options, so I need to ask questions to reduce the choices to the best use case for the customers situation. This will avoid what I have seen many times – a company chooses a particular product and after their solution is built, they say the product is “terrible”, but they were using it for a use case that it was not designed for. But the customer was not aware of a better product for their use case because “they don’t know what they don’t know”. That is why you should work with an architect expert as one of your first order of business: the technology decisions at this early part of building a solution are vital to get correct, as finding out 6-months or one year later that you made the wrong choice and have to start over can lead to so much wasted time and money (and I have seen some shocking waste).

Some of the questions I will ask:

  • Can you use the cloud? (nowadays, this is almost always yes, if not, let’s evaluate why and see if we can overcome it)
  • Is this a new solution or a migration?
  • What is the skillset of the developers?
  • Is this an OLTP or OLAP/DW solution?
  • Will you use non-relational data (variety)?
  • How much data do you need to store (volume)?
  • Will you have streaming data (velocity)?
  • Will you use dashboards and/or ad-hoc queries?
  • Will you use batch and/or interactive queries?
  • How fast do the operational reports need to run (SLA’s)?
  • Will you do predictive analytics/machine learning (ML)?
  • Do you want to use Microsoft tools or open source?
  • What are your high availability and/or disaster recovery requirements?
  • Do you need to master the data (MDM)?
  • Are there any security limitations with storing data in the cloud (i.e. defined in your customer contracts)?
  • Does this solution require 24/7 client access?
  • How many concurrent users will be accessing the solution at peak-time and on average?
  • What is the skill level of the end users?
  • What is your budget and timeline?
  • Is the source data cloud-born and/or on-prem born?
  • How much daily data needs to be imported into the solution?
  • What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)?
  • Are you ok with using products that are in public or private preview?
  • What are your security requirements? Do you need data sovereignty?
  • Is data movement a challenge?
  • How much self-service BI would you like?


And you have to be flexible: after a day spent in an ADS with the customer, going over all these questions and coming up with the best architecture and products for them, you might hear them say that it will cost too much and they want a more cost-effective solution. And I usually brief customers on what new products and features that Microsoft has in private preview (or about to be) as it may be something they want to consider if it fits within their timeline, which is usually quite long when building a data architecture such as a modern data warehouse, data fabric, data lakehouse, or data mesh (see Data Lakehouse, Data Mesh, and Data Fabric).

Sometimes I do miss the days when we just had to worry about a new version of SQL Server every few years, where we would just go to a bootcamp for a few weeks and then we knew everything we needed. But learning is fun so I do prefer the challenge of today’s world where the technology is changing on a near-daily bases.

More info:

Understanding the options and asking the right questions

The post Questions to ask when designing a data architecture first appeared on James Serra's Blog.

Power BI guidance documentation

$
0
0

Recently there has been a number of great articles published on Power BI that I wanted to make you aware of that go beyond the features descriptions found in the Power BI documentation. These new articles fall under the Power BI guidance documentation and are designed to address common strategic patterns.  Below is my summary of the articles, and check out Power BI guidance from the CAT by Matthew Roche for a more detailed summary.

Other helpful links:

The post Power BI guidance documentation first appeared on James Serra's Blog.

How to keep up with technology

$
0
0

(Side note: There is a “subscribe” button on the right side of my home page if you wish to receive my blogs updates via email as soon as they are published).

I have been asked a number of times on how do I keep up with technology, being that Microsoft comes up with new products and features on an almost daily bases. Gone are the days when SQL Server would come out with a new version every 3-4 years and you would just go to a bootcamp for a couple of weeks and you were all caught up. That all changed with the cloud. While this is challenging, I find it is more fun, as I equate learning with having fun. But if you are a customer of Microsoft, you have a day job and don’t have the time to keep up with all the technology. That is why architects like me exist within Microsoft: to learn about a customers business and keep updated on the technologies to help the customer choose the best architectures and products for their use cases.

Of course the internet has made learning so much easier. Pre-internet I remember we had to read books and magazines, go to the library, call tech support, or know a handful of “experts” that we could call. Now, everything is just a click away, but what to focus on? Working for Microsoft, we get a lot of internal emails, decks, whitepapers, Yammer, Teams channels, etc, that the outside world does not get access to. But there is plenty of info outside of that which I use. So, I monitored my actions for the last few weeks and listed below everything I used:


Read these blogs:

James Serra’s Blog
Chris Webb’s BI Blog
Paul Randal
Brent Ozar
sqlbi
Paul Turley SQL Server BI Blog
Melissa Coates
Paul Andrew
Matthew Roche
Azure Data Blog

Watch these YouTube channels:

Guy in a Cube
Advancing Analytics
Curbal
Fun with Azure
Bryan Cafferky
RADACAD
Kasper on BI

I also have monthly 30-minute chats with about a dozen people (inside and outside of Microsoft) to keep up with what they are doing and what they have learned. No doubt it is challenging and time consuming to keep up. Especially within my role as a data and AI architect, which I feel is so big it should be split into three separate roles: data platform, AI, and Power BI. Fortunately learning is part of my day job at Microsoft, although I voluntarily read about technology some nights and weekends because it is fun to me and does not feel like “work”. I also like to play with new features and products via hands-on experience to understand it better so I can be more thorough if needed when educating customers. I like to simplify things and explain it at a high level for c-level people or end-users, or go deeper if I’m talking to technical people. But with so many products the best you can do is to know most products in your solution area at a high level, and go deep with 2-3 products, because we also have to learn about customers, industries, the competition, and the Microsoft organization (which radially changes almost every year). The one thing I have gotten better at over the years is “learning how to learn”. I’m able to filter out very quickly what is important and what is not and absorb it rapidly (like cramming for a test). This helps compensate for the fact my brain neurons don’t fire as fast!

I hope this blog is helpful in giving you some new sources of learning, and please comment below on other sources that you like that I have not listed.

The post How to keep up with technology first appeared on James Serra's Blog.

Synapse database templates info and tips

$
0
0

I had previously blogged about Azure Synapse Analytics database templates, and wanted to follow-up with some notes and tips on that feature as I have been involved on a project that is using it:

  • Purview does not yet pull in the metadata for database templates (table/field descriptions and table relationships). Right now it pulls in the metadata as if it was a SQL table or as if it was a file in ADLS. Both just have the basic information supported by those types. The SQL one is probably preferred
  • Power BI does not import the table and field descriptions when connecting to a lake database (where the database templates are stored), but it does import the table relationships. You can see the table descriptions by hovering over the table names in the navigator when importing tables using the “Azure Synapse Analytics workspace (Beta)” connector. Note you are not able to see the table descriptions when hovering over the table names using the “Azure Synapse Analytics SQL” connector. Also note the “Select Related Tables” button does not work in the navigator
  • When using Power BI to connect to database templates, make sure the tables you want to use have data and are stored in Parquet format or you will get message “DataSource.Error: Microsoft SQL: External table ‘dbo.TableName’ is not accessible because content of directory cannot be listed”. Note that DirectQuery does not work with the “Azure Synapse Analytics workspace (Beta)” connector (see Power BI connectors to Azure Synapse), but works with the “Azure Synapse Analytics SQL” connector
  • The storage format for database templates defaults to delimited text, but recommend you store it in parquet format (via the “Data format” drop-down in the model properties). Delta format is not supported yet
  • When using the database template designer, think of it as you are building multiple virtualization layers (data models) over the data in your data lake – it’s not a robust modeling tool like erwin or ER/Studio
  • You can use the Map data tool (which is in beta) instead of using Spark notebooks or the ADF copy tool or ADF data flows to map source data to the database template (mapping is by far the most time consuming part). Once you are done with your Map Data transformations, you will click the “Create pipeline” button which will generate an Execute Pipeline which calls a mapping data flow, which you can then debug and run your transformation. You can think of the Map data tool as a wizard to shortcut the process of building a mapping data flow
  • There is no ability to share models between different Synapse workspaces within the template designer. In the future, API’s may become available to accomplish this
  • Create multiple domains from a database template model instead of just using one gigantic model. For example, instead of using the Consumer Goods model and just having that one model, break it up into multiple domains that each have its own model, such as HR and Finance
  • Documentation is in the models themselves via the description fields in the database template designer, along with all the relationships you can see visually in the designer. Note that many of the models have the same entities/tables
  • Within Synapse, you can utilize GitHub or Azure DevOps Git to store the database models as JSON files. While the actual data that is copied to the models is stored in the lake database, the metadata for the models is not visible in the lake database, only in Synapse (as well as within JSON’s files in GitHub)
  • You might want to create SQL views that query the data models in order to limit permissions. These views can be created in Synapse serverless, or via a just released feature that allows you to create views inside a lake database (see Azure Synapse Analytics August Update 2022)
  • Each database template has a version number that you can see in the properties blade when in the database template designer
  • You are not able to share models between Synapse workspaces (must use GitHub to share)
  • All the database template models are in third-normal form (3NF) and stored in a lake database (ADLS Gen2). There is not a way to automatically replicate the models to a dedicated pool from within the database template designer
  • You can have a Power BI data modeler import the database template tables and relationships into a Power BI dataset. Purview will list those datasets (see Microsoft Power BI and Microsoft Purview work better together) and you can request access to the datasets via Purview, but the access needs to be given manually as Power BI datasets access grant is not automated yet. Power BI report builders can then choose a dataset to build their reports so they do not have to create the table relationships themselves
  • If needed, you can to create your own star schema’s. Star schema’s will reduce the number of joins and improve performance. You can use the “Transform Data” in the Power BI navigator to fire up Power Query to build the star schema’s, or you can build the star schema’s using Synapse pipelines and store it in the data lake or in a Synapse dedicated pool. Using a star schema instead of 3NF will result in less tables and joins, reducing complexity for end-users when creating reports along with faster performance (which will be minimal if importing the data into Power BI). You can equate 3NF with a snowflake schema, and see the differences here: Star Schema vs. Snowflake Schema (zentut.com)
  • Concerning a way to setup roles and responsibilities for a database template project: You can have data modelers, working with domain SME’s, who will create a spreadsheet (blueprint) of the mapping of source to target along with transformation rules. Once completed, the spreadsheet would be handed off to data engineers to write the code to copy the source data into the database template and do the transformations


For more information about database templates or to get started, visit aka.ms/SynapseDBTemplates.

The post Synapse database templates info and tips first appeared on James Serra's Blog.

Data lake architecture

$
0
0

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

A data lake should have layers such as:

  • Raw data layer– Raw events are stored for historical reference, usually kept forever (immutable). Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.  Advantages are auditability, discovery, and recovery. A typical example is if you need to rerun an ETL job because of a bug, you can get the data from the raw layer instead of going back to the source. Also called bronze layer, staging layer or landing area. Sometimes there is a separate conformed layer (or base layer) that is used after the raw layer to make all the file types the same, usually parquet.
  • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Think of the cleansed layer as a filtration layer. It removes impurities and can also involve enrichment. The aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings and integers). Also called silver, transformed, integrated, or enriched layer
  • Presentation data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. data warehouse application, advanced analysis process, etc). The data is joined and/or aggregated, and can be stored in de-normalized data marts or star schemas. Also called application, workspace, trusted, gold, secure, production ready, governed, curated, or consumption layer
  • Sandbox data layer – Optional layer to be used to “play” in, usually for data scientists. It is usually a copy of the raw layer.  Also called exploration layer, development layer or data science workspace


Within each layer there will be a folder structure, which is designed based upon reasons such as subject matter, security, or performance (i.e. partitioning and incremental processing). Some good examples of this can be found in the doc Data lake zones and containers, and from one of my favorite bloggers, Melissa Coates (Coates Data Strategies): Zones in a Data LakeData Lake Use Cases and Planning ConsiderationsFAQs About Organizing a Data Lake and the PowerPoint Architecting a Data Lake. Also, see my video Modern Data Warehouse explained for how data is moved between the layers.

With rare exceptions, all layers use Azure Data Lake Storage (ADLS) Gen2. I have seen some customers use Azure Blob Storage for the raw layer because it was a bit cheaper, or if there are huge demands on throughput it may be a good way to isolate ingestion workloads (using Blob Storage) from processing/analytics workloads (using ADLS Gen2).

Most times all these layers are under one Azure subscription. Expectations are if you have specific requirements for billing, you will hit some subscription limit, or you want separate subscriptions for dev, test and prod.

Most customers create ADLS Gen2 storage accounts for each layer, all within a single resource group. This provides isolation of the layers to help with performance predictability, and allows for different features and functionality at the storage account level, such as lifecycle management options or firewall rules, or to prevent hitting some storage account limit (i.e. throughput limit).

Most data lakes make use of Azure storage access tiers, with the raw layer using the archive tier, cleansed using the cold tier, and presentation and sandbox using the hot tier.

I recommend some type of auditing or integrity checks be put in place to make sure the data is accurate as it moves through the layers. For example, if the data is finance data, create a query that sums up the total of the days orders (count and sales total) to make sure the values are equal in all the data layers (and comparing to the source data).

A large majority of customers are using delta lake in their data lake architecture because of the following reasons:

  • ACID transactions
  • Time travel (data versioning enables rollbacks, audit trail)
  • Streaming and batch unification
  • Schema enforcement
  • Supports commands DELETEUPDATE, and MERGE
  • Performance improvement
  • Solve “small files” problem via OPTIMIZE command (compact/merge)


Usually the raw data layer does not use delta lake, but cleansed and presentation do use it. See Azure Synapse and Delta Lake for more info about delta lake.

More info:

Book “The Enterprise Big Data Lake” by Alex Gorelik

Building your Data Lake on Azure Data Lake Storage gen2 – Part 1

Building your Data Lake on Azure Data Lake Storage gen2 – Part 2

The Hitchhiker’s Guide to the Data Lake

Should I load structured data into my data lake?

What is a data lake?

Why use a data lake?

THE DATA LAKE RAW ZONE

Video Data Lake Zones, Topology, and Security

The Hitchhiker’s Guide to the Data Lake

The post Data lake architecture first appeared on James Serra's Blog.

Microsoft Ignite Announcements Oct 2022

$
0
0

Announced at Microsoft Ignite were some new product features related to data platform and AI. Below are the ones I found most interesting:

  • Azure Cosmos DB adds distributed PostgreSQL support: Azure Cosmos DB now brings fast, flexible and scalable service to open-source relational data with the introduction of distributed PostgreSQL support. Built upon the Hyperscale (Citus) engine, the new Azure Cosmos DB for PostgreSQL brings everything developers love about PostgreSQL and the powerful Citus extension to Microsoft’s fast and scalable database for cloud-native app development. Developers can now build apps with both relational and non-relational (NoSQL) data using the same familiar database service. More info
  • Autoscale IO feature for Azure Database for MySQL (preview): Azure Database for MySQL, a fully managed and scalable MySQL database service, can now scale input/output (IO) on-demand without having to pre-provision a certain amount of IO per second. Customers will enjoy worry-free IO management in Azure Database for MySQL Flexible Server because the server will scale input/output operations per second (IOPS) up or down automatically depending on workload needs. With the Autoscale IO preview, Azure Database for MySQL customers pay only for the IO they consume and no longer need to provision and pay for resources they are not fully using, saving both time and money. In addition, mission-critical Tier 1 apps can achieve consistent performance by making additional IO available to the workload at any time. Autoscale IO eliminates the administration required to provide the best performance at the least cost for Azure Database for MySQL customers. More info
  • Azure Data Studio now supports Oracle database assessments for migration to Azure-managed databases (preview): The Database Migration Assessment for Oracle, an Azure Data Studio extension powered by Azure Database Migration Service, now offers a migration assessment for moving from Oracle Database to Azure Database for PostgreSQL. The assessment includes database migration recommendations and an evaluation of database code complexity. Through the same tooling, customers can get target sizing recommendations for Oracle Database migration to Azure Database for PostgreSQL and Azure SQL, including Azure SQL Database Hyperscale, which is ideal for large workloads up to 100 TB. More info
  • Azure Synapse Analytics – Microsoft 365 features (preview): A new pipeline template for Microsoft 365 data will simplify the configuration experience by enabling a one-click experience to set up Mapping Data Flows. This new feature will eliminate the extra steps needed to connect Microsoft 365 source data for analytics, making it easier for customers to configure an always synchronized and compliant integration. Mapping Data Flows builds on top of the Copy Activity functionality by improving the way Microsoft 365 data is brought in for analysis. Mapping Data Flows will clean, normalize and flatten the data (Parquet format) in multiple data sinks such as Azure Data Lake Storage (ADLS), Azure Cosmos DB and Synapse SQL DW (dedicated SQL pool). Flattened data is much cheaper and faster to process with big data processing. Additional data sync support gives customers flexibility and greater efficiency in constructing data pipelines that optimize workflows
  • Azure Synapse Analytics – R language support (preview): The R language will enable data scientists to apply the industry-standard R language to process data and develop machine learning models. Azure Synapse Spark now supports Python, Scala, C#, Spark SQL, and R
  • Azure Data Explorer adds new sources for near real-time analytics: Ingestion support for the following data sources: Amazon Simple Storage Service (S3); Azure Synapse Link from Azure Cosmos DB to Azure Data Explorer; OpenTelemetry Metrics, Logs and Traces; Azure Stream Analytics output; Streaming ingestion support for the Telegraph agent output. More info
  • SAP Change Data Connector for ADF now GA: The SAP Change Data Connector (CDC) for Azure Data Factory is now generally available. This feature, which previewed in June 2022, allows customers to easily bring SAP data into Azure for analytics, AI and other apps. More info
  • Microsoft Purview new features (preview): Improved root cause analysis and traceability with SQL Dynamic lineage and fine-grained lineage on Power BI datasets. Customers can do thorough root cause analysis from a single location in Microsoft Purview; Metamodels that will enable customers to define organization, departments, data domains and business processes on their technical data; Machine learning-based classifications will make detection of human names and addresses simple and scalable in user data
  • Azure Stream Analytics native support of Delta Lake output: Allows you to directly write streaming data to your delta lake tables without writing a single line of code. More info
  • Power BI updates: Power BI in Office installer; View and edit Power BI reports directly from OneDrive and SharePoint; B2B dataset sharing and report discoverability; Large dataset reporting support; Easier migration to Power BI Premium; Power BI reports and datasets in a Power App solution. More info


For more details, check out the Microsoft Ignite book of news.

More info:

Microsoft boosts Azure’s big-data cred with flurry of database-related enhancements

The post Microsoft Ignite Announcements Oct 2022 first appeared on James Serra's Blog.

Attending and presenting at conferences

$
0
0

I have attended and presented at a ton of conferences over the years (see the entire list at Presentations | James Serra’s Blog). If you are looking to learn a lot about the Microsoft data platform, the two biggest conferences to attend are the PASS Data Community Summit in November and SQLBits in March. I have presented at both and will be presenting at the PASS Summit next week. My session is called “Data Lakehouse, Data Mesh, and Data Fabric (the alphabet soup of data architectures)” and will be on Friday, Nov 18th at 2:30pm PST (more info). If you are at the PASS Summit and see me, make sure to say hi!

In addition to learning about technology, attending conferences is a great way to meet people in the industry and to network for your next job. I blogged about all the people I was able to meet at a conference may years ago: PASS Business Analytics Conference: the ultimate networking. When COVID hit and everything went virtual, it made me realize how much I enjoyed meeting everyone in-person, so I’m really looking forward to getting back to that next week at the PASS Summit. I usually spend more time chatting with people than attending the sessions (which you can always watch later). This year I hope to catch Brent Ozar in the PASS Summit Community Zone where he said he will be hanging out a lot, and I will be too (starting on Thursday). Brent helped me to get started blogging almost 12 years ago (I can’t believe it has been that long and that this is my 627th blog post!). Keep in mind that many of the conferences, including the PASS Summit, have a “personal development” track so it is more than just technical stuff that you can learn.

Other big Microsoft conferences that you may want to attend are Microsoft Build in May, Microsoft Inspire in July, Microsoft Ignite in November and Azure Data Conference in December.

If you are a new presenter, a good way to start is to present at a SQL Saturday. I have done over 50 presentations at SQL Saturdays and really enjoy them. For tips on presenting, check out my presentation Learning to present and becoming good at it. I highly recommend presenting if you have not done so before – it is very rewarding and can greatly help your career. See Brent’s Become a Presenter, Change Your Life.

A great way to get notified when calls for speakers are made available for events is to sign up at callfordataspeakers.com. Check out their List of events and you will be surprised how many data-related events there actually are.

Most events now use Sessionize to submit your abstracts, so you only need to enter your session details once, as you can select previously submitted sessions, making the process so much easier.

I hope to see some of you at the PASS Summit!

The post Attending and presenting at conferences first appeared on James Serra's Blog.

SQL Server 2022 is GA!

$
0
0

The big announcement at the PASS Data Community Summit 2022 was that SQL Server 2022 is now generally available. See the official announcement.

My top 10 list of new features are:

  • Integration with Azure SQL Database Managed Instance — the Microsoft-managed, cloud-based deployment of the SQL Server box product. This integration supports migrations to Managed Instance through the use of Distributed Availability Group (DAG), which will enable near-zero-downtime database migrations. Additionally, you will have the ability to move back to on-premises through a database restore (only available for SQL Server 2022), giving bi-directional HA/DR to Azure SQL. You can also use this link feature in read scale-out scenarios to offload heavy requests that might otherwise affect database performance. More info
  • Implementation of the ledger feature that already exists in Azure SQL Database (announced in May of last year), bringing the same blockchain capabilities to SQL Server. More info
  • Azure Synapse Link for SQL Server, which provides for replication of data from SQL Server 2022 into Azure Synapse-dedicated SQL pools. More info
  • Integration with Microsoft Purview, which assures that the cloud-based data governance platform encompasses SQL Server data, bringing data stored on-premises into its governance scope. That scope even includes propagation of Purview policies for centralized administration of management operations. More info
  • Query Store on secondary replicas enables the same Query Store functionality on secondary replica workloads that is available for primary replicas. More info
  • Query Store hints leverage the Query Store to provide a method to shape query plans without changing application code. Previously only available on Azure SQL Database and Azure SQL Managed Instance, Query Store hints are now available in SQL Server 2022 (16.x). Requires the Query Store to be enabled and in “Read write” mode. More info
  • A new feature called Parameter Sensitive Plan Optimization which automatically enables the generation of multiple active cached query plans for a single parameterized statement, accommodating different data sizes based on provided runtime parameter values. More info
  • An update to PolyBase that uses REST APIs to connect to data lakes (Azure storage and Amazon S3) in addition to using the ODBC drivers, as well as supporting the OPENROWSET command. More info
  • New built-in server-level roles enable least privileged access for administrative tasks that apply to the whole SQL Server Instance. More info
  • Enhancements to T-SQL that includes an enhanced set of functions for working with JSON data and new time series capabilities.  More info

For the full list of new features, check out What’s new in SQL Server 2022 (16.x), What’s new in SQL Server Analysis Services, and What’s new in SQL Server Reporting Services (SSRS).

The post SQL Server 2022 is GA! first appeared on James Serra's Blog.

My presentation recordings of data architectures

$
0
0

Over the years I have presented a ton (see the list), and some of those presentations were recorded. I had put some of them on my YouTube channel, but neglected to post some of them (13 in fact). They are now all there, and below I highlighted a few that I hope you will find helpful. If you do find any helpful, please subscribe to my YouTube channel to be notified of new videos, and I’ll make sure to upload any future recordings withing a few days of the event:

Data Lakehouse, Data Mesh, and Data Fabric (1 hour) – Calgary Azure Analytics User Group

(view) So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I’ll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I’ll discuss Microsoft version of the data mesh.

Data Lakehouse, Data Mesh, and Data Fabric (10 minute overview) – DataMinutes

(view) So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I’ll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I’ll discuss Microsoft version of the data mesh.

Data Lakehouse: Debunking the Hype

(view) In this podcast I talked about data warehouse, data lakehouse, and the differences.

The Rise of Data Mesh: Panel discussion – James Serra – Decisive 2022 

(view) I was on a panel for the conference Decisive 2022 and discussed the data mesh.

Interview on data mesh and data warehousing – James Serra – UNION: The Data Fest 

(view) In this interview for the UNION conference I talked about my thoughts on data mesh, data warehousing, Microsft Purview, and Microsoft’s approach to competition

Big Data Architectures and The Data Lake – PASS Cloud Virtual Group

(view) With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I’ll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!

Data Warehousing Trends, Best Practices, and Future Outlook

(view) Over the last decade, the 3Vs of data – Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this presentation you will learn:

  • Choosing the best solution – Data Lake vs. Data Warehouse vs. Data Mart
  • Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
  • Step by step approach to building an effective data warehouse architecture
  • Common reasons for the failure of data warehouse implementations and how to avoid them

The post My presentation recordings of data architectures first appeared on James Serra's Blog.

New year, new role, and new book

$
0
0

Happy new year to everyone!

As I enter my 9th year at Microsoft, I have switched roles, and am now an Industry Advisor in Federal Civilian, helping our Federal Civilian customers deliver on their missions through the Microsoft cloud. Microsoft has many different industry groups, such as healthcare, retail, and finance, and Federal is another industry group, with three operating units under it: Defense (Air Force, Army, Navy, Marines, defense contractors, etc), Intel (intelligence agencies), and Civilian (everything else: NASA, DOE, USPS, FAA, DOJ, IRS, DEA, and many more). So, this will be a much different type of customers than I have worked with in the past and am greatly looking forward to it. If you are working for a federal company and wish to have a chat with me, let your Microsoft account team know.

Note that in addition to the commercial cloud, Microsoft also has government cloud as well as Classified Cloud for Secret and Top Secret to handle classified workloads (see Azure Government for national security and Azure Government Top Secret now generally available for US national security missions). They work just like the commercial cloud but might not have all the same services so check out the Azure geographies and explore products by region. Many of the federal companies that I work with have classified workloads and the government and classified clouds solve their extra security needs.

Here is the breakdown of all the Azure cloud environments available:

  • Azure is available globally. It is sometimes referred to as Azure commercial, Azure public, or Azure global.
  • Azure China is available through a unique partnership between Microsoft and 21Vianet, one of the country’s largest Internet providers.
  • Azure Government is available from five regions in the United States to US government agencies and their partners. Two regions (US DoD Central and US DoD East) are reserved for exclusive use by the US Department of Defense.
  • Azure Government Secret is available from three regions exclusively for the needs of US Government and designed to accommodate classified Secret workloads and native connectivity to classified networks.
  • Azure Government Top Secret serves the national security mission and empowers leaders across the Intelligence Community (IC), Department of Defense (DoD), and Federal Civilian agencies to process national security workloads classified at the US Top Secret level.
(the classification descriptions can be found here)

This will be my fifth role change within Microsoft, and it’s similar to a Cloud Solution Architect (CSA) role I had when I first joined Microsoft, although we were called Technical Sales Professionals (TSP) back then (I first started by selling the parallel data warehouse, the precursor to Azure Synapse). An Industry Advisor is sort of a mash-up between a seller and an architect. Industry Advisor’s and CSA’s are very unique roles that you don’t find in many other companies: we are trying to sell the value of the Azure cloud to companies, as opposed to trying to sell them services or products. If you are a customer of Microsoft, you have a day job and don’t have the time to keep up with all the technology. That is why architects like me exist within Microsoft: to learn about a customer’s business and keep updated on the technologies to help the customer choose the best architectures and products for their use cases. My goal is to educate customers and get them excited about building a solution on Azure. Education can be in the form of ideation sessions, strategy sessions, architecture sessions, demos, hackathons, workshops, POC’s or just a conversation to show them the “art of the possible” with Azure. Once a customer decides they want to build a solution on Azure, they go off to build that solution on their own or with a partner, and I can help with that transition.

Finally, I have been writing a book the last couple of months on data architectures that greatly expands on my previous blogs and presentations such as “Data Lakehouse, Data Mesh, and Data Fabric (video)” and “Data Warehousing Trends, Best Practices, and Future Outlook (video)”. More details to come later!

The post New year, new role, and new book first appeared on James Serra's Blog.

When to have multiple data lakes

$
0
0

A question I get asked frequently from customers when discussing Data lake architecture is “Should I use one data lake for all my data, or multiple lakes?”. Ideally, you would use just one data lake, but I have seen many valid use cases where customers are using multiple data lakes. Here are some of those reasons:

  • Because of organizational structure, where each org keeps ownership of their own data. Typical with a data mesh
  • To support multi-regional deployments, where certain regions have data residency/sovereignty requirements. For example, data in China cannot leave China
  • To avoid Azure subscription or service limits, quotas, and constraints. For example, 250 maximum number of storage accounts with standard endpoints per region per subscription
  • To enact different azure policies for each data lake. For example, specifying that storage accounts should have infrastructure encryption
  • Having multiple lakes, each with its own Azure subscription, makes it easier to track costs for billing purposes, especially compared to other options such as using tags
  • If you have confidential or sensitive data and want to keep it separate from other less sensitive data for security reasons. Plus, you can implement more restrictive security controls on the sensitive data
  • Different lakes for dev, test, and prod environments
  • To improve latency – having a data lake reside in the same region as an end-user or an application querying the data, instead of users all over the world accessing data in one lake that could be located a considerable distance away
  • For security purposes, to limit the scope of a person with elevated privileges only having those privileges in the lake they are working in
  • Having one source-aligned data lake as well as a consumer-aligned data lake
  • To manage data that has different governance or compliance requirements. This can be especially important for organizations that need to comply with regulations such as GDPR or HIPAA.
  • You have different teams or departments that need their own data lake for specific use cases or projects
  • For better disaster recovery by having multiple data lakes in different regions with copies of the data, so you can ensure that your data is available in the event of a disaster or other disruption
  • To enable the use of different data recovery and disaster recovery strategies for different types of data
  • To enable different data retention policies. Organizations may have to retain data for a certain period of time due to legal or regulatory requirements, and having separate data lakes for different types of data can make it easier to implement different retention policies for different types of data
  • The ability to implement different levels of service for different types of data. For example, you could use one data lake for storing and processing high-priority data that requires low-latency access and high availability, and another data lake for storing and processing lower-priority data that can tolerate higher latencies and lower availability. This can help to optimize the cost and performance of your data management infrastructure by allowing you to use less expensive storage and processing resources for lower-priority data

It’s important to note that using multiple data lakes can increase the complexity and cost of your data management infrastructure and require more resources and more expertise to maintain, so it’s important to weigh the benefits against the costs before implementing multiple data lakes (although in some cases you will have no choice to have multiple data lakes such as data sovereignty). Multiple data lakes also may require additional data integration and management tools to help ensure that the data is properly transferred between the different data lakes and that data is consistent across all data lakes. Finally, having multiple data lakes adds the performance challenge of combining the data when a query or report needs data from multiple lakes.

The post When to have multiple data lakes first appeared on James Serra's Blog.

New Microsoft Purview features

$
0
0

Microsoft Purview, formally called Azure Purview (see Azure Purview is generally available) has recently released a number of new cool features. I wanted to call out a few of them:

  • Data Sharing – in public preview, you can now share data in-place from Azure Data Lake Storage Gen2 and Azure Storage accounts, both within and across organizations. Share data directly with users and partners without data duplication and centrally manage your sharing activities from within Microsoft Purview. You can now have near real-time access to shared data. Storage data access and transactions are charged to the data consumers based on what they use, and at no more cost to the data providers (more info). Note this is using the same technology used with Azure Data Share, except Azure Purview data sharing does not support snapshot-based sharing, only in-place sharing
  • DevOps policies – DevOps policies are a special type of Microsoft Purview access policies. They grant access to database system metadata (not user data) provided the data source is enabled for Data Use Management. For example, giving a person the ability to log in to dozens of Azure SQL logical servers to monitor their performance (more info)
  • Data owner access policies – In public preview, enables you to manage access to user data in sources that have been registered for Data Use Management in Microsoft Purview (data use management needs certain permissions and can affect the security of your data, as it delegates to certain Microsoft Purview roles to manage access to the data sources). These policies can be authored directly in the Microsoft Purview governance portal, and after publishing, they get enforced by the data source (more info). For example, giving a user read access to an Azure Storage account that has been registered in Microsoft Purview
  • Self-service data access policies – In public preview, allows a data consumer to request access to data when browsing or searching for data, which triggers a self-service data access workflow. Once the data access request is approved, a policy is auto-generated and applied against the respective data source to grant access to the requestor, provided the data source is enabled for Data Use Management. Currently, self-service data access policy is supported for storage accounts, containers, folders, and files (more info). As an example, an end-user can be browsing folders in Purview and finds one that contains files the end-user would like to use. That person would request access to the folder through Purview, and if the access is approved, that person would be able to use a tool outside of Purview to read the files. Behind the scenes, a workflow was setup in Purview to automatically grant a user access to that folder, so nothing manual needs to be done to give access (the workflows look like Power Automate). Click here to see how the process works. The difference between “data owner access policies” (DOAP) and “self-service data access policies” (SDAP) is that DOAP is for the data owner to grant access to user data in sources, while SDAP is a way for data consumers to request access to user data in a source. So with DOAP no one is requesting access and with SDAP they are. As an example, say a DOAP is created to give five people read access to FolderA. If a sixth person wanted access to FolderA, that person would request access by creating a SDAP
  • Workflows – Workflows are automated processes that are made up of connectors that contain a common set of pre-established actions and are run when specified operations occur in your data catalog. Workflow actions include things like generating approval requests or sending a notification, that allow users to automate validation and notification systems across their organization. There are two kinds of workflows: Data governance – for data policy, access governance, and loss prevention (as used in the self-service data access policies mentioned above), and Data catalog – to manage approvals for CUD (create, update, delete) operations for glossary terms (more info). For example: A user attempts to delete a business glossary term that is bound to a workflow. When the user submits this operation, the workflow runs through its actions instead of, or before, the original delete operation. The deletions could then be prevented if certain conditions are not met
  • Asset Types – In public preview, you can build a metamodel by adding asset types to a canvas and defining the relationships between them. Asset types allow you to describe parts of your business that are not technical assets. It is a template for important concepts like business processes, departments, lines of business, or even products. Metamodel tells a story about how your data is grouped in data domains, how it’s used in business processes, what projects are impacted by the data, and ultimately how the data fits in the day to day of your business (more info). For example, it can help answer questions like: What department produces this dataset? Are there any projects that are using this dataset? Where does this report come from? You can define an asset type “department” and then create new department assets for each of your business departments, as well as attach existing data source assets. These new assets are stored in Microsoft Purview like any other data assets that were scanned in the data map, so you can search and browse for them in the data catalog. Metamodel includes several predefined asset types to help you get started, but you can also create your own.
The post New Microsoft Purview features first appeared on James Serra's Blog.

Using a data lakehouse

$
0
0

As I mentioned in my Data Mesh, Data Fabric, Data Lakehouse presentation, the data lakehouse architecture, where you use a data lake with delta lake as a software layer and skip using a relational data warehouse, is becoming more and more popular. For some customers, I will recommend “Use a data lake until you can’t”. What I mean by this is to take the following steps when building a new data architecture in Azure with Azure Synapse Analytics:

  1. Create a data lake in Azure and start with just using that – no delta lake or Synapse dedicated SQL pool. Set it up using best practices – see Best practices for using Azure Data Lake Storage Gen2 and The best practices for organizing Synapse workspaces and lakehouses. This may be all you need if your data size is on the smaller end, and you don’t have complicated queries. Just be aware of the tradeoffs of not using a dedicated SQL pool (See Data Lakehouse & Synapse). You will be using a Synapse serverless pool to access the data in the data lake so be sure to read Best practices for serverless SQL pool in Azure Synapse Analytics. Also, learn of ways to adjust the folder structure in your data lake to improve performance – see Synapse Serverless SQL Pool – Performance and cost optimization with partitioning. You might want to do a few POC’s to see if you can get by without a delta lake and/or a Synapse dedicated SQL pool
  2. If you find performance is not satisfactory or you need some of the features of a delta lake (see Azure Synapse and Delta Lake), then update all your code to write the data into your data lake in delta lake format (for example, see Transform data in delta lake using mapping data flows). Delta lake does give you many benefits, but it does have the tradeoff of more cost and complexity. In my experience, the majority of customers are using delta lake
  3. If you are using Power BI for reporting against data in the data lake, there are a lot of features you can use if you are not getting the performance you need. Using Import mode is usually the best option for fast performance, but there are other options – see Power BI Performance Features
  4. If your performance is still suffering, then it’s time to start using a Synapse dedicated SQL pool. Just realize you only need to copy the data into the Synapse dedicated SQL pool for the data that is causing problems, not all the data in the data lake. Be aware with dedicated SQL pools there are a lot of knobs to turn to improve performance (see Best practices for dedicated SQL pools in Azure Synapse Analytics)
  5. An option to get better performance with a dedicated SQL pool is to duplicate the data from one SQL pool to another that has different indexes on it. So, queries and reports will go against the pool that gives the best performance. Of course, the tradeoff is cost and complexity in updating two pools
  6. Another option to get further performance is to convert your data to a star schema, which can be done in the Synapse dedicated SQL pool or as a dataset in Power BI (see Power BI and a Star Schema), or you could even land the star schema in your data lake

A question I get asked in a data lakehouse discussion is: what is the difference between importing directly from the data lake into PBI vs using Synapse serverless? There are two blogs on this by Chris Webb from Microsoft that I refer customers to:
Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless
Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations

I’ll be discussing in detail the data lakehouse architecture (as well as other architectures such as data mesh and data fabric) in my upcoming book, that will be published by O’Reilly. I’m at least halfway through writing the draft of the book and hope to have a couple of chapters available soon that you can view via the O’Reilly early release program. I’ll then add another couple chapters every few weeks. The early release program is a great way to start reading the chapters well before the entire book is officially published. I’ll post on my blog when it becomes available.

The post Using a data lakehouse first appeared on James Serra's Blog.

Data Marketplace

$
0
0

I question I have been getting from customers lately is about a data marketplace. What is it, and what products can I use to build it?

A data marketplace, sometimes called a data exchange, is an online platform where data providers and data consumers come together to buy, sell, or exchange data sets. These marketplaces are designed to facilitate the process of discovering, evaluating, and purchasing data that can be used for various purposes, such as data analysis, machine learning, business intelligence, and decision-making.

Data providers in a marketplace can include organizations, governments, or individuals who have collected or generated valuable data sets. Data consumers, on the other hand, can be businesses, researchers, or developers looking to obtain relevant data to address their specific needs.

Data marketplaces typically offer features such as:

  • Data catalog: A searchable directory of data sets, often categorized by industry, topic, or source, that allows users to easily find relevant data
  • Data quality and standardization: Many marketplaces assess and improve the quality of the data sets to ensure they are accurate, complete, and consistent, making it easier for consumers to use the data
  • Pricing and licensing: Data marketplaces often establish pricing structures and licensing agreements that clearly define the terms of use and ensure legal compliance for both data providers and consumers
  • Data delivery and integration: Once a data set is purchased, the marketplace may offer tools to help consumers easily access and integrate the data into their workflows, applications, or systems
  • Ratings and reviews: Some marketplaces allow users to rate and review data sets, which can help potential consumers make informed decisions about the quality and relevance of the data
  • Data privacy and security: Data marketplaces should have robust security measures in place to protect user data, ensure compliance with data protection regulations, and maintain the confidentiality of sensitive information
  • Data enrichment and preprocessing: Some data marketplaces provide tools and services for data cleaning, transformation, and enrichment, which can save users time and effort in preparing data for analysis. This can be particularly valuable when integrating data from multiple sources or dealing with incomplete or inconsistent data
  • Customized data sets: Data marketplaces may allow users to create customized data sets by combining data from multiple sources or filtering data based on specific criteria, enabling users to access only the data they need for their specific use case
  • Data analytics and visualization tools: Some data marketplaces offer built-in analytics tools and data visualization capabilities that allow users to analyze and visualize the data directly within the platform, without the need for additional software or tools
  • Collaboration and community: Data marketplaces can foster a community of data providers and consumers, enabling collaboration, sharing of knowledge, and best practices in data management and analysis. This can help users to learn from each other and improve their skills in working with data
  • Training and support: Data marketplaces may provide training materials, tutorials, and support services to help users get started with using the data and familiarize themselves with the platform and its features
  • Data monetization: Data providers can leverage data marketplaces to monetize their data assets, generating new revenue streams. This can be particularly beneficial for organizations that collect large amounts of data but may not have the resources or expertise to analyze and monetize it themselves


Data marketplaces have become increasingly popular as the demand for data-driven insights and the availability of diverse data sets continue to grow. Data marketplaces can benefit both data providers and data consumers by facilitating a more efficient and transparent exchange of data, fostering innovation, and encouraging the development of new data-driven products and services.

In Azure, there is a product called Microsoft Purview that can be used as a solution to build a data marketplace (see Empowering data consumers leveraging Microsoft Purview as a data marketplace). In combination with other products such as Power BI and Azure Data Factory, it will handle most of the features above. But for things like pricing and licensing, ratings and reviews, and data monetization, I have seen some customers build a “front end” application to handle those features, and use the Atlas API that Purview supports to pull information from Purview into their application.

More info:

Ultimate Guide to The Data Marketplace in 2023

Data Marketplaces: A Paradigm Shift for the Big Data Boom

Everything to Know About Marketplaces for Data

Data Marketplace: An Endless Source of Data – Complete Guide

Why You Should Focus On Building An Internal Data Marketplace

Data Marketplace is the key to success of Data Mesh

The post Data Marketplace first appeared on James Serra's Blog.

The first two chapters of my book are available!

$
0
0

As I have mentioned in prior blog posts, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly. They have made the first two chapters and the preface available in their Early Release program. It’s 32 printed pages. Check it out here! You can expect to see 1-2 additional chapters appear each month. This is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial. The site has the release date for the full book as September 2024, but I’m expecting it to be available by the end of this year. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

Here is the abstract of the book:

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of each architecture to help data professionals understand its pros and cons.

In the process, James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You’ll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. By reading this book, you’ll:

  • Gain a working understanding of several data architectures
  • Know the pros and cons of each approach
  • Distinguish data architecture theory from the reality
  • Learn to pick the best architecture for your use case
  • Understand the differences between data warehouses and data lakes
  • Learn common data architecture concepts to help you build better solutions
  • Alleviate confusion by clearly defining each data architecture
  • Know what architectures to use for each cloud provider

And here is the table of contents (subject to change):

A brief description of how the book publishing process works, for those interested: You submit a book proposal (for O’Reilly go here). You discuss your abstract with an acquisitions editor, and if the proposal is accepted, you sign a contract that contains a timeline for when the chapters are due. Then start writing! It could take a year or more to finish a book, hence the benefit of the Early Release program. How this works is you write a chapter and submit it to a development editor, who will make edits and suggested changes and sends it back to you (the edits are more along the lines of structure and content, not so much on grammar). You make some or all of the suggested changes and submit it back to the development editor. You may repeat this cycle a couple of times for each chapter. Once you do this for two chapters (they don’t have to be the first two chapters in the book), you are ready for the early-release program. Those two chapters are sent to a production editor who publishes them to the O’Reilly site. Then approximately every month you try and write 1-2 chapters that can be edited and posted to the site. The content is considered “unedited” but as I explained earlier there is editing being done for structure and content and the grammar editing will be done by a copy editor after all the chapters are posted in the Early Release program. So the chapters you read in the Early Release program will have some changes, but usually not much. Most of what I’m describing happens before your book gets to production (if you’re curious about that, I’d recommend checking out O’Reilly’s guide to Production.) The Early Release process is pretty O’Reilly-specific, and different authors and development editors will manage the revisions and number of chapter expectations differently. The level of edit for copyedits and proofreads in production will depend on a number of factors as well. 

The post The first two chapters of my book are available! first appeared on James Serra's Blog.

Build announcement: Microsoft Fabric

$
0
0

The HUGE announcement at Microsoft Build yesterday was Microsoft Fabric (see Introducing Microsoft Fabric: Data analytics for the era of AI), now available in public preview. I have been at Microsoft for nearly nine years, and this is easily the biggest data-related announcement since I have been here. Satya Nadella, Microsoft’s CEO, even said Microsoft Fabric is “The biggest launch of a Microsoft data product since the launch of SQL Server”. I will introduce Microsoft Fabric in this blog post, and then follow-up with other blog posts that will go into more detail on specific features.

The best way to understand Microsoft Fabric is to think of it as an enhancement to Power BI that adds SaaS versions of many Microsoft analytical products to the Power BI workspace, now called a Fabric workspace. Those products include Azure Synapse Analytics, Azure Data Factory, Azure Data Explorer, and Power BI. Do not think of it as a new version of Azure Synapse Analytics. SaaS versions of all these products now become available to all Power BI users via the Fabric workspace, making it much easier for business end-users to get insights into their data and not have to wait until IT builds a solution for them. Synapse workspaces (old) are how you are managing your services, Fabric workspaces (new) are how you are managing your content.

But Fabric is not just for departmental use. IT will also use it to build enterprise solutions, providing one place for everyone to build solutions. This means you won’t have to decide between using Synapse or Fabric. Fabric is going to run your entire data estate: departmental projects as well as the largest data warehouse, data lakehouse and data science projects.

When using Microsoft Fabric, you won’t even realize you are using Azure. There are no more subscriptions, creating storage, or up-front time filling out various configuration properties to get a resource created. You want a data lakehouse? Simply enter the lakehouse name and in a few seconds you will have it – no other info needs to be specified. There are minimal knobs, and Fabric is auto-optimized and auto-integrated with a centralized administration for everything.

Power BI capacities become Fabric capacities and all the compute you require is pulled from the Fabric capacities as needed in a serverless fashion. That means no more serverless pools, dedicated pools, DWU’s, or Spark clusters. Everything has become simplified.

All data that is stored within Fabric is in delta lake format. Since delta lake is open sourced, that means anything you create in Fabric can be used outside of Fabric by any product that can read from a delta lake (which is nearly all products). For example, you can use Databricks to access data created in Fabric. Use whatever compute is easiest and/or cheapest. This deep commitment to a common open data format means that customers need to load the data into the lake only once and all the workloads can operate on the same data, without having to separately ingest it.

Even data for a warehouse is stored in delta format. There is no more relational storage. Fabric has fully embraced the data lakehouse concept.

On the home screen of Fabric, you will be asked to choose a persona:

From then on, the Fabric workspace will be customized to the persona chosen. For example, if Synapse Data Engineering is chosen, the main screen will contain the options for creating a lakehouse, notebook, or spark job definition.

The various items you can create in Fabric are listed below:

There is also a very impressive new feature called OneLake, which is a single SaaS lake for the whole organization. There is no need for you to create this data lake as it is provisioned automatically with your tenant. When you create a workspace, a folder is created in OneLake storage (ADLS Gen2 behind the scenes) on your customer tenant. All workloads automatically store their data in OneLake workspace folders in delta format. Think of it as a OneDrive for data. Even better, you can create shortcuts within OneLake that point to other data locations, such as ADLS Gen2 or even AWS S3 and Google Storage (coming soon). A shortcut is nothing more than a symbolic link which points from one data location to another, just like you create shortcuts in Windows. The data will appear in the shortcut location as if it were physically there. Your OneLake becomes a logical container that can point to many physical containers, so you can think of it is an abstraction layer or a virtualization layer. So you can use your existing data lakes within Fabric.

By adopting OneLake as the store and delta as the common format for all workloads, Microsoft offers customers a data stack that’s unified at the most fundamental level. Customers do not need to maintain different copies of data for databases, data lakes, data warehousing, business intelligence, or real-time analytics. Instead, a single copy of the data in OneLake can directly power all the workloads.

If you already have Power BI, you can try Microsoft Fabric today by having your Power BI admin turning it on via the admin portal in your Power BI tenant. There is a free trial period that last until Fabric is GA’d and then is extended another 60 days:

If you do not have Power BI, you can sign up for the Microsoft Fabric free trial

I know this announcement will lead to a lot of questions, and I will be posting blogs over the next few months that will hopefully answer most of those questions. In the meantime, please post your most pressing questions in the comment section below and I will answer them directly or with a blog post.

Check out the Fabric blog posts by Microsoft and the Fabric documentation, or ask questions at the Fabric Community.

And one more thing: coming soon is Copilot in Fabric. You can use conversational language to create dataflows and data pipelines, generate code and entire functions, build machine learning models, or visualize results. Check it out: video!

More info:

Why Microsoft is combining all its data analytics products into Fabric

Microsoft Fabric intends to be a data platform for the AI era

The post Build announcement: Microsoft Fabric first appeared on James Serra's Blog.

Microsoft Fabric introduction video

$
0
0

I blogged about Microsoft Fabric a few weeks ago, and wanted to follow up with an introduction video that covers the basics so hopefully you will understand the major features of Fabric at a high-level.

You can find the video here. The deck used in the video can be found here.

Here is the video abstract:

Microsoft Fabric is the next version of Azure Data Factory, Azure Data Explorer, Azure Synapse Analytics, and Power BI. It brings all of these capabilities together into a single unified analytics platform that goes from the data lake to the business user in a SaaS-like environment. Therefore, the vision of Fabric is to be a one-stop shop for all the analytical needs for every enterprise and one platform for everyone from a citizen developer to a data engineer. Fabric will cover the complete spectrum of services including data movement, data lake, data engineering, data integration and data science, observational analytics, and business intelligence. With Fabric, there is no need to stitch together different services from multiple vendors. Instead, the customer enjoys end-to-end, highly integrated, single offering that is easy to understand, onboard, create and operate.

This is a hugely important new product from Microsoft and I will simplify your understanding of it via a presentation and demo.

Agenda:
What is Microsoft Fabric?
Workspaces and capacities
OneLake
Lakehouse
Data Warehouse
ADF
Power BI / DirectLake
Resources

The post Microsoft Fabric introduction video first appeared on James Serra's Blog.
Viewing all 516 articles
Browse latest View live