Quantcast
Channel: James Serra's Blog
Viewing all 516 articles
Browse latest View live

Three more chapters of my data architecture book are available!

$
0
0

As I have mentioned in a prior blog post, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly.

There are now five chapters and the preface available in their Early Release program:

  1. Big Data
  2. Types of Data Architectures
  3. The Architecture Design Session
  4. The Relational Data Warehouse
  5. Data Lake


It’s 80 printed pages. Check it out here! You can expect to see two additional chapters appear each month. This is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial. The site has the release date for the full book as May 2024, but I’m expecting it to be available by the end of this year. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

The post Three more chapters of my data architecture book are available! first appeared on James Serra's Blog.

Microsoft Fabric: Lakehouse vs Warehouse video

$
0
0

Since Microsoft Fabric became available, I blogged about it and have an introduction video on Fabric that you can view here. I wanted to follow up with a short 30-minute video on the biggest point of confusion I see with Fabric, and that is the difference between Lakehouse and Warehouse.

You can find the video here. The deck used in the video can be found here.

Here is the video abstract:

In Microsoft Fabric, I see a lot of confusion about the differences between Lakehouse and Warehouse – when to use what. I created this video to give a brief overview of each, then dive into the differences that hopefully will clear things up for you.

Hope this helps!

The post Microsoft Fabric: Lakehouse vs Warehouse video first appeared on James Serra's Blog.

Serving layers with a data lake

$
0
0

Data lakes typically have three layers: raw, cleaned, and presentation (also called bronze, silver, and gold if using the medallion architecture popularized by Databricks). I talk about this is my prior blog post on Data lake architecture. Many times, companies will create a fourth layer outside of the data lake that I call the relational serving layer. I’ve been having conversations recently with companies about the need for another type of fourth layer, which I will call the physical serving layer. In this blog post I’ll discuss the relational serving layer and the physical serving layer.

Typically, once data is in the presentation layer, it is “ready to go”. From there the data can be shared via a tool like Azure Data Share, or an end-user can access the data directly in the presentation directly with a tool like Power BI.

Because a data lake is schema-on-read, the schema is applied to the data at the time of reading the data, rather than beforehand.  A data lake is a folder-file system, where there is no context of what the data is, which is different from what you get with a relational database which has a metadata presentation layer on top of the data and that is tied directly to the data.  So, you might want to create a “serving layer” on top of the data in the data lake which will tie metadata directly to the data.  To make it easier for an end-user to find and understand data, you will likely want to present the data in the form of a relational data model, hence creating a relational serving layer on top of the data.  If done correctly, the end-user will have no idea they are actually pulling data from a data lake – they will think it is from a relational data warehouse. 

The relational server layer can be created in many ways, such as a SQL view, a dataset in a reporting tool like Power BI, an Apache Hive table, or in an ad-hoc SQL query.  With this layer you can also define the relationships between files if you need to join more than one together. This is a common practice because defined relationships do not exist within a data lake, as each file is in its own isolated island. 

As an example of using a relational serving layer, many companies will create SQL views on top of files in the data lake, then use a reporting tool to call those views, making it easy for an end-user to create reports or dashboards because it gives the end-user a “relational model” on top of the data instead of seeing everything in a folder-file format.

Another type of serving layer could be a physical serving layer. This is where data is copied from the presentation layer in the data lake to one or more products as a way that to make it easier for end-users to access – products such as Azure Cosmos DB, Azure SQL Database, or a graph database, just to name a few. This is to help satisfy different lines of business that have different needs for the data. For example, say a department within your company is very familiar with Azure SQL Database and have built applications and use reporting tools that go against data in an Azure SQL Database. Instead of having that department pull data from the data lake into an Azure SQL Database that resides within the department, IT does it for them, but puts the data into an Azure SQL Database that IT maintains in the physical serving layer (you could think of that data as a “datamart“). This approach eliminates the work needed by the department to get the data ready, and they can spend that extra time getting more insights into the data.

So, what has to be decided is in what cases IT would do the work compared to a department or end-user doing the work to get data in the format that is needed? One consideration would be how many people would use the data in the physical layer – just a few people in one department in your company, or many people across multiple departments? Does IT have the resources and availability to do this work, or would they become a bottleneck because their resources are constrained?

You should also consider the benefits of having a physical serving layer:

  • It can help control costs, especially if multiple departments need the data in this new format so you are preventing duplicate data
  • It could help improve query and report performance since IT likely has better expertise and can tune the resulting data in the physical serving layer
  • It helps with data lineage since the physical serving layer is controlled by IT as opposed to a department taking the data and storing it in a place IT may not have access to
  • It helps with data governance and security since IT can incorporate the physical serving layer into their governance policies and security environment as opposed to hoping each department has the proper governance and security in place

Previously I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for each stage. You can view the video here. How the relational serving layer and physical serving layer changes things is at the model stage, where instead of it being a RDBMS such as an Azure Synapse dedicated pool that all departments and end-users will access, it will be a relational serving layer that removes the need to copy the data into a RDBMS, giving you a true data lakehouse. You still might utilize a physical serving layer that does have a RDBMS, but that would be for specific departmental use cases that IT is creating for them instead of the department having to build and maintain it themselves.

One thing to point out is that within the presentation or gold layer you could have multiple copies of the same data in different formats, such as 3rd-normal form or star schemas, Microsoft Fabric lakehouses or warehouses, or copies to help with performance or cost savings via features like object tiering. These are all happening within the data lake, as opposed to the physical serving layer that is outside the data lake. Hopefully I’m not confusing things too much 🙂

I’d be remiss if I did not mention that you need to pay strong attention to data governance which will be even more important when you have serving layers. Using a product like Microsoft Purview is the easy part – it’s implementing best practices for data governance that will be challenging and time consuming.

I hope this helps to give you some ideas on how to make things better for the consumers of the data lake!

The post Serving layers with a data lake first appeared on James Serra's Blog.

Nine chapters of my data architecture book are available

$
0
0

As I have mentioned in prior blog posts, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly. I have spent a ton of time writing the book and it’s getting close to being finished.

There are now nine chapters and the preface available in their Early Release program. The book will have 16 chapters. Here is the TOC so far:

  1. Big Data
    • What is Big Data and how can it help you?
    • Data maturity
    • Self-Service Business Intelligence
    • Summary
  2. Types of Data Architectures
    • Evolution of data architectures
    • Summary
  3. The Architecture Design Session
    • What is an ADS?
    • Why hold an ADS?
    • Before the ADS
    • Conducting the ADS
    • After the ADS
    • Tips
    • Summary
  4. The Relational Data Warehouse
    • What is a relational data warehouse?
    • The top-down approach
    • Why use a relational data warehouse
    • Drawbacks to using a relational data warehouse
    • Populating a data warehouse
    • The death of the relational data warehouse has been greatly exaggerated
    • Summary
  5. Data Lake
    • What is a data lake?
    • Why use a data lake?
    • Bottoms-up approach
    • Best practices for data lake design
    • Multiple data lakes
    • Summary
  6. Data Storage Solutions and Process
    • Data storage solutions
    • Data processes
    • Summary
  7. Approaches to Design
    • Online transaction processing (OLTP) versus online analytical processing (OLAP)
    • Operational and analytical data
    • Symmetric multiprocessing (SMP) and massively parallel processing (MPP)
    • Lambda architecture
    • Kappa architecture
    • Polyglot persistence and polyglot data stores
    • Summary
  8. Approaches to Data Modeling
    • Relational modeling
    • Dimensional Modeling
    • Common Data Model (CDM)
    • Data Vault
    • The Kimball and Inmon data warehouse methodologies
    • Summary
  9. Approaches to Data Ingestion
    • ETL versus ELT
    • Reverse ETL
    • Data governance
    • Summary

      Future chapters:
    • Modern data warehouse
    • Data fabric
    • Data lakehouse
    • Data mesh foundation
    • Data mesh adoption
    • People and process
    • Technologies

It’s 147 printed pages so far. Check it out here. You can expect to see two additional chapters appear every few weeks. This is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial. The site has the release date for the full book as May 2024, but I’m expecting it to be available by the end of this year. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

The post Nine chapters of my data architecture book are available first appeared on James Serra's Blog.

Eleven chapters of my data architecture book are available

$
0
0

As I have mentioned in prior blog posts, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly. I have spent a ton of time writing the book and it’s getting close to being finished – all the chapters are written and only five chapters left to edit. The fully finished book should be available for download or for a printed copy by the end of year.

There are now eleven chapters and the preface available in their Early Release program. The book will have 16 chapters. Here is the TOC so far:

  1. Big Data
    • What is Big Data and how can it help you?
    • Data maturity
    • Self-Service Business Intelligence
    • Summary
  2. Types of Data Architectures
    • Evolution of data architectures
    • Summary
  3. The Architecture Design Session
    • What is an ADS?
    • Why hold an ADS?
    • Before the ADS
    • Conducting the ADS
    • After the ADS
    • Tips
    • Summary
  4. The Relational Data Warehouse
    • What is a relational data warehouse?
    • The top-down approach
    • Why use a relational data warehouse
    • Drawbacks to using a relational data warehouse
    • Populating a data warehouse
    • The death of the relational data warehouse has been greatly exaggerated
    • Summary
  5. Data Lake
    • What is a data lake?
    • Why use a data lake?
    • Bottoms-up approach
    • Best practices for data lake design
    • Multiple data lakes
    • Summary
  6. Data Storage Solutions and Process
    • Data storage solutions
    • Data processes
    • Summary
  7. Approaches to Design
    • Online transaction processing (OLTP) versus online analytical processing (OLAP)
    • Operational and analytical data
    • Symmetric multiprocessing (SMP) and massively parallel processing (MPP)
    • Lambda architecture
    • Kappa architecture
    • Polyglot persistence and polyglot data stores
    • Summary
  8. Approaches to Data Modeling
    • Relational modeling
    • Dimensional Modeling
    • Common Data Model (CDM)
    • Data Vault
    • The Kimball and Inmon data warehouse methodologies
    • Summary
  9. Approaches to Data Ingestion
    • ETL versus ELT
    • Reverse ETL
    • Data governance
    • Summary
  10. The Modern Data Warehouse
    • The MDW Architecture
    • Pros and Cons of the MDW Architecture
    • Combining the RDW and Data Lake
    • Stepping Stones to the MDW
    • Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW
    • Summary
  11. Data Fabric
    • The Data Fabric Architecture
    • Why Transition from an MDW to a Data Fabric Architecture?
    • Potential Drawbacks
    • Summary
  12. Data Lakehouse (future)
    • Delta lake features
    • Performance improvements
    • What if you skip the relational data warehouse?
    • Relational serving layer
    • Summary
  13. Data mesh foundation (future)
    • A decentralized data architecture
    • Data mesh hype
    • Dehghani’s four principles of a data mesh
    • The “pure” data mesh
    • Data domains
    • Different topologies
    • Data mesh compared to data fabric
    • Use cases
    • Summary
  14. Should you adopt data mesh? Myths, concerns, and the future (future)
    • Myths
    • Concerns
    • Organizational assessment: Should you adopt a data mesh?
    • Recommendations for implementing a successful data mesh
    • Conclusion: the future of data mesh
  15. People and process (future)
    • Team organization: Roles and responsibilities
    • Roles for MDW, data fabric, or data lakehouse
    • Roles for data mesh
    • Why projects fail: Pitfalls and prevention
    • Why projects succeed
    • Conclusion
  16. Technologies (future)
    • Open source
    • Hadoop
    • Benefits of the cloud
    • Major cloud providers
    • Multi-cloud
    • Databricks
    • Snowflake
    • Summary


It’s 172 printed pages so far. Check it out here. Chapter 12 should appear in the next two weeks. Then chapter 13-14 a few weeks after that, followed by 15-16 a few weeks after that. Then the book will be updated by a grammar editor along with the figures being rewritten, then it’s off to the presses!

This is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial. The site has the release date for the full book as May 2024, but I’m expecting it to be available by the end of this year. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

The post Eleven chapters of my data architecture book are available first appeared on James Serra's Blog.

Microsoft Fabric roadmap

$
0
0

Microsoft Fabric is an awesome product that has now been in public preview for five months. If you are not familiar with it, check out my recent video where I provide a Microsoft Fabric introduction. Also, an excellent training course has just been released to learn all about Fabric: Microsoft Fabric Complete Guide – Future of Data with Fabric.

Just released was the Microsoft Fabric roadmap that you can check out at https://aka.ms/FabricRoadmap. It’s great to see Microsoft be transparent on what features they are working on and when they will be available.

Here are my top 18 features on the roadmap that I am most excited about (in the order found in the roadmap):

Admin and governance:

Purview hub for administrators and data owners – Public preview

Estimated release timeline: Q4 2023

Fabric admins (Q3 2023) and data owners (Q4 2023) can gain valuable insights about sensitive data, certified and promoted items. They contain insights about sensitive data, certified and promoted items, and a gateway to advanced capabilities in Microsoft Purview portals.

Purview data loss prevention policies for schematized data in OneLake

Estimated release timeline: Q1 2024

Compliance admins can use Microsoft Purview Data Loss Prevention (DLP) policies to detect the upload of sensitive data (such as social security number) to OneLake. If such an upload is detected, the policies will trigger automatic policy tip that is visible to data owners and it can also trigger an alert for compliance admins. DLP policies can automate the compliance processes to meet enterprise-scale compliance and regulatory requirements in an effective way.

Microsoft Fabric user REST APIs

Estimated release timeline: Q4 2023

Deliver a user-friendly, standardized API for Fabric’s core functionality and experience APIs, ensuring ease of use for developers. The well-documented Fabric REST API includes authentication, authorization, version control, policy enforcement, and error handling. Additionally, developers can use existing protocol-specific APIs like XMLA and TDS. Some examples include Workspace and capacity management, CRUD operations on items, and permission management.

OneLake:

OneLake security model for tables and files (public preview)

Estimated release timeline: Q2 2024

Managing data security across multiple analytical engines and copies of data is challenging. OneLake and Fabric simplify this by enabling the use of a single data copy across multiple analytical engines without any data movement or duplication. Taking the “one copy” concept further, OneLake is also enhancing security with a finer-grain model, allowing direct security for tables and folders. These security definitions live with the data and travel across shortcuts to wherever the data is used. Security defined at OneLake is universally enforced no matter, which analytical engine is used to access the data.

Folders in workspaces

Estimated release timeline: Preview in Q4 2023

Introducing folders in the workspace allows you to better organize and find items. The preview of this feature will provide the organizational capabilities of folders. Subsequent updates will address folder-related permission management scenarios.

Synapse – Data Warehouse:

Data warehouse SQL security enhancements

Estimated release timeline: Q4 2023 (available now: Announcing: Column-Level & Row-Level Security for Fabric Warehouse & SQL Endpoint | Microsoft Fabric Blog | Microsoft Fabric)

You can define granular row-level security for data in the data warehouse, ensuring restricted access and appropriate viewing based on entitlements.

Synapse – Data Engineering:

Lakehouse data security (Public Preview)

Estimated release timeline: Q2 2024

You’ll have the ability to apply file, folder, and table (or object level) security in the lakehouse. you can also control who can access data in the lakehouse, and the level of permissions they have. For example, You can grant read permissions on files, folders, and tables. Once permissions are applied, they’re automatically synchronized across all engines. Which means, that permissions will be consistent across Spark, SQL, Power BI, and external engines.

Schema support for Lakehouse

Estimated release timeline: Q2 2024

The lakehouse will support 3-part naming convention. It enables you to add schemas to your lakehouses, which is consistent with the current warehouse experience.

Policy management

Estimated release timeline: Q2 2024

Workspace admins will be able to author and enforce policies based on Spark properties, ensuring that your workloads comply with certain rules. For example, they can limit the number of resources, the time that a workload can consume, or prevent users from changing certain Spark settings. This will enhance the governance and security of your Spark workloads.

Copilot integration in notebooks (Public Preview)

Estimated release timeline: Q4 2023

You’ll be able to use copilot in notebooks, to chat about your data, get code suggestions, and debug your code. Copilot will be data aware, which means, it will have context about the lakehouse tables and schemas. Copilot is a smart and helpful assistant for data engineering tasks.

Dynamic lineage of data engineering items

Estimated release timeline: Q4 2023

You will be able to trace the lineage within Fabric across the code items such as notebooks & Spark jobs, and data items such as a lakehouse. This lineage will be dynamic, which means meaning if the code adds or removes references to lakehouses, it will be reflected in the lineage view.

Synapse – Data Science:

Estimated release timeline: Q4 2023 (available now: Semantic link in Microsoft Fabric: Bridging BI and Data Science | Microsoft Fabric Blog | Microsoft Fabric)

Semantic link bridges the gap between data science and BI by providing a Python library (SemPy) that enables data scientists to interact with Power BI datasets and measures. You can use SemPy to read, explore, query, and validate data in Power BI from Python notebooks, and use the library’s features to detect and resolve data challenges. Users can also write back to the Power BI dataset through the lakehouse with Direct Lake mode.

Synapse – Real-Time Analytics:

SQL native support in KQL querysets

Estimated release timeline: Q2 2024

This feature enables customers to use a native SQL editor to run SQL over KQL databases in a queryset, alongside using KQL. With this capability, customers are able to use the SQL editor’s native capabilities, such as syntax highlighting, suggestions, and more.

Co-pilot (Preview)

Estimated release timeline: Q4 2023

KQL Co-pilot allows you to write queries in natural language and have them translated into Kusto Query Language (KQL). You can use Co-pilot to ask your how-to queries, explore your data in a KQL database, and create Kusto entities such as tables, functions, and materialized views.

Create actions and alerts with Data Activator

Estimated release timeline: Q4 2023

This feature provides a low-code/no-code experience to drive actions and alerts from your KQL database data. Data Activator gives you a single place to define actionable patterns in your data. These patterns can range from simple thresholds (such as a value being exceeded) to more complex patterns over time (such a value trending down). When Data Activator detects an actionable pattern, it triggers an action. That action can be an email or a Teams alert to the relevant person in your organization. It can also trigger an automatic process, via a Power Automate flow or an action in one of your organization’s line-of-business apps.

Data Factory:

Fast Copy support in Dataflow Gen2

Estimated release timeline: Q1 2024

We’re adding support for large-scale data ingestion directly within the Dataflow Gen2 experience, utilizing the pipelines Copy Activity capability. This supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage.

This enhancement significantly scales up the data processing capacity of Dataflow Gen2 providing high-scale ELT (Extract-Load-Transform) capabilities.

Copilot in Data Factory

Estimated release timeline: Q4 2023

You’ll be able to use copilot with dataflows and data pipelines in Data Factory. Copilot in Data Factory empowers both citizen and professional developers to build simple to complex dataflows and pipelines using natural language. You’ll be able to work together with Copilot to iteratively develop dataflows and data pipelines for your data integration needs.

Data Activator: (available now: Announcing the Data Activator public preview)

Metrics, triggers, and actions

Estimated release timeline: Q2 2024

In addition to monitoring business objects, Data Activator will let you define key metrics that assign values from your data stream. These metrics can trigger actions based on aggregated values across multiple dimensions at specific time intervals.

The post Microsoft Fabric roadmap first appeared on James Serra's Blog.

My data architecture book now has 15 chapters available!

$
0
0

Only one more chapter to go! As I have mentioned in prior blog posts, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly. The fully finished book should be available for download or for a printed copy by the end of January. The cover will feature the Discus fish (O’Reilly is known for having different animals on their cover).

15 of 16 chapters are out. Here is the likely final TOC:

  1. Big Data
    • What is Big Data and how can it help you?
    • Data maturity
    • Self-Service Business Intelligence
    • Summary
  2. Types of Data Architectures
    • Evolution of data architectures
    • Relational Data Warehouse
    • Data Lake
    • Modern Data Warehouse
    • Data Fabric
    • Data Lakehouse
    • Data Mesh
    • Summary
  3. The Architecture Design Session
    • What is an ADS?
    • Why hold an ADS?
    • Before the ADS
    • Conducting the ADS
    • After the ADS
    • Tips
    • Summary
  4. The Relational Data Warehouse
    • What is a relational data warehouse?
    • The top-down approach
    • Why use a relational data warehouse
    • Drawbacks to using a relational data warehouse
    • Populating a data warehouse
    • The death of the relational data warehouse has been greatly exaggerated
    • Summary
  5. Data Lake
    • What is a data lake?
    • Why use a data lake?
    • Bottoms-up approach
    • Best practices for data lake design
    • Multiple data lakes
    • Summary
  6. Data Storage Solutions and Process
    • Data storage solutions
    • Data processes
    • Summary
  7. Approaches to Design
    • Online transaction processing (OLTP) versus online analytical processing (OLAP)
    • Operational and analytical data
    • Symmetric multiprocessing (SMP) and massively parallel processing (MPP)
    • Lambda architecture
    • Kappa architecture
    • Polyglot persistence and polyglot data stores
    • Summary
  8. Approaches to Data Modeling
    • Relational modeling
    • Dimensional Modeling
    • Common Data Model (CDM)
    • Data Vault
    • The Kimball and Inmon data warehouse methodologies
    • Summary
  9. Approaches to Data Ingestion
    • ETL versus ELT
    • Reverse ETL
    • Data governance
    • Summary
  10. The Modern Data Warehouse
    • The MDW Architecture
    • Pros and Cons of the MDW Architecture
    • Combining the RDW and Data Lake
    • Stepping Stones to the MDW
    • Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW
    • Summary
  11. Data Fabric
    • The Data Fabric Architecture
    • Why Transition from an MDW to a Data Fabric Architecture?
    • Potential Drawbacks
    • Summary
  12. Data Lakehouse
    • Delta lake features
    • Performance improvements
    • The data lakehouse architecture
    • What if you skip the relational data warehouse?
    • Relational serving layer
    • Summary
  13. Data mesh foundation
    • A decentralized data architecture
    • Data mesh hype
    • Dehghani’s four principles of a data mesh
    • The “pure” data mesh
    • Data domains
    • Data mesh logical architecture
    • Example domains
    • Summary
  14. Should you adopt data mesh? Myths, concerns, and the future
    • Myths
    • Concerns
    • Organizational assessment: Should you adopt a data mesh?
    • Recommendations for implementing a successful data mesh
    • The future of data mesh
    • Conclusion: Zooming out: understanding data architectures and their application
  15. People and process
    • Team organization: Roles and responsibilities
    • Why projects fail: Pitfalls and prevention
    • Why projects succeed
    • Conclusion
  16. Technologies
    • Choosing a platform
    • Cloud service models
    • Software frameworks
    • Conclusion


It’s 236 printed pages so far. Check it out here. Soon the book will go into “production” and be updated by a grammar editor along with the figures being rewritten, TOC and Index created, covers drawn, and then it’s off to the presses!

This is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

The post My data architecture book now has 15 chapters available! first appeared on James Serra's Blog.

All the chapters in my data architecture book are now available!

$
0
0

As I have mentioned in prior blog posts, I have been writing a data architecture book, which I started last November. The title of the book is “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” and it is being published by O’Reilly.

All 16 chapters are now available in the O’Reilly Early Release program. It’s 258 printed pages. Check it out here. The Early Release program is a great way to start reading the book without having to wait until the entire book is done. Note you have to have an O’Reilly subscription to access it, or start a free 10-day trial, which gives you access to not only my book, but all the books published by O’Reilly and multiple other publishers.

My book now goes off to “production” where it will be thoroughly grammar checked, get a TOC and index, and have all the pictures redrawn. That process usually takes 2-3 months. The fully finished book should be available for download or for a printed copy by the end of January. The new cover below features the Discus fish (O’Reilly is known for having different animals on their cover).

Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

Here is the likely final TOC:

  1. Big Data
    • What is Big Data and how can it help you?
    • Data maturity
    • Self-Service Business Intelligence
    • Summary
  2. Types of Data Architectures
    • Evolution of data architectures
    • Relational Data Warehouse
    • Data Lake
    • Modern Data Warehouse
    • Data Fabric
    • Data Lakehouse
    • Data Mesh
    • Summary
  3. The Architecture Design Session
    • What is an ADS?
    • Why hold an ADS?
    • Before the ADS
    • Conducting the ADS
    • After the ADS
    • Tips
    • Summary
  4. The Relational Data Warehouse
    • What is a relational data warehouse?
    • The top-down approach
    • Why use a relational data warehouse
    • Drawbacks to using a relational data warehouse
    • Populating a data warehouse
    • The death of the relational data warehouse has been greatly exaggerated
    • Summary
  5. Data Lake
    • What is a data lake?
    • Why use a data lake?
    • Bottoms-up approach
    • Best practices for data lake design
    • Multiple data lakes
    • Summary
  6. Data Storage Solutions and Process
    • Data storage solutions
    • Data processes
    • Summary
  7. Approaches to Design
    • Online transaction processing (OLTP) versus online analytical processing (OLAP)
    • Operational and analytical data
    • Symmetric multiprocessing (SMP) and massively parallel processing (MPP)
    • Lambda architecture
    • Kappa architecture
    • Polyglot persistence and polyglot data stores
    • Summary
  8. Approaches to Data Modeling
    • Relational modeling
    • Dimensional Modeling
    • Common Data Model (CDM)
    • Data Vault
    • The Kimball and Inmon data warehouse methodologies
    • Summary
  9. Approaches to Data Ingestion
    • ETL versus ELT
    • Reverse ETL
    • Data governance
    • Summary
  10. The Modern Data Warehouse
    • The MDW Architecture
    • Pros and Cons of the MDW Architecture
    • Combining the RDW and Data Lake
    • Stepping Stones to the MDW
    • Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW
    • Summary
  11. Data Fabric
    • The Data Fabric Architecture
    • Why Transition from an MDW to a Data Fabric Architecture?
    • Potential Drawbacks
    • Summary
  12. Data Lakehouse
    • Delta lake features
    • Performance improvements
    • The data lakehouse architecture
    • What if you skip the relational data warehouse?
    • Relational serving layer
    • Summary
  13. Data mesh foundation
    • A decentralized data architecture
    • Data mesh hype
    • Dehghani’s four principles of a data mesh
    • The “pure” data mesh
    • Data domains
    • Data mesh logical architecture
    • Example domains
    • Summary
  14. Should you adopt data mesh? Myths, concerns, and the future
    • Myths
    • Concerns
    • Organizational assessment: Should you adopt a data mesh?
    • Recommendations for implementing a successful data mesh
    • The future of data mesh
    • Conclusion: Zooming out: understanding data architectures and their application
  15. People and process
    • Team organization: Roles and responsibilities
    • Why projects fail: Pitfalls and prevention
    • Why projects succeed
    • Conclusion
  16. Technologies
    • Choosing a platform
    • Cloud service models
    • Software frameworks
    • Conclusion


The post All the chapters in my data architecture book are now available! first appeared on James Serra's Blog.

My data architecture book is now available for pre-order!

$
0
0

For those looking to be first in line when my book is available for a printed copy, I’m very excited to let you know that it is now available for pre-order on Amazon at Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh: 9781098150761: Serra, James. Note that I’m expecting the book to be available a couple months before the listed April 2nd date.

And for those at the PASS Summit this week, I’ll be doing two sessions:

Enhancing your Career: Building your Personal Brand
Thursday, Nov 16, 3:15 PM – 4:30 PM PST, room 609

In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, and a SQL Server MVP. Along the way I saw my yearly income triple. Is it because I know some secret? Is it because I am a genius? No! It is just about laying out your career path, setting goals, and doing the work. It’s about building your personal brand and stepping out of your comfort zone. It’s about overcoming your fear of taking risks. If you can do those things, you will be rewarded. I will discuss how you too can go from unknown to well-known. I will talk about building your community presence by blogging, presenting, writing articles and books, twitter, LinkedIn, certifications, interviewing, networking, and consulting and contracting. Your first step to enhancing your career will be to attend this session!

Data Warehousing Trends, Best Practices, and Future Outlook
Friday, Nov 17, 10:15 AM – 11:30 AM PST, room 606

Over the last decade, the 3Vs of data – Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. Learn: Choosing the best solution – Data Lake vs. Data Warehouse vs. Data Mart Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon Step by step approach to building an effective data warehouse architecture

I hope to see you at PASS! Stop me in the hallways if you see me and let’s chat!

The book cover and TOC:

Here is the likely final TOC:

  1. Big Data
    • What is Big Data and how can it help you?
    • Data maturity
    • Self-Service Business Intelligence
    • Summary
  2. Types of Data Architectures
    • Evolution of data architectures
    • Relational Data Warehouse
    • Data Lake
    • Modern Data Warehouse
    • Data Fabric
    • Data Lakehouse
    • Data Mesh
    • Summary
  3. The Architecture Design Session
    • What is an ADS?
    • Why hold an ADS?
    • Before the ADS
    • Conducting the ADS
    • After the ADS
    • Tips
    • Summary
  4. The Relational Data Warehouse
    • What is a relational data warehouse?
    • The top-down approach
    • Why use a relational data warehouse
    • Drawbacks to using a relational data warehouse
    • Populating a data warehouse
    • The death of the relational data warehouse has been greatly exaggerated
    • Summary
  5. Data Lake
    • What is a data lake?
    • Why use a data lake?
    • Bottoms-up approach
    • Best practices for data lake design
    • Multiple data lakes
    • Summary
  6. Data Storage Solutions and Process
    • Data storage solutions
    • Data processes
    • Summary
  7. Approaches to Design
    • Online transaction processing (OLTP) versus online analytical processing (OLAP)
    • Operational and analytical data
    • Symmetric multiprocessing (SMP) and massively parallel processing (MPP)
    • Lambda architecture
    • Kappa architecture
    • Polyglot persistence and polyglot data stores
    • Summary
  8. Approaches to Data Modeling
    • Relational modeling
    • Dimensional Modeling
    • Common Data Model (CDM)
    • Data Vault
    • The Kimball and Inmon data warehouse methodologies
    • Summary
  9. Approaches to Data Ingestion
    • ETL versus ELT
    • Reverse ETL
    • Data governance
    • Summary
  10. The Modern Data Warehouse
    • The MDW Architecture
    • Pros and Cons of the MDW Architecture
    • Combining the RDW and Data Lake
    • Stepping Stones to the MDW
    • Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW
    • Summary
  11. Data Fabric
    • The Data Fabric Architecture
    • Why Transition from an MDW to a Data Fabric Architecture?
    • Potential Drawbacks
    • Summary
  12. Data Lakehouse
    • Delta lake features
    • Performance improvements
    • The data lakehouse architecture
    • What if you skip the relational data warehouse?
    • Relational serving layer
    • Summary
  13. Data mesh foundation
    • A decentralized data architecture
    • Data mesh hype
    • Dehghani’s four principles of a data mesh
    • The “pure” data mesh
    • Data domains
    • Data mesh logical architecture
    • Example domains
    • Summary
  14. Should you adopt data mesh? Myths, concerns, and the future
    • Myths
    • Concerns
    • Organizational assessment: Should you adopt a data mesh?
    • Recommendations for implementing a successful data mesh
    • The future of data mesh
    • Conclusion: Zooming out: understanding data architectures and their application
  15. People and process
    • Team organization: Roles and responsibilities
    • Why projects fail: Pitfalls and prevention
    • Why projects succeed
    • Conclusion
  16. Technologies
    • Choosing a platform
    • Cloud service models
    • Software frameworks
    • Conclusion
The post My data architecture book is now available for pre-order! first appeared on James Serra's Blog.

Microsoft Fabric is now GA!

$
0
0

After more than two years in development and six months in public preview, Microsoft Fabric is now generally available (GA). Here is the announcement made during Microsoft Ignite last week. If you are not familiar with Fabric, check out my blog Build announcement: Microsoft Fabric | James Serra’s Blog.

Make sure to check out the Microsoft Fabric roadmap at https://aka.ms/FabricRoadmap to be aware of those features that are not yet available in GA.

“General Availability” signifies that a product or service is fully developed, tested, and ready for production use. Once a product reaches GA, it has undergone thorough testing and improvements, addressing issues identified during the Public Preview. GA products are stable, fully functional, and backed by Microsoft’s support and Service Level Agreement (SLA) guarantees. This stage marks the official release of the service, ensuring reliability, performance standards, and compliance with relevant regulations, making it suitable for production environments.

Here are other major announcements made at Ignite about Fabric (the full list is here):

Mirroring, Microsoft Graph integration, disaster recovery

Mirroring is a new, frictionless way to add and manage existing cloud data warehouses and databases in Fabric’s Synapse Data Warehouse experience. Mirroring replicates a snapshot of the database to OneLake in Delta Parquet tables and keeps the replica in sync in near real time. Once the source database is attached, features like shortcuts, Direct Lake mode in Power BI, and the universal security model work instantly. Microsoft will soon enable Azure Cosmos DB, Azure SQL DB, Snowflake, and Mongo DB customers to use mirroring to access their data in OneLake, with more data sources coming in 2024 such as SQL Server, Azure PostgreSQL, and Azure MySQL. More info. For participation in the early adopter program, submit your application here.

Microsoft making it easier to analyze the vast amount of work data you have in Microsoft 365 with native integration into Microsoft Graph, the unified data model for products like Teams, Outlook, SharePoint, Viva Insights, and more. Previously, Microsoft 365 data was only offered in JSON format, but it’s now also offered in Delta Parquet format for easy integration into OneLake. Your Microsoft 365 data can now be seamlessly joined with other data sources in OneLake. More info.

Micrsoft announced disaster recovery for Fabric. With this feature your data in OneLake will be replicated across regions which will ensure availability of this data in case of regional outages. You will be able to choose which capacities need to be replicated via capacity level configurations. BCDR for Power BI will be available by default as it is today and isn’t impacted by this disaster recovery capability.

Power BI semantic model support for Direct Lake on Synapse Data Warehouse

Power BI semantic models can now leverage Direct Lake mode in conjunction with Synapse Data Warehouses in Microsoft Fabric. Until now, Direct Lake mode was limited to semantic models on Fabric lakehouses, while warehouses were queried only in DirectQuery mode. Now, Microsoft has expanded Direct Lake mode to support warehouses in Fabric as well. For more information, read the Direct Lake in Power BI and Microsoft Fabric page in the product documentation.

Announcing public preview of stored credentials for Direct Lake semantic model row-level security and object-level security

In public preview is row-level security (RLS) and object-level security (OLS) and stored credentials for Direct Lake semantic models. RLS and OLS security is a Power BI feature that enables you to define row-level and object-level access rules in a semantic model, so different users can see different subsets of the data based on their roles and permissions. Stored credentials help reduce configuration complexity and are strongly recommended when using RLS and OLS with Direct Lake semantic models. You can add users to RLS roles in a Direct Lake model using the web modelling experience. The web modeling security roles dialog will be fully deployed in the coming days or weeks. For more information about how to set up stored credentials, see the Direct Lake product documentation. For RLS and OLS, see the articles Row-level security (RLS) with Power BI and Object level security (OLS).

Instantly integrate your import-mode semantic models into OneLake

In public preview is Microsoft OneLake integration for import models. With the flick of a switch, you can enable OneLake integration and automatically write data imported into your semantic models to delta tables in OneLake and enjoy the benefits of Fabric without any migration effort. The data is instantly and concurrently accessible through these delta tables. Data scientists, database administrators, app developers, data engineers, citizen developers, and any other type of data consumer now have direct access to the same data that drives your business intelligence. Enable OneLake integration to include these delta tables in your Lakehouses and Synapse Data Warehouses through shortcuts, enabling your users to use T-SQL, Python, Scala, PySpark, Spark SQL, R, and no-code and low-code solutions to query the data. More info.

Quickly answer your data questions with Explore

There is a new public preview feature called Explore that will enable anyone to quickly explore a semantic model. Similar to exporting and building a PivotTable in Excel, you can open the Explore experience and create a matrix or visual view for your data. Analysts could use Explore to learn about a new semantic model before building a report for example, or a business user could answer a specific question they have about the data without building an entire report. More info,

Minimize costs with new Microsoft Fabric licensing options

In June 2023, Microsoft announced pay-as-you-go prices for Fabric that allow you to dynamically scale up or scale down and pause capacity as needed. There is now reservation pricing for Fabric that will allow you to pre-commit Fabric Capacity Units in one-year increments, helping you save up to 40.5 percent over the pay-as-you-go prices (excluding Power BI Premium capacity SKUs). Microsoft also announced OneLake business continuity and disaster recovery (BCDR) and cache storage prices, expanding on the already announced OneLake storage pricing. Check out all these pricing options on the Microsoft Fabric pricing page.

Announcing the public preview of Copilot in Microsoft Fabric

Now in public preview is Copilot in Microsoft Fabric. This helps users quickly get started by helping them create reports in the Power BI web experience. Based on a high-level prompt, Copilot for Power BI in Fabric creates an entire report page for you by identifying the tables, fields, measures, and charts that would help you get started. You can then customize the page using our existing editing experiences. Copilot can also help you understand your semantic model and even suggest topics for your report pages. It’s a fast and easy way to get started with a report, especially if you’re less familiar with report creation in Power BI.  Also new is a Narrative with Copilot visual. This visual summarizes the data and insights on the page, across your report, or even for your own template if you need to define a specific summary. 

Lastly, in Power BI Desktop, Copilot can help model authors improve their models and save time. The first capability Microsoft is releasing in November 2023 helps authors generate synonyms for their fields, measures, and tables using Copilot. But this is just the start. Future Power BI Desktop updates will bring even more new Copilot experiences, including the report creation experience from the service, a Data Analysis Expressions (DAX) writing experience, and more.

The preview of Copilot in Microsoft Fabric will be rolling out in stages with the goal that all customers with Power BI Premium capacity (P1 or higher) or Fabric capacity (F64 or higher) will have access to the Copilot preview by the end of March 2024. You don’t need to sign up to join the preview, it will automatically become available to you as a new setting in the Fabric admin portal when it is rolled out to your tenant. When charging begins for the Copilot in Fabric experiences, you can simply count Copilot usage against your existing Fabric or Power BI Premium capacity. Check out the Copilot for Power BI documentation for complete instructions and requirements.

Synapse Data Warehouse – Query Insights

Query Insights (QI) is a scalable, sustainable, and extendable solution to enhance the SQL analytics experience. With historical query data, aggregated insights, and access to actual query text, you can analyze and tune your SQL queries.

Query Insights provides a central location for historic query data and actionable insights for 30 days, helping you to make informed decisions to enhance the performance of your Warehouse or SQL Endpoint. When a SQL query runs in Microsoft Fabric, Query Insights collects and consolidates its execution data asynchronously, providing you with valuable information. Admin, Member, and Contributor roles can access the feature. More info.

What else was announced at Ignite?

For more information about what else was announced at Ignite concerning other products, check out Microsoft Ignite 2003 Book of News and Microsoft Ignite 2023: AI transformation and the technology driving change.

More info:

Fabric workloads are now generally available! | Microsoft Fabric Blog | Microsoft Fabric

Empower Power BI users with Microsoft Fabric and Copilot | Microsoft Power BI Blog | Microsoft Power BI

The post Microsoft Fabric is now GA! first appeared on James Serra's Blog.

Microsoft Fabric – the great unifier

$
0
0

I’m seeing a lot of excitement from customers over Microsoft Fabric, now that it GA’d a few weeks ago. One thing that is generating a lot of that excitement is using Fabric in a way that I call “the great unifier”. That is, using Fabric shortcuts and mirroring so that anyone can very easily use Power BI to easily create reports and dashboards using data from multiple sources without copying the data into OneLake. Shortcuts and mirroring make it appear to the report user that all the data they need is local due to the ease of use of object Explorer:

These tables all seem to be local, but in fact only the first four are local, and the other three, designated by the paperclip-looking icon, could be shortcuts to data located all over the world. For mirrored sources, data is replicated in real-time and copied to OneLake and appears as a mirrored database that can be joined in a query (example below shows Cosmos DB and SQL DB being mirrored as warehouses). This is a huge benefit because you don’t need to write ETL and the data is never outdated:

Click here for a YouTube video demonstrating mirrorring.

Here is what Fabric looks like as the great unifier:

Some points about the benefits of such an architecture:

  • If end-users do not like the reporting features in other products like Databricks or Snowflake, or they are just used to Power BI, Fabric gives them a great way to use the Power BI reporting tool instead of being forced to use the reporting tools from other products. Plus, the end-user can easily supplement the data from those other products by uploading their own data into Fabric and joining the data together, instead of having to ask IT to ingest the data into those other products
  • Some customers will use a competing cloud platform for various data analytics workloads, but may still want to use Fabric’s business intelligence, data science, data engineering, and other capabilities on that data
  • It is an easy button, helping end-users who are not technical to unify data from all different sources
  • Microsoft is not trying to compete or replace other products/clouds
  • Helps with being multi-cloud: If you have data in Azure, Amazon, and Google (supported soon), you can use Fabric to bring all that data together to query and generate reports
  • Great way to bring data together to train ML models
  • Think of shortcuts as a light virtualization layer
  • Shortcuts cache data and only pull over data that is needed, saving egress costs and removing the need to build ETL to keep data in sync
  • A great way to support data sovereignty as you can keep data in data lakes within countries and create shortcuts to unify all the data
  • Microsoft Purview can be used for data governance
The post Microsoft Fabric – the great unifier first appeared on James Serra's Blog.

Common Data Mesh exceptions

$
0
0

When it comes to data meshes that have been constructed or are currently under development, I have not observed any instances where the four core data mesh principles have been utilized to their fullest extent. This ideal, which I term the ‘pure’ data mesh, demands strict adherence to its defined principles. However, in practice, all implementations deviate from this pure form, each with its own set of exceptions. Among these, some of the most frequent exceptions I have seen are:

  • Data is kept in one enterprise data lake with each domain getting its own container/folder instead of each domain having its own data lake. This is popular with users of Microsoft Fabric and its OneLake. This helps with performance by not having to join data from multiple data lakes that can be physically located all over the world. It also helps with governance as all data is in one location
  • Limiting the technologies each domain can use, which is almost always at least limited to one cloud provider. And usually within the cloud provider, limited to certain technologies. For example, each domain can only use Microsoft Fabric and not other technologies such as Azure Synapse. This reduces complexity and cost
  • Central IT creates the data ingestion piece for each domain instead of each domain creating their own. This speeds up development as central IT will know the data ingestion piece well and can reuse a lot of the code
  • Some services (for example, compute, MDM) are shared instead of each domain having their own. For example, a handful of Databricks clusters can be used by all the domains instead of each domain having their own Databricks cluster. This can greatly reduce costs
  • IT builds the entire solution for each domain, including creating the analytical data. This will speed up development as IT will become experts in building these solutions and can repeat the process for each domain
  • Using a product (like Microsoft Purview) for data cataloging all domain data instead of each domain using their own product and providing API calls to access their domain metadata
  • Using a virtualization tool (like Microsoft Fabric with OneLake shortcuts) to access all domain data instead of each domain providing API calls to access data
  • Having data products that simply contain data and metadata and not implementing the more complicated concept of using a data quantum as a data product (in which it holds not only the domain-oriented data and metadata but also the code that performs the necessary data transformations, the policies governing the data and the infrastructure to hold the data)
  • Not using consumer-aligned domains or aggregate domains
  • Using a shared model (there is only one domain that is shared by multiple business units) instead of a federated model (there are multiple domains that may be aligned to business units, products, or data producers/consumers)

It’s important to recognize that many of these exceptions can diminish or completely negate a key advantage of a data mesh: its ability to facilitate organizational and technical scaling. This scaling is crucial in preventing bottlenecks caused by people and infrastructure limitations.

This leads to the question: At what point do you have so many exceptions it should not be called a data mesh?

We can call that point the Minimum Viable Mesh (MVM). How many exceptions would be allowed for a solution to be considered a MVM? Can we all even agree on the four data mesh principles? If the four principles aren’t universally accepted in total, we can’t even try to answer that question. Do we need an industry committee to come up with a revised definition of a data mesh and a MVM? How do we avoid feedback based solely on self-interest?

Unfortunately, I don’t have the answers. If you are looking to build a data mesh, I suggest learning all you can about a data mesh and other architectures to see what would work best for your particular use case. My upcoming book might help with that. The eBook version will be available in early February. It’s got a pretty cover now 🙂

The post Common Data Mesh exceptions first appeared on James Serra's Blog.

Data Mesh Topologies

$
0
0

As a follow up to my blog Common Data Mesh exceptions, I wanted to discuss various types of data mesh topologies I am seeing being built. I put them into three categories, but there are many variations to these three (variations mentioned in my exceptions blog). Ole Olesen-Bagneux has just posted a similar discussion on LinkedIn about data mesh types that I encourage you to check out.

On the left of the figure below, the architectures have the most centralization, and the architectures become more distributed as you move to the right:

Mesh Type 1
In this setup, all domains use identical technology, restricted to a single cloud provider’s offerings. Each domain maintains its own infrastructure but shares a central enterprise data lake, where each domain has a dedicated container or folder. This setup, common due to its performance benefits and simplified security, monitoring, and disaster recovery, involves minimal product variation within domains.
Mesh Type 2
Similar to Type 1 in technology use, but here, each domain possesses its own data lake. This fully decentralized approach faces challenges in linking data lakes and maintaining performance when integrating data across domains. Despite these challenges, Mesh Type 1 is more prevalent due to its practicality in data and integration, and is the approach usually taken with customers using Microsoft Fabric. In my opinion, this fits closely with what Ole calls the pragmatic data- and integration mesh.
Mesh Type 3
Domains in this architecture have the freedom to choose any technology and cloud provider, with each having its separate data lake. This leads to a diverse tech environment with varying security protocols, a need for expertise in multiple products, and challenges in governance and infrastructure automation. Combining data across these varied systems is complex. Due to these challenges, Mesh Type 3, though visionary, is considered impractical for widespread adoption. I have not seen any company even attempt it. This fits into the category of what I call the pure data mesh or what Ole calls the visionary data mesh. 

The bottom line is you should design an architecture that works best for your use case, based on the size, speed, and type of data. And that architecture could be a combination of certain features of a data mesh along with features of a data fabric or data lakehouse. I go into much more detail about this in my upcoming book Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh which you can pre-order on Amazon and which will (finally!) be available in eBook form the second week of February and the print version a couple of weeks after that. Note you can currently only pre-order the print version on Amazon, but the eBook can be ordered next week.

The post Data Mesh Topologies first appeared on James Serra's Blog.

My data architecture book is published!

$
0
0

I’m thrilled to share that after 15 months of dedicated effort, my book, “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh,” is now available in eBook/Kindle format on Amazon! The journey to this moment has been incredibly rewarding, and I’m excited for you to see the result. The paperback edition will be available in about 2-3 weeks. You can order the eBook on Amazon here (for USA). And I would greatly appreciate any Amazon reviews! The book is being rolled out to other online stores, and some may have it at a lower price than Amazon (UPDATE: Barnes & Noble has the lowest eBook price so far).

Below is the abstract. The TOC and editorial reviews can be found here. Please send me any feedback on the book to jamesserra3@gmail.com. Would love to hear what you think!

Abstract

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each.

James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You’ll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. With this book, you’ll:

  • Gain a working understanding of several data architectures
  • Learn the strengths and weaknesses of each approach
  • Distinguish data architecture theory from the reality
  • Pick the best architecture for your use case
  • Understand the differences between data warehouses and data lakes
  • Learn common data architecture concepts to help you build better solutions
  • Explore the historical evolution and characteristics of data architectures.
  • Learn essentials of running an architecture design session, team organization, and project success factors
The post My data architecture book is published! first appeared on James Serra's Blog.

Elevate Your Data Leadership: A Unique Intensive Learning Experience

$
0
0

I’ve collaborated with two industry experts to design a transformative course for data leaders: “The Technical and Strategic Data Leader.” This six-week intensive learning journey is crafted to elevate both your technical skills and strategic acumen.

Facing Data Leadership Challenges?

  • Struggling with the balance between business objectives and technical challenges?
  • Feeling isolated in your leadership role, wishing for a community that understands?
  • Comfortable in technical or business discussions but feel out of depth when crossing over?


What Sets This Course Apart?

  • Beyond Basics: Dive into a curriculum that goes beyond technical training to emphasize the integration of business insight and technological expertise.
  • Collaborative Learning: Engage with live discussions in a collaborative environment, fostering active participation and real-time feedback from industry leaders.
  • Focused Content: Participate in sessions on Data Team Strategy, Business Intelligence, Data Architecture, and Data Leadership.


Join Our Transformative Journey Early bird pricing ends February 16th, 2024, with our first cohort beginning on March 12th. Don’t miss this opportunity for a deep, immersive learning experience that will redefine your leadership and technical prowess in the data realm.

In the first two weeks, I’ll cover “Intro and Data Architectures: Understanding Common Data Architecture Concepts” followed by “Data Architectures: Gain a Working Understanding of Several Data Architectures.” These sessions are designed to equip you with the foundational knowledge and practical insights into data architectures, crucial for any data leader looking to leverage technology strategically within their organization.

Learn more and register now at https://www.thedatashop.co/leader.

The post Elevate Your Data Leadership: A Unique Intensive Learning Experience first appeared on James Serra's Blog.

Introduction to OpenAI and LLMs

$
0
0

I focus most of my blog posts on the data platform and how companies can make better business decisions using structured data (think SQL tables), but I’m seeing more and more customers interested in OpenAI and how they can make better business decisions with OpenAI using unstructured data (think text in documents). And they want to know if it is possible to use OpenAI on structured data? This is my first blog in a three-part series on the topic.

This first blog will focus on using OpenAI on unstructured data, where the ideal solution is a bot, like ChatGPT, that is used to ask questions on documents from your company.

First I want to explain in layman’s terms what OpenAI, ChatGPT and Bing’s Copilot are. ChatGPT and Copilot are basically bots that work with what are called Generative AI models, commonly known as Large Language Models or LLMs, that were “trained” on multiple data sets that include essentially the entire web as well as millions of digital books. The models were built using OpenAI’s technology (OpenAI is a leading artificial intelligence research lab that focuses on developing advanced AI technologies). So these models are very smart! Sitting on top of these LLM’s are “bots” that allow you to ask questions (via prompts), and the bot returns answers using the LLM. The more details you put in the question, the better the answer will be (this technique is called prompt engineering – designing prompts for LLMs that improves accuracy and relevancy in responses, optimizing the performance of the model). An example question would be “What are the best cities in the USA?”, and the LLM would return an answer based on all the websites, blog posts, reedit posts, books, etc. that it found that talked about the best USA cities.

But what if you wanted to ask questions on data sets that the LLM’s did not use for its training, such as PDF’s that your company has that are not available publicly (not on their website)? For example, maybe you are a company that makes refrigerators and have a bunch of material such as refrigerator user guides, model specifications, repair manuals, customer problems and solutions, etc. And you would like to train a LLM on the text in those documents and have a bot on top of it so that customers can ask questions of that material. Just think of the improved customer service as customers would not need to talk to a customer service person and can get quick, accurate answers about features of a refrigerator, as well as get quick answers to fix problems they are having with their refrigerator.

LLMs, when it comes to using them in a real-world production scenario, have some limitations, mainly due to the fact that they can answer questions related only to the data they were trained on (called the base model or pre-trained LLM). This means that they do not know facts that happened after their date of training, and they do not have access to data protected by firewalls or not accessible to the internet. So how do you get LLMs to also use PDF’s from your company? There are two approaches that can supplement the base model: further training of the base model with new data, called fine-tuning, or RAG which uses prompt engineering to supplement or guide the model in real time.

Let’s first talk about RAG. RAG supplements the base model by providing the LLM with the relevant and freshest data to answer a user question by injecting the new information through the prompt. This means RAG works with pre-trained LLMs and your own data to generate responses. Your own data can be PDF documents.

A system that implements the RAG pattern has in its architecture a knowledge base, hosting the validated docs (usually private data) on which the model should base its answer on. Each time a user question comes to the system:

  1. Information Retrieval: The user question is converted into a query to search into the knowledge base for relevant docs, which are your private docs such as the previously mentioned refrigerator user guides. An index is commonly used to optimize the search process
  2. Prompt Engineering: The matching docs are combined with the user question and a system message and injected into the pre-trained LLM. The system message contains instructions that guides the LLM in generating the desired output, such as “the user is a 5th grader” so its answer will be more simple to understand
  3. LLM Generation: The LLM, trained on a massive dataset of text, generates text based on the prompt and the retrieved information from the model
  4. Output Response: The generated text is then presented to the user, written in natural language, providing them with insights and assistance based on their private docs

Note that you can choose to have user questions answered only with the knowledge base of private docs, or also with the text that was used to train the LLM (“the internet”). For example, if a user question is for an older refrigerator model that is not part of the private docs, you can decide to return an answer of “not found”, or you can choose to search the pre-trained LLM and return what is found from the public information. You can also choose to combine the two: for example, if the user question is for a model you have in your private docs, you can return information from the private docs and combine it with public information to give a more detailed answer, perhaps with the public information giving customer reviews that the private docs do not have (the system message is used to indicate this).

The other approach, fine-tuning, enhances an existing pre-trained LLM using example data, like your refrigerator user guides (a domain-specific dataset). This results in a new “custom” LLM, or fine-tuned LLM, that has been optimized for the provided example data. The main issue with fine-tuning is the time and cost it takes to enhance (“retrain”) the LLM, and it will still only have information from when it was retrained, as opposed to RAG that is “real-time”.

When deciding between RAG and fine-tuning, it’s essential to consider the distinct advantages each offers. RAG, by leveraging existing models to intelligently process new inputs through prompts, facilitates in-context learning without the significant costs associated with fine-tuning. This approach allows businesses to precisely tailor their solutions, maintaining data relevance and optimizing expenses. In contrast, fine-tuning enables models to adapt specifically to new domains, markedly enhancing their performance but often at a higher cost due to the extensive resources required. Employing RAG enables companies to harness the analytical capabilities of LLMs to interpret and respond to novel information efficiently, supporting the periodic incorporation of fresh data into the model’s framework without undergoing the fine-tuning process. This strategy simplifies the integration and maintenance of LLMs in business settings, effectively balancing performance improvement with cost efficiency.

Bing’s Copilot is using RAG to give you the most updated answers to your questions (by scrapping web pages so it is real-time), as opposed to rebuilding the LLM using fine-tuning, which would take tons of hours and just be impractical to do each day, and also lag behind real-time. Microsoft’s Copilot in its Office 365 products also uses RAG on your data (PowerPoint, Word files, etc) – see How Microsoft Copilot Incorporates Private Enterprise Data. Think of Office 365 Copilot as a customized bot for a specific purpose (working with Office 365 files).

Two popular choices for building RAG are Azure AI studio and Microsoft Copilot Studio – see Building your own copilot with low-code approach: a comparison between Azure AI Studio and Microsoft Copilot Studio.

Now that you understand the “what”, what is OpenAI and LLM, the next blog post will talk about the “how” part (via Azure OpenAI On Your Data), and the third blog post will be about using OpenAI on structured data (or on both unstructured data and structured data at the same time).

More info:

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation using Azure Machine Learning prompt flow (preview)

Full Fine-Tuning, PEFT, Prompt Engineering, and RAG: Which One Is Right for You?

RAG vs. fine-tuning: A comparison of two techniques for enhancing LLMs

Building your own copilot – yes, but how? (Part 1 of 2)

How Microsoft 365 Copilot works

The Fashionable Truth About AI

The post Introduction to OpenAI and LLMs first appeared on James Serra's Blog.

Announcements from the Microsoft Fabric Community Conference

$
0
0

(Shameless plug: The price of my book “Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh” has dropped on Amazon to its lowest price yet)

A ton of new features for Microsoft Fabric were announced at the Microsoft Fabric Community Conference. Here are all the new features I am aware of, with some released now and others coming soon:

  • Mirroring is now in public preview for Cosmos DB, Azure SQL DB and Snowflake. See Announcing the Public Preview of Database Mirroring in Microsoft Fabric
  • You get a free terabyte of Mirroring storage for replicas for every capacity unit (CU) you have purchased and provisioned. For example, if you purchase F64, you will get sixty-four free terabytes worth of storage for your mirrored replicas
  • You can now access on-premises data using the on-premises Data Gateway in Data Factory pipelines. See Integrating On-Premises Data into Microsoft Fabric Using Data Pipelines in Data Factory
  • You can have folders in the Workspace view. See Announcing Folder in Workspace in Public Preview
  • Improving the Microsoft Fabric’s CI/CD experience. This improvement includes support for data pipelines and data warehouses in Fabric Git integration and deployment pipelines. Spark job definition and Spark environment will become available in Git integration. Microsoft is also giving you the ability to easily branch out a workspace integrated into Git with just a couple of clicks to help you reduce the time to code. Additionally, because many organizations already have robust CI/CD processes established in tools such as Azure DevOps, Fabric will also support both Fabric Git integration APIs as well as Fabric deployment pipelines APIs, enabling you to integrate Fabric into these familiar CI/CD tools. All of these updates will be launched in a preview experience in early April. See Data Factory Adds CI/CD to Fabric Data Pipelines
  • Fast Copy in Dataflows Gen2, where you can ingest a large amount of data using the same data movement backend as the “copy” activity in data pipelines
  • A feature coming soon that will give you the ability to add tags to Fabric items and manage them for enhanced compliance, discoverability, and reuse
  • A new Microsoft Fabric feature called task flows. Task flows can help you visualize a data project from end-to-end
  • Security admins will soon be able to define Purview Information Protection policies in Microsoft Fabric to automatically enforce access permissions to sensitive information in Fabric
  • Coming soon is the extension of Purview Data Loss Prevention (DLP) policies to Fabric, enabling security teams to automatically identify the upload of sensitive information to Fabric and trigger automatic risk remediation actions. The DLP policies will initially work with Fabric Lakehouses with support for other Fabric workloads to follow. See Extend your data security to Microsoft Fabric
  • In preview is the ability for organizations to create subdomains, the ability to set default domains for security groups, the ability to use public admin APIs, and more.  See Easily implement data mesh architecture with domains in Fabric
  • Released in preview is the ability to create shortcuts to the Google Cloud Platform
  • Released in preview is the ability to create shortcuts to cloud-based S3 compatible data sources and, coming soon, on-premise S3 compatible data sources (these sources include Cloudflare, Qumulo, MinIO, Dell ECS, and many more)
  • Introducing an external data-sharing experience for Microsoft Fabric data and artifacts, helping make collaboration easier and more fruitful across organizations. Fabric external data sharing, coming soon, enables you to share data and assets with external organizations such as business partners, customers, and vendors in an easy, quick, and secure manner. Because this experience is built on top of OneLake’s shortcut capabilities, you can share data in place from OneLake storage locations without copying the data. External users can access it in their Fabric tenant, combine it with their data, and work with it across any Fabric experience and engine
  • Coming soon is a metrics layer in Fabric which allows organizations to create standardized business metrics that are rooted in measures and are discoverable and intended for reuse. Trusted creators can select Power BI measures to promote to metrics and even include descriptions, dimensions, and other metadata to help users better understand how they should be applied and interpreted. When looking through the metrics, users can preview and explore the simplified semantic model in a simple UI before using it in their solution. These metrics can not only be used in reports, scorecards, and Power BI solutions but also in other artifacts across Fabric, such as data science notebooks
  • Later in the year, Microsoft will release the ability to live edit Direct Lake semantic models in the Fabric service right from Power BI Desktop, so you can work with data directly from OneLake
  • In preview is the ability to connect to over a hundred data sources and create paginated reports right from the Power BI Report Builder 
  • In preview is the new ability to create Power BI reports in the Fabric web service by simply connecting to your Excel and CSV files with relationship detection enabled
  • To save you time when you are building reports, in preview Microsoft has created new visuals for calculations and are introducing a new way for you to create and edit a custom date table without writing any Data Analysis Expressions (DAX) formulas
  • In preview you can generate mobile-optimized layouts for any report page to help everyone view insights even on the go
  • In Explorer, Microsoft added a new “Data overview” button which provides a summary, powered by Copilot, of the semantic model to help users get started. This feature will be released in preview in early April and will roll out to regions gradually
  • In preview is the ability for Copilot to help you write and explain DAX queries in the DAX query view
  • Coming soon is a new generative AI feature in Fabric that will enable custom Q&A experiences for your data. You can simply select the data source in Fabric you want to explore and immediately start asking questions about your data—even without any configuration. When answering questions, the generative AI experience will show the query it generated to find the answer and you can enhance the Q&A experience by adding more tables, setting additional context, and configuring settings. Data professionals can use this experience to learn more about their data or it could even be embedded into apps for business users to query
  • More new Fabric features are at Microsoft Fabric March 2024 Update
  • And for specific Power BI changes, check out Power BI March 2024 Feature Summary

More info:

Announcements from the Microsoft Fabric Community Conference

The post Announcements from the Microsoft Fabric Community Conference first appeared on James Serra's Blog.

Microsoft Purview new data governance features

$
0
0

Starting last week is a rollout of the public preview of a new and fully reimagined Microsoft Purview data governance solution. Data governance has become so much more important with the collection of more and more data. Data governance involves the processes, policies, and standards that ensure the effective use of information in an organization. It focuses on maintaining data quality, managing data life cycle, ensuring compliance with regulations, and defining roles and responsibilities for data management. This framework helps organizations achieve their goals by enhancing decision-making, facilitating compliance, and improving data security and privacy.

This new SaaS experience is intended to supersede the previous Microsoft Purview Data Catalog experience and addresses the key areas of customer feedback as well as support the requirements of a modern data governance solution via these new features:

  • Business-friendly catalog for business users across data roles (Chief Data Officer, data stewards, data asset creators, and data consumers) and anchored to durable business concepts and business objectives and key results (OKRs)
  • A unified and extensible experience that works how customers work, backed by compliant, self-serve data access
  • AI-enabled experiences and automation to dramatically scale business success
  • Built to enable a culture of federated data governance through simplicity and actionable insights

The new data governance experience will be available on the new unified Microsoft Purview portal. If you have not yet transitioned to this new experience, please reference this useful article on how to transition. You can easily toggle from the new Purview portal experience to the previous one right from within the user interface (upper righthand corner). The new experience has already rolled out to a few regions, and additional regions will become available over the course of several weeks (see Data Governance for the Age of AI for the region rollout dates).

The changes to Purview, in short, are that the “Data Catalog” section of Microsoft Purview was redesigned and updated with new features, including a “Data management” section and a “Data estate health” section. Within the data management section, you can easily define and assign business-friendly terminology (such as Finance and Claims). Business-friendly language follows the data governance experience through Data Products (a collection of data assets used for a business function), Business Domains (ownership of Data Products), Data Quality (assessment of quality), Data Access, and Data Estate Health (reports and insights). Let’s explore each of these new features in detail:

  • Data search – Browse, search, and discover data assets across your organization. Enter in keywords that help narrow down your search such as name, data type, classifications, and glossary terms. Or you can explore by source type or collection. There are many options to filter the results by, such as asset type, classification, contact, or endorsement. This is basically the same as the search in the older portal
  • Data product search – Discover, understand, and access data products from across your organization. You can search for data products in the catalog, explore data products by business domain, view data product details, and view data asset details. In the data product search bar, you can perform a natural language search for any of your organization’s data products. For example: “I’m looking for daily sales data for retail stores worldwide to analyze sales trends for the last six months.” Also on this page is where you can see all your data access requests via the “My data access” tab. This will make it so much easier to find assets, as instead of using the data search page which returns all assets, this page allows you to drill down through the business domain and data products to find the relevant assets
  • Business domains – Organize data products into meaningful groups (such as Sales or Marketing) and link them to business concepts. It employs the concept of business domains to manage business concepts and define data products. A business domain is a framework for managing your business-related information in the catalog, typically organized around a common business purpose or business capability.  It is a boundary that aligns your data estate to your organization; think of it as a mini catalog inside your data catalog. This is different than the domains under Data Map, which are technical domains for logical groupings (called collections) such as by project, asset type, or ownership (be sure to check out What’s coming for domains? ). Business domains can be mapped to collections in your data map. Mapping your business domain to a collection means that the assets associated with the business concepts in that domain will be from that collection or its children. Within business domains is where you define glossary terms. Steps to build out your business domains are: 1) Create and manage business domains to curate your data catalog, 2) Assign owners and stewards for a business domain, 3) Relate business domains to physical data collections and domains in the data map, 4) Create glossary terms for the business domain (note you can use built-in Copilot to suggest terms), 5) Understand and take timely actions to keep your business domains in a healthy state, 6) Define business objectives and key results (OKR’s) such as 10% rise in sales or 3% reduction in support cases (of which you would manually track the progress of in Purview), 7) Define critical data elements, also called CDEs, which are a logical grouping of important pieces of information to make data easier to understand as well as to promote standardization (for example: a “Customer ID” critical data element can map “CustID” from one table and “CID” from another table into the same logical container)
  • Data products – Manage groups of data assets packaged together for specific use cases. Data products are essentially logical business concepts. Each data product will be assigned to a business domain, and assets such as tables, files, and Power BI reports will be assigned to the data product (Copilot can be used for suggestions). No more requesting access to 15 different tables you might need to build a data model. Once one user does the research to create a viable data product, all other users can benefit from that work. They can find (and request access to) the data in that product and have everything they need in one place. A business domain can house many data products, but a data product is managed by a single business domain and can be discovered across many business domains. For example, with a data product, a data scientist can create a data product that lists all the assets used to create their data model. The description provides a full use case, with examples or suggestions on how to use the data. The data scientist is now a data product owner and they’ve improved their data consumer’s search experience by helping them get everything they need in this one data product. Data products streamline governance for data assets as well: With data products, when a user finds the data product, they can request access to the data product, which will provide them access (after approval) to all the associated data assets.  The resulting hierarchy looks like: Business domains -> Data products -> Assets. An example would be Sales -> Global Sales Revenue for 2023CY -> Global Sales for 2023 (Power BI report). You can also assign previously created glossary terms to the data product
  • Data quality – Identify and fix data quality issues. The new data quality model enables your organization to set data quality rules top down with business domains, data products, and the data assets themselves. This is done using no-code/low-code rules, including out-of-the-box (OOB) rules. Some examples of rules are checking for duplicate rows, empty fields, unique values, etc. Copilot can be used to suggest rules. Data access policies can be set on a business domain, data product and a glossary term, and also to manage access requests to a data product (a request triggers a workflow that requests that the owners of the data resource grant you access to the data product). Any time a glossary term is applied to a data product, all the associated policies will be automatically applied. These policies are where you determine permitted access (“access time limit”), approval requirements (“manager approval required” or “privacy and compliance review required”), and digital attestations (“permit data copies”). Once rules and policies are applied, the data quality model will generate data quality scores at the asset, data product, or business domain level giving you snapshot insights into your data quality relative to your business rules. Within the data quality model, there are two metadata analysis capabilities: 1) data profiling—quick sample set insights, and 2) data quality scans—in-depth scans of full data sets. These profiling capabilities use your defined rules or built-in templates to reason over your metadata and give you data quality insights and recommendations. Also available are data quality actions, which identify problems that you should address to improve data quality in your data estate. For example, “Data profile outlier values detected” and “Data asset quality rule score has fallen below threshold”. You would assign the action to a person, and when that person fixes the issue via a tool such as ADF, you would mark it as resolved. Also, there are data quality alerts that notify Microsoft Purview users about important events or unexpected behavior detected around the quality of the data. When you create alerts for assets, you’ll receive email notifications about data quality scores. An alert example would be to send a notification if the data quality score for the sales domain (containing customer and fact_sale data assets) dropped below 50%”. Lastly, there are data quality rules you can choose for CDEs, such as “unique values” and “empty/blank fields”
  • Data access – This page is where you manage requests to access data products in your business domains by approving or declining the request (which can also be done through an email notification). But first you will manage access to your data products and set up a system to provide access to users who request them via the data products page by selecting a data product and on the data product page selecting “Manage policies”. There you can define access policies in many ways, as mentioned above in the data quality section. For example, setting the usage purposes, who approves the request, requiring manager approval, requiring acknowledgement of terms of use, determining if copies of the data are permitted, and setting the maximum access duration. The selected values affect what the data consumers see on their access request form and actions they need to take. Note in this preview experience, the approvers of the request must provide access to the individual data assets manually. To request access to a data product, a user will select the “Request access” button while on a data product details page
  • Heath controls – Track your journey to complete data governance by monitoring health controls to monitor your progress. Health controls measure your current governance practices against standards that give your data estate a score. Some example controls are metadata completeness, cataloging, classification, access entitlement, and data quality. A data officer can configure rules which determine the score and define what constitutes a red/yellow/green indicator score, ensuring your rules and indicators reflect the unique standards of your organization. An example would be checking if data assets are classified, with a target value of 80%, or if data assets are mapped to data products for discoverability in the catalog, with a target of 90%
  • Health actions – Steps you can take to improve data governance across your data estate. This new action center aggregates and summarizes governance-related actions by role, data product, or business domain. All the anomalies noted by health controls are translated into actions that you can assign an owner to and that has recommendations to resolve that you can track and address from within Microsoft Purview. Actions stem from usage or implementation being out of alignment from defined controls. This interactive summary makes it easy for teams to manage and track actions—simply click on the action to make the change required. Cleaning up outstanding actions helps improve the overall posture of your data governance practice—key to making governance a team sport. An example is missing classification on data assets, or data product not linked to data assets
  • Metadata quality – A low-code/no code experience for data stewards and members of the Chief Data Officer’s office to write any logic to test the metadata health and quality. It comes with a set of predefined logic for each health control and you can add more rules to the existing logic, and at the next refresh of the health controls, the new metadata quality logic will be applied to all the metadata, based on the scope (data product or business domain or any other entity). For example, you can create a rule that a data product must have a published term of use or a data product must have a description
  • Reports – Data governance is a practice which is nurtured over time. Aggregated insights help you put the “practice” into your data governance practice by showcasing the overall health of your governed data estate. These reports provide deep insight across a variety of dimensions: asset insights (an overview of assets by type and collection, and their curation status), catalog adoption (to understand at a glance how your data catalog is being used), classification insights (an overview of assets classified and the types of classifications), data stewardship (for the governance and quality focused users, like data stewards and chief data officers, to understand governance health gaps like asset curation and asset ownership), glossary insights (health and use of glossary terms), and sensitivity label insights (an overview of assets that have sensitivity labels applied and the types of labels applied). Coming soon is data governance health and data quality health
  • Roles and permissions – Admin roles give users permission to view data and complete tasks in Microsoft Purview. Give users only the access they need by assigning the least-permissive role. Roles include Data Governance Administrators, Business Domain Creators, Data Health Owners, and Data Health Readers. These roles are called application-level permissions (with the application being the Data Catalog). Note these relate to using features in the Data Catalog, separate from the role assignments for collections and separate from the roles for each business domain (called business domain level permission). For details on all the permissions, check out Permissions in the new Microsoft Purview portal preview

Finally, I wanted to stress that it’s important to understand this new data quality life cycle:

  1. Assign users(s) data quality steward permissions in your data catalog to use all data quality features
  2. Register and scan a data source in your Microsoft Purview Data Map
  3. Add your data asset to a data product
  4. Set up a data source connection to prepare your source for data quality assessment. The currently supported data source types are ADLS Gen2 (delta format), Azure SQL Database, and Fabric Lakehouse (delta table)
  5. Configure and run data profiling for an asset in your data source. Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data
    1. When profiling is complete, browse the results for each column in the data asset to understand your data’s current structure and state
  6. Set up data quality rules based on the profiling results and apply them to your data asset. Data quality rules are essential guidelines that organizations establish to ensure the accuracy, consistency, and completeness of their data. These rules help maintain data integrity and reliability
  7. Configure and run a data quality scan on a data product to assess the quality of all supported assets in the data product and produce a score. Your data stewards can use that score to assess the data health and address any issues that might be lowering the quality of your data
  8. Review your scan results to evaluate your data product’s current data quality
  9. Repeat steps 5-8 periodically over your data asset’s life cycle to ensure it’s maintaining quality
  10. Continually monitor your data quality
    1. Review data quality actions to identify and resolve problems
    2. Set data quality notifications to alert you to quality issues

For more details on the new governance features, check out New Microsoft Purview Data Catalog (Preview) as well as the videos on the Microsoft Purview YouTube channel.

As you can see, Microsoft Purview has transitioned its focus from not just cataloging data and applying policies, but to managing logical concepts (business domains and data products) and providing data governance via quality checks to the data and ensuring compliance. 

More info:

Introducing modern data governance for the era of AI 

Video of changes

Get ready for the next enhancement in Microsoft Purview governance

Scalable Data Management with Microsoft Fabric and Microsoft Purview

Microsoft Purview’s reimagined data governance experience

Episode 5: Connecting the dots with Microsoft Purview

The post Microsoft Purview new data governance features first appeared on James Serra's Blog.

Microsoft Fabric shortcuts

$
0
0

I talked about Microsoft Fabric shortcuts in my blog post Microsoft Fabric – the great unifier (where I have updated the picture with the newest supported sources) and wanted to provide more details on how shortcuts work and reduce some confusion.

This is how shortcuts appear in Explorer:

Here are important key points in understanding shortcuts:

  • You can create shortcuts in Fabric lakehouses and Kusto Query Language (KQL) databases, not in Fabric warehouses
  • You can use the Fabric UI to create shortcuts interactively, and you can use the REST API to create shortcuts programmatically
  • In the Tables section, you can only create shortcuts at the top level.  Shortcuts aren’t supported in other subdirectories of the Tables section.  If the target of the shortcut contains data in Delta format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table.  Shortcuts in the tables can be to anywhere and any format (technically) but only the Delta tables are seen by the SQL engine* so it’s useless to create shortcuts to any other data
  • In the Files section, there are no restrictions on where you can create shortcuts.  You can create them at any level of the Files section folder hierarchy.  Table discovery doesn’t happen in the Files section.  Shortcuts in the files section can be to any supported source in any format but those will never be visible from the SQL engine, regardless of the data type
  • Any Fabric or non-Fabric service that can access data in OneLake can use the shortcuts.  Shortcuts just appear as another folder in the lake
  • In the Tables section, you can create shortcuts to folders, but folders will show up in an Unidentified folder and not be usable by the SQL engine if the folder does not contain a Delta log.  You can also create shortcuts to individual warehouse/lakehouse tables
  • In the Files section, you can only create shortcuts to folders, not to individual files
  • Shortcut supported sources don’t include relational storage – for those use Fabric mirroring

* = The SQL engine being the new Fabric unified SQL engine that can query tables in the lakehouse as well as the warehouse through the SQL Analytics Endpoint

The post Microsoft Fabric shortcuts first appeared on James Serra's Blog.

Microsoft Build event announcements on Fabric

$
0
0

There were a number of Microsoft Fabric announcements at Microsoft Build yesterday that I wanted to blog about.

Everything announced at Build can be found in the Microsoft Build 2024 Book of News.

Top announcement: Real-Time Intelligence

The new Real-Time Intelligence within Microsoft Fabric will provide an end-to-end software as a service (SaaS) solution that will empower customers to act on high volume, time-sensitive and highly granular data in a proactive and timely fashion to make faster and more-informed business decisions. Real-Time Intelligence, now in preview, will empower user roles such as everyday analysts with simple low-code/no-code experiences, as well as pro developers with code-rich user interfaces.

Features of Real-Time Intelligence will include:

  • Real-Time hub, a single place to ingest, process and route events in Fabric as a central point for managing events from diverse sources across the organization. All events that flow through Real-Time hub will be easily transformed and routed to any Fabric data stores.
  • Event streams that will provide out-of-the-box streaming connectors to cross cloud sources and content-based routing that helps remove the complexity of ingesting streaming data from external sources.
  • Event house and real-time dashboards with improved data exploration to assist business users looking to gain insights from terabytes of streaming data without writing code.
  • Data Activator that will integrate with the Real-Time hub, event streams, real-time dashboards and KQL query sets, to make it seamless to trigger on any patterns or changes in real-time data.
  • AI-powered insights, now with an integrated Microsoft Copilot in Fabric experience for generating queries, in preview, and a one-click anomaly detection experience, allowing users to detect unknown conditions beyond human scale with high granularity in high-volume data, in private preview.
  • Event-Driven Fabric will allow users to respond to system events that happen within Fabric and trigger Fabric actions, such as running data pipelines.

More info
Documentation

Other announcements

Updates to Fabric include:

  • Fabric Workload Development Kit: When building an app, it must be flexible, customizable and efficient. Fabric Workload Development Kit will make this possible by enabling ISVs and developers to extend apps within Fabric, creating a unified user experience. This feature is now in preview. More info, documentation
  • Fabric Data Sharing feature: Enables real-time data sharing across users and apps. The shortcut feature API allows seamless access to data stored in external sources to perform analytics without the traditional heavy integration tax. The new Automation feature now streamlines repetitive tasks resulting in less manual work, fewer errors and more time to focus on the growth of the business.These features are now generally available.
  • GraphQL API and user data functions in Fabric: GraphQL API in Fabric is a savvy personal assistant for data. It’s a RESTful API that will let developers access data from multiple sources within Fabric, using a single query. User data functions will enhance data processing efficiency, enabling data-centric experiences and apps using Fabric data sources like lakehouses, data warehouses and mirrored databases using native code ability, custom logic and seamless integration. These features are now in preview.
  • AI skills in Fabric: AI skills in Fabric is designed to weave generative AI into data specific work happening in Fabric. With this feature, analysts, creators, developers and even those with minimal technical expertise will be empowered to build intuitive AI experiences with data to unlock insights. Users will be able to ask questions and receive insights as if they were asking an expert colleague while honoring user security permissions. This feature is now in preview.
  • Copilot in Fabric: Microsoft is infusing Fabric with Microsoft Azure OpenAI Service at every layer to help customers unlock the full potential of their data to find insights. Customers can use conversational language to create dataflows and data pipelines, generate code and entire functions, build machine learning models or visualize results. Copilot in Fabric is generally available in Power BI and available in preview in the other Fabric workloads. More info
  • Snowflake Apache Iceberg shortcuts in Fabric: Apache Iceberg is an open-source native table format. With Iceberg shortcuts in OneLake, users will be able to unify data across domains, clouds and accounts by creating a single virtual data lake for the entire enterprise. Through Iceberg shortcuts, now in preview, Microsoft Fabric customers will be able to connect Iceberg tables in Snowflake to Fabric quickly, easily and without compromising performance. This is possible because of the Apache XTable project that both Microsoft and Snowflake are committed to (see Snowflake and Microsoft announce expansion of their partnership).  Previously shortcuts only supported Delta format.
  • Direct Snowflake integration: In Snowflake, when creating a Snowflake database, you can specify to use iceberg tables (instead of Snowflake propriety format) and also specify that you will use Microsoft Fabric as the external storage provider, resulting in data that will be stored in Iceberg format in OneLake. So in OneLake you will see a Snowflake database that has been automatically created. Within Fabric, you can then create shortcuts inside this Snowflake database to other OneLake items (like a lakehouse or warehouse) that will be expressed in Iceberg format. This means if I go into the Snowflake environment and look inside this newly created database, I will see those OneLake items that have shortcuts to them. The result: I can query this Snowflake database in Fabric and updates made to it via Snowflake will be seen immediately in Fabric. Also, when using the database within Snowflake, I can combine the data with those Fabric items I have created shortcuts to. This will be preview later this year.
  • Data workflows in Microsoft Fabric: Powered by Apache Airflow runtime, this will help you author, schedule, and monitor workflows or data pipelines using Python. More info
  • Azure Databricks Unity Catalog integration with Fabric: Coming soon, you will be able to access Azure Databricks Unity Catalog tables directly in Fabric, making it even easier to unify Azure Databricks with Fabric. From the Fabric portal, you can create and configure a new Azure Databricks Unity Catalog item in Fabric with just a few clicks. You can add a full catalog, a schema, or even individual tables to link and the management of this Azure Databricks item in OneLake—a shortcut connected to Unity Catalog—is automatically taken care of for you. This data acts like any other data in OneLake—you can write SQL queries or use it with any other workloads in Fabric including Power BI through Direct Lake mode. When the data is modified or tables are added, removed, or renamed in Azure Databricks, the data in Fabric will remain always in sync. This new integration makes it simple to unify Azure Databricks data in Fabric and seamlessly use it across every Fabric workload.
  • Federate OneLake as a Remote Catalog in Azure Databricks: Also coming soon, Fabric users will be able to access Fabric data items like lakehouses as a catalog in Azure Databricks. While the data remains in OneLake, you can access and view data lineage and other metadata in Azure Databricks and leverage the full power of Unity Catalog. This includes extending Unity Catalog’s unified governance over data and AI into Azure Databricks Mosaic AI. In total, you will be able to combine this data with other native and federated data in Azure Databricks, perform analysis assisted by generative AI, and publish the aggregated data back to Power BI—making this integration complete across the entire data and AI lifecycle.

More info

Also check out the Microsoft Fabric May 2024 Update

The post Microsoft Build event announcements on Fabric first appeared on James Serra's Blog.
Viewing all 516 articles
Browse latest View live