Quantcast
Channel: James Serra's Blog
Viewing all 516 articles
Browse latest View live

Introduction to OpenAI and LLMs – Part 2

$
0
0

My previous blog post on this topic was Introduction to OpenAI and LLMs, the “what” part (what is OpenAI and LLM), and this blog post will talk about the “how” part (how to use OpenAI on your own unstructured data via products like Azure OpenAI On Your Data). But first, a review of the previous blog with some additional clarifications and definitions to help you better understand what OpenAI and LLMs are.

First a few definitions:

  • Structured data: relational databases
  • Semi-structured data: files and logs in CSV, XML, or JSON formats
  • Unstructured data: emails, documents, and PDFs
  • Binary data: images, audio, video

The “generative” in generative AI refers to the ability of these systems to create or generate new content (text, images, code), often in response to a prompt entered by a user, rather than simply analyzing or responding to existing content. This involves producing output that is original and often unpredictable, based on the patterns, rules, and knowledge it has learned during its training phase. Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) are a type of generative AI that allow users to type questions or instructions into an input field (such as a chat bot), upon which the model will generate a human-like response.

Machine learning (ML) is a broad field encompassing various algorithms and models that learn from data to make predictions or decisions. Large Language Models (LLMs) like GPT are a specific type of ML model focused on processing and generating text based on the patterns they’ve learned from large datasets. The main difference between LLMs and the broader category of ML is that ML includes a wide range of algorithms for different tasks, such as image recognition, data analysis, and predictive modeling, whereas LLMs are specialized for understanding and generating human language.

Generative AI uses a computing process known as deep learning to analyze patterns in large sets of data and then replicates this to create new data that appears human-generated. It does this by employing neural networks, a type of machine learning process that is loosely inspired by the way the human brain processes, interprets and learns from information over time. LLMs are based on a specific type of neural network architecture known as the Transformer, which uses vectors and weights (vectors convert words into numerical data that the model interprets, while weights adjust how these vectors influence each other, enabling the model to identify patterns and generate text that reflects the meaning and context of words).

To give an example, if you were to feed lots of fiction writing into a generative AI model, it would eventually gain the ability to craft stories or story elements based on the literature it’s been trained on. This is because the machine learning algorithms that power generative AI models learn from the information they’re fed — in the case of fiction, this would include elements like plot structure, characters, themes and other narrative devices. Generative AI models get more sophisticated over time — the more data a model is trained on and generates, the more convincing and human-like its outputs become.

Training a LLM is about teaching a computer to understand and use language effectively and answer like a human. First, we collect a lot of written text and prepare it so that it’s easy for the computer to process. Then, we use this text to train the neural network, which learns by trying to predict what word comes next in a sentence. We constantly adjust the program to help it learn better and check its progress by testing it with new text it hasn’t seen before. If needed, we can further train it on specific types of text to improve its skills in certain areas. This whole process requires powerful computers and the knowledge of how to train and adjust these complex models.

RAG (Retrieval-Augmented Generation), as explained in detail in my prior blog post, does not rewire or fundamentally alter the neural network of the LLM itself. The underlying neural architecture of the LLM component remains the same. What changes is the input process—RAG methods combine the input query with additional context from the retrieval system before processing it through the neural network. This allows the LLM to use both the original input and the external information effectively, improving the relevance and quality of the outputs by effectively altering the input vectors that the model works with which changes the data that is fed into the model. In short, it makes the questions a person asks smarter.

GPT-4 is reported to have approximately 1.8 trillion parameters spread across 120 layers. Layers in a neural network are levels of neurons, where each layer processes inputs from the previous layer and passes its output to the next. Parameters are the internal variables that the model adjusts during training to improve its predictions. This process helps the model understand how words, phrases, and sentences are typically used and relate to each other. The 120 layers in GPT-4 allow for more complex processing and deeper analysis of language compared to previous models. This is a significant increase from GPT-3.5’s 175 billion parameters and 96 layers. The larger number of parameters and layers enables GPT-4 to have a deeper understanding of language nuances and generate more complex responses.

As part of the fully managed Azure OpenAI Service, the GPT-3 models analyze and generate natural language, Codex models analyze and generate code and plain text code commentary, and the GPT-4 models can understand and generate both natural language and code. These models use an autoregressive architecture, meaning they use data from prior observations to predict the most probable next word. This process is then repeated by appending the newly generated content to the original text to produce the complete generated response. Because the response is conditioned on the input text, these models can be applied to various tasks simply by changing the input text.

The GPT-3 series of models are pretrained on a wide body of publicly available free text data. This data is sourced from a combination of web crawling (specifically, a filtered version of Common Crawl, which includes a broad range of text from the internet and comprises 60 percent of the weighted pretraining dataset and is filtered to improve the quality of the data) and higher-quality datasets, including an expanded version of the WebText dataset (which includes a broad range of text, including Reddit), two internet-based books corpora (Books1 and Books2) and English-language Wikipedia (for more info on these data sources, check out AI Training Datasets: the Books1+Books2 that Big AI eats for breakfast). The model was fine-tuned using reinforcement learning with human feedback (RLHF). Note that OpenAI has not publicly disclosed the full details of the specific datasets used to train GPT-4.

It’s accurate to describe GPT as a sophisticated autocomplete system (autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing.). GPT-4, like other advanced language models, predicts the next word or sequence of words based on the input it receives, similar to how autocomplete functions. However, it goes beyond simple prediction by understanding context, managing complex conversations, generating coherent and diverse content, and adapting to a wide range of tasks and prompts. This level of sophistication and adaptability is what sets it apart from standard autocomplete features.

Learn more about the training and modeling techniques in OpenAI’s GPT-3GPT-4, and Codex research papers. 

A point of confusion with a LLM is whether it “stores” text or whether the LLM “remembers” prior interactions and learns from the questions. To clarify, when you ask a question or make a request, the LLM uses only the text or documents you provide as input to generate an answer. It does not access or retrieve information from external sources in real-time or pull from a specific database of stored facts. Instead, it generates responses based on patterns and information it learned during its training phase. This means the model’s responses are constructed by predicting what text is most likely to be relevant or appropriate, based on the input it receives and the training it underwent.

The model does not “remember” previous interactions in the traditional sense. Each response is independently generated based on the current input it receives, without any retained knowledge from past interactions unless those interactions are part of the current session or explicitly included in the conversation. So it is not learning along the way, and the RAG method is not training your model.

The accuracy and relevance of the model’s answers depend on how well its training data covered the topic in question and how effectively it learned from that data. Therefore, while it can provide information that feels quite informed and accurate, it can also make errors or produce outdated information if that reflects its last training update.

An LLM doesn’t “store” text in the sense of storing it directly as a database does. Instead, it learns patterns, relationships, and information from the text it was trained on and encodes this knowledge into a complex neural network of weights and biases within its architecture. During training, the model adjusts these weights and biases based on the input data. The weights determine the strength of the connection between neurons, while biases adjust the output of the neurons.

When you ask a question, the model uses these learned patterns to generate text that it predicts would be a plausible continuation or response based on the input you provide. It’s not recalling specific texts or copying them verbatim from its training data but rather generating responses based on the statistical properties and linguistic structures it learned during training.

So, the model doesn’t contain the text itself but has learned from a vast amount of text how to generate relevant and coherent language outputs. This process is more akin to a skilled writer recalling knowledge, ideas, and linguistic structures they’ve learned over time to compose something new, rather than pulling exact entries from a reference book.

Grounded models, such as the RAG method, are those that reference or integrate external data sources or specific pieces of information in real-time from the user prompt while generating responses. Ungrounded models, on the other hand, generate responses based solely on the data they were trained on, without any real-time access to external information (the GTP-4 model in its base form is an ungrounded model).

Now that you have a good understanding of OpenAI and its LLMs, let’s discuss how you can leverage these technologies with your own unstructured data (text in documents). When you interact with ChatGPT, you’re engaging with a model trained on a diverse range of internet text. However, OpenAI now offers the capability to upload your own documents (.txt, .pdf, .docx, .xlsx) directly into ChatGPT (Bing Copilot supports document uploads too, and many more file types). This allows the model to reference your specific documents when answering questions, enhancing the relevance and accuracy of responses. This feature is an application of RAG techniques.

A product Microsoft has to help you build a solution that allows you to upload documents (unstructured enterprise data) and ask questions of them is Azure OpenAI Service On Your Data. Azure OpenAI On Your Data enables you to run advanced AI models such as GPT-35-Turbo and GPT-4 on your own enterprise data without needing to train or fine-tune models. You can chat on top of and analyze your data with greater accuracy. You can specify sources such as company databases, internal document repositories, cloud storage systems like Azure Blob Storage, SharePoint, or other designated data sources that contain the latest information to support the responses. You can access Azure OpenAI On Your Data using a REST API, via the SDK or the web-based interface in the Azure OpenAI Studio. Azure OpenAI On Your Data supports the following file types for uploading: .txt, .md, .html, .docx, .pptx, .pdf. You can also create a web app that connects to your data to enable an enhanced chat solution, or deploy it directly as a Copilot using Microsoft Copilot Studio.

Microsoft Copilot Studio is a low-code conversational AI platform that enables you to extend and customize Copilot for Microsoft 365 with plugins, as well as build your own copilots. Plugins are reusable building blocks that allow Copilot to access data from other systems of record, such as CRM, ERP, HRM and line-of-business apps, using 1200+ standard and premium connectors. You can also use plugins to incorporate your unique business processes into Copilot, such as expense management, HR onboarding, or IT support. And you can use plugins to control how Copilot responds to specific questions on topics like compliance, HR policies, and more.  Imagine you want to know how much of your team’s travel budget is left for the rest of the quarter. You ask Copilot in the chat, but it can’t answer because the data you’re looking for resides in your SAP system. With Copilot Studio, you can customize Copilot to connect to your SAP system and retrieve the information you need. Ask questions like “How many deals did I close this quarter?” or “What are the top opportunities in my pipeline?”. You can also orchestrate workflows with Copilot using Power Automate, such as booking a meeting, sending an email, or creating a document.

Copilot Studio also allows you to build custom copilots for generative AI experiences outside of Microsoft 365. With a separate Copilot Studio license, you can create conversational copilots for customers or employees and publish them on various channels, including websites, SharePoint, and social media. This flexibility enables organizations to design unique AI experiences, whether for enhancing customer interactions, streamlining internal functions, or developing innovative solutions. For instance, you can create a copilot for your website to help customers check in-stock items, provide quotes, or book services, or for your SharePoint site to assist employees with HR or IT requests.

Lastly, there is an accelerator called Information Assistant that will create an out-of-box solution in your Azure environment that includes a chat bot and the ability to upload your own documents to give answers to your questions using RAG techniques. Check out the code at https://aka.ms/fia and this video on how it works: Information Assistant, built with Azure OpenAI Service (youtube.com).

More info:

Transparency Note for Azure OpenAI Service

Will AI end with SQL, Is this the end of SQL?

Using OpenAI with Structured Data: A Beginner’s Guide | by Margaux Vander Plaetsen | Medium

How to use Azure Open AI to Enhance Your Data Analysis in Power BI (microsoft.com)

Querying structured data with Azure OpenAI | by Valentina Alto | Microsoft Azure | Medium

Using your data with Azure OpenAI Service – Azure OpenAI | Microsoft Learn

Microsoft Copilot or Copilot for Microsoft 365 (M365) or ChatGPT

Generative AI Defined: How it Works, Benefits and Dangers

The post Introduction to OpenAI and LLMs – Part 2 first appeared on James Serra's Blog.

Transform yourself into an Invaluable Data Leader in just 6 weeks

$
0
0

I’ve collaborated with two industry experts to design a transformative course for data leaders: “The Technical and Strategic Data Leader.” This six-week intensive learning journey is crafted to elevate both your technical skills and strategic acumen.

Facing Data Leadership Challenges?

  • Struggling with the balance between business objectives and technical challenges?
  • Feeling isolated in your leadership role, wishing for a community that understands?
  • Comfortable in technical or business discussions but feel out of depth when crossing over?


What Sets This Course Apart?

  • Beyond Basics: Dive into a curriculum that goes beyond technical training to emphasize the integration of business insight and technological expertise.
  • Collaborative Learning: Engage with live discussions in a collaborative environment, fostering active participation and real-time feedback from industry leaders.
  • Focused Content: Participate in sessions on Data Team Strategy, Business Intelligence, Data Architecture, and Data Leadership.


Join Our Transformative Journey Our first cohort begins on July 16th. Don’t miss this opportunity for a deep, immersive learning experience that will redefine your leadership and technical prowess in the data realm.

This is the second time we have offered this course, after the tremendous success of the first one. Here is one of many recommendations from the first course:

In weeks two and three, I’ll cover “Intro and Data Architectures: Understanding Common Data Architecture Concepts” followed by “Data Architectures: Gain a Working Understanding of Several Data Architectures.” These sessions are designed to equip you with the foundational knowledge and practical insights into data architectures, crucial for any data leader looking to leverage technology strategically within their organization. Here is the full schedule:

Course Schedule:

Week 1 – Intro and Foundations of Great Data Teams: Purpose and Strategy
Week 2 – Data Architectures: Understanding Common Data Architecture Concepts
Week 3 – Data Architectures: Gain a working understanding of several data architectures
Week 4 – Foundations of Great Data Teams: Part 2
Week 5 – Power BI Architecture: Best Practices in Power BI Administration & Governance
Week 6 – Managing Power BI Developers: Standardizing requirements gathering, UI/UX design, and the dashboard creation process

Learn more and register now at https://www.thedatashop.co/leader.

The post Transform yourself into an Invaluable Data Leader in just 6 weeks first appeared on James Serra's Blog.

Copilot in Microsoft Fabric

$
0
0

Microsoft Copilot is an app that uses AI to help you find information, create content, and get things done faster (see What Is Copilot? Microsoft’s AI Assistant Explained).  Copilot is now integrated heavily in Microsoft Fabric to bring new ways to transform and analyze data, generate insights, and create visualizations and reports in Microsoft Fabric and Power BI. I wanted to cover the places you will find Copilot in Fabric.

First you need to enable Copilot. Note that Copilot in Microsoft Fabric is rolling out in stages with the goal that all customers with a paid Fabric capacity (F64 or higher) or Power BI Premium capacity (P1 or higher) have access to Copilot. It becomes available to you automatically as a new setting in the Fabric admin portal when it’s rolled out to your tenant. When charging begins for the Copilot in Fabric experiences, you can count Copilot usage against your existing Fabric or Power BI Premium capacity.

See the article Overview of Copilot in Fabric for answers to your questions about how it works in the different workloads, how it keeps your business data secure and adheres to privacy requirements, and how to use generative AI responsibly.

The spots you will find Copilot in Fabric:

Copilot for Power BI – Quickly create report pages, natural language summaries, and generate synonyms. As a report author, you can use Copilot to help you write DAX queries, streamline your semantic model documentation, provide a summary about your semantic model, and help you get started with report creation by suggesting topics based on your data. Additionally, Copilot can also create a narrative visual that summarizes a page or a whole report and can generate synonyms for Q&A, to help report readers find what they’re looking for in your reports. You can also ask specific questions about the visualized data on a report page and receive a tailored response. This response includes references to specific visuals, aiding you in understanding the specific data sources contributing to each part of the answer or summary within the report.

Copilot for Data Factory (in preview) – Get intelligent code generation to transform data with ease and code explanations to help you better understand complex tasks. Copilot works with Dataflow Gen2 to: generate new transformation steps for an existing query, provide a summary of the query and the applied steps, and generate a new query that may include sample data or a reference to an existing query.

Copilot for Data Science and Data Engineering (in preview) – Quickly generate code in Notebooks to help work with Lakehouse data and get insights. Copilot for Data Science and Data Engineering is an AI assistant that helps analyze and visualize data. It works with Lakehouse tables and files, Power BI Datasets, and pandas/spark/fabric dataframes, providing answers and code snippets directly in the notebook.  The most effective way of using Copilot is to add your data as a dataframe. You can ask your questions in the chat panel, and the AI provides responses or code to copy into your notebook. It understands your data’s schema and metadata, and if data is loaded into a dataframe, it has awareness of the data inside of the data frame as well. You can ask Copilot to provide insights on data, create code for visualizations, or provide code for data transformations, and it recognizes file names for easy reference. Copilot streamlines data analysis by eliminating complex coding.

Copilot for Data Warehouse – Write and explain T-SQL queries, or even make intelligent suggestions and fixes while you are coding. Key features of Copilot for Warehouse include:

  • Natural Language to SQL: Ask Copilot to generate SQL queries using simple natural language questions.
  • Code completion: Enhance your coding efficiency with AI-powered code completions.
  • Quick actions: Quickly fix and explain SQL queries with readily available actions.
  • Intelligent Insights: Receive smart suggestions and insights based on your warehouse schema and metadata.

    There are three ways to interact with Copilot in the Fabric Warehouse editor.

  • Chat Pane: Use the chat pane to ask questions to Copilot through natural language. Copilot will respond with a generated SQL query or natural language based on the question asked.
  • Code completions: Start writing T-SQL in the SQL query editor and Copilot will automatically generate a code suggestion to help complete your query. The Tab key accepts the code suggestion, or keep typing to ignore the suggestion.
  • Quick Actions: In the ribbon of the SQL query editor, the Fix and Explain options are quick actions. Highlight a SQL query of your choice and select one of the quick action buttons to perform the selected action on your query.
    • Explain: Copilot can provide natural language explanations of your SQL query and warehouse schema in comments format.
    • Fix: Copilot can fix errors in your code as error messages arise. Error scenarios can include incorrect/unsupported T-SQL code, wrong spellings, and more. Copilot will also provide comments that explain the changes and suggest SQL best practices.
    • How to: Use Copilot quick actions for Synapse Data Warehouse

Copilot for Real-Time Intelligence (in preview) – Copilot for Real-Time Intelligence lets you effortlessly translate natural language queries into Kusto Query Language (KQL). The copilot acts as a bridge between everyday language and KQL’s technical intricacies, and in doing so removes adoption barriers for citizen data scientists.

Check out Copilot for Fabric Consumption for information on how the Fabric Copilot usage is billed and reported.

Copilot is a fantastic accelerator for building solutions and you can expect to see more of Copilot in Fabric in the near future!

More info:

Copilot in Power Query in Power BI Service and Microsoft Fabric Dataflow Gen2

Exploring the Potential and Pitfalls of Microsoft Fabric Copilot: A Practical Analysis

How to Use Power BI Copilot in Microsoft Fabric

The post Copilot in Microsoft Fabric first appeared on James Serra's Blog.

Classifications and sensitivity labels in Microsoft Purview

$
0
0

I see a lot of confusion on how classifications and sensitivity labels work in Microsoft Purview. This blog will help to clear that up, but I first must address the confusion with Purview now that multiple products have been renamed to Microsoft Purview. I decided to use a question-and-answer format that will hopefully clear up the confusion (I was very confused too!):

Microsoft Purview is now the combination of multiple Microsoft products.  Can you explain the differences?

Let’s break Microsoft Purview down into three sections of features that were formerly other products to clarify things:

  • Data governance:  This deals with data catalog, data quality (preview), data lineage, data management, and data estate insights (preview).  The product that had these features was formerly called Azure Purview
  • Data security: Covers data loss prevention, insider risk management, information protection, and adaptive protection.  The product that had these features was formerly called Microsoft Information Protection (MIP)
  • Data compliance: This covers compliance manager, eDiscovery and audit, communication compliance, data lifecycle management, and records management.  The product that had these features was formerly called Microsoft Information Governance

For simplification, when talking about the data governance features, you can call the features dealing with them Azure Purview, and the features dealing with data security and data compliance M365 Purview (M365 is Microsoft 365, previously called Office 365).  I will reference these names in this blog post. Azure Purview generally works with products that contain structured data such as SQL Database, ADLS, and Cosmos DB, collecting metadata and classifying data. M365 Purview usually deals with unstructured data such as email and Word and Excel documents, applying sensitivity labels and securing documents so only those with appropriate privileges can view them (Azure Purview has very limited features to secure data). 

Azure Purview and M365 Purview are two very different products combined into one, which is confusing as there is not much they have in common between the two products except for sensitivity labels, hence the reason for this blog. Within Microsoft, when a customer asks for a “Microsoft Purview” demo, we always have to ask if they want a demo on the Azure Purview features or the M365 Purview features as they are both very large products with very different features. Interestingly, to add to the confusion, my Microsoft cohort who demos the M365 piece is also named James!

What is the difference between classifications and sensitivity labels in Purview?

In Microsoft Purview, classifications and sensitivity labels serve distinct purposes. Classifications categorize data based on its content, such as identifying credit card numbers or Social Security Numbers (SSNs). They can be applied to specific data points like columns in a database. Dozens of built-in classifications include personally identifiable information (PII) and financial data, and you can create custom classifications for something like “Customer ID”.

Sensitivity Labels, on the other hand, define how data should be handled and protected, using built-in labels such as Public, General, Confidential, or Highly Confidential, or custom labels you create such as Secret or Top Secret. They enforce protection policies such as encryption and access control and can be applied to various data types including documents, emails, and databases. When it comes to databases, sensitivity labels can be applied at the column level to ensure that data within those columns is handled according to the defined protection policies. For instance, a “Confidential” label might encrypt the data in a column and restrict access to authorized users, while a “Public” label would have no restrictions. 

The key differences between the two are their objectives and functions. Classifications aim to identify and tag data, whereas sensitivity labels focus on protecting data. Classifications are automatically applied by scanning data, while sensitivity labels can be applied manually (for example, see Apply sensitivity labels to your files and email) or automatically based on policies. Essentially, classifications help organize data, while sensitivity labels ensure its protection.  For more info see Classifications vs sensitivity labels.

Classifications rules are created within Azure Purview and automatically applied to data sources when the data sources are scanned by Azure Purview (see Data classification in the Microsoft Purview governance portal).  Classifications can be applied to tables (for structured data such as CSV, TSV, JSON, SQL Table, etc.) or files (for unstructured data such as DOC, PDF, TXT, etc., see File types supported for scanning). Table assets are not automatically assigned classifications at the table level, because the classifications are only automatically assigned to their columns, but you can manually apply classifications to table assets at the table level.  A classification can be automatically applied to a file asset.  For example, if you have a file named multiple.docx and it has a National ID number in its content, during the scanning process Microsoft Purview adds the classification EU National Identification Number to the file asset’s detail page (a file can have multiple classifications applied to it).  To see all the built-in classifications in Azure Purview, check out System classifications in Microsoft Purview.

Sensitivity labels are created within M365 Purview (see Create and publish sensitivity labels).  Also within M365, you can setup auto-labeling for items (Office files, Power BI items, files in ADLS, emails), defining the conditions where you want your label to be automatically applied to your data (see apply a sensitivity label to data automatically).  So you can, for example, automatically apply a Highly Confidential label to any content that contains customers’ personal information, such as credit card numbers, social security numbers, or passport numbers. 

When an office document (as an example) has a sensitivity label that was applied manually or by auto-labeling, and then is scanned by Azure Purview into the Microsoft Purview Data Map, the label will be applied to the data asset within Azure Purview. While the sensitivity label is applied to the actual file in Microsoft Purview Information Protection, it’s only added as metadata in the Microsoft Purview Data Map. 

Within M365, when creating sensitivity labels, you can choose to setup auto-labeling for schematized data assets (such as Azure SQL Database, Azure Synapse, and Cosmos DB) which will automatically apply sensitivity labels to your data in the Microsoft Purview Data Map.  You choose the sensitive info types (SIT) that you want to apply to your label, such as driver’s license number, SSN, or passport number (for example, if an SSN is found, the data asset is marked Highly Confidential).  Once you create a sensitivity label, you need to scan your data in the Microsoft Purview Data Map to automatically apply the labels you created, based on the auto-labeling rules you defined.  Only columns can be tagged as sensitive, not at the table or database level.  Scanning an asset in the Microsoft Purview Data Map applies the labels to assets in the catalog based on the SIT found in the data during the scan – it uses the Azure Purview scanning engine to find the SIT, the same scan process it uses to find classified data based on the Azure Purview classification rules.  Sensitivity labels are applied only to the asset metadata in the Microsoft Purview Data Map and aren’t applied to the actual files or database columns. These sensitivity labels don’t modify your files or databases in any way.  Applying sensitivity labels manually to data sources within Azure Purview is not supported.  For more info, see Labeling in the Microsoft Purview Data Map (preview). 

To see what data sources and file types support classification and which support sensitivity labeling, check out Microsoft Purview Data Map available data sources.

What Azure database and report products can use Microsoft Purview Information Protection sensitivity labels?

Microsoft Purview Information Protection sensitivity labels provide a simple and uniform way for your users to tag sensitive data within the products SQL Server, SQL Databases (Azure SQL Database, Azure SQL Managed Instance, Azure Synapse Analytics), and Power BI (dashboards, reports, semantic models, dataflows, and paginated reports).  For databases, only columns can be tagged as sensitive, not at the table or database level.  Any sources tagged with sensitive data in this way will automatically have the sensitivity labels imported into the Microsoft Purview Data Map in Azure Purview when scanned via Azure Purview.  Note that SQL Server and SQL Databases offer both SQL Information Protection policy and Microsoft Information Protection policy to apply sensitivity labels.  Labels applied via SQL Information Protection policy are NOT imported into the Azure Purview Data Map, only sensitivity labels applied via Microsoft Information Protection policy.  Microsoft Purview Information Protection labels provide a simple and uniform way for users to classify sensitive data uniformly across different Microsoft applications, instead of each application classifying data in their own way. 

Is there a way to enforce access control on database columns with sensitive data?

Yes!  Azure SQL Database supports the ability to enforce access control on the columns with sensitive data that have been labeled using Microsoft Purview Information Protection sensitivity labels (other sources supported soon are Azure Blob storage, ADLS Gen2, and AWS S3).  This enables personas like enterprise security/compliance admins to configure and enforce access control actions on sensitive data in their databases, ensuring that sensitive data can’t be accessed by unauthorized users for a particular sensitivity label.  To configure and enforce Purview access policies, the database must be registered in the Microsoft Purview Data Map and scanned by Azure Purview, so that Microsoft Purview Information Protection sensitivity labels get assigned by Azure Purview to the database columns containing sensitive data. Once sensitivity labels are assigned, the user can configure Microsoft Purview Information Protection access policies to enforce deny actions on database columns with a specific sensitivity label, restricting access to sensitive data in those columns to only an allowed user or group of users.  Any attempt by an unauthorized user to run a T-SQL query to access columns in a Azure SQL database with sensitivity label scoped to the policy will fail.  This feature requires existing Microsoft Purview accounts upgraded to Microsoft Purview single tenant model and new Portal experience, using the enterprise version of Microsoft Purview.  See Enabling access control for sensitive data using Microsoft Purview Information Protection policies (public preview) and Enable data policy enforcement on your Microsoft Purview sources and Authoring and publishing protection policies (preview).

How do sensitivity labels work in Power BI?

Sensitivity labels from Microsoft Purview Information Protection provide a simple way for your users to classify critical content in Power BI without compromising productivity or the ability to collaborate. They can be applied in both Power BI Desktop and the Power BI service, making it possible to protect your sensitive data from the moment you first start developing your content on through to when it’s being accessed from Excel via a live connection. Sensitivity labels are retained when you move your content back and forth between Desktop and the service in the form of .pbix files.

In the Power BI service, sensitivity labels can be applied to semantic models, reports, dashboards, and dataflows. When labeled data leaves Power BI, either via export to Excel, PowerPoint, PDF, or .pbix files, or via other supported export scenarios such as Analyze in Excel or live connection PivotTables in Excel, Power BI automatically applies the label to the exported file and protects it according to the label’s file encryption settings. This way your sensitive data can remain protected, even when it leaves Power BI.  You can require your organization’s Power BI users to apply sensitivity labels to content they create or edit in Power BI.

In addition, sensitivity labels can be applied to .pbix files in Power BI Desktop, so that your data and content is safe when it’s shared outside Power BI (for example, so that only users within your organization can open a confidential .pbix that has been shared or attached in an email), even before it has been published to the Power BI service. See Restrict access to content by using sensitivity labels to apply encryption for more detail.

In the Power BI service, sensitivity labeling does not affect access to content. Access to content in the service is managed solely by Power BI permissions. While the labels are visible, any associated encryption settings (configured in the Microsoft Purview compliance portal) aren’t applied. They’re applied only to data that leaves the service via a supported export path, such as export to Excel, PowerPoint, or PDF, and download to .pbix.

In Power BI Desktop, sensitivity labels with encryption settings do affect access to content. If a user doesn’t have sufficient permissions according to the encryption settings of the sensitivity label on the .pbix file, they won’t be able to open the file. In addition, in Desktop, when you save your work, any sensitivity label you’ve added and its associated encryption settings will be applied to the saved .pbix file.

For more info, see Sensitivity labels from Microsoft Purview Information Protection in Power BI – Power BI | Microsoft Learn.

More info:

Understanding Sensitivity Labels: Set Up and Management Across Power BI, Azure Purview, and O365 (video)

Microsoft 365 Information Protection & How it REALLY Works! (video)

CONFIGURE SENSITIVITY LABELS IN MICROSOFT PURVIEW (video)

The post Classifications and sensitivity labels in Microsoft Purview first appeared on James Serra's Blog.

Microsoft Fabric reference architecture

$
0
0

Microsoft Fabric uses a data lakehouse architecture, which means it does not use a relational data warehouse (with its relational engine and relational storage) and instead uses only a data lake to store data. Data is stored in Delta lake format so that the data lake acquires relational data warehouse-like features (check out my book that goes into much detail on this, or my video). Here is what a typical architecture looks like when using Fabric (click here for the .vsd):

This section describes the five stages of data movement within a data lakehouse architecture, as indicated by the numbered steps in the diagram above. While I highlight the most commonly used Microsoft Fabric features for each stage, other options may also be applicable:

  1. Ingest – A data lakehouse can handle any type of data from many sources including on-prem and in the cloud. The data may vary in size, speed, and type; it can be unstructured, semi-structured, or relational; it can come in batches or via real-time streaming; it can come in small files or massive ones. ELT processes will be written to copy this data from the source systems into the raw layer (bronze) of the data lake. In Fabric, you will use features such as Azure IoT Hubs and Eventstream for real-time streaming, and data pipelines for batch processing
  2. Store – Storing all the data in a data lake results in a single version of the truth and allows end-users to access it no matter where they are as long as they have a connection to the cloud. In Fabric, OneLake is used for the data lake (which uses ADLS Gen2 under the covers). Fabric shortcuts can be used to access data sources outside of OneLake (such as other data lakes on ADLS Gen2 or AWS S3), and mirroring can be used to copy data from data sources into OneLake in real-time (such as Snowflake or Azure SQL Database)
  3. Transform – A data lake is just storage, so in this step compute resources are used to copy files from the raw layer in the data lake into a conformed folder, which simply converts all the files to Delta format (which under the covers uses Parquet and a Delta log). Then the data is transformed (enriching and cleaning it) and stored in the cleaned layer (silver) of the data lake. Next, the computing tool takes the files from the cleaned layer and curates the data for performance or ease of use (such as joining data from multiple files and aggregating it) and then writes it to the presentation layer (gold) in the data lake. If you are implementing Master Data Management (MDM) with a tool such as Profisee, the MDM would be done between the cleaned and presentation layers. Within the presentation layer you may also want to copy the data into a star schema for performance reasons and simplification. You might need more or less layers in the data lake, depending on the size, speed, and type of your data. Also important is the folder structure of each layer – see Data lake architecture and Serving layers with a data lake. Within all the layers the data can either be stored in a Fabric lakehouse or warehouse (see Microsoft Fabric: Lakehouse vs Warehouse video). In Fabric, for the compute required to transform data, you can use features such as Dataflow Gen2, Spark notebooks, or stored procedures
  4. Model – Reporting directly from the data in a data lake can be confusing for end-users, especially if they are used to a relational data warehouse. Because Delta Lake is schema-on-read, the schema is applied to the data when it is read, not beforehand. Delta Lake is a file-folder system, so it doesn’t provide context for what the data is. (Contrast that with a relation data warehouse’s metadata presentation layer, which is on top of and tied directly to the data). Also, defined relationships don’t exist within Delta Lake. Each file is in its own isolated island, so you need to create a “serving layer” on top of the data in Delta Lake to tie the metadata directly to the data. To help end users understand the data, you will likely want to present it in a relational data model, so that makes what you’re building a “relational serving layer.” With this layer on top of the data, if you need to join more than one file together, you can define the relationships between them. The relational serving layer can take many forms such as a SQL view or a semantic model in star schema format in Power BI. If done correctly, the end user will have no idea they are actually pulling data from a Delta Lake—they will think it is from the relational data warehouse
  5. Visualize – Once the relational serving layer exposes the data in an easy-to-understand format, end-users can easily analyze the data using familiar tools to create reports and dashboards. In Fabric, you would use Power BI for building reports and dashboards


A few more things to call-out in the diagram: a data scientist can use the ML model feature in Fabric on the data in OneLake. Often a sandbox layer is created where some of the raw data is copied to, which allows the data scientist to modify the data for their own purposes without affecting everyone else. Also, Microsoft Purview can be used as a data catalog to make everyone aware of all the data and reports in the data lakehouse and to request access to those items.

In the end, data is moved along in this architecture and copied multiple times, resulting in more cost and complexity. However, all this work done by IT has the benefit of making the data really easy for end-users to query and build reports, resulting in self-service BI, so IT gets out of the business of building reports.

The post Microsoft Fabric reference architecture first appeared on James Serra's Blog.

Microsoft Fabric AI Skill

$
0
0

Are you ready to have conversations with your data? Announced in public preview within Microsoft Fabric is AI Skill, a new capability in Fabric that allows you to build your own generative AI experiences. In short, this is a generative AI product that enables a fundamentally new way for you to interact with your data, dramatically increasing the amount of data-driven decision-making. (If you are not familiar with generative AI, LLMs, RAG, ChatGPT, and OpenAI, check out my blogs Introduction to OpenAI and LLMs and Introduction to OpenAI and LLMs – Part 2).

One generative AI product that you may already be familiar with is Fabric Copilot, which helps you explore data more easily. It is intended to be an assistant, and as such, there is an expectation that the end user will work with the AI to verify and approve its outputs.  Copilots are incredibly powerful for anyone performing data tasks in Fabric, but after working with many customers, Microsoft has observed a need for a different experience. An experience that requires even less input from the end user. This is where AI Skill comes in.

What is an AI Skill?
AI skills allow you, a Fabric developer, to create your own conversational Q&A systems in Fabric using generative AI on your structured data (limited to a Lakehouse or Warehouse for now). By putting in effort up front, you can give end users the experience of simply asking a question and getting a reliable, data-driven answer in return. With the AI Skill, you can provide instructions and examples to guide the AI to the correct answer for any given question in your organization. This allows you to ensure that the AI understands your organization and your data context before you share this capability more broadly with others in your organization or team, who can then ask their questions in plain English.

How does the AI Skill work?
The AI skill relies on Generative AI – specifically, Large Language Models (LLMs). These LLMs can generate queries, in this case T-SQL queries, based on a given schema and a user question. The system sends the question asked in the AI skill interface along with information about the selected data (metadata that includes the table and column names, and the data types found in the tables) to the LLM. Next, it requests the generation of a T-SQL query that answers the question. Then it parses the generated query to first ensure that it doesn’t change the data in any way, and then executes that query. Finally, it shows the query execution results. An AI skill is intended to access specific database resources, and then generate and execute relevant T-SQL queries. You can ask questions on the data such as “How many active customers did we have on June 1st, 2013?” or “Which promotion was the most impactful?” or even “What is the largest table?”

Why do I need an AI Skill?
Many organizations have centralized or embedded analyst groups, who spend significant portions of their day answering data questions that are not necessarily overly complex or nuanced but still require knowledge about query languages (like T-SQL) as well as the data context. As a result, data-driven insights remain locked behind a few groups or individuals, who often struggle to keep up with requests for answers. Responding to constant incoming data questions takes valuable time away from more sophisticated analysis or proactive strategic planning.

Generative AI offers a promising solution to this challenge. Generative AI has proven to be adept at writing queries. The part that is still missing in many generative AI applications is the nuance and context that comes with any real-world data system. AI skills allow you to capture this context and nuance in a way that allows the AI to understand your data systems fully. You cannot expect a newly hired analyst to immediately and reliably tackle all incoming data questions on day one. You expect them to gradually learn about your metrics, definitions, and data quirks. In the same way, you cannot expect an AI to be perfectly accurate in answering questions unless you give it the full set of background information that it needs to answer your questions, which is what AI Skill allows you to do.

How do I configure AI Skill?
You should expect to handle some necessary configuration steps before the AI skill works properly. An AI skill can often provide out-of-the-box answers to reasonable questions, but it could provide incorrect answers for your specific situation. These incorrect answers typically occur because the AI is missing context about your company, setup, or definition of key terms. To solve the problem, you can customize it:

  • Focus the AI: Select the data you want the AI to access, to focus its scope on specific tables in your databases.
  • Configure through instructions: Provide instructions in English via “Notes for model” to guide the AI to follow rules or definitions. For example: “Whenever I ask about ‘the most sold’ products or items, the metric of interest is total sales revenue, and not order quantity” or “The primary table to use is the FactInternetSales – only use FactResellerSales if explicitly asked about resales or when asked about total sales”. This would be especially useful if table or field names in your database are cryptic, and you can provide better descriptions – “The table T116 contains customer data”.
  • Configure through examples: Give example question and query pairs via “Example SQL queries” that the AI can use to answer similar questions. You will have a “question” section and a corresponding “SQL query” section that you will populate. For example: In the question section enter “What was the largest shipping delay in days?” and in the SQL Query section enter “Select max(datediff(day, OrderDate, ShipDate)) as LargestShippingDelayInDays from dbo.factinternetsales”.

What is the difference between AI Skill and Copilot?
The technology behind the AI skill and Fabric Copilot is similar. They both use Generative AI to reason over data. However, they have some key differences:

  • Configuration: With an AI skill, you can configure the AI to behave the way you need. You can provide it with instructions and examples that tune it to your specific use case. A Fabric Copilot doesn’t offer this configuration flexibility.
  • Use Case: A Copilot can help you do your work in Fabric. It can help you generate Notebook code or Data Warehouse queries. In contrast, the AI skill operates independently, and you can eventually connect it to Microsoft Teams and other areas outside of Fabric.

To summarize AI Skill in a couple of sentences: it creates your own conversational Q&A system that allows you to ask plain English questions that, via LLMs, are turned into T-SQL queries which it executes on your structured data and returns the results that answer the question. You can improve the answers via model notes and SQL examples sent to the LLM. Basically, it’s using RAG on one of the GPT models to pass metadata, model notes and example SQL queries to the LLM. It’s Copilot on your data!

Currently you can only use the interface in Fabric for AI Skill, but eventually there will be integration with Copilot Studio and M365 chat.

The AI Skill experience is now available for all customers when using Fabric F64 or larger capacity. Please note that your tenant admin must enable this preview experience and the Fabric AI setting before you can try it out. Also, check out the current AI Skill limitations.

Click here to learn more about AI Skill. You can setup an environment to demo AI Skill via AI skill example with the AdventureWorks dataset (preview). Of note in this demo is that it creates a lakehouse using AdventureWorksDW which can be very useful to demo other features in Fabric.

This is an amazing new feature, and we are now at the point of generative AI working on structured data and not just text!

More info:

Microsoft Fabric: AI Skills Preview! Build RAG Patterns Simple and Easy!! (video)

The post Microsoft Fabric AI Skill first appeared on James Serra's Blog.

Microsoft Purview FAQ

$
0
0

I get many of the same questions about Microsoft Purview, so I wanted to list those common questions here along with their answers. If your question is not answered here, please put it in the comments and I will reply:

Microsoft Purview is now the combination of multiple Microsoft products.  Can you explain the differences?

Let’s break Microsoft Purview down into three sections of features that were formerly other products to clarify things:

  • Data governance:  This deals with data catalog, data quality (preview), data lineage, data management, and data estate insights (preview).  The product that had these features was formerly called Azure Purview
  • Data security: Covers data loss prevention, insider risk management, information protection, and adaptive protection.  The product that had these features was formerly called Microsoft Information Protection (MIP)
  • Data compliance: This covers compliance manager, eDiscovery and audit, communication compliance, data lifecycle management, and records management.  The product that had these features was formerly called Microsoft Information Governance

For simplification, when talking about the data governance features, you can call the features dealing with them Azure Purview, and the features dealing with data security and data compliance M365 Purview (M365 is Microsoft 365, previously called Office 365).  I will reference these names in this blog post. Azure Purview generally works with products that contain structured data such as SQL Database, ADLS, and Cosmos DB, collecting metadata and classifying data. M365 Purview usually deals with unstructured data such as email and Word and Excel documents, applying sensitivity labels and securing documents so only those with appropriate privileges can view them (Azure Purview has very limited features to secure data). 

Explain the Microsoft Purview compliance portal and the Microsoft Purview governance portal?

In Microsoft Purview, you have the choice of using the classic portal or the new unified portal (which GA’d on 8/1/24). The classic portal means that the Microsoft Purview governance portal (https://web.purview.azure.com/) and the Microsoft Purview compliance portal (https://compliance.microsoft.com/) are completely separated.  The new unified portal, available by flipping the “New Microsoft Purview portal” switch that is at the top of any of the web pages in Microsoft Purview, combines the two portals (https://purview.microsoft.com/).  See Learn about the Microsoft Purview portal. Customers using the Azure government portal (GCC High, DoD) would have URL’s ending in .us instead of .com and will be rolled out to the new unified portal starting 8/30/24, along with customers in GCC (see roadmap item). However, these customers will not see the solution icons Data Map and Data Catalog in the new unified portal (when they will is TBD).  The compliance portal will be deprecated starting 11/4/24 (but still available for 36 months), and there is no ETA for deprecating the governance portal. This new unified portal, in addition to a new easier-to-use menu layout, has new data governance features such as business domains, data products, data quality, data product search, data access, health controls, and metadata quality (these will start to GA on September 1st, 2024 to the 26 commercial regions that Microsoft Purview is available in – see rollout schedule).  This takes Azure Purview from a PaaS solution to Microsoft Purview that is a SaaS solution.  The one government region that Microsoft Purview is in (USGov Virginia) won’t have the new data governance features until late this year or early next year. 

Is the new portal and new data governance features available for customers using GCC?

While a GCC customer can turn on the new Purview portal starting 8/30/24, the Data Map and Data Catalog icons are not visible.  GCC customers will have those icons working on the new portal, as well as the new data governance features, on the same timeline as Azure Gov customers (late this year or early next year).  This is because Purview for GCC customers is considered part of the GCC M365 suite due to the M365 Purview features (i.e. Microsoft Information Protection).  GCC customers using Purview will have the M365 Purview features using resources in the Gov cloud, while the Azure Purview governance features are using resources in the commercial cloud.  This means Azure Purview under the GCC tenant is able to scan resources residing in Azure commercial.  

Do you have a slide that covers the first two questions?

Sure:

What things can you use the “Request access” feature on (to request access for a data asset)?

The “Request access” feature can be done on two things: 1) Requesting access to a physical asset, and 2) requesting access to a data product that may contain multiple physical assets.  Note #2 is only available in the new portal and currently does not automatically assign read permissions to the assets – it must all be done manually, and to review requests you would go to Data Catalog → Data management → Data access.

For a physical asset, will “Request access” automatically assign read permissions to an asset for the requester?

For a few assets (Azure Blob StorageADLS Gen2Azure SQL Database), when you click “Request access” and it is approved, and if the data source is registered for data policy enforcement, it will automatically assign read permission to the asset for the requester (via a data access policy that gets auto-generated and applied against the respective data source to grant read access to the requestor) and you can see approved asset requests on the Self-service access policies screen.  Otherwise, it just creates a task that is assigned to a user or a Microsoft Entra group who will need to manually provide access to the assets for the requestor and then approve the request.  Note for the “Request access” option to be available for an asset, a self-service access workflow needs to be created and assigned to the collection where the asset is registered (the “Request access” can be made available for any asset, but there are only a few assets (Azure Blob StorageADLS Gen2Azure SQL Database) where it will actually grant access, the rest will need to be manually provided). Also note automatic read permission can only be applied to the requester – applying read permission to a group or moving the requester to a group that has read permission is not supported. [NOTICE: automatic permission provisioning is not yet available in the new portal – slated for Q2CY25].  For the classic portal, in private preview for automatic permission provisioning is SQL MI, SQL Server 2022, Azure Databricks (Unity), and Snowflake.

As an example, an end-user can be browsing folders in Purview and find one that contains files the end-user would like to use. That person would request access to the folder through Purview (via the “Request access” button), which triggers a self-service data access workflow, and if the access is approved, that person would be able to use a tool outside of Purview to read the files (such as Power BI).  This can also be done for storage accounts, containers, folders, individual files, databases, or tables in a database.

Can Purview scan on-prem file systems?

Yes, Microsoft Purview does have the capability to scan on-premises file systems.  The metadata curated at the end of the scan process can include data asset names such as table names or file names, file size, columns, and data lineage among other details.  To leverage this functionality, you would typically configure a self-hosted integration runtime which provides a bridge between your on-premises network and the cloud-based Microsoft Purview service. Once this is set up, you can proceed to register and configure your on-premises file system as a data source in Purview for scanning.

What security permissions are needed to register and scan a data source? 

You’ll need to be a Data Source Admin and one of the other Microsoft Purview Data Map roles (for example, Data Reader or Data Share Contributor) to register a source and manage it in the Microsoft Purview data map. See our Microsoft Purview Permissions page for details on roles and adding permissions (or Understand access and permissions in the classic Microsoft Purview governance portal | Microsoft Learn if using classic portal).  When you go to register a source, you choose one of your subscriptions and the items for that source that you have access to are displayed.  For example, for ADLS Gen2, a list of storage account names would be displayed.  Most data sources have prerequisites to register and scan them in Microsoft Purview. For example, to scan ADLS Gen2, the storage account must have the role “Storage Blob Data Reader” assigned to the Microsoft Purview account name.  For a list of all available sources, and links to source-specific instructions to register scan, see our supported sources article.  Click on a source to get details on how to register and scan it.

Does Purview connect to Databricks Unity Catalog?

Yes, see https://learn.microsoft.com/en-us/purview/register-scan-azure-databricks-unity-catalog.  How they work together: (43) Best data governance tool: Databricks Unity Catalog or Microsoft Purview? | LinkedIn.

Can Purview scan Databricks tables without Unity Catalog?

Microsoft Purview can scan Databricks tables without requiring Unity Catalog, but using Unity Catalog offers significant benefits that enhance data governance and metadata management. Without Unity Catalog, Microsoft Purview connects directly to Azure Databricks and can scan tables to gather metadata (using the Hive Metastore), such as table schemas and columns, and perform basic classification tasks. However, this method may have limitations in metadata detail and advanced governance features.

Unity Catalog provides a centralized and standardized metadata layer for all data assets within Databricks, which Microsoft Purview can leverage to access richer metadata and comprehensive data classifications. This integration simplifies metadata management and enhances data governance by offering fine-grained access control, auditing, and compliance features. Unity Catalog streamlines the integration process by centralizing metadata, making it easier for Purview to scan and classify data assets effectively.

However, it’s important to note that Microsoft Purview does not currently support lineage extraction from Unity Catalog. This means that while Purview can identify and classify data assets within Unity Catalog, it cannot track the flow of data between different stages or transformations within Databricks. For lineage extraction, Purview relies on scanning the Hive Metastore within Databricks, which does provide lineage information.  Note that Microsoft Purview requires either the Hive Metastore or Unity Catalog to scan Databricks.  See Connect to and manage Azure Databricks Unity Catalog and Connect to and manage Azure Databricks.

Can purview scan pdf documents or a word document?

This applies to M365 Purview:

Yes, Microsoft Purview (formerly known as Microsoft Information Protection) can scan PDF documents, Word documents, and other types of files for sensitive information. Microsoft Purview offers a comprehensive range of compliance and risk management solutions, including information protection, data loss prevention (DLP), governance, and compliance capabilities across your Microsoft 365 and Office 365 services, as well as other environments.

For PDF documents, it uses Optical Character Recognition (OCR) to scan content in images for sensitive information (see Learn about optical character recognition in Microsoft Purview). This feature is optional and can be enabled at the tenant level. Once enabled, you can select the locations where you want to scan images. Please note that each page in a PDF file is charged separately. For example, if there are 10 pages in a PDF file, an OCR scan of the PDF file counts as 10 separate scans.

This applies to Azure Purview:

For Word documents, the scanning process establishes a connection to the data source and captures technical metadata like names, file size, columns, and so on. It also extracts schema for structured data sources, applies classifications on schemas, and applies sensitivity labels if your Microsoft Purview Data Map is connected to a Microsoft Purview compliance portal. The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.

It doesn’t store the documents themselves but rather metadata about these documents and insights into the sensitive information they contain.

When it comes to a Word document (or any other document type like PDFs), Microsoft Purview can:

  1. Scan and classify the document based on its content, identifying sensitive information.
  2. Catalog the classification and metadata about the document in the Purview Data Map. This includes information like where the document is stored, the type of sensitive information it contains, and how it’s classified.
  3. Enable governance policies to be applied based on this classification, such as protection actions, access controls, and monitoring.

However, the actual content of the Word document remains stored in its original location. The Purview Data Catalog focuses on managing the metadata and governance policies around the document, rather than the document file itself.  The text content of a Word document does not get stored in the Microsoft Purview Data Catalog.

In Azure Purview, scanning a data asset such as ADLS will return just basic file info for documents.  Documents in ADLS it will scan: DOC, DOCM, DOCX, DOT, ODP, ODS, ODT, PDF, POT, PPS, PPSX, PPT, PPTM, PPTX, XLC, XLS, XLSB, XLSM, XLSX, XLT (structured data it scans: CSV, JSON, PSV, SSV, TSV, GZIP, TXT, XML, PARQUET, AVRO, ORC).  See File types supported for scanning.

Azure Purview can scan cloud storage locations like Blob storage, ADLS, AWS S3, etc. but SharePoint and OneDrive for Business locations are only scanned with M365 Purview.

Can you explain the following concepts: Domains, Business Domains, Collections, Data Products, and Data Assets?

Check out Explaining Purview concepts: Domains, Business Domains, Collections, Data Products and Data Assets. (microsoft.com).

Is there customer training?

Here is click-through training:

Are there Purview best practices?

Yes, starting at Microsoft Purview (formerly Azure Purview) accounts architecture and best practices.

What is the difference between classifications and sensitivity labels in Purview?

See prior blog.

What Azure database and report products can use Microsoft Purview Information Protection sensitivity labels?

See prior blog.

Is there a way to enforce access control on database columns with sensitive data?

See prior blog.

How do sensitivity labels work in Power BI?

See prior blog.

The post Microsoft Purview FAQ first appeared on James Serra's Blog.

Microsoft Purview GA menu’s

$
0
0

The new data governance features in Microsoft Purview are now being made generally available as they are gradually rolled out across various regions. You can view the deployment schedule at New Microsoft Purview Data Catalog deployment regions. These enhanced features include business domains, data products, data quality, data product search, data access, health controls, and metadata quality, which I previously discussed in my blog post Microsoft Purview new data governance features.

Additionally, with each region that reaches general availability, the Data Catalog in the new unified portal will receive an updated menu layout. Below, I’ve outlined the key changes to the menu. This will help those of you, like me, who have been using the new unified portal and have become familiar with that layout (for those looking for info on the menu changes between the classic portal and new portal, go to Learn about the Microsoft Purview portal):

Data search –> moved under menu “Discovery” -> Data assets
Data product search –> moved under new menu “Discovery” -> Data products

Data management –> renamed “Catalog management”
Business domains –> Governance domains
Data quality –> moved under Health management
Data access –> Requests

Data estate health –> renamed “Health management”
Health controls –> Controls
Health actions –> Actions
Metadata quality –> Moved into Controls page (edit a control and go to the Rules page)

Roles and permissions –> moved under Settings -> Solution settings -> Data Catalog

Business domains page, dropdown menu on “Business domains”
Business domains –> dropdown removed
Glossaries (classic) –> Catalog management -> Classic types -> Glossaries
Business assets (classic) –> Catalog management -> Classic types -> Business assets
Asset types (classic) –> Catalog management -> Classic types -> Asset types (also: Data Map -> Metamodel -> Asset types)
Managed attributes (classic) –> Catalog management -> Classic types -> Managed attributes (also: Data Map -> Metamodel -> Managed attributes)

NEW menu item:
Discovery -> Enterprise glossary [to search for any published (in any business domain) Glossary Terms, Critical Data Elements and OKRs]

The post Microsoft Purview GA menu’s first appeared on James Serra's Blog.

Get mentored and coached by me and other industry experts!

$
0
0

This fall you can take the next step in your data leadership journey by joining a cohort of industry peers and get mentored by experts in the field.

In this cohort you will get personalized interaction and coaching from me, collaboration with your industry peers, and guidance around the biggest challenges you face as a data leader.

Imagine what it woud be like…

—————————

…if you had mentors to guide you on this journey and a group of industry peers to walk with you?

…if you felt less alone in the challenges you faced as a data leader, and had the resources and connections to overcome obstacles?

…if could feel confident about the role your data team plays in your organization because you have the tools and frameworks to build a great team?

…if technical data architecture no longer felt like a foreign language that you couldn’t understand, but something you strategically leverage to deliver value?

—————————

Here’s what some past cohort participants have said:

Sammantha (Performance Managment Consultant) said: “I’m a lot more comfortable in conversations about data architecture in the cloud. It was incredibly insightful and practical.”

Cole (Director of Analytics) said: “I would absolutely recommend the TSDL to anyone who is struggling with the challenges of maturing the data practice inside of an organization. The content is relatable and relevant to the issues data teams face today.”

Josh (Associate Director of Enterpries BI) said: “This was the most unique training I have taken in the data space”
—————————

We build The Technical and Strategic Data Leader to meet these needs. The cohort is launching October 8th.

Space is limited. Additional details and register info available here: thedatashop.co/leader

The post Get mentored and coached by me and other industry experts! first appeared on James Serra's Blog.

European Microsoft Fabric Community Conference announcements

$
0
0

A TON of new features announcements at the European Microsoft Fabric Community Conference help last week. The full list is here, and I wanted to list my favorite announcements from that list:

  • Access Databricks Unity Catalog tables from Fabric (public preview): You can now access Databricks Unity Catalog tables directly from Fabric. In Fabric, you can now create a new data item called “Mirrored Azure Databricks Catalog”. When creating this item, you simply provide your Azure Databricks workspace URL and select the catalog you want to make available in Fabric. Rather than making a copy of the data, Fabric creates a shortcut for every table in the selected catalog. It also keeps the Fabric data item in sync. So, if a table is added or removed from UC, the change is automatically reflected in Fabric. Once your Azure Databricks Catalog item is created, it behaves the same as any other item in Fabric. Seamlessly access tables through the SQL endpoint, utilize Spark with Fabric notebooks and take full advantage of Direct Lake mode with Power BI reports. To learn more about Databricks integration with Fabric, see our documentation here.
  • Copilot for Data Warehouse (public preview): Copilot for Data Warehouse in public preview! Copilot for Data Warehouse is an AI assistant that helps developers generate insights through T-SQL exploratory analysis. Copilot is contextualized to your warehouse’s schema. With this feature, data engineers and data analysts can use Copilot to: Generate T-SQL queries for data analysis; Explain and add in-line code comments for existing T-SQL queries; Fix broken T-SQL code; Receive answers regarding general data warehousing tasks and operations. Learn more about Copilot for Data Warehouse. Copilot for Data Warehouse is currently only available in the Warehouse. Make sure you have Copilot enabled in your tenant and capacity settings to take advantage of these capabilities. Copilot in the SQL analytics endpoint is coming soon.
  • Database Migration Experience (private preview): We are excited to announce the opening of a Private Preview for a new Migration Experience. Designed to accelerate the migration of SQL Server, Synapse dedicated SQL pools, and other warehouses to the Fabric Data Warehouse, users will be able to migrate the code and data from the source database, automatically converting the source schema and code to Fabric Data Warehouse, helping with data migration, and providing AI powered assistance. Please contact your Microsoft account team if you are interested in joining the preview.
  • TSQL Notebook (public preview): You can now use Fabric Notebooks to develop your Fabric warehouse and consume data from your warehouse or SQL analytics endpoint. The ability to create a new notebook item from the warehouse editor lets you carry over your warehouse context into the notebook and use rich capabilities of notebook to run T-SQL queries. T-SQL notebook enables you to execute complex T-SQL queries, visualize results in real-time, and document your analytical process within a single, cohesive interface. The embedded rich T-SQL IntelliSense and easy gestures like Save as tableSave as view or Run selected code provides familiar experiences in the notebook experience to increase your productivity. Learn more here.
  • Share Feature for Fabric AI Skill (public preview): The highly anticipated feature for Fabric AI Skill, the “Share” capability is now in public preview. This powerful addition allows you to share the AI Skill with others using a variety of permission models, providing you with complete control over how your AI Skill is accessed and utilized. With this new feature, you can: Co-create: Invite others to collaborate on the development of your AI Skill, enabling joint efforts in refining and enhancing its functionality; View Configuration: Allow others to view the configuration of your AI Skill without making any changes; Query: Enable others to interact with the AI Skill to obtain answers to their queries. Additionally, we are introducing flexibility in managing versions. You can now switch between the published version and the current version you are working on. This feature facilitates performance comparison by running the same set of queries, providing valuable insights into how your changes impact the AI Skill’s effectiveness. We’ve also refined the publishing process. You can now include a description that outlines what your AI Skill does. This description will be visible to users, helping them understand the purpose and functionality of your AI Skill.
  • Real-time Intelligence:
    • Creating a Real time Dashboard by Copilot: From the list of tables in Real-Time hub, users can click on the three dots menu and select create real-time dashboard. Copilot will review the table and automatically create a dashboard with two pages, one with insights about the data in the table and one page that contains a profile of the data with a sample, the table schema and more details about the values in each column. This can be further explored and edited to make it easy for users to find insights on their time-series data without having to write a single line of code.
    • Four new Eventstream connectors have been introduced into the Real-Time hub. Now you can stream data from Azure SQL MI DB (CDC), SQL Server on VM DB (CDC), Apache Kafka, and Amazon MSK Kafka. 
    • Set Alerts Based on KQL Query Results or Specific Conditions: With this new feature, you can set alerts to trigger based on specific results or conditions from a scheduled KQL query. For example, if your KQL DB tracks application logs, you can configure an alert to notify you if the query, scheduled at a frequency of your choice (e.g., every 5 minutes), returns any logs where the message field contains the string “error”. This feature also lets you monitor live data trends by setting conditions on visualizations, similar to how you can set alerts on visuals within Real-Time Dashboards. For instance, if you visualize sales data distribution across product categories in a pie chart, you can set an alert to notify you if the share of any category drops below a certain threshold. This helps you quickly identify and address potential issues with that product line. You can choose whether to receive alerts via email or Teams messages when the condition is met. To read more about setting alerts for KQL querysets, check out the documentation.
    • Real-Time Dashboard lower than ever refresh rate: We are pleased to share an enhancement to our dashboard auto refresh feature, now supporting continuous and 10 seconds refresh rates, in addition to the existing options. This upgrade, addressing a popular customer request, allows both editors and viewers to set near real-time and real-time data updates, ensuring your dashboards display the most current information with minimal delay. Experience faster with data refresh and make more timely decisions with our improved dashboard capabilities. As the dashboard author you can enable the Auto refresh setting and set a minimum time interval, to prevent users from setting an auto refresh interval smaller than the provided value. Note that the Continuous option should be used with caution. The data is refreshed every second or after the previous refresh is completed if it takes more than 1 second.
    • Real-Time Intelligence Copilot conversational mode: We’d like to share an upgrade to our Copilot assistant, which translates natural language into KQL. Now, the assistant supports a conversational mode, allowing you to ask follow-up questions that build on previous queries within the chat. This enhancement enables a more intuitive and seamless data exploration experience, making it easier to refine your queries and dive deeper into your data, all within a natural, conversational flow.
  • Deeper integration with Microsoft Purview, Microsoft’s unified data security, data governance, and compliance solution. Coming soon, security admins will be able to use Microsoft Purview Information Protection sensitivity labels to manage who has access to Fabric items with certain labels—similar to Microsoft 365. Also coming soon, we are extending support for Microsoft Purview Data Loss Prevention (DLP) policies, so security admins can apply DLP policies to detect the upload of sensitive data, like social security numbers, to a lakehouse in Fabric. If detected, the policy will trigger an automatic audit activity, can alert the security admin, and can even show a custom policy tip to data owners to remedy themselves. These capabilities will be available at no additional cost during preview in the near term, but will be part of a new Purview pay-as-you-go consumptive model, with pricing details to follow in the future. Learn more about how to secure your Fabric data with Microsoft Purview by watching the following video.
  • Incremental refresh for Dataflow Gen2: This significant enhancement in Microsoft Fabric’s Data Factory is designed to optimize data ingestion and transformation, particularly as your data continues to expand. More info
  • Invoke remote pipeline in Data pipeline (public preview): We have now added the exciting ability to call pipelines from Azure Data Factory (ADF) or Synapse Analytics pipelines as a public preview. This opens tremendous possibilities to utilize your existing ADF or Synapse pipelines inside of a Fabric pipeline by calling it inline through this new Invoke Pipeline activity. Use cases that include calling Mapping Data Flows or SSIS pipelines from your Fabric data pipeline will now be possible. More info
  • New Azure Data Factory Item: Bring your existing Azure Data Factory (ADF) to your Fabric workspace. We are introducing a new preview capability that allows you to connect to your existing ADF factories from your Fabric workspace. By clicking “Create Azure Data Factory” inside of your Fabric Data Factory workspace, you will now be able to fully manage your ADF factories directly from the Fabric workspace UI. Once your ADF is linked to your Fabric workspace, you’ll be able to trigger, execute, and monitor your pipelines as you do in ADF but directly inside of Fabric. More info
  • Copy Job (public preview): We’d like to introduce Copy Job, elevating the data ingestion experience to a more streamlined and user-friendly process from any source to any destination. Now, copying your data is easier than ever before. Moreover, Copy job supports various data delivery styles, including both batch copy and incremental copy, offering flexibility to meet your specific needs. Click here to learn more about Copy Job.

More info:

Microsoft Fabric Conference Europe Recap: Copilot, Real-Time Intelligence and More

European Fabric Community Conference 2024: Building an AI-powered data platform

Recap of Data Factory Announcements at Fabric Community Conference Europe

Announcing Updates to Data Activator in Public Preview

Fabric Community Conference Europe Recap

The post European Microsoft Fabric Community Conference announcements first appeared on James Serra's Blog.

Benefits of Migrating from Azure Synapse Analytics to Microsoft Fabric

$
0
0

Many customers ask me about the advantages of moving from Azure Synapse Analytics to Microsoft Fabric. Here’s a breakdown of the standout features that make Fabric an appealing choice:

  • Unified Environment for All Users
    Fabric serves everyone—from report writers and citizen developers to IT engineers—unlike Synapse, which primarily targets IT professionals.
  • Hands-Free Optimization
    Fabric is auto-optimized and fully integrated, allowing most features to perform well without requiring technical adjustments.
  • Simplified Data Storage with OneLake and Shortcuts
    OneLake and shortcuts make data storage and access straightforward, enhancing data management efficiency.
  • Open-Source Delta Lake Compatibility
    Fabric stores all data in open-source Delta Lake format, enabling easy integration with Databricks and other analytics platforms.
  • SaaS Architecture for Ease of Use
    As a Software-as-a-Service (SaaS) platform, Fabric simplifies development and maintenance, unlike the more hands-on Synapse.
  • Enhanced Cost Control through Fabric capacities
    Fabric’s capacity-based pricing model provides straightforward cost management.
  • Scalability from Small to Enterprise Solutions
    Migrating a small Fabric solution to an enterprise-scale solution is seamless and efficient.
  • Direct Lake for Faster Query Performance
    Direct Lake enhances query performance and speeds up reporting.
  • Unified Compute with Capacity Pooling
    Fabric’s “universal bucket of compute” allows all products to share the same capacity, instead of each having individual compute, simplifying resource management.
  • Universal Security (Available Q4 2024)
    Fabric’s upcoming OneSecurity, or “universal security,” will provide centralized security across datasets.
  • Integrated Copilot for Enhanced Development
    Copilot integration aids in development and data insights, making it easier to analyze and understand data.
  • Seamless Data Sharing Across Workspaces
    Data can be easily shared between workspaces for improved collaboration.
  • Multi-Cluster Compute
    Fabric supports multiple capacities accessing the same data, similar to “multi-cluster compute” or “multi-cluster warehousing”, optimizing concurrency.
  • Cost Savings Compared to Dedicated SQL Pools
    Fabric offers a more cost-effective solution than dedicated SQL pools.

Cost-Saving Insights with Microsoft Fabric

Additionally, here are indirect ways customers are saving costs with Fabric compared to Synapse:

  • Consolidated Technology Stack
    A single platform for IT and citizen developers (no separate learning curves for Synapse and Power BI).
  • Reduced DBA Needs
    Auto-optimization and integration lessen the demand for dedicated database administration.
  • Storage and Egress Cost Savings
    OneLake shortcuts reduce storage and egress charges, and simplify ETL development.
  • Delta Lake Storage for Flexible Use
    Data stored in Delta Lake format is accessible outside Fabric, providing greater flexibility.
  • SaaS Benefits
    The SaaS model reduces development overhead and streamlines updates.
  • Unified Compute with Capacity Pooling
    Shared compute across capacities reduces costs by eliminating individual compute allocations per service.
  • Copilot Acceleration
    Copilot speeds up development, reducing time-to-insight.
The post Benefits of Migrating from Azure Synapse Analytics to Microsoft Fabric first appeared on James Serra's Blog.

Microsoft Ignite Announcements Nov 2024

$
0
0

Announced at Microsoft Ignite last week were some new product features related to the data platform and AI. Check out the Major announcements and Book of News. Below are the ones I found most interesting:

Fabric-related:

  • Fabric Databases, now in public preview and being rolled out to various regions (it will be available in all Fabric regions by early December). Fabric Databases represent a new class of cloud databases that brings a world-class transactional database natively to Microsoft Fabric.  Fabric now brings together both transactional and analytical workloads, creating a truly unified data platform.  SQL database, the first database available in Fabric Databases, was built on the SQL Server engine and the simple and intuitive SaaS platform of Fabric.  Data professionals who’ve tried SQL database in Fabric were able to complete common database tasks up to 71% faster and with 63% more effective task completion. Data in SQL database is automatically mirrored to Fabric OneLake, making it easy to combine the SQL database data with other data and making it available to other platforms. SQL database is just the beginning for Fabric Databases, with more databases on the roadmap.  Customer scenarios I can envision to use a Fabric SQL database include a: 1) Metadata driven framework, where you need a control table in a database to drive an ETL process (see build large-scale data copy pipelines with metadata-driven approach in copy data tool); 2) Data warehouse, when the scale of Fabric Data Warehouse is not needed or you need to use a T-SQL feature not available in a Fabric Data Warehouse; 3) Digital native app development by low-code or no-code developers, where a database is needed and the developers don’t know anything about managing databases and they don’t want to know; 4) Curated data/Reverse ETL/Cache, where data is copied from the data warehouse into a SQL database for analysts to query, especially when the analysts want the interface to the data to look like it did before (by using their existing tools); 5) Hybrid application, with the operational data tier in Fabric and the operational application tier in Azure, and the operational data will be a source ingested into the Fabric data warehouse; 6) Power BI Writeback Function, where you create a user data function (in preview) inside of Fabric that updates data in a SQL Database. Then call that function from a Power BI report (via a Power BI button that calls that function). Make sure to check out the limitations in SQL database in Microsoft Fabric and Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview). SQL database in Fabric will be free until January 1, 2025, after which compute and data storage charges will begin, with backup billing starting on February 1, 2025. To learn more, read the Fabric Databases blog post, watch the Microsoft Mechanics deep dive video or sizzle video, watch these Ignite sessions: Fuel AI innovation with Azure Databases, Use AI with the latest Azure SQL innovations to transform your data, Power AI apps with insights from SQL database in Fabric, and check out the Learning pathways for SQL database in Microsoft Fabric.
  • OneLake catalog, which is an evolution of the OneLake data hub. It is a complete solution to explore, manage, and govern your entire Fabric data estate. The OneLake catalog comes with two tabs, Explore and Govern, that can help all Fabric users discover and manage trusted data, as well as provide governance for data owners with valuable insights, recommended actions, and tooling.  The Explore tab is now generally available, and the Govern tab will be coming soon in preview.  Learn more about the OneLake catalog by reading this blog post and by watching the demo, or watch this Ignite session: Ingest, govern, and secure your data with OneLake.
  • The preview of open mirroring, a feature that allows any application or data provider to write change data directly into a mirrored database within Fabric. Microsoft’s Open Mirroring partner ecosystem continues to grow with publicly available solutions from StriimOracle Golden Gate, and MongoDB, with DataStax’s solution coming soon. Watch this Ignite session: Ingest, govern, and secure your data with OneLake.
  • The public preview of SQL MI mirroring, seamlessly synchronizing data from operational databases within Azure SQL Managed Instance into Microsoft Fabric’s OneLake.
  • The preview of the Copilot in Fabric experience for data pipelines in Fabric Data Factory.  These features function as an AI expert to help users build, troubleshoot, and maintain data pipelines.
  • Coming soon, the preview of AI skill enhancements, including a more conversational experience and support for semantic models and Eventhouse KQL databases. 
  • Coming soon, the preview of AI skill integration with Agent Service in the newly announced Azure AI Foundry, allowing developers to use AI skills as a core knowledge source.
  • The preview of workspace monitoring, which provides detailed diagnostic logs for workspaces to troubleshoot performance issues, capacity performance, and data downtime.
  • The preview of further integration with Microsoft Purview including extending Protection policies to enforce access permissions to more sources and using Data Loss Prevention policies to restrict access to semantic models with sensitive data. 
  • Tenant switcher control – The tenant switcher is now available in the Fabric portal. Users with access to more than one Fabric tenant can easily switch between tenants directly from the account manager in the top right corner of the Fabric portal. This is in addition to the existing From External Orgs tab that can be found in the home page of the Power BI experience.
  • Microsoft Fabric SKU Estimator, in private preview. This is an evolution of the Microsoft Fabric Capacity Calculator, designed to help customers and partners accurately estimate their capacity requirements and identify the appropriate SKU for their needs.
  • Upcoming Changes to the Fabric Navigation Experience, designed to enhance your navigation experience with Microsoft Fabric. These updates aim to simplify your workflow and make navigation more intuitive. It removes the granular persona/workload-based experience we currently have, in favor of a simplified two-experience model: Fabric, or Power BI. The Fabric model is a new workspace-centric navigation and task-oriented item creation flow allows you to focus on your projects without the distraction of selecting a specific workload. The Power BI model is designed for users focused on exploring insights in reports, apps, and semantic models. It features an item-first approach, providing direct access to items using Power BI tools.  You will also see a new option on the navigation bar called Workloads, which is now your go-to hub for discovering available workloads, along with comprehensive getting started guides and tutorials. It’s the place where you can learn how to leverage these workloads to maximize their impact on your projects. Whether you’re exploring new features or looking to deepen your expertise, you’ll find all the resources you need to get up to speed and drive better results.
  • Reflex (under Real-Time Intelligence) has been renamed to Activator.
  • Fabric is now FedRAMP High certified for the Azure Commercial cloud, the highest level of compliance and security standards required by the federal government for cloud service providers. Now government agencies can run Fabric on the Azure Commercial cloud while maintaining strict compliance (see: Services added in the last 90 days).

Non-Fabric related:

  • Microsoft Purview Data Catalog is being renamed to Microsoft Purview Unified Catalog to better reflect the offering’s comprehensive customer benefits.
  • New Purview features: integration with new OneLake catalog; a new data quality scan engine; Purview Analytics in OneLake; and expanded Data Loss Prevention (DLP) capabilities for Fabric lakehouse and semantic models.
  • A new product was announced called Azure AI Foundry, but it is more of a grouping and rebranding of existing products and services – these services include the Azure AI Foundry portal, formerly known as Azure AI Studio, the Azure AI Foundry software development kit (SDK), Azure AI Agents, pre-built application templates (25 pre-built application templates at launch) and a suite of tools designed for AI-based application development. This service integrates with existing Azure AI tools, including Azure AI Search, AI Agents, AI Content Safety and Azure Machine Learning.
  • SQL Server 2025, now in private preview. See Bob Ward’s Announcing SQL Server 2025 and the Ignite session SQL Server 2025: an enterprise AI-ready database platform.

More info:

Ignite video Microsoft Fabric: What’s new and what’s next with Arun Ulagaratchagan and Amir Netz

Spreading your SQL Server wings with SQL database in Fabric

The post Microsoft Ignite Announcements Nov 2024 first appeared on James Serra's Blog.

Ways to land data into Fabric OneLake

$
0
0

Microsoft Fabric is rapidly gaining popularity as a unified data platform, leveraging OneLake as its central data storage hub for all Fabric-integrated products. A variety of tools and methods are available for copying data into OneLake, catering to diverse data ingestion needs. Below is an overview of what I believe are the key options:

Fabric Data Pipeline via Copy activity

Simplify data movement with managed workflows designed for efficient and reliable transfers. Ideal for orchestrating complex data pipelines with minimal effort.

Fabric Dataflow Gen2

Create repeatable, scalable ETL (Extract, Transform, Load) processes. Dataflow Gen2 allows for visually mapping transformations and is perfect for business users and data engineers alike.

Local file/folder upload via Fabric Portal Explorer

Leverage drag-and-drop functionality in the Fabric portal for quick, manual uploads of local files and folders to OneLake.

Fabric Eventstreams

Ingest event-driven data in real time. This is an excellent option for use cases like IoT telemetry, application logs, or transactional events.

Fabric OneLake File Explorer

Manage your OneLake files as if they were stored locally on your machine. This tool enhances accessibility and streamlines workflows.

Fabric Spark notebooks via APIs

Utilize Spark notebooks to process and load data programmatically. Combined with OneLake’s REST API, this method is tailored for advanced, customizable data ingestion needs.

Fabric Mirroring

Synchronize OneLake with external storage systems seamlessly. This option ensures your OneLake data stays updated without manual intervention.

Azure Storage Explorer

Use this desktop app to manage data across your Azure storage resources, including OneLake. It’s particularly useful for managing large datasets with a familiar interface.

AzCopy

Leverage this powerful command-line utility for efficient, large-scale data transfers. It’s the perfect tool for moving massive datasets to OneLake.

OneLake integration for semantic models

Automatically write data imported into model tables to Delta tables in OneLake. This integration simplifies analytics workflows while enhancing data consistency.

Azure Data Factory (ADF)

For enterprise-scale ETL needs, ADF offers robust capabilities that integrate seamlessly with OneLake. While similar to Fabric Data Pipelines, ADF shines in complex, high-volume scenarios.

T-SQL COPY INTO

Load data directly into OneLake using SQL scripts. This method is ideal for developers and database administrators looking for a straightforward, SQL-native approach.

By leveraging these tools and methods, organizations can effectively and efficiently ingest data into Fabric OneLake, ensuring optimal use of its unified data platform capabilities. Each approach has its unique strengths, allowing teams to choose the best fit for their specific use case.

The post Ways to land data into Fabric OneLake first appeared on James Serra's Blog.

Introduction to OpenAI and LLMs – Part 3

$
0
0

My previous blog posts on this topic were Introduction to OpenAI and LLMs, the “what” part (what is OpenAI and LLM), and the “how” part Introduction to OpenAI and LLMs – Part 2. Those blogs focused on using LLMs on unstructured data, such as emails and documents in formats like .pdf, docx or .txt files (think documents that include text). This blog will focus on using LLMs on semi-structured data, such as files and logs in CSV, Parquet, XML, or JSON formats (often in table format, meaning rows and columns) as well as Excel files; and structured data such as relational databases (SQL tables and fields).

First, let’s review some key definitions from the previous two blogs:

  • AI: Computational systems and models capable of performing tasks that typically require human intelligence.  GenAI and ML are subsets of AI.
  • Generative AI (GenAI): AI systems capable of generating new content, such as text, images, or audio. It does this by employing neural networks, a type of machine learning process that is loosely inspired by the way the human brain processes, interprets and learns from information over time.
  • Large Language Models (LLMs): A type of GenAI designed for natural language (text) understanding and generation, trained on diverse datasets.  Think of it like a super-smart auto-complete on steroids. GenAI uses other specialized models (non-LLMs) for image or video generation.
  • Machine learning (ML): A broad subset of AI that encompasses algorithms and models capable of learning from data to make predictions or decisions. LLMs are a type of deep learning within ML, focused on language understanding and generation, whereas ML includes techniques for diverse tasks like image recognition, data analysis, and predictive modeling.
  • OpenAI: A leading organization specializing in AI research and development, known for creating GenAI models such as GPT (used in ChatGPT) and other AI technologies.
  • ChatGPT, Copilot: Applications (“bots”) built on GenAI models (e.g., GPT). These tools allow users to interact with GenAI via natural language prompts, generating responses tailored to various contexts, such as conversations (ChatGPT) or coding (Copilot).
  • Prompt engineering: The practice of designing input prompts to guide LLMs in generating accurate and relevant responses, optimizing the model’s performance for specific tasks.

There are many possible use cases for using GenAI on semi-structured and structured data: Conversational querying, data enrichment and cleaning, sample data creation, data summarization, trend identification, forecasting and predictions, what-if analysis, anomaly detection and correction, product/service recommendations, mapping fields, MDM, and creating semantic models. In this blog I will focus on the first two: Conversational querying and data enrichment and cleaning,

Let’s first talk about using ChatGPT or Copilot on semi-structured data. It’s something I don’t think many are aware of, but you can use these tools to query and transform data that sits in files such as CSV and Excel simply by uploading those files and entering prompts. The data in those files just needs to be in a table format (rows and columns). The tool will take the uploaded data and create an internal table that you can modify and then save to a file on your computer. Check the features and limits of the version of ChatGPT and Copilot you are using to make sure it supports file uploads (only the enterprise version of Copilot supports file uploads).

As an example using ChatGPT, I can upload a csv movie file and then ask questions like “Tell me about this file”, “How many movies does it contain?”, and “List the movies with a rating of 10 or above”. After any question, I can ask “Generate the T-SQL query for the last question I just asked”. I can then start to modify the data with prompts like “Change ‘sci-fi’ in the genres field to ‘science fiction’ and “Replace any empty fields with a zero” (this is data cleaning). You can get real sophisticated with prompts such as “Add two more columns to the csv file: leading actor in the movie and leading actress, then populate the movies by searching publicly available information from Wikipedia or IMDb to find the leading actor and actress for each movie” and “Add a column to the csv file called MovieGross and lookup each movie from Wikipedia or IMDb for the total gross at the box office each movie made and populate that column” (this is data enrichment). Just think of all the time it can save you from manually doing those lookups!

Now let’s talk about using Microsoft Fabric AI Skill on structured data. The Microsoft Fabric AI Skill is a feature powered by GenAI that makes data more accessible by enabling users to interact with it using plain English questions. Leveraging LLMs, the AI Skill generates queries via T-SQL based on the user’s question and the database schema. When a user poses a question, the system sends the question, along with metadata about the selected data (such as table and column names and data types), to the LLM. The model then generates a T-SQL query to answer the question, ensuring that the query does not alter the data before executing it and returning the results along with the T-SQL. This can be used against a lakehouse, warehouse, and soon also a semantic model or Eventhouse KQL DB.

Configuring the AI Skill is similar to creating a Power BI report. Once set up, it can be shared with colleagues for collaborative data exploration. While the AI Skill often provides accurate answers out of the box, it may need refinement to better understand your organization’s specific context. You can provide additional instructions and example question-query pairs to help guide the AI to generate more precise responses, making it a powerful tool for intuitive and collaborative data analysis. Note that a Fabric tenant admin has to enable the Tenant settings for a Fabric AI Skill so you can use them. Check out this video to see a demo of AI Skill in action.

You may be asking what is the difference between AI Skill and Copilot? The technology behind AI Skill and Fabric Copilot is similar, as both use GenAI to analyze and interact with data. However, there are notable differences between the two capabilities. AI Skill offers extensive configuration options, allowing users to customize the AI’s behavior to meet specific needs. Users can provide instructions and examples to guide the AI, ensuring it aligns with their use case. Fabric Copilot, on the other hand, does not provide this level of configuration flexibility. While Copilot assists users in performing tasks within Fabric, such as generating Notebook code or Data Warehouse queries, AI Skill functions independently. AI Skill can also be integrated with other platforms, such as Microsoft Teams, making it a versatile tool for broader applications beyond Fabric. Copilot in Power BI is designed to work within reports or models, focusing on updating and enhancing those reports. In contrast, AI Skill is geared toward handling ad-hoc queries, working directly against a lakehouse or warehouse to generate and execute T-SQL queries and return query results.

So now technology allows us to use GenAI on both structured and unstructured data at the same time. An industry use case that highlights this is a healthcare organization aiming to improve patient outcomes by leveraging GenAI to optimize treatment plans by combining unstructured and structured data. This approach involves analyzing unstructured data, such as doctor notes and lab reports, and linking it with structured data, including diagnostic codes, treatment history, and patient demographics stored in relational databases. By doing so, GenAI can suggest the most effective treatments for patients. This optimization not only improves patient outcomes but also reduces the time required to identify suitable care paths, enabling more efficient and effective healthcare delivery.

Now there are two key architectural approaches to handling data queries and analysis: the traditional method and a GenAI-driven method (discussed in this blog).

The traditional approach involves extracting numbers and text from documents (unstructured data), storing them in a database alongside other structured data, and using SQL to query the information. This method is ideal when the questions to be asked are predefined, as it ensures consistent and accurate results. Tools like Azure AI Document Intelligence can extract data from documents (and then use LLMs to convert from the JSON output to CSV to more easily add to a database), while solutions like Microsoft Fabric AI Skill can use LLMs to convert natural language questions into SQL queries. However, this approach relies heavily on the accuracy of document data extraction and does not use LLMs to enhance answers beyond the scope of SQL queries. It is particularly suited for scenarios requiring high accuracy and consistency, such as financial reporting or operational dashboards. The workflow follows a straightforward pattern: Document → Database → SQL Query → Answer, making it best for accuracy and predefined queries.

The GenAI approach takes a more flexible and exploratory route by feeding various types of data—structured, semi-structured, and unstructured—into an LLM. An application/bot is then used to ask questions on this data, which is all sent to an LLM. This method is beneficial when the questions are not predefined, allowing for broader exploration and discovery. The bot can upload unstructured data (documents), semi-structured data (CSV files), and structured data exported to CSV or pulled directly from a database. However, challenges include ensuring CSV files have sufficient metadata and maintaining accuracy in data extraction when using retrieval-augmented generation (RAG). This approach is particularly useful for applications like customer support bots or summarizing and analyzing large document repositories. The workflow typically follows the pattern: Document/CSV/Database → LLM → Bot Query → Answer, making it best for exploration and leveraging many types of data.

In summary, the traditional method excels in accuracy and predefined queries, while the GenAI approach is ideal for exploration and working with diverse data sources.

One more point about Microsoft Fabric AI Skill: It is not being used to send data to an LLM.  Rather, it is taking natural language questions and turning them into SQL and running the SQL on structured data and returning the results, of which the results could then be fed to an LLM by manual means (i.e. by exporting the data to a csv and sending it to the LLM along with unstructured data to help answer a question). The AI Skill essentially acts as a natural language interface to your structured data.  But you can argue that AI skill sending metadata to an LLM to create a SQL statement to answer a question is as good as sending the actual data to the LLM, with the caveat you are not able to combine it with unstructured data.  Also, SQL only deals with specific questions (“What are the top-selling products?”, “What is the average sales growth by region?”) not with more general questions that an LLM can answer (“find anomalies in the data” or “Identify patterns indicative of fraud”). While metadata-driven SQL generation is efficient for many business intelligence use cases, it falls short of providing the deeper contextual or semantic analysis that LLMs excel at.

I gave a presentation on this topic for the Toronto Data Professional Community that you can view and download the slides.

More info:

Build Chatbot with Large Language Model (LLM) and Azure SQL Database

Natual Language to SQL Query

NLP based Data Engineering and ETL Tool – Ask On Data

Ask questions to your data, is Copilot the way to go or should we consider alternatives like AI Skills?

The post Introduction to OpenAI and LLMs – Part 3 first appeared on James Serra's Blog.

Cool AI sites

$
0
0

As I researched and wrote my OpenAI and LLMs blogs (see Introduction to OpenAI and LLMs, Introduction to OpenAI and LLMs – Part 2 and Introduction to OpenAI and LLMs – Part 3, along with a presentation on that topic that I did for the Toronto Data Professional Community which you can view and download the slides), I found and played with many fascinating AI products and features. I am continually amazed at the progress we are making with AI, especially those in the GenAI world, and feel like we are just getting started. Here are my favorites:

HeyGen – Speak to an avatar live. Check out their demos where you can interact with avatars such as a therapist, fitness coach, and doctor. They keep adding new ones. Amazing stuff!

OpenAI Sora – Create video from text. My favorite is Historical footage of California during the gold rush.

GPT Store – Discover and create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills. My favorites are Books, Movies, and Therapist/Psychologist.

ChatGPT advanced voice mode – Have spoken interactive conversations with ChatGPT, where you can screen share and also share live video. I had a long conversation with ChatGPT while driving alone to help keep me from getting sleepy – we discussed the best Yankee teams of all time. The ChatGPT voice is animated so it’s like talking to real person (choose from nine lifelike output voices for ChatGPT, each with its own distinct tone and character). Screen share can be used for things like help guiding you through settings on your computer in order to help troubleshoot a problem, while live video can be used in ways such as helping you recognize objects or tell you the color of a shirt (great for colorblind people like me).

ChatGPT DALL-E 3 – Create images from text. This used to be a separate product but is now built into ChatGPT. I recently asked it to draw me a picture of a crawfish blowing out a birthday cake for a crawfish boil I did for someone’s birthday and sent that picture with the birthday invitations.

Azure AI Speech Studio – Create a custom text to speech avatar using the Azure AI Custom Neural Voice and the Custom Avatar Self-Service capabilities in Azure AI Speech Studio. You can use the video and voice of anyone you wish. I hope to one day create an avatar that looks and sounds like me and then talk to it. Freaky!

Clonos – Create your own virtual avatar from existing video and sound, and make it say anything. See examples of Clonos in the sports world that is very funny at memerunngergpt on Instagram, where they modify video from sports figures to say hilarious things in their own voice (warning: foul language).

Microsoft Copilot – An AI chatbot that is now in many Microsoft products. Check out a few of my favorites: Copilot in Teams, Copilot in Microsoft Fabric, Copilot on Windows, Copilot in Word, Microsoft 365 Copilot Chat.

If you have not used ChatGPT, you need to do so immediately! One way to use it that you may not be aware of is via roleplay, where OpenAI takes on a persona. This is a great way to learn about a subject matter. For example, I am reading about the crusades, so I prompted OpenAI: “Pretend you are a knight from the first crusade and fought from the very beginning to the very end of the crusade. I will ask you questions, and I want you to answer in the first person. Draw upon historical knowledge and accounts of the first crusade to immerse yourself in the mindset, beliefs, and experiences of such a knight, responding to me as though you were truly from that era”. Then I was able to ask questions directly to a “knight” and get all sorts of great info. Other amazing persona’s you can ask it to emulate are such things as “pretend you are SQL Server and I’m using SSMS” and “pretend you are the game Zork” (for you old-timers out there like me).

I also used ChatGPT on my iPhone and told it to pretend it was Santa Clause and to talk to my 6-year old grandson. I turned on voice mode, which has a seasonal Santa voice, and my grandson had a long and animated conversation with Santa (ChatGPT)!

To see some of the capabilities of OpenAI and ChatGPT, check out OpenAI YouTube that has demo videos. My favorites are Interview Prep with GPT-4o, Live demo of GPT-4o’s vision capabilities, and Interview roleplay with GPT-4o voice and vision.

A helpful tip on asking questions with ChatGPT: you will get wrong answers sometimes, especially if your question (“prompt”) does not have much information. If you get a wrong answer, tell ChatGPT via a follow-up prompt that the answer is wrong and tell it what the correct answer is. If it then responds with the correct answer, prompt it with “change my original prompt to prevent the incorrect answer if I were to ask the question again”. You will then get an improved prompt that you can use to ask the question again, or to build upon it.

More info:

AI Companions Will Change Our Lives

The post Cool AI sites first appeared on James Serra's Blog.

Azure SQL offerings

$
0
0

There are three Azure SQL products with so many different deployment options, service tiers, and compute tiers that it can get quite confusing when choosing the right option for your workload. So, I thought I would write this blog to help out a bit.

Azure SQL is a cloud-based suite of database services designed to offer flexibility, scalability, and ease of management. It comprises three main products, each catering to different deployment needs and compatibility requirements. Within each of the products are various deployment options, service tiers, and compute tiers:

Below I expand on the options from the above diagram:

Azure SQL Products

1. SQL Server on a Virtual Machine (VM)

For organizations needing full control over their SQL Server environment, running SQL Server on an Azure VM is a great option. This Infrastructure-as-a-Service (IaaS) offering provides complete access to the operating system and database engine, enabling users to configure and manage SQL Server as they would on-premises. It is best suited for:

  • Lift-and-shift migrations with minimal changes.
  • Applications requiring full SQL Server features.
  • Custom configurations and third-party integrations.
  • See Provision SQL Server on Azure VM and VM size.

2. Azure SQL Database

Azure SQL Database is a fully managed, Platform-as-a-Service (PaaS) database solution that automates maintenance tasks like patching, backups, and scaling. It is ideal for modern cloud applications and is available in two deployment options:

Single Database

  • A dedicated, isolated database with predictable performance.
  • Best for applications that require resource guarantees at the database level.
  • Supports service tiers: General Purpose, Business Critical, and Hyperscale.
  • Can use the serverless compute tier (only in the General Purpose and Hyperscale tiers), allowing dynamic scaling of resources based on workload demand.
  • See What is a single database in Azure SQL Database?

Elastic Pool

3. Azure SQL Managed Instance

Azure SQL Managed Instance is an instance-scoped deployment option that provides near 100% compatibility with SQL Server while delivering full PaaS benefits. This makes it ideal for organizations looking to modernize their database infrastructure with minimal friction. Key benefits include:

  • Native support for SQL Server features like cross-database queries, linked servers, and SQL Agent.
  • Built-in high availability, automated maintenance, and security.
  • Supports service tiers: General Purpose and Business Critical.
  • Does not support the serverless compute tier.
  • See What is Azure SQL Managed Instance?

Service Tiers in Azure SQL

Azure SQL Database and Managed Instance offer three service tiers to cater to different workload needs:

1. General Purpose

  • Balanced performance and cost-effective for most applications.
  • Available in both Azure SQL Database (single and elastic pool) and Azure SQL Managed Instance.
  • Supports the serverless compute tier (only for single database deployment).
  • See vCore purchasing model – Azure SQL Database.

2. Business Critical

  • Designed for applications requiring high transaction rates and low-latency I/O performance.
  • Includes built-in high availability with multiple replicas.
  • Available in Azure SQL Database (single and elastic pool) and Azure SQL Managed Instance.
  • Does not support the serverless compute tier.
  • See vCore purchasing model – Azure SQL Database.

3. Hyperscale

  • Optimized for extremely large databases, supporting up to 128 TB.
  • Provides rapid scaling of compute and storage independently.
  • Available for Azure SQL Database (single database and elastic pool deployments).
  • Supports the serverless compute tier for single database deployment.
  • See Hyperscale service tier

Serverless Compute Tier

The serverless compute tier is a cost-effective option that allows automatic scaling of compute resources based on demand. It is available in the General Purpose and Hyperscale service tiers for Azure SQL Database (single database deployment).

Benefits of Serverless Compute Tier:

  • Autoscaling of CPU and memory based on workload fluctuations.
  • Automatic pausing of the database during inactivity, reducing costs.
  • Best suited for intermittent, unpredictable workloads that do not require continuous availability.
  • See Serverless compute tier for Azure SQL Database.

Choosing the right Azure SQL option

When selecting an Azure SQL offering, consider factors like workload requirements, compatibility, cost, and management overhead. Below is a quick reference guide:

FeatureSQL Server on VMAzure SQL Database (Single)Azure SQL Database (Elastic Pool)Azure SQL Managed Instance
Fully Managed (PaaS)NoYesYesYes
Full SQL Server CompatibilityYesPartialPartialNearly Full
Best for Lift-and-ShiftYesNoNoYes
Best for SaaS AppsNoNoYesNo
Best for ModernizationNoYesYesYes
Supports General PurposeN/AYesYesYes
Supports Business CriticalN/AYesYesYes
Supports HyperscaleN/AYesYesNo
Supports Serverless ComputeN/AYes (General Purpose, Hyperscale)NoNo

Conclusion

Azure SQL offers a diverse set of solutions to accommodate different business and technical requirements. Whether you need full control with SQL Server on a VM, a fully managed single database, a cost-effective elastic pool, or near full SQL Server compatibility with Managed Instance, Azure SQL provides the flexibility to optimize cost, performance, and scalability. Understanding these options and service tiers will help you make an informed decision tailored to your specific workload needs.

This diagram may help with choosing the right option for your use case:

The post Azure SQL offerings first appeared on James Serra's Blog.
Viewing all 516 articles
Browse latest View live