Modern Data Warehouse & Reverse ETL

April 12, 2021, 8:00 am

≪ Previous: A new WordPress theme and a new job

An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:

Ingest: Data is ingested from multiple sources via ELT
Store: The ingested data is stored in a data lake in a raw layer in the format that it came from the source
Transform: The data is then cleaned and written to a cleaned layer in the data lake, and then joined and/or aggregated and copied into a presentation layer in the data lake
Model: Some of the data is then copied into a relational database in third normal form (3NF) and/or a star schema
Visualize: The data is then reported off of

There are many variations, additions, and exceptions to these stages and multiple products/tools can be used at each stage. In the Azure world, typically Azure Data Factory is used to ingest data, Azure Data Lake Storage Gen2 is used to store the data, mapping data flows in Azure Data Factory is used to transform the data, Azure Synapse Analytics is used to model the data, and Power BI is used to visualize the data. Of course, there are many variations on the tools you can use. I’ll post a video in the next few weeks that will discuss this in more detail.

“Reverse ETL” is the process of moving data from a modern data warehouse into third party systems to make the data operational. Traditionally data stored in a data warehouse is used for analytical workloads and business intelligence (i.e. identify long-term trends and influencing long-term strategy), but some companies are now recognizing that this data can be further utilized for operational analytics. Operational analytics helps with day-to-day decisions with the goal of improving the efficiency and effectiveness of an organization’s operations. In simpler terms, it’s putting a company’s data to work so everyone can make better and smarter decisions about the business. As examples, if your MDW ingested customer data which was then cleaned and mastered, that customer data can then by copied into multiple SaaS systems such as Salesforce to make sure there is a consistent view of the customer across all systems. Customer info can also be copied to a customer support system to provide better support to that customer by having more info about that person, or copied to a sales system to give the customer a better sales experience. As a last example, you can identify at-risk customers by surfacing customer usage data in a CRM.

Companies are building key definitions in SQL on top of the data warehouse such as Lifetime Value (LTV), Product Qualified Lead (PQL), propensity score, customer health, etc. Yes, you can easily create reports and visualizations using this data in BI tools or SQL, but these insights can be much more powerful if they drive the everyday operations of your teams across sales, marketing, finance, etc. in the tools they live in. Instead of training sales reps to use the BI reports you built from the MDW, the data analyst can operationalize their analysis by feeding lead scores from the data warehouse into, for example, a custom field in Salesforce. As another example, say your data science team calculates a propensity score on top of the data warehouse or data lake describing the user’s likelihood of buying a product. Using Reverse ETL, you can move the propensity score to a operational production database to serve customers personalized in-app experiences in real time.

Instead of writing your own API connectors from the data warehouse to SaaS products to pipe the data into operational systems like Salesforce and dealing with all the mapping of fields, reverse ETL solutions have appeared which offer out of the box connectors to numerous systems. They provide the mapping to the SaaS products and allow you to continuously sync or define what triggers the syncing between the two systems. I see reverse ETL products with extensive and easy-to-use mapping capabilities to the many SaaS products such as Salesforce, Hubspot, Marketo, and Zendesk as a big advantage over trying to use Azure Data Factory and coding your own.

There are now a handful of startups building reverse ETL products including Hightouch, Census, Grouparoo (open source), Headsup, Polytomic, RudderStack, and Seekwell. It will be interesting to see how these product evolve and if reverse ETL becomes popular.

More info:

Reverse ETL — A Primer

Reverse ETL is Just Another Data Pipeline

What is Reverse ETL?

Reverse ETL: A new category emerges (video)

What is Reverse ETL and Why it is Taking Off

The post Modern Data Warehouse & Reverse ETL first appeared on James Serra's Blog.

↧

Modern Data Warehouse explained

April 27, 2021, 7:00 am

≫ Next: What’s new with Power BI?

≪ Previous: Modern Data Warehouse & Reverse ETL

I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for each stage. You can view the video here. I hope it helps!

The post Modern Data Warehouse explained first appeared on James Serra's Blog.

↧

What’s new with Power BI?

May 17, 2021, 8:00 am

≫ Next: Data Fabric defined

≪ Previous: Modern Data Warehouse explained

At the Microsoft Business Applications Summit 2021, a ton of new features for Power BI where announced. Below is my list of top ten new features, but there were plenty more. For a list of all the announcements, see Microsoft Business Application Summit Recap.

Lineage view and Impact analysis support for DirectQuery for Power BI datasets

Coming soon. Microsoft is adding support for the DirectQuery for Power BI datasets preview feature to the lineage view and impact analysis both in the Service and in the Power BI Desktop. Regardless of if the dataset is in the same workspace as the source datasets or in a different workspace you can see it in lineage view and impact analysis.

Make datasets discoverable and request access

Coming soon. In a culture of data reuse, analysts should be able to see what data is available in their organization for them to use. To this end, another big step Power BI has taken is to allow a data owner to make their dataset discoverable even by users who don’t yet have access to the data it contains.

As an admin or a member of a workspace, when you endorse a dataset, by default you’ll also be making it discoverable by other users, who can then request access to it if they don’t have access already.

Now that datasets can be made discoverable, users can find data that interests them and request access to it in the Datasets Hub and in Power BI Desktop.

In many cases, getting access to data goes through a process that is unique to your org and to your dataset. Recognizing the variety of processes for requesting and granting access, Power BI allows any dataset owner (workspace admin/member/contributor) to customize per-dataset instructions for how to get access. When an end-user requests access to the dataset, they will be shown a dialog message with details about how to get access.

By default, if no instructions were set, the access request will be sent by email to the user who created the dataset or to the one who assumed ownership.

Request accessing instructions can be set on a dataset.

Achieve more and stay on track with goals in Power BI

In preview. Goals in Power BI will redefine how you measure key business metrics, outcomes, and milestones. Using goals, individuals can aggregate metrics with deep ownership and accountability. Goals helps you streamline the process of collecting, tracking, and analyzing all your business metrics in one place. You can create scorecards with writeback capabilities, define status, check in with notes, and assign goals to individuals. Goals can be powered by data coming from one or more Power BI reports, spanning different workspaces.

New Insights and Governance functionality for Power BI admins

Coming soon. Azure Monitor integration will allow you to connect Power BI Premium/Embedded environments to Azure Log Analytics workspaces. Log Analytics provides long-term data storage, retention policies, ad hoc query capability, and the ability to analyze activities in Power BI using a built-in connector. You will have access to detailed logs on Analysis Services dataset activity, capacity utilization, and report usage and performance in near real time. You will be able to view template reports and apps so you can understand how, when, and by whom content and resources are being consumed.

Azure Monitor integration will allow you to connect Power BI Premium/Embedded environments to Azure Log Analytics (LA) workspaces.

Modern Usage Metrics will be updated to provide 90 days of history. You will be able to access built-in usage reports on aggregated/trend data and drill down to each individual page level view. Each page view will have performance information, including breakdowns of query and render duration.

Microsoft will also release a new series of Admin APIs to provide detailed visibility of access rights. You will be able to determine the users, groups and permissions on a given Power BI artifact such as report, dashboard, dataset, dataflow, app, workspace or capacity. You can also specify all the content and permissions any specific user has access to and satisfy audit and compliance needs.

Paginated report visual for Power BI reports

Coming soon. Microsoft is bringing together paginated reports and interactive reports for the first time with the new paginated report visual. You can embed a paginated report directly in your Power BI report, and use filters/slicers and cross-highlighting to interact with the visual just as you would other Power BI visuals. Additionally, you will be able to export and print the paginated report directly from the Power BI report canvas and preserve all formatting.

Automatic Aggregations

Coming soon. Automatic aggregations unlock massive datasets for interactive analysis as Power BI automatically creates and manages in-memory aggregations based on usage patterns to boost query performance and user concurrency. Automatic aggregations will support DirectQuery datasets over Azure Synapse, Azure SQL Database, Snowflake and Google BigQuery.

Deployment Pipelines automation

Coming soon. Automate the release of content updates with a set of new APIs for deployment pipelines.

With these APIs, BI teams can:

Deploy multiple pipelines at a scheduled time.
Cascade deployments of Power BI content.
Easily integrate Power BI release processes into familiar DevOps tools, such as Azure DevOps or GitHub actions.

Automate the release of content updates with a set of new APIs for deployment pipelines.

Automated Insights

Coming soon. Automated Insights combine all Power BI AI functionality ranging from Anomaly Detection to Smart Narratives and the Decomposition tree into a single end to end experience directly catered to your needs. When you open a report, Automated Insights runs behind the scenes, and surfaces details that need your attention. You can also click into the new ‘Get Insights’ button in the ribbon whenever you want to get additional insights about the visuals you are seeing.

Hybrid tables

This allows you to have a single table which is automatically created with different data refresh requirements (Import or DirectQuery). In the example below, older data is imported while the more recent data is using incremental refresh to keep it up to date. In addition, the current day query is using DirectQuery to keep it near real-time.

Streaming dataflows

Coming soon. Streaming dataflows allows Power Query transformations on top of streaming data.

To learn more

The Power Platform release plan (formerly release notes) for the 2021 release wave 1 describes all new features releasing from April 2021 through September 2021 for Power BI, Power Apps, Power Automate, AI Builder, Power Virtual Agents, and Common Data Model and Data Integration. You can either browse the release plan online or download the document as a PDF file or via the Power BI Release Wave.

Share your feedback on the community forum for users of the “Power” suite of products (Power BI, Power Apps, Power Automate, and Power Virtual Agents). Microsoft will use your feedback to make improvements. And you can also vote for new features: https://ideas.powerbi.com/forums/265200-power-bi-ideas.

More info:

Driving a data culture with Power BI– Empowering individuals, every team, and every organization | Microsoft Power BI Blog | Microsoft Power BI

Incredible Power BI Announcements at Microsoft Business Applications Summit 2021

All Power BI Announcements of Microsoft Business Applications Summit 2021 in One Place

My favorite Power BI announcements from the Business Application Summit

The post What’s new with Power BI? first appeared on James Serra's Blog.

↧

Data Fabric defined

June 9, 2021, 8:00 am

≫ Next: Centralized vs decentralized data architecture

≪ Previous: What’s new with Power BI?

Another buzzword that you may have been hearing a lot about lately is Data Fabric. In short, a data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data. It enables accessing, ingesting, integrating, and sharing data in a environment where the data can be batched or streamed and be in the cloud or on-prem. The ultimate goal of data fabric is to use all your data to gain better insights into your company and make better business decisions. If you are thinking this sounds a lot like a modern data warehouse that I posted a video on recently at Modern Data Warehouse explained, well, I would argue it basically is the same thing except a data fabric expands on that architecture. A data fabric includes building blocks such as data pipeline, data access, data lake, data store, data policy, ingestion framework, and data visualization. These building blocks would be used to build platforms or “products” such as a client data integration platform, data hub, governance framework, and a global semantic layer, giving you centralized governance and standardization. Ideally the building blocks could be use by other solutions outside of the data fabric. At EY, my new place of employment, we are building a data fabric that will be the subject of a future blog post.

You may now be thinking how does a data fabric compare to a data mesh? (If you are not familiar with a data mesh, check out my blog Data Mesh defined). A data fabric and a data mesh both provide an architecture to access data across multiple technologies and platforms, but a data fabric is technology-centric, while a data mesh focuses on organizational change. Another difference is a data mesh is decentralized (or “distributed”) where each of the sets of data is a domain (treated like a product) that is kept within each of the various organizations within a company , whereas in a data fabric all the data is brought into a centralized location. I need to point out here that this is my interpretation of a data fabric compared to a data mesh and you will find many who have variations of my view, and some that can be very different. In fact, two companies can have very different technology solutions for a data fabric or a data mesh that can both be correct as what is correct is the best solution based on your company’s data (size, speed, and type), security policies, skillset, performance requirements, and monetary constraints.

Fundamentally, the data fabric is about collecting data and making it available via purposed built APIs (optionally also via direct connection to the data stores for those tools that don’t support APIs). The data mesh involves building data products via copying data into specific datasets for specific use-cases but built by the dept/domain who keeps and owns the data.

As an example, say I want a dashboard that measures sales vs inventory. In the data fabric world I would ingest the data in the sales system and well as the data in the inventory system in a central location, then I would build an API that joins them together and expose that to the dashboard. Data fabrics are more about technical data integration and don’t really dictate who does it or who owns the data. In the data mesh world I would get the sales team to copy data from the sales system to a sales product dataset and the inventory management team to copy data from the inventory system to an inventory dataset and get the dashboard owner to build a joined table that the dashboard uses.

In summary, a data mesh is more about people and process than architecture, while a data fabric is an architectural approach that tackles the complexity of data and metadata in a smart way that works well together.

In the future the technology used to build a data mesh could look very different than the technology used to build a data fabric, but for now some of the technology needed to build a true data mesh does not exist, so the result is a built data mesh may look more like a data fabric. If you still find this all confusing, you are not alone! Please share your thoughts by entering a comment below.

More info:

Data Virtualization in the Context of the Data Mesh

Disambiguation of Data Mesh, Fabric, Centric, Driven, and Everything

The Role of the Data Fabric In Your Target State Architecture

Catalog & Cocktails #32: Is Your Data Fabric a Mesh?

Catalog and Cocktails #44: Why it’s time to mesh with your data architecture

What is a Data Fabric? | Talend

The post Data Fabric defined first appeared on James Serra's Blog.

↧

Centralized vs decentralized data architecture

June 30, 2021, 8:00 am

≫ Next: Data Mesh: Centralized ownership vs decentralized ownership

≪ Previous: Data Fabric defined

One of the biggest differences between the Data Mesh and other data platform architectures is a data mesh is a highly decentralized distributed data architecture as opposed to a centralized monolithic data architecture based on a data warehouse and/or a data lake.

A centralized data architecture means the data from each domain/subject (i.e. payroll, operations, finance) is copied to one location (i.e. a data lake under one storage account), and that the data from the multiple domains/subjects are combined to create centralized data models and unified views. It also means centralized ownership of the data (usually IT). This is the approach used by a Data Fabric.

A decentralized distributed data architecture means the data from each domain is not copied but rather kept within the domain (each domain/subject has its own data lake under one storage account) and each domain has its own data models. It also means distributed ownership of the data, with each domain having its own owner.

So is decentralized better than centralized?

The first thing to mention is that a decentralized solution is not for smaller companies, only for really big companies that have very complex data models, high data volumes, and many data domains. I would say that means at least for 90% of companies, a decentralized solution would be overkill.

Second, a lot depends on the technology used. In future blog posts I’ll go more into the technology used for a data mesh and some concerns I have over it. If you are not familiar with the data mesh, I recommend you read the just-released free available chapters in the book by Zhamak Dehghani, Data Mesh: Delivering Data-Driven Value At Scale.

For this blog, I want to cover the specific question: Is data virtualization/federation a good solution for enabling decentralization, where data in separate remote data stores can be queried and joined together? (I’ll dig into domain data models vs centralized data models, along with data ownership, in a future blog post).

To enable data virtualization/federation, there are full proprietary virtualization software products such as Denoto, Dremio, Starburst, and Fraxses, that can query many different types of data stores (i.e. Dremio supports 19, Starburst supports 45, Denoto supports 67+).

While there are benefits to using full proprietary virtualization software, there are some tradeoffs. I already blogged about those tradeoffs at Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement. I also found a list of pros/cons from a presentation from Microsoft called Azure Modern Data Strategy with Data Mesh. It explains how to use Azure to build a data mesh and takes an exception to the ideal data mesh in that storage and data governance is centralized (which I’m finding is a common exception to the ideal data mesh). Definitely work a watch! Here is that list of pros/cons of data virtualization:

Pros:

Reduces data duplication
Reduces ETL/ETL data pipeline
Improves Speed-to-Market & rapid prototyping
Lowers costs (but beware of egress/ingress charges)
Reduces data staleness/refresh
Security is centralized

Cons:

Slower performance (not sub-seconds)
Data ownership is still not addressed
Data versioning/history not supported (i.e. Slowly Changing Dimensions)
Affects source system performance (OLTP)
How to manage Master Data Management (MDM)?
How to manage data cleansing?
Not a star schema optimize for read
Changes at the source will break the chain
Might require installing software on the source systems

An alternative to using full proprietary virtualization software, sort of a “light” version of virtualization, is the Serverless SQL pool in Azure Synapse Analytics, which can query remote data stores. It currently only supports querying data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Cosmos DB, or Dataverse, but hopefully more will come in the future. And if your company uses Power BI, another option is to use Power BI’s DirectQuery which also can query remote data stores and supports many data sources. Note that a dataset built in Power BI that uses DirectQuery can be used outside Power BI via XMLA endpoints.

I have seen the most use of a data virtualization product when data from many sources is copied into different data stores inside a modern data warehouse or data fabric (Cosmos DB, SQL Database, ADLS Gen2, etc) and you need to query those multiple different data stores and join the data.

Now if you are building a data fabric and decide to use data virtualization to keep data in place instead of copying it to a centralized location, then I would say your data fabric and a data mesh are nearly the same thing with at least the one difference is that a data mesh has standards/frameworks on how each domain handles its data, treating data as a product with the domain as the owner, where a data fabric does not have that.

I would love to hear your thoughts in the comment section below. More to come on this topic!

The post Centralized vs decentralized data architecture first appeared on James Serra's Blog.

↧

Data Mesh: Centralized ownership vs decentralized ownership

July 23, 2021, 8:00 am

≫ Next: Podcast and presentation decks on data architectures

≪ Previous: Centralized vs decentralized data architecture

I have done a ton of research lately on Data Mesh (see the excellent Building a successful Data Mesh – More than just a technology initiative for more details), and have some concerns about the paradigm shift it requires. My last blog tackled the one about Centralized vs decentralized data architecture. In this one I want to talk about centralized ownership vs decentralized ownership, along with another paradigm shift (or core principle) closely related to it, siloed data engineering teams vs cross-functional data domain teams.

First I wanted to mention there is a Data Mesh Learning slack channel that I have spent a lot of time reading and what is apparent is there is a lot of confusion on exactly what a data mesh is and how to build it. I see this as a major problem as the more difficult it is to explain a concept the more difficult it will be for companies to successfully build that concept, so the promise of a data mesh improving the failure rates for big data projects will be difficult to achieve if we can’t all agree exactly what a data mesh is. What’s more is the core principles of the data mesh sound great in theory but will have challenges in implementing them, hence my thoughts in this blog on centralized ownership vs decentralized ownership.

To review what is centralized ownership vs decentralized ownership (which reminds me of the data mart arguments of the Kimball vs Inmon debates many years ago): Rather than thinking in terms of pipeline stages (i.e. data source teams copying data to a central data lake to be filtered by a centralized data team in IT, who then prepare it for data consumers, so “central ownership”), we think about data in terms of domains (e.g. HR or marketing or finance) where the data is owned and kept within each domain (called a data product), hence “decentralized ownership” (also called domain or distributed ownership). From a business perspective this makes things easier as it maps much more closely to the actual structure of your business. Domains can be followed from one end of the business to the other. Each team is accountable for their data, and their processes can be scaled without impacting other teams. Each domain will have their own team for implementing their domain solution (“cross-functional data domain teams”) instead of one centralized team that resides in IT being responsible for all the implementations (“siloed data engineering teams”).

Inside a domain such as HR, that team is managing their HR-related OLTP systems (i.e. Salesforce, Dynamics) and have created their own datasets built on top of a data warehouse or a data lake that has combined the data from all the HR-related OLTP systems. I have not seen clarity from the data mesh discussions on how exactly a domain handles OLTP and analytical data so please comment below if you have a different opinion.

To be part of the data mesh, each domain must follow a set of IT guidelines and standards (“contracts”) that describe how their domain data will be managed, secured, discovered and accessed.

Having built database and data warehouse solutions for 35 years, I have some concerns about this approach:

Domains will only be thinking of their own data product and not how to work with other products, possibly making it difficult to combine the data from multiple domains
Not having IT-like people in each product group to do the implementation but instead trying to use business-like people
Does each domain have the budget to do its own implementation?
You may have domains not wanting to deal with data and just focus on what they are good at (i.e. serving their customers), happy to have IT handle their data
Each domain could be using different technology, some of which could be obscure. And not having the experience to pick the right technology
Having centralized policies with a data mesh oftentimes leaves the implementation details to the individual teams. This has the potential of inconsistent implementations that may lead to performance degradations and differing cost profiles
If implementing a Common Data Model (CDM), then you will have to get every domain to implement it
You will have to coordinate each domain to have its own unique ID’s for rows when it has the same types of data as other domains (i.e. customers)
Domains may have their own roadmap and want to implement their use case now and/or don’t want to pay or wait for a data mesh. And what if you have dozens of domains/orgs who feel this way?
Conformed dimensions would have to be duplicated in each domain
You could plan on having a bunch of people with domain knowledge within each domain, but what about if you already have many people in IT who understand all the domains and how to integrate the data to get more value than the separate domains of data? Wouldn’t this favor a centralized ownership?
Ideally you want deep expertise on your cross-functional teams in streaming, ETL batch processing, data warehouse design, and data visualization. So if you have many domains this means many roles to fill and that might not be affordable. The data mesh approach assumes that each domain team has the necessary skills, or can acquire them, to build robust data products. These skills are incredible hard to find
How do you convince ‘business people’ in each domain to take ownership of data if it only introduces extra work for them? And that there could possibly be a distribution in service?
If each domain is building their own data transformation code, then there will be a lot of duplication of effort
If there are already data experts within each domain, why not just have IT work closely with them if using a centralized ownership?
The domain teams may say their data is clean and won’t change it, where if the data is centralized then it can be cleaned. And domains may have different interpretations of clean or how to standardize data (i.e. defining states with abbreviations or the full state name). And what if the domains don’t have time to clean the data?
Who scans for personally identifiable information (PII) data and who fixes the issue if it is found out that people are seeing PII information that they should not be allowed to see?
Who coordinates if a domain changes its data model, causing problems with core data models or queries that join domain data models?
Who handles DataOps?
Shifting from a centralized set of individuals servicing their data requests to a self-serve approach could be very challenging for many companies
Each domain ingesting their own data could lead to duplication of purchased data, along with many domains building similar ingestion platforms
The problem of domains ignoring the data security standards or data quality standards, which would not happen in a centralized architecture
You create data silos for domains that don’t want to join the data mesh or are not allowed to because they don’t follow the data mesh contract for domains
Replacing the IT data engineers with engineers in each domain (“business engineers”) will provide the benefit of business engineers knowing the data better, but the tradeoff is they don’t have the specialized technical knowledge that IT data engineers have which could lead to less-than-optimal technical solutions
Having multiple domains that have aggregates or copies of data from other domains for performance reasons leads to duplication of data
A data mesh assumes that the people who are closest to the data are the best able to understand it, but that is not always true. Plus, they likely don’t understand how best to combine their data with other domains
A data mesh touts that it reduces the “organizational complexity”, but it may actually make it worse when the teams are distributed instead of centralized and many more people are involved
The assumption that IT data engineers don’t have business and domain knowledge is not always true in my experience. I have seen some that have more knowledge than the actual domain experts, plus they understand how to combine the data from different domains together. And if IT data engineers don’t have the domain knowledge, having them obtain that knowledge could be a better solution than a whole new way of working that comes with a data mesh (in which those people are in many cases just moved to the business group). Wouldn’t improving the communication between IT and the domains be the easiest solution?

Finally, I have to take issue when I hear that current big data solutions don’t scale and data mesh will solve that problem. It is trying to solve what it perceives as a major problem (“crisis”) that is really not major in my opinion. There are thousands that have implemented successful big data solutions, but there are very few data meshes in production. I have seen many “monolithic” architectures scale the technology and the organization very well. Sure, many big data projects fail, but for the same reasons that would of made them fail if they tried to implement a data mesh instead (and arguable there would be an even higher failure rate trying to build a data mesh due to the additional challenges of a data mesh). Technology for centralizing data has improved greatly allowing solutions to scale, having serverless options now to meet the needs of most big data requirements along with cost savings, and it will continue to improve. There is a risk with the new architecture and organizational change that comes with a data mesh, especially compared to the centralized data warehouse which has proven to work for many years if done right. Plus, the data mesh assumes that each source system can dynamically scale to meet the demands of the consumers which will be particularly challenging when data assets become “hot spots” within the ecosystem.

But I want to be clear that I see a lot of positives with the data mesh architecture, and my hope is that it will be a great approach for certain use cases (mainly large fragmented organizations). I’m just trying to point out that a data mesh is not a silver bullet and you need to be aware of the concerns listed above before undertaking a data mesh to make sure it’s the right approach for you so you don’t become another statistic under the failed project column. It requires a large change in a companies technology strategy and an even larger change in a companies organizational strategy which will be a huge challenge that you have to be prepared for.

More info:

Data Mesh Pain Points

Building a data mesh to support an ecosystem of data products at Adevinta

Will the Data Mesh save organizations from the Data Mess?

The post Data Mesh: Centralized ownership vs decentralized ownership first appeared on James Serra's Blog.

↧

Podcast and presentation decks on data architectures

August 9, 2021, 8:00 am

≫ Next: Single-cloud versus Multi-cloud

≪ Previous: Data Mesh: Centralized ownership vs decentralized ownership

Tomorrow (Tuesday (8/10/21) I will be on a podcast for SaxonGlobal called “The Alphabet Soup of Data Architectures” where I will talk about the modern data warehouse, data fabric, data lakehouse, data mesh, and more. I hope you can check it out live or via replay here.

I have also created two more presentation decks:

Data Lakehouse, Data Mesh, and Data Fabric

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs. Check it out here.

Data Warehousing Trends, Best Practices, and Future Outlook

Over the last decade, the 3Vs of data – Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:

– Choosing the best solution – Data Lake vs. Data Warehouse vs. Data Mart
– Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
– Step by step approach to building an effective data warehouse architecture
– Common reasons for the failure of data warehouse implementations and how to avoid them

Check it out here.

The post Podcast and presentation decks on data architectures first appeared on James Serra's Blog.

↧

Single-cloud versus Multi-cloud

August 26, 2021, 8:00 am

≫ Next: Data Platform products for Microsoft gaps

≪ Previous: Podcast and presentation decks on data architectures

A discussion I have seen many companies have is if they should be single-cloud (using only one cloud company) or multi-cloud (using more than one cloud company). The three major Cloud Service Providers (CSPs) that companies use for nearly all use cases are Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP). First, here are the benefits of being multi-cloud:

Improved ability to meet SLA’s: One CSP could have reduced latency or better HA/DR than another
Reduced cost: The same service may be cheaper at another CSP or the price is the same but has more features or performance
Reduced lock-in: Some companies don’t like the idea of using one CSP because they are “putting all their eggs in one basket”. Or they fear the CSP they are using might become a competitor to their business, which has happened with AWS
Capacity issues: If one CSP has capacity issues, you can lean on the other CSP for the capacity you need
Missing features/products: There may be certain features or products that one CSP has that another does not, and you would like to use those. Or a way to hedge your bet that one cloud company will introduce a new product/feature that the other does not have. Think best-in-class services or best-of-breed. Also, it could be a CSP supports a government cloud in a certain region and the others do not
Data sovereignty: One CSP may have a cloud region that another does not. For example, if you are using MongoDB in Azure, but Azure does not have a region in Taiwan, you can use MongoDB in AWS for Taiwan. This could be very important for government or regulatory reasons

But there are plenty of reasons that multi-cloud may not be a good idea:

Performance: All these clouds are not designed to work together so moving data and compute back and forth between clouds is typically very slow and clunky. Plus all the CSPs have egress charges when moving data between clouds
Increasing the skillset: Your company will need to understand two or more clouds, greatly increasing your costs
Reduced interoperability: Each CSP is of course only thinking about working with its own components, so having a product from one CSP work with a product from a different CSP could prove to be very challenging
Switching costs: Moving data and applications from one CSP to another and launching another product could be very costly, especially if you have to re-engineer applications to support multiple clouds. And there is usually no simple migration capability between CSPs. Also think about if you have purchased a bigger pipeline to the Internet, such as ExpressRoute with Azure, you would have to make the same purchase at the other CSP
Management overhead: Now you will have an additional CSP to manage and will have to create additional policies, standards, and procedures. Another increase in cost
Administrative complexity: You will now have two very different types of security between CSPs that you have to get to work together, two places to add/delete/update users, two places to monitor, billing from multiple CSPs, just to name a few things. Yet another increase in cost
Least common denominator: If you want to make it easy to move from one cloud to another, you won’t be able to use features in one CSP that others don’t have. In particular, you will have to use Infrastructure as a Service (IaaS) offerings instead of Platform as a Service (PaaS) which takes away a major benefit of being on the cloud. This is a big opportunity cost as you can no longer use the higher value services that each CSP offers
Exposure: Having data in multiple clouds increases the exposure of your data, and instead of increasing your security expertise in one cloud you are splitting it between multiple clouds

I did not put as a benefit that being multi-could means you can negotiate a better deal from a CSP because you have put them in a compete situation. That is more of a myth. With CSPs, the more you use, the bigger discounts you get. For enterprise license agreements (ELA), the CSPs give you higher discounts based on how much consumption you can commit on their cloud. They call such commitment-based discounts different names such as pre-commit discounts, commercial pricing discounts, or volume-based discounts. Then add in tier-based pricing, reserved instance pricing (or committed usage discounts), and sustained usage discounts, and you can see you will save more sticking with just one CSP.

And one thing to note: having two cloud providers isn’t going to save you from an outage. The CSPs have specifically designed their networks so that an outage at one region doesn’t impact other regions. So your disaster recovery solution should be to failover to another region within the same CSP, not a failover to another CSP.

Finally, going with just one CSP could allow for partnership benefits, such as EY has with Microsoft: EY and Microsoft announce expansion of collaboration to drive a US$15b growth opportunity and technology innovation across industries – Stories.

I believe that the multi-cloud approach will prove to be short lived and in time most companies will choose one CSP and stick with them so that they can go deeper and leverage the services and ecosystem fully as applications modernize to take full advantage of that cloud’s PaaS and SaaS. This is especially true now that all three CSPs have very few unique features or products anymore.

Some companies will settle on a couple of clouds as they are big enough to invest the time and people to go deep on both, and that makes sense if the clouds offer something different that is beneficial to the workload, but be advised only to do so if you go in deeply with both and take advantage of the PaaS services that each offers.

More info:

Should You Be Multi-cloud?

The Why and How of Multi-Cloud Adoption

The Advantages of Accepting Lock-In From Your Cloud Provider

Why multicloud management is a mess

Is A Single Cloud Provider a Single Point of Failure?

The myths (and realities) of cost savings with multicloud

What are your architectural drivers for adopting multi-cloud?

The post Single-cloud versus Multi-cloud first appeared on James Serra's Blog.

↧

Data Platform products for Microsoft gaps

September 22, 2021, 8:00 am

≫ Next: Azure Purview is generally available

≪ Previous: Single-cloud versus Multi-cloud

Microsoft has a ton of data platform-related products, but there are certain areas where they either don’t have a product or what they have is limited and you need to look at a 3rd-party product to fill that gap. At the company I work at, EY, we are building a data fabric on Azure and I have listed below the areas that we have had to look at other products outside the Microsoft realm:

Master Data Management (MDM): Microsoft has Master Data Services (MDS), but it is for lightweight MDM needs and has not had any new features in quite a while and requires SQL Server. Microsoft usually recommends Profisee instead. Other options: Informatica, Tamr, boomi, Riversand, Semarchy
Data Quality: Microsoft has a data quality product called Data Quality Services (DQS), but it does not seem to be supported anymore and is limited in features and also requires SQL Server. Instead, if you are using an MDM tool like Profisee it has built-in data quality features or look at other options: Informatica Cloud Data Quality, Talend Data Quality
Data virtualization: Microsoft has sort of a “light” version of virtualization with their Serverless SQL pool in Azure Synapse Analytics, which can query remote data stores. It currently only supports querying data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Cosmos DB, or Dataverse, but hopefully more will come in the future. For full virtualization software, check out: Denoto, Dremio, Starburst, Fraxses, Stratio
Data Catalog: Microsoft has a nice product in this area called Purview, but it is not yet GA. If you need a GA product or one that has been around a while, check out: Informatica, Waterline data, Alation, Collibra, Amundsen, Databricks Unity Catalog (not GA), erwin Data Intelligence, Apache Atlas, data.world
Attribute-based access control (ABAC): ABAC for security is becoming more popular but Microsoft has limited support for it (see What is Azure attribute-based access control (Azure ABAC)? (preview)). Hopefully ABAC will be added to Purview, but for now look at: Immuta, Okera. For an excellent paper to see the benefits of ABAC over RBAC check out GigaOm Report: Immuta vs. Apache Ranger
Multi-master cluster warehouse: Basically this means you can have multiple compute clusters all accessing the same database, as opposed to a cluster only able to access one database that it is assigned to (i.e. five clusters all accessing databaseA, instead of cluster1 only accessing databaseA, cluster2 only accessing databaseB, etc). This functionality was demo’d in Azure Synapse Analytics quite a long time ago (see Azure Synapse Analytics & Power BI concurrency), but is still not available yet. Snowflake does have this feature and it is quite popular

Note these are just some of the products for each category based on my knowledge. Please leave a comment for products that I have missed that you like!

The post Data Platform products for Microsoft gaps first appeared on James Serra's Blog.

↧

Azure Purview is generally available

October 5, 2021, 8:00 am

≫ Next: Microsoft Ignite Announcements Nov 2021

≪ Previous: Data Platform products for Microsoft gaps

After a very long public preview (9 months), Azure Purview is finally generally available (GA). I described Purview in a previous blog (New Microsoft data governance product: Azure Purview). In short, Purview’s main purpose is to collect metadata from various sources and provide a search feature over that metadata so you can quickly find relevant data. It can automatically classify the data (i.e. SSN) and provide data lineage, as well allowing you to enter glossary terms. It then provides insights into the data:

(Note: The below info, including slides, were talked about during the GA announcement that can be found here).

There are over 200 built-in classifiers and support for 35 data sources and growing (in public preview is Google Cloud, Erwin, salesforce, IBM DB2, Cassandra, and coming soon is snowflake, PostgreSQL, mongoDB, SAP HANA, and MySQL). Automated lineage extraction now supports ADF, Azure Synapse, Azure Data Share, Teradata stored procedures, and Power BI.

The roadmap was shared and I was excited to see that data sharing, data quality, and data policy will be added:

Note there are a few items related to data insights that are still in public preview:

Azure Purview has a limited-time free offer that includes free scanning of on-premises SQL Servers and Power BI tenants, and free data sensitivity labeling for all existing Microsoft 365 E5 customers. Get started here.

More info:

Azure Purview data governance service heads to GA

Microsoft launches data governance service Azure Purview in general availability

The post Azure Purview is generally available first appeared on James Serra's Blog.

↧

Microsoft Ignite Announcements Nov 2021

November 3, 2021, 8:00 am

≫ Next: Azure Synapse Analytics database templates

≪ Previous: Azure Purview is generally available

Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform and AI related announcements:

SQL Server 2022: A new version of SQL Server has arrived! It is now in private preview. Some of the top new features are: Integration with Azure SQL Database Managed Instance — the Microsoft-managed, cloud-based deployment of the SQL Server box product. This integration supports migrations to Managed Instance through the use of Distributed Availability Group (DAG), which will enable near-zero-downtime database migrations. Additionally, you will have the ability to move back to on-premises through a database restore (only available for SQL Server 2022), giving bi-directional HA/DR to Azure SQL. You can also use this link feature in read scale-out scenarios to offload heavy requests that might otherwise affect database performance; Implementation of the ledger feature in Azure SQL Database, which was announced in May of this year, bringing the same blockchain capabilities to SQL Server; Azure Synapse Link for SQL Server, which provides for replication of data from SQL Server into Synapse-dedicated SQL pools; Integration with Azure Purview, which assures that the cloud-based data governance platform encompasses SQL Server data, bringing data stored on-premises into its governance scope. That scope even includes propagation of Purview policies for centralized administration of management operations; Support for multi-write replication, creating corresponding multiple read replicas. This facilitates SQL Server Query Store’s enablement of query hints for the multiple replicas, improving performance without requiring a rewrite of Transact SQL (T-SQL) code; A new feature called Parameter Sensitive Plan Optimization which automatically enables the generation of multiple active cached query plans for a single parameterized statement, accommodating different data sizes based on provided runtime parameter values; An update to PolyBase that uses REST APIs to connect to data lakes (Azure storage and Amazon S3) in addition to using the ODBC drivers, as well as supporting the OPENROWSET command; Enhancements to T-SQL that includes an enhanced set of functions for working with JSON data and new time series capabilities. More info
Azure Database for PostgreSQL – Flexible Server: A middle-ground position between the default Single Server offering, which is entry-level, and the Hyperscale offering, which is for large scale-out applications. Flexible Server provides high-availability options to help ensure zero data loss, a burstable compute tier, and built-in capabilities for cost optimization — including the ability to stop the server (and billing) when not in use. Flexible Server will be made generally available this month. More info
Azure Managed Instance for Apache Cassandra: Is now GA. Cassandra is an open source, column family store NoSQL database. The Azure Cassandra service includes an automatic synchronization feature that can sync data between with customers’ own Cassandra instances, on-premises and elsewhere. More info
Cosmos DB improvements: Partial document updates using the SQL interface is now GA; Customizable provisioned throughput spending limits and cost-savings alerts in Azure Advisor is now GA
Azure Synapse Analytics improvements: Azure Synapse Link for SQL Server, which provides for replication of data from SQL Server into Synapse-dedicated SQL pools; Integration and adaptation of Azure Data Explorer (ADX) into the Azure Synapse platform. ADX is a platform for real-time analysis of huge volumes of log/machine/telemetry and other time series data; Azure Event Hubs Premium, which hits GA today, is another integration in Azure Synapse. Azure Event Hubs is a streaming event data platform that is available as a linked service which means that Synapse users can perform event streaming, ingestion, and analysis right in the platform and do so with the reserved compute, memory, and store resources offered by the Premium level of the service; Integration of a set of industry-specific database templates into Synapse Studio. Templates for retail, consumer packaged goods, and financial services (including banking, fund management, property, and casualty insurance) are being added as a preview feature, at no additional cost. The database templates are actually a new name for what was previously called common data models. A new database designer gives you the possibility to create a new data model or modify an existing one for your lake database (the lake database in Azure Synapse Analytics brings together database design, meta information about the data that is stored and a possibility to describe how and where the data should be stored).

For more details, check out the Microsoft Ignite book of news.

More info:

At Ignite, Microsoft enhances its cloud database, warehouse and lake services

Microsoft’s cloud-connected on-prem database: SQL Server 2022 rolls out in private preview

Introducing SQL Server 2022: Top 3 New Features Announced at Microsoft Ignite

At Ignite, Microsoft unveils data analytics, server, and DevOps products for Azure

Announcing SQL Server 2022 preview: Azure-enabled with continued performance and security innovation

The post Microsoft Ignite Announcements Nov 2021 first appeared on James Serra's Blog.

↧

Azure Synapse Analytics database templates

November 22, 2021, 10:30 am

≫ Next: Azure Synapse Analytics November updates

≪ Previous: Microsoft Ignite Announcements Nov 2021

One of the biggest announcements at Microsoft Ignite that seemed to be overlooked by a lot of people was Azure Synapse Analytics database templates, now in public preview. I wanted to dive into it a bit in this blog because I feel this is an exciting new feature that will be used by a lot of companies.

Basically, database templates are a set of industry-specific database templates that are integrated into Synapse Studio at no additional cost. The database templates are actually common data models (see my blog Common Data Model), and an earlier version of this feature was called Synapse CDM. They were also part of a product in preview called Industry Data Workbench that was merged into Synapse. The idea is that instead of creating a data model from scratch, which can take weeks if not months, you have pre-built data models you can use instead (if you are in an industry that is currently supported). In addition to the time savings, you will have a model that is very well thought-out and tested so you won’t have to worry that it is deficient like you would if you created your own model from scratch. This greatly helps to solve the challenge of bringing in all your data from various similar sources into a standardized format to more easily analyze the data.

Within Synapse there is a new database designer that gives you the ability to create and modify a database model using a database template. And you have the option to create a new database model from scratch or add tables from an existing data lake.

The model will be stored in a lake database in Azure Synapse Analytics. The lake database brings together database design, meta information about the data that is stored, and a possibility to describe how and where the data should be stored. Lake databases use a data lake on an Azure Storage account to store the data of the database. The data can be stored in Parquet or CSV format and different settings can be used to optimize the storage.

The database templates started with six industries, and they have already added five more industries (see New Azure Synapse database templates in public preview):

I expect many more database templates to be added in the near future as Microsoft already has 75 industry vertical schemas that it acquired when it purchased ADRM software (press release).

To create a data model, go to the Data tab in Synapse and click “+”. Then choose “Lake database (preview)”. You have now created a lake database and can proceed to add data models to it. You do this by selecting the “Table” drop-down menu and choosing “Custom” to create a brand new model, or “From template” to create a data model using one of the industry templates. You can then select a table from the designer pane to modify it.

To map fields from the source data to the Synapse lake database tables, use the Map Data tool in Synapse as described here (it uses the mapping data flow in ADF). Hopefully in the future Microsoft will add default mapping templates for popular sources such as Salesforce and SAP. I also hope to see the ability to create the data models in a Synapse dedicated pool along with the ADF code to transfer the data from the lake database to the Synapse dedicated pool, as I can see the need to query the data in a relational database instead of a data lake (see Data Lakehouse & Synapse for the reasons querying from a relational database may be better than a data lake).

Using database templates means we will know the shape of the data which provides another benefit: we can use pre-built ML and AI models on that data. Microsoft has already provided one in the gallery under “Database templates – AI Solutions” called “Retail – Product recommendations”, which creates a Jupyter notebook with Python code for you, and I expect to see many more pre-build ML models in the near future.

For more info about Azure Synapse Analytics database templates check out the documentation.

More info:

Database templates in Azure Synapse Analytics

The post Azure Synapse Analytics database templates first appeared on James Serra's Blog.

↧

Azure Synapse Analytics November updates

December 9, 2021, 8:00 am

≫ Next: Distributed SQL

≪ Previous: Azure Synapse Analytics database templates

Microsoft recently came out with a blog on a bunch of new features available for Azure Synapse Analytics (see Azure Synapse Analytics November 2021 Update), and I wanted to point out my top ten personal favorites:

Synapse Data Explorer now available in preview: Azure Data Explorer (ADX) is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices. ADX is now integrated into Synapse and complements the Synapse SQL and Apache Spark analytic engines already available in Synapse Analytics. See What is Azure Synapse Data Explorer?
Database Templates: A set of industry-specific database templates that are integrated into Synapse Studio at no additional cost. I blogged about this feature at Azure Synapse Analytics database templates
Introducing Lake databases: Previously, Synapse workspaces had a kind of database called a Spark Database. Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes), and tables in Spark databases could be queried by both Spark pools and by serverless SQL pools. To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, they have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before – they just have a new name. They will show up under “Lake database” on the Data tab in Synapse Studio. Any databases created using database templates is also a Lake database and will also show up under “Lake database” on the Data tab. Note that SQL databases (dedicated SQL pool databases or databases created using serverless SQL pools) will show up under “SQL database” on the Data tab. See The best practices for organizing Synapse workspaces and lakehouses
Lake database designer now available in preview: Until now you’ve had to write code to design databases, tables, etc. In this update, instead of writing code to design your database, any new Lake databases you create will support a new no-code design experience called the database designer that is built into Synapse Studio. Note the designer does not work for SQL databases (i.e. a dedicated SQL pool database)
Delta Lake support for serverless SQL is generally available: Azure Synapse has had preview-level support for serverless SQL pools querying the Delta Lake format. This enables BI and reporting tools to access data in Delta Lake format through standard T-SQL. With this latest update, the support is now Generally Available and can be used in production. See How to query Delta Lake files using serverless SQL pools
Handling invalid rows with OPENROWSET in serverless SQL: Often raw data includes invalid rows that will cause your queries to fail. You can now use OPENROWSET to reject these bad rows and place them in a separate file so you can examine those rows later. See the reject options in OPENROWSET and external tables
Accelerate Spark workloads with NVIDIA GPU acceleration: Hardware-accelerated pools are now in public preview for Spark in Synapse. With Hardware-accelerated pools you can speed up big data with NVIDIA GPU-accelerated Spark pools. This can reduce the time necessary to run data integration pipelines, score ML models, and more. This means less time waiting for data to process and more time identifying insights to drive better business outcomes. See GPU-Accelerated Apache Spark Pools – Azure Synapse Analytics
Mapping Data Flow gets new native connectors: Data flows allow data transformation using a visual designer instead of writing code. They have added two new native mapping data flow connectors for Synapse Analytics. You can now connect directly to AWS S3 buckets for data transformations and Azure Data Explorer clusters. See Mapping Data Flow gets new native connectors
Synapse Link for Dataverse. Dataverse became the new name for what used to be called Common Data Service (CDS). CDS is a data service that has served as the foundation for data storage and modeling capabilities for building custom Power Apps as well as leveraging the many Dynamics 365 app products (such as Dynamics 365 Sales, Dynamics 365 Customer Service, or Dynamics 365 Talent) built by Microsoft. Previously available in Preview, Synapse Link for Dataverse is now Generally Available. With a few clicks, data from Dataverse will land in Azure Synapse for analytics exploration without putting unnecessary load on the operational databases—no ETL processes, pipelines, or management overhead. See Create an Azure Synapse Link for Dataverse with your Azure Synapse Workspace
Synapse Link for SQL Server: Provides automatic change feeds that capture the changes within SQL Server and feed those into Azure Synapse Analytics. It provides near real-time analysis and hybrid transactional and analytical processing with minimal impact on operational systems. So no more ETL! You can now apply for the preview of Synapse Link for SQL Server 22.

Many of these new features can been seen in the Microsoft webinar Build a Unified Analytics Platform with Azure Synapse and Power BI.

There is free training for Synapse at Let’s Build Together — Hands-on Training Series for Azure Synapse Analytics. If you need some free Synapse queries for the training, or just want to play with Synapse, check out the Limited-time free quantities offer for Azure Synapse Analytics.

The post Azure Synapse Analytics November updates first appeared on James Serra's Blog.

↧

Distributed SQL

December 29, 2021, 8:00 am

≫ Next: Microsoft industry clouds

≪ Previous: Azure Synapse Analytics November updates

A very brief history of databases for online transaction processing (OLTP) workloads starts with relational databases (RDBMS), which worked well for many years. With the advent of the internet and the need for millions of transactions per second, NoSQL filled that need, but is modeled in means other than the tabular relations used in relational databases and uses eventual consistency instead of ACID (among other differences). Then came NewSQL, which is a class of RDBMS that seeks to provide the scalability of NoSQL systems for OLTP workloads while maintaining the ACID guarantees of a traditional database system. Now the latest technology for databases is Distributed SQL.

A Distributed SQL database is a single relational database which replicates data across multiple servers. They are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones. Distributed SQL databases typically use the Paxos or Raft algorithms to achieve consensus across multiple nodes. They are for OLTP applications and not for OLAP solutions. Sometimes distributed SQL databases are referred to as NewSQL but NewSQL is a more inclusive term that includes databases that are not distributed databases. Usually distributed SQL databases are built from the ground-up and NewSQL databases include replication and sharding technologies added to existing client-server relational databases like PostgreSQL. Some experts define Distributed SQL databases as a more specific subset of NewSQL databases. For more details on the differences between relational databases, non-relational databases (NoSQL), and NewSQL, see my blog Relational databases vs Non-relational databases.

A Distributed SQL database was really made possible because of the cloud – with it’s ability to easily scale and with its built-in resiliency and disaster recovery. You simply scale by adding more nodes, and each node can handle read or write queries. It can scale beyond just a single data center into multiple regions (even different clouds), but maintains one single logical database (so your applications uses one single connection string). It can survive the failure of a node or even a service disruption in an entire region. You can also geo-locate data near a user to reduce read/write latencies or comply with regulations.

To better understand how a Distributed SQL database works, I’ll go into details using CockroachDB, since it is one of the more popular ones (2nd most popular next to Amazon Aurora according to db-engines):

Every database consists of three layers, from bottom to top: Storage, SQL execution, and SQL Syntax. CockroachDB rearchitects the bottom two. Behind the scenes you will find most Distributed SQL solutions store the data differently than what you find in a relational database. For example, CockroachDB uses a key-value store (called Pebble). So data appears to the user to be in a relational database (PostgreSQL in the case of CockroachDB) but is actually in a key-value store to allow tables to be sharded (called ranges) across multiple regions. It uses the Raft algorithm to replicate data to three replicas in real-time (you can configure to use more than three replicas). Each of the three replicas can be placed anywhere, based on your applications requirements, such as diversity (balanced across nodes to improve resiliency), load (balance placement based on real-time usage to improve performance), or latency & geo-partitioning (to improve latency and satisfy regulations). It provides consistent transactions (ACID) so it can be used for things like financial transactions (unlike NoSQL which provides eventual consistency). If a node is added, CockroachDB automatically redistributes the replicas to even the load across the cluster based on the three replica placement heuristics just discussed. If a node goes down, it automatically replaces it with a new replica on an active node. The tradeoff to writing to three replicas is writes can take longer, especially if the nodes are in regions that are far apart. And reads can take longer if a query needs data that is in multiple nodes that are far apart (this is improved via push-down queries and cost-based optimizers). The key to improving latency is to have your ranges and replicas all close together via the three replica placement heuristics (or via other methods discussed at 9 Techniques to Build Cloud-Native, Geo-Distributed SQL Apps with Low Latency), but there are built-in optimizations in CockroachDB to help with distributed transaction performance (called Lazy Transaction Records and Write Pipelining).

The below slide from a CockroachDB video compares Distributed SQL with relational and NoSQL:

Distributed SQL databases include: MariaDB’s Xpand, CockroachDB, Google Cloud Spanner, YugabyteDB, MariaDB’s SkySQL, Azure Database for PostgreSQL-Hyperscale, Amazon Aurora, and PingCap TiDB. NoSQL databases include: MongoDB, Azure Cosmos DB, Cassandra, Coachbase, and HBase. NewSQL databases include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.

In my experience, I have yet to see a company that uses a Distributed SQL database. Considering the most popular one, Amazon Aurora, is only ranked #45 on db-engines means they are not that popular. They have certainly become more popular than NewSQL databases, which have really dropped in customer use. I can’t see Distributed SQL databases becoming much more popular than they are now, as there use case is small: if your OLTP application does not need to support millions of transactions per second, then a traditional relational database will do just fine. And most workloads that do need that type of performance have gone the NoSQL route, and I think it will be hard for Distributed SQL databases to cut into the NoSQL market which has been around a lot longer and is very popular.

More info:

Distributed SQL Takes Databases to the Next Level

Video Distributed SQL Architecture | Cloud Database | Distributed Database

What is Distributed SQL?

Distributed SQL vs. NewSQL

SQL VS. NOSQL VS. NEWSQL: FINDING THE RIGHT SOLUTION

NoSQL vs. NewSQL vs. Distributed SQL: DZone’s 2020 Trend Report

A Decade in Review: Distributed SQL Takes the Stage as NewSQL Exits

NoSQL, NewSQL and Distributed SQL Systems

The post Distributed SQL first appeared on James Serra's Blog.

↧

Microsoft industry clouds

January 18, 2022, 8:00 am

≫ Next: Data Lakehouse, Data Mesh, and Data Fabric

≪ Previous: Distributed SQL

In talking with Microsoft customers, I have found that most are not aware that Microsoft has created industry clouds (see Microsoft Industry Clouds). So I wanted to use this blog to briefly explain what they are and how they may be useful to you. To clear up a lot of confusion, despite the name, these are not clouds they are separate for Azure. Rather, they use the Azure cloud along with existing tools and products customized for a specific industry and placed within the Azure cloud (think “add-ons”).

These industry clouds package together common data models, cross-cloud connectors, workflows, application programming interfaces and industry-specific components and standards. For example, for the retail cloud a pre-built unified customer profile is included that was created using Dynamics 365 and specialty built widgets, along with machine learning models that use a retail common data model. Another example, in the nonprofit cloud, there is a Fundraising and Engagement module that is an add-on on top of Dynamics 365 Sales Enterprise that handles the complex relationships nonprofit organizations have with the people they deal with.

I like to think of the industry clouds as a shortcut to getting value out of your data as you can build solutions much quicker, or even use a pre-built solution right out of the box. All of the industry solutions use the following products (in addition to Azure): Dynamics 365, Microsoft 365, Microsoft Teams, and Power Platform. And most use additional products such as HoloLens, LinkedIn or Azure Synapse Analytics.

The six industry clouds created so far:

Microsoft Cloud for Healthcare (see announcement) – Provides capabilities to manage health data at scale and make it easier for healthcare organizations to improve the patient experience, coordinate care, and drive operational efficiency, while helping support security, compliance, and interoperability of health data. See products used and capabilities and pricing.

Microsoft Cloud for Retail (see announcement) – Reimagine retail and deliver seamless experiences across the end-to-end shopper journey. See products used and capabilities and pricing.

Microsoft Cloud for Financial Services (see announcement) – Provides capabilities to manage data to deliver differentiated experiences, empower employees, and combat financial crime while facilitating security, compliance, and interoperability. See products used and capabilities and pricing.

Microsoft Cloud for Nonprofit (see announcement) – Built for fundraisers, volunteer managers, program managers, and other roles unique to nonprofit organizations, these products address the sector’s most urgent challenges. See products used and capabilities and pricing.

Microsoft Cloud for Manufacturing (preview) (see announcement) – Designed to deliver capabilities that support the core processes and requirements of the industry. These end-to-end manufacturing cloud solutions include released and new capabilities that help securely connect people, assets, workflow, and business processes, empowering organizations to be more resilient.

Microsoft Cloud for Sustainability (preview) (see announcement) – An extensible software-as-a-service solution that helps you record, report, and reduce your organization’s environmental impact through automated data connections and actionable insights.

Industry clouds will also be useful for Microsoft partners: instead of building a custom solution from scratch, partners use the building blocks in each industry cloud, and then customize those. Many partners have already built solutions for the industry clouds.

More info:

Give customers an edge with new Microsoft industry clouds

Microsoft unveils three more ‘industry clouds’ for financial, manufacturing and nonprofit

Ignite Nov ’21: Microsoft expands Industry Clouds verticals

Microsoft is going to have lots of clouds for industries. Here’s why they matter

Microsoft: How we deliver industry-specific cloud computing at scale

Microsoft Expands Cloud Programs for Specific Industries

The post Microsoft industry clouds first appeared on James Serra's Blog.

↧

Data Lakehouse, Data Mesh, and Data Fabric

February 8, 2022, 8:00 am

≫ Next: Azure Synapse and Delta Lake

≪ Previous: Microsoft industry clouds

(NOTE: I have returned to Microsoft and am working as a Solution Architect in Microsoft Industry Solutions, formally known as Microsoft Consulting Services (MCS), where I help customers build solutions on Azure. Contact your Microsoft account executive for more info. That being said: the views and opinions in this blog are mine and not that of Microsoft).

There certainly has been a lot of discussion lately on the topic of Data Lakehouse, Data Mesh, and Data Fabric, and how they compare to the Modern Data Warehouse. There is no clear definition of all these data architectures, and I have created a presentation using my own take that I have been presenting frequently internally at Microsoft and externally to customers and at conferences. Hopefully these presentations, blog posts, and videos can help clarify all these data architectures for you:

Videos of me presenting on “Data Lakehouse, Data Mesh, and Data Fabric (the alphabet soup of data architectures)” can be found in three different lengths: DataMinutes (recording – 10 minutes), Data Agility Day (recording – 30 minutes), and India Azure Community Conference 2021 (recording – 1 hour). I will also be presenting it at: SQLBits on 3/10/22 (info) and Data Summit 2022 on 5/17/22 (info). Abstract of the presentation is below
The Data Lakehouse, Data Mesh, and Data Fabric presentation slides can be found here
I did a 20-minute video explaining the Modern Data Warehouse that you can view here
These are my blog posts on the subject matter: Data Lakehouse defined, Data Fabric defined, Data Mesh defined, Data Mesh: Centralized vs decentralized data architecture, Data Mesh: Centralized ownership vs decentralized ownership
A 30-minute video at the Hevo Cloud Data Warehousing Summit: Why Modern Enterprises Need a Cloud Data Warehouse
Check out the SaxonGlobal Data Story Podcast Series that covers all the architectures as well as common data models in four episodes

Look for a blog post of mine in a couple months that will cover Microsoft’s vision and technology solution of a data mesh.

Presentation abstract:

Data Lakehouse, Data Mesh, and Data Fabric (the alphabet soup of data architectures)

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I’ll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I’ll discuss Microsoft version of the data mesh.

The post Data Lakehouse, Data Mesh, and Data Fabric first appeared on James Serra's Blog.

↧

Azure Synapse and Delta Lake

March 2, 2022, 8:00 am

≫ Next: Azure IoT Central

≪ Previous: Data Lakehouse, Data Mesh, and Data Fabric

Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).

The reasons most are using delta lake is because of the following features that delta lake provides over just using a data lake (with supporting the MERGE statement the biggest one):

ACID transactions
Time travel (data versioning enables rollbacks, audit trail)
Streaming and batch unification
Schema enforcement
Supports commands DELETE, UPDATE, and MERGE
Performance improvement

Fortunately most Azure products now support delta lake, such as:

Mapping data flows in Azure Data Factory/ Azure Synapse (see Transform data in delta lake using mapping data flows)
Azure Synapse serverless SQL pool via OPENROWSET (see Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics and How to query your Delta Lake with Azure Synapse SQL pool). A serverless SQL pool can read delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format. However, be aware of the limitations and known issues that you might see in delta lake support in serverless SQL pools
Azure Synapse Spark pool

However, some products or features do not support delta lake (at least not yet), so I wanted to make you aware of those:

Copy activity in Azure Data Factory/Azure Synapse Pipelines, unless you use a Databricks cluster (see Copy data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics)
Azure Synapse dedicated SQL pool when using external tables and PolyBase (see Use external tables with Synapse SQL)
Azure Synapse database templates

Serverless SQL pools do not support updating delta lake files. Use Azure Databricks or Apache Spark pools in Azure Synapse Analytics to update Delta Lake.

Within Power BI, there is a connector for Synapse (called “Azure Synapse Analytics SQL”) that can connect to an Azure Synapse serverless SQL pool, which can have a view that queries a delta table. However, you are limited to the compute offered by the serverless pool, and if that does not give you the performance you need or if you want direct control on the ability to scale up, you might want to look to instead use the “Azure Databricks” connector which will give you more compute (see Connecting Power BI to Azure Databricks). Note there is a new “Azure Synapse Analytics workspace (Beta)” connector in Power BI that can also query a delta table (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query and Azure Synapse Analytics workspace (Beta)), but that is also using serverless SQL pool compute and not Spark pool compute.

Note that an Azure Synapse serverless SQL pool can access data in a data lake, delta lake, and data in a spark table, called Lake database (but only if in Parquet or CSV format and NOT in delta lake format – see Azure Synapse Analytics shared metadata tables). An Azure Synapse Spark pool can access data in a data lake, delta lake, and a Lake database (any format, including delta lake). So if you are using a Lake database that is built on the delta lake format, you would not be able to use an Azure Synapse serverless SQL pool to query it, only a Azure Synapse Spark pool. Which also means if you are using the “Azure Synapse Analytics workspace (Beta)” connector in Power BI, you won’t see it display Lake database tables built on the delta lake format to connect to.

More info:

Exploring Delta Lake in Azure Synapse Analytics

The post Azure Synapse and Delta Lake first appeared on James Serra's Blog.

↧

Azure IoT Central

March 31, 2022, 8:00 am

≫ Next: Power BI Performance Features

≪ Previous: Azure Synapse and Delta Lake

This is a short blog to give you a high-level overview on a product called Azure IoT Central. I saw this fairly new Azure product (GA Sept 2018) in use for the first time at a large manufacturing company who was using it at their manufacturing facility (see Grupo Bimbo takes a bite out of production costs with Azure IoT throughout factories). They have thousands of sensors that are collecting data for all the machines used in producing their products. In short, think of it as an “Application Platform as a Service (aPaas)” for quickly building IoT solutions. It’s boxing up IoT hub, Device Provisioning Service (DPS), Stream Analytics, Data Explorer, SQL Database, Time Series Intelligence and Cosmos DB to make it easy to quickly build a solution and get value out of the IoT data. To get an idea of the what this solution would look like, check out the IoT Central sample for calculating Overall Equipment Effectiveness (OEE) of industrial equipment.

IoT Central starts with connecting and monitoring IoT data, and then moving on to taking that information to analyze via reports and ML models, then improving physical processes, and then transforming operations and business models.

Azure IoT Central has a bunch of base capabilities (connectivity, provisioning, device management, dashboarding, etc) leveraging all those products. The data ingested into the platform is kept for 30 days. If you need to use the data beyond the built-in capabilities or to keep it longer, you can leverage “continuous data export” to stream data out to external sources like event hub, data lake, Azure Data Explorer (ADX), etc

At a high level, this is what the architecture of IoT Central looks like:

The out-of-the box website that is created for you looks like this:

A brief discussion of each menu item:

Devices lets you manage all your devices.

Device groups lets you view and create collections of devices specified by a query. Device groups are used through the application to perform bulk operations.

Device templates lets you create and manage the characteristics of devices that connect to your application.

Data explorer exposes rich capabilities to analyze historical trends and correlate various telemetries from your devices.

Dashboards displays all application and personal dashboards.

Jobs lets you manage your devices at scale by running bulk operations.

Rules lets you create and edit rules to monitor your devices. Rules are evaluated based on device data and trigger customizable actions.

Data export lets you configure a continuous export to external services such as storage and queues.

Permissions lets you manage an organization’s users, devices and data.

Application lets you manage your application’s settings, billing, users, and roles.

Customization lets you customize your application appearance.

IoT Central uses ADX under the hood but that is not exposed to the user directly. Instead, can use query the data using the UI shown above, or via the IoT Central REST API.

When you begin your IoT journey, start with Azure IoT Central. It is the fastest and easiest way to get started using Azure IoT. However, if you require a high level of customization, you can move from IoT Central and go lower in the stack with the Azure IoT platform as a service (PaaS) services. Use the IoT Central migrator tool to migrate devices seamlessly from IoT Central to a custom PaaS solution that uses the Device Provisioning Service (DPS) and IoT Hub service. See How do I move between aPaaS and PaaS solutions?

The IoT Central REST API lets you develop client applications that integrate with IoT Central applications. Use the REST API to work with resources in your IoT Central application such as device templates, devices, jobs, users, and roles.

The IoT Central homepage page is the place to learn more about the latest news and features available on IoT Central, create new applications, and see and launch your existing applications.

For more information about IoT central, check out updates to the product, demos, and the docs.

Also check out the upcoming IoT Central Summit on April 7th as it has all the very latest messaging and demos. Registration is open to the public at https://aka.ms/azureiotcentralsummitregistration.

More info:

Azure IoT Central Webinar

The post Azure IoT Central first appeared on James Serra's Blog.

↧

Power BI Performance Features

April 25, 2022, 8:00 am

≫ Next: Power BI connectors to Azure Synapse

≪ Previous: Azure IoT Central

It’s hard to believe but Power BI has now been available for over 10 years (see history)! Over the last few years there have been a number of new features to improve the performance of queries for dashboards and reports, especially for very large datasets, and I wanted to mention those features so you are aware of them in case you need to give your dashboards and reports a speed boost. Some of these features also improve the time is takes to load data:

Composite models: Allows a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data. So this means you can combine multiple DirectQuery sources with multiple Import sources. I blogged about this in detail at Power BI new feature: Composite models. A good review of Import mode, DirectQuery mode, and Composite mode is at Dataset modes in the Power BI service.
User-defined Aggregations: Can improve query performance over very large DirectQuery datasets. By using aggregations, you cache data at the aggregated level in-memory (it is similar to what Azure Analysis Services does). I blogged about this in detail at Power BI new feature: Composite models. For more info see User-defined aggregations
Incremental refreshes: When data is loaded, instead of doing a flush-and-fill (wiping out the entire data and re-loading it again which can be a long process if you have a big dataset), it can be incrementally refreshed, so that only new or updated data is loaded. Power BI Incremental refresh does that and also provides automated partition formation and administration for dataset tables that often load new and updated data. See Incremental refresh and real-time data for datasets
Hybrid tables (public preview): A hybrid table is a table with one or multiple Import partitions and one DirectQuery partition. The advantage of a hybrid table is it could be efficiently and quickly queried from in-memory while at the same time including the latest data changes from the data source that occurred after the last import cycle. The easiest way to create a hybrid table is to configure an incremental refresh policy in Power BI Desktop and enable the option “Get the latest data in real time with DirectQuery (Premium only)”. More info
Automatic Aggregations (public preview): Uses state-of-the-art machine learning (ML) to continuously optimize DirectQuery datasets for maximum report query performance. Built on top of existing user-defined aggregations (mentioned above). Unlike user-defined aggregations, automatic aggregations don’t require extensive data modeling and query-optimization skills to configure and maintain. Automatic aggregations are both self-training and self-optimizing. Automatic aggregations are supported for Power BI Premium per capacity, Premium per user, and Power BI Embedded datasets. More info
Power BI performance accelerator for Azure Synapse Analytics: When turned on in Azure Synapse Analytics, tracks the most utilized Power BI queries in an organization and creates cached views to optimize query performance. This was announced a while back, but not available yet

If you are using Synapse with Power BI, check out Azure Synapse Analytics & Power BI performance.

The post Power BI Performance Features first appeared on James Serra's Blog.

↧

Power BI connectors to Azure Synapse

May 13, 2022, 12:44 pm

≫ Next: Microsoft Build event announcements

≪ Previous: Power BI Performance Features

When using Power BI and pulling data from Azure Synapse, you will use the “Get Data” feature in Power BI. There are now three connections that you can use that I will cover in this blog. But first, a review of the various ways to store data in Synapse:

Lake databases: Previously, Synapse workspaces had a kind of database called a Spark Database. Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes), and tables in Spark databases could be queried by both Spark pools and by serverless SQL pools (if the tables are in Parquet or CSV format). To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, they have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before – they just have a new name. Any databases created using database templates is also a Lake database. They will show up under “Lake database” on the Data tab in Synapse Studio. See The best practices for organizing Synapse workspaces and lakehouses.

SQL databases: These are dedicated SQL pool relational databases or serverless databases (that use external tables) created using serverless SQL pools. They will show up under “Lake database” on the Data tab in Synapse Studio.

The three connections you can use in Power BI to connect to Synapse are:

Azure Synapse Analytics SQL – Can be used to connect to SQL databases (dedicated SQL Pools and serverless SQL pools), as well as Lake databases (if they are in Parquet or CSV format). Have to enter the Dedicated SQL endpoint or Serverless SQL endpoint. Supports Import and DirectQuery mode
Azure SQL Database – Can also be used to connect to SQL databases (dedicated SQL Pools and serverless SQL pools), as well as Lake databases (if they are in Parquet or CSV format). Have to enter the Dedicated SQL endpoint or Serverless SQL endpoint. Supports Import and DirectQuery mode. This seems to be the exact same connector as Azure Synapse Analytics SQL
Azure Synapse Analytics workspace (beta) – Recently released (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query). Can connect to Lake databases (if they are in Parquet or CSV format) and dedicated SQL pools, but it seems not serverless databases yet (I’m running into errors and will report back). Support Import mode but not DirectQuery mode. The cool thing is you don’t need to enter endpoints, just sign in to your organization in Power BI and you will see all the Synapse workspaces under your subscription, and under each workspace you will see all the Lake databases and SQL databases you have access to

Note that with all these connectors, Lake databases and SQL databases are not separated out in the Power BI Navigator after you connect to Synapse, rather all grouped together.

Make sure to read Azure Synapse and Delta Lake on the limitations on the types of data that can be accessed using Azure Synapse serverless SQL pools.

The post Power BI connectors to Azure Synapse first appeared on James Serra's Blog.

↧