Deep-Dive into the New Features and Capabilities of Databricks
April 29, 2021
In recent years, Databricks has made its mark as a unified analytics framework that makes it easier for users to collaborate and share code and resources. By breaking down siloes between Data Engineers and Data Analysts, Databricks enables large-scale data processing, analytics, data science, and machine learning.
History and Core Functionalities of Databricks
Databricks is the brainchild of the creators of Apache Spark, a unified framework that provides capabilities for distributed data processing. The idea behind Spark was to create a platform-agnostic framework for developing software and applications for distributed computing. When Spark gained popularity across various systems, technologies, and platforms, its creators established Databricks as a company delivering closed-source optimizations of Apache Spark in terms of performance and extended capabilities.
Databricks, which is essentially a wrapper on top of Spark, is a closed-source unified analytics framework available only in the Cloud environment. It comes with a user environment for collaborative work, allowing users to share resources and work together. The Databricks notebooks can be used for both data analytics and data integration, which makes information management and sharing between Data Engineers and Data Analysts much easier.
Like Apache Spark, Databricks supports several programming languages, including Java, Scala, Python, and R. Of these, R and Python are predominantly used for analytics, while Scala and Java are typical application development languages.
Since Databricks is a wrapper on Spark, the core Apache Spark APIs and libraries are all available in Databricks. Spark revolves around the concept of Resilient Distributed Dataset (RDD), and all libraries and expansions available within Spark are based on this. Spark SQL is a framework for processing structured data in Spark and provides capabilities for querying data in Spark using SQL language. Dataframes and Datasets, which are abstractions of Spark SQL, are also available to Databricks users. Spark Streaming and its evolution, Structured Streaming, can be used for near-real-time data processing in micro-batches. Other commonly used libraries in Spark include Spark MLlib for machine learning and Spark GraphX for graphs and graph parallel computation.
As Databricks evolved, more technologies developed by the company have been added to the offering. These include:
- Delta Lake: Delta Lake is an open format storage layer that delivers reliability, security, and performance on your data lake — for both streaming and batch operations. It is a metadata wrapper around data stored in the Apache Parquet format, which allows users to ensure data consistency, keep a comprehensive data history, and even roll back to older versions of data if needed. It is an open approach to introducing the basic tenets of data management and governance into data lakes.
- MLflow: MLflow is a machine learning lifecycle platform that allows data analysts to keep track of their machine learning models and experiments. It is a leading framework for MLOps supporting the tracking, registry, and deployment of machine learning models.
What’s New in Databricks
Databricks is constantly evolving to add new features and capabilities that better address the requirements of its customers.
Data Lakehouse: Data Lakehouse is a fast SQL querying capability that Databricks introduced in 2021. Lakehouse integrates the capabilities of Databricks’ optimized Spark SQL and Delta Lake, allowing users to efficiently query data stored in the Delta Lake using SQL. The Lakehouse mimics the functionalities and performance of cloud-based data warehouses without the need to actually create one. This new, open architecture combines the best features of data lakes and data warehouses and enables BI and machine learning on all data.
Availability on all major clouds: Previously, Databricks was only available on Amazon Web Services (AWS) and Microsoft Azure clouds. In March 2021, however, Databricks was made available for public preview on Google Cloud and is expected to go into general availability later in 2021. With this move, Databricks will be available for use on all three major cloud platforms. It is also available on Alibaba in the Asia Pacific region.
Industry-Specific Accelerators: Databricks has been working on introducing industry-specific accelerators for Healthcare & Lifesciences, Manufacturing and Automotive, Retail & CPG, Financial Services, Media and Entertainment, and other sectors. These use cases are built on Lakehouse, which draws data from the Delta Lake via the Delta Engine for streaming analytics, BI, data science, machine learning, and other applications.
Databricks’ industry-specific technical teams have written standard code to address challenges that they see across each industry. For example, within Retail & CPG, Databricks’ technical team has prepared about 80% of the plain code needed for Demand Forecasting solutions, which is then made available to Databricks partners and customers as an accelerator. Some of the other existing solution accelerators include:
- For Retail & CPG industry: Demand Forecasting, Customer Lifetime Value, Customer Retention, Customer Segmentation, and Safety Stock Optimization
- For Media and Entertainment: Survivorship and Churn, Quality of Streaming Service, Sales Forecasting & Ad Attribution, and Recommendation Engines
- In Financial Services: Market Risk, Alternative Data for Investing, Fraud Detection, and ESG Analytics
- In Healthcare and Life Sciences: Genomics sequencing and image analytics
Sample notebooks are available for everything from data preparation to analytics. These can be used within Databricks to build and customize solutions for common industry challenges, accelerating speed-to-market and significantly reducing the time it would otherwise take to create solutions from scratch. Databricks is building more solution accelerators on an ongoing basis.
How Databricks Works with Various Cloud Platforms
Databricks runs on all three major cloud platforms, and its focus is on being simple, open, and collaborative. Databricks is deployed in the customers’ cloud account, and all data and compute draws on the consumption investments in the cloud. Consequently, it is not a separate investment from the Cloud platforms.
In the case of Microsoft Synapse, while Synapse does have open-source Spark as a part of it, Databricks has optimized Spark to operate faster and be more performant and reliable within Databricks. Consequently. Databricks can be used to cleanse and validate data and put it all back on ADLS, and then Synapse can be used as the serving layer for reporting or analytics.
Both Synapse and Databricks have their place in the architecture, and the two organizations work closely to ensure that their offerings work “better together” to provide the customer with a faster, integrated architecture.
Databricks on AWS and Databricks on Google Cloud also allow users to make the most of the combined platforms, leveraging the strengths of each for a more performant system. Databricks’ partnership with the various cloud providers enables customers to accelerate Databricks implementations by simplifying their data access by combining analytics and AI/ML capabilities to better drive business outcomes.
As a leader in data and analytics, we have expertise in implementing emerging technologies, such as Databricks, across various industries, including financial services, retail, energy, etc. Adastra, as an official Databricks partner, can offer best-in-class services and implementations backed by Databricks experts.
Adastra offers services to help our customers implement Databricks solutions, including identifying the right data pipelines where the data will come from, understanding the kind of models they want to build, the underlying ML solution being used, building out models, and refining the solution. To get our customers to the solution faster, we can also leverage pre-built accelerators developed by Databricks.
We have dedicated Data Engineering and Data Analytics teams that work closely with our Cloud partners to ensure end-to-end implementation and project delivery. Adastra has partnerships with all three major cloud providers, and we offer a full stack of services in the data and analytics domain, ranging from Data Governance to AI and Managed Services.