Tech

Understanding Azure Databricks: Definition | Runtime | Security

Azure Databricks is a fast, secure, and collaborative Apache Spark-based analytics platform optimized for the Cloud services platform of Microsoft Azure.

Databricks is integrated with Microsoft Azure for streamlining workflow, providing a one-click setup and a comprehensive workspace that allows collaboration between data engineers, business analysts, and data scientists.

The raw and structured data is ingested into Azure through Microsoft Azure Data Factory in batches for the big data pipeline. The complete data lands in a data lake for persisted storage in Azure Data Lake Storage or Azure Blob Storage.

You can use Databricks for reading data from multiple data sources such as Azure Data Lake Sources, Azure SQL Data Warehouse, Azure Blob Storage, and Azure Cosmos DB. Once you collect data across various sources, you can turn it into breakthrough, useful insights using Spark.

The integrated and collaborative environment of Azure Databricks streamlines different processes that include data exploration, running data-driven applications, and prototyping in Spark. It allows:

Visualizing data quickly in a few clicks.
Using familiar tools such as d3, ggplot, and Matplotlib.
Using dynamic dashboards for creating interactive reports.
Documenting progress in notebooks in Python, SQL, Scala, or R.
Using Spark while interacting with the data volumes simultaneously.
Determining how users can use data for quickly exploring it thoroughly.

Apache Spark in Azure Databricks

By providing a zero-management Cloud platform, Databricks builds on the capabilities of Apache Spark. It includes:

Fully managed Spark clusters
A unified platform that powers Spark-based application
A collaborative workspace for visualization and exploration

Apache Spark Clusters in Cloud

Azure Databricks offers a secure and reliable production environment in Cloud. The cloud architecture is supported and managed by the experts. It allows for:

Creating clusters in a fraction of seconds.
Using clusters programmatically and efficiently by using REST APIs.
Getting instant access to the latest Spark features with each new release.
Autoscaling clusters up and down dynamically. It includes serverless clusters that are shareable across teams.
Using secure data integration capabilities developed on top of Spark. It enables users to unify the entire data without centralization.

Databricks Runtime

Databricks runtime is built natively for Azure Cloud and is built on top of Spark. It is a data processing engine that allows for 50x performance gains for IT teams. Databricks runs over the auto-scaling infrastructure that allows an easy self-service environment without DevOps without compromising security. It offers complete administrative control that IT teams require for production.

It abstracts out the complexities involved in the infrastructure using the Serverless option. Also, it eliminates the need for specialized expertise for setting up and configuring the data infrastructure. The serverless feature allows snowflake development experts and data scientists to iterate efficiently & quickly as a team. Moreover, it allows for building the pipeline, scheduling jobs, and training models efficiently than ever.

Azure Databricks integrates faster, performance-rich spark engines through different optimizations at the processing layer (also known as Databricks I/O) and I/O layer. The feature helps data engineers to perform production tasks efficiently.

Benefits of Databricks Runtime

Performance

Databricks Runtime is highly optimized for performance by the developers of Apache Spark. Therefore, significant acceleration in performance enables use cases not possible previously for pipelines and data processing. It remarkably improves the productivity of data teams.

Simplicity

Databricks brings along a comprehensive suite of integrated services for management and automation. These services give significant administrative control to data teams while enabling them to build and manage pipelines easily.

Cost-effective

Databricks Runtime leverage auto-scaling storage and compute for managing and controlling infrastructure costs. Clusters start and terminate intelligently, whereas high cost-to-performance minimizes infrastructure spend.

How Databricks Work?

Databricks Runtime implements open Apache Spark APIs with a highly-optimized execution engine. It ensures significant performance gains in comparison to standard open-source Spark establishes on another Spark Cloud platform. The core engine is wrapped with additional services for enterprise governance and developer productivity.

Databricks Security

Leverage Azure Databricks for providing an additional security layer to data and network infrastructure for your business growth.

Access Control

Databricks allows you to use ACLs (Access Control Lists) for configuring permissions to access data tables, clusters, jobs, pools, workspace elements such as notebooks, folders, models, and experiments. Admin users possess full control over managing access control lists. Besides, delegated users who are granted permission to manage access control lists can retain management rights.

Secret Management

Accessing data sometimes require authenticating external data sources via JDBC. You can use Databricks secrets to store the login credentials instead of directly entering them into the notebook. You can reference these credentials in jobs and notebooks. Use Databricks CLI for accessing Secrets API to manage the secrets.

Secure Cluster Connectivity

With secure cluster connectivity ‘Enabled’, Databricks Runtime cluster nodes have no public IP address, and customer VPC has no open ports.

Encrypt Traffic Between Cluster Worker Nodes

User queries and transformations are sent to the clusters over an encrypted channel. It requires setting up Spark configuration parameters through the ‘init’ script to enable encryption of traffic between nodes. This feature is available in the Enterprise Plan.

IP Access Lists

Security centric businesses that use Cloud SaaS platforms need access restriction among employees. Accessing a Cloud service from an unsecured network can pose a security risk to business data, particularly when the user holds the rights to access business-critical data. Enterprise network perimeters apply security policies to limit access to external services.

Configure Domain Name Firewall Rules

If the corporate firewalls block domain name-based traffic, then you can allow HTTPS and WebSocket traffic to ensure access to resources.

Credential Passthrough

Credential Passthrough enables users to authenticate from Databricks clusters to S3 buckets automatically. You need to use the same identity as you use to enter Databricks.

Credential Redaction

Usage of Databricks application development redacts credentials and keys in audit logs and Apache Spark logs for protecting the data from unauthorized access or information leakage. It uses three types of credentials at logging times that includes AWS secret access key, AWS access key, and credentials in URI.

CCPA & GDPR Compliance

Use Delta Lake on Azure Databricks for managing GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act) compliance for the data lake.