In the previous blog, we discussed how to securely access Azure Data Services from Azure Databricks using Virtual Network Service Endpoints or Private Link. Given a baseline of those best practices, in this article we walkthrough detailed steps on how to harden your Azure Databricks deployment from a network security perspective in order to prevent data exfiltration.
As per wikipedia: Data exfiltration occurs when malware and/or a malicious actor carries out an unauthorized data transfer from a computer. It is also commonly called data extrusion or data exportation. Data exfiltration is also considered a form of data theft. Since the year 2000, a number of data exfiltration efforts severely damaged the consumer confidence, corporate valuation, and intellectual property of businesses and national security of governments across the world. The problem assumes even more significance as enterprises start storing and processing sensitive data (PII, PHI or Strategic Confidential) with public cloud services.
Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.
High-level Data Exfiltration Protection Architecture
We recommend a hub and spoke topology styled reference architecture. The hub virtual network houses the shared infrastructure required to connect to validated sources and optionally to an on-premises environment. And the spoke virtual networks peer with the hub, while housing isolated Azure Databricks workspaces for different business units or segregated teams.
High-level view of art of the possible:
Following are high-level steps to set up a secure Azure Databricks deployment (see corresponding diagram below):
- Deploy Azure Databricks in a spoke virtual network using VNet injection (azuredatabricks-spoke-vnet in below diagram)
- Set up Private Link endpoints for your Azure Data Services in a separate subnet within the Azure Databricks spoke virtual network (privatelink-subnet in below diagram). This would ensure that all workload data is being accessed securely over Azure network backbone with default data exfiltration protection in place (see this for more). Also in general it’s completely fine to deploy these endpoints in another virtual network that’s peered to the one hosting the Azure Databricks workspace.
- Optionally, set up Azure SQL database as External Hive Metastore to override as the primary metastore for all clusters in the workspace. This is meant to override the configuration for consolidated metastore housed in the control plane.
- Deploy Azure Firewall (or other Network Virtual Appliance) in a hub virtual network (shared-infra-hub-vnet in below diagram). With Azure Firewall, you could configure:
• Application rules that define fully qualified domain names (FQDNs) that are accessible through the firewall. Some Azure Databricks required traffic could be whitelisted using the application rules.
• Network rules that define IP address, port and protocol for endpoints that can’t be configured using FQDNs. Some of the required Azure Databricks traffic needs to be whitelisted using the network rules.
Some of our customers prefer to use a third-party firewall appliance instead of Azure Firewall, which works generally fine. Though please note that each product has its own nuances and it’s better to engage relevant product support and network security teams to troubleshoot any pertinent issues.
• Set up Service Endpoint to Azure Storage for the Azure Firewall subnet, such that all traffic to whitelisted in-region or in-paired-region storage goes over the Azure network backbone (includes endpoints in Azure Databricks control plane if the customer data plane region is a match or paired).
- Create a user-defined route table with the following rules and attach it to Azure Databricks subnets.
Name | Address | Next Hop | Purpose |
to-databricks-control-plane-NAT | Based on the region where you’ve deployed Azure Databricks workspace, select control plane NAT IP from here | Internet | Required to provision Azure Databricks Clusters in your private network |
to-firewall | 0.0.0.0/0 | Azure Firewall Private IP | Default quad-zero route for all other traffic |
- Configure virtual network peering between the Azure Databricks spoke and Azure Firewall hub virtual networks.
Such a hub-and-spoke architecture allows creating multiple spoke VNETs for different purposes and teams. Though we’ve seen some of our customers implement isolation by creating separate subnets for different teams within a large contiguous virtual network. In such instances, it’s totally possible to set up multiple isolated Azure Databricks workspaces in their own subnet pairs, and deploy Azure Firewall in another sister subnet within the same virtual network.
We’ll now discuss the above setup in more detail below.
Secure Azure Databricks Deployment Details
Prerequisites
Please take a note of Azure Databricks control plane endpoints for your workspace from here (map it based on region of your workspace). We’ll need these details to configure Azure Firewall rules later.
Name | Source | Destination | Protocol:Port | Purpose |
databricks-webapp | Azure Databricks workspace subnets | Region specific Webapp Endpoint | tcp:443 | Communication with Azure Databricks webapp |
databricks-log-blob-storage | Azure Databricks workspace subnets | Region specific Log Blob Storage Endpoint | https:443 | To store Azure Databricks audit and cluster logs (anonymized / masked) for support and troubleshooting |
databricks-artifact-blob-storage | Azure Databricks workspace subnets | Region specific Artifact Blob Storage Endpoint | https:443 | Stores Databricks Runtime images to be deployed on cluster nodes | databricks-observability-eventhub | Azure Databricks workspace subnets | Region specific Observability Event Hub Endpoint | tcp:9093 | Transit for Azure Databricks on-cluster service specific telemetry |
databricks-dbfs | Azure Databricks workspace subnets | DBFS Blob Storage Endpoint | https:443 | Azure Databricks workspace root storage |
databricks-sql-metastore (OPTIONAL – please see Step 3 for External Hive Metastore below) |
Azure Databricks workspace subnets | Region specific SQL Metastore Endpoint | tcp:3306 | Stores metadata for databases and child objects in a Azure Databricks workspace |
Step 1: Deploy Azure Databricks Workspace in your virtual network
The default deployment of Azure Databricks creates a new virtual network (with two subnets) in a resource group managed by Databricks. So as to make necessary customizations for a secure deployment, the workspace data plane should be deployed in your own virtual network. This quickstart shows how to do that in a few easy steps. Before that, you should create a virtual network named azuredatabricks-spoke-vnet with address space 10.2.1.0/24 in resource group adblabs-rg (names and address space are specific to this test setup).
Referring to Azure Databricks deployment documentation:
- From the Azure portal menu, select Create a resource. Then select Analytics > Databricks.
- Under Azure Databricks Service, apply the following settings:
Setting | Suggested value | Description |
Workspace name | adblabs-ws | Select a name for your Azure Databricks workspace. |
Subscription | “Your subscription” | Select the Azure subscription that you want to use. |
Resource group | adblabs-rg | Select the same resource group you used for the virtual network. |
Location | Central US | Choose the same location as your virtual network. |
Pricing Tier | Premium | For more information on pricing tiers, see the Azure Databricks pricing page. |
- Once you’ve finished entering basic settings, select Next: Networking > and apply the following settings:
Setting | Value | Description | Deploy Azure Databricks workspace in your Virtual Network (VNet) | Yes | This setting allows you to deploy an Azure Databricks workspace in your virtual network. |
Virtual Network | azuredatabricks-spoke-vnet | Select the virtual network you created earlier. |
Public Subnet Name | public-subnet | Use the default public subnet name, you could use any name though. |
Public Subnet CIDR Range | 10.2.1.64/26 | Use a CIDR range up to and including /26. |
Private Subnet Name | private-subnet | Use the default private subnet name, you could use any name though. |
Private Subnet CIDR Range | 10.2.1.128/26 | Use a CIDR range up to and including /26. |
Click Review and Create. Few things to note:
- The virtual network must include two subnets dedicated to each Azure Databricks workspace: a private subnet and public subnet (feel free to use a different nomenclature). The public subnet is the source of a private IP for each cluster node’s host VM. The private subnet is the source of a private IP for the Databricks Runtime container deployed on each cluster node. It indicates that each cluster node has two private IP addresses today.
- Each workspace subnet size is allowed to be anywhere from /18 to /26, and the actual sizing will be based on forecasting for the overall workloads per workspace. The address space could be arbitrary (including non RFC 1918 ones), but it must align with the enterprise on-premises plus cloud network strategy.
- Azure Databricks will create these subnets for you when you deploy the workspace using Azure portal and will perform subnet delegation to the Microsoft.Databricks/workspaces service. That allows Azure Databricks to create the required Network Security Group (NSG) rules. Azure Databricks will always give advance notice if we need to add or update the scope of an Azure Databricks-managed NSG rule. Please note that if these subnets already exist, the service will use those as such.
- There is a one-to-one relationship between these subnets and an Azure Databricks workspace. You cannot share multiple workspaces across the same subnet pair, and must use a new subnet pair for each different workspace.
- Notice the resource group and managed resource group in the Azure Databricks resource overview page on Azure portal. You cannot create any resources in the managed resource group, nor can you edit any existing ones.
Step 2: Set up Private Link Endpoints
As discussed in the Securely Accessing Azure Data Services blog, we’ll use Azure Private Link to securely connect previously created Azure Databricks workspace to your Azure Data Services. We do not recommend setting up access to such data services through a network virtual appliance / firewall, as that has a potential to adversely impact the performance of big data workloads and the intermediate infrastructure.
Please create a subnet privatelink-subnet with address space 10.2.1.0/26 in the virtual network azuredatabricks-spoke-vnet.
For the test setup, we’ll deploy a sample storage account and then create a Private Link endpoint for that. Referring to the setting up private link documentation:
- On the upper-left side of the screen in the Azure portal, select Create a resource > Storage > Storage account.
- In Create storage account – Basics, enter or select this information:
Setting | Value |
PROJECT DETAILS | |
Subscription | Select your subscription. | Resource group | Select adblabs-rg. You created this in the previous section. |
INSTANCE DETAILS | |
Storage account name | Enter myteststorageaccount. If this name is taken, please provide a unique name. |
Region | Select Central US (or the same region you used for Azure Databricks workspace and virtual network). |
Performance | Leave the default Standard. |
Replication | Select Read-access geo-redundant storage (RA-GRS). |
Select Next:Networking >
- In Create a storage account – Networking, connectivity method, select Private Endpoint.
- In Create a storage account – Networking, select Add Private Endpoint.
- In Create Private Endpoint, enter or select this information:
Setting | Value |
PROJECT DETAILS | |
Subscription | Select your subscription. |
Resource group | Select adblabs-rg. You created this in the previous section. |
Location | Select Central US (or the same region you used for Azure Databricks workspace and virtual network). |
Name | Enter myStoragePrivateEndpoint. |
Storage sub-resource | Select dfs. |
NETWORKING | |
Virtual network | Select azuredatabricks-spoke-vnet from resource group adblabs-rg. |
Subnet | Select privatelink-subnet. |
PRIVATE DNS INTEGRATION | |
Integrate with private DNS zone | Leave the default Yes. |
Private DNS zone | Leave the default (New) privatelink.dfs.core.windows.net. |
Select OK.
- Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
- When you see the Validation passed message, select Create.
- Browse to the storage account resource that you just created.
It’s possible to create more than one Private Link endpoint for supported Azure Data Services. To configure such endpoints for additional services, please refer to the relevant Azure documentation.
Step 3: Set up External Hive Metastore
Provision Azure SQL database
This step is optional. By default the consolidated regional metastore is used for the Azure Databricks workspace. Please skip to the next step if you would like to avoid managing a Azure SQL database for this end-to-end deployment.
Referring to provisioning an Azure SQL database documentation, please provision an Azure SQL database which we will use as an external hive metastore for the Azure Databricks workspace.
- On the upper-left side of the screen in the Azure portal, select Create a resource > Databases > SQL database.
- In Create SQL database – Basics, enter or select this information:
Setting | Value |
DATABASE DETAILS | |
Subscription | Select your subscription. |
Resource group | Select adblabs-rg. You created this in the previous section. |
INSTANCE DETAILS | |
Database name | Enter myhivedatabase. If this name is taken, please provide a unique name. |
- In Server, select Create new.
- In New server, enter or select this information:
Setting | Value |
Server name | Enter mysqlserver. If this name is taken, please provide a unique name. |
Server admin login | Enter an administrator name of your choice. |
Password | Enter a password of your choice. The password must be at least 8 characters long and meet the defined requirements. |
Location | Select Central US (or the same region you used for Azure Databricks workspace and virtual network). |
Select OK.
- Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
- When you see the Validation passed message, select Create.
Create a Private Link endpoint
In this section, you will add a Private Link endpoint for the Azure SQL database created above. Referring from this source
- On the upper-left side of the screen in the Azure portal, select Create a resource > Networking > Private Link Center.
- In Private Link Center – Overview, on the option to Build a private connection to a service, select Start.
- In Create a private endpoint – Basics, enter or select this information:
Setting | Value |
PROJECT DETAILS | |
Subscription | Select your subscription. |
Resource group | Select adblabs-rg. You created this in the previous section. |
INSTANCE DETAILS | |
Name | Enter mySqlDBPrivateEndpoint. If this name is taken, please provide a unique name. |
Region | Select Central US (or the same region you used for Azure Databricks workspace and virtual network). |
Select Next: Resource |
In Create a private endpoint – Resource, enter or select this information:
Setting | Value |
Connection method | Select connect to an Azure resource in my directory. |
Subscription | Select your subscription. |
Resource type | Select Microsoft.Sql/servers. |
Resource | Select mysqlserver |
Target sub-resource | Select sqlServer |
Select Next: Configuration
In Create a private endpoint – Configuration, enter or select this information:
Setting | Value | NETWORKING |
Virtual network | Select azuredatabricks-spoke-vnet |
Subnet | Select privatelink-subnet |
PRIVATE DNS INTEGRATION | |
Integrate with private DNS zone | Select Yes. |
Private DNS Zone | Select (New)privatelink.database.windows.net |
- Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
- When you see the Validation passed message, select Create.
Configure External Hive Metastore
- From Azure Portal, search for the adblabs-rg resource group
- Go to Azure Databricks workspace resource
- Click Launch Workspace
- Please follow the instructions documented here to configure the Azure SQL database created above as an external hive metastore for the Azure Databricks workspace.
Step 4: Deploy Azure Firewall
We recommend Azure Firewall as a scalable cloud firewall to act as the filtering device for Azure Databricks control plane traffic, DBFS Storage, and any allowed public endpoints to be accessible from your Azure Databricks workspace.
Referring to the documentation for configuring an Azure Firewall, you could deploy Azure Firewall into a new virtual network. Please create the virtual network named hub-vnet with address space 10.3.1.0/24 in resource group adblabs-rg (names and address space are specific to this test setup). Also create a subnet named AzureFirewallSubnet with address space 10.3.1.0/26 in hub-vnet.
- On the Azure portal menu or from the Home page, select Create a resource.
- Type firewall in the search box and press Enter.
- Select Firewall and then select Create.
- On the Create a Firewall page, use the following table to configure the firewall:
Setting | Value |
Subscription | “your subscription” |
Resource group | adblabs-rg |
Name | firewall |
Location | Select Central US (or the same region you used for Azure Databricks workspace and virtual network). |
Choose a virtual network | Use existing: hub-vnet |
Public IP address | Add new. The Public IP address must be the Standard SKU type. Name it fw-public-ip |
- Select Review + create.
- Review the summary, and then select Create to deploy the firewall.
- This will take a few minutes.
- After the deployment completes, go to the adblabs-rg resource group, and select the firewall
- Note the private IP address. You’ll use it later when you create the custom default route from Azure Databricks subnets.
Configure Azure Firewall Rules
With Azure Firewall, you can configure:
- Application rules that define fully qualified domain names (FQDNs) that can be accessed from a subnet.
- Network rules that define source address, protocol, destination port, and destination address.
- Network traffic is subjected to the configured firewall rules when you route your network traffic to the firewall as the subnet default gateway.
Configure Application Rule
We first need to configure application rules to allow outbound access to Log Blob Storage and Artifact Blob Storage endpoints in the Azure Databricks control plane plus the DBFS Root Blob Storage for the workspace.
- Go to the resource group adblabs-rg, and select the firewall.
- On the firewall page, under Settings, select Rules.
- Select the Application rule collection tab.
- Select Add application rule collection.
- For Name, type databricks-control-plane-services.
- For Priority, type 200.
- For Action, select Allow.
- Configure the following in Rules -> Target FQDNs
Name | Source type | Source | Protocol:Port | Target FQDNs |
databricks-spark-log-blob-storage | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
https:443 | Refer notes from Prerequisites above (for Central US) |
databricks-audit-log-blob-storage | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
https:443 | Refer notes from Prerequisites above (for Central US)
This is separate log storage only for US regions today |
databricks-artifact-blob-storage | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
https:443 | Refer notes from Prerequisites above (for Central US) |
databricks-dbfs | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
https:443 | Refer notes from Prerequisites above |
Public Repositories for Python and R Libraries
(OPTIONAL – if workspace users are allowed to install libraries from public repos) |
IP Address | 10.2.1.128/26,10.2.1.64/26 | https:443 | *pypi.org,*pythonhosted.org,cran.r-project.org
Add any other public repos as desired |
Configure Network Rule
Some endpoints can’t be configured as application rules using FQDNs. So we’ll set those up as network rules, namely the Observability Event Hub and Webapp.
- Open the resource group adblabs-rg, and select the firewall.
- On the firewall page, under Settings, select Rules.
- Select the Network rule collection tab.
- Select Add network rule collection.
- For Name, type databricks-control-plane-services.
- For Priority, type 200.
- For Action, select Allow.
- Configure the following in Rules -> IP Addresses.
Name | Protocol | Source type | Source | Destination type | Destination Address | Destination Ports |
databricks-webapp | TCP | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
IP Address | Refer notes from Prerequisites above (for Central US) | 443 |
databricks-observability-eventhub | TCP | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
IP Address | Refer notes from Prerequisites above (for Central US) | 9093 |
databricks-sql-metastore (OPTIONAL – please see Step 3 for External Hive Metastore above) |
TCP | IP Address | Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26 |
IP Address | Refer notes from Prerequisites above (for Central US) | 3306 |
Configure Virtual Network Service Endpoints
- On the hub-vnet page, click Service endpoints and then Add
- From Services select “Microsoft.Storage”
- In Subnets, select AzureFirewallSubnet
Service endpoint would allow traffic from AzureFirewallSubnet to Log Blob Storage, Artifact Blob Storage, and DBFS Storage to go over Azure network backbone, thus eliminating exposure to public networks.
If users are going to access Azure Storage using Service Principals, then we recommend creating an additional service endpoint from Azure Databricks workspace subnets to Microsoft.AzureActiveDirectory.
Step 5: Create User Defined Routes (UDRs)
At this point, the majority of the infrastructure setup for a secure, locked-down deployment has been completed. We now need to route appropriate traffic from Azure Databricks workspace subnets to the Control Plane NAT IP (see FAQ below) and Azure Firewall setup earlier.
Referring to the documentation for user defined routes:
- On the Azure portal menu, select All services and search for Route Tables. Go to that section.
- Select Add
- For Name, type firewall-route.
- For Subscription, select your subscription.
- For the Resource group, select adblabs-rg.
- For Location, select the same location that you used previously i.e. Central US
- Select Create.
- Select Refresh, and then select the firewall-route-table route table.
- Select Routes and then select Add.
- For Route name, add to-firewall.
- For Address prefix, add 0.0.0.0/0.
- For Next hop type, select Virtual appliance.
- For the Next hop address, add the Private IP address for the Azure Firewall that you noted earlier.
- Select OK.
Now add one more route for Azure Databricks Control Plane NAT.
- Select Routes and then select Add.
- For Route name, add to-central-us-databricks-control-plane.
- For Address prefix, add the Control Plane NAT IP address for Central US from here.
- For Next hop type, select Internet (why – see below in FAQ).
- Select OK.
The route table needs to be associated with both of the Azure Databricks workspace subnets.
- Go to the firewall-route-table.
- Select Subnets and then select Associate.
- Select Virtual network > azuredatabricks-spoke-vnet.
- For Subnet, select both workspace subnets.
- Select OK.
Step 6: Configure VNET Peering
We are now at the last step. The virtual network azuredatabricks-spoke-vnet and hub-vnet need to be peered so that the route table configured earlier could work properly.
Referring to the documentation for configuring VNET peering:
In the search box at the top of the Azure portal, enter virtual networks in the search box. When Virtual networks appear in the search results, select that view.
- Go to hub-vnet.
- Under Settings, select Peerings.
- Select Add, and enter or select values as follows:
Name | Value |
Name of the peering from hub-vnet to remote virtual network | from-hub-vnet-to-databricks-spoke-vnet |
Virtual network deployment model | Resource Manager |
Subscription | Select your subscription |
Virtual Network | azuredatabricks-spoke-vnet or select the VNET where Azure Databricks is deployed |
Name of the peering from remote virtual network to hub-vnet | from-databricks-spoke-vnet-to-hub-vnet |
- Leave rest of the default values as is and click OK
The setup is now complete.
Step 7: Validate Deployment
It’s time to put everything to test now:
- Go to the Azure Databricks workspace adblabs-ws that you’d created in Step 1, launch and create a cluster.
- Create a notebook and attach it to the cluster.
- Try and access the storage account myteststorageaccount that you created in Step 2 earlier.
If the data access worked without any issues, that means you’ve accomplished the optimum secure deployment for Azure Databricks in your subscription. This was quite a bit of manual work, but that was more for a one-time showcase. In practical terms, you would want to automate such a setup using a combination of ARM Templates, Azure CLI, Azure SDK etc.:
- Deploy Azure Databricks in your own managed VNET using ARM Template
- Create Private Endpoint using Azure CLI (or ARM Template)
- Deploy Azure SQL as External Metastore using ARM Template
- Deploy Azure Firewall using ARM Template (or Azure CLI)
- Deploy Route Table and Custom Routes using ARM Template
- Peer Virtual Networks using ARM Template
Common Questions with Data Exfiltration Protection Architecture
Can I use service endpoint policies to secure data egress to Azure Data Services?
Service Endpoint Policies allow you to filter virtual network traffic to only specific Azure Data Service instances over Service Endpoints. Endpoint policies can not be applied to Azure Databricks workspace subnets or other such managed Azure services that have resources in a management or control plane subscription. Hence we cannot use this feature.
Can I use Network Virtual Appliance (NVA) other than Azure Firewall?
Yes, you could use a third-party NVA as long as network traffic rules are configured as discussed in this article. Please note that we have tested this setup with Azure Firewall only, though some of our customers use other third-party appliances. It’s ideal to deploy the appliance on cloud rather than be on-premises.
Can I have a firewall subnet in the same virtual network as Azure Databricks?
Yes, you can. As per Azure reference architecture, it is advisable to use a hub-spoke virtual network topology to plan better for future. Should you choose to create the Azure Firewall subnet in the same virtual network as Azure Databricks workspace subnets, you wouldn’t need to configure virtual network peering as discussed in Step 6 above.
Can I filter Azure Databricks control plane NAT traffic through Azure Firewall?
To bootstrap Azure Databricks clusters, the control plane initiates the communication to the virtual machines in your subscription. If the control plane NAT traffic is configured to be sent through the firewall, the acknowledgement for the incoming TCP message will be sent via that route, which creates something called asymmetric routing and hence cluster bootstrap fails. Thus the control plane NAT traffic does need to be directly routed through the public network, as discussed in Step 5 above.
Can I analyze accepted or blocked traffic by Azure Firewall?
We recommend using Azure Firewall Logs and Metrics for that requirement.
Getting Started with Data Exfiltration Protection with Azure Databricks
We discussed utilizing cloud-native security control to implement data exfiltration protection for your Azure Databricks deployments, all of it which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project:
- Enable meta controls to unlock true potential of your data lake
- Manage access to notebook features
- Access ADLS using Credential Passthrough
- Audit everything with Diagnostic Logs, Storage Access Logs and NSG Flow Logs (requires VNET Injection).
Please reach out to your Microsoft or Databricks account team for any questions.
--
Try Databricks for free. Get started today.
The post Data Exfiltration Protection with Azure Databricks appeared first on Databricks.