Developing Shiny Applications in Databricks

March 9, 2020, 8:00 am

≫ Next: Security that Unblocks the True Potential of your Data Lake

≪ Previous: Connect 90+ Data Sources to Your Data Lake with Azure Databricks and Azure Data Factory

Join our live webinar hosted by Data Science Central on March 12 to learn more

We are excited to announce that you can now develop and test Shiny applications in Databricks! Inside the RStudio Server hosted on Databricks clusters, you can now import the Shiny package and interactively develop Shiny applications. Once completed, you can publish the Shiny application to an external hosting service, while continuing to leverage Databricks to access data securely and at scale.

What is Shiny?

Shiny is an open-source R package for developing interactive R applications or dashboards. With Shiny, data scientists can easily create interactive web apps and user interfaces using R, even if they don’t have any web development experiences. During development, data scientists can also build Shiny apps to visualize data and gain insights.

Shiny is a popular framework among R developers because it allows them to showcase their work to a wider audience, and have direct business impact. Making Shiny available in Databricks has been a highly requested feature among our customers.

Databricks for R Developers

As the Unified Data Analytics Platform, Databricks brings together a variety of data users onto a single platform: data engineers, data scientists, data analysts, BI users, and more. The unification drives cross-functional collaboration, improves efficient use of resources, and ultimately leads enterprises to extract value from their datasets.

R as a programming language has an established track record of helping data scientists gain insights from data, and thus is well adopted by enterprise data science teams across many industries. R also boasts a strong open-source ecosystem with many popular packages readily available, including Shiny.

Serving R users is a critical piece to Databricks’s vision of being the Unified Data Analytics Platform. Since the start, Databricks has been introducing new features to serve the diverse needs of R developers:

Databricks notebooks have supported R since 2015. Users can use R interchangeably with Python, Scala, and SQL in Databricks notebooks (Blog)
We introduced support for using sparklyr in Databricks Notebooks in addition to SparkR. Users can use either sparklyr or SparkR in Databricks (Blog)
We enabled hosted RStudio Server inside Databricks so that users can work in the most popular R IDE while taking advantage of many Databricks features (Blog)

Adding Shiny to Databricks Ecosystem

We have now made it possible to use the Shiny package in the hosted RStudio server inside Databricks. If the RStudio is hosted on a cluster that uses Databricks Runtime (or Databricks Runtime ML) 6.2 or later, the Shiny install.packages are pre-installed –you can simply import “shiny” in RStudio. If you are using an earlier Runtime version, you can use Databricks Library UI [AWS | Azure] to install the Shiny package from CRAN to the cluster, before importing it in RStudio.

Figure 1. Importing Shiny package in RStudio in Databricks

Apache Spark™ is optimally configured for all Databricks clusters. Based on your preference, you can seamlessly use either SparkR or sparklyr while developing a Shiny application. Many R developers particularly prefer to use Spark to read data, as Spark offers scalability and connectors to many popular data sources.

Figure 2. Use Spark to read data while developing a Shiny application

When you run the command to run the Shiny application, the application will be displayed in a separate browser window. The Shiny application is running inside an RStudio’s R session. If you stop the R session, the Shiny application window will disappear. This workflow is meant for developers to interactively develop and test a Shiny application, not to host Shiny applications for a broad audience to consume.

Figure 3. A sample Shiny application

Publishing to Shiny Server or Hosting Service

Once you complete a Shiny application, you can publish the application to a hosting service. Popular products that host Shiny applications include shinyapps.io and RStudio Connect, or Shiny Server.

The hosting service can be outside of Databricks. The deployed Shiny applications can access data through Databricks by connecting to the JDBC/ODBC server [AWS | Azure]. You can use the JDBC/ODBC driver, which is available on every Databricks cluster, to submit queries to the Databricks cluster. That way, you can continue to take advantage of the security and scalability of Databricks, even if the Shiny application is outside of Databricks.

In summary, with this new feature you can develop Shiny applications on Databricks. You can continue to leverage Databricks’ highly scalable and secure platform after you publish your Shiny application to a hosting service.

Figure 4. Publish a developed Shiny application to a hosting server

Get Started with Shiny in Databricks

To learn more, don’t miss our Shiny User Guide [AWS | Azure] and join us on March 12 for a live webinar on “Developing and testing Shiny apps“, hosted by Data Science Central, where we will demonstrate these new capabilities.

We are committed to continuing to enhance and expand our product support for R users, and looking forward to your feedback!

Try Databricks for free. Get started today.

The post Developing Shiny Applications in Databricks appeared first on Databricks.

↧

Security that Unblocks the True Potential of your Data Lake

March 16, 2020, 9:15 pm

≫ Next: Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

≪ Previous: Developing Shiny Applications in Databricks

Over the last few years, Databricks has gained a lot of experience deploying data analytics at scale in the enterprise. In many cases, our customers have thousands of people using our product across different business units for a variety of different use cases —all of which involve accessing various data classifications from private and sensitive data to public data. This has brought forth to us various challenges that come with deploying, operating and securing a data analytics platform at scale. In this blog post, I want to talk about some of those learnings.

Challenges with securing a data lake

As they move to break down data silos, many organizations push all of their data, from different sources, into a data lake where data engineers, data scientists and business analysts can process and query the data. This addresses the challenge of making data available to users but creates a new challenge of protecting and isolating different classes of data from users who are not allowed to access it.

What we have learned from our experience is that scaling from operationalizing a single use case in production to operationalizing a platform that any team in the enterprise could leverage poses a lot of security questions:

How can we ensure that every compute environment accessing the data lake is secure and compliant with enterprise governance controls?
How do we ensure that each user can only access the data they are allowed to access?
How do we audit who is accessing the data lake and what data are they reading/writing to?
How do we create a policy-governed environment without relying on users to follow best practices to protect our company’s most sensitive data?

These questions are simple to answer and implement for a small team or small datasets for a specific use case. However, it is really hard to operationalize data at scale such that every data scientist, engineer, and analyst can make most use of the data. That’s exactly what the Databricks platform is built for — simplifying and enabling data analytics securely at enterprise scale.

Based on our experience here are some themes that platforms need to pay attention to

Cloud-native controls for core security

Enterprises spend a lot of money and resources in creating and maintaining a data lake with the promise that the data can be used for a variety of products and services across the enterprise. No one platform can solve all enterprise needs which implies that this data will be used by different products either homegrown, vendor acquired or cloud-native. For this reason, data has to be unified in an open format and secured using cloud-native controls where possible. Why? Two reasons. One, because cloud providers have figured out how to scale their core security controls. Two, if protecting and accessing data requires proprietary tools, then you will have to integrate those tools with everything that accesses the data. This can be a nightmare to scale. So, when in doubt go cloud-native.

This is exactly what the databricks platform does. It integrates with IAM, AAD for identity and KMS/Key vault for encryption of data, STS for access tokens, security groups/NSGs for instance firewalls. This gives enterprises control over their trust anchors, centralize their access control policies in one place and extend them to Databricks seamlessly.

Isolate the environment

Separation of compute and storage is an accepted architecture pattern to store and process large amounts of data. Securing and protecting the compute environment that can access the data is the most important step when it comes to reducing the overall attack surface. How do you secure the compute environment ? Reminds me of a quote by Dennis Hughes from the FBI that said “The only secure computer is one that’s unplugged, locked in a safe, and buried 20 feet under the ground in a secret location and I’m not even too sure about that one” – well sure but that does not help us with the goal to enable all enterprise data scientists and engineers to get going with new data projects in minutes across the globe and at scale. So what does? Isolation, isolation, isolation.

Step 1. Ensure that the cloud workspaces for your analytics are only accessible from your secured corporate perimeters. If employees need to work from remote locations they need to VPN into the corporate network to access anything that can touch data. This will allow enterprise IT to monitor, inspect and enforce policies on any access to workspaces in the cloud.

Step 2. Go invisible, and by that I mean, implement Azure Private Link or AWS privateLink. Ensure that all traffic between users of your platform,the notebooks and the compute clusters that process queries is encrypted and transmitted over the cloud provider’s network backbone, inaccessible to the outside world. This also works to mitigate against data exfiltration because compromised or malicious users cannot send data externally. VPC/VNET peering addresses a similar requirement but is operationally more intensive and can’t scale as well.

Step 3. Restrict and monitor your compute. The compute clusters that execute the queries should be protected by restricting ssh and network access. This prevents installation of arbitrary packages, and ensures you’re only using images that are periodically scanned for vulnerabilities and continuously monitored to verify so. This can be accomplished with Databricks by simply clicking: “launch cluster.” Done!

Databricks makes it really easy to do the above. Dynamic IP access lists allow admins to access workspaces only from their corporate networks. Furthermore, Private Link ensures that the entire network traffic between users->databricks->clusters->data stays within cloud provider networks. Every cluster that is launched is started with images that have been scanned for vulnerabilities and are locked down so that changes that violate compliance can be restricted – all this is built-in to the workspace creation and cluster launch.

Network isolation

Secure the data

The challenge with data security/protection of a data lake is that the data lake has large amounts of data that can have different levels of classification and sensitivity. Often this data is accessed by users through different products and services and can contain PII data. How do you provide data access to 100’s / 1000’s of engineers while ensuring that they can only access the data they are allowed to?

Remove PII data

Before data lands into the data lake, remove PII data. This should be possible in many cases. This has proven to be the most successful route when minimizing the scope of compliance and ensuring that users don’t accidentally use/leak PII data. There are several ways to accomplish this but incorporating this as part of your ingest is the best approach. If you must have data that can be classified as PII in the data lake be sure to build in the capability to query for it and delete it if required (by CCPA, GDPR). This article demonstrates how this can be achieved using delta.

Strong access control

Most enterprises have some form of data classification in place. The access control strategy depends on how the data is stored in the data lake. If data categorized under different classifications is separated into different folders, then having IAM roles map to the segregated storage enables clean separation, and users/groups in the identity provider can be associated with one or more of these roles. If this approach suffices it’s easier to scale than implementing granular access control.

If classification is defined at the data object level, or access control needs to be implemented at the row/column/record level, the architecture requires a centralized access control layer that can enforce granular access control policies on every query. The reason this should be centralized is because there may be different tools/products that access the data lake and having different solutions for each will require maintaining policies in multiple places. There are products that have rich features in this area of attribute-based access control and the cloud providers are also implementing this functionality. The winner will have the right combination of ease-of-use and scalability.

Whatever you do it’s important to ensure that you can attribute access back to an individual user. A query executed by a user should assume the identity and role of that user before accessing the data, which will not only give you granular access control but also provide a required audit trail for compliance.

Encryption

Encryption not only acts as a way to gain “ownership” of data on third-party infrastructure but can also be used as an additional layer of access control. Use cloud provider key-management systems over third parties here because they are tightly integrated with all services. Achieving the same level of integration for all the cloud services that you want to use with third party encryption providers is near impossible.

Enterprises that want to go the extra mile in security should configure policies on customer-managed keys used to encrypt/decrypt data and combine that with access control of the storage folder itself. This approach ensures separation of duties between users who manage storage environments vs those who need to access the data in the storage environment. Even if new IAM roles are created to access data they will not be authorized to access the KMS key to decrypt it, thus creating a second level of enforcement.

Unblock your data lake’s potential

The true potential of data lakes can only be realized when the data in the lake is available to all of the engineers and scientists who want to use it. Accomplishing this requires a strong security fabric woven into the data platform. Building such a data platform that can also scale to all users across the globe is a complex undertaking. Databricks delivers such a platform that is trusted by some of the largest companies in the world as the foundation of their AI-driven future.

Learn more about other steps in your journey to create a simple, scalable and production-ready data platform, ready the following blogs

Enabling Massive Data Transformation Across Your Organization

Administration blog: Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

Automation blog: Productionize and Automate your Data Platform at Scale

Try Databricks for free. Get started today.

The post Security that Unblocks the True Potential of your Data Lake appeared first on Databricks.

↧

Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

March 16, 2020, 9:33 pm

≫ Next: Productionize and automate your data platform at scale

≪ Previous: Security that Unblocks the True Potential of your Data Lake

Data is growing exponentially and organizations are building products to harness the data and provide services to their customers. However, this exponential growth cannot be sustained by an exponential growth in infrastructure spend or human capital cost.

Today, there are over a hundred services available in each of the major clouds (AWS, Azure) that can be used to build your data platform. And there are hundreds of enterprise services that also need to integrate with your data platform. Data leaders and platform administrators are tasked with providing the right set of services and products to fulfill the data needs of their organization. These services need to be on-demand, at scale, reliable, policy-compliant and within budget.

Complexities administering an organization-wide data platform

Data is the lifeblood of any organization. As organizations become more and more data-driven, every team in every line of business is trying to leverage the power of data to innovate on their products and services. How do you create an enterprise-wide data, analytics and ML platform that is easy to use for users while having the right visibility and control for admins?

Heterogeneous teams have heterogenous operations

Product and services teams want ready-to-use analytic tools, so they can get to work on the meaty parts of the problems that they are trying to solve.

Data science teams use data sets to build analytic models to answer hard questions about the business. They use notebooks, connecting them to databases or data lakes, reading log files potentially stored in a cloud or on-premise data stores and event streams. They often use tools that are most easily available on their laptops and work on a representative set of data to validate their models.

Data engineering teams, on the other hand, are trying to take these models into production so the insights from the models and apps are available 24/7 to the business. They need infrastructure that can scale to their needs. They need the right set of testing and deployment infrastructure to test out their pipelines before deploying them for production use.

Disjointed solutions are hard-to-manage solutions

Various teams end up building bespoke solutions to solve their problems as fast as possible. They deploy infrastructure that may not be suited for the needs of their workloads and may result in either starvation for workloads (under-provisioning) or runaway costs (over-provisioning). The infrastructure and tools may not be configured correctly to meet the compliance, security and governance policies set by the organization and admin teams have no visibility into it. While these teams have the right expertise to do this for traditional application development, they may not have the right expertise or tools to do this in the rapidly changing data ecosystem. The end result is a potpourri of solutions sprinkled across the organization, that lack the visibility and control needed to scale across the entire organization.

A simple-to-administer data platform

So what would it take to build a platform for data platform leaders that allows them to provide data environments for the analytic needs of product and services teams while retaining the visibility, control, and scale that allows them to sleep well at night? We focused on visibility, control, and scale as the key pillars of this platform.

Visibility – Audit and analyze all the activity in your account for full transparency

Typically, the data platform engineering team starts onboarding their workloads directly on the data platform they are managing. Initially, the euphoria of getting to a working state with these workloads overshadows the costs being incurred. However, as the number and scale of these workloads increase, so do resources and the cost of compute needed to process the data.

The conscientious administrators of the data platform look for ways to visualize the usage on the platform. They can visualize past usage and get an empirical understanding of the usage trends on the platform.

Usage Visualization

As more product and services teams are onboarded, the resulting explosion in usage quickly exceeds the allocated budget. The only viable way for the data platform administration team to run its business is to issue chargebacks to product teams for their usage. In order to do that, the admin needs access to the usage logs that are tagged with the right usage tags.

Over the course of operations during the year, there may be spikes in the usage of resources. It is hard to determine whether these spikes are expected changes in workloads or some unintended behavior — like a team running a job that has an error that causes unexpected use of resources. Detailed usage logs help identify the workloads and teams that caused the anomalous usage. Admin teams can then use detailed audit logs to analyze the events leading up to that usage. They can work with the respective team to get qualitative information about this usage and make a determination on the anomaly. In case it is a change in the usage patterns of the workloads they can set up automated means to classify this usage as ‘normal’ in the future. Similarly, if this usage is actually an anomaly, then they can set up monitoring and alerts to catch such anomalies in real-time in the future.

As data platform leaders plan budgets, detailed usage data from the past can be used to build out more accurate forecasts of costs, usage and return on investment.

Control – Set policies to administer users, control the budget and manage infrastructure

While visibility is great, when managing many teams, it is better to have proactive controls to ensure a policy-compliant use of the platform.

When new data scientists are unboarded, they may not have a good understanding of the underlying infrastructure that runs their models. They can be provided environments that are pre-provisioned with the right policy-mandated clusters, the right access controls, and the ability to view and analyze the results of their experiments.

Similarly, as part of automating data pipelines, data engineers create clusters on demand and kill them when not needed, so that infrastructure is used optimally. However, they may be creating clusters that are fairly large and do not conform to the IT policies of the organization. The admin can apply cluster policies for this team such that the clusters any users create will conform to the mandated IT policies. This allows the team to spin up the resources in a self-service and policy-compliant way, without any manual dependencies on admin.

Set policies to administer users

Furthermore, the admin can set bounds on the infrastructure being used by allocating pools of infrastructure that dynamically auto-scale for the team. This ensures that the team is only able to spin up resources from within the bounds of the pool. Additionally, the resources in the pool can be spun down when not in use, thereby optimizing the overall use of the infrastructure.

Scale – Extend and scale the platform to all your users, customers and partners

As hundreds of teams are onboarded to the data platform, team workspaces are needed to isolate teams soh that they can work collaboratively within their team without being distracted or affected by other teams working on the platform. The workspace can be fully configured for the team’s use, with notebooks, data sources, infrastructure, runtimes and integration with DevOps tools. With user provisioning and entitlements for users managed by trusted Identity Providers (IdPs) the Admin can ensure that the right set of users can access the right workspaces by using enterprise-wide Single Sign-On capabilities. This isolation and access mechanism ensures that hundreds of teams can co-exist on the same data platform in a systematic manner, allowing the admin to manage them easily and scale the platform across the organization worldwide.

All of the above mentioned capabilities of the platform should be available to admins both in an easy to use UI as well via a rich set of REST APIs. The APIs enable the admin to automate and make onboarding teams efficient and fast.

Workspaces for teams across the organization

Use Databricks to effortlessly manage users and infrastructure

The Databricks platform has a number of these capabilities that help you provide a global-scale data platform to various product and services teams in your organization. The heterogeneity and scale of customers on the platform poses new challenges every day. We are building more into the Databricks platform so you can press the ‘easy button’ and enable consistent, compliant data environments across your organization on-demand.

Learn more about other steps in your journey to create a simple, scalable and production-ready data platform, ready the following blogs

Enabling Massive Data Transformation Across Your Organization

Security blog: Security that Unblocks the True Potential of your Data Lake

Automation blog: Productionize and Automate your Data Platform at Scale

Try Databricks for free. Get started today.

The post Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease appeared first on Databricks.

↧

Productionize and automate your data platform at scale

March 16, 2020, 9:38 pm

≫ Next: Enabling Massive Data Transformation Across Your Organization

≪ Previous: Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

Data-driven innovation is no longer optional to stay competitive in today’s marketplace. Companies that can bring data, analytics, and ML-based products to market first will quickly beat their competition. While many companies have streamlined CI/CD (continuous integration and delivery) processes for application development, very few have well-defined processes for developing data and ML products.

That’s why it’s critical to have production-ready, reliable and scalable data pipelines that feed the analytics dashboards and ML applications your managers use. As new feature sets are developed, data scientists, data analysts and data engineers need consistent toolsets and environments that help them rapidly iterate on ideas. As these ideas progress, they need to be tested and taken from development to production rapidly. Once they are in production, the ML models and analytics need to be constantly monitored for effectiveness, stability, and scale.

If you want to accelerate the creation of new and innovative data products, you will need to rely heavily on automation to overcome the following challenges.

Lack of consistent and collaborative development environments

When a new team is spinning up in an organization to build their data product or service, they need the infrastructure, tools and enterprise governance policies set up before they can start. We refer to this as a ‘fully configured’ data environment. This requires coordination and negotiations with multiple teams in the organization and may take days (if not weeks or months). The ideal would be to get such a fully configured data environment on-demand within minutes.

Lack of consistent Devops processes

A lot of code is written by data teams to manage the data itself — like code for data ingestion, data quality, data cleanup, and data transformations. All of this is needed before the data can be utilized by downstream teams for business intelligence and machine learning. Machine learning flows themselves are very iterative. The amount and variety of data changes rapidly, thus requiring changes to the code that handles the data pipelines, and trains machine learning models. Like any other application development codebase, this requires the discipline of a CI/CD pipeline to ensure quality, consistency, and idempotency. Ideally, data engineers and data scientists — like app dev engineers — could focus on iterations of code and let the data platform and tools apply the right quality gates to transport the code to production.

Limited visibility into data pipeline and ML model performance

Data environments can change on multiple dimensions — the underlying data, the code for data transformations in data pipelines, and the models that are built using the data. Any of these can affect the behavior and performance of applications that depend on it. Multiply that by hundreds of teams deploying thousands of applications and services. The problem of scaling and monitoring the health and performance of such applications becomes complex. DevOps teams supporting data platforms for the organization need automated tools that help data teams scale as their workloads become larger and leverage monitoring tools to be able to ensure the health of these applications.

Fully configured data environments on-demand

Deploy workspace
Connect data sources
Provision users and groups
Create clusters and cluster policies
Add permissions for users and groups

Deploy workspace – A global organization should be able to serve the data platform needs of their teams by provisioning data environments closest to where their data teams are — and more importantly, co-locate the services where the data resides. As the data platform leader of the organization, you should be able to service the needs of these teams on multiple clouds and in multiple regions. Once a region is chosen, the next step is deploying completely isolated ‘environments’ for the separate teams within the organization. Workspaces can represent such environments that isolate the teams from each other while still allowing the team members to collaborate among themselves. These workspaces can be created in an automated fashion, by calling Workspace REST APIs directly or using a client-side Workspace provisioning tool that uses those APIs.

Connect data sources –The next step is connecting data sources to the environment, including mounting data sources within the workspace. In order to access these cloud-native data sources with the right level of permissions from within the data environment, the appropriate permissions and roles can be set up using standardized infrastructure-as-code tools like Terraform.

Provision users and groups – The next step is to provision users and groups within the workspace using a standards-based SCIM API. This can be further automated when using Identity Provider (IdP), like Azure Active Directory, Okta, etc., by setting up automated sync between the IdP and Databricks. This enables seamless management of users in one standard location, IdP.

Create clusters and cluster policies – Now that users and data have been provisioned, you need to set up the compute so that users can run their workloads to process data. The clusters object represents the fully managed, auto-scaling unit of compute in a workspace. Typically, organizations have two modes of instantiating clusters. First, the long-running, static cluster that is used for interactive workloads – data scientists doing exploratory analysis in their notebooks. Second, is the transient cluster that is created because of scheduled or on-demand automated jobs. The static clusters are set up by the admins in the process of creating the data environment using the clusters APIs. This ensures that the clusters conform to policies such as using the right instance types for VMs, using the right runtimes, using the right tags, etc. You can also configure the clusters to have the right set of libraries using the library APIs, so end users don’t have to worry about it. Transient clusters by definition are created at run time, so the policies have to be applied only at runtime. To automate this you can use Cluster Policies that help you define the parameters of any cluster that can be created by any user in the workspace, thus ensuring that these clusters are compliant with your policies.

Grant permissions – Next you would want to give your users and groups the right level of permissions on objects in the data environment so they can do their jobs. Databrick supports fine-grained access controls on objects such as clusters, jobs, notebooks, pools, etc. These can be automated using permissions API (in Preview).

Fully configured data environments

The CI/CD pipeline

Development environment – Now that you have delivered a fully configured data environment to the product (or services) team in your organization, the data scientists have started working on it. They are using the data science notebook interface that they are familiar with to do exploratory analysis. The data engineers have also started working in the environment and they like working in the context of their IDEs. They would prefer a connection between their favorite IDE and the data environment that allows them to use the familiar interface of their IDE to code and, at the same time, use the power of the data environment to run through unit tests, all in context of their IDE.

Any disciplined engineering team would take their code from the developer’s desktop to production, running through various quality gates and feedback loops. As a start, the team needs to connect their data environment to their code repository on a service like git so that the code base is properly versioned and the team can work collaboratively on the codebase.

Staging/Integration environment – While hundreds of data scientists and data engineers are working in the development phase in the data environment when the time comes to push a set of changes to the integration testing phase, more control is required. Typically, you want fewer users to have access to the integration testing environment where tests are running continuously and results are being reported and changes are being promoted further. In order to do that, the team requires another workspace to represent their ‘staging’ environment. Another fully configured environment can be delivered very quickly to this team. Integration with popular continuous integration tools such as Jenkins or Azure Devops Services, enables the team to continuously test the changes. With more developers and more workloads, the rate of changes to the code base increases. There is a need to run tests faster. This also requires the underlying infrastructure to be available very quickly. Databricks pools allow the infrastructure to be held in a ready-to-use state while preventing runaway costs for always-on infrastructure. With all this tooling in place, the team is able to realize the ‘Staging’ environment for their continuous integration workflows.

Production environment – Eventually, when the code needs to be deployed to production — similar to the fully configured staging environment —a fully configured production environment can be provisioned quickly. This is a more locked-down environment, with access for only a few users. Using their standard deployment tools and by leveraging the well-defined REST APIs of the platform, the team can deploy the artifacts to their production environment.

Once this CI/CD pipeline is set up, the team can quickly move their changes from a developer desktop to production, while testing at scale, using familiar tools to ship high-quality products.

Databricks automation at scale

Streamlined operations

Over time, as the amount of data being processed increases, data teams need to scale their workloads. DevOps teams need to ensure that the data platform can scale seamlessly for their workloads. The DevOps teams can leverage auto-scaling capabilities in the data platform to deliver a seamless automated scale for these workloads. Additionally, a platform that is available on multiple clouds, and multiple regions in each cloud (AWS, Azure) allows the DevOps teams to deliver an at-scale platform to data teams, wherever they are operating in the world.

DevOps is in charge of supporting the data platform being used by teams across the organization. They would like to ensure the three layers of data environment are working as expected — infrastructure, platform, and application. At the bottom is the infrastructure, the compute that is used to perform the data processing jobs. Integration with tools such as Ganglia and Datadog provide visibility into core infrastructure metrics including CPU, memory, and disk, usage etc. A layer above infrastructure is the data platform. Objects such as clusters and jobs enable DevOps teams to automate analytics and machine learning workflows. Performance metrics for these objects can be consumed using REST APIs for Cluster Events and Jobs and plugged into the monitoring tool of choice for the organization. The last layer is the application, specifically the Spark-based application. Visibility into monitoring data using spark logs from Apache Spark application enables the team to troubleshoot failures and performance regressions and design optimizations for running workloads efficiently.

The automation interfaces enable the DevOps teams to apply consistent tooling across the CI/CD pipeline and integrate performance monitoring not only in production but also in lower-level environments. It allows the teams to debug the changes and also test out the performance monitoring workflow itself before the changes reach production.

Take innovations to market faster

AI/ML-driven innovations are happening organically in every enterprise in all sectors of the economy. Enterprises can realize the full potential of those innovations only when these innovations are developed, tested and productionized rapidly so their customers can use them. In order to do that, the organization needs a reliable, global scale, easy to use data platform that enables CI/CD for AI/ML applications and that can be made available to any team in the organization quickly.

Learn more about other steps in your journey to create a simple, scalable and production-ready data platform, ready the following blogs

Enabling Massive Data Transformation Across Your Organization

Security blog: Security that Unblocks the True Potential of your Data Lake

Administration blog: Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

Try Databricks for free. Get started today.

The post Productionize and automate your data platform at scale appeared first on Databricks.

↧

Enabling Massive Data Transformation Across Your Organization

March 16, 2020, 9:47 pm

≫ Next: Business Continuity at Databricks During COVID-19 Crisis

≪ Previous: Productionize and automate your data platform at scale

The top four companies by market cap today (MSFT, APPL, AMZN, GOOG) are all software companies, running their businesses on data. This is a radical departure from a decade ago. Enterprises in every segment are beginning to establish dominance through data. As a bank executive recently told me: “We’re not a finance company, we are a data company.”

Why is every company a data company (and every product a data product)?

The next wave of dominant companies in every segment will, underneath the covers, be data companies. This requires a data platform that drives the decisions of every employee and, just as important, powers data products. What are data products? A financial instrument, such as a credit card with a credit limit, can become a data product. Its competitive edge comes from crunching enormous amounts of data. Genomic sequencing is a data product. Finding life on Mars is a data product.

To enable the massive data transformation we talk about, you need to bring all your users and all of your data together. And then give them the tools and infrastructure they need to draw insights while following the enterprise security protocols. You need an Enterprise Data platform that scales across every department and every team. So why is that getting harder, not easier?

Your data has become more sensitive — The scale of data is increasing exponentially, yet it’s siloed across different systems in different departments. How do you make sure the right users have access to the right data, and it’s all monitored and audited centrally? And, at the same time, how do you stay in compliance with international regulations?

Your costs are difficult to control – Every organization is under pressure to do more with less. Exponential growth in data does not justify exponential growth in data infrastructure costs. When you have no visibility into who is doing what with what data, it results in uncontrolled costs — infrastructure costs, data costs and labor costs.

Data projects are difficult to manage – How do you track an initiative from start to finish when disparate teams — business analysts, data scientists, and data engineering — deploy disparate technologies managed by IT, Security and DevOps?? Which projects are in production? How are we monetizing them? What happens if an app goes down?

The complexities in going from small-scale success to enterprise-wide data transformation are enormous. A survey by McKinsey reveals that only 8% of enterprises have been successful at scaling their Data and Analytics practices across the organization¹.

Executives need a holistic strategy to scale data across the organization

Enterprises new to these challenges may take an incremental approach, or take on-premises solutions and move them to the cloud. But without a holistic approach, you are setting yourself up for replacing one outdated architecture for another that is not up to the challenge long term. The following 5 steps can ensure you are progressing towards a system that can stand the test of time.

Step 1: Bring all your data together

Data warehouses have been used for decades to aggregate structured business data and make decisions by creating BI dashboards on visualization tools. The arrival of data lakes —with their attractive scaling properties and suitability for unstructured data — were vital for enabling data science and machine learning. Today, the Data Lakehouse model combines the reliability of data warehouses with the scalability of data lakes using an open format such as Delta Lake. Regardless of your specific architecture choices, choose a structure that can store all of your data — structured and unstructured —in open formats for long-term control, suitable for processing by a rapidly evolving set of technologies.

Step 2: Enable users to securely access the data

Make sure every member of your data team (data engineers, data scientists, ML engineers, BI analysts, and business decision-makers) across various roles and business units have access to the data they need and none of the data they’re not authorized to access). This means complying with various regulations, including GDPR, CCPA, HIPAA, and PCI).

It is important that all of your data — and all people that interact with it — remain together, in one place. If you are fragmenting the data by copying it into a new system for a subset of users (e.g. a data warehouse for your BI users) you have data drift, which leads to issues in Step 3. It also means you have drift of “truth”, where some information in your organization is stale or of a different quality, leading to (at best) organizational mistrust and (more likely) bad business outcomes.

Step 3: Manage your data platform like you manage your business

When you onboard a new employee, you set them up for success. They get the right computer, access to the right systems, etc. Your data platform should be the same.

Since all of your data is in one place, every employee can see a different facet of the data, according to their roles and responsibilities. And this data access needs to be aligned with how you manage other employee onboarding; everything must be tied to your onboarding systems, automated, and audited.

Step 4: Leverage cloud-native security

As cloud computing has become the de facto destination for massive data processing and ML, core security principles have been reformulated to Cloud-Native security. The DMZ and perimeter security of “on-premise” security are replaced with “zero-trust” and “software-defined networking.” Locks on physical doors have transformed into modern cryptography. So you must ensure your data processing platform is designed for the cloud and leverages best-in-class cloud-native controls.

Moreover, the cloud auditing and telemetry provide a record of data access and modification through the cloud-native tools, since every user accesses data with their own identity. This makes Step 3 possible – the groups that you manage your company with are enforced and auditable down to the cloud-native security primitives and tools.

Step 5: Automate for scale

Whether rolling out your platform to hundreds of business units, or many thousands of customers, it needs to be automated from the ground up. This requires that your data platform can be deployed with zero human intervention.

Further, for each workspace (environment for a business unit), data access, machine learning models, and other templates must be configured in an automated fashion to be ready for your business.

But powering this scale also demands powerful controls. With the compute of millions of machines at your fingertips, it is easy to run up a massive bill. To deploy to departments across the enterprise the right spend policies and chargebacks need to be designed to ensure the power is being deployed as the business expects.

APIs can automate everything from provisioning users and team workspaces to automating production pipelines, controlling costs, and measuring business outcomes. A fully automatable platform is necessary to power your enterprise.

Become the data company you must be

It is time to begin the journey to compete as a data company. Enterprises around the world are making this journey by placing Databricks at their core. A large modern bank uses Databricks to process 20 million transactions across 13 million end users every day to detect credit card fraud and many other use cases. They have been able to democratize data access so 5000 employees make data-based decisions on Databricks. One of the largest food & beverage retailers in the world operates over 220 production pipelines, with 667 TB of data and 70+ published data products on the Databricks platform. We are glimpsing the beginning of the data revolution in business and are excited to see where the road takes us from here.

Regardless of the platform choices you make, incorporate these five steps to ensure you are designing a platform that delivers for years to come.

Resources:

To learn more and watch actual demos sign up for the webinar

Learn more about each one of the steps in detail by reading the following blogs

Security blog: Security that Unblocks the True Potential of your Data Lake

Administration blog: Delivering and Managing a Cloud-Scale Enterprise Data Platform with Ease

Automation blog: Productionize and Automate your Data Platform at Scale

¹ https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/ten-red-flags-signaling-your-analytics-program-will-fail

Try Databricks for free. Get started today.

The post Enabling Massive Data Transformation Across Your Organization appeared first on Databricks.

↧

Business Continuity at Databricks During COVID-19 Crisis

March 17, 2020, 10:00 am

≫ Next: Women’s HERstory Month at Databricks

≪ Previous: Enabling Massive Data Transformation Across Your Organization

Dear valued customer,

As COVID-19 continues to impact our daily lives, we are thinking about those affected as well as the practitioners who are delivering essential services to so many.

As we adjust to a new reality, we want you to know that we have never lost focus on you and have a comprehensive plan in place for this evolving situation. As always, you can continue to rely on Databricks for our technology and our people.

We fully recognize the critical role that our platform plays for so many of our customers, and we are committed to providing you with the same world-class customer experience you’ve come to trust. We remain dedicated to the reliability of the Databricks platform, and our Customer Success team looks forward to continuing to advise and support you globally. As a reminder, Databricks’ realtime service availability can be viewed at any time at status.databricks.com.

Along with our commitment to our customers, the safety and well-being of our employees and our community are top priorities for us. We are taking measures to prevent the spread of COVID-19 by having employees work from home, canceling or virtualizing events, and restricting travel. We want to ensure that you and our teams are safe, connected, and supported during these times.

Finally, to stay responsive to this evolving situation, we have created a task force of senior leaders across Databricks. This group is monitoring best practices on a daily basis, and taking necessary actions in the best interest of our customers and employees. For Databricks updates related to COVID-19, please follow this blog post.

Being a trusted partner to you is our mission and our privilege. We remain focused on providing reliable service to you and your business throughout this difficult time. Please take care, and be well.

All the best,

Ali Ghodsi
Databricks CEO & Co-Founder

Try Databricks for free. Get started today.

The post Business Continuity at Databricks During COVID-19 Crisis appeared first on Databricks.

↧

Women’s HERstory Month at Databricks

March 20, 2020, 6:00 am

≫ Next: Spark + AI Summit’s Expanded Agenda Announced

≪ Previous: Business Continuity at Databricks During COVID-19 Crisis

As we celebrate Women’s HERstory Month this March, we reflect on the community that we’ve built within Databricks and throughout our careers. Our mission as Co-Leads of the Women of Databricks Employee Resource Group is to promote inclusion, career growth, leadership, mentorship, and networking. A strong support system is one of the keys to achieving this mission. In this blog post, we highlight how mentorship and advocacy have shaped the careers of some of the women of Databricks.

Databricks employees celebrating International Women’s Day

Celebrating International Women’s Day #EachforEqual

Women’s HERstory Month Q&A with the Women of Databricks

Describe what mentorship and advocacy means to you – where have you seen it within Databricks, or in past roles? Who has made a positive impact on your career and what advice would you give people who want to mentor?

“For me, mentorship brings a reminder that whatever challenge I am facing has been conquered by thousands of women before me. I find comfort in that shared experience and speaking with a mentor leaves me feeling inspired and lighter. At Databricks, my teammate Allie Emrich has become my mentor. I have found that mentorship relationships come naturally in organizations that have a healthy culture. Mentorship is often stressed for women and it is crucial, especially in fields where women are underrepresented. A sponsor will advocate for you in your career and will champion you to get you to the next level. My manager Vinay Wagh acts as my sponsor and my career would look much different without him.”
— Anna Shrestinian, Sr. Product Manager

“I am a big proponent of mentoring. I’ve been both a mentor and mentee, and believe that everyone should do both throughout their careers. I’ve had many mentors/mentees in my career, both formal and informal, peers and superiors, men and women. One rule of thumb I like to follow is that you are your best advocate and you need to own your career, meaning that you should regularly seek constructive feedback from people you trust. Both peers and superiors have had different experiences and perspectives and can give you feedback that will help you grow professionally. For people who want to mentor, start by developing trust with others, volunteering that constructive feedback, and providing a welcoming environment for others to seek out your thoughts.”
— Jody Soeiro de Faria, Director of Curriculum Development

“In my opinion, mentorship is about learning from others to grow into a better self, as well as offering support for others to achieve their goals. At Databricks, I have received a lot of constructive feedback from my managers, in each stage of my career growth: from “how to give a meetup talk on my work” to “how to work effectively in cross-functional collaboration”. When I became an engineer manager, I started to do more mentoring for the team, as well as for other Bricksters. I realize that the path I have gone through and the experience I have learned can be applicable to help them achieve their goals. Our discussions often spark new observations for myself as well. I believe mentorship is a win-win for both sides and everyone grows through the mentorship relationship.”
— Nanxi Kang, Team Lead, Cloud Team

“Mentorship means going the extra mile to support, guide and share knowledge with another person or team. The biggest area I’ve seen mentorship is within Databricks leadership. In a team that’s highly technical, where my role is not, I’ve had individuals in my team (from managers, directors, to our CISO), take time to answer my questions in depth, even if it didn’t directly involve them. This has given me more confidence working on projects with other teams, and provided a better understanding of the product. The best advice I’d give is to take the time to be open, answer questions, and help out people at all levels. This will not only help that person understand their role better, but provide a better experience for the team as a whole.”
— Chelsea Stoecker, Project Manager, Security & IT

“Many people may think of mentorship as seeking advice from a single person. However, for me, I look for advice from a diverse set of individuals whether they are a peer, a manager or a customer. At Databricks, we have a strong team of Customer Success leaders that I can always count on to guide me through tough situations. It’s up to you as the individual to be vocal and actively solicit mentorship. Activate that childlike wonder mentality and don’t be afraid to treat every interaction as an opportunity to learn something new.”
—Sharon Cheung, Manager, Enterprise West – Customer Success Engineering

“In my experience, the best mentorship has never been through a formal assignment or program, but has happened organically within the personal and professional relationships I’ve built. By understanding the strengths and backgrounds of the people around me, I can get mentorship without putting a label on it. It can come from anywhere and the mentorship needed will vary throughout your career, so I encourage people seeking a mentor to perhaps instead think about what exactly they’re looking for from that relationship – if you can define that, you can identify the best person to support you from the network you already have.”
— Cynthia Hanson, Head of HR, EMEA

“Mentorship is more than just shadowing someone. It’s about working with that mentor to gain skills but also develop your own style and develop yourself professionally. A mentor encourages you to dig deeper and question everything, even what they are teaching you. They inspire you to seek truth and act as a trusted sounding board for your ideas and goals. Being a trusted mentor in any capacity is something we should all aim to achieve.”
— Palla Lentz, Resident Solutions Architect

As we continue our work with the Women of Databricks ERG, we are excited to offer more avenues for networking and establishing relationships that can lead to mentorship opportunities. We recently co-sponsored happy hours at Databricks offices around the world to celebrate International Women’s Day, and we will be providing many more events and activities throughout the year.

Start Your Story

We would love to hear your story. Learn more about how you can get involved with Databricks by checking our Careers Page.

Try Databricks for free. Get started today.

The post Women’s HERstory Month at Databricks appeared first on Databricks.

↧

Spark + AI Summit’s Expanded Agenda Announced

March 24, 2020, 6:00 am

≫ Next: Trust but Verify with Databricks

≪ Previous: Women’s HERstory Month at Databricks

Over the years, technical conferences tend to expand beyond their initial focus, adding new technologies, types of attendees, and a broader range of sessions and speakers. From its original focus on Apache Spark™, the Spark + AI Summit has extended its scope to include not only data engineering and infrastructure, but also machine learning automation and applications.

Data, AI, and the Cloud are pillars of advanced data analytics that bring together data visionaries, Spark experts, machine learning developers, data engineers, data scientists, and data analysts to impact innovation at scale, to share novel ideas with the community. With the dawn of a new decade of data, this conference broadly covers topics in data engineering and architecture, business analytics and visualization, data platforms for machine learning, and artificial intelligence industry use cases.

The agenda for this year’s Summit has just been announced, and you’ll see a remarkable range of sessions designed to help attendees learn how to put the latest technologies and techniques into practice. The 190+ presenters, tracks and sessions cover not only open source projects originated by Databricks (Spark, MLflow, Delta Lake, and Koalas) but other important open source technologies, including TensorFlow, PyTorch, the Python data ecosystem, Ray, Presto, Apache Arrow, and Apache Kafka, and more.

New AI, ML and Other Sessions by Theme

Among the companies presenting are Apple, Microsoft, Facebook, Airbnb, Pinterest, Linkedin, Capital One, Netflix, Uber, Adobe, Nvidia, Walmart, Zillow, Paypal, Visa, Target, T-Mobile, Intuit, Atlassian, Comcast, Alibaba, Tencent and, Bytedance. Here are some highlights, clustered by themes:

Automation and AI use cases: Learn how to use machine learning and AI technologies to automate workflows, processes, and systems. We have speakers from leading research groups and industry sectors, including IT and software, financial services, retail, logistics, IoT, and media and advertising.
Building, deploying, and maintaining data pipelines: As data and machine learning applications become more sophisticated, underlying data pipelines have become harder to build and maintain. We have a series of presentations on best practices and new open source projects aimed directly at helping data engineering teams build and maintain data pipelines using Spark, Delta Lake, and other open source technologies.
ML Ops: Over 20 presentations on managing the machine learning development lifecycle, and how to deploy and monitor models once they have been deployed. This is an area where open source projects and best practices are starting to emerge
Data management and platforms: In a recent post, we introduced a new data management paradigm – “the lakehouse” – for the age of data, machine learning and AI. This year’s conference will have sessions on lakehouses and deep dives into various open source technologies for data management.
Performance and scalability: Over 40 sessions covering aspects of scaling and tuning machine learning models, Spark SQL and Spark 3.0, analytics and data platforms, and end-to-end data applications
Open source technologies: We have expanded the program to include dedicated sessions on open source projects in data and machine learning, including a series of technical presentations from contributors of several notable libraries and frameworks.

Come and Join Us

Join the community and enjoy the camaraderie at Spark + AI Summit 2020. Register now to save an extra 20% off the already low early-bird rate: use code BenAndJulesSAI20.

Also check out who is giving keynotes and all the courses offered on two days of expanded training.

Try Databricks for free. Get started today.

The post Spark + AI Summit’s Expanded Agenda Announced appeared first on Databricks.

↧

Trust but Verify with Databricks

March 25, 2020, 6:00 am

≫ Next: Creating a Productive Work Environment from Home

≪ Previous: Spark + AI Summit’s Expanded Agenda Announced

As enterprises modernize their data infrastructure to make data-driven decisions, teams across the organization become consumers of that platform. The data workloads grow exponentially, where cloud data lake becomes the centralized storage for enterprise-wide functions and different tools & technologies are used to gain insights out of it. For cloud security teams, the addition of more services and more users means the potential of additional vulnerabilities and security threats. They need to ensure that any data access adheres to enterprise governance controls, which could be easily monitored and audited. Organizations are faced with the challenge of balancing broader access to data in order to make better business decisions, by following a myriad number of controls and regulations to prevent unauthorized access and data leaks. Databricks helps to address this security challenge by providing visibility into all platform activities through Audit Logs, which when combined with cloud provider activity logging becomes a powerful tracking tool in the hands of security and admin teams.

Databricks Audit Logs

Audit Logging allows enterprise security and admins to monitor all access to data and other cloud resources, which helps to establish an increased level of trust with the users. Security teams gain insight into a host of activities occurring within or from a Databricks workspace, like:

Cluster administration
Permission management
Workspace access via the Web Application or the API
And much more…

Audit logs can be configured to be delivered to your cloud storage (Azure / AWS). From there, Databricks or any other log analysis service could be used to find anomalies of interest, and integrated with cloud-based notification services to create a seamless alerting workflow. We’ll discuss some scenarios where Databricks audit logs could prove to be a critical security solution asset.

Workspace Access Control

Issue: An admin accidentally adds a new user to a group that has an elevated access level that the user should not have been granted. They realize the mistake a few days later and remove the user from the relevant group.

Vulnerability: The admin is worried that the user may have used the workspace in a way s/he was not entitled to, during the period of elevated access level. This particular group has the ability to grant permissions across Workspace objects. The admin needs to ensure that this user did not share clusters or jobs with other users, which would cause cascading security issues in terms of access control.

Solution: The admin can use Databricks Audit Logs to see the exact amount of time that the user was in the wrong group. They can analyze every activity the user made during that time period to see if they had taken advantage of the higher privileges in the workspace. In this case, if the user had not behaved maliciously, it could be proved by Audit Logs.

Workspace Budget Control

Issue: Databricks admins want to ensure that teams are using the service within allocated budgets. But a few workspace admins have given many of their users broad controls to create and manage clusters.

Vulnerability: The admin is worried that large clusters are being created or existing clusters are being resized due to elevated cluster provisioning controls. This could put relevant teams over their allocated budgets.

Solution: The admin can use Audit Logs to watch over cluster activities. They can see when cluster creation or resizing occurs. This enables them to notify their users to keep within allocated budget for respective teams. Additionally, the admin can create Cluster Policies to address this concern in the future.

Cloud Provider Infrastructure Logs

Databricks logging allows security and admin teams to demonstrate conformance to data governance standards within or from a Databricks workspace. Customers, especially in the regulated industries, also need records on activities like:

User access control to cloud data storage
Cloud Identity and Access Management roles
User access to cloud network and compute
And much more…

Databricks audit logs and integrations with different cloud provider logs provide the necessary proof to meet varying degrees of compliance. We’ll now discuss some of the scenarios where such integrations could be useful.

Data Access Security Controls

Issue: A healthcare company wants to ensure that only verified or allowed users access their sensitive data. This data is in cloud storage and a group of verified users have access to this bucket. An issue could occur if the verified user shares the Databricks notebook and the cluster that are used to access this data, with a non-verified user.,/p>

Vulnerability: The non-verified user could access the sensitive data and potentially use it for nefarious purposes. Administrators need to ensure that there are no loopholes in the data access security controls and verify this with an audit of who is accessing the cloud storage.

Solution: The admin could enforce that Databricks users access the cloud storage only from a passthrough-enabled (Azure / AWS) cluster. This will ensure that even if a user accidentally shares access to this cluster with a non-verified user, that user will not be able to access the underlying data. Audit of which user is accessing which file/folder can be delivered using the cloud storage access logs capability (Azure / AWS). These logs will prove that only verified users are accessing the classified data which could be used to meet compliance.

Data Exfiltration Controls

Issue: A group of admins are allowed elevated permissions in the cloud account / subscription to provision Databricks workspaces and manage related cloud infrastructure like security groups, subnets etc. One of these admins updates the outbound security group rules to allow for extra egress locations.

Vulnerability: Users can now access the Databricks workspace and exfiltrate data to this new location that was added to the outbound security group rules.

Solution: Admins should always follow the best practices of shared responsibility model in cloud and assign elevated cloud account/subscription permissions only to a minimum set of authorized superusers. In this case, one could monitor if such changes are being made and by whom using cloud provider activity logs (Azure / AWS). Additionally, admins should also configure appropriate access control in the Databricks workspace (Azure / AWS) and can monitor that access in Databricks Audit Logs.

Getting started with workspace auditing

Tracking workspace activity using Databricks audit logs and various cloud provider logs provides security and admin teams the insights they need to allow their users access the required data while conforming to enterprise governance controls. Databricks users are also comfortable with the understanding that everything that needs to be audited is, and they are working in a safe and secure cloud environment. This enables more and better data-driven decisions throughout the organization. We’ll do a recap of different types of logs worth looking into:

Databricks Audit Logs (Azure / AWS)
Cloud Storage Access Logs (Azure / AWS)
Cloud Provider Activity Logs / CloudTrail (Azure / AWS)
Virtual Network Traffic Flow Logs (Azure / AWS)

We plan to publish deep dives into how to analyze the above types of logs in the near future. Until then, please feel free to reach out to your Databricks account team for any questions.

Try Databricks for free. Get started today.

The post Trust but Verify with Databricks appeared first on Databricks.

↧

Creating a Productive Work Environment from Home

March 26, 2020, 6:00 am

≫ Next: New Methods for Improving Supply Chain Demand Forecasting

≪ Previous: Trust but Verify with Databricks

Whether working from home is an old habit or the “new normal” for those adhering to COVID-19 protocols, we want to share Bricksters’ tips and tricks of how they create a productive and engaging work environment at home.

Rebekah Uusitalo, Director, Talent Acquisition Operations and Programs (Toronto, ON)

Home office of Rebekah Uusitalo, Director, Talent Operations and Programs, Databricks

My home office with my Supurrrrrvisor

While many companies are shifting to a work from home scenario for the first time, this has been my daily life for the last 12 years. If I’ve learned anything, it’s that relationships are key, they stop at no border and that communication is essential. It also takes a significant amount of self-awareness, discipline and laser-like focus to do it well.

A few tips that I have learned along the way that have made it a connected, productive and rewarding experience:

Be kind to yourself

Telecommuting is not for everyone. You may feel lonely, isolated, frustrated, missing your colleagues or even bored. You may also feel energized, motivated, more relaxed and more in control of your day (no commute will do that). You may feel all — or a combination of — these feelings every day. Observe how you’re feeling and adjust your approach as needed. Transition takes time.

Set boundaries

When blending your home and work life, it’s important to have a space that is dedicated to your work (so your family knows when you’re there, you’re working) and set your hours of work and be disciplined (it’s easy to continue working or jump back online when your laptop is never far away). Establishing boundaries and communicating them upfront to your family, roommates, your manager and your colleagues will make the situation clear from the beginning. Holding yourself accountable will ensure that you remain focused, reach your objectives and remain productive.

Take scheduled breaks

Set some ground rules for how you’ll manage your time and put blocks on your calendar to break up the day. This will avoid getting distracted by mixing work time with other tasks and allow you to stretch, grab a glass of water or something to eat — but try not to loiter in the kitchen (you’ll thank me later!)

Communicate

While working from home can feel isolating, communication can help you overcome this and help to build and maintain strong relationships. Take the time to understand which tools and methods of communication you and your colleagues prefer (email, phone, Slack, video conferencing, text etc). It’s not a one-size-fits-all and creating the space for different means of communication can also help in changing up your day (a quick phone call or meeting while you take a walk vs sending yet another email may be refreshing for you both to connect on a deeper level).

Chessa Vir, Field Marketing Manager (Portland, ME)

Home office of Chessa Vir, Field Marketing Manager, Databricks

Anything is paw-ssible

Having worked remote for three years, here are my top pieces of advice for a successful WFH day:

Kick things off by reviewing your schedule and prioritizing your tasks so you can structure your day.
Especially during these uncertain times, stay away from watching news throughout the day. Instead, opt for some good tunes or a podcast to keep your focus.
Make time to step outside; a walk with the dogs, a jog or a simple hop to the mailbox while on the phone with your family will do.
Have a dedicated organized workspace and make sure you have all the tools needed to be at your most productive. Do you need a monitor? What about a separate keyboard? Whatever it is, it could make a huge difference.
Don’t be scared about your dogs or kids making a guest appearance. Your co-workers love getting a glimpse into your family life.

Inbar Gam, University Recruiting Program Manager (San Francisco, CA)

Home office of Inbar Gam, University Recruiting Program Manager, Databricks

Taking a break to get moving with some on-demand yoga

Working from home is new for me and it’s definitely taken some time to adjust. When I’m in the office, I move around throughout the day — walking to meetings, grabbing water or a snack, or going on walks with coworkers. The first few days of working from home I noticed I wasn’t moving around enough and by the end of the day it was hard to focus on work. Since then, I’ve been blocking off times throughout the day to get in a quick 10-15 minute workout. Stepping away from work and moving my body allows me to come back more focused and productive.

Maria Pere-Perez, Director, Strategic Technology Partners (Reno, NV)

Home office of Maria Pere-Perez, Director, Strategic Technology Partners, Databricks

Right meow it’s work time

I’ve been working from home for 12+ years. It starts with great home office mates. For me, they provide a sort of yin-yang balance.

Antonio Gomez, Head of EMEA & APAC Talent Acquisition (London, UK)

Home office of Antonio Gomez, Head of EMEA & APAC Talent Acquisition, Databricks

EMEA Recruiting team singing to “I Will Survive

Although parts of our team are used to working remotely prior to the current circumstances, it’s been fun to see more of each other on a regular basis via video and to bring some laughter to those meetings as a way to drive team bonding. Most recently, we have introduced a virtual karaoke happy-half-hour at the end of the day on Friday where one of our team members picks a song. We then all sing along karaoke-style — with a drink or two to aid our voices!

Andreana Garcia-Phillips, Field Marketing Manager (San Francisco, CA)

Home office of Andreana Garcia-Phillips, Field Marketing Manager, Databricks

Let’s raise the woof!

As I’ve recently transitioned into a WFH lifestyle, I’ve picked up a few tips and tricks along the way.

Have a dog? Move the dog bed near you for hourly pets and instant mood improvement.
Check in with your team members with the webcam on. Start meetings with a song and dance (when appropriate). Bonus points for background jams in between meetings.
Include workouts, meditation, daily walks with your dog or significant other (I chose dog).
Stay positive, encourage one another and make the best of the situation. Remind yourself how lucky we are to be able to work from home during this time.
Crush it.

Vish Gupta, Marketing Operations Manager (San Jose, CA)

Home office of Vish Gupta, Marketing Operations Manager, Databricks

Creating a makeshift standing desk with a step stool

As an employee who had previously split time between the office and home, switching to fully remote work has made my routine significantly different. Now, I am much more deliberate about my own daily schedule and have found a few things that I do to help me stay energized.

Create a similar setup to your work desk. I’ve found a creative way to turn my home set up into a sit-stand option using my Ikea Step Stool.
Keep the same working hours and start the day as if you’re going to work. Get dressed for work, plan a workout during the morning or after work, eat lunch in the 11am-2pm time frame, and avoid doing housework, like laundry, during the day.
It’s easy to have too much screen time, so find time to sketch, walk, and take a break from your laptop for at least five minutes, multiple times during the day, to avoid eye strain.

Alexa Friedman, Manager, University Recruiting (San Francisco, CA)

Home office of Alexa Friedman, Manager, University Recruiting, Databricks

Databricks’ winter interns playing online Pictionary

The UR team is having fun hosting virtual intern events like online Pictionary and video lunch & learns. It’s important to have a community of people you can enjoy spending time with, even if it’s not in person.

Sonya Vargas, Director of Analyst Relations (San Diego, CA)

Home office of Sonya Vargas, Director of Analyst Relations, Databricks

Getting camera ready for meetings

As someone who has been working from home for over 10 years now, there are a few essential tips that have worked for me.

Wake up at the same time every day. I am a mom of three little girls so that happens by default (most days we are up before we should be).
Get “camera ready”: As a full-time virtual employee, I turn on my camera for all meetings in order to have face time with my colleagues and partners. It’s one of the main ways to feel connected to your team while not being with them in the office.
Move: Set some time aside in your day to move and exercise. Whether it’s a quick yoga flow, HIIT workout, or walk/run in the neighborhood.
Shut down: Aim to shut down your computer at a set time after each work day.

Try Databricks for free. Get started today.

The post Creating a Productive Work Environment from Home appeared first on Databricks.

↧

New Methods for Improving Supply Chain Demand Forecasting

March 26, 2020, 8:23 am

≫ Next: Data Exfiltration Protection with Azure Databricks

≪ Previous: Creating a Productive Work Environment from Home

Organizations Are Rapidly Embracing Fine-Grained Demand Forecasting

Retailers and Consumer Goods manufacturers are increasingly seeking improvements to their supply chain management in order to reduce costs, free up working capital and create a foundation for omnichannel innovation. Changes in consumer purchasing behavior are placing new strains on the supply chain. Developing a better understanding of consumer demand via a demand forecast is considered a good starting point for most of these efforts as the demand for products and services drives decisions about the labor, inventory management, supply and production planning, freight and logistics and many other areas.

In Notes from the AI Frontier, McKinsey & Company highlight that, a 10 to 20% improvement in retail supply chain forecasting accuracy is likely to produce a 5% reduction in inventory costs and a 2 to 3% increase in revenues. Traditional supply chain forecasting tools have failed to deliver the desired results. With claims of industry-average inaccuracies of 32% in retailer supply chain demand forecasting, the potential impact of even modest forecasting improvements is immense for most retailers. As a result, many organizations are moving away from pre-packaged forecasting solutions, exploring ways to bring demand forecasting skills in-house and revisiting past practices which compromised forecast accuracy for computational efficiency.

A key focus of these efforts is the generation of forecasts at a finer level of temporal and (location/product) hierarchical granularity. Fine-grain demand forecasts have the potential to capture the patterns that influence demand closer to the level at which that demand must be met. Whereas in the past a retailer might have predicted short-term demand for a class of products at a market level or distribution level, for a month or week period, and then used the forecasted values to allocate units of a specific product in that class should be placed in a given store and day, fine-grain demand forecasting allows forecasters to build more localized models that reflect the dynamics of that specific product in a particular location.

Fine-grain Demand Forecasting Comes with Challenges

As exciting as fine-grain demand forecasting sounds, it comes with many challenges. First, by moving away from aggregate forecasts, the number of forecasting models and predictions which must be generated explodes. The level of processing required is either unattainable by existing forecasting tools, or it greatly exceeds the service windows for this information to be useful. This limitation leads to companies making tradeoffs in the number of categories being processed, or the level of grain in the analysis.

As examined in a prior blog post, Apache Spark can be employed to overcome this challenge, allowing modelers to parallelize the work for timely, efficient execution. When deployed on cloud-native platforms such as Databricks, computational resources can be quickly allocated and then released, keeping the cost of this work within budget.
The second and more difficult challenge to overcome is understanding that demand patterns that exist in aggregate may not be present when examining data at a finer level of granularity. To paraphrase Aristotle, the whole may often be greater than the sum of its parts. As we move to lower levels of detail in our analysis, patterns more easily modeled at higher levels of granularity may no longer be reliably present, making the generation of forecasts with techniques applicable at higher levels more challenging. This problem within the context of forecasting is noted by many practitioners going all the way back to Henri Theil in the 1950s.

As we move closer to the transaction level of granularity, we also need to consider the external causal factors that influence individual customer demand and purchase decisions. In aggregate, these may be reflected in the averages, trends and seasonality that make up a time series but at finer levels of granularity, we may need to incorporate these directly into our forecasting models.

Finally, moving to a finer level of granularity increases the likelihood the structure of our data will not allow for the use of traditional forecasting techniques. The closer we move to the transaction grain, the higher the likelihood we will need to address periods of inactivity in our data. At this level of granularity, our dependent variables, especially when dealing with count data such as units sold, may take on a skewed distribution that’s not amenable to simple transformations and which may require the use of forecasting techniques outside the comfort zone of many Data Scientists.

Accessing the Historical Data

See the Data Preparation notebook for details.

In order to examine these challenges, we will leverage public trip history data from the New York City Bike Share program, also known as Citi Bike NYC. Citi Bike NYC is a company that promises to help people, “Unlock a Bike. Unlock New York.” Their service allows people to go to any of over 850 various rental locations throughout the NYC area and rent bikes. The company has an inventory of over 13,000 bikes with plans to increase the number to 40,000. Citi Bike has well over 100,000 subscribers who make nearly 14,000 rides per day.

Citi Bike NYC reallocates bikes from where they were left to where they anticipate future demand. Citi Bike NYC has a challenge that is similar to what retailers and consumer goods companies deal with on a daily basis. How do we best predict demand to allocate resources to the right areas? If we underestimate demand, we miss revenue opportunities and potentially hurt customer sentiment. If we overestimate demand, we have excess bike inventory being unused.

This publicly available dataset provides information on each bicycle rental from the end of the prior month all the way back to the inception of the program in mid-2013. The trip history data identifies the exact time a bicycle is rented from a specific rental station and the time that bicycle is returned to another rental station. If we treat stations in the Citi Bike NYC program as store locations and consider the initiation of a rental as a transaction, we have something closely approximating a long and detailed transaction history with which we can produce forecasts.

As part of this exercise, we will need to identify external factors to incorporate into our modeling efforts. We will leverage both holiday events as well as historical (and predicted) weather data as external influencers. For the holiday dataset, we will simply identify standard holidays from 2013 to present using the holidays library in Python. For the weather data, we will employ hourly extracts from Visual Crossing, a popular weather data aggregator.

Citi Bike NYC and Visual Crossing data sets have terms and conditions that prohibit our directly sharing of their data. Those wishing to recreate our results should visit the data providers’ websites, review their Terms & Conditions, and download their datasets to their environments in an appropriate manner. We will provide the data preparation logic required to transform these raw data assets into the data objects used in our analysis.

Examining the Transactional Data

See the Exploratory Analysis notebook for details.

As of January 2020, the Citi Bike NYC bike share program consists of 864 active stations operating in the New York City metropolitan area, primarily in Manhattan. In 2019 alone, a little over 4-million unique rentals were initiated by customers with as many as nearly 14,000 rentals taking place on peak days.

Since the start of the program, we can see the number of rentals has increased year over year. Some of this growth is likely due to the increased utilization of the bicycles, but much of it seems to be aligned with the expansion of the overall station network.

Normalizing rentals by the number of active stations in the network shows that growth in ridership on a per-station basis has been slowly ticking up for the last few years in what we might consider to be a slight linear upward trend.

Using this normalized value for rentals, ridership seems to follow a distinctly seasonal pattern, rising in the Spring, Summer and Fall and then dropping in Winter as the weather outside becomes less conducive to bike riding

his pattern appears to closely follow patterns in the maximum temperatures (in degrees Fahrenheit) for the city.

While it can be hard to separate monthly ridership from patterns in temperatures, rainfall (in average monthly inches) does not mirror these patterns quite so readily

Examining weekly patterns of ridership with Sunday identified as 1 and Saturday identified as 7, it would appear that New Yorkers are using the bicycles as commuter devices, a pattern seen in many other bike share programs.

Breaking down these ridership patterns by hour of the day, we see distinct weekday patterns where ridership spikes during standard commute hours. On the weekends, patterns indicate more leisurely utilization of the program, supporting our earlier hypothesis.

An interesting pattern is that holidays, regardless of their day of week, show consumption patterns that roughly mimic weekend usage patterns. The infrequent occurrence of holidays may be the cause of erraticism of these trends. Still, the chart seems to support that the identification of holidays is important to producing a reliable forecast.

In aggregate, the hourly data appear to show that New York City is truly the city that never sleeps. In reality, there are many stations for which there are a large proportion of hours during which no bicycles are rented.

These gaps in activity can be problematic when attempting to generate a forecast. By moving from 1-hour to 4-hour intervals, the number of periods within which individual stations experience no rental activity drops considerably though there are still many stations that are inactive across this timeframe.

Instead of ducking the problem of inactive periods by moving towards even higher-levels of granularity, we will attempt to make a forecast at the hourly level, exploring how an alternative forecasting technique may help us deal with this dataset. As forecasting for stations that are largely inactive isn’t terribly interesting, we’ll limit our analysis to the top 200 most active stations.

Forecasting Bike Share Rentals with Facebook Prophet

In an initial attempt to forecast bike rentals at the per-station level, we made use of Facebook Prophet, a popular Python library for time series forecasting. The model was configured to explore a linear growth pattern with daily, weekly and yearly seasonal patterns. Periods in the dataset associated with holidays were also identified so that anomalous behavior on these dates would not affect the average, trend and seasonal patterns detected by the algorithm.

Using the scale-out pattern documented in the previously referenced blog post, models were trained for most active 200 stations and 36-hour forecasts were generated for each. Collectively, the models had a Root Mean Squared Error (RMSE) of 5.44 with a Mean Average Proportional Error (MAPE) of 0.73. (Zero-value actuals were adjusted to 1 for the MAPE calculation.)

These metrics indicate that the models do a reasonably good job of predicting rentals but are missing when hourly rental rates move higher. Visualizing sales data for individual stations, you can see this graphically such as in this chart for Station 518, E 39 St & 2 Ave, which has an RMSE of 4.58 and a MAPE of 0.69:

See the Time Series notebook for details.

The model was then adjusted to incorporate temperature and precipitation as regressors. Collectively, the resulting forecasts had a RMSE of 5.35 and a MAPE of 0.72. While a very slight improvement, the models are still having difficulty picking up on the large swings in ridership found at the station level, as demonstrated again by Station 518 which had an RMSE of 4.51 and a MAPE of 0.68:

See the Time Series with Regressors notebook for details.

This pattern of difficulty modeling the higher values in both the time series models is typical of working with data having a Poisson distribution. In such a distribution, we will have a large number of values around an average with a long-tail of values above it. On the other side of the average, a floor of zero leaves the data skewed. Today, Facebook Prophet expects data to have a normal (Gaussian) distribution but plans for the incorporation of Poisson regression have been discussed.

Alternative Approaches to Forecasting Supply Chain Demand

How might we then proceed with generating a forecast for these data? One solution, as the caretakers of Facebook Prophet are considering, is to leverage Poisson regression capabilities in the context of a traditional time series model. While this may be an excellent approach, it is not widely documented so tackling this on our own before considering other techniques may not be the best approach for our needs.

Another potential solution is to model the scale of non-zero values and the frequency of the occurrence of the zero-valued periods. The output of each model can then be combined to assemble a forecast. This method, known as Croston’s method, is supported by the recently released croston Python library while another data scientist has implemented his own function for it. Still, this is not a widely adopted method (despite the technique dating back to the 1970s) and our preference is to explore something a bit more out-of-the-box.

Given this preference, a random forest regressor would seem to make quite a bit of sense. Decision trees, in general, do not impose the same constraints on data distribution as many statistical methods. The range of values for the predicted variable is such that it may make sense to transform rentals using something like a square root transformation before training the model, but even then, we might see how well the algorithm performs without it.

To leverage this model, we’ll need to engineer a few features. It’s clear from the exploratory analysis that there are strong seasonal patterns in the data, both at the annual, weekly and daily levels. This leads us to extract year, month, day of week and hour of the day as features. We may also include a flag for holiday.

Using a random forest regressor and nothing but time-derived features, we arrive at an overall RMSE of 3.4 and MAPE of 0.39. For Station 518, the RMSE and MAPE values are 3.09 and 0.38, respectively:

See Temporal Notebook for details.

By leveraging precipitation and temperature data in combination with some of these same temporal features, we are able to better (though not perfectly) address some of the higher rental values. The RMSE for Station 518 drops to 2.14 and the MAPE to 0.26. Overall, the RMSE drops to 2.37 and MAPE to 0.26 indicating weather data is valuable in forecasting demand for bicycles.

See the Random Forest with Temporal & Weather Features notebook for details.

Implications of the Results

Demand forecasting at finer levels of granularity may require us to think differently about our approach to modeling. External influencers which may be safely considered summarized in high-level time series patterns may need to be more explicitly incorporated into our models. Patterns in data distribution hidden at the aggregate level may become more readily exposed and necessitate changes in modeling approaches. In this dataset, these challenges were best addressed by the inclusion of hourly weather data and a shift away from traditional time series techniques towards an algorithm which makes fewer assumptions about our input data.

There may be many other external influencers and algorithms worth exploring, and as we go down this path, we may find that some of these work better for some subset of our data than for others. We may also find that as new data arrives, techniques that previously worked well may need to be abandoned and new techniques considered.

A common pattern we are seeing with customers exploring fine-grain demand forecasting is the evaluation of multiple techniques with each training and forecasting cycle, something we might describe as an automated model bake-off. In a bake-off round, the model producing the best results for a given subset of the data wins the round with each subset able to decide its own winning model type. In the end, we want to ensure we are performing good Data Science where our data is properly aligned with the algorithms we employ, but as is noted in article after article, there isn’t always just one solution to a problem and some may work better at one time than at others. The power of what we have available today with platforms like Apache Spark and Databricks is that we have access to the computational capacity to explore all these paths and deliver the best solution to our business.

Additional Retail/CPG and Demand Forecasting Resources

Sign-up for a free trial and download these notebooks to start experimenting:
Download our Guide to Data Analytics and AI at Scale for Retail and CPG
Visit our Retail and CPG page to learn how Dollar Shave Club and Zalando are innovating with Databricks
Read our recent blog Fine-Grained Time Series Forecasting At Scale With Facebook Prophet And Apache Spark to learn how Databricks Unified Data Analytics Platform addresses challenges in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories

Fine-Grained Time Series Forecasting At Scale With Facebook Prophet And Apache Spark

On-Demand Webinar: Granular Demand Forecasting At Scale

Try Databricks for free. Get started today.

The post New Methods for Improving Supply Chain Demand Forecasting appeared first on Databricks.

↧

Data Exfiltration Protection with Azure Databricks

March 27, 2020, 6:00 am

≫ Next: 10 Minutes from pandas to Koalas on Apache Spark

≪ Previous: New Methods for Improving Supply Chain Demand Forecasting

In the previous blog, we discussed how to securely access Azure Data Services from Azure Databricks using Virtual Network Service Endpoints or Private Link. Given a baseline of those best practices, in this article we walkthrough detailed steps on how to harden your Azure Databricks deployment from a network security perspective in order to prevent data exfiltration.

As per wikipedia: Data exfiltration occurs when malware and/or a malicious actor carries out an unauthorized data transfer from a computer. It is also commonly called data extrusion or data exportation. Data exfiltration is also considered a form of data theft. Since the year 2000, a number of data exfiltration efforts severely damaged the consumer confidence, corporate valuation, and intellectual property of businesses and national security of governments across the world. The problem assumes even more significance as enterprises start storing and processing sensitive data (PII, PHI or Strategic Confidential) with public cloud services.

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

High-level Data Exfiltration Protection Architecture

We recommend a hub and spoke topology styled reference architecture. The hub virtual network houses the shared infrastructure required to connect to validated sources and optionally to an on-premises environment. And the spoke virtual networks peer with the hub, while housing isolated Azure Databricks workspaces for different business units or segregated teams.

High-level view of art of the possible:

Following are high-level steps to set up a secure Azure Databricks deployment (see corresponding diagram below):

Deploy Azure Databricks in a spoke virtual network using VNet injection (azuredatabricks-spoke-vnet in below diagram)
Set up Private Link endpoints for your Azure Data Services in a separate subnet within the Azure Databricks spoke virtual network (privatelink-subnet in below diagram). This would ensure that all workload data is being accessed securely over Azure network backbone with default data exfiltration protection in place (see this for more). Also in general it’s completely fine to deploy these endpoints in another virtual network that’s peered to the one hosting the Azure Databricks workspace.
Optionally, set up Azure SQL database as External Hive Metastore to override as the primary metastore for all clusters in the workspace. This is meant to override the configuration for consolidated metastore housed in the control plane.
Deploy Azure Firewall (or other Network Virtual Appliance) in a hub virtual network (shared-infra-hub-vnet in below diagram). With Azure Firewall, you could configure:

• Application rules that define fully qualified domain names (FQDNs) that are accessible through the firewall. Some Azure Databricks required traffic could be whitelisted using the application rules.

• Network rules that define IP address, port and protocol for endpoints that can’t be configured using FQDNs. Some of the required Azure Databricks traffic needs to be whitelisted using the network rules.

Some of our customers prefer to use a third-party firewall appliance instead of Azure Firewall, which works generally fine. Though please note that each product has its own nuances and it’s better to engage relevant product support and network security teams to troubleshoot any pertinent issues.

• Set up Service Endpoint to Azure Storage for the Azure Firewall subnet, such that all traffic to whitelisted in-region or in-paired-region storage goes over the Azure network backbone (includes endpoints in Azure Databricks control plane if the customer data plane region is a match or paired).

Create a user-defined route table with the following rules and attach it to Azure Databricks subnets.

Name	Address	Next Hop	Purpose
to-databricks-control-plane-NAT	Based on the region where you’ve deployed Azure Databricks workspace, select control plane NAT IP from here	Internet	Required to provision Azure Databricks Clusters in your private network
to-firewall	0.0.0.0/0	Azure Firewall Private IP	Default quad-zero route for all other traffic

Configure virtual network peering between the Azure Databricks spoke and Azure Firewall hub virtual networks.

Such a hub-and-spoke architecture allows creating multiple spoke VNETs for different purposes and teams. Though we’ve seen some of our customers implement isolation by creating separate subnets for different teams within a large contiguous virtual network. In such instances, it’s totally possible to set up multiple isolated Azure Databricks workspaces in their own subnet pairs, and deploy Azure Firewall in another sister subnet within the same virtual network.

We’ll now discuss the above setup in more detail below.

Secure Azure Databricks Deployment Details

Prerequisites

Please take a note of Azure Databricks control plane endpoints for your workspace from here (map it based on region of your workspace). We’ll need these details to configure Azure Firewall rules later.

Name	Source	Destination	Protocol:Port	Purpose
databricks-webapp	Azure Databricks workspace subnets	Region specific Webapp Endpoint	tcp:443	Communication with Azure Databricks webapp
databricks-log-blob-storage	Azure Databricks workspace subnets	Region specific Log Blob Storage Endpoint	https:443	To store Azure Databricks audit and cluster logs (anonymized / masked) for support and troubleshooting
databricks-artifact-blob-storage	Azure Databricks workspace subnets	Region specific Artifact Blob Storage Endpoint	https:443	Stores Databricks Runtime images to be deployed on cluster nodes
databricks-observability-eventhub	Azure Databricks workspace subnets	Region specific Observability Event Hub Endpoint	tcp:9093	Transit for Azure Databricks on-cluster service specific telemetry
databricks-dbfs	Azure Databricks workspace subnets	DBFS Blob Storage Endpoint	https:443	Azure Databricks workspace root storage
databricks-sql-metastore (OPTIONAL – please see Step 3 for External Hive Metastore below)	Azure Databricks workspace subnets	Region specific SQL Metastore Endpoint	tcp:3306	Stores metadata for databases and child objects in a Azure Databricks workspace

Step 1: Deploy Azure Databricks Workspace in your virtual network

The default deployment of Azure Databricks creates a new virtual network (with two subnets) in a resource group managed by Databricks. So as to make necessary customizations for a secure deployment, the workspace data plane should be deployed in your own virtual network. This quickstart shows how to do that in a few easy steps. Before that, you should create a virtual network named azuredatabricks-spoke-vnet with address space 10.2.1.0/24 in resource group adblabs-rg (names and address space are specific to this test setup).

Referring to Azure Databricks deployment documentation:

From the Azure portal menu, select Create a resource. Then select Analytics > Databricks.
Under Azure Databricks Service, apply the following settings:

Setting	Suggested value	Description
Workspace name	adblabs-ws	Select a name for your Azure Databricks workspace.
Subscription	“Your subscription”	Select the Azure subscription that you want to use.
Resource group	adblabs-rg	Select the same resource group you used for the virtual network.
Location	Central US	Choose the same location as your virtual network.
Pricing Tier	Premium	For more information on pricing tiers, see the Azure Databricks pricing page.

Once you’ve finished entering basic settings, select Next: Networking > and apply the following settings:

Setting	Value	Description
Deploy Azure Databricks workspace in your Virtual Network (VNet)	Yes	This setting allows you to deploy an Azure Databricks workspace in your virtual network.
Virtual Network	azuredatabricks-spoke-vnet	Select the virtual network you created earlier.
Public Subnet Name	public-subnet	Use the default public subnet name, you could use any name though.
Public Subnet CIDR Range	10.2.1.64/26	Use a CIDR range up to and including /26.
Private Subnet Name	private-subnet	Use the default private subnet name, you could use any name though.
Private Subnet CIDR Range	10.2.1.128/26	Use a CIDR range up to and including /26.

Click Review and Create. Few things to note:

The virtual network must include two subnets dedicated to each Azure Databricks workspace: a private subnet and public subnet (feel free to use a different nomenclature). The public subnet is the source of a private IP for each cluster node’s host VM. The private subnet is the source of a private IP for the Databricks Runtime container deployed on each cluster node. It indicates that each cluster node has two private IP addresses today.
Each workspace subnet size is allowed to be anywhere from /18 to /26, and the actual sizing will be based on forecasting for the overall workloads per workspace. The address space could be arbitrary (including non RFC 1918 ones), but it must align with the enterprise on-premises plus cloud network strategy.
Azure Databricks will create these subnets for you when you deploy the workspace using Azure portal and will perform subnet delegation to the Microsoft.Databricks/workspaces service. That allows Azure Databricks to create the required Network Security Group (NSG) rules. Azure Databricks will always give advance notice if we need to add or update the scope of an Azure Databricks-managed NSG rule. Please note that if these subnets already exist, the service will use those as such.
There is a one-to-one relationship between these subnets and an Azure Databricks workspace. You cannot share multiple workspaces across the same subnet pair, and must use a new subnet pair for each different workspace.
Notice the resource group and managed resource group in the Azure Databricks resource overview page on Azure portal. You cannot create any resources in the managed resource group, nor can you edit any existing ones.

Step 2: Set up Private Link Endpoints

As discussed in the Securely Accessing Azure Data Services blog, we’ll use Azure Private Link to securely connect previously created Azure Databricks workspace to your Azure Data Services. We do not recommend setting up access to such data services through a network virtual appliance / firewall, as that has a potential to adversely impact the performance of big data workloads and the intermediate infrastructure.

Please create a subnet privatelink-subnet with address space 10.2.1.0/26 in the virtual network azuredatabricks-spoke-vnet.

For the test setup, we’ll deploy a sample storage account and then create a Private Link endpoint for that. Referring to the setting up private link documentation:

On the upper-left side of the screen in the Azure portal, select Create a resource > Storage > Storage account.
In Create storage account – Basics, enter or select this information:

Setting	Value
PROJECT DETAILS
Subscription	Select your subscription.
Resource group	Select adblabs-rg. You created this in the previous section.
INSTANCE DETAILS
Storage account name	Enter myteststorageaccount. If this name is taken, please provide a unique name.
Region	Select Central US (or the same region you used for Azure Databricks workspace and virtual network).
Performance	Leave the default Standard.
Replication	Select Read-access geo-redundant storage (RA-GRS).

Select Next:Networking >

In Create a storage account – Networking, connectivity method, select Private Endpoint.
In Create a storage account – Networking, select Add Private Endpoint.
In Create Private Endpoint, enter or select this information:

Setting	Value
PROJECT DETAILS
Subscription	Select your subscription.
Resource group	Select adblabs-rg. You created this in the previous section.
Location	Select Central US (or the same region you used for Azure Databricks workspace and virtual network).
Name	Enter myStoragePrivateEndpoint.
Storage sub-resource	Select dfs.
NETWORKING
Virtual network	Select azuredatabricks-spoke-vnet from resource group adblabs-rg.
Subnet	Select privatelink-subnet.
PRIVATE DNS INTEGRATION
Integrate with private DNS zone	Leave the default Yes.
Private DNS zone	Leave the default (New) privatelink.dfs.core.windows.net.

Select OK.

Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
When you see the Validation passed message, select Create.
Browse to the storage account resource that you just created.

It’s possible to create more than one Private Link endpoint for supported Azure Data Services. To configure such endpoints for additional services, please refer to the relevant Azure documentation.

Step 3: Set up External Hive Metastore

Provision Azure SQL database

This step is optional. By default the consolidated regional metastore is used for the Azure Databricks workspace. Please skip to the next step if you would like to avoid managing a Azure SQL database for this end-to-end deployment.

Referring to provisioning an Azure SQL database documentation, please provision an Azure SQL database which we will use as an external hive metastore for the Azure Databricks workspace.

On the upper-left side of the screen in the Azure portal, select Create a resource > Databases > SQL database.
In Create SQL database – Basics, enter or select this information:

Setting	Value
DATABASE DETAILS
Subscription	Select your subscription.
Resource group	Select adblabs-rg. You created this in the previous section.
INSTANCE DETAILS
Database name	Enter myhivedatabase. If this name is taken, please provide a unique name.

In Server, select Create new.
In New server, enter or select this information:

Setting	Value
Server name	Enter mysqlserver. If this name is taken, please provide a unique name.
Server admin login	Enter an administrator name of your choice.
Password	Enter a password of your choice. The password must be at least 8 characters long and meet the defined requirements.
Location	Select Central US (or the same region you used for Azure Databricks workspace and virtual network).

Select OK.

Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
When you see the Validation passed message, select Create.

Create a Private Link endpoint

In this section, you will add a Private Link endpoint for the Azure SQL database created above. Referring from this source

On the upper-left side of the screen in the Azure portal, select Create a resource > Networking > Private Link Center.
In Private Link Center – Overview, on the option to Build a private connection to a service, select Start.
In Create a private endpoint – Basics, enter or select this information:

Setting	Value
PROJECT DETAILS
Subscription	Select your subscription.
Resource group	Select adblabs-rg. You created this in the previous section.
INSTANCE DETAILS
Name	Enter mySqlDBPrivateEndpoint. If this name is taken, please provide a unique name.
Region	Select Central US (or the same region you used for Azure Databricks workspace and virtual network).
Select Next: Resource

In Create a private endpoint – Resource, enter or select this information:

Setting	Value
Connection method	Select connect to an Azure resource in my directory.
Subscription	Select your subscription.
Resource type	Select Microsoft.Sql/servers.
Resource	Select mysqlserver
Target sub-resource	Select sqlServer

Select Next: Configuration

In Create a private endpoint – Configuration, enter or select this information:

Setting	Value
NETWORKING
Virtual network	Select azuredatabricks-spoke-vnet
Subnet	Select privatelink-subnet
PRIVATE DNS INTEGRATION
Integrate with private DNS zone	Select Yes.
Private DNS Zone	Select (New)privatelink.database.windows.net

Select Review + create. You’re taken to the Review + create page where Azure validates your configuration.
When you see the Validation passed message, select Create.

Configure External Hive Metastore

From Azure Portal, search for the adblabs-rg resource group
Go to Azure Databricks workspace resource
Click Launch Workspace
Please follow the instructions documented here to configure the Azure SQL database created above as an external hive metastore for the Azure Databricks workspace.

Step 4: Deploy Azure Firewall

We recommend Azure Firewall as a scalable cloud firewall to act as the filtering device for Azure Databricks control plane traffic, DBFS Storage, and any allowed public endpoints to be accessible from your Azure Databricks workspace.

Referring to the documentation for configuring an Azure Firewall, you could deploy Azure Firewall into a new virtual network. Please create the virtual network named hub-vnet with address space 10.3.1.0/24 in resource group adblabs-rg (names and address space are specific to this test setup). Also create a subnet named AzureFirewallSubnet with address space 10.3.1.0/26 in hub-vnet.

On the Azure portal menu or from the Home page, select Create a resource.
Type firewall in the search box and press Enter.
Select Firewall and then select Create.
On the Create a Firewall page, use the following table to configure the firewall:

Setting	Value
Subscription	“your subscription”
Resource group	adblabs-rg
Name	firewall
Location	Select Central US (or the same region you used for Azure Databricks workspace and virtual network).
Choose a virtual network	Use existing: hub-vnet
Public IP address	Add new. The Public IP address must be the Standard SKU type. Name it fw-public-ip

Select Review + create.
Review the summary, and then select Create to deploy the firewall.
This will take a few minutes.
After the deployment completes, go to the adblabs-rg resource group, and select the firewall
Note the private IP address. You’ll use it later when you create the custom default route from Azure Databricks subnets.

Configure Azure Firewall Rules

With Azure Firewall, you can configure:

Application rules that define fully qualified domain names (FQDNs) that can be accessed from a subnet.
Network rules that define source address, protocol, destination port, and destination address.
Network traffic is subjected to the configured firewall rules when you route your network traffic to the firewall as the subnet default gateway.

Configure Application Rule

We first need to configure application rules to allow outbound access to Log Blob Storage and Artifact Blob Storage endpoints in the Azure Databricks control plane plus the DBFS Root Blob Storage for the workspace.

Go to the resource group adblabs-rg, and select the firewall.
On the firewall page, under Settings, select Rules.
Select the Application rule collection tab.
Select Add application rule collection.
For Name, type databricks-control-plane-services.
For Priority, type 200.
For Action, select Allow.
Configure the following in Rules -> Target FQDNs

Name	Source type	Source	Protocol:Port	Target FQDNs
databricks-spark-log-blob-storage	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	https:443	Refer notes from Prerequisites above (for Central US)
databricks-audit-log-blob-storage	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	https:443	Refer notes from Prerequisites above (for Central US) This is separate log storage only for US regions today
databricks-artifact-blob-storage	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	https:443	Refer notes from Prerequisites above (for Central US)
databricks-dbfs	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	https:443	Refer notes from Prerequisites above
Public Repositories for Python and R Libraries (OPTIONAL – if workspace users are allowed to install libraries from public repos)	IP Address	10.2.1.128/26,10.2.1.64/26	https:443	pypi.org,pythonhosted.org,cran.r-project.org Add any other public repos as desired

Configure Network Rule

Some endpoints can’t be configured as application rules using FQDNs. So we’ll set those up as network rules, namely the Observability Event Hub and Webapp.

Open the resource group adblabs-rg, and select the firewall.
On the firewall page, under Settings, select Rules.
Select the Network rule collection tab.
Select Add network rule collection.
For Name, type databricks-control-plane-services.
For Priority, type 200.
For Action, select Allow.
Configure the following in Rules -> IP Addresses.

Name	Protocol	Source type	Source	Destination type	Destination Address	Destination Ports
databricks-webapp	TCP	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	IP Address	Refer notes from Prerequisites above (for Central US)	443
databricks-observability-eventhub	TCP	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	IP Address	Refer notes from Prerequisites above (for Central US)	9093
databricks-sql-metastore (OPTIONAL – please see Step 3 for External Hive Metastore above)	TCP	IP Address	Azure Databricks workspace subnets 10.2.1.128/26,10.2.1.64/26	IP Address	Refer notes from Prerequisites above (for Central US)	3306

Configure Virtual Network Service Endpoints

On the hub-vnet page, click Service endpoints and then Add
From Services select “Microsoft.Storage”
In Subnets, select AzureFirewallSubnet

Service endpoint would allow traffic from AzureFirewallSubnet to Log Blob Storage, Artifact Blob Storage, and DBFS Storage to go over Azure network backbone, thus eliminating exposure to public networks.

If users are going to access Azure Storage using Service Principals, then we recommend creating an additional service endpoint from Azure Databricks workspace subnets to Microsoft.AzureActiveDirectory.

Step 5: Create User Defined Routes (UDRs)

At this point, the majority of the infrastructure setup for a secure, locked-down deployment has been completed. We now need to route appropriate traffic from Azure Databricks workspace subnets to the Control Plane NAT IP (see FAQ below) and Azure Firewall setup earlier.

Referring to the documentation for user defined routes:

On the Azure portal menu, select All services and search for Route Tables. Go to that section.
Select Add
For Name, type firewall-route.
For Subscription, select your subscription.
For the Resource group, select adblabs-rg.
For Location, select the same location that you used previously i.e. Central US
Select Create.
Select Refresh, and then select the firewall-route-table route table.
Select Routes and then select Add.
For Route name, add to-firewall.
For Address prefix, add 0.0.0.0/0.
For Next hop type, select Virtual appliance.
For the Next hop address, add the Private IP address for the Azure Firewall that you noted earlier.
Select OK.

Now add one more route for Azure Databricks Control Plane NAT.

Select Routes and then select Add.
For Route name, add to-central-us-databricks-control-plane.
For Address prefix, add the Control Plane NAT IP address for Central US from here.
For Next hop type, select Internet (why – see below in FAQ).
Select OK.

The route table needs to be associated with both of the Azure Databricks workspace subnets.

Go to the firewall-route-table.
Select Subnets and then select Associate.
Select Virtual network > azuredatabricks-spoke-vnet.
For Subnet, select both workspace subnets.
Select OK.

Step 6: Configure VNET Peering

We are now at the last step. The virtual network azuredatabricks-spoke-vnet and hub-vnet need to be peered so that the route table configured earlier could work properly.

Referring to the documentation for configuring VNET peering:

In the search box at the top of the Azure portal, enter virtual networks in the search box. When Virtual networks appear in the search results, select that view.

Go to hub-vnet.
Under Settings, select Peerings.
Select Add, and enter or select values as follows:

Name	Value
Name of the peering from hub-vnet to remote virtual network	from-hub-vnet-to-databricks-spoke-vnet
Virtual network deployment model	Resource Manager
Subscription	Select your subscription
Virtual Network	azuredatabricks-spoke-vnet or select the VNET where Azure Databricks is deployed
Name of the peering from remote virtual network to hub-vnet	from-databricks-spoke-vnet-to-hub-vnet

Leave rest of the default values as is and click OK

The setup is now complete.

Step 7: Validate Deployment

It’s time to put everything to test now:

Go to the Azure Databricks workspace adblabs-ws that you’d created in Step 1, launch and create a cluster.
Create a notebook and attach it to the cluster.
Try and access the storage account myteststorageaccount that you created in Step 2 earlier.

If the data access worked without any issues, that means you’ve accomplished the optimum secure deployment for Azure Databricks in your subscription. This was quite a bit of manual work, but that was more for a one-time showcase. In practical terms, you would want to automate such a setup using a combination of ARM Templates, Azure CLI, Azure SDK etc.:

Common Questions with Data Exfiltration Protection Architecture

Can I use service endpoint policies to secure data egress to Azure Data Services?

Service Endpoint Policies allow you to filter virtual network traffic to only specific Azure Data Service instances over Service Endpoints. Endpoint policies can not be applied to Azure Databricks workspace subnets or other such managed Azure services that have resources in a management or control plane subscription. Hence we cannot use this feature.

Can I use Network Virtual Appliance (NVA) other than Azure Firewall?

Yes, you could use a third-party NVA as long as network traffic rules are configured as discussed in this article. Please note that we have tested this setup with Azure Firewall only, though some of our customers use other third-party appliances. It’s ideal to deploy the appliance on cloud rather than be on-premises.

Can I have a firewall subnet in the same virtual network as Azure Databricks?

Yes, you can. As per Azure reference architecture, it is advisable to use a hub-spoke virtual network topology to plan better for future. Should you choose to create the Azure Firewall subnet in the same virtual network as Azure Databricks workspace subnets, you wouldn’t need to configure virtual network peering as discussed in Step 6 above.

Can I filter Azure Databricks control plane NAT traffic through Azure Firewall?

To bootstrap Azure Databricks clusters, the control plane initiates the communication to the virtual machines in your subscription. If the control plane NAT traffic is configured to be sent through the firewall, the acknowledgement for the incoming TCP message will be sent via that route, which creates something called asymmetric routing and hence cluster bootstrap fails. Thus the control plane NAT traffic does need to be directly routed through the public network, as discussed in Step 5 above.

Can I analyze accepted or blocked traffic by Azure Firewall?

We recommend using Azure Firewall Logs and Metrics for that requirement.

Getting Started with Data Exfiltration Protection with Azure Databricks

We discussed utilizing cloud-native security control to implement data exfiltration protection for your Azure Databricks deployments, all of it which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project:

Enable meta controls to unlock true potential of your data lake
Manage access to notebook features
Access ADLS using Credential Passthrough
Audit everything with Diagnostic Logs, Storage Access Logs and NSG Flow Logs (requires VNET Injection).

Please reach out to your Microsoft or Databricks account team for any questions.

Try Databricks for free. Get started today.

The post Data Exfiltration Protection with Azure Databricks appeared first on Databricks.

↧

10 Minutes from pandas to Koalas on Apache Spark

March 31, 2020, 6:00 am

≫ Next: Data Science with Azure Databricks at Clifford Chance

≪ Previous: Data Exfiltration Protection with Azure Databricks

This is a guest community post from Haejoon Lee, a software engineer at Mobigen in South Korea and a Koalas contributor.

pandas is a great tool to analyze small datasets on a single machine. When the need for bigger datasets arises, users often choose PySpark. However, the converting code from pandas to PySpark is not easy as PySpark APIs are considerably different from pandas APIs. Koalas makes the learning curve significantly easier by providing pandas-like APIs on the top of PySpark. With Koalas, users can take advantage of the benefits of PySpark with minimal efforts, and thus get to value much faster.

A number of blog posts such as Koalas: Easy Transition from pandas to Apache Spark, How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas, and 10 minutes to Koalas in Koalas official docs have demonstrated the ease of conversion between pandas and Koalas. However, despite having the same APIs, there are subtleties when working in a distributed environment that may not be obvious to pandas users. In addition, only about ~70% of pandas APIs are implemented in Koalas. While the open-source community is actively implementing the remaining pandas APIs in Koalas, users would need to use PySpark to work around. Finally, Koalas also offers its own APIs such as to_spark(), DataFrame.map_in_pandas(), ks.sql(), etc. that can significantly improve user productivity.

Therefore, Koalas is not meant to completely replace the needs for learning PySpark. Instead, Koalas makes learning PySpark much easier by offering pandas-like functions. To be proficient in Koalas, users would need to understand the basics of Spark and some PySpark APIs. In fact, we find that users using Koalas and PySpark interchangeably tend to extract the most value from Koalas.

In particular, two types of users benefit the most from Koalas:

pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Distributed and Partitioned Koalas DataFrame

Even though you can apply the same APIs in Koalas as in pandas, under the hood a Koalas DataFrame is very different from a pandas DataFrame. A Koalas DataFrame is distributed, which means the data is partitioned and computed across different workers. On the other hand, all the data in a pandas DataFrame fits in a single machine. As you will see, this difference leads to different behaviors.

Migration from pandas to Koalas

This section will describe how Koalas supports easy migration from pandas to Koalas with various code examples.

Object Creation

The packages below are customarily imported in order to use Koalas. Technically those packages like numpy or pandas are not necessary, but allow users to utilize Koalas more flexibly.

import numpy as np
import pandas as pd
import databricks.koalas as ks

A Koalas Series can be created by passing a list of values, the same way as a pandas Series. A Koalas Series can also be created by passing a pandas Series.

# Create a pandas Series
pser = pd.Series([1, 3, 5, np.nan, 6, 8]) 
# Create a Koalas Series
kser = ks.Series([1, 3, 5, np.nan, 6, 8])
# Create a Koalas Series by passing a pandas Series
kser = ks.Series(pser)
kser = ks.from_pandas(pser)

Best Practice: As shown below, Koalas does not guarantee the order of indices unlike pandas. This is because almost all operations in Koalas run in a distributed manner. You can use Series.sort_index() if you want ordered indices.

>>> pser
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
>>> kser
3    NaN
2    5.0
1    3.0
5    8.0
0    1.0
4    6.0
Name: 0, dtype: float64
# Apply sort_index() to a Koalas series
>>> kser.sort_index() 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

A Koalas DataFrame can also be created by passing a NumPy array, the same way as a pandas DataFrame. A Koalas DataFrame has an Index unlike PySpark DataFrame. Therefore, Index of the pandas DataFrame would be preserved in the Koalas DataFrame after creating a Koalas DataFrame by passing a pandas DataFrame.

# Create a pandas DataFrame
pdf = pd.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame
kdf = ks.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame by passing a pandas DataFrame
kdf = ks.DataFrame(pdf)
kdf = ks.from_pandas(pdf)

Likewise, the order of indices can be sorted by DataFrame.sort_index().

>>> pdf
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712
>>> kdf.sort_index()
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Viewing Data

As with a pandas DataFrame, the top rows of a Koalas DataFrame can be displayed using DataFrame.head(). Generally, a confusion can occur when converting from pandas to PySpark due to the different behavior of the head() between pandas and PySpark, but Koalas supports this in the same way as pandas by using limit() of PySpark under the hood.

>>> kdf.head(2)
          A         B
0  0.015869  0.584455
1  0.224340  0.632132

A quick statistical summary of a Koalas DataFrame can be displayed using DataFrame.describe().

>>> kdf.describe()
              A         B
count  5.000000  5.000000
mean   0.344998  0.660481
std    0.360486  0.195485
min    0.015869  0.388611
25%    0.037077  0.584455
50%    0.224340  0.632132
75%    0.637126  0.820495
max    0.810577  0.876712

Sorting a Koalas DataFrame can be done using DataFrame.sort_values().

>>> kdf.sort_values(by='B')
          A         B
3  0.810577  0.388611
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
4  0.037077  0.876712

Transposing a Koalas DataFrame can be done using DataFrame.transpose().

>>> kdf.transpose()
          0         1         2         3         4
A  0.015869  0.224340  0.637126  0.810577  0.037077
B  0.584455  0.632132  0.820495  0.388611  0.876712

Best Practice: DataFrame.transpose() will fail when the number of rows is more than the value of compute.max_rows, which is set to 1000 by default. This is to prevent users from unknowingly executing expensive operations. In Koalas, you can easily reset the default compute.max_rows. See the official docs for DataFrame.transpose() for more details.

>>> from databricks.koalas.config import set_option, get_option
>>> ks.get_option('compute.max_rows')
1000
>>> ks.set_option('compute.max_rows', 2000)
>>> ks.get_option('compute.max_rows')
2000

Selecting or Accessing Data

As with a pandas DataFrame, selecting a single column from a Koalas DataFrame returns a Series.

>>> kdf['A']  # or kdf.A
0    0.015869
1    0.224340
2    0.637126
3    0.810577
4    0.037077
Name: A, dtype: float64

Selecting multiple columns from a Koalas DataFrame returns a Koalas DataFrame.

>>> kdf[['A', 'B']]
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Slicing is available for selecting rows from a Koalas DataFrame.

>>> kdf.loc[1:2]
          A         B
1  0.224340  0.632132
2  0.637126  0.820495

Slicing rows and columns is also available.

>>> kdf.iloc[:3, 1:2]
          B
0  0.584455
1  0.632132
2  0.820495

Best Practice: By default, Koalas disallows adding columns coming from different DataFrames or Series to a Koalas DataFrame as adding columns requires join operations which are generally expensive. This operation can be enabled by setting compute.ops_on_diff_frames to True. See Available options in the docs for more detail.

>>> kser = ks.Series([100, 200, 300, 400, 500], index=[0, 1, 2, 3, 4])
>>> kdf['C'] = kser


...
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
# Those are needed for managing options
>>> from databricks.koalas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> kdf['C'] = kser
# Reset to default to avoid potential expensive operation in the future
>>> reset_option("compute.ops_on_diff_frames")
>>> kdf
          A         B    C
0  0.015869  0.584455  100
1  0.224340  0.632132  200
3  0.810577  0.388611  400
2  0.637126  0.820495  300
4  0.037077  0.876712  500

Applying a Python Function to Koalas DataFrame

DataFrame.apply() is a very powerful function favored by many pandas users. Koalas DataFrames also support this function.

>>> kdf.apply(np.cumsum)
          A         B     C
0  0.015869  0.584455   100
1  0.240210  1.216587   300
3  1.050786  1.605198   700
2  1.687913  2.425693  1000
4  1.724990  3.302404  1500

DataFrame.apply() also works for axis = 1 or ‘columns’ (0 or ‘index’ is the default).

>>> kdf.apply(np.cumsum, axis=1)
          A         B           C
0  0.015869  0.600324  100.600324
1  0.224340  0.856472  200.856472
3  0.810577  1.199187  401.199187
2  0.637126  1.457621  301.457621
4  0.037077  0.913788  500.913788

Also, a Python native function can be applied to a Koalas DataFrame.

>>> kdf.apply(lambda x: x ** 2)
          A         B       C
0  0.000252  0.341588   10000
1  0.050329  0.399591   40000
3  0.657035  0.151018  160000
2  0.405930  0.673212   90000
4  0.001375  0.768623  250000

Best Practice: While it works fine as it is, it is recommended to specify the return type hint for Spark’s return type internally when applying user defined functions to a Koalas DataFrame. If the return type hint is not specified, Koalas runs the function once for a small sample to infer the Spark return type which can be fairly expensive.

>>> def square(x) -> ks.Series[np.float64]:
...     return x ** 2
>>> kdf.apply(square)
          A         B         C
0  0.405930  0.673212   90000.0
1  0.001375  0.768623  250000.0
2  0.000252  0.341588   10000.0
3  0.657035  0.151018  160000.0
4  0.050329  0.399591   40000.0

Note that DataFrame.apply() in Koalas does not support global aggregations by its design. However, If the size of data is lower than compute.shortcut_limit, it might work because it uses pandas as a shortcut execution.

# Working properly since size of data <= compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1000)}).apply(lambda col: col.max())
A    999
Name: 0, dtype: int64
# Not working properly since size of data > compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A     165
A     580
A     331
A     497
A     829
A     414
A     746
A     663
A     912
A    1000
A     248
A      82
Name: 0, dtype: int64

Best Practice: In Koalas, compute.shortcut_limit (default = 1000) computes a specified number of rows in pandas as a shortcut when operating on a small dataset. Koalas uses the pandas API directly in some cases when the size of input data is below this threshold. Therefore, setting this limit too high could slow down the execution or even lead to out-of-memory errors. The following code example sets a higher compute.shortcut_limit, which then allows the previous code to work properly. See the Available options for more details.

>>> ks.set_option('compute.shortcut_limit', 1001)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A    1000
Name: 0, dtype: int64

Grouping Data

Grouping data by columns is one of the common APIs in pandas. DataFrame.groupby() is available in Koalas as well.

>>> kdf.groupby('A').sum()
                 B    C
A                      
0.224340  0.632132  200
0.637126  0.820495  300
0.015869  0.584455  100
0.810577  0.388611  400
0.037077  0.876712  500

See also grouping data by multiple columns below.

>>> kdf.groupby(['A', 'B']).sum()
                     C
A        B            
0.224340 0.632132  200
0.015869 0.584455  100
0.037077 0.876712  500
0.810577 0.388611  400
0.637126 0.820495  300

Plotting and Visualizing Data

In pandas, DataFrame.plot is a good solution for visualizing data. It can be used in the same way in Koalas.

Note that Koalas leverages approximation for faster rendering. Therefore, the results could be slightly different when the number of data is larger than plotting.max_rows.

See the example below that plots a Koalas DataFrame as a bar chart with DataFrame.plot.bar().

>>> speed = [0.1, 17.5, 40, 48, 52, 69, 88]
>>> lifespan = [2, 8, 70, 1.5, 25, 12, 28]
>>> index = ['snail', 'pig', 'elephant',
...          'rabbit', 'giraffe', 'coyote', 'horse']
>>> kdf = ks.DataFrame({'speed': speed,
...                     'lifespan': lifespan}, index=index)
>>> kdf.plot.bar()

Also, The horizontal bar plot is supported with DataFrame.plot.barh()

>>> kdf.plot.barh()

Make a pie plot using DataFrame.plot.pie().

>>> kdf = ks.DataFrame({'mass': [0.330, 4.87, 5.97],
...                     'radius': [2439.7, 6051.8, 6378.1]},
...                    index=['Mercury', 'Venus', 'Earth'])
>>> kdf.plot.pie(y='mass')

Best Practice: For bar and pie plots, only the top-n-rows are displayed to render more efficiently, which can be set by using option plotting.max_rows.

Make a stacked area plot using DataFrame.plot.area().

>>> kdf = ks.DataFrame({
...     'sales': [3, 2, 3, 9, 10, 6, 3],
...     'signups': [5, 5, 6, 12, 14, 13, 9],
...     'visits': [20, 42, 28, 62, 81, 50, 90],
... }, index=pd.date_range(start='2019/08/15', end='2020/03/09',
...                        freq='M'))
>>> kdf.plot.area()

Make line charts using DataFrame.plot.line().

>>> kdf = ks.DataFrame({'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]},
...                    index=[1990, 1997, 2003, 2009, 2014])
>>> kdf.plot.line()

Best Practice: For area and line plots, the proportion of data that will be plotted can be set by plotting.sample_ratio. The default is 1000, or the same as plotting.max_rows. See Available options for details.

Make a histogram using DataFrame.plot.hist()

>>> kdf = pd.DataFrame(
...     np.random.randint(1, 7, 6000),
...     columns=['one'])
>>> kdf['two'] = kdf['one'] + np.random.randint(1, 7, 6000)
>>> kdf = ks.from_pandas(kdf)
>>> kdf.plot.hist(bins=12, alpha=0.5)

Make a scatter plot using DataFrame.plot.scatter()

>>> kdf = ks.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
...                     [6.4, 3.2, 1], [5.9, 3.0, 2]],
...                    columns=['length', 'width', 'species'])
>>> kdf.plot.scatter(x='length', y='width', c='species', colormap='viridis')

Missing Functionalities and Workarounds in Koalas

When working with Koalas, there are a few things to look out for. First, not all pandas APIs are currently available in Koalas. Currently, about ~70% of pandas APIs are available in Koalas. In addition, there are subtle behavioral differences between Koalas and pandas, even if the same APIs are applied. Due to the difference, it would not make sense to implement certain pandas APIs in Koalas. This section discusses common workarounds.

Using pandas APIs via Conversion

When dealing with missing pandas APIs in Koalas, a common workaround is to convert Koalas DataFrames to pandas or PySpark DataFrames, and then apply either pandas or PySpark APIs. Converting between Koalas DataFrames and pandas/PySpark DataFrames is pretty straightforward: DataFrame.to_pandas() and koalas.from_pandas() for conversion to/from pandas; DataFrame.to_spark() and DataFrame.to_koalas() for conversion to/from PySpark. However, if the Koalas DataFrame is too large to fit in one single machine, converting to pandas can cause an out-of-memory error.

Following code snippets shows a simple usage of DataFrame.to_pandas().

>>> kidx = kdf.index
>>> kidx.to_list()

...
PandasNotImplementedError: The method `pd.Index.to_list()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Best Practice: Index.to_list() raises PandasNotImplementedError. Koalas does not support this because it requires collecting all data into the client (driver node) side. A simple workaround is to convert to pandas using to_pandas().

>>> kidx.to_pandas().to_list()
[0, 1, 2, 3, 4]

Native Support for pandas Objects

Koalas has also made available the native support for pandas objects. Koalas can directly leverage pandas objects as below.

>>> kdf = ks.DataFrame({'A': 1.,
...                     'B': pd.Timestamp('20130102'),
...                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...                     'D': np.array([3] * 4, dtype='int32'),
...                     'F': 'foo'})
>>> kdf
     A          B    C  D    F
0  1.0 2013-01-02  1.0  3  foo
1  1.0 2013-01-02  1.0  3  foo
2  1.0 2013-01-02  1.0  3  foo
3  1.0 2013-01-02  1.0  3  foo

ks.Timestamp() is not implemented yet, and ks.Series() cannot be used in the creation of Koalas DataFrame. In these cases, the pandas native objects pd.Timestamp() and pd.Series() can be used instead.

Distributing a pandas Function in Koalas

In addition, Koalas offers Koalas-specific APIs such as DataFrame.map_in_pandas(), which natively support distributing a given pandas function in Koalas.

>>> i = pd.date_range('2018-04-09', periods=2000, freq='1D1min')
>>> ts = ks.DataFrame({'A': ['timestamp']}, index=i)
>>> ts.between_time('0:15', '0:16')


...
PandasNotImplementedError: The method `pd.DataFrame.between_time()` is not implemented yet.

DataFrame.between_time() is not yet implemented in Koalas. As shown below, a simple workaround is to convert to a pandas DataFrame using to_pandas(), and then applying the function.

>>> ts.to_pandas().between_time('0:15', '0:16')
                             A
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp

However, DataFrame.map_in_pandas() is a better alternative workaround because it does not require moving data into a single client node and potentially causing out-of-memory errors.

>>> ts.map_in_pandas(func=lambda pdf: pdf.between_time('0:15', '0:16'))
                             A
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp

Best Practice: In this way, DataFrame.between_time(), which is a pandas function, can be performed on a distributed Koalas DataFrame because DataFrame.map_in_pandas() executes the given function across multiple nodes. See DataFrame.map_in_pandas().

Using SQL in Koalas

Koalas supports standard SQL syntax with ks.sql() which allows executing Spark SQL query and returns the result as a Koalas DataFrame.

>>> kdf = ks.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]})
>>> ks.sql("SELECT * FROM {kdf} WHERE pig > 100")
   year   pig  horse
0  1990    20      4
1  1997    18     25
2  2003   489    281
3  2009   675    600
4  2014  1776   1900

Also, mixing Koalas DataFrame and pandas DataFrame is supported in a join operation.

>>> pdf = pd.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'sheep': [22, 50, 121, 445, 791],
...                     'chicken': [250, 326, 589, 1241, 2118]})
>>> ks.sql('''
...     SELECT ks.pig, pd.chicken
...     FROM {kdf} ks INNER JOIN {pdf} pd
...     ON ks.year = pd.year
...     ORDER BY ks.pig, pd.chicken''')
    pig  chicken
0    18      326
1    20      250
2   489      589
3   675     1241
4  1776     2118

Working with PySpark

You can also apply several PySpark APIs on Koalas DataFrames. PySpark background can make you more productive when working in Koalas. If you know PySpark, you can use PySpark APIs as workarounds when the pandas-equivalent APIs are not available in Koalas. If you feel comfortable with PySpark, you can use many rich features such as the Spark UI, history server, etc.

Conversion from and to PySpark DataFrame

A Koalas DataFrame can be easily converted to a PySpark DataFrame using DataFrame.to_spark(), similar to DataFrame.to_pandas(). On the other hand, a PySpark DataFrame can be easily converted to a Koalas DataFrame using DataFrame.to_koalas(), which extends the Spark DataFrame class.

>>> kdf = ks.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
>>> sdf = kdf.to_spark()
>>> type(sdf)
pyspark.sql.dataframe.DataFrame
>>> sdf.show()
+---+---+
|  A|  B|
+---+---+
|  1| 10|
|  2| 20|
|  3| 30|
|  4| 40|
|  5| 50|
+---+---+

Note that converting from PySpark to Koalas can cause an out-of-memory error when the default index type is sequence. Default index type can be set by compute.default_index_type (default = sequence). If the default index must be the sequence in a large dataset, distributed-sequence should be used.

>>> from databricks.koalas import option_context
>>> with option_context(
...         "compute.default_index_type", "distributed-sequence"):
...     kdf = sdf.to_koalas()
>>> type(kdf)
databricks.koalas.frame.DataFrame
>>> kdf
   A   B
3  4  40
1  2  20
2  3  30
4  5  50
0  1  10

Best Practice: Converting from a PySpark DataFrame to Koalas DataFrame can have some overhead because it requires creating a new default index internally – PySpark DataFrames do not have indices. You can avoid this overhead by specifying the column that can be used as an index column. See the Default Index type for more detail.

>>> sdf.to_koalas(index_col='A')
    B
A    
1  10
2  20
3  30
4  40
5  50

Checking Spark’s Execution Plans

DataFrame.explain() is a useful PySpark API and is also available in Koalas. It can show the Spark execution plans before the actual execution. It helps you understand and predict the actual execution and avoid the critical performance degradation.

from databricks.koalas import option_context

with option_context(
        "compute.ops_on_diff_frames", True,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10) + ks.range(10)
    df.explain()

The command above simply adds two DataFrames with the same values. The result is shown below.

== Physical Plan ==
*(5) Project [...]
+- SortMergeJoin [...], FullOuter
   :- *(2) Sort [...], false, 0
   :  +- Exchange hashpartitioning(...), [id=#]
   :     +- *(1) Project [...]
   :        +- *(1) Range (0, 10, step=1, splits=12)
   +- *(4) Sort [...], false, 0
      +- ReusedExchange [...], Exchange hashpartitioning(...), [id=#]

As shown in the physical plan, the execution will be fairly expensive because it will perform the sort merge join to combine DataFrames. To improve the execution performance, you can reuse the same DataFrame to avoid the merge. See Physical Plans in Spark SQL to learn more.

with option_context(
        "compute.ops_on_diff_frames", False,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10)
    df = df + df
    df.explain()

Now it uses the same DataFrame for the operations and avoids combining different DataFrames and triggering a sort merge join, which is enabled by compute.ops_on_diff_frames.

== Physical Plan ==
*(1) Project [...]
+- *(1) Project [...]
   +- *(1) Range (0, 10, step=1, splits=12)

This operation is much cheaper than the previous one while producing the same output. Examine DataFrame.explain() to help improve your code efficiency.

Caching DataFrame

DataFrame.cache() is a useful PySpark API and is available in Koalas as well. It is used to cache the output from a Koalas operation so that it would not need to be computed again in the subsequent execution. This would significantly improve the execution speed when the output needs to be accessed repeatedly.

with option_context("compute.default_index_type", 'distributed'):
    df = ks.range(10)
    new_df = (df + df).cache()  # `(df + df)` is cached here as `df`
    new_df.explain()

As the physical plan shows below, new_df will be cached once it is executed.

== Physical Plan ==
*(1) InMemoryTableScan [...]
   +- InMemoryRelation [...], StorageLevel(...)
      +- *(1) Project [...]
         +- *(1) Project [...]
            +- *(1) Project [...]
               +- *(1) Range (0, 10, step=1, splits=12)

InMemoryTableScan and InMemoryRelation mean the new_df will be cached – it does not need to perform the same (df + df) operation when it is executed the next time.

A cached DataFrame can be uncached by DataFrame.unpersist().

new_df.unpersist()

Best Practice: A cached DataFrame can be used in a context manager to ensure the cached scope against the DataFrame. It will be cached and uncached back within the with scope.

with (df + df).cache() as df:
    df.explain()

Conclusion

The examples in this blog demonstrate how easily you can migrate your pandas codebase to Koalas when working with large datasets. Koalas is built on top of PySpark, and provides the same API interface as pandas. While there are subtle differences between pandas and Koalas, Koalas provides additional Koalas-specific functions to make it easy when working in a distributed setting. Finally, this blog shows common workarounds and best practices when working in Koalas. For pandas users who need to scale out, Koalas fits their needs nicely.

Get Started with Koalas on Apache Spark

You can get started with trying examples in this blog in this notebook, visit the Koalas documentation and peruse examples, and contribute at Koalas GitHub. Also, join the koalas-dev mailing list for discussions and new release announcements.

References

Try Databricks for free. Get started today.

The post 10 Minutes from pandas to Koalas on Apache Spark appeared first on Databricks.

↧

Data Science with Azure Databricks at Clifford Chance

March 31, 2020, 9:00 am

≫ Next: Operationalizing machine learning at scale with Databricks and Accenture

≪ Previous: 10 Minutes from pandas to Koalas on Apache Spark

Guest blog by Mirko Bernardoni (Fiume Ltd) and Lulu Wan (Clifford Chance)

With headquarters in London, Clifford Chance is a member of the “Magic Circle” of law firms and is one of the ten largest law firms in the world measured both by number of lawyers and revenue.

As a global law firm, we support clients at both the local and international level across Europe, Asia Pacific, the Americas, the Middle East and Africa. Our global view, coupled with our sector approach, gives us a detailed understanding of our clients’ business, including the drivers and competitive landscapes.

To achieve our vision of becoming the global law firm of choice we must be the firm that creates the greatest value for our clients. That means delivering service that is ever quicker, simpler, more efficient and more robust. By investing in smart technology and applying our extensive legal expertise, we can continually improve value and outcomes for clients, making delivery more effective, every time

Data Science and Legal

Artificial intelligence is growing at a phenomenal speed and is now set to transform the legal industry by mining documents, reviewing and creating contracts, raising red flags and performing due diligence. We are enthusiastic early adopters of AI and other advanced technology tools to enable us to deliver a better service to our clients.

To ensure we are providing the best value to our clients, Clifford Chance created an internal Data Science Lab, organised similar to a startup inside the firm. We are working with, and as part of the Innovation Lab and Best Delivery Hub in Clifford Chance where we deliver initiatives helping lawyers do their daily work.

Applying data science to the lawyer’s work comes with many challenges. These include handling lengthy documents, working with a specific domain language, analysing millions of documents and classifying them, extracting information and predicting statements and clauses. For example, a simple document classification can become a complex exercise if we consider that our documents contain more than 5,000 words.

Data Science Lab process

The process that enables the data science lab to work at full capacity can be summarised in four steps:

Idea management. Every idea is catalogued with a specific workflow for managing all progression gates and stakeholder’s interaction efficiently. This focuses us on embedding the idea in our existing business processes or creating a new product.
Data processing. It is up to the Data Science Lab to collaborate with other teams to acquire data, seek the necessary approvals and transform it in such a way that only the relevant data with the right permission in the right format reaches the data scientist. Databricks with Apache Spark^TM — we have an on-premise instance for filtering and obfuscating the data based on our contracts and regulations — allows us to move the data to Azure efficiently. Thanks to the unified data analytics platform, the entire data team — data engineers and data scientists — can fix minor bugs in our processes.
Data science. Without Databricks it would be incredibly expensive for us to conduct research. The size of the team is small, but we are always looking to implement the latest academic research. We need a platform that allows us to code in an efficient manner without considering all the infrastructure aspects. Databricks provides a unified, collaborative environment for all our data scientists, while also ensuring that we can comply with the security standards as mandated by our organisation.
Operationalisation. The Databricks platform is used to re-train the models and run the ETL process which moves data into production as necessary. Again, in this case, unifying data engineering and data science was a big win for us. It reduces the time to fix issues and bugs and helps us to better understand the data.

Workflow process for Data Science Lab

Data Science Lab toolkit

The Data science Lab requirements for building our toolkit are:

Maintain high standards of confidentiality
Build products as quickly as possible
Keep control of our models and personalisation
Usable by a small team of four members with mixed skills and roles

These requirements drove us to automate all of our processes and choose the right platforms for development. We had to unify data engineering and data science while reducing costs and time required to be operational.

We use a variety of third-party tools, including Azure Cloud, open-source and in-house build tools for our data stack:

Spark on-premise installation for applying the first level of governance on our data (such as defining what can be copied in the cloud)
Kafka and Event Hub are our transport protocol for moving the data in Azure
Databricks Unified Data Analytics Platform for any ETL transformation, iterate development and test our built-in models
MLflow to log models’ metadata, select best models and hyperparameters and models deployment
Hyperopt for model tuning and optimisation at scale
Azure Data Lake with Delta Lake for storing our datasets, enabling traceability and model storage

Data Science Lab data ingestion and elaboration architecture

An example use case: Document classification

Having the ability to automatically label documents speed ups many legal processes when thousands or millions of documents are involved. To build our model, we worked with the EDGAR dataset, which is an online public database from the U.S. Security and Exchange Commission (SEC). EDGAR is the primary system for submissions by companies and others required to file information with the SEC.

The first step was to extract the documents from filings and find entries that are similar in size to our use case (more than 5,000 words) and extract only the relevant text. The process took multiple iterations to get a usable labelled dataset. We started from more than 15 million files and selected only 28,445 for creating our models.

What was novel about our approach was applying chunk embedding inspired from audio segmentation. This entailed, dividing a long document into chunks and mapping to numeric space to achieve chunk embeddings. For more details, you can read our published paper here: Long-length Legal Document Classification.

On the top of long short-term memory (LSTM), we employed an attention mechanism to enable our model to assign different scores to different parts across the whole document. Throughout the entire architecture of the model, a set of hyperparameters, comprising embedding dimension, hidden size, batch size, learning rate and weight decay, play vital roles either in determining the performance of the model or the time to be consumed on training the model.

Model architecture

Even though we can narrow down candidate values for each hyperparameter to a limited range of values, the total number of combinations is still massive. In this case, implementing a greedy search over the hyperparameter space is unrealistic, but here Hyperopt makes life much easier. What we only need to do is to construct the objective function and define the hyperparameter space. Meanwhile, all the results generated during the training are stored in MLflow. No model evaluations are lost.

t-SNE plot of projections of document embeddings, using Doc2Vec + BiLSTM

Conclusion

The Clifford Chance Data Science Lab team is able to deliver end-user applications and academic research with a small team and limited resources. This has been achieved through automating processes and using a combination of Azure Cloud, Azure Databricks, MLflow and Hyperopt.

In the use case above, we achieved an F1 score greater than 0.98 on our document classification task with long-length documents. This is assisting multiple projects where we are dealing with huge numbers of files that require classification.

Looking forward, we plan to further automate our processes to reduce the workload of managing product development. We are continuing to optimise our processes to add alerting and monitoring. We plan to produce more scientific papers and contribute to the MLflow and Hyperopt open-source projects in the near future so we can share our specific use cases.

Try Databricks for free. Get started today.

The post Data Science with Azure Databricks at Clifford Chance appeared first on Databricks.

↧

Operationalizing machine learning at scale with Databricks and Accenture

April 7, 2020, 7:00 am

≫ Next: What it means to be customer obsessed

≪ Previous: Data Science with Azure Databricks at Clifford Chance

Guest blog by Atish Ray, Managing Director at Accenture Applied Intelligence

While many machine learning pilots are successful, scaling and operating full blown applications to deliver business-critical outcomes remains a key challenge. Accenture and Databricks are partnering to overcome this, writes Atish Ray, Managing Director at Accenture Applied Intelligence, who specializes in big data and AI.

In 2019, machine learning (ML) applications and platforms attracted an average of $42 billion in funding worldwide. Despite this promise, scaling and operating full-blown ML applications remains a key challenge—especially in a business context, where many of the long-term benefits of industrialized ML are yet to be realized.

While ML is lauded for its ability to learn patterns of data, subsequently improving performance and outcomes based on experience, the barriers to scaling it are many and varied. For instance, a lack of good management of metadata end-to-end in an ML lifecycle could lead to fundamental issues around trust and traceability. The rapid evolution of skills and technologies required, and the potential incompatibility of traditional operating models and business processes, especially in IT, both pose hurdles in moving ML applications from pilot stage into production.

The good news is that recent advances in and availability of several AI and ML technologies have yielded the tools necessary to democratize and industrialize the lifecycle of an ML application. Increasing popularity and use of public cloud has enabled organizations to store and process more data at higher efficiency than ever before—a prerequisite for ML applications to scale and run most efficiently.

Innovations from open source communities supported by companies like Databricks have resulted in state-of-the-art products that allow scientists, engineers, and architects to collaborate together and rapidly build and deploy ML applications. And, what used to require a PhD in machine learning, has now been abstracted into a wide variety of software tools and services, which are democratized for use by a more diverse set of users.

Combine all of this with deep knowledge of an industry and its data, and it is clear there has never been a better time for organizations to deploy and operate ML at scale.

What makes the ML lifecycle a complex, collaborative process?

To monitor if ML delivers sustained outcomes to a business over a period of time, a deep understanding of the people, processes and technologies in each phase of the ML lifecycle is critical (Figure 1). From the outset, key stakeholders must be aligned on what exactly it is they need to achieve for the business.

Figure 1: An end-to-end ML product life cycle

In a business context, one of the best practices is to begin by prioritizing one or two business challenges around which to build a minimum viable product or MVP supported by an initial foundation. Once this is established and the necessary data prepared, an experimentation phase takes place to determine the right model for any given problem.

After the model has been selected, tested, tuned and finalized, the ML application is ready to be operationalized. Traditionally, the work required to get to this point has been where data scientists focus the bulk of their time. However, in order to operationalize at scale, models may need to be deployed on certain platforms, such as cloud platforms, or integrated into user-facing business applications.

Once this is all complete, the next step is to monitor and tune the performance of these learning models as they’re deployed into a production environment, where they are delivering specific outcomes such as making recommendations and predictions or monitoring certain types of operational efficiencies.

In one case, for example, an online ad agency in Japan was utilizing ML to create lists of target customers for ad delivery. They were successful in creating accurate models but were suffering from high operational cost of building models and evaluating the targeting outcomes. There was an urgent need to normalize and automate the process across measures.

To address this issue, Accenture implemented a reusable scripts tool to build, train, test and validate models. The scripts, which ran from a GUI frontend, were integrated with ML flow to enable deployment with ease, substantially reducing the DevOps time and effort needed to scale.

In another case, a large pharmaceutical retailer in the US was struggling to engage with its 80-million-plus member base through offers made with its loyalty program. It needed a way to increase uplift, yet apart from manual processes, there were no systems in place to build a reliable, uniform and reproducible ML pipeline to evaluate billions of combinations of offers for millions of customers on a continual basis.

Accenture developed and delivered a personalization engine with the Databricks platform to build, train, test, validate and deploy models at scale, across tens of millions of customers, billions of offers, and tens of thousands of products. An automated ML model deployment process and modernized AI pipeline were also deployed. The result was substantially reduced DevOps time and effort in deploying models, and the business was able to achieve an estimated 20% higher margin for pilot retail locations.

What are the technical building blocks for industrial ML?

Leveraging established building blocks by partnering with established experts—such as in the two cases outlined above—accelerates both the build and deployment of these types of programs, which can be iterated, incrementally scaled and applied to deliver increasingly complex business outcomes.

To help its clients build and operate these ML applications, Accenture has partnered with Databricks. Accenture is leveraging Databricks’ platform to establish the key technical foundation needed to address three core areas of industrial ML: collaboration, data dependency and deployment (Figure 2).

Databricks’ Unified Analytics Data Platform enables key technical components for each of the three fundamental areas, and Accenture has developed a set of additional technical components that co-exist and integrate with the Databricks platform. This also includes a package of reusable components, which accelerate collaboration, improve understanding of data and streamline operational deployments.

Ultimately, the objective of this partnership was to streamline the methodologies that have been proven successful for large-scale deployment.

Figure 2: Collaboration, data dependency and deployment

Based on extensive implementation experience, we know organizations that are industrializing ML development and deployment are addressing the three fundamental areas that we address here:

Collaboration

Comprehensive collaboration of analytics communities across organization boundaries, managing and sharing features and models, is key to success. As a collaborative environment, Databricks Workspaces provides a space in which data engineers and data scientists can jointly explore datasets, build models iteratively and execute experiments and data pipelines. MLflow is a key component and open source project from Databricks to collaborate across the ML lifecycle from experimentation to deployment and allows users to track model performance, versions and reproducible results.

Accenture brings a toolkit of models and feature engineering for many scenarios, for example a recommendation engine, that bootstraps the full ML application lifecycle. It leverages industry knowledge of successful models and enables baseline production feedback to inform calibration efforts.

Data dependency

We cannot emphasize enough the importance of access to and understanding of usable datasets and associated metadata for driving successful outcomes. Our data dependency components capture standards and rules to shape data, and provide visual charts to help assess data quality. This improves speed of data acquisition and curation, and further accelerates understanding of data and improves efficiencies of feature engineering.

The Databricks platform provides several capabilities to improve data quality and processing performance at scale. Delta Lake, available as part of Databricks, is an open source storage layer that enables ACID transactions and data quality features, and brings reliability to data lakes at scale. Apache Spark delivers a highly scalable engine for big data and ML, with additional enhancements from Databricks for high performance.

Deployment

Whereas experimentation requires data science knowledge to apply the right solutions to the right industry problem, deployment requires well integrated, cross-functional teams. Our deployment components use a metadata-driven approach to build and deploy ML pipelines representing continuous workflow from inception to validation. By enabling standards and deployment patterns, these components make it possible to operationalize experiments.

The Databricks Enterprise Cloud Service is a simple, secure and scalable managed service that enables consistent deployment of high-performing ML pipelines and applications. What’s more, a governance structure for deployment and management of production models and drifts can also be enabled. Integrated together, these components from Databricks and Accenture deliver significant acceleration to the deployment of a ML lifecycle on AWS and Azure clouds.

What are the key considerations before deploying ML at scale?

For those considering an industrialized approach to ML, there are a few key questions to consider first. They include:

Are business stakeholders aligned on the business problems ML needs to solve and the expectations on key outcomes it needs to achieve?
Are the right roles and skills in place to scale and monitor the ML application for deploying a successful experimentation?
Is the necessary infrastructure and automation needs understood and made available for an industrialized ML solution?
Does the data science team have the right operating model, standards and enablers needed to avoid significant deployment re-engineering once experimentation is done?

When it comes to industrialized ML, it can be tempting to start by building a very deep technical foundation that addresses all aspects of the lifecycle. More often than not, however, that approach risks losing sight of business outcomes and encountering challenges with adoption, spend and justification of approach.

Instead, we have experienced successful iterative development of these foundations in a business-critical environment, built through successive cycles delivering incremental business outcomes.

About Accenture

Accenture is a leading global professional services company, providing a broad range of services and solutions in strategy, consulting, digital, technology and operations. Combining unmatched experience and specialized skills across more than 40 industries and all business functions – underpinned by the world’s largest delivery network – Accenture works at the intersection of business and technology to help clients improve their performance and create sustainable value for their stakeholders. With more than 492,000 people serving clients in more than 120 countries, Accenture drives innovation to improve the way the world works and lives. Visit us at www.accenture.com.

This document is produced by consultants at Accenture as general guidance. It is not intended to provide specific advice on your circumstances. If you require advice or further details on any matters referred to, please contact your Accenture representative.

This document makes descriptive reference to trademarks that may be owned by others. The use of such trademarks herein is not an assertion of ownership of such trademarks by Accenture and is not intended to represent or imply the existence of an association between Accenture and the lawful owners of such trademarks.

Try Databricks for free. Get started today.

The post Operationalizing machine learning at scale with Databricks and Accenture appeared first on Databricks.

↧

What it means to be customer obsessed

April 13, 2020, 8:00 am

≫ Next: COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help

≪ Previous: Operationalizing machine learning at scale with Databricks and Accenture

One of our values at Databricks is to be customer obsessed. We deeply care about the impact and success of our customers, and are proud to be recognized by Gartner for focusing on this. A key part of that is how we strategize on making the world better through the applications of data. Working as a customer success engineer at Databricks has provided me with a life-changing career opportunity to help companies in the Data and Cloud space — across a variety of industries — solve the world’s toughest problems.

Helping customers succeed across a variety of industries

The customer success team at Tech Summit’19

Having worked as a lead data science engineer for five years at Citi, I wanted a role that could help me understand the wider data industry landscape, specifically around machine learning and AI use cases. Because of my previous experience working with Apache Spark™, I was impressed by the powerful and exciting open source project that helped with accelerated model development for big data.

My current role in customer success management allows me to work with many organizations focused on data-driven products and solutions, and it’s gratifying to work with both executives and technical users. Because Databricks is so industry agnostic, our teams get exposure to a wide range of verticals like FSI-Financial Services, Health and Life Sciences, Media and Entertainment, Gaming, Energy, Retail and eCommerce, Automotive, and Technology.

Customer success engineers (CSE) serve as trusted advisors by ensuring that customers are successful with the product and innovating at a fast pace. One highlight of being a CSE is that you feel like a CEO for the customers — you must understand their use cases and strategies to make them successful by mapping our solutions to their vision, being their champion and driving advocacy through feedback cycles and enablement. I am gratified to share one such example of growth and success. My very first customer, YipitData is leveraging the scale and processing ability of the Databricks Unified Data Analytics Platform for a competitive advantage.

Working and scaling as a team

Part of my role includes working with internal teams like Sales, Product, Engineering, Support, Marketing and, of course, our Customer Success organization. This allows me to grow and benefit from different perspectives, including how something can be done, trade-offs for different options, and learning from subject matter experts that explain the bigger picture. I love the energy at Databricks, where we think about the positive impact we are making on billions of users and how we can push for 10x innovation. The vision and pace of the company allows me to go beyond my skill set and push the boundaries.

Customer Success team at our 2020 Company Retreat

When I joined in 2018, there were approximately 400 employees and now we’ve grown to more than 1300. With this growth, I found many opportunities to lead and contribute outside of my daily job, whether it was helping the open source community, technical enablement for field, helping with diversity and inclusion, empowering other organizations to share their success stories, providing visibility in the market, building champions, and being Databricks ambassadors through conferences or meetups.

We believe in the power of community and acknowledge the amazing solutions that others have built by partnering with us to solve the world’s toughest data problems. Our team is constantly looking to build our community of employees, customers, partners and contributors to open source projects. Databricks is growing tremendously and the opportunities are limitless for what we can give back to the community. If you want to learn more, visit databricks.com or if you’re interested in working with us, please apply on our Careers at Databricks page.

Vini is a customer success engineer at Databricks, San Francisco. She serves as the trusted technical advisor for our customers, and she works cross functionally with our internal Databricks teams.

Try Databricks for free. Get started today.

The post What it means to be customer obsessed appeared first on Databricks.

↧

COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help

April 14, 2020, 12:35 pm

≫ Next: Spark + AI Summit is now a global virtual event

≪ Previous: What it means to be customer obsessed

With the massive disruption of the current COVID-19 pandemic, many data engineers and data scientists are asking themselves “How can the data community help?” The data community is already doing some amazing work in a short amount of time including (but certainly not limited to) one of the most commonly used COVID-19 data sources: the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE. The following animated GIF is a visual representation of the number of confirmed cases (counties) and deaths (circles) spanning 20 days from March 22nd to April 11th.

Figure 1: Confirmed COVID-19 Cases and Deaths spanning 20 days (2020-3-22 to 2020-04-11)
[source notebook]

Other examples include Genomic epidemiology of novel coronavirus which provides real-time tracking of pathogen evolution (click to play the transmissions and phylogeny).

Figure 2: Source: Genomic epidemiology of novel coronavirus (from 2020-04-08)

A powerful example of hospital resource utilization modeling includes the University of Washington’s Institute of Health and Metrics Evaluation (IHME) COVID-19 projections. The screenshot below provides the projected hospital resource utilization metrics, highlighting that peak resources were used on March 28th, 2020.

Figure 3: IHME COVID-19 projections for Italy (from 2020-04-08)

But how can I help?

We believe that overcoming COVID-19 is the world’s toughest problem at the moment, and to help make important decisions, it is important to understand the underlying data. So we’ve taken steps to enable anyone — from first-time data explorers to data professionals — to participate in the effort.

In late March, we began with a data analytics primer of COVID-19 datasets with our tech talk on Analyzing COVID-19: Can the Data Community Help? In this session, we performed exploratory data analysis and natural language processing (NLP) with various open source projects, including but not limited to Apache Spark™, Python, pandas, and BERT. We have also made these notebooks available for you to download and use in your own environment of choice, whether that is your own local Python virtual environment, cloud computing, or Databricks Community Edition.

For example, during this session we analyzed the COVID-19 Open Research Dataset Challenge (CORD-19) dataset and observed:

There are thousands of JSON files, each containing the research paper text details including their references. The complexity of the JSON schema can make processing this data a complicated task. Fortunately, Apache Spark can quickly and automatically infer the schema of these JSON files and using this notebook, we save the thousands of JSON files into a few Parquet files to make it easier for the subsequent exploratory data analysis.
As most of this text is unstructured, there are data quality issues including (but not limited to) correctly identifying the primary author’s country. In this notebook, we provide the steps to clean up this data and identify the ISO Alpha 3 country code so we can subsequently map the number of papers by primary author’s country.

Figure 4: Number of COVID-19-related research papers by primary author’s country from Analyzing COVID-19: Can the Data Community Help?

Upon cleaning up the data, we can apply various NLP algorithms to it to gain some insight and intuition into this data. This notebook performs various tasks including generalizing paper abstracts (one paper went from 7800 to 1100 characters) as well as creating the following word cloud based on the titles of these research papers.

Word cloud based on COVID-19-related research paper titles from Analyzing COVID-19: Can the Data Community Help?

Show me the data!

As most data analysts, engineers, and scientists will attest, the quality of your data has a formidable affect on your exploratory data analysis. As noted in A Few Useful Things to Know about Machine Learning (October 2012):

“A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.“

It is important to note that this quote is to emphasize the importance of having a large amount of high quality data as opposed to trivializing the many other important aspects of machine learning such as (but not limited to) the importance of feature engineering and data alone is not enough.

Many in the data community have been and are continuing to work expediently to provide various SARS-CoV-2 (the cause) and COVID-19 (the disease) datasets on Kaggle and GitHub including.

To make it easier for you to perform your analysis — if you’re using Databricks or Databricks Community Edition — we are periodically refreshing and making available various COVID-19 datasets for research (i.e. non-commercial) purposes. We are currently refreshing the following datasets and we plan to add more over time:

/databricks-datasets/[location]	Resource
/../COVID/CORD-19/	COVID-19 Open Research Dataset Challenge (CORD-19)
/../COVID/CSSEGISandData/	2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE
/../COVID/ESRI_hospital_beds/	Definitive Healthcare: USA Hospital Beds
/../COVID/IHME/	IHME (UW) COVID-19 Projections
/../COVID/USAFacts/	USA Facts: Confirmed \| Deaths
/../COVID/coronavirusdataset/	Data Science for COVID-19 (DS4C) (South Korea)
/../COVID/covid-19-data/	NY Times COVID-19 Datasets

Learn more with our exploratory data analysis workshops

Thanks to the positive feedback from our earlier tech talk, we are happy to announce that we are following up with a workshop series on exploratory data analysis in Python with COVID-19 datasets. The videos will be available on YouTube and the notebooks will be available at https://github.com/databricks/tech-talks for you to use in your environment of choice.

Intro to Python on Databricks

This workshop shows you the simple steps needed to program in Python using a notebook environment on the free Databricks Community Edition. Python is a popular programming language because of its wide range of applications, including data analysis, machine learning and web development. This workshop covers major foundational concepts to get you started coding in Python, with a focus on data analysis. You will learn about different types of variables, for loops, functions, and conditional statements. No prior programming knowledge is required.

Who should attend this workshop: Anyone and everyone, CS students and even non-technical folks are welcome to join. No prior programming knowledge is required. If you have taken Python courses in the past, this may be too basic for you.

Data Analysis with pandas

This workshop focuses on pandas, a powerful open-source Python package for data analysis and manipulation. In this workshop, you learn how to read data, compute summary statistics, check data distributions, conduct basic data cleaning and transformation, and plot simple data visualizations. We will be using the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19) dataset.

Who should attend this workshop: Anyone and everyone – CS students and even non-technical folks are welcome to join. Prior basic Python experience is recommended.

What you need: Although no prep work is required, we do recommend basic Python knowledge. If you’re new to Python, a great jump start is our Introduction to Python tutorial.

Machine Learning with scikit-learn

scikit-learn is one of the most popular open source machine learning libraries for data science practitioners. This workshop walks through the basics of machine learning, the different types of machine learning, and how to build a simple machine learning model. This workshop focuses on the techniques of applying and evaluating machine learning methods, rather than the statistical concepts behind them. We will be using data released by the Johns Hopkins Center for Systems Science and Engineering (CSSE) Novel Coronavirus (COVID-19).

Who should attend this workshop: Anyone and everyone – CS students and even non-technical folks are welcome to join. Prior basic Python and pandas experience is required. If you’re new to Python and pandas, watch the Introduction to Python tutorial and register for the Data Analysis with pandas tutorial.

Gaining some insight into COVID-19 datasets

To help you jumpstart your analysis of COVID-19 datasets, we have also included additional notebooks in the tech-talks/samples folder for both the New York Times COVID-19 dataset and 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE (both available and regularly refreshed in /databricks-datasets/COVID).

The NY Times COVID-19 Analysis notebook includes the analysis of COVID-19 cases and deaths by county.

Figure 6: COVID-19 cases for Washington State by top 10 counties (source: New York Times COVID-19 dataset as of April 11th, 2020)

Figure 7: COVID-19 deaths for New York State by top 10 counties (source: New York Times COVID-19 dataset as of April 11th, 2020)

Some observations based on the JHU COVID-19 Analysis notebook include:

As of April 11th, 2020, the schema of the JHU COVID-19 daily reports has changed three times. The preceding notebook includes a script that loops through each file, extracts the filename (to obtain the date), and merges the three different schemas together.
It includes Altair visualizations to visualize the exponential growth of the number of cases and deaths related to COVID-19 in the United States both statically and dynamically via a slider bar.

COVID-19 Confirmed Cases (counties) and deaths (lat, long) using Altair Choropleth map on 3/22 per Johns Hopkins COVID-19 dataset

COVID-19 Confirmed Cases (counties) and deaths (lat, long) using Altair Choropleth map on 3/22 and 4/11 per Johns Hopkins COVID-19 dataset

Discussion

The data community can help during this pandemic by providing crucial insight on the patterns behind the data: rate of growth of confirmed cases and deaths in each county, the impact to that growth where states applied social distancing earlier, understanding how we are flattening the curve by social distancing, etc. While at its core, COVID-19 is a medical problem — i.e. how do we save patients’ lives — it is also an epidemiological problem where understanding the data will help the medical community make better decisions — e.g. how can we use data to make better public health policies to keep people from becoming patients.

Try Databricks for free. Get started today.

The post COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help appeared first on Databricks.

↧

Spark + AI Summit is now a global virtual event

April 14, 2020, 4:41 pm

≫ Next: Databricks Extends MLflow Model Registry with Enterprise Features

≪ Previous: COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help

Extraordinary times call for extraordinary measures. That’s why we transformed this year’s Spark + AI Summit into a fully virtual experience and opened the doors to welcome everyone, free of charge. This gives us the opportunity to turn Summit into a truly global event, bringing together tens of thousands of data scientists, engineers and analysts from around the world in what will be a defining moment for data teams.

Already the largest event of its kind for data teams, this year’s digital conference will be bigger than ever, drawing a worldwide audience who will tune in to collaborate, connect and explore the latest advances in data science, data engineering, machine learning and AI. Expanded to five days — June 22–26 — and featuring more than 200 sessions with live keynotes, interactive demos and panels, plus 4x the amount of training, this Summit may be virtual but its impact will be real.

Data Teams Unite!

We believe that everyone — from startups to large enterprises — needs to unleash their data teams so they can solve the complex challenges we face as businesses, people and as a planet. Recent events only underscore the urgency to tackle the biggest challenges by empowering data teams with the data, the tools, and the platform they need.

That’s why the theme of this year’s summit is Data Teams Unite! Today, teams are often dispersed geographically — and right now many of them are working remotely — but our platform, based on open source technologies, unites them wherever they are

Summit represents a one-of-a-kind opportunity for data teams to connect with each other, share knowledge and discover new technologies. Summit will showcase our vision of flexible, agile data teams who can collaborate from anywhere, work independently and tackle a wide range of projects that once required highly specialized skills.

You might not yet call yourselves a data team, but practitioners at Summit will find the lines between personas blurring and discover the value of unified teams for collaborating and innovating. Data scientists, engineers and analysts will have a range of opportunities to broaden their skill sets, hone best practices and explore the growing interdependencies between their roles and other members of the data team. You’ll also find a wealth of sessions and tutorials on key topics in data and AI including machine learning and MLOps, AI applications, data science and analytics, data engineering and architecture, data management and data pipelines, performance and scalability, and open source technologies.

Best of all, this year’s Summit is free so register your entire data team now. No travel required. Join tens of thousands of practitioners — data scientists, engineers, analysts, machine learning pros — and business leaders as we shape the future of Big Data, AI, and open-source technologies like Apache Spark™, Delta Lake, and MLflow.

FREE REGISTRATION

Try Databricks for free. Get started today.

The post Spark + AI Summit is now a global virtual event appeared first on Databricks.

↧

Databricks Extends MLflow Model Registry with Enterprise Features

April 15, 2020, 8:17 am

≫ Next: Building a Modern Clinical Health Data Lake with Delta Lake

≪ Previous: Spark + AI Summit is now a global virtual event

We are excited to announce new enterprise grade features for the MLflow Model Registry on Databricks. The Model Registry is now enabled by default for all customers using Databricks’ Unified Analytics Platform.

In this blog, we want to highlight the benefits of the Model Registry as a centralized hub for model management, how data teams across organizations can share and control access to their models, and touch upon how you can use Model Registry APIs for integration or inspection.

Central Hub for Collaborative Model Lifecycle Management

MLflow already has the ability to track metrics, parameters, and artifacts as part of experiments; package models and reproducible ML projects; and deploy models to batch or real-time serving platforms. Built on these existing capabilities, the MLflow Model Registry [AWS] [Azure] provides a central repository to manage the model deployment lifecycle.

Overview of the CI/CD tools, architecture and workflow of the MLflow centralized hub for model management.

One of the primary challenges among data scientists in a large organization is the absence of a central repository to collaborate, share code, and manage deployment stage transitions for models, model versions, and their history. A centralized registry for models across an organization affords data teams the ability to:

discover registered models, current stage in model development, experiment runs, and associated code with a registered model
transition models to deployment stages
deploy different versions of a registered model in different stages, offering MLOps engineers ability to deploy and conduct testing of different model versions
archive older models for posterity and provenance
peruse model activities and annotations throughout model’s lifecycle
control granular access and permission for model registrations, transitions or modifications

The Model Registry shows different version in different stages throughout their lifecycle.

Access Control for Model Stage Management

In the current decade of data and machine learning innovation, models have become precious assets and essential to businesses strategies. The models’ usage as part of solutions to solve business problems range from predicting mechanical failures in machinery to forecasting power consumption or financial performance; from fraud and anomaly detection to nudging recommendations for purchasing related items.

As with sensitive data, so with models that use this data to train and score, an access control list (ACL) is imperative so that only authorized users can access models. Through a set of ACLs, data team administrators can grant granular access to operations on a registered model during the model’s lifecycle, preventing inappropriate use of the models or unapproved model transitions to production stages.

In Databricks Unified Analytics Platform you can now set permissions on individual registered models, following the general Databricks’ access control and permissions model [AWS] [Azure].

Access Control Policies for Databricks Assets.

From the Registered Models UI in the Databricks workspace, you can assign users and groups with appropriate permissions for models in the registry, similar to notebooks or clusters.

Set permissions in the Model Registry UI using the ACLs

As shown in the table below, an administrator can assign four permission levels to models registered in the Model Registry: No permissions, Read, Edit, and Manage. Depending on team members’ requirements to access models, you can grant permissions to individual users or groups for each of the abilities shown below.

Ability	No Permissions	Read	Edit	Manage
Create a model	X	X	X	X
View model and its model versions in a list		X	X	X
View model’s details, its versions and their details, stage transition requests, activities, and artifact download URIs		X	X	X
Request stage transitions for a model version		X	X	X
Add a new version to model			X	X
Update model and version description			X	X
Rename model				X
Transition model version between stages				X
Approve, reject, or cancel a model version stage transition request				X
Modify permissions				X
Delete model and model versions				X

Table for Model Registry Access, Abilities and Permissions

How to Use the Model Registry

Typically, data scientists who use MLflow will conduct many experiments, each with a number of runs that track and log metrics and parameters. During the course of this development cycle, they will select the best run within an experiment and register its model with the registry. Thereafter, the registry will let data scientists track multiple versions over the course of model progression as they assign each version with a lifecycle stage: Staging, Production, or Archived.

There are two ways to interact with the MLflow Model Registry [AWS] [Azure]. The first is through the Model Registry UI integrated with the Databricks workspace and the second is via MLflow Tracking Client APIs. The latter provides MLOps engineers access to registered models to integrate with CI/CD tools for testing or inspect model’s runs and its metadata.

Model Registry UI Workflows

The Model Registry UI is accessible from the Databricks workspace. From the Model Registry UI, you can conduct the following activities as part of your workflow:

Register a model from the Run’s page
Edit a model version description
Transition a model version
View model version activities and annotations
Display and search registered models
Delete a model version

Model Registry APIs Workflows

An alternative way to interact with Model Registry is to use the MLflow model flavor or MLflow Client Tracking API interface. As enumerated above in the UI workflows, you can perform similar operations on registered models with the APIs. These APIs are useful for perusing or integrating with external tools that need access to models for nightly testing.

Load Models from the Model Registry

The Model Registry’s APIs allow you to integrate with your choice of continuous integration and deployment (CI/CD) tools such as Jenkins to test your models. For example, your unit tests, with proper permissions granted as mentioned above, can load a version of a model for testing.

In the code snippet below, we are loading two versions of the same model: version 3 in staging and the latest version in production.

import mlflow.sklearn
# Load version 3 with model://URI as argument
model_version_uri = "models:/{model_name}/3".format(model_name="scikit-learn-power-forecasting-model")
model_version_3 = mlflow.sklearn.load_model(model_version_uri)

Now your Jenkins job has access to a staging version 3 of the model for testing. If you want to load the latest production version, you simply change the model:/URI to fetch the production model.

# Load the model in production stage
model_production_uri = "models:/{model_name}/production".format(model_name="scikit-learn-power-forecasting-model")
model_production = mlflow.sklearn.load_model(model_production_uri)

Integrate with an Apache Spark Job

As well as integrating with your choice of deployment (CI/CD) tools, you can load models from the registry and use it in your Spark batch job. A common scenario is to load your registered model as a Spark UDF.

# Load the model as a Spark udf
import mlflow.pyfunc
batch_df = spark.read.parquet()
features = [‘temperature’, ‘wind-speed’, ‘humidity’]
pyfunc_forecast_udf = mlflow.pyfunc.spark_udf(spark, model_production_uri)
prediction_batch_df = batch_df.withColumn("prediction",
                           pyfunc_forecast_udf(*features))

Inspect, List or Search Information about Registered Models

At times, you may want to inspect a registered model’s information via a programmatic interface to examine MLflow Entity information about a model. For example, you can fetch a list of all registered models in the registry with a simple method and iterate over its version information.

client = MlflowClient()
for rm in client.list_registered_models():
   print(f"name={rm.name}")
   [(print(f"run_id={mv.run_id}"), print(f"status={mv.current_stage}"),
         print(f"version={mv.version}")) for mv in rm.latest_versions]

This outputs:

name=sk-learn-random-forest-reg-model
run_id=dfe7227d2cae4c33890fe2e61aa8f54b
current_stage=Production
version=1
...
...

With hundreds of models, it can be cumbersome to peruse or print the results returned from this call. A more efficient approach would be to search for a specific model name and list its version details using search_model_versions() method and provide a filter string such as “name=’sk-learn-random-forest-reg-model'”.

client = MlflowClient()
for mv in client.search_model_versions("name='sk-learn-random-forest-reg-model'"):
    pprint(dict(mv), indent=4)

{   'creation_timestamp': 1582671933246,
    'current_stage': 'Production',
    'description': 'A random forest model containing 100 decision trees'
                   'trained in scikit-learn',
    'last_updated_timestamp': 1582671960712,
    'name': 'sk-learn-random-forest-reg-model',
    'run_id': 'ae2cc01346de45f79a44a320aab1797b',
    'source': './mlruns/0/ae2cc01346de45f79a44a320aab1797b/artifacts/sklearn-model',
    'status': 'READY',
    'status_message': None,
    'user_id': jane@doe.ml,
    'version': 1}
 ...

To sum up, the MLfow Model Registry is available by default to all Databricks customers. As a central hub for ML models, it offers data teams across large organizations to collaborate and share models, manage transitions, annotate and examine lineage. For controlled collaboration, administrators set policies with ACLs to grant permissions to access a registered model.

And finally, you can interact with the registry either using a Databricks workspace’s MLflow UI or MLflow APIs as part of your model lifecycle workflow.

Get Started with the Model Registry

Ready to get started or try it out for yourself? You can read more about MLflow Model Registry and how to use it on AWS or Azure. Or you can try an example notebook [AWS] [Azure]

If you are new to MLflow, read the open source MLflow quickstart with the lastest MLflow 1.7. For production use cases, read about Managed MLflow on Databricks and get started on using the MLflow Model Registry.

And if you’re interested to learn about the latest developments and best practices for managing the full ML lifecycle on Databricks with MLflow, join our interactive MLOps Virtual Event.

Try Databricks for free. Get started today.

The post Databricks Extends MLflow Model Registry with Enterprise Features appeared first on Databricks.

↧

Building a Modern Clinical Health Data Lake with Delta Lake

April 21, 2020, 5:36 pm

≫ Next: Announcing Hackathon for Social Good

≪ Previous: Databricks Extends MLflow Model Registry with Enterprise Features

The healthcare industry is one of the biggest producers of data. In fact, the average healthcare organization is sitting on nearly 9 petabytes of medical data. The rise of electronic health records (EHR), digital medical imagery, and wearables are contributing to this data explosion. For example, an EHR system at a large provider can catalogue millions of medical tests, clinical interactions, and prescribed treatments. And the potential to learn from this population scale data is massive. By building analytic dashboards and machine learning models on top of these datasets, healthcare organizations can improve the patient experience and drive better health outcomes. Here are few real-world examples:

Preventing Neonatal Sepsis

Learn more

Early Detection of Chronic Disease

Learn more

Tracking Disease Physiology Across Populations

Learn more

Preventing Claims Fraud and Abuse

Learn more

Top 3 Big Data Challenges for Healthcare Organizations

Despite the opportunity to improve patient care with analytics and machine learning, healthcare organizations face the classical big data challenges:

Variety – The delivery of care produces a lot of multidimensional data from a variety of data sources. Healthcare teams need to run queries across patients, treatments, facilities and time windows to build a holistic view of the patient experience. This is compute intensive for legacy analytics platforms. On top of that, 80% of healthcare data is unstructured (e.g. clinical notes, medical imaging, genomics, etc). Unfortunately, traditional data warehouses, which serve as the analytics backbone for most healthcare organizations, don’t support unstructured data.
Volume – some organizations have started investing in health data lakes to bring their petabytes of structured and unstructured data together. Unfortunately, traditional query engines struggle with data volumes of this magnitude. A simple ad-hoc analysis can take hours or days. This is too long to wait when adjusting for patient needs in real-time.
Velocity – patients are always coming into the clinic or hospital. With a constant flow of data, EHR records may need to be updated to fix coding errors. It’s critical that a transactional model exists to allow for updates.

As if this wasn’t challenging enough, the data store must also support data scientists who need to run ad-hoc transformations, like creating a longitudinal view of a patient, or build predictive insights with machine learning techniques.

Fortunately, Delta Lake, an open-source storage layer that brings ACID transactions to big data workloads, along with Apache Spark^TM can help solve these challenges by providing a transactional store that supports fast multidimensional queries on diverse data along with rich data science capabilities. With Delta Lake and Apache Spark, healthcare organizations can build a scalable clinical data lake for analytics and ML.

In this blog series, we’ll start by walking through a simple example showing how Delta Lake can be used for ad hoc analytics on health and clinical data. In future blogs, we will look at how Delta Lake and Spark can be coupled together to process streaming HL7/FHIR datasets. Finally, we will look at a number of data science use cases that can run on top of a health data lake built with Delta Lake.

Using Delta Lake to Build a Comorbidity Dashboard

To demonstrate how Delta Lake makes it easier to work with large clinical datasets, we will start off with a simple but powerful use case. We will build a dashboard that allows us to identify comorbid conditions (one or more diseases or conditions that occur along with another condition in the same person at the same time) across a population of patients. To do this, we will use a simulated EHR dataset, generated by the Synthea simulator, made available through Databricks Datasets (AWS | Azure). This dataset represents a cohort of approximately 11,000 patients from Massachusetts, and is stored in 12 CSV files. We will load the CSV files in, before masking protected health information (PHI) and joining the tables together to get the data representation we need for our downstream query. Once the data has been refined, we will use SparkR to build a dashboard that allows us to interactively explore and compute common health statistics on our dataset.

This use case is a very common starting point. In a clinical setting, we may look at comorbidities as a way to understand the risk of a patient’s disease increasing in severity. From a medical coding and financial perspective, looking at comorbid diseases may allow us to identify common medical coding issues that impact reimbursement. In pharmaceutical research, looking at comorbid diseases with shared genetic evidence may give us a deeper understanding of the function of a gene.

However, when we think about the underlying analytics architecture, we are also at a starting point. Instead of loading data in one large batch, we might seek to load streaming EHR data to allow for real-time analytics. Instead of using a dashboard that gives us simple insights, we may advance to machine learning use cases, such as training a machine learning model that uses data from recent patient encounters to predict the progression of a disease. This can be powerful in an ER setting where streaming data and ML can be used to predict the likelihood of a patient improving or declining in real-time.

In the rest of this blog, we will walk through the implementation of our dashboard. We will first start by using Apache Spark and Delta Lake to ETL our simulated EHR dataset. Once the data has been prepared for analysis, we will then create a notebook that identifies comorbid conditions in our dataset. By using built-in capabilities in Databricks (AWS | Azure), we can then directly transform the notebook into a dashboard.

ETLing Clinical Data into Delta Lake

To start off, we need to load our CSV data dump into a consistent representation that we can use for our analytical workloads. By using Delta Lake, we can accelerate a number of the downstream queries that we will run. Delta Lake supports Z-ordering, which allows us to efficiently query data across multiple dimensions. This is critical for working with EHR data, as we may want to slice and dice our data by patient, by date, by care facility, or by condition, amongst other things. Additionally, the managed Delta Lake offering in Databricks provides additional optimizations, which accelerate exploratory queries into our dataset. Delta Lake also future-proofs our work: while we aren’t currently working with streaming data, we may work with live streams from an EHR system in the future, and Delta Lake’s ACID semantics (AWS | Azure) make working with streams simple and reliable.

Our workflow follows a few steps that we show in the figure below. We will start by loading the raw/bronze data from our eight different CSV files, we will mask any PHI that is present in the tables, and we will write out a set of silver tables. We will then join our silver tables together to get an easier representation to work with for downstream queries.

Loading our raw CSV files into Delta Lake tables is a straightforward process. Apache Spark has native support for loading CSV, and we are able to load our files with a single line of code per file. While Spark does not have in-built support for masking PHI, we can use Spark’s rich support for user defined functions (UDFs, AWS | Azure) to define an arbitrary function that deterministically masks fields with PHI or PII. In our example notebook, we use a Python function to compute a SHA1 hash. Finally, saving the data into Delta Lake is a single line of code.

Once data has been loaded into Delta, we can optimize the tables by running a simple SQL command. In our example comorbid condition prediction engine, we will want to rapidly query across both the patient ID and the condition they were evaluated for. By using Delta Lake’s Z-ordering command, we can optimize the table so it can be rapidly queried down either dimension. We have done this on one of our final gold tables, which has joined several of our silver tables together to achieve the data representation we will need for our dashboard.

Building a Comorbidity Dashboard

Now that we have prepared our dataset, we will build our dashboard allowing us to explore comorbid conditions, or more simply put, conditions that commonly co-occur in a single patient. Some of the time, these can be precursors/risk factors, for example, high blood pressure is a well known risk factor for stroke and other cardiovascular diseases. By discovering and monitoring comorbid conditions, and other health statistics, we can improve care by identifying risks and advising patients on preventative steps they can take. Ultimately, identifying comorbidities is a counting exercise! We need to identify the distinct set of patients who had both condition A and condition B, which means it can be done all fully in SQL using Spark SQL. In our dashboard, we will follow a simple three step process:

First, we create a data frame that has conditions, ranked by the number of patients they occurred in. This allows the user to visualize the relative frequency of the most common conditions in their dataset.

We then give the user widgets (AWS | Azure) to specify two conditions they are interested in comparing. By using Spark SQL, we identify the full set of patients that the condition occurred in.

Since we do this in Spark SQL using SparkR, we can easily collect the count of patients at the end and use a χ2 test to compute significance. We print whether or not the association between the two conditions is statistically significant.

While a data scientist who is rapidly iterating to understand what trends lie in their dataset may be happy working in a notebook, we will encounter a number of users (clinicians, public health officials and researchers, operations analysts, billing analysts) who are less interested in seeing the code underlying the analysis. By using the built-in dashboarding function, we can hide the code and focus on the visualizations we’ve generated. Since we added widgets into our notebook, our users can still provide input to the notebook and change which diseases to compare.

Get Started Building Your Clinical Data Lake

In this blog, we laid down the fundamentals for building a scalable health data lake with Delta Lake and a simple comorbidity dashboard. To learn more about using Delta Lake to store and process health and clinical datasets:

Download our eBook on working with real world clinical datasets.
Sign-up for a free Databricks trial and start experimenting with our ETL and dashboarding notebooks highlighted in this blog.

Try Databricks for free. Get started today.

The post Building a Modern Clinical Health Data Lake with Delta Lake appeared first on Databricks.

↧