Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

Databricks, AWS, and SafeGraph Team Up For Easier Analysis of Consumer Behavior

$
0
0

This notebook was produced as a collaboration between SafeGraph and Databricks

  • Ryan Fox Squire, Product Data Scientist @ SafeGraph
  • Andrew Hutchinson, Solutions Architect @ Databricks
  • Prasad Kona, Partner Solutions Architect @ Databricks

We’ve created this Databricks notebook (.dbc download here), and published this blog, so that you can hit the ground running using SafeGraph Data from AWS Data Exchange in Databricks. For ready-to-run code, please see the complementary databricks notebook.

To see the full SafeGraph dataset, visit the SafeGraph Data Bar.

Learn more – register now for this webinar: Building Reliable Data Pipelines for Machine Learning at SafeGraph

This blog will show you:

  • How to load SafeGraph Patterns data (a rich dataset on consumer points-of-interest) from AWS s3 (via AWS Data Exchange) into a Databricks notebook.
  • How to take full advantage of Databricks Delta Lake technology
  • How to use SafeGraph data to analyze offline consumer behavior and foot-traffic to major corporate retail and restaurant brands, like Starbucks.
    • What times of day and days of week is Starbucks most popular or least busy?
    • How long do customers stay during their visit to Starbucks?
    • How far from home do customers travel to visit Starbucks?
    • What are the cross-shopping brand preferences of Starbucks customers? What other stores do they visit?
    • How can I use SafeGraph data, combined with Census data, to do customer demographic analysis and build customer demographic profiles?

The first half of this notebook shows how to read, load, and prepare the data. The second half shows how to answer analytics questions using spark sql.

Questions? Get in touch with us at datastories+aws@safegraph.com.

What is SafeGraph Patterns?

SafeGraph is a geospatial data company focused on understanding the physical world. SafeGraph Patterns is a dataset of 3.6MM commercial brick-and-mortar points-of-interest (POI) in the USA and includes anonymized counts of how many people visit these POI each month. The counts of visitors are derived from an anonymized panel (sample of population that is measured longitudinally) of ~35MM mobile devices (e.g., smart phones) in the USA.

SafeGraph patterns is designed to answer questions like:

  • How many people are visiting a place? How frequently are they visiting?
  • How many unique visitors from our panel went to this place?
  • On average, what census areas do visitors come from?
  • What is the cross-shopping behavior of visitors from one POI to another?
  • What times of day and days of the week do people visit?
  • How far from home do visitors travel to visit this place?
  • How long do people stay at this place when they visit?

Protecting individual consumer privacy is at the core of the SafeGraph mission:

“SafeGraph’s mission is to make the world’s data open for innovation while protecting individuals privacy.” – SafeGraph Vision and Values

The devices in the panel are fully anonymized; no identity or demographic information exists for devices in the panel, and individual device-level data is not present in SafeGraph products. The aggregated form of SafeGraph Patterns helps to ensure the protection of individuals’ privacy, while also providing actionable data for statistical analysis and data science. For all the details on SafeGraph Patterns, see the SafeGraph Patterns docs.

What is Databricks?

Databricks is a unified analytics platform that enables data science, data engineering and business analytics teams to derive value from data at scale and with ease of use in a collaborative manner.

At its core, the Databricks platform is powered by Apache Spark and Delta Lake in a cloud native architecture, which gives users virtually unlimited horse power to acquire, clean, transform, combine and analyze data sets within minutes from a notebook interface, with popular languages of choice (python, scala, SQL, R).

Because Databricks is a managed platform, customers do not have to become big data devops gurus to power their analytical needs, which reduces administrative burden, costs and risks of their data driven projects.

Delta Lake, as also featured in the Safegraph notebooks below, brings unique capabilities to the Databricks platform:

  • Reliability: Delta Lake improves the integrity of data sets in the data lake by making data engineering pipelines transactional – ACID semantics, when applied to data engineering and machine learning, give customers confidence they are doing analytics on high quality data and problems such as partially ingested datasets, dirty reads and concurrent consistent  access to fresh data are automatically taken care of.
  • Performance: Delta Lake has specific optimizations underneath the hoods, such as smart caching, auto-collection of stats, compaction and z-ordering, which speeds up performance of both data engineering pipelines & the reporting done on cleaned data.

How do we load SafeGraph Patterns from AWS Data Exchange into Databricks Data Lake?

To demonstrate the power of SafeGraph data inside Databricks, we are highlighting three datasets from SafeGraph currently available for free inside AWS Exchange.

  1. SafeGraph Patterns – Starbucks in the USA
  2. SafeGraph Core Places – Starbucks in the USA
  3. SafeGraph Open Census Data

Follow these steps to subscribe to Safegraph datasets in AWS Data Exchange

  • Goto AWS Data Exchange service in your AWS account and search for “SafeGraph Patterns Census – Starbucks in the USA

  • Subscribe to the above 3 Safegraph datasets from AWS Data Exchange UI

  • The subscription process will take a few minutes – once its complete you will see the subscriptions in the Subscriptions UI like below

  • Import all 3 subscribed datasets into an S3 bucket of your choice by clicking on the data set name in Subscriptions UI and following an export to S3 flow from it’s revision id.

  • Once the datasets have been exported to your S3 bucket of choice, download the Databricks notebook from the Databricks link on anyone of the datasets

  • Create and start an interactive Databricks cluster
    • Instructions on how to create an interactive Databricks cluster
    • A two node i3.2xl cluster should suffice
    • Ensure your cluster has rights to access the bucket on which you imported AWS Data Exchange Safegraph datasets
  • Import Databricks notebook that was downloaded from Safegraphs AWS Data Exchange UI
  • Attach imported notebook to cluster
  • Update notebook parameter to point to your S3 bucket
    • Replace the “Delta External Table Location” parameter at top of notebook to point to a folder of choice on above configured S3 bucket – this is where Databricks will write optimized Delta datasets
    • Replace the Open Census, Safegraph Core Place and Safegraph Patterns parameter to point to the respective AWS Data Exchange datasets you imported on your S3 bucket

  • Click Run All to execute the notebook

  • The notebook then parses, cleans, joins the above datasets and converts them into Delta tables for fast analytics at scale – all this work is executed on the Databricks cluster you created.

  • Above is the user experience customers get regardless of the size of data involved – the user focuses on analytics and the underlying Databricks clusters automatically scale to handle petabytes of volume without users having to become big data devops experts.

What can I learn about consumer behavior using SafeGraph data in Databricks?

Once you have SafeGraph data loaded into Databricks, a bunch of exciting answers about consumer behavior are at your fingertips.

To see these implemented in a Databricks notebook, checkout the accompanying Demo Notebook.

What time of day do people visit Starbucks?

With a few lines of code you can examine the relative popularity of individual locations of Starbucks, as well as the average popularity across Starbucks nation-wide. Each safegraph_place_id is a different unique Starbucks location. The x-axis shows each hour of the day (local time) from midnight (0) to 11pm (23). The y-axis reflects how many visits are happening at each hour, summed across all the days of the month, as a percent of total  visits of the entire month (Note, visits that cross hour- boundaries will be counted in multiple hours. Therefore, the total % across all hours may add up to > 100%.)

We see that although traffic certainly ramps up during the morning, peak traffic is actually around 12pm and 1pm.

What days of the week do people visit Starbucks?

We can ask the same question but about what days of the week are popular.

Looking at 20 random starbucks examples we see that on average no days are strongly preferred over others. However, some POI do show interesting weekend vs weekday differences.

We can examine one of these POI and compare it to the national average.

This data shows that, on average nationally, the busiest days of the week at Starbucks are Wednesdays and Thursdays, although this is a mild preference. In contrast, safegraph_place_id sg:68513387500e48eb87d719207d058309 shows a very different pattern and is significantly less popular during the weekends compared to weekdays.

To visualize where this POI is located, you can read the (latitude, longitude) from the SafeGraph dataset and search for it in Google Maps. It turns out that this particular Starbucks is located on the campus of the Boston University School of Law. Presumably the fact that classes are not held during weekends is causing this very large weekday vs weekend difference.

How far do people travel to visit Starbucks?

SafeGraph reports the median distance travelled (from the home census block group) for each POI. Using this we can construct a histogram of Starbucks locations, showing how far people travel to visit Starbucks.

This data shows that most Starbucks locations draw visitors that live less than 10 kilometers away. However there is a long thin tail of Starbucks locations with the median distance from home is hundreds of km. These locations are likely in high-tourist or high-commute areas (like in an airport) where most visitors do not live geographically nearby.

What are the cross-shopping preferences of Starbucks customers?

The column related_same_month_brand and related_same_day_brand reports an index of how frequently visitors to a POI visit also visit other brands (relative to the average visitor rate to that brand).

Here we look at what other brands are frequently visited by customers of Starbucks. The larger the index, the more frequently starbucks customers visit that brand.

Although Starbucks is a national chain, cross-brand shopping is highly influenced by local geography. Here we show the top 5 top cross-shopping brands for Starbucks customers in California, New York, and Texas. Only McDonald’s is in the Top 5 of all 3 states.

Analyzing a Brand’s Customer Demographics

You can use SafeGraph data from AWS Data Exchange in Databricks to analyze the customer demographics of individual POI or brands. For a deep dive on the methodology, along with more complete statistical analysis feel free to read this workbook.

Here we analyze Starbucks Customer Demographics along the Race Demographic dimension using available from SafeGraph in AWS Data Exchange.

This analysis could be repeated for any demographic information tracked by the Census, and reported at the census block group level. That includes Ethnicity, Educational Attainment, Household Income, and much, much more.

To do this analysis we will use:

  • Census data (from Open Census Data)
  • SafeGraph Patterns data, specifically the visitor_home_cbgs column
  • SafeGraph Panel Overview data

The y-axis shows the % of total visitors for each demographic segment.

The baseline demographics of the United States are shown as a reference.  SafeGraph Patterns shows interesting differences between the census area demographics of Starbucks Customers compared to the overall USA population

  • SafeGraph Patterns data shows that on average, the home census block groups (CBGs) of Starbucks customers are 78.4% White, whereas the USA population is only 73.3% White. In other words, the home census areas of Starbucks customers are a larger fraction White than the US population.
  • The home CBGs of Starbucks customers are a larger fraction Asian, compared to the USA population.
  • The home CBGs of Starbucks customers are a smaller fraction Black or African American compared to the overall USA average.

 

Importantly, these differences are not due to geographic sampling bias in the SafeGraph dataset.  It is true that the SafeGraph dataset has some small geographic biases. For a full report see “What about bias in the SafeGraph dataset?”. However, we are able to measure and correct the small effects of sampling bias in the SafeGraph dataset as part of the cbg_adjust_factor calculation. If the differences observed were due solely to geographic sampling bias in the SafeGraph dataset, then they would disappear after the correction. The differences that remain cannot be attributed to sampling bias. For a thorough discussion on this methodology, see A Workbook to Analyze Demographic Profiles from SafeGraph Patterns Data.

Summary

  • Reading SafeGraph data from AWS Data Exchange into Databricks is quick and easy.
  • Combining these technologies and datasets enables you to answer powerful and precise questions about consumer behavior.

Thanks for reading!

Want to get more SafeGraph data?

  • There are over 20 datasets available for free or for purchase in AWS Data Exchange. Check them out!
  • And you can download CSVs for data on over 6MM points-of-interest at the SafeGraph Data Bar. Use coupon code SafeGraphAWSDatabricksNotebook for $200 of free data.
  • Questions on this notebook? Drop us a line at datastories+aws@safegraph.com

--

Try Databricks for free. Get started today.

The post Databricks, AWS, and SafeGraph Team Up For Easier Analysis of Consumer Behavior appeared first on Databricks.


Azure Databricks Highlights Adoption of Delta Lake, MLflow, and Integration with Azure Machine Learning at Microsoft Ignite 2019

$
0
0

At Microsoft Ignite 2019, thousands of attendees participated in hands-on workshops, breakout sessions, and theater presentations to learn how customers are achieving phenomenal results with Azure Databricks! It was an action-packed week of making new connections and learning about new innovation across data science, data engineering, and business analytics.

Azure Databricks -- Amazing growth in less than 2 years, including thousands of global Azure Databricks customers, millions of server-hours spinning up every day, 2 exabytes of data processed per month, and global coverage with 29 Regions available worldwide.

We shared the news that over 75% of data processed on Azure Databricks is in Delta Lake — the new open source standard for data lakes. Hands-on labs and breakout sessions gave attendees an opportunity to see and experience Delta Lake on Azure Databricks first hand.

Delta Lake -- Open-source storage layer that brings ACID transactions to Apache SparkTM and big data workloads, including 75% of data processed in Azure Databricks.
Delta Lake’s openness and extensibility enable faster innovation and more effective use of data. Azure Databricks has seen amazing growth over the past two years. This rapid growth has created the need for new tools and capabilities such as the MLflow Model Registry and ML lifecycle management which help customers track, deploy, and update their ML models. Attendees learned three ways Azure Databricks works with Azure Synapse Analytics to bring analytics, business intelligence (BI), and data science together in one solution architecture.

Azure Databricks momentum and acceleration were highlighted in many sessions during the week of Ignite 2019. Below are just a few sessions to give you an idea of the breadth of customer use cases driving such phenomenal growth.

Azure Databricks Sessions at Ignite 2019

  • Delta Lake on Azure Databricks: Implementing a new open source standard for Data Lakes (THR2339) – Ajay Singh from Databricks shared how you can enhance your data lakes for greater value by adding new capabilities for transactions, version control and indexing using the latest open source innovations. Founded by the original creators of Apache Spark, Databricks has continued to innovate by launching a new open source project called Delta Lake designed to make existing data lakes more scalable and reliable. Built on top of infinitely scalable Azure Data Lake Storage and deeply integrated with Azure SQL Data Warehouse, Delta Lake on Azure Databricks makes data better prepared for analytic and machine learning workloads. Available today as part of Azure Databricks, see why large organizations are upgrading their data lake in-place with zero migration to Delta Lake!
  • Managing your ML lifecycle with Azure Databricks and Azure Machine Learning (BRK3245) – Premal Shah from Microsoft and Mike Cornell from Databricks shared how machine learning development has new complexities beyond software development. There are a myriad of tools and frameworks which make it hard to track experiments, reproduce results, and deploy machine learning models. Learn how you can accelerate and manage your end-to-end machine learning lifecycle on Azure Databricks using MLflow and Azure ML to reliably build, share, and deploy machine learning applications using Azure Databricks.
  • Hands-on with Azure Databricks Delta Lake (WRK2013) – In this lab, Kyle Weller, Premal Shah, Shiva Nimmagadda Venkata, and Santosh Perla from Microsoft guided attendees as they learned how to increase performance and reliability of their data engineering pipelines with Azure Databricks Delta Lake. Delta Lake provides optimized layouts to enable big data use cases, from batch and streaming ingests, fast interactive queries, to machine learning.
  • Maximizing your Azure Databricks deployment (BRK3043) – Yatharth Gupta from Microsoft shared insights and key steps to get the most out of your Azure Databricks resources. Whether you’re new to Spark or an Azure Databricks veteran, attend and learn the tips, tricks, and best practices for working with Azure Databricks. Based on experiences with real industry customers, these best practices cover deployments, management, security, networking, monitoring, machine learning, and more. See how the power of Azure Databricks’ managed Spark experience can benefit your analytics pipeline!
  • Azure Databricks and Azure Machine Learning better together (THR2186) – In this theater session, Premal Shah from Microsoft demonstrated the new integrations between Azure Databricks and Azure Machine Learning to realize comprehensive E2E machine learning scenarios for our customers.
  • Data reproducibility, audits, immediate rollbacks and other applications of time travel with Delta Lake in Azure Databricks (BRK3254) – Time travel is now possible with Azure Databricks Delta Lake! Kyle Weller from Microsoft uncovered how Delta Lake makes time travel possible and why it matters to you. Through presentation, notebooks, and code, he showcased several common applications and how they can improve your modern data engineering pipelines. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark(TM). It provides snapshot isolation for concurrent read/writes. Enables efficient upserts, deletes, and immediate rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving up to 100x performance improvements. In this presentation, learn what challenges Delta Lake solves, how Delta Lake works under the hood, and applications of the new Delta Time Travel capability.

Get Started with Azure Databricks

The fun doesn’t stop there! Keep learning more:

Follow us on Twitter, LinkedIn, and Facebook for more Azure Databricks news, customer highlights, and new feature announcements.

--

Try Databricks for free. Get started today.

The post Azure Databricks Highlights Adoption of Delta Lake, MLflow, and Integration with Azure Machine Learning at Microsoft Ignite 2019 appeared first on Databricks.

A Day at the 2019 Women in Product Conference

$
0
0

Databricks team members that participated in the 2019 Women in Product Conference

From left to right: Shveta, Julia, Yardley Pohl (a WIP board member and co-founder), Anna, Allie, Cyrielle, and Rani at the Databricks booth

Databricks was a proud sponsor of the 2019 Women In Product conference which focuses on empowering women in product management and advocating for equal representation. We had a booth and happy hour where attendees could network with our product team, and learn more about our open roles. Read more about what this experience meant to the team members that attended below!

What led you to your current position at Databricks?

Anna Shrestinian, Senior Product Manager
I am a product manager on identity and data security at Databricks. I work to solve our customers’ challenges with securing data at an enterprise scale.

Rebecca Li, Staff Product Manager, Pricing and Portfolio
I am a product manager on pricing, cost management, and usage visibility. I work to design a packaging & pricing model that drives Databricks’ adoption in the market; and optimizes the long-term revenue of the company. At the same time, I work to make sure customers have the right tools to closely monitor their usage and cost.

Allie Emrich, Senior Program Manager, Product
I’m a program manager on the product team who’s in charge of coordinating and communicating much of our roadmap and the updates that occur throughout the quarter. Program management is a funny position because it doesn’t necessarily have a conventional path. I started out working on the customer success side, and really enjoyed cross-departmental projects focusing on product launches. When I saw the role at Databricks that leveraged my skill set on the CS side, but allowed me to grow operationally on the product side, I knew it was an opportunity I couldn’t pass up.

Cyrielle Simeone, Senior Product Marketing Manager, Data Science and Machine Learning
I’m a product marketing manager for data science and machine learning at Databricks. My role primarily consists of developing deep domain and audience understanding, product messaging collateral, as well as go-to-market strategies and campaigns. It’s funny, I didn’t originally plan on becoming a PMM, but I’m so glad I did. I graduated from a French generalist engineering school in 2006 with a specialization in image and signal processing. While curious and fascinated by math and science, I realized early on in my career that I was more interested in enabling people with technology and learning how they solved problems rather than becoming a specialist myself. The PMM role at Databricks is perfect for me in the sense that it sits at the intersection of product, customers, and sales, and gives me the opportunity to collaborate with extremely talented people across functions to bring to market ML products with a fantastic community fit, like MLflow. It allows me to stay abreast of practitioner needs and challenges, industry trends, and use cases across industries. It is a very diverse and demanding role, that can be very rewarding when executed well.

Julia Han, Engineering Program Manager
I’m an engineering program manager on the technical program management team. The technical program management team works on ensuring that there is alignment between the engineering and product management orgs. I work on reducing engineering overhead for project management and ensuring that proper coordination and communication is in place to facilitate execution for the engineering teams.

2019 Women in Product Keynote Session with Diane von Furstenberg and Anna Shrestinian, Senior Product Manager, Databricks.

Keynote Session with Diane von Furstenberg at the conference

What did sponsoring the Women in Product conference mean to you?

Anna Shrestinian, Senior Product Manager
The energy at the Women in Product conference was invigorating. It was amazing to be surrounded by leaders in the field and hear their insights on Product skills as well as their inspiring stories. I am proud to see Databricks’ name alongside best-in-class organizations as a Gold Sponsor. It shows me our commitment to diversity and inclusion.

Rebecca Li, Staff Product Manager, Pricing and Portfolio
I was pleasantly surprised by the diversity of industries, experiences, and career trajectories of our guest speakers. Each of them shed light on a different area of expertise, and shared their unique story of how they became who they are today. It was very inspiring and encouraging to understand their mindset, their struggles and how they overcome them to build their unique character. The crowd was very energetic and engaging. You can tell the message resonates with the audience deeply, and it is a community of women in product that can help each other grow. It was a great experience and there needs to be more of them!

Allie Emrich, Senior Program Manager, Product
Many companies are focused on increasing gender diversity, but it’s great to be part of a company who is visibly taking action to help move the needle. The atmosphere at the conference was vivacious and contagious-full of attendees who were so excited to be there and connect with one another. Getting our name out there, and teaching people about what Databricks is all about was an experience for the books!

Cyrielle Simeone, Senior Product Marketing Manager
It was very exciting and humbling to be part of this initiative, and represent Databricks at the Women in Product conference. It demonstrates real commitment and initiatives towards more diversity and inclusion for us as we continue to scale, which is key. As a woman in tech, it was fantastic to get to meet and connect with the community, hear the experiences and aspirations of many women throughout the day, and introduce them to Databricks. The energy was vibrant and very communicative, I look forward to the next one!

Julia Han, Engineering Program Manager
As a woman in tech myself, it’s always eye opening to be able to hear and learn about the experiences of other women who are in tech. As Databricks scales, we want to see gender diversity make progress across the board by learning from other companies on their efforts and vice versa. It’s exciting to be part of a company that supports women who are seeking to connect and search for growth opportunities.

We are so excited to be a part of a team that promotes growth and development and fosters us to learn from other subject matter experts in the field. Interested in working with our product teams here at Databricks? Check out our Careers Page!

Learn More about Databricks Involvement with Women in Tech:

https://databricks.com/sparkaisummit/north-america/women-in-ua

--

Try Databricks for free. Get started today.

The post A Day at the 2019 Women in Product Conference appeared first on Databricks.

Deep Learning Tutorial Demonstrates How to Simplify Distributed Deep Learning Model Inference Using Delta Lake and Apache Spark™

$
0
0

On October 10th, our team hosted a live webinar—Simple Distributed Deep Learning Model Inference—with Xiangrui Meng, Software Engineer at Databricks.

Simple Distributed Deep Learning Model Inference Webinar

Model inference, unlike model training, is usually embarrassingly parallel and hence simple to distribute. However, in practice, complex data scenarios and compute infrastructure often make this “simple” task hard to do from data source to sink.

In this webinar, we provided a reference end-to-end pipeline for distributed deep learning model inference using the latest features from Apache Spark and Delta Lake. While the reference pipeline applies to various deep learning scenarios, we focused on image applications, and demonstrated specific pain points and proposed solutions.

The walkthrough starts from data ingestion and ETL, using binary file data source from Apache Spark to load and store raw image files into a Delta Lake table. A small code change then enables Spark structure streaming to continuously discover and import new images, keeping the table up-to-date. From the Delta Lake table, Pandas UDF is used to wrap single-node code and perform distributed model inference in Spark.

We demonstrated these concepts using these Simple Distributed Deep Learning Model Inference Notebooks and Tutorials.

Here are some additional deep learning tutorials and resources available from Databricks.

If you’d like free access Databricks Unified Analytics Platform and try our notebooks on it, you can access a free trial here.

--

Try Databricks for free. Get started today.

The post Deep Learning Tutorial Demonstrates How to Simplify Distributed Deep Learning Model Inference Using Delta Lake and Apache Spark™ appeared first on Databricks.

How Databricks and Privacera Combine to Secure Data for Cloud Analytics

$
0
0

In their quest to anticipate customer needs, forward looking organizations are looking to use cloud-based analytics and AI to innovate. But we often hear from customers how challenging it is to manipulate large data volumes in a secure and compliant way. Databricks and Privacera have partnered to help customers address several key use cases for cloud-based analytics with a fine-grained access control and data classification solution.

For customers using Apache Ranger, this is especially an easy transition. The Privacera platform is based on and developed by the original team behind Apache Ranger that provides centralized policy management and dynamic data masking.

The Databricks and Privacera Joint Solution Addresses the Following Data Security Key Use Cases.

1. Create a Sensitive Data Catalog to Comply with GDPR, CCPA

The Databricks and Privacera joint solution enables automatic scanning of incoming data into Databricks to identify and profile sensitive data. By tagging this data into a scalable metadata store, it builds a centralized data catalog of sensitive data including PII information.

Thus, enterprises can comply with regulations such as General Data Protection Regulation

(GDPR) and California Consumer Privacy Act (CCPA) while maintaining data privacy.

2. Ensure Centralized Data Governance for Analytics and AI/ML

The partnership enables organizations to create fine-grained access control policies on row, column and fine-level natively on Delta Lake tables. Role-based, fine-grained access allows centralized governance so that users that have access can use more data for analytics and ML instead of being denied entire table level access.

The integration also features a Privacera plugin natively inside Databricks to provide authorization for Spark SQL in Databricks. This ensures continuous flow of data for users with access permissions when they run queries in their Databricks environment.

The Privacera plugin operates natively inside Databricks, which provides authorization for Spark SQL and ensures the continuous flow of data for users with access permissions.

3. Make More Data Available without Compromising Data Privacy

The joint solution also provides de-identification to anonymize sensitive data in Databricks so that data engineering teams can make sensitive data available to data scientists and analysts. This allows them to preserve data privacy while maintaining the data’s referential integrity and analytical value. Data scientists and analysts can anonymize or mask data in Databricks for analytics and ML/AI workloads while maintaining enterprise-wide control to analyze their complete data sets for analytics and AI/ML workloads.

Available Today

The integration is available today. To learn more, check out this detailed blog co-authored by Databricks and Privacera.

More Resources

https://databricks.com/databricks-enterprise-security

https://docs.databricks.com/security/index.html

https://www.privacera.com/databricks.php

--

Try Databricks for free. Get started today.

The post How Databricks and Privacera Combine to Secure Data for Cloud Analytics appeared first on Databricks.

Brickster Spotlight: Meet Alex

$
0
0

At Databricks, we help data teams solve the world’s toughest problems and we couldn’t do that without our wonderful Databricks Team. “Teamwork makes the dream work” is not only a central function of our product but it is also central to our culture. Learn more about Alex, our Director of Resident Solution Architects in Databricks’ Customer Success organization, and about what he does here at Databricks!

Alex and his team at a Customer Success Offsite

Alex (upper right hand corner) and his team at a Customer Success Offsite

Tell us a little bit about yourself

I am an aerospace engineer who became a data scientist by accident before the title was popularized and data sets became bigger, faster and more complex. I lived through the (often painful) evolution from data warehouses, to on-prem data lakes, and finally to the cloud while working for a large enterprise. I was a user of the Databricks Unified Data Analytics Platform prior to joining the company and became a believer in the product vision since it solved so many fundamental problems in big data. I’ve been with Databricks for one year as a leader in our Customer Success organization, based in our Midtown Manhattan office. I live in Brooklyn with my wife, baby girl and dog, enjoying the NYC lifestyle.

What were you looking for in your next opportunity, and why did you choose Databricks?

Databricks was a step-change for me both in technology and career growth. I lead a team called Resident Solutions Architects in our Customer Success organization. This is a diverse group of customer-facing big data architects who work across the data engineering and machine learning spectrum to solve our customers’ most challenging use-cases in the field. The variety and complexity of the problems we solve is unparalleled. We are learning and adding value in every business vertical and feeding those insights back into the product continuously. We design massively scalable solutions, write production-quality code and become trusted advisors to the data teams of the most successful enterprises in the world.

What gets you excited to come to work every day?

As an engineer, this is an easy one. At Databricks, we are solving the most challenging big data problems in the world, with an amazing culture of teamwork and customer focus. Let’s do this.

One of our core values at Databricks is to be an owner. What is your most memorable experience at Databricks when you owned it?

Databricks’ 1,000 Employee milestone celebration in the NYC office

Databricks’ 1,000 Employee milestone celebration in the NYC office

Rapidly scaling our diverse team is my number one priority by a wide margin. Inspiring the best people in big data and machine learning industry to grow with our team is how Databricks will continue its rapid success. Our East Coast team in particular has increased by 3X over the last year to keep up with our customer demand. At this pace, owning our team culture becomes a major focus for everyone and my biggest accomplishment is simply enabling our team to learn from each other and continue to do their magic in the field at an ever-growing scale.

What has been the biggest challenge you’ve faced, and what is a lesson you learned from it?

A great challenge of any field team is sharing best practices across the globe so that our customers benefit from the rapidly advancing state-of-the-art. Here’s the good news: we have an amazing team of industry experts with the answers. Databricks invests time so that our experts can drive initiatives to benefit the company and our customers.  This is a unique aspect of our Customer Success organization, which is time that is specifically carved out of their schedules to build scalable solutions and grow their careers. Our Resident Solution Architects and Sr. Solutions Consultants have designed and implemented next-gen security features, created an Auto ML Toolkit  and ML Feature Factory under our Databricks Labs initiative and we also actively share Apache Spark optimization best practices with the community to name just a few examples. This approach of investing in our experts allows us to scale and ensures we are capturing everything we are learning in the field. Did I mention that we are hiring?

Databricks has grown tremendously in the last few years. How do you see the future of Databricks evolving and what are you most excited to see us accomplish?

We are a product company, and the main driver of our success is aligning people, technology and business value. It’s a very satisfying win-win for our customers and I think that the way we treat our customers is truly an industry differentiator for Databricks. It’s amazing to think that we are just getting started and excited to see our growth globally.

What advice would you give to field engineering professionals who are starting their careers?

There are so many early career engineers that don’t even know that they will end up as field professionals – including me. So, first be aware of all the opportunities that are out there. Whether it’s sales, support, or customer success, there is a place for you in the field if you want to work directly with customers and solve problems in a very dynamic environment. Any field role will simultaneously accelerate your career from both a technical and leadership perspective. Wait, did I mention that we are hiring?

Interested in helping solve our customers’ most challenging use-cases in the field? Learn more about Alex’s team and check out our Careers Page.

--

Try Databricks for free. Get started today.

The post Brickster Spotlight: Meet Alex appeared first on Databricks.

Migration from Hadoop to modern cloud platforms: The case for Hadoop alternatives

$
0
0

Companies rely on their big data and analytics platforms to support innovation and digital transformation strategies. However, many Hadoop users struggle with complexity, unscalable infrastructure, excessive maintenance overhead and overall, unrealized value. We help customers navigate their Hadoop migrations to modern cloud platforms such as Databricks and our partner products and solutions, and in this post, we’ll share what we’ve learned.

Challenges with Hadoop Architectures

Teams migrate from Hadoop for a variety of reasons. It’s often a combination of “push” and “pull”: limitations with existing Hadoop systems are pushing teams to explore Hadoop alternatives, and they’re also being pulled by the new possibilities enabled by modern cloud data architectures. While the architecture requirements vary for different teams, we’ve seen a number of common factors with customers looking to leave Hadoop.

  • Poor data reliability and scalability: A pharmaceutical company had data-scalability issues with its Hadoop clusters, which could not scale up for research projects or scale down to reduce costs. A consumer brand company was tired of its Hadoop jobs failing, leaving its data in limbo and impacting team productivity.
  • Time and resource costs: One retail company was experiencing excessive operational burdens given the time and headcount required to maintain, patch, and upgrade complicated Hadoop systems. A media start-up suffered reduced productivity because of the amount of time spent configuring its systems instead of getting work done for the business.
  • Blocked projects: A logistics company wanted to do more with its data, but the company’s Hadoop-based data platform couldn’t keep up with it’s business goals—the team could only process a sample of their imaging data, and they had advanced network computations that couldn’t be finished within a reasonable period of time. Another manufacturing company had data stuck in different silos, some in HPC clusters, other on Hadoop, which was hindering important deep learning projects for the business.

Beyond the technical challenges, we’ve also had customers raise concerns around the long term viability of the technology and the business stability of its vendors. Google, whose seminal 2004 paper on MapReduce underpinned the open-source development of Apache Hadoop, has stopped using MapReduce altogether, as tweeted by Google SVP Urs Hölzle: “… R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good…” These technology shifts are reflected by the consolidation and purchase activity in the space that Hadoop-focused vendors have seen.  This collection of concerns has inspired many companies to re-evaluate their Hadoop investments to see if the technology still meets their needs.

Shift toward Modern Cloud Data Platforms

Data platforms built for cloud-native use can deliver significant gains compared to legacy Hadoop environments, which “pull” companies into their cloud adoption. This also includes customers that have tried to use Hadoop in the cloud. Here are some results from a customer that migrated to Databricks from a cloud based Hadoop service.

  • Up to 50% performance improvement in data processing job runtime
  • 40% lower monthly infrastructure cost
  • 200% greater data processing throughput
  • Security environment credentials centralized across six global teams
  • Fifteen AI and ML initiatives unblocked and accelerated

Hadoop was not designed to run natively in cloud environments, and while cloud-based Hadoop services certainly have improvements compared to their on-premises counterparts, both still lag compared to modern data platforms architected to run natively in the cloud, in terms of both performance and their ability to address more sophisticated data use cases. On-premise Hadoop customers that we’ve worked with have seen improvements even greater than those noted above.

Managing Change: Hadoop to Cloud Migration Principles

While migrating to a modern cloud data platform can be daunting, the customers we’ve worked with often consider the prospect of staying with their existing solutions to be even worse. The pain of staying where they were was significantly worse than the costs of migrating. We’ve worked hard to streamline the migration process across various dimensions:

  • Managing Complexity and Scale: Metadata movement, Workload Migration, Data Migration
  • Manage Quality and Risk: Methodology, Project Plans, Timelines, Technology Mappings
  • Manage Cost and Time: Partners and Professional Services bringing experience and training

Future Proofing Your Cloud Analytics Projects

Cloud migration decisions are as much about business decisions as they are about technology. They force companies to take a hard look at what their current systems deliver, and evaluate what they need to achieve their goals, whether they’re measured in petabytes of data processed, customer insights uncovered, or business financial targets.

With clarity on these goals comes important technical details, such as mapping technology components from on-premises models to cloud models, evaluating cloud resource utilization and cost-to-performance, and structuring a migration project to minimize errors and risks. If you want to learn more, check out my on-demand webinar to explore cloud migration concepts, data modernization best practices, and migration product demos.

--

Try Databricks for free. Get started today.

The post Migration from Hadoop to modern cloud platforms: The case for Hadoop alternatives appeared first on Databricks.

New Databricks Integration for Jupyter Bridges Local and Remote Workflows

$
0
0

Introduction

For many years now, data scientists have developed specific workflows on premises using local filesystem hierarchies, source code revision systems and CI/CD processes.

On the other side, the available data is growing exponentially and new capabilities for data analysis and modeling are needed, for example, easily scalable storage, distributed computing systems or special hardware for new technologies like GPUs for Deep Learning.

These capabilities are hard to provide on premises in a flexible way. So companies more and more leverage solutions in the cloud and data scientists have the challenge to combine their existing local workflows with these new cloud based capabilities.

The project JupyterLab Integration, published in Databricks Labs, was built to bridge these two worlds. Data scientists can use their familiar local environments with JupyterLab and work with remote data and remote clusters simply by selecting a kernel.

Example scenarios enabled by JupyterLab Integration from your local JupyterLab:

  • Execute single node data science Jupyter notebooks on remote clusters maintained by Databricks with access to the remote Data Lake.
  • Run deep learning code on Databricks GPU clusters.
  • Run remote Spark jobs with an integrated user experience (progress bars, DBFS browser, …).
  • Easily follow deep learning tutorials where the setup is based on Jupyter or JupyterLab and run the code on a Databricks cluster.
  • Mirror a remote cluster environment locally (python and library versions) and switch seamlessly between local and remote execution by just selecting Jupyter kernels.

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

Using a remote cluster from a local Jupyterlab

JupyterLab Integration follows the standard approach of Jupyter/JupyterLab and allows you to create Jupyter kernels for remote Databricks clusters (this is explained in the next section). To work with JupyterLab Integration you start JupyterLab with the standard command:

$ jupyter lab

In the notebook, select the remote kernel from the menu to connect to the remote Databricks cluster and get a Spark session with the following Python code:

from databrickslabs_jupyterlab.connect import dbcontext
dbcontext()

The image below shows this process and some of the features of JupyterLab Integration.

The Databricks Jupyter - JupyterLab Integration follows the standard approach of Jupyter/JupyterLab and allows you to create Jupyter kernels for remote Databricks clusters.

Databricks-JupyterLab Integration — An end to end example

Before configuring a Databricks cluster for JupyterLab Integration, let’s understand how it will be identified: A Databricks clusters runs in cloud in a Databricks Data Science Workspace. These workspaces can be maintained from a local terminal with the Databricks CLI. The Databricks CLI stores the URL and personal access token for a workspace in a local configuration file under a selectable profile name. JupyterLab Integration uses this profile name to reference Databricks Workspaces, e.g demo for the workspace demo.cloud.databricks.com.

Configuring a remote kernel for JupyterLab

Let’s assume the JupyterLab Integration is already installed and configured to mirror a remote cluster named bernhard-5.5ml (details about installation at the end of this blog post).

The first step is to create a Jupyter kernel specification for a remote cluster, e.g. in the workspace with profile name demo:

(bernhard-6.1ml)$ alias dj=databrickslabs-jupyterlab
(bernhard-6.1ml)$ dj demo -k

The following wizard lets you select the remote cluster in workspace demo, stores its driver IP address in the local ssh configuration file and installs some necessary runtime libraries on the remote driver:

The Databrick-JupyterLab Integration wizard lets you select the remote cluster in workspace demo, stores its driver IP address in the local ssh configuration file and installs some necessary runtime libraries on the remote driver.

At the end, a new kernel SSH 1104-182503-trust65 demo:bernhard-6.1ml will be available in JupyterLab (the name is a combination of the remote cluster id 1104-182503-trust65, the Databricks CLI profile name demo, the remote cluster name bernhard-6.1ml and optionally the local conda environment name).

Starting JupyterLab with the Databricks integration

Now we have two choices to start JupyterLab, first the usual way:

(bernhard-6.1ml)$ jupyter lab

This will work perfectly, when the remote cluster is already up and running and its local configuration is up to date. However, the preferred way to start JupyterLab for JupyterLab Integration is

(bernhard-6.1ml)$ dj demo -l -c

This command automatically starts the remote cluster (if terminated), installs the runtime libraries “ipykernel” and “ipywidgets” on the driver and saves the remote IP address of the driver locally. As a nice side effect, with flag -c the personal access token is automatically copied to the clipboard. You will need the token in the next step in the notebook to authenticate against the remote cluster. It is important to note that the personal access token will not be stored on the remote cluster.

Getting a Spark Context in the Jupyter Notebook

To create a Spark session in a Jupyter Notebook that is connected to this remote kernel, enter the following two lines into a notebook cell:

from databrickslabs_jupyterlab.connect import dbcontext, is_remote
dbcontext()

This will request to enter the personal access token (the one that was copied to the clipboard above) and then connect the notebook to the remote Spark Context.

Running hyperparameter tuning locally and remotely

The following code will run on both a local Python kernel and a remote Databricks kernel. Running locally, it will use GridSearchCV from scikit-learn with a small hyperparameter space. Running on the remote Databricks kernel, it will leverage spark-sklearn to distribute the hyperparameter optimization across Spark executors. For different settings on local and remote environment (e.g. paths to data), the function is_remote() from JupyterLab Integration can be used.

  1. Define the data locations both locally and remotely and load GridSearchCV
    if is_remote():
        from functools import partial
        from spark_sklearn import GridSearchCV
        GridSearchCV = partial(GridSearchCV, sc)  # add Spark context
        data_path = "/dbfs/bernhard/digits.csv"
    else:
        from sklearn.model_selection import GridSearchCV
        data_path = ("/Users/bernhardwalter/Data/digits/digits.csv")
    
  2. Load the data
    import pandas as pd
    
    digits = pd.read_csv(data_path, index_col=None)
    X, y = digits.iloc[:,1:-1], digits.iloc[:,-1]
    
  3. Define the different hyperparameter spaces for local and remote execution
    from sklearn.ensemble import RandomForestClassifier
    
    if is_remote():
        param_grid = {
            "max_depth": [3, 5, 10, 15],
            "max_features": ["auto", "sqrt", "log2", None],
            "min_samples_split": [2, 5, 10],
            "min_samples_leaf": [1, 3, 10],
            "n_estimators": [10, 15, 25, 50, 75, 100]
        }  # 864 options
    else:
        param_grid = {
            "max_depth": [3, None],
            "max_features": [1, 3],
            "min_samples_split": [2, 10],
            "min_samples_leaf": [1, 10],
            "n_estimators": [10, 20]
        }  # 32 options
    
    cv = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
    cv.fit(X,y)
    
  4. Finally, evaluate the model
    best = cv.best_index_
    cv_results = cv.cv_results_
    print("mean_test_score", cv_results["mean_test_score"][best], 
          "std_test_score", cv_results["std_test_score"][best]) 
    cv_results["params"][best]
    

Below is an animated demo for both a local and a remote run:

Running hyperparameter tuning locally and remotely

JupyterLab Integration and Databricks Connect

Databricks Connect allows you to connect your favorite IDE, notebook server, and other custom applications to Databricks clusters. It provides a special local Spark Context which is basically a proxy to the remote Spark Context. Only Spark code will be executed on the remote cluster. This means, for example, if you start a GPU node in Databricks for some Deep Learning experiments, with Databricks Connect your code will run on the laptop and will not leverage the GPU of the remote machine:

Databricks Connect allows you to connect your favorite IDE, notebook server, and other custom applications to Databricks clusters.

Databricks Connect architecture

JupyterLab Integration, on the other hand, keeps notebooks locally but runs all code on the remote cluster if a remote kernel is selected. This enables your local JupyterLab to run single node data science notebooks (using pandas, scikit-learn, etc.) on a remote environment maintained by Databricks or to run your deep learning code on a remote Databricks GPU machine .
Your local JupyterLab can also execute distributed Spark jobs on Databricks clusters with progress bars providing the status of the Spark job.

JupyterLab Integration allows you to run single node data science notebooks on a Databricks remote environment managed or to run deep learning code on a remote Databricks GPU machine.

JupyterLab Integration architecture

Furthermore, you can set up a local conda environment that mirrors a remote cluster. You can start building out your experiment locally, where you have full control over your environment, processes and easy access to all log files. When the code is stable, you can use the remote cluster to apply it to the full remote data set or do distributed hyperparameter optimization on a remote cluster without uploading data with every run.

Note: If a notebook is connected to a remote cluster, its Python kernel runs on the remote cluster and neither local config files nor local data can be accessed with Python and Spark. To exchange files between the local laptop and DBFS on the remote cluster, use Databricks CLI to copy data back and forth:

$ databricks --profile $PROFILE fs cp /DATA/abc.csv dbfs:/data

Since e.g. Pandas cannot access files in DBFS via dbfs:/, there is a mount point /dbfs/ that allows to access the data in DBFS (like /dbfs/data/abc.csv) with standard libraries of Python.

JupyterLab Integration Installation

After we have seen how JupyterLab Integration works, let’s have a look at how to install it.

Prerequisites

JupyterLab Integration will run for Databricks on both AWS and Azure Databricks. The setup is based on the Databricks CLI configuration and assumes:

  1. Anaconda is installed (the libraries for the JupyterLab Integration will be installed later)
  2. Databricks CLI is installed and configured for the workspace you want to use
  3. An SSH key pair is created for the cluster you want to use
  4. The cluster you want to use is SSH enabled and has the public key from 3 installed

Note: It currently only runs on MacOS and Linux and tested with Databricks Runtime 5.5, 6.0 and 6.1 (Standard and ML).

Required setup for running JupyterLab Integration on either AWS or Azure Databricks, based on the Databricks CLI configuration.

The convention is that the SSH key pair is named after the name of the Databricks CLI profile name. For more details on prerequisites, please see the “prerequisites” section of the documentation.

Installation

  1. Create a local conda environment and install JupyterLab Integration:
    (base)$ conda create -n db-jlab python=3.6
    (base)$ conda activate db-jlab
    (db-jlab)$ pip install --upgrade databrickslabs-jupyterlab
    

    The prefix (db-jlab)$ for the command examples in this blog post shows that the conda environment db-jlab is activated.

    The terminal command name databrickslabs-jupyterlab is quite long, so let’s create an alias

    (db-jlab)$ alias dj=databrickslabs-jupyterlab
    
  2. Bootstrap JupyterLab Integration:

    This will Install the necessary libraries and extensions (using the alias from above):
    (db-jlab)$ dj -b
    
  3. Optionally, if you want to run the same notebook locally and remotely (mirroring):
    This will ask for the name of a cluster to be mirrored and install all its data science related libraries in a local conda environment matching all versions.
    (db-jlab)$ dj $PROFILE -m     
    

    For more details see the “mirror” section of the documentation.

Get started with JupyterLab Integration

In this blog post we have shown how JupyterLab Integration integrates remote Databricks clusters into locally established workflows by running Python kernels on the Databricks clusters via ssh. This allows data scientists to work in their familiar local environments with JupyterLab and access remote data and remote clusters in a consistent way. We have shown that JupyterLab Integration follows a different approach to Databricks Connect by using ssh. Compared to Databricks Data Science Workspaces and Databricks Connect, this enables a set of additional use cases.

https://github.com/databrickslabs/Jupyterlab-Integration

Related Resources

--

Try Databricks for free. Get started today.

The post New Databricks Integration for Jupyter Bridges Local and Remote Workflows appeared first on Databricks.


Databricks and Informatica Integration Simplifies Data Lineage and Governance for Cloud Analytics

$
0
0

In a rapidly evolving world of big data, data discovery, governance and data lineage is an essential aspect of data management. As organizations modernize their workloads into multi-cloud and hybrid environments, data starts to get distributed across cloud data lakes and SaaS applications. With that, organizations are trying to answer key questions:

  • How do I find the right dataset?
  • How do I ensure the data is of high quality?
  • How do I move faster to deliver insights for analytics and ML workloads?
  • How do I comply with regulations and deliver trusted data?

Achieving data discovery, lineage and reliability – at enterprise scale – is an opportunity for organizations. To help enterprises build a strong foundation for data management, we’ve partnered with Informatica to provide an end-to-end lineage solution. This joint solution provides complete visibility and traceability into data pipelines on Delta Lake, the open-source storage layer for reliable data lakes at scale.

Building End-to-End Pipelines with Data Discovery, Governance and Lineage in the Cloud on Delta Lake

Take a moment to think about all the applications we use everyday – email, web, mobile, social media, SaaS applications, BI, reporting dashboards and many others. Data Engineers spend vast amounts of time in these applications finding datasets and tracing data transformations, which delays analytics and machine learning projects.

The joint solution by Databricks and Informatica solves this problem by enabling data engineers and data scientists to find, validate and trace datasets as they move through data pipelines. Informatica’s EDC connects seamlessly with Delta Lake to scan and index metadata so teams can discover and profile data and find detailed lineage of that data as it moves through pipelines. This allows data engineers and data scientists to easily track data movement, including column/metric-level lineage to identify related tables, views and domains.

Along with EDC, Databricks integrates with Informatica Data Engineering Integration (DEI). DEI uses dynamic mappings and data transformations to ingest data from multiple source systems and applications into Delta tables with complete data lineage tracking. Once the data is in Delta tables, EDC performs scanning, profiling and discovery to help data engineers find the right data sets.

Viewing Lineage of Delta Datasets Engineered with Informatica Data Engineering Integration in EDC

Viewing Lineage of Delta Datasets Engineered with Informatica Data Engineering Integration in EDC

Data Governance at Enterprise Scale

With new and upcoming regulations such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), data governance becomes integral to any data management initiative. For example, GDPR mandates ‘Right to Access’ that allows a customer to view their personal information across the entire enterprise. Similarly, the Right to Erasure requires that all their personal data be deleted without delay. Since Delta is a transactional engine, specific data (i.e. rows in tables) can be easily deleted using DELETE commands in the ‘Right to be forgotten’ compliance scenarios, without the burden of coding elaborate pipelines.

Overall, with data dispersed across disparate sources, it can be challenging to keep track of what data resides where and which on-premise and cloud workflows touch that data. A cloud-based data platform and discovery program for the entire organization can future proof its data governance discipline.

Getting started with the Databricks-Informatica End-to-end Data Lineage solution

Building intelligent data pipelines to bring data from different silos, tracing its origin and creating a complete view of data movement in the cloud is critical to enterprise organizations. The Databricks and Informatica partnership enables modern data teams to leverage data assets to scale and document datasets and data pipelines for analytics and ML. It is a powerful integration for data engineers and data scientists looking to automate their governance processes while achieving speed and agility of data management for the future.

Check out the Data Discovery and Lineage Analytics webinar for an in-depth demo of the Databricks and Informatica joint solution for data lineage.

Related Resources

--

Try Databricks for free. Get started today.

The post Databricks and Informatica Integration Simplifies Data Lineage and Governance for Cloud Analytics appeared first on Databricks.

Streamlining Variant Normalization on Large Genomic Datasets with Glow

$
0
0

Cross posted from the Glow blog.

Many research and drug development projects in the genomics world involve large genomic variant data sets, the volume of which has been growing exponentially over the past decade. However, the tools to extract, transform, load (ETL) and analyze these data sets have not kept pace with this growth. Single-node command line tools or scripts are very inefficient in handling terabytes of genomics data in these projects. In October of this year, Databricks and the Regeneron Genetics Center partnered to introduce project Glow, an open-source toolkit for large-scale genomic analysis based on Apache Spark™, to address this issue. An optimized version of Glow is incorporated into Databricks Unified Data Analytics Platform (UDAP) for Genomics, which in addition to several feature optimizations, provides many more secondary and tertiary analysis features on top of a scalable, managed cloud service, making Databricks the best platform to run Glow.

In large cross-team research or drug discovery projects, computational biologists and bioinformaticians usually need to merge very large variant call sets in order to perform downstream analyses. In a prior post, we showcased the power and simplicity of Glow in ETL and merging of variant call sets from different sources using Glow’s VCF and BGEN Data Sources at unprecedented scales. Differently sourced variant call sets impose another major challenge. It is not uncommon for these sets to be generated by different variant calling tools and methods. Consequently, the same genomic variant may be represented differently (in terms of genomic position and alleles) across different call sets. These discrepancies in variant representation must be resolved before any further analysis on the data. This is critical for the following reasons:

  1. To avoid incorrect bias in the results of downstream analysis on the merged set of variants or waste of analysis effort on seemingly new variants due to lack of normalization, which are in fact redundant (see Tan et al. for examples of this redundancy in 1000 Genome Project variant calls and dbSNP)
  2. To ensure that the merged data set and its post-analysis derivations are compatible and comparable with other public and private variant databases.

This is achieved by what is referred to as variant normalization, a process that ensures the same variant is represented identically across different data sets. Performing variant normalization on terabytes of variant data in large projects using popular single-node tools can become quite a challenge as the acceptable input and output of these tools are the flat file formats that are commonly used to store variant calls (such as VCF and BGEN). To address this issue, we introduced the variant normalization transformation into Glow, which directly acts on a Spark Dataframe of variants to generate a DataFrame of normalized variants, harnessing the power of Spark to normalize variants from hundreds of thousands of samples in a fast and scalable manner with just a single line of Python or Scala code. Before addressing our normalizer, let us have a slightly more technical look at what variant normalization actually does.

What does variant normalization do?

Variant normalization ensures that the representation of a variant is both “parsimonious” and “left-aligned.” A variant is parsimonious if it is represented in as few nucleotides as possible without reducing the length of any allele to zero. An example is given in Figure 1.

Variant parsimonyFigure 1. Variant Parsimony

A variant is left-aligned if its position cannot be shifted to the left while keeping the length of all its alleles the same. An example is given in Figure 2.

Left-aligned variant, where a genomic variant cannot be shifted to the left without altering the length of the alleles.Figure 2. Left-aligned Variant

Tan et al. have proved that normalization results in uniqueness. In other words, two variants have different normalized representations if and only if they are actually different variants.

Variant normalization in Glow

We have introduced the normalize_variants transformer into Glow (Figure 3). After ingesting variant calls into a Spark DataFrame using the VCF, BGEN or Delta readers, a user can call a single line of Python or Scala code to normalize all variants. This generates another DataFrame in which all variants are presented in their normalized form. The normalized DataFrame can then be used for downstream analyses like a GWAS using our built-in regression functions or an efficiently-parallelized GWAS tool.

Scalable variant normalization using GlowFigure 3. Scalable Variant Normalization Using Glow

The normalize_variants transformer brings unprecedented scalability and simplicity to this important upstream process, hence is yet another reason why Glow and Databricks UDAP for Genomics are ideal platforms for biobank-scale genomic analyses, e.g., association studies between genetic variations and diseases across cohorts of hundreds of thousands of individuals.

The underlying normalization algorithm and its accuracy

There are several single-node tools for variant normalization that use different normalization algorithms. Widely used tools for variant normalization include vt normalize, bcftools norm, and the GATK’s LeftAlignAndTrimVariants.

Based on our own investigation and also as indicated by Bayat et al. and Tan et al., the GATK’s LeftAlignAndTrimVariants algorithm frequently fails to completely left-align some variants. For example, we noticed that on the test_left_align_hg38.vcf test file from GATK itself, applying LeftAlignAndTrimVariants results in an incorrect normalization of 3 of the 16 variants in the file, including the variants at positions chr20:63669973, chr20:64012187, and chr21:13255301. These variants are normalized correctly using vt normalize and bcftools norm.

Consequently, in our normalize_variants transformer, we used an improved version of the bcftools norm or vt normalize algorithms, which are similar in fundamentals. For a given variant, we start by right-trimming all the alleles of the variant as long as their rightmost nucleotides are the same. If the length of any allele reaches zero, we left-append it with a fixed block of nucleotides from the reference genome (the nucleotides are added in blocks as opposed to one-by-one to limit the number of referrals to the reference genome). When right-trimming is terminated, a potential left-trimming is performed to eliminate the leftmost nucleotides common to all alleles (possibly generated by prior left-appendings). The start, end, and alleles of the variants are updated appropriately during this process.

We benchmarked the accuracy of our normalization algorithm against vt normalize and bcftools norm on multiple test files and validated that our results match the results of these tools.

The optional splitting of multiallelic variants

Our normalize_variants transformer can optionally split multiallelic variants to biallelics. This is controlled by the mode option that can be supplied to this transformer. The possible values for the mode option are as follows: normalize (default), which performs normalization only, split_and_normalize, which splits multiallelic variants to biallelic ones before performing normalization, and split, which only splits multiallelics without doing any normalization.

The splitting logic of our transformer is the same as the splitting logic followed by GATK’s LeftAlignAndTrimVariants tool using –splitMultiallelics option. More precisely, in case of splitting multiallelic variants loaded from VCF files, this transformer recalculates the GT blocks for the resulting biallelic variants if possible, and drops all INFO fields, except for AC, AN, and AF. These three fields are imputed based on the newly calculated GT blocks, if any exists, otherwise, these fields are dropped as well.

Using the normalize_variant transformer

Here, we briefly demonstrate how using Glow very large variant call sets can be normalized and/or split. First, VCF and/or BGEN files can be read into a Spark DataFrame as demonstrated in a prior post. This is shown in Python for the set of VCF files contained in a folder named /databricks-datasets/genomics/call-sets:

original_variants_df = spark.read\
  .format("vcf")\
  .option("includeSampleIds", False)\
  .load("/databricks-datasets/genomics/call-sets")

An example of the DataFrame original_variants_df is shown in Figure 4.

Example DataFrame with original variants before normalizationFigure 4. The variant DataFrame original_variants_df

The variants can then be normalized using the normalize_variants transformer as follows:

import glow

ref_genome_path = '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38.fa'

normalized_variants_df = glow.transform(\
  "normalize_variants",\
  original_variants_df,\
  reference_genome_path=ref_genome_path\
)

Note that normalization requires the reference genome .fasta or .fa file, which is provided using the reference_genome_path option. The .dict and .fai files must accompany the reference genome file in the same folder (read more about these file formats here).

Our example Dataframe after normalization can be seen in Figure 5.

Example variants in DataFrame after normalizationFigure 5. The normalized_variants_df DataFrame obtained after applying normalize_variants transformer on original_variants_df. Notice that several variants are normalized and their start, end, and alleles have changed accordingly.

By default, the transformer normalizes each variant without splitting the multiallelic variants before normalization as seen in Figure 5. By setting the mode option to split_and_normalize, nothing changes for biallelic variants, but the multiallelic variants are first split to the appropriate number of biallelics and the resulting biallelics are normalized. This can be done as follows:

split_and_normalized_variants_df = glow.transform(\
  "normalize_variants",\
  original_variants_df,\
  reference_genome_path=ref_genome_path,\
  mode=“split_and_normalize”
)

The resulting DataFrame looks like Figure 6.

Example DataFrame with split and normalized variantsFigure 6. The split_and_normalized_variants_df DataFrame after applying normalize_variants transformer with mode=“split_and_normalize” on original_variants_df. Notice that for example the triallelic variant (chr20, start=19883344, end=19883345, REF=T, ALT=[TT,C]) of original_variants_df has been split into two biallelic variants and then normalized resulting in two normalized biallelic variants (chr20, start=19883336, end=19883337, REF=C, ALT=CT) and (chr20, start=19883344, end=19883345, REF=T, ALT=C).

As mentioned before, the transformer can also be used only for splitting of multiallelics without doing any normalization by setting the mode option to split.

Summary

Using Glow normalize_variants transformer, computational biologists and bioinformaticians can normalize very large variant datasets of hundreds of thousands of samples in a fast and scalable manner. Differently sourced call sets can be ingested and merged using VCF and/or BGEN readers, normalization can be performed using this transformer in a just a single line of code. The transformer can optionally perform splitting of multiallelic variants to biallelics as well.

Get started with Glow — Streamline variant normalization

Our normalize_variants transformer makes it easy to normalize (and split) large variant datasets with a very small amount of code (Azure | AWS). Learn more about Glow features here and check out Databricks Unified Data Analytics for Genomics or try out a preview today.

Next Steps

Join our upcoming webinar Accelerate and Scale Joint Genotyping in the Cloud to see how Databricks and Glow simplify multi-sample variant calling at scale

References

Arash Bayat, Bruno Gaëta, Aleksandar Ignjatovic, Sri Parameswaran, Improved VCF normalization for accurate VCF comparison, Bioinformatics, Volume 33, Issue 7, 2017, Pages 964–970

Adrian Tan, Gonçalo R. Abecasis, Hyun Min Kang, Unified representation of genetic variants, Bioinformatics, Volume 31, Issue 13, 2015, Pages 2202–2204

Additional Resources

--

Try Databricks for free. Get started today.

The post Streamlining Variant Normalization on Large Genomic Datasets with Glow appeared first on Databricks.

Processing Geospatial Data at Scale With Databricks

$
0
0

The evolution and convergence of technology has fueled a vibrant marketplace for timely and accurate geospatial data. Every day billions of handheld and IoT devices along with thousands of airborne and satellite remote sensing platforms generate hundreds of exabytes of location-aware data. This boom of geospatial big data combined with advancements in machine learning is enabling organizations across industry to build new products and capabilities.

Maps leveraging geospatial data are used widely across industry, spanning multiple use cases, including disaster recovery, defense and intel, infrastructure and health services.Maps leveraging geospatial data are used widely across industry, spanning multiple use cases, including disaster recovery, defense and intel, infrastructure and health services.

For example, numerous companies provide localized drone-based services such as mapping and site inspection (reference Developing for the Intelligent Cloud and Intelligent Edge). Another rapidly growing industry for geospatial data is autonomous vehicles. Startups and established companies alike are amassing large corpuses of highly contextualized geodata from vehicle sensors to deliver the next innovation in self-driving cars (reference Databricks fuels wejo’s ambition to create a mobility data ecosystem). Retailers and government agencies are also looking to make use of their geospatial data. For example, foot-traffic analysis (reference Building Foot-Traffic Insights Dataset) can help determine the best location to open a new store or, in the Public Sector, improve urban planning. Despite all these investments in geospatial data, a number of challenges exist.

Challenges Analyzing Geospatial at Scale

The first challenge involves dealing with scale in streaming and batch applications. The sheer proliferation of geospatial data and the SLAs required by applications overwhelms traditional storage and processing systems. Customer data has been spilling out of existing vertically scaled geo databases into data lakes for many years now due to pressures such as data volume, velocity, storage cost, and strict schema-on-write enforcement. While enterprises have invested in geospatial data, few have the proper technology architecture to prepare these large, complex datasets for downstream analytics. Further, given that scaled data is often required for advanced use cases, the majority of AI-driven initiatives are failing to make it from pilot to production.

Compatibility with various spatial formats poses the second challenge. There are many different specialized geospatial formats established over many decades as well as incidental data sources in which location information may be harvested:

  • Vector formats such as GeoJSON, KML, Shapefile, and WKT
  • Raster formats such as ESRI Grid, GeoTIFF, JPEG 2000, and NITF
  • Navigational standards such as used by AIS and GPS devices
  • Geodatabases accessible via JDBC / ODBC connections such as PostgreSQL / PostGIS
  • Remote sensor formats from Hyperspectral, Multispectral, Lidar, and Radar platforms
  • OGC web standards such as WCS, WFS, WMS, and WMTS
  • Geotagged logs, pictures, videos, and social media
  • Unstructured data with location references

In this blog post, we give an overview of general approaches to deal with the two main challenges listed above using the Databricks Unified Data Analytics Platform. This is the first part of a series of blog posts on working with large volumes of geospatial data.

Scaling Geospatial Workloads with Databricks

Databricks offers a unified data analytics platform for big data analytics and machine learning used by thousands of customers worldwide. It is powered by Apache Spark™, Delta Lake, and MLflow with a wide ecosystem of third-party and available library integrations. Databricks UDAP delivers enterprise-grade security, support, reliability, and performance at scale for production workloads. Geospatial workloads are typically complex and there is no one library fitting all use cases. While Apache Spark does not offer geospatial Data Types natively, the open source community as well as enterprises have directed much effort to develop spatial libraries, resulting in a sea of options from which to choose.

There are generally three patterns for scaling geospatial operations such as spatial joins or nearest neighbors:

  1. Using purpose-built libraries which extend Apache Spark for geospatial analytics. GeoSpark, GeoMesa, GeoTrellis, and Rasterframes are a few of such libraries used by our customers. These frameworks often offer multiple language bindings, have much better scaling and performance than non-formalized approaches, but can also come with a learning curve.
  2. Wrapping single-node libraries such as GeoPandas, Geospatial Data Abstraction Library (GDAL), or Java Topology Service (JTS) in ad-hoc user defined functions (UDFs) for processing in a distributed fashion with Spark DataFrames. This is the simplest approach for scaling existing workloads without much code rewrite; however it can introduce performance drawbacks as it is more lift-and-shift in nature.
  3. Indexing the data with grid systems and leveraging the generated index to perform spatial operations is a common approach for dealing with very large scale or computationally restricted workloads. S2, GeoHex and Uber’s H3 are examples of such grid systems. Grids approximate geo features such as polygons or points with a fixed set of identifiable cells thus avoiding expensive geospatial operations altogether and thus offer much better scaling behavior. Implementers can decide between grids fixed to a single accuracy which can be somewhat lossy yet more performant or grids with multiple accuracies which can be less performant but mitigate against lossines.

The examples which follow are generally oriented around a NYC taxi pickup / dropoff dataset found here. NYC Taxi Zone data with geometries will also be used as the set of polygons. This data contains polygons for the five boroughs of NYC as well the neighborhoods. This notebook will walk you through preparations and cleanings done to convert the initial CSV files into Delta Lake Tables as a reliable and performant data source.

Our base DataFrame is the taxi pickup / dropoff data read from a Delta Lake Table using Databricks.

%scala
val dfRaw = spark.read.format("delta").load("/ml/blogs/geospatial/delta/nyc-green") 
display(dfRaw) // showing first 10 columns

Example geospatial data read from a Delta Lake table using Databricks.Example geospatial data read from a Delta Lake table using Databricks.

Geospatial Operations using GeoSpatial Libraries for Apache Spark

Over the last few years, several libraries have been developed to extend the capabilities of Apache Spark for geospatial analysis. These frameworks bear the brunt of registering commonly applied user defined types (UDT) and functions (UDF) in a consistent manner, lifting the burden otherwise placed on users and teams to write ad-hoc spatial logic. Please note that in this blog post we use several different spatial frameworks chosen to highlight various capabilities. We understand that other frameworks exist beyond those highlighted which you might also want to use with Databricks to process your spatial workloads.

Earlier, we loaded our base data into a DataFrame. Now we need to turn the latitude/longitude attributes into point geometries. To accomplish this, we will use UDFs to perform operations on DataFrames in a distributed fashion. Please refer to the provided notebooks at the end of the blog for details on adding these frameworks to a cluster and the initialization calls to register UDFs and UDTs. For starters, we have added GeoMesa to our cluster, a framework especially adept at handling vector data. For ingestion, we are mainly leveraging its integration of JTS with Spark SQL which allows us to easily convert to and use registered JTS geometry classes. We will be using the function st_makePoint that given a latitude and longitude create a Point geometry object. Since the function is a UDF, we can apply it to columns directly.

%scala
val df = dfRaw
 .withColumn("pickup_point", st_makePoint(col("pickup_longitude"), col("pickup_latitude")))
 .withColumn("dropoff_point", st_makePoint(col("dropoff_longitude"),col("dropoff_latitude")))
display(df.select("dropoff_point","dropoff_datetime"))

Using UDFs to perform operations on DataFrames in a distributed fashion to turn geospatial data latitude/longitude attributes into point geometries.Using UDFs to perform operations on DataFrames in a distributed fashion to turn geospatial data latitude/longitude attributes into point geometries.

We can also perform distributed spatial joins, in this case using GeoMesa’s provided st_contains UDF to produce the resulting join of all polygons against pickup points.

%scala
val joinedDF = wktDF.join(df, st_contains($"the_geom", $"pickup_point")
display(joinedDF.select("zone","borough","pickup_point","pickup_datetime"))

Using GeoMesa's provided st_contains UDF, for example, to produce the resulting join of all polygons against pickup points.Using GeoMesa’s provided st_contains UDF, for example, to produce the resulting join of all polygons against pickup points.

Wrapping Single-node Libraries in UDFs

In addition to using purpose built distributed spatial frameworks, existing single-node libraries can also be wrapped in ad-hoc UDFs for performing geospatial operations on DataFrames in a distributed fashion. This pattern is available to all Spark language bindings – Scala, Java, Python, R, and SQL – and is a simple approach for leveraging existing workloads with minimal code changes. To demonstrate a single-node example, let’s load NYC borough data and define UDF find_borough(…) for point-in-polygon operation to assign each GPS location to a borough using geopandas. This could also have been accomplished with a vectorized UDF for even better performance.

%python 
# read the boroughs polygons with geopandas
gdf = gdp.read_file("/dbfs/ml/blogs/geospatial/nyc_boroughs.geojson")

b_gdf = sc.broadcast(gdf) # broadcast the geopandas dataframe to all nodes of the cluster 
def find_borough(latitude,longitude):
  mgdf = b_gdf.value.apply(lambda x: x["boro_name"] if x["geometry"].intersects(Point(longitude, latitude))
  idx = mgdf.first_valid_index()
  return mgdf.loc[idx] if idx is not None else None

find_borough_udf = udf(find_borough, StringType())

Now we can apply the UDF to add a column to our Spark DataFrame which assigns a borough name to each pickup point.

%python 
# read the coordinates from delta 
df = spark.read.format("delta").load("/ml/blogs/geospatial/delta/nyc-green")
df_with_boroughs = df.withColumn("pickup_borough", find_borough_udf(col("pickup_latitude"),col(pickup_longitude)))
display(df_with_boroughs.select(
  "pickup_datetime","pickup_latitude","pickup_longitude","pickup_borough"))

The result of a single-node example, where Geopandas is used to assign each GPS location to NYC borough.The result of a single-node example, where Geopandas is used to assign each GPS location to NYC borough.

Grid Systems for Spatial Indexing

Geospatial operations are inherently computationally expensive. Point-in-polygon, spatial joins, nearest neighbor or snapping to routes all involve complex operations. By indexing with grid systems, the aim is to avoid geospatial operations altogether. This approach leads to the most scalable implementations with the caveat of approximate operations. Here is a brief example with H3.

Scaling spatial operations with H3 is essentially a two step process. The first step is to compute an H3 index for each feature (points, polygons, …) defined as UDF geoToH3(…). The second step is to use these indices for spatial operations such as spatial join (point in polygon, k-nearest neighbors, etc), in this case defined as UDF multiPolygonToH3(…).

%scala 
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
import scala.collection.JavaConversions._
import scala.collection.JavaConverters._

object H3 extends Serializable {
  val instance = H3Core.newInstance()
}

val geoToH3 = udf{ (latitude: Double, longitude: Double, resolution: Int) => 
  H3.instance.geoToH3(latitude, longitude, resolution) 
}
                  
val polygonToH3 = udf{ (geometry: Geometry, resolution: Int) => 
  var points: List[GeoCoord] = List()
  var holes: List[java.util.List[GeoCoord]] = List()
  if (geometry.getGeometryType == "Polygon") {
    points = List(
      geometry
        .getCoordinates()
        .toList
        .map(coord => new GeoCoord(coord.y, coord.x)): _*)
  }
  H3.instance.polyfill(points, holes.asJava, resolution).toList 
}

val multiPolygonToH3 = udf{ (geometry: Geometry, resolution: Int) => 
  var points: List[GeoCoord] = List()
  var holes: List[java.util.List[GeoCoord]] = List()
  if (geometry.getGeometryType == "MultiPolygon") {
    val numGeometries = geometry.getNumGeometries()
    if (numGeometries > 0) {
      points = List(
        geometry
          .getGeometryN(0)
          .getCoordinates()
          .toList
          .map(coord => new GeoCoord(coord.y, coord.x)): _*)
    }
    if (numGeometries > 1) {
      holes = (1 to (numGeometries - 1)).toList.map(n => {
        List(
          geometry
            .getGeometryN(n)
            .getCoordinates()
            .toList
            .map(coord => new GeoCoord(coord.y, coord.x)): _*).asJava
      })
    }
  }
  H3.instance.polyfill(points, holes.asJava, resolution).toList 
}

We can now apply these two UDFs to the NYC taxi data as well as the set of borough polygons to generate the H3 index.

%scala
val res = 7 //the resolution of the H3 index, 1.2km
val dfH3 = df.withColumn(
  "h3index",
  geoToH3(col("pickup_latitude"), col("pickup_longitude"), lit(res))
)
val wktDFH3 = wktDF
  .withColumn("h3index", multiPolygonToH3(col("the_geom"), lit(res)))
  .withColumn("h3index", explode($"h3index"))

Given a set of a lat/lon points and a set of polygon geometries, it is now possible to perform the spatial join using h3index field as the join condition. These assignments can be used to aggregate the number of points that fall within each polygon for instance. There are usually millions or billions of points that have to be matched to thousands or millions of polygons which necessitates a scalable approach. There are other techniques not covered in this blog which can be used for indexing in support of spatial operations when an approximation is insufficient.

%scala
val dfWithBoroughH3 = dfH3.join(wktDFH3,"h3index") 
    
display(df_with_borough_h3.select("zone","borough","pickup_point","pickup_datetime","h3index"))

DataFrame table representing the spatial join of a set of lat/lon points and polygon geometries, using a specific field as the join condition.DataFrame table representing the spatial join of a set of lat/lon points and polygon geometries, using a specific field as the join condition.

Here is a visualization of taxi dropoff locations, with latitude and longitude binned at a resolution of 7 (1.22km edge length) and colored by aggregated counts within each bin.

Geospatial visualization of taxi dropoff locations, with latitude and longitude binned at a resolution of 7 (1.22km edge length) and colored by aggregated counts within each bin.Geospatial visualization of taxi dropoff locations, with latitude and longitude binned at a resolution of 7 (1.22km edge length) and colored by aggregated counts within each bin.

Handling Spatial Formats with Databricks

Geospatial data involves reference points, such as latitude and longitude, to physical locations or extents on the earth along with features described by attributes. While there are many file formats to choose from, we have picked out a handful of representative vector and raster formats to demonstrate reading with Databricks.

Vector Data

Vector data is a representation of the world stored in x (longitude), y (latitude) coordinates in degrees, also z (altitude in meters) if elevation is considered. The three basic symbol types for vector data are points, lines, and polygons. Well-known-text (WKT), GeoJSON, and Shapefile are some popular formats for storing vector data we highlight below.

Let’s read NYC Taxi Zone data with geometries stored as WKT. The data structure we want to get back is a DataFrame which will allow us to standardize with other APIs and available data sources, such as those used elsewhere in the blog. We are able to easily convert the WKT text content found in field the_geom into its corresponding JTS Geometry class through the st_geomFromWKT(…) UDF call.

%scala
val wktDFText = sqlContext.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("/ml/blogs/geospatial/nyc_taxi_zones.wkt.csv")

val wktDF = wktDFText.withColumn("the_geom", st_geomFromWKT(col("the_geom"))).cache

GeoJSON is used by many open source GIS packages for encoding a variety of geographic data structures, including their features, properties, and spatial extents. For this example, we will read NYC Borough Boundaries with the approach taken depending on the workflow. Since the data is conforming JSON, we could use the Databricks built-in JSON reader with .option(“multiline”,”true”) to load the data with the nested schema.

%python
json_df = spark.read.option("multiline","true").json("nyc_boroughs.geojson")

Example of using the Databricks built-in JSON reader .option(Example of using the Databricks built-in JSON reader .option(“multiline”,”true”) to load the data with the nested schema.

From there we could choose to hoist any of the fields up to top level columns using Spark’s built-in explode function. For example, we might want to bring up geometry, properties, and type and then convert geometry to its corresponding JTS class as was shown with the WKT example.

%python
from pyspark.sql import functions as F
json_explode_df = ( json_df.select(
 "features",
 "type",
 F.explode(F.col("features.properties")).alias("properties")
).select("*",F.explode(F.col("features.geometry")).alias("geometry")).drop("features"))

display(json_explode_df)

Using the Spark’s built-in explode function to raise a field to the top level, displayed within a DataFrame table.Using the Spark’s built-in explode function to raise a field to the top level, displayed within a DataFrame table.

We can also visualize the NYC Taxi Zone data within a notebook using an existing DataFrame or directly rendering the data with a library such as Folium, a Python library for rendering spatial data. Databricks File System (DBFS) runs over a distributed storage layer which allows code to work with data formats using familiar file system standards. DBFS has a FUSE Mount to allow local API calls which perform file read and write operations,which makes it very easy to load data with non-distributed APIs for interactive rendering. In the Python open(…) command below, the “/dbfs/…” prefix enables the use of FUSE Mount.

%python 
import folium
import json

with open ("/dbfs/ml/blogs/geospatial/nyc_boroughs.geojson", "r") as myfile:
 boro_data=myfile.read() # read GeoJSON from DBFS using FuseMount

m = folium.Map(
 location=[40.7128, -74.0060],
 tiles='Stamen Terrain',
 zoom_start=12 
)
folium.GeoJson(json.loads(boro_data)).add_to(m)
m # to display, also could use displayHTML(...) variants

We can also visualize the NYC Taxi Zone data, for example, within a notebook using an existing DataFrame or directly rendering the data with a library such as Folium, a Python library for rendering geospatial data.We can also visualize the NYC Taxi Zone data, for example, within a notebook using an existing DataFrame or directly rendering the data with a library such as Folium, a Python library for rendering geospatial data.

Shapefile is a popular vector format developed by ESRI which stores the geometric location and attribute information of geographic features. The format consists of a collection of files with a common filename prefix (*.shp, *.shx, and *.dbf are mandatory) stored in the same directory. An alternative to shapefile is KML, also used by our customers but not shown for brevity. For this example, let’s use NYC Building shapefiles. While there are many ways to demonstrate reading shapefiles, we will give an example using GeoSpark. The built-in ShapefileReader is used to generate the rawSpatialDf DataFrame.

%scala
var spatialRDD = new SpatialRDD[Geometry]
spatialRDD = ShapefileReader.readToGeometryRDD(sc, "/ml/blogs/geospatial/shapefiles/nyc")

var rawSpatialDf = Adapter.toDf(spatialRDD,spark)
rawSpatialDf.createOrReplaceTempView("rawSpatialDf") //DataFrame now available to SQL, Python, and R 

By registering rawSpatialDf as a temp view, we can easily drop into pure Spark SQL syntax to work with the DataFrame, to include applying a UDF to convert the shapefile WKT into Geometry.

%sql 
SELECT *,
 ST_GeomFromWKT(geometry) AS geometry -- GeoSpark UDF to convert WKT to Geometry 
FROM rawspatialdf 

Additionally, we can use Databricks built in visualization for inline analytics such as charting the tallest buildings in NYC.

%sql 
SELECT name, 
 round(Cast(num_floors AS DOUBLE), 0) AS num_floors --String to Number
FROM rawspatialdf 
WHERE name <> ''
ORDER BY num_floors DESC LIMIT 5

A Databricks built-in visualization for inline analytics charting, for example, the tallest buildings in NYC.A Databricks built-in visualization for inline analytics charting, for example, the tallest buildings in NYC.

Raster Data

Raster data stores information of features in a matrix of cells (or pixels) organized into rows and columns (either discrete or continuous). Satellite images, photogrammetry, and scanned maps are all types of raster-based Earth Observation (EO) data.

The following Python example uses RasterFrames, a DataFrame-centric spatial analytics framework, to read two bands of GeoTIFF Landsat-8 imagery (red and near-infrared) and combine them into Normalized Difference Vegetation Index. We can use this data to assess plant health around NYC. The rf_ipython module is used to manipulate RasterFrame contents into a variety of visually useful forms, such as below where the red, NIR and NDVI tile columns are rendered with color ramps, using the Databricks built-in displayHTML(…) command to show the results within the notebook.

%python
# construct a CSV "catalog" for RasterFrames `raster` reader 
# catalogs can also be Spark or Pandas DataFrames
bands = [f'B{b}' for b in [4, 5]]
uris = [f'https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/014/032/LC08_L1TP_014032_20190720_20190731_01_T1/LC08_L1TP_014032_20190720_20190731_01_T1_{b}.TIF' for b in bands]
catalog = ','.join(bands) + '\n' + ','.join(uris)

# read red and NIR bands from Landsat 8 dataset over NYC
rf = spark.read.raster(catalog, bands) \
 .withColumnRenamed('B4', 'red').withColumnRenamed('B5', 'NIR') \
 .withColumn('longitude_latitude', st_reproject(st_centroid(rf_geometry('red')), rf_crs('red'), lit('EPSG:4326'))) \
 .withColumn('NDVI', rf_normalized_difference('NIR', 'red')) \
 .where(rf_tile_sum('NDVI') > 10000)

results = rf.select('longitude_latitude', rf_tile('red'), rf_tile('NIR'), rf_tile('NDVI'))
displayHTML(rf_ipython.spark_df_to_html(results))

RasterFrame contents can be filtered, transformed, summarized, resampled, and rasterized through 200+ raster and vector functions.RasterFrame contents can be filtered, transformed, summarized, resampled, and rasterized through 200+ raster and vector functions.

Through its custom Spark DataSource, RasterFrames can read various raster formats, including GeoTIFF, JP2000, MRF, and HDF, from an array of services. It also supports reading the vector formats GeoJSON and WKT/WKB. RasterFrame contents can be filtered, transformed, summarized, resampled, and rasterized through 200+ raster and vector functions, such as st_reproject(…) and st_centroid(…) used in the example above. It provides APIs for Python, SQL, and Scala as well as interoperability with Spark ML.

GeoDatabases

Geo databases can be filebased for smaller scale data or accessible via JDBC / ODBC connections for medium scale data. You can use Databricks to query many SQL databases with the built-in JDBC / ODBC Data Source. Connecting to PostgreSQL is shown below which is commonly used for smaller scale workloads by applying PostGIS extensions. This pattern of connectivity allows customers to maintain as-is access to existing databases.

%scala
display(
  sqlContext.read.format("jdbc")
    .option("url", jdbcUrl)
    .option("driver", "org.postgresql.Driver")
    .option("dbtable", 
      """(SELECT * FROM yellow_tripdata_staging 
      OFFSET 5 LIMIT 10) AS t""") //predicate pushdown
    .option("user", jdbcUsername)
    .option("jdbcPassword", jdbcPassword)
  .load)

Getting Started with Geospatial Analysis on Databricks

Businesses and government agencies seek to use spatially referenced data in conjunction with enterprise data sources to draw actionable insights and deliver on a broad range of innovative use cases. In this blog we demonstrated how the Databricks Unified Data Analytics Platform can easily scale geospatial workloads, enabling our customers to harness the power of the cloud to capture, store and analyze data of massive size.

In an upcoming blog, we will take a deep dive into more advanced topics for geospatial processing at-scale with Databricks. You will find additional details about the spatial formats and highlighted frameworks by reviewing Data Prep Notebook, GeoMesa + H3 Notebook, GeoSpark Notebook, GeoPandas Notebook, and Rasterframes Notebook. Also, stay tuned for a new section in our documentation specifically for geospatial topics of interest.

Next Steps

--

Try Databricks for free. Get started today.

The post Processing Geospatial Data at Scale With Databricks appeared first on Databricks.

Databricks Demonstrates AWS Platform Integrations at re:Invent 2019

$
0
0

Databricks was proud to be a Platinum sponsor at re:Invent. The past year has been an exciting one for our partnership with AWS, as we built new integrations and deepened existing ones with so many AWS services. re:Invent was a great opportunity to showcase how our joint customers have benefitted from those integrations and we wanted to share a recap of what was highlighted at the conference!

Databricks at AWS re:Iinvent 2019

Session: Building Reliable Data Lakes for Analytics with Delta Lake
In this session, Michael Armbrust, the creator of Delta Lake, walked through the evolution of Delta Lake. He showed how customers using data lakes for analytics often build complex architectures that require many validation steps. Then he showed how Delta Lake takes care of that and also introduces a 3 step process to refine data and make it ready for analytics. Kyle Burke, head of data platform at Kabbage Inc. walked through how Kabbage has been using Delta Lake and some of their key metrics of improvement. Kabbage has migrated from an on-premises Hadoop-based architecture to a Spark-based cloud architecture using AWS and Databricks. They expect to see a 30-50% savings in costs due to this migration, but Kyle also pointed out their system is much more flexible and provides much more capability. Their new Delta Lake architecture enables them to handle all their streaming data as well as their batch data, and to deliver data for data science and BI reporting from one source of data, removing data discrepancies.

Announcement: Databricks Achieves Retail Competency
Databricks was awarded the AWS Retail Competency in recognition of our solutions and number of joint customers in Retail. This adds to the list of competencies Databricks already holds in Machine Learning, Data and Analytics, Life Sciences and Public Sector. Learn more about how retail customers combine Databricks and AWZ our our retail solutions page.

Delta Lake and Athena, Glue and Redshift
Delta Lake is an open source tool that customers are using to build powerful datalakes with Amazon’s S3 service. Databricks includes Managed Delta Lake in our Unified Data Analytics Platform to provide schema enforcement, ACID transactions and a time travel features that enables you to roll back data sets at any time. Using AWS Glue as a data catalog, Delta Lake tables can be registered for access and AWS services such as Redshift and Athena can query Glue to identify tables, and query Delta Lake for datasets. You can find out more about the integrations with Glue, Athena and Redshift in this blog post: Transform Your AWS Data Lake using Databricks Delta and the AWS Glue Data Catalog Service.

MLflow and SageMaker
Databricks provides built-in support for Python and R, and also provides built-in ML frameworks such as Keras, TensorFlow, and PyTorch. Many customers are using the power of Databricks for creating and building models, and then using SageMaker to put those models into production through our integration built onMLflow. You can find out more about how our customer Brandless is using this integration in this blog post: Using Databricks, MLflow, and Amazon SageMaker at Brandless to Bring Recommendation Systems to Production.

Enterprise Security and Credentials
DDatabricks enables you to architect a complete data and analytics solution seamlessly integrated with AWS security, roles and other platform elements. These integrations provide enterprise-wide visibility and policies for various teams. Policy violations can be flagged and departments can be billed with chargebacks. Learn more about IAM credential pass through in this blog post: Introducing Databricks AWS IAM Credential Passthrough.

Koalas and pandas
Over the past few years pandas has emerged as a key Python framework for data science. To provide scalability, Databricks has developed Koalas that implements the pandas DataFrame API on top of Apache Spark, and enables data scientists to make the transition from a single machine to a distributed environment without needing to learn a new framework. You can learn more in this blog post: Koalas: Easy Transition from pandas to Apache Spark.

AWS Data Exchange
AWS announced the AWS Data Exchange on November 13th that makes it easier for customers to subscribe to third-party datasets to mix with their own data and drive new insights. Databricks is used by both data providers and data subscribers to build and blend datasets at scale. You can learn more in this blog post: Databricks, AWS, and SafeGraph Team Up For Easier Analysis of Consumer Behavior.

To Learn More:

  • Sign up now for our post-re:Invent webinar with SafeGraph. http://bit.ly/SAFEGRAPH
  • Get six hours of free training using Databricks on AWS: http://bit.ly/TrainingAWS
  • Talk to an expert: Contact us to get answers to questions you might have as you start your first project or to learn more about available training.

--

Try Databricks for free. Get started today.

The post Databricks Demonstrates AWS Platform Integrations at re:Invent 2019 appeared first on Databricks.

End-to-end Data Governance with Databricks and Immuta

$
0
0

Businesses are consuming data at a staggering rate but when it comes to getting insights from this data they grapple with secure data access and data sharing, and ensure compliance. With new customer data privacy regulations like GDPR and the upcoming CCPA, the leash on data security policies is getting tighter, slowing down analytics and machine learning (ML) projects.

That’s why Databricks and Immuta have partnered to provide an end-to-end data governance solution with enterprise data security for analytics, data science and machine learning. This joint solution is centered around fine-grained security, secure data discovery and search that allows teams to securely share data and perform compliant analytics and ML on their data lakes.

Enabling Scalable Analytics and ML on Sensitive Data in Data Lakes

Immuta’s automated governance solution integrates natively with Databricks Unified Data Analytics Platform. The advanced, fine-grained data governance controls give users an end-to-end, easy way to manage access to Delta Lake and meet their organization’s security and data stewardship directives.

  1. Regulatory Compliance: Immuta offers fine-grained access control that provides row, column and cell-level access to data in Databricks. This makes it possible to make more data assets available to users without restricting entire table level access. All data security policies are enforced dynamically as users run their jobs in Databricks.
  2. Secure Data Sharing: By building a self-service data catalog, Immuta makes it easy to perform secure data discovery and search in Databricks. The integration comes with features like programmatic data access that automatically enables global and local policies on Spark jobs in Databricks. Data engineers and data scientists can securely subscribe to and collaborate on sensitive data while having the peace of mind for all their data security and privacy needs.
  3. Compliant Analytics and ML: Using anonymization and masking techniques in Immuta, Databricks users can perform compliant data analytics and ML in Delta tables within the context under which they need to act, e.g. vertical (HIPAA) or horizontal compliance (GDPR, CCPA). With automated policy application, the joint solution eliminates the need to check for permissions each time data is accessed to speed up analytics workloads while preserving the data value.

Get Started with the New Data Governance Tool

To learn more about the Databricks and Immuta partnership, check out the Data Governance and Data Security for Cloud Analytics webinar.

In this webinar, Steve Touw, Co-founder and Chief Technology Officer of Immuta, and Todd Greenstein, Product Manager at Databricks share details about the solution along with an in-depth demo of the native integration.

Additional Resources

Databricks Enterprise Security – Databricks

Security and Privacy — Databricks Documentation

Databricks and Immuta: Data Analytics with Automated Governance

 

 

--

Try Databricks for free. Get started today.

The post End-to-end Data Governance with Databricks and Immuta appeared first on Databricks.

Make Your Data Lake CCPA Compliant with a Unified Approach to Data and Analytics

$
0
0

With more digital data being captured every day, there has been a rise of various regulatory standards such as the General Data Protection Regulation (GDPR) and recently the California Consumer Privacy Act (CCPA). These privacy laws and standards are aimed at protecting consumers from businesses that improperly collect, use, or share their personal information, and is changing the way businesses have to manage and protect the consumer data they collect and store.

Similar to the GDPR, the CCPA empowers individuals to request:

  • what personal information is being captured,
  • how personal information is being used, and
  • to have that personal information deleted.

Additionally, the CCPA encompasses information about ‘households’. This has potential to significantly expand the scope of personal information subject to these requests. Failure to comply in a timely manner can result in statutory fines and statutory damages (where a consumer need not even prove damages) that can rise quickly. The challenge for companies doing business in California or otherwise subject to the CCPA, then, is to ensure they can quickly find, secure, and delete that personal information.

Many companies wrongfully think that the data privacy processes and controls put in place for GDPR compliance will guarantee complete compliance with the CCPA–and while the things you may have done to prepare for the GDPR are helpful and a great start–they are unlikely to be sufficient. Companies need to focus on understanding their need for compliance and must determine which processes and controls can effectively prevent the misuse and unauthorized sale of consumer data.

Are you prepared for CCPA?

CCPA requires businesses to potentially delete all personal information about a consumer upon request. Many organizations today are using or plan to use a data lake for storing the vast majority of their data in order to have a comprehensive view of their customers and business and power downstream data science, machine learning, and business analytics. The lack of structure of a data lake makes it challenging to locate and remove individual records to remain compliant with these regulatory requirements.

This is critical when responding to a consumer’s deletion request, and if a business receives more than just a few consumer rights requests in a short period of time, the resources spent to comply with the requests could be significant. Businesses that fail to comply with CCPA requirements by January 1, 2020 could be subject to lawsuits and civil penalties. The CCPA also contains a “lookback” period applying it to actions and personal information since January 1, 2019, making it vital to get these solutions in place quickly.

Taking your data security beyond the data lake

When it comes to adhering to CCPA requirements, your data lake should enable you to respond to consumer rights requests within prescribed timelines without handicapping your business. Unfortunately, most data lakes lack the data management and data manipulation capabilities to quickly locate and remove records, which makes this challenging.

Fortunately, Databricks offers a solution. The Databricks Unified Data Analytics Platform simplifies data access and engineering, while fostering a collaborative environment that supports analytics and machine learning. As part of the platform, Databricks offers a Unified Data Service that ensures reliability and scalability for your data pipelines, data lakes, and data analytics workflows.

Databricks Unified Data Analytics Platform simplifies data access and engineering, while fostering a collaborative environment that supports analytics and machine learning driven innovation.

One of the main components of the Databricks Unified Data Service is Delta Lake, an open-source storage layer that brings enhanced data reliability, performance, and lifecycle management to your data lake. With improved data management, organizations can start to think “beyond the data lake” and leverage more advanced analytics techniques and technologies to extend their data for downstream business needs including data privacy protection and CCPA compliance.

Start building a CCPA-friendly data lake with Delta Lake

Delta Lake provides your data lake with a structured data management system including transactional capabilities. This enables you to easily and quickly search, modify, and clean your data using standard DML statements (e.g. DELETE, UPDATE, MERGE INTO).

To get started, ingest your raw data with the Spark APIs that you’re familiar with and write them out as Delta Lake tables. Doing this also adds metadata to your files. If your data is already in Parquet format, you also have the option to convert the parquet files in place to a Delta Lake table without rewriting any of the data. Delta uses an open file format (parquet) so there are no worries of being locked in as you can quickly and easily convert your data back into another format if you need to.

Once ingested, you can easily search and modify individual records within your Delta Lake tables. The final step is to make Delta Lake your single source of truth by erasing any underlying raw data. This removes any lingering records from your raw data sets. We suggest setting up a retention policy with AWS or Azure of thirty days or less to automatically remove raw data so that no further action is needed to delete the raw consumer data to meet CCPA response timelines.

How do I delete data in my data lake using Delta Lake?

You can find and delete any personal information related to a consumer by running two commands:

  1. DELETE FROM data WHERE email = ‘consumer@domain.com’;
  2. VACUUM data;

The first command identifies records that contain the string “consumer@domain.com” stored in the column email, and deletes the data containing these records by rewriting the respective underlying files with the consumer’s unique personal data removed and marking the old files as deleted.

The second command cleans up the Delta table, removing any stale records that have been logically deleted and are outside of the default retention period. With the default retention period of 7 days, this means that files marked for deletion will linger around until you run the VACUUM command at least 7 days later. You could easily set up a scheduled job with the Databricks Job Scheduler to run the VACUUM command for you in an automated fashion. You might also be familiar with Delta Lake’s time travel capabilities, which allows you to keep historical versions of your Delta Lake table in case you need to query an earlier version of the table. Note that when you run VACUUM, you will lose the ability to time travel back to a version older than the default 7-day data retention period.

After running these commands, you can now safely state that you have removed the necessary consumer data and records from your data lake.

How else does Databricks help me with CCPA consumer rights requests?

Once a user’s personal information has been removed from the data lake, it is also important to remove this personal information from the tools used by your data teams. Often times these tools reside locally on the data scientist’s or engineer’s laptop. A better more secure solution is to use Databricks with its hosted Data Science Workspace where data teams can prep, explore and model data collaboratively in a shared notebook environment. This improves team productivity while creating a secured, centralized environment for the entire analytics workflow.

To help you meet CCPA compliance requirements, Databricks provides you with privacy protection tools to permanently remove personal information, either on a per-command or per-notebook level.

After you delete a notebook, it is moved to trash. If you don’t take further action, it will be permanently deleted within 30 days – allowing you to be confident it has been deleted within the prescribed timelines for both CCPA and GDPR.

If for any reason you need to do this more quickly, we also offer the ability to permanently delete individual items in the trash:

Databricks allows you to easily delete personal information on a per-command or per-notebook level.

deleting all items in a particular user’s trash:

Databricks also lets you permanently delete any personal information residing in the Trash, immediately.

or purging all deleted items in a workspace on command, which includes deleted notebook cells, notebook comments or MLFlow experiments:

Databricks also lets you purge all deleted items in workspace on command, for easy compliance with CCPA and other privacy regulations.

You also have the option to purge Databricks notebook revision history, which is useful to ensure that old query results are permanently deleted:

As an additional privacy protection, Databricks also gives you the option of purging the notebook revision history, to ensure old query results are permanently deleted.

Getting Started with CCPA Compliance for Data and Analytics

With the Databricks Unified Data Analytics Platform and Delta Lake, you can bring enhanced data security, reliability, performance, and lifecycle management to your data lake while delivering on all your analytics needs. Organizations can now quickly find and remove individual records from a data lake to meet CCPA access requests and compliance requirements without hindering their business.

Learn more about Delta Lake and the Databricks Unified Data Analytics Platform. Sign-up for your free Databricks trial  now.

--

Try Databricks for free. Get started today.

The post Make Your Data Lake CCPA Compliant with a Unified Approach to Data and Analytics appeared first on Databricks.

YipitData Example Highlights Benefits of Databricks integration with AWS Glue

$
0
0

At Databricks, we have partnered with the team at Amazon Web Services (AWS) to provide a seamless integration with the AWS Glue Metastore.   Databricks can easily use Glue as the metastore, even across multiple workspaces. YipitData, a longtime Databricks customer, has taken full advantage of this feature, storing all their metadata in AWS Glue.  Databricks’ integration with Glue enables YipitData to seamlessly interact with all data that is catalogued within their metastore.

YipitData is a data company that specializes in sourcing and analyzing alternative data to answer key questions for fundamental investors.  YipitData relies on the scale and processing ability of Databricks Unified Data Analytics for a competitive advantage.  They are able to able to incorporate a far greater variety of data enriched and analyzed in different ways than competitors in their space.  The ability to use the AWS Glue Metastore has been instrumental to their continued growth and success.

The key benefits for YipitData’s usage of AWS Glue with Databricks:

  • All their metadata resides in one data catalog, easily accessible across their data lake.  Synchronization of metastores was a difficult challenge, and using Glue removes this burden.
  • They are able to quickly and seamlessly integrate tools within their existing stack, with the same metastore.  For example, they often perform quick queries using Amazon Athena.  Data that has been ETL’d using Databricks is easily accessible to any tools within the AWS Stack, including Amazon Cloudwatch to enable monitoring.
  • AWS Glue’s API’s are ideal for mass sorting and filtering.   Understanding expiry across 10’s of thousands of tables is core to Yipidata’s business, and together with Databricks this used to take 8 hours to accomplish.  This now can be done in under 5 minutes.

Databricks also provides several advantages that help YipitData succeed.  The power of notebooks has enabled sharing of information rapidly, removing the siloes of tribal knowledge common in the past – now their analysts are able to easily share information.   Using AWS’s Single Sign On service has also been a huge benefit to the team as they haven’t needed to implement costly complex third-party solutions. Databricks’ ability to scale means, as Andrew Gross, Staff Engineer from YipitData puts it,  “Databricks allows us to effortlessly trade scale for speed, which was not possible before.”

Get Started with Databricks and AWS Glue

You can apply the power of Databricks and AWS Glue to help solve your toughest data problems.   Learn more at https://docs.databricks.com/data/metastores/aws-glue-metastore.html

Additional Resources

Using AWS Glue Data Catalog as the Metastore for Databricks 

AWS Data Lake Delta Transformation Using AWS Glue

 

--

Try Databricks for free. Get started today.

The post YipitData Example Highlights Benefits of Databricks integration with AWS Glue appeared first on Databricks.


Better Machine Learning through Active Learning

$
0
0

Try this notebook to reproduce the steps outlined below

Machine learning models can seem like magical savants. They can distinguish hot dogs from not-hot-dogs, but that’s long since an easy trick. My aunt’s parrot can do that too. But machine-learned models power voice-activated assistants that effortlessly understand noisy human speech, and cars that drive themselves more or less safely. It’s no wonder we assume these are at some level artificially ‘intelligent’.

What they don’t tell you is that these supervised models are more parrot than oracle. They learn by example, lots of them, and learn to emulate the connection between input and output that the examples suggest. Herein lies the problem that many companies face when embracing machine learning: the modeling is (relatively) easy. Having the right examples to learn from is not.

Obtaining these examples can be hard. One can’t start collecting the last five years of data, today, of course. Where there is data, it may be just ‘inputs’ without desired ‘outputs’ to learn. Worse, producing that label is typically a manual process. After all, if there were an automated process for it, there would be no need to relearn it as a model!

Where labels are not readily available, some manual labeling is inevitable. Fortunately, not all data has to be labeled. A class of techniques commonly called ‘active learning’ can make the process collaborative, wherein a model trained on some data helps identify data that are most useful to label next.

This example uses a Python library for active learning, modAL, to assist a human in labeling data for a simple text classification problem. It will show how Apache Spark can apply modAL at scale, and how open source tools like Hyperopt and mlflow, as integrated with Spark in Databricks, can help along the way.

Real-world Learning Problem: Classifying Consumer Complaints as “Distressed”

The US Consumer Financial Protection Bureau (CFPB) oversees financial institutions’ relationship with consumers. It handles complaints from consumers. They have published an anonymized data set of these complaints. Most is simple tabular data, but it also contains the free text of a consumer’s complaint (if present). Anyone who has handled customer support tickets will not be surprised by what they look like.

complaints_df = full_complaints_df.\
  select(col("Complaint ID").alias("id"),\
    col("Consumer complaint narrative").alias("complaint")).\
  filter("complaint IS NOT NULL")
display(complaints_df)

Example of a common text classification problem where the training data is largely unlabeled.

Imagine that the CFPB wants to prioritize or pre-emptively escalate handling of complaints that seem distressed: a consumer that is frightened or angry, would be raising voices on a call. It’s a straightforward text classification problem — if these complaints are already labeled accordingly. They are not. With over 440,000 complaints, it’s not realistic to hand-label them all.

Accepting that, your author labeled about 230 of the complaints (dataset).

labeled1_df = spark.read.option("header", True).option("inferSchema", True).\
  csv(data_path + "/labeled.csv")
input1_df = complaints_df.join(labeled1_df, "id")
pool_df = complaints_df.join(labeled1_df, "id", how="left_anti")
display(input1_df)

Active learning requires only a small subset of the training data to be manually labeled

Using Spark ML to Build the Initial Classification Model

Spark ML can construct a basic TF-IDF embedding of the text at scale. At the moment, only the handful of labeled examples need transformation, but the entire data set will need this transformation later.

# Tokenize into words
tokenizer = Tokenizer(inputCol="complaint", outputCol="tokenized")
# Remove stopwords
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
# Compute term frequencies and hash into buckets
hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="hashed",\
  numFeatures=1000)
# Convert to TF-IDF
idf = IDF(inputCol=hashing_tf.getOutputCol(), outputCol="features")

pipeline = Pipeline(stages=[tokenizer, remover, hashing_tf, idf])
pipeline_model = pipeline.fit(complaints_df)

# need array of float, not Spark vector, for pandas later
tolist_udf = udf(lambda v: v.toArray().tolist(), ArrayType(FloatType()))
featurized1_df = pipeline_model.transform(input1_df).\
  select("id", "complaint", "features", "distressed").\
  withColumn("features", tolist_udf("features"))

There is no value in applying distributed Spark ML at this scale. Instead, scikit-learn can fit the model on this tiny data set in seconds. However, Spark still has a role here. Fitting a model typically means fitting many variants on the model, varying ‘hyperparameters’ like more or less regularization. These variants can be fit in parallel by Spark. Hyperopt is an open-source tool integrated with Spark in Databricks that can drive this search for optimal hyperparameters in a way that learns what combinations work best, rather than just randomly searching.

The attached notebook has a full code listing, but an edit of the key portion of the implementation follows:

# Core function to train a model given train set and params
def train_model(params, X_train, y_train):
  lr = LogisticRegression(solver='liblinear', max_iter=1000,\
         penalty=params['penalty'], C=params['C'], random_state=seed)
  return lr.fit(X_train, y_train)

# Wraps core modeling function to evaluate and return results for hyperopt
def train_model_fmin(params):
  lr = train_model(params, X_train, y_train)
  loss = log_loss(y_val, lr.predict_proba(X_val))
  # supplement auto logging in mlflow with accuracy
  accuracy = accuracy_score(y_val, lr.predict(X_val))
  mlflow.log_metric('accuracy', accuracy)
  return {'status': STATUS_OK, 'loss': loss, 'accuracy': accuracy}

penalties = ['l1', 'l2']
search_space = {
  'C': hp.loguniform('C', -6, 1),
  'penalty': hp.choice('penalty', penalties)
}

best_params = fmin(fn=train_model_fmin,
                   space=search_space,
                   algo=tpe.suggest,
                   max_evals=32,
                   trials=SparkTrials(parallelism=4),
                   rstate=np.random.RandomState(seed))

# Need to translate this back from 0/1 in output to be used again as input
best_params['penalty'] = penalties[best_params['penalty']]
# Train final model on train + validation sets
final_model = train_model(best_params,\
                          np.concatenate([X_train, X_val]),\
                          np.concatenate([y_train, y_val]))

...
(X_train, X_val, X_test, y_train, y_val, y_test) = build_test_train_split(featurized1_pd, 80)
(best_params, best_model) = find_best_lr_model(X_train, X_val, y_train, y_val)
(accuracy, loss) = log_and_eval_model(best_model, best_params, X_test, y_test)
...

Accuracy: 0.6
Loss: 0.6928265768789768

Hyperopt here tries 128 different hyperparameter combinations in its search. Here, it varies L1 vs L2 regularization penalty, and the strength of regularization, C. It returns the best settings it found, from which a final model is refit on train and validation data. Note that the results of these trials are automatically logged to mlflow, if using Databricks. The listing above shows that it’s possible to log additional metrics like accuracy, not just ‘loss’ that Hyperopt records. It’s clear, for example, that L1 regularization is better, incidentally:

The open source tool hyperopt can drive the search for optimal hyperparameters, returning better combinations than random searching.

For the run with best loss of about 0.7, accuracy is only 60%. Further tuning and more sophisticated models could improve this, but there is only so far this can get with a small training set. More labeled data is needed.

Applying modAL for Active Learning

This is where active learning comes in, via the modAL library. It is pleasantly simple to apply. When wrapped around a classifier or regressor that can return a probabilistic estimate of its prediction, it can analyze remaining data and decide which are most useful to label.

“Most useful” generally means labels for inputs that the classifier is currently most uncertain about. Knowing the label is more likely to improve the classifier than that of an input whose prediction is quite certain. modAL supports classifiers like logistic regression, whose output is a probability, via ActiveLearner.

learner = ActiveLearner(estimator=best_model, X_training=X_train, y_training=y_train)

It’s necessary to prepare the ‘pool’ of remaining data for querying. This means featurizing the rest of the data, so it’s handy that it was implemented with Spark ML:

featurized_pool_df = pipeline_model.transform(pool_df).\
  select("id", "complaint", "features").\
  withColumn("features", tolist_udf("features")).cache()

ActiveLearner’s query() method returns most-uncertain instances from an unlabeled data set, but it can’t directly operate in parallel via Spark. However Spark can apply it in parallel to chunks of the featurized data using a pandas UDF, which efficiently presents the data as pandas DataFrames or Series. Each can be independently queried with ActiveLearner then. Your author can only bear labeling a hundred or so more complaints, so this example tries to choose just about 0.02% of 440,000 in the pool:

query_fraction = 0.0002

@pandas_udf("boolean")
def to_query(features_series):
  X_i = np.stack(features_series.to_numpy())
  n = X_i.shape[0]
  query_idx, _ = learner.query(X_i, n_instances=math.ceil(n * query_fraction))
  # Output has same size of inputs; most instances were not sampled for query
  query_result = pd.Series([False] * n)
  # Set True where ActiveLearner wants a label
  query_result.iloc[query_idx] = True
  return query_result

with_query_df = featurized_pool_df.withColumn("query", to_query("features"))
display(with_query_df.filter("query").select("complaint"))

ActiveLearner's query() method selects approximately the top 0.02% from each chunk of unlabeled data.

Note that this isn’t quite the same as selecting the best 0.02% to query from the entire pool of 440,000, because this selects the top 0.02% from each chunk of that data as a pandas DataFrame separately. This won’t necessarily give the very best query candidates. The upside is parallelism. This tradeoff is probably useful to make in practical cases, as the results will still be relatively much more useful than most to query.

Understanding the Active Learner Queries

Indeed, the model returns probabilities between 49.9% and 50.1% for all complaints in the query. It is uncertain about all of them.

The input features can be plotted in two dimensions (via scikit-learn’s PCA) with seaborn to visualize not only which complaints are classified as ‘distressed’, but which the learner has chosen for labeling.

...
queried = with_query_pd['query']
ax = sns.scatterplot(x=pca_pd[:,0], y=pca_pd[:,1],\
                     hue=best_model.predict(with_query_np), style=~queried, size=~queried,\
                     alpha=0.8, legend=False)
# Zoom in on the interesting part
ax.set_xlim(-0.75,1)
ax.set_ylim(-1,1)
display()

Here, orange points are ‘distressed’ and blue are not, according to the model so far. The larger points are some of those selected to query; they are all, as it happens, negative.

Model Classification of (Projected) Sample, with Queried Points

Plotting the Active Learner queries chosen for labeling

Although hard to interpret visually, it does seem to choose points in regions where both classifications appear, not from uniform regions.

Effects on Machine Learning Accuracy

Your author downloaded the query set from Databricks as CSV and dutifully labeled almost 100 more in a favorite spreadsheet program, then exported and uploaded it back to storage as CSV. A low-tech process like this — a column in a spreadsheet — may be just fine for small scale labeling. Of course it is also possible to save the query as a table that an external system uses to manage labeling.

The same process above can be repeated with the new, larger data set. The result? Cutting to the chase, it’s 68% accuracy. Your mileage may vary. This time Hyperopt’s search (see listing above) over hyperparameters found better models from nearly the first few trials and improved from there, rather than plateauing at about 60% accuracy.

Learning Strategy Variations on modAL Queries

modAL has other strategies for choosing query candidates: max uncertainty sampling, max margin sampling and entropy sampling. These differ in the multi-class case, but are equivalent in a binary classification case such as this.

Also, for example, ActiveLearner’s query_strategy can be customized to use “uncertainty batch sampling” to return queries ranked by uncertainty. This may be useful to prepare a longer list of queries to be labeled in order of usefulness as much as time permits before the next model build and query loop.

def preset_batch(classifier, X_pool):
  return uncertainty_batch_sampling(classifier, X_pool, 100)

learner = ActiveLearner(estimator=..., query_strategy=preset_batch)

Active Learning with Streaming

Above, the entire pool of candidates were available for the query() method. This is useful when choosing the best ones to query in a batch context. However it might be necessary to apply the same ideas to a stream of data, one at a time.

It’s already of course possible to score the model against a stream of complaints and flag the ones that are predicted to be ‘distressed’ with high probability for preemptive escalation. However it might equally be useful, in some cases, to flag highly-uncertain inputs for evaluation by a data science team, before the model and learner are rebuilt.

@pandas_udf("boolean")
def uncertain(features_series):
  X_i = np.stack(features_series.to_numpy())
  n = X_i.shape[0]
  uncertain = pd.Series([False] * n)
  # Set True where uncertainty is high. Uncertainty is at most 0.5
  uncertain[classifier_uncertainty(learner, X_i) > 0.4999] = True
  return uncertain

display(pool2_df.filter(uncertain(pool2_df['features'])).drop("features"))

Using Active Learning with streaming to flag “highly-uncertain” complaint data for evaluation by the data science team.

In the simple binary classification case, this essentially reduces to finding where the model outputs a probability near 0.5. However modAL offers other possibilities for quantifying uncertainty that do differ in the multi-class case.

Getting Started with Your Active Learning Problem

When we learn from data with supervised machine learning techniques, it’s not how much data we have that counts, but how much labeled data. In some cases labels are expensive to acquire, manually. Fortunately active learning techniques, as implemented in open source tools like modAL, can help humans prioritize what to label. The recipe is:

  • Label a small amount of data, if not already available
  • Train an initial model
  • Apply active learning to decide what to label
  • Train a new model and repeat until accuracy is sufficient or you run out of labelers’ patience

modAL can be applied at scale with Apache Spark, and integrates well with other standard open source tools like scikit-learn, Hyperopt, and mlflow.

Complaints about this blog? Please contact the CFPB.

--

Try Databricks for free. Get started today.

The post Better Machine Learning through Active Learning appeared first on Databricks.

Automate deployment and testing with Databricks Notebook + MLflow

$
0
0

Today many data science (DS) organizations are accelerating the agile analytics development process using Databricks notebooks.  Fully leveraging the distributed computing power of Apache Spark™, these organizations are able to interact easily with data at multi-terabytes scale, from exploration to fast prototype and all the way to productionize sophisticated machine learning (ML) models.  As fast iteration is achieved at high velocity, what has become increasingly evident is that it is non-trivial to manage the DS life cycle for efficiency, reproducibility, and high-quality. The challenge multiplies in large enterprises where data volume grows exponentially, the expectation of ROI is high on getting business value from data, and cross-functional collaborations are common.

In this blog, we introduce a joint work with Iterable that hardens the DS process with best practices from software development.  This approach automates building, testing, and deployment of DS workflow from inside Databricks notebooks and integrates fully with MLflow and Databricks CLI. It enables proper version control and comprehensive logging of important metrics, including functional and integration tests, model performance metrics, and data lineage. All of these are achieved without the need to maintain a separate build server.

Overview

In a typical software development workflow (e.g. Github flow), a feature branch is created based on the master branch for feature development. A notebook can be synced to the feature branch via Github integration. Or a notebook can be exported from Databrick workspace to your laptop and code changes are committed to the feature branch with git commands. When the development is ready for review, a Pull Request (PR) will be set up and the feature branch will be deployed to a staging environment for integration testing. Once tested and approved, the feature branch will be merged into the master branch. The master branch is always ready to be deployed to production environments.

In our approach, the driver of the deployment and testing processes is a notebook. The driver notebook can run on its own cluster or a dedicated high-concurrency cluster shared with other deployment notebooks. The notebooks can be triggered manually or they can be integrated with a build server for a full-fledged CI/CD implementation. The input parameters include the deployment environment (testing, staging, prod, etc), an experiment id,  with which MLflow logs messages and artifacts, and source code version.

As depicted in the workflow below, the driver notebook starts by initializing the access tokens to both the Databricks workspace and the source code repo (e.g. github). The building and deploying process runs on the driver node of the cluster, and the build artifacts will be deployed to a dbfs directory. The deploy status and messages can be logged as part of the current MLflow run.

After the deployment, functional and integration tests can be triggered by the driver notebook. The test results are logged as part of a run in an MLflow experiment. The test results from different runs can be tracked and compared with MLflow. In this blog, python and scala code are provided as examples of how to utilize MLflow tracking capabilities in your tests.

Automate Notebooks Deployment Process

First of all, a uuid and a dedicated work directory is created for a deployment so that concurrent deployments are isolated from each other.  The following code snippet shows how the deploy uuid is assigned from the active run id of an MLflow experiment, and how the working directory is created.

import mlflow
active_run = mlflow.start_run(experiment_id=experiment_id)
deploy_uuid = active_run.info.run_id
workspace = "/tmp/{}".format(deploy_uuid)
print("workspace: {}".format(workspace))
if not os.path.exists(workspace):
  os.mkdir(workspace)

To authenticate and access Databricks CLI and Github, you can set up personal access tokens. Details of setting up CLI authentication can be found at: Databricks CLI > Set up authentication.  Access tokens should be treated with care. Explicitly including the tokens in the notebooks can be dangerous. The tokens can accidentally be exposed when the notebook is exported and shared with other users.

One way to protect your tokens is to store the tokens in Databricks secrets. A scope needs to be created first:

databricks secrets create-scope --scope cicd-test 

To store a token in a scope:

databricks secrets put --scope cicd-test --key token

To access the tokens stored in secrets, dbutils.secrets.get can be utilized. The fetched tokens are displayed in notebooks as [REDACTED]. The permission to access a token can be defined using Secrets ACL. For more details about the secrets API, please refer to Databricks Secrets API.

The following code snippet shows how secrets are retrieved from a scope:

db_token = dbutils.secrets.get(scope = pipeline_config["secrets_scope"], key = pipeline_config["databricks_access_token"])
git_username = dbutils.secrets.get(scope = pipeline_config["secrets_scope"], key = pipeline_config["github_user"])
git_token = dbutils.secrets.get(scope = pipeline_config["secrets_scope"], key = pipeline_config["github_access_token"])

Databricks access can be set up via .databrickscfg file as follows. Please note that each working directory has its own .databrickscfg file to support concurrent deployments.

dbcfg_path = os.path.join(workspace, ".databrickscfg")
with open(dbcfg_path, "w+") as f:
  f.write("[DEFAULT]\n")
  f.write("host = {}\n".format(db_host_url))
  f.write("token = {}\n".format(db_token))

The following code snippet shows how to check out the source code from Github given a code version. The building process is not included but can be added after the checkout step. After that, the artifact is deployed to a dbfs location, and notebooks can be imported to Databricks workspace.

lines = '''
#!/bin/bash
export DATABRICKS_CONFIG_FILE={dbcfg}
echo "cd {workspace}/{repo_name}/notebooks/"
cd {workspace}/{repo_name}/notebooks/

echo "target_ver_dir={target_ver_dir}"
databricks workspace delete -r {target_ver_dir}
databricks workspace mkdirs {target_ver_dir}
if [[ $? != 0 ]]; then exit -1; fi

databricks workspace import_dir {source_dir} {target_ver_dir}
'''.format(target_base_dir=target_base_dir, git_hash=git_hash, deploy_env=deploy_env, repo_name=repo_name, target_ver_dir=target_ver_dir, git_url=git_url, pipeline_id=pipeline_id, workspace=workspace, dbcfg=dbcfg_path)

with open("{}/deploy_notebooks.sh".format(workspace), "w+") as f:
  f.writelines(lines)

process = subprocess.Popen(["bash", "{}/deploy_notebooks.sh".format(workspace)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
sys.stdout.write(process.communicate()[0])

Deploy Tracking

For visibility into the state of our deployment, we normally might store that in a database or use some sort of managed deployment service with a UI. In our case, we can use MLflow for those purposes.

The metadata such as deploy environment, app name, notes can be logged by MLflow tracking API:

try:
  mlflow.log_param("run_id", active_run.info.run_uuid)
  mlflow.log_param("env", deploy_env)
  mlflow.log_param("githash", git_hash)
  mlflow.log_param("pipeline_id", pipeline_config["pipeline-id"])
  mlflow.log_param("deploy_note", deploy_note)
except:
  clean_up(active_run.info.run_uuid)
  raise

Triggering Notebooks

Now that we have deployed our notebooks into our workspace path, we need to be able to trigger the correct version of the set of notebooks given the environment. We may have notebooks on version A in the prd environment while simultaneously testing version B in our staging environment.

Every deployment system needs a source of truth for the mappings for the “deployed” githash for each environment. For us, we leverage Databricks Delta since it provides us with transactional guarantees.

For us, we simply look up in the deployment delta table the githash for a given environment and run the notebook at that path.

dbutils.notebook.run(PATH_PREFIX + s“${git_hash}/notebook”, ...)

 

In Production

At Iterable, we needed to move quickly and avoid setting up the heavy infrastructure to have a deployment and triggering system if possible. Hence we developed this approach with Li at Databricks such that we could conduct most of our workflow within Databricks itself, leverage Delta as a database, and use MLflow for a view for the state of truth for deployments.

Because our data-scientists work within Databricks and can now deploy their latest changes all within Databricks, leveraging the UI that MLflow and Databricks notebooks provide, we are able to iterate quickly while having a robust deployment and triggering system that has zero downtime between deployments.

Implement tests

Tests and validation can be added to your notebooks by calling assertion statements. However error messages from assertion scatter across notebooks, and there is no overview of the testing results available. In this section, we are going to show you how to automate tests from notebooks and track the results using MLflow tracking APIs.

In our example, a driver notebook serves as the main entry point for all the tests. The driver notebook is source controlled and can be invoked from the deployment notebook. In the driver notebook, a list of tests/test notebooks is defined and looped through to run and generate test results. The tests can be a set of regression tests and tests specific to the current branch. The driver notebook handles creating the MLflow scope and logs the test results to the proper run of an experiment.

def log_message(test_name: str, msg: str):
  if experiment_id:
    mlflow.log_param(test_name, msg)
  print("{} {}".format(test_name, msg))

def test_notebook(notebook_path):
  import time
  test_name = get_notebook_name(notebook_path)
  try:
    start_time = time.time()
    result = dbutils.notebook.run(notebook_path, 120, {"egg_file": egg_file})
    elapsed_time = time.time() - start_time
    log_message(test_name, result)
    mlflow.log_metric("{}_dur".format(test_name), elapsed_time)
  except Exception as e:
    log_message(test_name, "Failed")
    print(e)

for t in test_notebooks:
  test_notebook(t)

The picture below shows a screenshot of an experiment of MLflow, which contains testing results from different runs. Each run is based on a code version (git commit), which is also logged as a parameter of the run.

The MLflow UI provides powerful capabilities for end-users to explore and analyze the results of their experiments. The result table can be filtered by specific parameters and metrics. Metrics from different runs can be compared and generate a trend of the metric like below:

Unit tests of individual functions are also tracked by MLflow. A common testing fixture can be implemented for logging metadata of tests. Test classes will inherit this common fixture to include MLflow tracking capability to the tests. Our current implementation is based on ScalaTest, though similar implementation can be done with other testing framework as well.

The code example below shows how a fixture (testTracker) can be defined by overriding the withFixture method on TestSuiteMixin. A test function is passed to withFixture and executed inside withFixture. This way, withFixture servers as a wrapper function of the test. Pre and post-processing code can be implemented inside withFixture. In our case, preprocessing is to record the start time of the test, and post-processing is to log metadata of a test function. Any test suite which inherits this fixture will automatically run this fixture before and after each test to log the metadata of the test.

import org.scalatest._
import scala.collection.mutable._

trait TestTracker extends TestSuiteMixin { this: TestSuite =>

  var testRuns = scala.collection.mutable.Map[String, MetricsTrackData]()
  var envVariables: String = ""
  //To be overridden with the actual test suite name
  var testSuiteName: String = randomUUID().toString

  abstract override def withFixture(test: NoArgTest) = {
    val ts = System.currentTimeMillis()
    val t0 = System.nanoTime()
    var testRes: Boolean = false
    var ex: String = ""
    try super.withFixture(test) match {
      case result: Outcome =>
        if(result.isFailed || result.isCanceled) {
          val failed = result.asInstanceOf[Failed]
          ex = failed.exception.toString
        }
        else if (result.isCanceled) {
          val canceled = result.asInstanceOf[Canceled]
          ex = canceled.exception.toString
        }
        else if(result.isSucceeded ) {
          testRes = true
        }
        result
      case other =>
        other
    }
    finally {
      //TODO: log your metrics here
    }
  }
  
  def AllTestsPassed(): Boolean = {
    if(testRuns.size == 0) true
    else {
      testRuns.values.forall { t =>
        t.status
      }
    }
  }
}

A test suite needs to extend from TestTracker to incorporate the logging capability to its own tests. The code example below shows how to inherit the testing metadata logging capability from the fixture defined above:

class TestClass extends FlatSpec with TestTracker {

  "addCountColumns" should "add correct counts" in {
    //TODO: test your function here
    //assert(...)
  }
}

//Run the tests
val test = new TestClass
test.execute()

Discussion

In this blog, we have reviewed how to build a CI/CD pipeline combining the capability of Databricks CLI and MLflow. The main advantages of this approach are:

  • Deploy notebooks to production without having to set up and maintain a build server.
  • Log metrics of tests automatically.
  • Provide query capability of tests.
  • Provide an overview of deployment status and test results.
  • ML algorithm performance is tracked and can be analyzed (e.g. detect model drift, performance degradation).

With this approach, you can quickly set up a production pipeline in the Databricks environment. You can also extend the approach by adding more constraints and steps for your own productization process.

Credits

We want to thank the following contributors: Denny Lee, Ankur Mathur, Christopher Hoshino-Fish, Andre Mesarovic, and Clemens Mewald

--

Try Databricks for free. Get started today.

The post Automate deployment and testing with Databricks Notebook + MLflow appeared first on Databricks.

Azure Databricks Achieves HITRUST CSF® Certification

$
0
0

We are excited to announce that Azure Databricks is now certified for the HITRUST Common Security Framework (HITRUST CSF®).

Azure Databricks is already trusted by organizations such as Electrolux, Shell, renewables.AI, and Devon Energy for their business-critical use cases. The HITRUST CSF certification provides customers the assurance that Azure Databricks meets a level of security and risk controls to support their regulatory requirements and specific industry use cases. The HITRUST CSF is widely adopted across a variety of industries by organizations who are modernizing their approach to information security and privacy.

The HITRUST CSF is a certifiable framework that can be leveraged by organizations to comply with ISO, SOC 2, NIST, and HIPAA information security requirements. The HITRUST CSF is already widely adopted across a variety of industries by organizations who are modernizing their approach to information security and privacy.

For example, the HITRUST CSF certification can be used to measure and attest to the effectiveness of an organization’s own internal security and compliance efforts and to evaluate the security and risk-management efforts of its supply chain and third-party vendors. A growing number of healthcare organizations require their business associates to obtain CSF Certification in order to demonstrate effective security and privacy practices to meet healthcare industry requirements. HITRUST CSF has supported New York State Cybersecurity Requirements for Financial Services Companies (23 NYCRR 500) since 2018. Increasingly, Fintech companies are adopting HITRUST CSF to demonstrate the effectiveness of internal security and compliance efforts and to evaluate third-party vendor efforts. HITRUST CSF certification also provides startups a common approach to information risk management and compliance for appropriate security and privacy oversight. This translates to lower risk for customers.

View the Azure Databricks HITRUST CSF Assessment and other Security Compliance Documentation

You can view and download the Azure Databricks HITRUST CSF certification and download related certifications by visiting the Microsoft Trust Center. Learn more about HITRUST by viewing the Microsoft HITRUST documentation.

As always, we welcome your feedback and questions and commit to helping customers achieve and maintain the highest standard of security and compliance. Please feel free to reach out to the team through Azure Support.

Follow us on Twitter, LinkedIn, and Facebook for more Azure Databricks security and compliance news, customer highlights, and new feature announcements.

--

Try Databricks for free. Get started today.

The post Azure Databricks Achieves HITRUST CSF® Certification appeared first on Databricks.

Engineering Interviews — A Hiring Manager’s Guide to Standing Out

$
0
0

Ask any engineering leader at a growth stage company what their top priority is, and they’ll likely say hiring. When we think about how big a decision taking a job is for both the company and candidate, the few hours of interviews seems pretty short. We want to make sure our job interview process makes the most of that time to help both candidates and Databricks understand if the role is a good fit. We want to learn about you and make sure you get the information you need to make the best decision. One of the best ways to do this is to design interviews that emphasize conversation and collaboration. Real world problems are messy and complex. We want to understand how candidates solve abstract challenges more than we want to see a specific solution.

What do you want candidates to understand about the data team at Databricks before entering the interview process?

Despite the scale of infrastructure Databricks operates, we have a relatively small engineering organization. We operate millions of virtual machines, generating terabytes of logs and processing exabytes of data per day. At our scale, we regularly observe cloud hardware, network, and operating system faults, and our software must gracefully shield our customers from any of the above. We do all this with less than 200 engineers.

Our size means we have the flexibility to adopt or create the technology we believe is the best solution for each engineering challenge. The flip side of that is there are many parts of our infrastructure that are still maturing, so the set of concerns for many initiatives expands beyond the scope of a single service. It’s also still a startup so the boundaries of ownership and responsibility aren’t always clear. That means it’s easy to make changes and have an impact outside your core focus areas, and that you’ll own much more of a project than you would somewhere else.

What are you going to be a master of after working at Databricks? You will be able to create scalable systems within the Big Data and Machine Learning field. Most engineers don’t do applied ML in their day to day work, but we deeply understand how it’s being used across a range of industries for our customers.

How can you prepare for technical interview questions?

Our engineering interviews consist of a mix of technical and soft skills assessments between 45 and 90 minutes long. While some of our technical interviews are more traditional algorithm questions focused on data structures and computer science fundamentals, we have been shifting towards more hands-on problem solving and coding assessments. Even on the algorithm questions, candidates are welcome to work through the problem on a laptop rather than a whiteboard if they prefer. This helps us get a sense of how they write code in a more realistic environment. For our coding questions, we focus less on algorithm knowledge and more on design, code structure, debugging and learning new domains. For example, some of our technical questions will probably use a language/framework you are unfamiliar with so you’ll need to demonstrate an ability to read documentation and solve a problem in a new area. Other questions involve progressively building a complex program in stages by following a feature spec.

We also adapt our interviews based on the candidate’s background, work experience, and role. For more fullstack roles, we spend more time on the basics of web communication (http, websockets, authentication), browser fundamentals (caching, js event handling), and API + data modeling. For more low level systems engineering, we’ll emphasize multi threading and OS primitives.

I recommend three things to prepare:

  1. Find coding questions online and practice solving them completely. This means creating full working code and tests without looking at the solution. Creating tests is important; some of our technical questions have several stages, so you’ll want to be able to quickly set up a test harness for a fast edit/compile/debug loop during the interview, just like you would for your day to day work.
  2. Review computer science fundamentals. Know common data structures, the runtime and memory utilization of each method, and their interface in the language you plan to use. This technical interview handbook on GitHub is a good overview of the different data structures, but you should also study systems concepts like mult-threading, concurrency, locks and transactions.
  3. Do mock interviews. The time pressure and dialogue of a mock job interview is a great way to get comfortable before the real thing. Have a friend ask you questions you don’t know and hint along the way as needed.

Haoyi on our Dev Tools team wrote a great blog post on how to interview effectively that gives good insight into how we structure our interviews and what we look for.

What are the most common mistakes you see during interviews?

Now that we’ve covered what we look for and how to prepare for interviews, there are a few things you should consciously try not to do during an engineering job interview.

The main one is lacking passion or interest in the role. Remember, you are interviewing the company as well and it’s important you show that you are invested in making a match. Having low enthusiasm, not being familiar with the Databricks product, not asking any questions and in general relying on the interviewer to drive the entire conversation are all signs you aren’t interested. Just as you want an interview process that challenges you and dives into your skills and interests, we like a candidate that asks us tough questions and takes the time to get to know us.

For technical interviews, if a candidate is pursuing a solution that won’t work, we try to help them realize it before spending a lot of time on implementation. If the interviewer is asking questions, chances are they are trying to hint you towards a different path. Rather than staying fixed on a single track solution, take a minute to step back and reconsider your approach with new hints or questions. Remember that your interviewer has probably asked the same question dozens of times and seen a range of approaches. They also want to see how you’d respond in a real-world environment, where you’d be working with a team that offers help in a similar way.

For interviews focused on work history and soft skills, have specific examples. It’s ok to start with broad generalization, but tell a story about how specific examples in your past work history answer the question. When talking about your work experience, try to (1) clearly define the problem, (2) your solution, (3) the outcome and (4) any reflections on improvements. A good way to provide a well thought-out answer is by using the STAR Interview Response Technique.

What are some qualities you’ve seen in successful and impactful engineers on your teams (both in the present and past)?

At a startup like Databricks, the most important quality I’ve seen in successful engineers is ownership. We are growing quickly, which brings a lot of new challenges every week, but it’s not always clear how responsibilities divide across teams and priorities get determined. Great engineers handle this ambiguity by surfacing the most impactful problems to work on, not just those limited to their current team’s responsibilities. Sometimes this means directly helping to build the solution, but often it’s motivating others to prioritize the work.

The second quality we focus on, particularly for those earlier in their career, is the ability to learn and grow. The derivative of knowledge is often more important than a candidate’s current technical skills. Many of the engineering problems we are solving don’t have existing templates to follow. That means continually breaking through layers of abstraction to consider the larger system – from the lowest level of cpu instructions, up to how visualizations are rendered in the browser.

How have I seen these qualities in interviews? Engineers that show a lot of ownership can often speak in detail about the adjacent systems they relied on for past work. For example, they know the strengths and weaknesses of a specific storage layer or build system they used and why. They also often create changes to help their team become more effective – either through tooling improvements or a process change. Growth comes across through reflection on past work. No solution is perfect, and great engineers know what they would do next or do differently. A lot of candidates say the opportunity to grow is their main criteria for choosing their next job, but they should be able to talk about what they are already doing to grow. Maybe that’s a side project, a new technology they recently learned, some improvement to their developer environment, or a mentor relationship they are cultivating in their current role.

What are some of the problems your team is working on? What skills are you looking for that will make candidates successful with these problems?

The Workspace team has a pretty broad set of product use cases to support and most of the team works full stack. We look for generalists who have shown an ability to quickly learn new technologies. We are also very customer facing and need engineers that can dig deep to understand our users to formulate requirements. Several of the team members either had their own startups in the past or worked as early employees at startups.

One of the best ways to understand a role is to ask, “What will I become a master of?” For the Workspace team it’s three main skills.

  1. Quickly learning new technologies. The Workspace team does a lot of exploratory and prototype work. The team has many generalists who need to combine product sense and an ability to adapt existing technology to novel problems. A good example is adapting open source Jupyter to run in Databricks hosted cloud with Databricks Clusters. Another one is creating a pub/sub infra to stream updates through a GraphQL API to realtime web clients.
  2. The workflows around data science, machine learning and data analytics. We are building products for those personas so you will intimately understand the daily workflow of data scientists and data engineers at a variety of customers across many industries and company sizes. Engineers on this team have regular exposure to our customers and internal customer champions in Field Engineering.
  3. Scalable web service design on the JVM. Our team works on the core backend for the stateful notebooks and Workspace, which often faces design challenges unique to a service at our scale. Everyone on the team develops a deep understanding of resource primitives (cpu/memory/io/network) and how to optimize their usage in a distributed fault tolerant architecture.

At Databricks, we are constantly looking for Software Engineers who embody the characteristics we’ve talked about. If you are interested in solving some of the challenges that we are currently tackling here, check out our Careers Page and apply to interview with us!

Ted Tomlinson is a Director of Engineering at Databricks. He manages the Workspace team, which is responsible for Databricks’ flagship collaborative notebooks product and the services used to enable interactive data science and machine learning across environments.

--

Try Databricks for free. Get started today.

The post Engineering Interviews — A Hiring Manager’s Guide to Standing Out appeared first on Databricks.

How Databricks makes distributed sites successful

$
0
0

Credit: Pexels.com

Databricks recently announced signing a lease for new office space in Toronto. After Amsterdam, Toronto is our second geographically-distributed engineering site and a major expansion milestone for the company. This is a good time to reflect on how we work hard to make distributed offices successful and make sure employees do their best work and get growth opportunities around the world.

Like many high-growth technology startups, Databricks is a distributed company that develops products in multiple locations: San Francisco, Amsterdam and now Toronto. Databricks is a global company with customers on every continent and in every vertical (“from aviation to zygotes”). Since we build products for global customers, we believe that a global presence will help us be more successful. We gain access to top talent, and our customers benefit from the perspectives and skills of a diverse distributed workforce and company culture.

One of the most common questions I get from candidates is: “How do you choose what projects are worked on in Amsterdam?” What candidates are too polite to ask is this: “I’m worried about working on second-tier projects,” or “I don’t want to have late-night meetings with a product manager across the ocean” or even “I don’t want my career to suffer because I am not in the ‘right’ place – San Francisco.”

We believe that doing great work in every site is not simply a matter of choosing what work is done in Amsterdam, Toronto or San Francisco. It’s about setting up the distributed teams for success. Let’s see how Databricks crafts sites that are high-performance, take on high-impact projects and are empowered to grow.

Setting Up Distributed Teams for Success

The first ingredient in setting up a distributed team for success is to hire a seasoned site lead and assemble their extended leadership team. The site lead and their extended leadership team are seasoned engineering and product leaders who have scaled teams and products before. The site leadership acts as advocates for the team members’ needs. They are skilled at getting a lot of work done with a small team. For example, it may be a year or two before a distributed office  gets a human resources function, so the site lead must fill in.

In addition to a strong leadership team, Databricks transfers a small number of seasoned engineers from the home office in San Francisco to the site, both on a long-term and temporary basis. For example, Miles Yucht moved from San Francisco to Amsterdam to help stand up  a new team, which has doubled in size over the last year. Senior engineers who choose to relocate to Amsterdam commit to moving here for at least two years, and bring existing experience with our engineering processes and company values to the site. We also have a regular 1-2 week rotation of engineers between Amsterdam and San Francisco. These short trips help transfer knowledge between sites and encourage collaboration.

The next ingredient for doing great work at a site is to hire “three in a box” teams. Before we start a project in Amsterdam, we make sure that a strong tech lead, engineering manager and product manager are available to work on it. A tech lead provides technical guidance and leadership as we execute a project. A product manager gathers customer requirements, partners with the tech lead to design features and is the “glue” that makes sure the work we do brings value to customers. Finally, an engineering manager provides people leadership, scope, guidance and mentorship. The “three in a box” model has proved to be a crucial ingredient in making Amsterdam successful.

Once we have a “three in the box” team, we bring meaningful projects to the distributed office site and keep them there. Meaningful projects must have strong revenue and customer impact. They are projects that have interesting engineering challenges ahead of them as we scale. An example of this is Databricks Jobs, which is built and managed entirely out of Amsterdam. Once you bring a big project to a physical space, it is important to go “all in” and keep it there. It is tempting to want to move projects back to HQ – perhaps you hired a key leader there or hiring is going faster at the home office than at a site. One of our key lessons was to give the distributed team the opportunity to hire their own talent and grow.

Databricks is heavily invested in making our European Development Center a hub of growth. This starts with helping our employees do their best work, whether in San Francisco or Amsterdam. If you are interested in helping solve some of the world’s hardest problems with data, view our open positions at our Careers page.

Bilal is a Director of Product Management at Databricks, Amsterdam. His team focuses on the performance of large-scale distributed systems, with a focus on storage, I/O, benchmarking, query planning and optimization.

--

Try Databricks for free. Get started today.

The post How Databricks makes distributed sites successful appeared first on Databricks.

Viewing all 1873 articles
Browse latest View live