Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

It’s Time to Re-evaluate Your Relationship With Hadoop

$
0
0

With companies forced to adapt to a remote, distributed workforce this past year, cloud adoption has accelerated at an unprecedented pace by +14%  resulting in  2% or $13B above pre-pandemic forecasts for 2020 – with possibly more than $600B in on-prem to cloud migrations within the next few years.  This shift to the cloud places growing importance on a new generation of data and analytics platforms to fuel innovation and deliver on enterprise digital transformation strategies. However, many organizations still struggle with the complexity, unscalable infrastructure and severe maintenance overheads of their legacy Hadoop environments and eventually sacrifice the value of their data and, in turn,  risk their competitive edge. To tackle this challenge and unlock more (sometimes hidden) opportunities in their data, organizations are turning to open, simple and collaborative cloud-based data and analytics platforms like the Databricks Lakehouse Platform. In this blog, you’ll learn about the challenges driving organizations to explore modern cloud-based solutions and the role the lakehouse architecture plays in sparking the next wave of data-driven innovation.

Unfulfilled promises of Hadoop

Hadoop’s distributed file system (HDFS) was a game-changing technology when it launched and will remain an icon in the halls of data history. Because of its advent, organizations were no longer bound by the confines of relational databases, and it gave rise to modern big data storage and eventually cloud data lakes. For all its glory and fanfare leading up to 2015, Hadoop struggled to support the evolving potential of all data types – especially at enterprise scale. Ultimately, as the data landscape and accompanying business needs evolved, Hadoop struggled to continue to deliver on its promises.  As a result, enterprises have begun exploring cloud-based alternatives and the rate of migration from Hadoop to the cloud is only increasing.

Teams migrate from Hadoop for a variety of reasons; it’s often a combination of “push” and “pull.” Limitations with existing Hadoop systems and high licensing and administration costs are pushing teams to explore alternatives. They’re also being pulled by the new possibilities enabled by modern cloud data architectures. While the architecture requirements vary by organization, we see several common factors that lead customers to realize it’s time to start saying goodbye. These include:

  • Wasted hardware capacity: Over-capacity is a given in on-premises implementations so that you can scale up to your peak time needs, but the result is that much of that capacity sits idle but continues to add to the operational and maintenance costs.Hidden costs of Hadoop
  • Scaling costs add up fast: Decoupling storage and compute is not possible in an on-premises Hadoop environment, so costs grow with data sets. Factor that in with the rapid digitization resulting from the COVID-19 pandemic and the global growth rate. Research indicates that the total amount of data created, captured, copied, and consumed in the world forecast to increase by 152.5% from 2020 to 2024 to 149 Zettabytes. In a hyperdata growth world, runaway costs can balloon rapidly.
  • DevOps burden: Based on our customers’ experience, you can assume 4 to 8 full-time employees for every 100 nodes.
  • Increased power costs: Expect to pay as much as $800 per server annually based on consumption and cooling. That’s $80K per year for a 100 node Hadoop cluster!
  • New and replacement hardware costs:  This accounts for ~20% of TCO, which is equal to the Hadoop clusters’ administration costs.
  • Software version upgrades: These upgrades are often mandated to ensure the support contract is retained, and those projects take months at a time, deliver little new functionality and take up precious bandwidth of the data teams.

In addition to the full range of challenges above, there’s genuine concern about the long-term viability of Hadoop. In 2019, the world saw a massive unraveling within the Hadoop sphere. Google, whose seminal 2004 paper on MapReduce underpinned the creation of Apache Hadoop, stopped using MapReduce altogether, as tweeted by Google SVP of Technical Infrastructure, Urs Hölzle. Furthermore, there were some very high profile mergers and acquisitions in the world of Hadoop. Lastly, in 2020, a leading Hadoop provider shifted its product set away from being Hadoop-centric, as Hadoop is now thought of as “more of a philosophy than a technology”. This growing collection of concerns paired with the accelerated need to digitize has encouraged many companies to re-evaluate their relationship with Hadoop.

The shift toward lakehouse architecture

A lakehouse architecture is the ideal data architecture for data-driven organizations. It combines the best qualities of data warehouses and data lakes to provide a single high-performance solution for all data workloads. Lakehouse architecture supports a variety of use cases, such as streaming data analytics to BI, data science and AI. Why do customers love the Databricks Lakehouse Platform?

  • It’s simple. Unify your data, analytics and AI on one platform.
  • It’s open.  Unify your data ecosystem with open standards and formats.
  • It’s collaborative. Unify your data teams to collaborate across the entire data and AI workflow.

A lakehouse architecture can deliver significant gains compared to legacy Hadoop environments, which “pull” companies into their cloud adoption. This also includes customers who have tried to use Hadoop in the cloud but aren’t getting the same results as expected or desired. As R. Tyler Grow, Director of Engineering at Scribd, explains “Databricks claimed an optimization of 30%–50% for most traditional Apache Spark™ workloads. Out of curiosity, I refactored my cost model to account for the price of Databricks and the potential Spark job optimizations. After tweaking the numbers, I discovered that at a 17% optimization rate, Databricks would reduce our Amazon Web Services (AWS) infrastructure cost so much that it would pay for the cost of the Databricks platform itself. After our initial evaluation, I was already sold on the features and developer velocity improvements Databricks would offer. When I ran the numbers in my model, I learned that I couldn’t afford not to adopt Databricks!”

Scribd isn’t alone; additional customers that have migrated from Hadoop to the Databricks Lakehouse Platform include:

  • H&M processes massive volumes of data from over 5,000 stores in over 70 markets with millions of customers every day. Their Hadoop-based architecture created challenges for data. It became resource-intensive and costly to scale, presented data security issues, struggled to scale operations to support data science efforts from various siloed data sources and slowed down time-to-market because of significant DevOps delays. It would take a whole year to go from ideation to production. With Databricks, H&M benefits from improved operational efficiency by reducing operating costs by 70%, improving cross-team collaboration, and boosting business impact with faster time-to-insight.
  • Viacom18 needs to process terabytes of daily viewer data to optimize programming. Their on-premises Hadoop data lake could not process 90 days of rolling data within SLAs, limiting their ability to deliver on business needs. With Databricks, they significantly lowered costs with faster querying times and less DevOps despite increasing data volumes. Viacom18 also improved team productivity by 26% with a fully managed platform that supports ETL, analytics and ML at scale.
  • Reckitt Benckiser Group (RB) struggled with the complexity of forecasting demand across 500,000 stores. They process over 2TB of data every day across 250 pipelines. The legacy Hadoop infrastructure proved to be complex, cumbersome, costly to scale and struggled with performance. With Databricks, RB realized 10x more capacity to support business volume, 98% data compression from 80TB to 2TB, reducing operational costs, and 2x faster data pipeline performance for 24×7 jobs.

Hadoop was never built to run in cloud environments. While cloud-based Hadoop services make incremental improvements compared to their on-premises counterparts, both still lag compared to the lakehouse architecture. Both Hadoop instances yield low performance, low productivity, high costs and their inability to address more sophisticated data use cases at scale.

Future-proofing your data, analytics and AI-driven growth

Cloud migration decisions are business decisions. They force companies to take a hard look at the reality of the delivery of their current systems and evaluate what they need to achieve for both near-term and long-term goals. As AI investment continues to gain momentum, data, analytics, and technology leaders need to play a critical role in thinking beyond the existing Hadoop architecture with the question “will this get us where we need to go?”

With clarity on the goals come critical technical details, such as technology mapping, evaluating cloud resource utilization and cost-to-performance, and structuring a migration project that minimizes errors and risks. But most importantly, you need to have the data-driven conviction that it’s time to re-evaluate your relationship with Hadoop. Learn more how migration from Hadoop can accelerate business outcomes across your data use cases.


1. Source: Gartner Market Databook, Goldman Sachs Global Investment Research

--

Try Databricks for free. Get started today.

The post It’s Time to Re-evaluate Your Relationship With Hadoop appeared first on Databricks.


Creating Growth and Advancement Opportunities: Introducing the Women in Tech Mentorship Program

$
0
0

At Databricks, we recognize the importance of offering professional growth and advancement opportunities for all communities and are committed to fostering a work environment where every employee can contribute the best work of their careers.

This week, we’re wrapping up our celebration of  Women’s History Month. In alignment with our commitment to employee growth, we wanted to highlight our newly launched global Women in Tech Mentorship pilot program. We strongly believe that mentorship is a key piece to driving more career advancement, especially for women in technology and STEM fields such as computer science. With this growth mindset, as well as direct feedback from our employees, we’re proud to launch this program.

The objective of the mentorship program is to support the professional development of women across the company while building a community and support network within all functions and geographies. The initiative is a 6-month long program consisting of two main components: monthly group sessions with the entire cohort and monthly individual mentoring sessions with an assigned role model.

To assess the value of the program, participants were asked to complete a pre-program survey to clarify their current understanding of the topics mentioned above. Over the course of the program, participants will be asked to complete a monthly survey and final questionnaire once the program has come to a close.

Learn more about our employee perspectives on mentorship in this conversation between a current program pairing: mentee Samantha Menot (Senior Customer Growth Program Manager) and mentor Alexandra Bellogini (Manager, Technical Solutions): 

What is the most valuable aspect of participating in a mentorship program at Databricks? 

Sam: For me, I  think it’s being able to build relationships with people who we wouldn’t normally connect with back in the office. It gives us great networking opportunities. That’s definitely the number one reason why I decided to join the mentorship program.

Alex: I would agree with you Sam. The most valuable part of this program is having the opportunity to connect with others outside of our own team. I also love having the opportunity to mentor and being able to share my own experiences with you and watch you grow in your career.

Out of the topics we are covering during the program, such as personal branding and negotiating, which one are you most passionate about and why? 

Sam: I would definitely like to learn more about negotiating, as I have only really thought about negotiating a few times in my career. I am still learning how to go into a negotiation, what makes a strong negotiator and how to present and position yourself in a negotiation.

Alex: Negotiating is a big area I would like to learn more about also, but personal branding is an important one for me. It’s great to have the opportunity to really stop and think about who I am now and what I want to be, as personal branding is something that shifts throughout your career.

What do you hope to learn from each other during this experience?

Sam: Based on the first couple of sessions that we’ve had, I’m looking forward to learning more about personal branding. I’m trying to figure out what my brand is right now and what I would like it to be. It’s been helpful being able to have an honest conversation with [Alex], and to hear about the experiences you have had within your own career.

Alex: One of the biggest parts and for me as a mentor is to learn from you and the experiences you have had. I think it’s important to hear about how you have dealt with certain situations in your career and to compare them to my own similar situations in order to understand what could have been done better. I’m also interested in hearing about your role, how you interact with others and the challenges you face, and what I can learn from that.

Our goal is for the program to serve as a valuable tool in providing guidance, advocacy and navigation through important career milestones. As the Women in Tech mentorship program continues to grow, we look forward to hearing about the continued positive impact it will have for both mentors and mentees across the company.

Interested in joining Databricks?

Visit our careers page to explore our global opportunities and to learn more about how you can join our Databricks community.

--

Try Databricks for free. Get started today.

The post Creating Growth and Advancement Opportunities: Introducing the Women in Tech Mentorship Program appeared first on Databricks.

Databricks Notebook Dark Theme

$
0
0

This became the most-requested feature in Databricks’ history, and now it’s here: a dark theme for the Databricks notebook!  We’re excited for you to try it out. To turn it on, open a notebook and navigate to the View menu > Notebook Theme > Dark Theme.

Enjoy!

--

Try Databricks for free. Get started today.

The post Databricks Notebook Dark Theme appeared first on Databricks.

Top Questions from Customers about Delta Lake

$
0
0

Top Questions from Customers about Delta Lake

Last week, we hosted a virtual event highlighting Delta Lake, an open source storage layer that brings reliability, performance and security to your data lake. We had amazing engagement from the audience, with almost 200 thoughtful questions submitted! While we can’t answer all in this blog, we thought we should share answers to some of the most popular questions. For those who weren’t able to attend, feel free to take a look at the on-demand version here.

For those who aren’t familiar, Delta Lake is an open format, transactional storage layer that forms the foundation of a lakehouse. Delta Lake delivers reliability, security and performance on your data lake — for both streaming and batch operations — and eliminates data silos by providing a single home for structured, semi-structured and unstructured data. Ultimately, by making analytics, data science and machine learning simple and accessible across the enterprise, Delta Lake is the foundation and enabler of a cost-effective, highly-scalable lakehouse architecture.

Before diving into your questions, let’s start by establishing what the difference is between a data warehouse, a data lake and a lakehouse:

Data warehouses are data management systems with a structured format, designed to support business intelligence. While great for structured data, the world’s data continues to get more and more complex, and data warehouses are not suited for many of the use cases we have today, primarily involving a variety of data types. On top of that, data warehouses are expensive and lock users into a proprietary format.

Data lakes were developed in response to the challenges of data warehouses and have the ability to collect large amounts of data from many different sources in a variety of formats. While suitable for storing data and keeping costs low, data lakes lack some critical reliability and performance features like transactions, data quality enforcement and consistency/isolation, ultimately leading to severe limitations in their usability.

A lakehouse brings the best of data warehouses and data lakes together – all through an open and standardized system design. By adding a transaction layer on top of your data lake, you can enable critical capabilities like ACID transactions, schema enforcement/evolution and data versioning that provide reliability, performance and security to your data lake. A lakehouse is a scalable, low-cost option that unifies data, analytics and AI.

How does Delta Lake compare to other transactional storage layers?

While Delta Lake and other transaction storage layers aim to solve many of the same challenges, Delta Lake has broader use case coverage across the data ecosystem. In addition to bringing reliability, performance and security to data lakes, Delta Lake provides a unified framework for batch and streaming workloads, improving efficiency in not only data transformation pipelines, but also downstream activities like BI, data science and ML. Using Delta Lake on Databricks provides, among other benefits, better performance with Delta Engine, better security and governance with fine-grained access controls, and broader ecosystem support with faster native connectors to the most popular BI tools. Finally, Delta Lake on Databricks has been battled-tested and used in production for over 3 years by thousands of customers. Every day, Delta Lake ingests at least 3PB of data.

How do I ingest data into Delta Lake?

Ingesting data into Delta Lake is very easy. You can automatically ingest new data files into Delta Lake as they land in your data lake (e.g. on S3 or ADLS) using Databricks Auto Loader or the COPY INTO command with SQL. You can also use Apache Spark™ to batch read your data, perform any transformations and save the result in Delta Lake format. Learn more about ingesting data into Delta Lake here.

Is Delta Lake on Databricks suitable for BI and reporting use cases?

 Yes, Delta Lake works great with BI and reporting workloads. To address this data analysis use case, in particular, we recently announced the release of SQL Analytics, which is currently in public preview. SQL Analytics is designed specifically for BI use cases and enables customers to perform analytics directly on their data lake. So if you have a lot of users that are going to be querying your Delta Lake table, we suggest taking a look at SQL Analytics. You can either leverage the build-in query and dashboarding capabilities or connect your favorite BI tool with native optimized connectors.

Apart from data engineering, does Delta Lake help with ML and training ML models? 

Yes, Delta Lake provides the ability to version your data sets, which is a really important feature when it comes to reproducibility. The ability to essentially pin your models to a specific version of your dataset is extremely valuable because it allows other members of your data team to step in and reproduce your model training to make sure they get the exact same results. It also allows you to ensure you are training on the exact same data, and the exact same version of the data specifically, that you trained your model on as well. Learn more about data science and ML on Databricks.

How does Delta Lake help with compliance? And how does Delta Lake handle previous versions of data on delete for GDPR and CCPA support?

Delta Lake gives you the power to purge individual records from the underlying files in your data lake, which has tremendous implications for regulations like CCPA and GDPR.

When it comes to targeted deletion, in many cases, businesses will actually want those deletions to propagate down to their cloud object-store. By leveraging a managed table in Delta Lake, where the data is managed by Databricks, deletions are propagated down to your cloud object-store.

Does Delta Lake provide access controls for security and governance? 

Yes, with Delta Lake on Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments and models), clusters, pools, jobs, and data schemas, tables, views, etc. Admins can manage access control lists, as can users who have been given delegated permissions to manage access control lists. Learn more about data governance best practices on Databricks.

How does Delta Lake help with streaming vs. batch operations?

With Delta Lake, you can run both batch and streaming operations on one simplified architecture that avoids complex, redundant systems and operational challenges. A table on Delta Lake is both a batch table and a streaming source and sink. Streaming data ingest, batch historic backfill and interactive queries all work out of the box and directly integrate with Spark Structured Streaming.

This is just a small sample of the amazing engagement we received from all of the attendees during this event. If you were able to join live, thank you for taking the time to learn about Delta Lake and how it forms the foundation of a scalable, cost-efficient, lakehouse. If you haven’t had a chance to check out the event you can view it here.

--

Try Databricks for free. Get started today.

The post Top Questions from Customers about Delta Lake appeared first on Databricks.

Introducing Delta Time Travel for Future Data Sets

$
0
0

We are thrilled to introduce enhanced time travel capabilities in Databricks Delta, the next-gen unified analytics engine built on top of Apache Spark, for all of our users. With this new feature, Delta can automatically extrapolate big datasets stored in your data lake, enabling access to any future version of your data today. This temporal data management feature simplifies your data pipeline by making it easy to audit, forecast schema changes and run experiments before that data even exists or is suspected of existing. Your organization can finally standardize analytics in the cloud on the dataset that will arrive in the future and not rely on the existing datasets of today.

Common Challenges with Current Data

  • Stuck in the present: Today’s data becomes much more valuable and actionable when it’s tomorrow’s data. Because guess what? Now it’s yesterday’s data, which is great for reporting but isn’t going to win you any innovation awards.
  • Audit data changes: Not knowing which data might arrive in the future can lead to data compliance and debugging challenges. Understanding future data changes can significantly improve data governance and data pipelines, preventing future data mismatches.
  • Forward-looking experiments & reports: The moment scientists run experiments to produce models, their source data is already outdated. Often, they are caught off guard by late-arriving data and struggle to produce their experiments for tomorrow’s results.
  • Roll forward: Data pipelines can sometimes write bad data for downstream consumers due to issues ranging from infrastructure instabilities to messy data to bugs in the pipeline. Roll forwards allow Data Engineers to simplify data pipelines by detecting bad data that will come from downstream systems.

Introducing Future Time Travel in Delta

To enable our users to make data of tomorrow accessible already today, we enhanced the time travel capabilities of existing Delta Lakes to support future time travel. The way this works is that we implemented a Lambda Vacuum solution as an exact solution to the Delta Lake equation in which a data gravity term is the only term in the data-momentum tensor. This can be interpreted as a kind of classical approximation to an alpha vacuum data point.

This is a CTC, or Closed Timelike Curve, implementation of the Gödel spacetime. But let‘s see how it works exactly.

Delta Lake Equation by Albert Einstein

What the Delta Lake Equation actually shows is that if you have a cubit of information and can create enough data gravity around it, and you accelerate it using GPU vectorized operations fast enough,  you can extrapolate the information from that data gravity point to an alpha data point into the future.

This works by creating a data momentum tensor with high data density on SSDs, which holds extrapolated information of the future.

Lambda Vacuum on a Delta table (Delta Lake on Databricks)

Recursively, Lambda Vacuum directories are associated with the Delta table and add data files that will be in a future state of the transaction log for the table and are older than an extrapolation threshold. Files are added according to the time they will be logically added to Delta’s transaction log + extrapolation hours, not their modification timestamps on the storage system. The default threshold is 7 days. Databricks does not automatically trigger LAMBDA VACUUM operations on Delta tables. See Add files for future reference by a Delta table.

If you run LAMBDA VACUUM on a Delta table, you gain the ability to time travel forward to a version older than the specified data extrapolation period.

LAMBDA VACUUM table_identifier [EXTRAPOLATE num HOURS] [RUN TOMORROW]

  • table_identifier
    [database_name.] table_name:
    A table name, optionally qualified with a database name.
    delta.`<path-to-table>`:
    The location of an existing Delta table.
  • EXTRAPOLATE num HOURS
    The extrapolation threshold.
  • RUN TOMORROW
    Dry run the next day to return a list of files to be added.

Conclusion

As an implementation of the Delta Lake Equation, the Lambda Vacuum function of modern Delta Lakes makes data of tomorrow already accessible today by extrapolating the existing data points along an alpha data point. This is an exact CTC solution as an implementation of the Gödel spacetime. Stay tuned for more updates!

--

Try Databricks for free. Get started today.

The post Introducing Delta Time Travel for Future Data Sets appeared first on Databricks.

Data Democratization: A Key to Building a Healthy Data Culture

$
0
0

Building a thriving data culture is a strategic priority for many organizations, but only 24% of enterprises have managed to forge a data culture.

What is a thriving data culture anyway? In its purest form, it’s when the entire organization – from the C-suite to the front-line workers – are making informed business decisions every day using readily available and relevant data. It’s when data is given greater weight than experience, intuition or tenure.  However, building a successful data culture that creates new business value and competitive advantages is no simple task, and there is no easy button that makes it happen overnight. It requires a few key ingredients (data access, governance, data literacy, data-driven decision-makers and data architecture) that enable data to be easily manipulated and consumed.

In a world where data is growing exponentially, and advanced analytics and machine learning (ML) are becoming standards, why do so many organizations still struggle at creating a successful data culture? In this blog, you’ll get a glimpse into one data leader’s approach to nurturing a healthy data culture and key considerations for democratizing data.

The role of disruptive leadership

According to Stuart Hughes, Chief Information and Digital Officer of Rolls-Royce, one of the primary reasons organizations struggle at building thriving data cultures is because of the difficulty in bridging data silos and democratizing data throughout the entire enterprise. When silos around data are removed, every person in the organization has access to all the relevant data and tools to easily interpret and make data-driven decisions. In fact, a Microstrategies report reveals that only 3% of surveyed employees required to use data for business decisions can do so in seconds. Shockingly, 60% needed hours or days – an SLA that can clearly stifle innovation when compounded over long periods of time.

In our Champions Data + AI podcast, Stuart Hughes discusses how data democratization starts with disruptive leadership. When data leaders simplify the data architecture and empower data teams with the right tools and creative freedom to do their best work, they unleash the true power of data.

In addition, organizations need an approach to data that factors in:

  1. Managed access to all relevant data: Unlike the movie Frozen, you don’t want to open up the gates 100% to all the data to every user. Instead, you need to rationalize data access by roles. It’s about effectively using approaches like role-based-access control (RBAC) to ensure users within a particular security group have access to the relevant data to do their jobs. In fact, just as employees are provided their laptops/tablets and badges during new hire onboarding, the necessary data access to do their job should be part of this process. By doing this, you reduce the issue so that small data sets risk the effectiveness of exploration, experimentation, model development and model training — resulting in inaccurate models when deployed into production and used with the full data sets. Poor data management ultimately creates massive amounts of rework when access to the full data set finally is granted.
  2. Increased data team productivity: Get your data teams focused on more analytics by increasing automation across data engineering, creation and sharing of data assets to reduce duplication and create efficiencies (i.e. data pipelines, reports/dashboards, code and ML models). Also, ensure your data architecture has the needed horsepower and set-up to offer the best performance, enabling you to increase the speed of experimentation.
  3. It’s not just the data; tools matter: The democratization of data also requires that users have a simple and intuitive way to understand the data, collaborate on the data and ML models and extract insights to make informed decisions. At the core of the toolset is the data and AI platform. Data and analytics leaders need to assess whether their platform is simple and easy to use. Is it open so data teams can easily work with existing tools and avoid proprietary formats? Lastly, is it collaborative so data engineers, analysts, and data scientists can work together and more efficiently?

Fostering an environment that lives and breathes data isn’t a feat achieved overnight but requires effective leadership that leads by example. Leadership that creates an environment for grassroots efforts to emerge, where experimentation is encouraged, and data-driven decisions are celebrated. None of this is possible unless leadership enables every decision-making body in the organization with all the relevant data to make informed data-driven decisions. Check out the full interview with Stuart Hughes to learn more about his disruptive approach to democratize data.

View the full interview with Stuart Hughes, Chief Information & Digital Officer Rolls-Royce.

--

Try Databricks for free. Get started today.

The post Data Democratization: A Key to Building a Healthy Data Culture appeared first on Databricks.

Data + AI Summit Is Back

$
0
0

Data + AI Summit, the global event for the data community, returns May 24-28. We are thrilled to announce that registration for this free virtual event is now open!

The future is open

Data and AI are rapidly opening up new possibilities and solving stubborn problems. Unifying data opens up collaboration and solutions that were once unthinkable. Pushing the limits of AI and ML have opened doors to innovation that were, even just recently, deemed impossible.

That’s why we’re celebrating this spirit of openness — open source, open standards, open minded — at this year’s summit.

For many of us, open takes on an even greater meaning: after more than a year of COVID shutdowns, the world is starting to reopen. That’s why one of our goals for Data + AI Summit is to virtually bring together top data professionals from every corner of the globe to explore new ideas and technologies.

What will you experience?

With more than 100,000 data and AI experts, leaders and visionaries expected at Data + AI Summit, our team is hard at work putting together five days of sessions, keynotes, training and demos.

We’ve curated a broad range of speakers – within tech and beyond – to share their stories and insights alongside industry leaders. Expect to hear from change-makers like:

  • Malala Yousafzai, Nobel Peace Prize winner and education advocate
  • DJ Patil, Former Chief Data Scientist of the US Office of Science and Technology Policy
  • Matei Zaharia, Databricks Co-founder & Chief Technologist, and original creator of Apache Spark™ and MLflow

You’ll also hear from data leaders at companies like H&M, Apple, Northwestern Mutual and Nielsen who will share their stories about unifying data and AI to power their data teams and solve business challenges.

You can choose from over 200 sessions presented by leading experts in industry, research and academia. These highly-technical sessions will keep you up-to-date with the latest advances in open-source technologies like Apache Spark™, Delta Lake, MLflow and Koalas. Plus, you will walk away from Summit understanding how your organization will benefit from uniting data and AI on a common platform with Databricks Lakehouse.

And don’t forget the free and paid hands-on training for your entire data team; last year, these sessions filled up quickly. From data engineers to data scientists to business leaders, everyone will benefit from these deep-dive, specialized workshops. Use the code DAISBLOG25 to get 25% off our pre-conference training sessions!

Join the global data community

Begin planning your Data + AI Summit experience by registering now and checking out the full event agenda here. In the coming weeks, keep an eye out for information on keynote and session speakers, hands-on training, panels and more!

--

Try Databricks for free. Get started today.

The post Data + AI Summit Is Back appeared first on Databricks.

Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3

$
0
0

Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more and more enterprises facing these challenges are finding they can overcome the scalability and accuracy limits of past solutions.

Go directly to the forecasting notebook referenced in this post
To see this solution for Spark 2.0, please read the original post here

In this post, we’ll discuss the importance of time series forecasting, visualize some sample time series data, and then build a simple model to show the use of Facebook Prophet. Once you’re comfortable building a single model, we’ll combine Facebook Prophet with the magic of Spark to show you how to train hundreds of models at once, allowing you to create precise forecasts for each individual product-store combination at a level of granularity rarely achieved until now.

Accurate and timely forecasting is now more important than ever

Improving the speed and accuracy of time series analyses in order to better forecast demand for products and services is critical to retailers’ success. If too much product is placed in a store, shelf and storeroom space can be strained, products can expire and retailers may find their financial resources tied up in inventory, leaving them unable to take advantage of new opportunities generated by manufacturers or shifts in consumer patterns. If too little product is placed in a store, customers may not be able to purchase the products they need. Not only do these forecast errors result in an immediate loss of revenue to the retailer, but over time, consumer frustration may drive customers towards competitors.

New expectations require more precise time series models and forecasting methods

For some time, enterprise resource planning (ERP) systems and third-party solutions have provided retailers with demand forecasting capabilities based on simple time series models. But with advances in technology and increased pressure in the sector, many retailers are looking to move beyond the linear models and more traditional algorithms historically available to them.

New capabilities, such as those provided by Facebook’s Prophet, are emerging from the data science community, and companies are seeking the flexibility to apply these machine learning (ML) models to their time series forecasting needs.

This movement away from traditional forecasting solutions requires retailers and the like to develop in-house expertise not only in the complexities of demand forecasting but also in the efficient distribution of the work required to generate hundreds of thousands or even millions of ML models in a timely manner. Luckily, we can use Spark to distribute the training of these models, making it possible to predict both demand for products and services and the unique demand for each product in each location.

Visualizing demand seasonality in time series data

To demonstrate the use of Facebook Prophet to generate fine-grained demand forecasts for individual stores and products, we will use a publicly available dataset from Kaggle. It consists of 5 years of daily sales data for 50 individual items across 10 different stores.

To get started, let’s look at the overall yearly sales trend for all products and stores. As you can see, total product sales are increasing year over year with no clear sign of convergence around a plateau.

Sample Kaggle retail data used to demonstrate the combined fine-grained demand forecasting capabilities of Facebook Prophet and Apache Spark

Next, by viewing the same data on a monthly basis, it’s clear that the year-over-year upward trend doesn’t progress steadily each month. Instead, there is a clear seasonal pattern of peaks in the summer months and troughs in the winter months. Using the built-in data visualization feature of Databricks Collaborative Notebooks, we can see the value of our data during each month by mousing over the chart.

At the weekday level, sales peak on Sundays (weekday 0), followed by a hard drop on Mondays (weekday 1), then steadily recover throughout the rest of the week.

Demonstrating the difficulty of accounting for seasonal patterns with traditional time series forecasting methods

Getting started with a simple time series forecasting model on Facebook Prophet

As illustrated above, our data shows a clear year-over-year upward trend in sales, along with both annual and weekly seasonal patterns. It’s these overlapping patterns in the data that Facebook Prophet is designed to address.

Facebook Prophet follows the scikit-learn API, so it should be easy to pick up for anyone with experience with sklearn. We need to pass in a two-column pandas DataFrame as input: the first column is the date, and the second is the value to predict (in our case, sales). Once our data is in the proper format, building a model is easy:

import pandas as pd
from fbprophet import Prophet
 
# instantiate the model and set parameters
model = Prophet(
    interval_width=0.95,
    growth='linear',
    daily_seasonality=False,
    weekly_seasonality=True,
    yearly_seasonality=True,
    seasonality_mode='multiplicative'
)
 
# fit the model to historical data
model.fit(history_pd)

Now that we have fit our model to the data, let’s use it to build a 90 day forecast:

# define a dataset including both historical dates & 90-days beyond the last available date, using Prophet's built-in make_future_dataframe method
future_pd = model.make_future_dataframe(
    periods=90, 
    freq='d', 
    include_history=True
)
 
# predict over the dataset
forecast_pd = model.predict(future_pd)

That’s it! We can now visualize how our actual and predicted data line up as well as a forecast for the future using the Facebook Prophet model’s built-in .plot method. As you can see, the weekly and seasonal demand patterns shown earlier are reflected in the forecasted results.

predict_fig = model.plot(forecast_pd, xlabel='date', ylabel='sales')
display(fig)

Comparing the actual demand to the time-series forecast generated by Facebook Prophet leveraging Apache Spark

This visualization is a bit busy. Bartosz Mikulski provides an excellent breakdown of it that is well worth checking out. In a nutshell, the black dots represent our actuals with the darker blue line representing our predictions and the lighter blue band representing our (95%) uncertainty interval.

Training hundreds of time series forecasting models in parallel with Facebook Prophet and Spark

Now that we’ve demonstrated how to build a single model, we can use the power of Spark to multiply our efforts. Our goal is to generate not one forecast for the entire dataset, but hundreds of models and forecasts for each product-store combination, something that would be incredibly time-consuming to perform as a sequential operation.

Building models in this way could allow a grocery store chain, for example, to create a precise forecast for the amount of milk they should order for their Sandusky store that differs from the amount needed in their Cleveland store, based upon the differing demand at those locations.

How to use Spark DataFrames to distribute the processing of time series data

Data scientists frequently tackle the challenge of training large numbers of models using a distributed data processing engine such as Spark. By leveraging a Spark cluster, individual worker nodes in the cluster can train a subset of models in parallel with other worker nodes, greatly reducing the overall time required to train the entire collection of time series models.

Of course, training models on a cluster of worker nodes (computers) requires more cloud infrastructure, and this comes at a price. But with the easy availability of on-demand cloud resources, companies can quickly provision the resources they need, train their models, and release those resources just as quickly, allowing them to achieve massive scalability without long-term commitments to physical assets.

The key mechanism for achieving distributed data processing in Spark is the DataFrame. By loading the data into a Spark DataFrame, the data is distributed across the workers in the cluster. This allows these workers to process subsets of the data in a parallel manner, reducing the overall amount of time required to perform our work.
Of course, each worker needs to have access to the subset of data it requires to do its work. By grouping the data on key values, in this case on combinations of store and item, we bring together all the time series data for those key values onto a specific worker node.

store_item_history
    .groupBy('store', 'item')
    . . .

We share the groupBy code here to underscore how it enables us to train many models in parallel efficiently, although it will not actually come into play until we set up and apply a custom pandas function to our data in the next section.

Leveraging the power of pandas functions

With our time series data properly grouped by store and item, we now need to train a single model for each group. To accomplish this, we can use a pandas function, which allows us to apply a custom function to each group of data in our DataFrame.

This function will not only train a model for each group, but also generate a result set representing the predictions from that model. But while the function will train and predict on each group in the DataFrame independent of the others, the results returned from each group will be conveniently collected into a single resulting DataFrame. This will allow us to generate store-item level forecasts but present our results to analysts and managers as a single output dataset.

As you can see in the abbreviated code below, building our function is relatively straightforward. Unlike in previous versions of Spark, we can declare our functions in a fairly streamlined manner, specifying the type of pandas object we expect to receive and return, i.e. Python type hints.

Within the function definition, we instantiate our model, configure it and fit it to the data it has received. The model makes a prediction, and that data is returned as the output of the function.

def forecast_store_item(history_pd: pd.DataFrame) -> pd.DataFrame: 
    
    # instantiate the model, configure the parameters
    model = Prophet(
        interval_width=0.95,
        growth='linear',
        daily_seasonality=False,
        weekly_seasonality=True,
        yearly_seasonality=True,
        seasonality_mode='multiplicative'
    )
    
    # fit the model
    model.fit(history_pd)
    
    # configure predictions
    future_pd = model.make_future_dataframe(
        periods=90, 
        freq='d',
        include_history=True
    )
    
    # make predictions
    results_pd = model.predict(future_pd)
    
    # . . .
    
    # return predictions
    return results_pd
Now, to bring it all together, we use the groupBy command we discussed earlier to ensure our dataset is properly partitioned into groups representing specific store and item combinations. We then simply applyInPandas the function to our DataFrame, allowing it to fit a model and make predictions on each grouping of data.

The dataset returned by the application of the function to each group is updated to reflect the date on which we generated our predictions. This will help us keep track of data generated during different model runs as we eventually take our functionality into production.

from pyspark.sql.functions import current_date
 
results = (
    store_item_history
        .groupBy('store', 'item')
          .applyInPandas(forecast_store_item, schema=result_schema)
        .withColumn('training_date', current_date())
    )

Next steps

We have now constructed a forecast for each store-item combination. Using a SQL query, analysts can view the tailored forecasts for each product. In the chart below, we’ve plotted the projected demand for product #1 across 10 stores. As you can see, the demand forecasts vary from store to store, but the general pattern is consistent across all of the stores, as we would expect.

Sample time series visualization generated via a SQL query

As new sales data arrives, we can efficiently generate new forecasts and append these to our existing table structures, allowing analysts to update the business’s expectations as conditions evolve.

To generate these forecasts in your Databricks environment, please import the following notebook:

To access the prior version of this notebook, built for Spark 2.0, please click this link.

--

Try Databricks for free. Get started today.

The post Fine-Grained Time Series Forecasting at Scale With Facebook Prophet and Apache Spark: Updated for Spark 3 appeared first on Databricks.


Benchmark: Koalas (PySpark) and Dask

$
0
0

Koalas is a data science library that implements the pandas APIs on top of Apache Spark so data scientists can use their favorite APIs on datasets of all sizes. This blog post compares the performance of Dask’s implementation of the pandas API and Koalas on PySpark. Using a repeatable benchmark, we have found that Koalas is 4x faster than Dask on a single node, 8x on a cluster and, in some cases, up to 25x.

First, we walk through the benchmarking methodology, environment and results of our test. Then, we discuss why Koalas/Spark is significantly faster than Dask by diving into Spark’s optimized SQL engine, which uses sophisticated techniques such as code generation and query optimizations.

Methodology

The benchmark was performed against the 2009 – 2013 Yellow Taxi Trip Records (157 GB) from NYC Taxi and Limousine Commission (TLC) Trip Record Data. We identified common operations from our pandas workloads such as basic statistical calculations, joins, filtering and grouping on this dataset.

Local and distributed execution were also taken into account in order to cover both single node cases and cluster computing cases comprehensively. The operations were measured with/without filter operations and caching to consider various real-world workloads.

Therefore, we performed the benchmark in the dimensions below:

  • Standard operations (local & distributed execution)
  • Operations with filtering (local & distributed execution)
  • Operations with filtering and caching (local & distributed execution)

Dataset

The yellow taxi trip record dataset contains CSV files, which consist of 17 columns with numeric and text types. The fields include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types and driver-reported passenger counts. The CSV files were downloaded into Databricks File System (DBFS), and then were converted into Parquet files via Koalas for better efficiency.

Operations

We analyzed multiple existing pandas workloads and identified several patterns of common operations. Below is some pseudocode of the derived operations.

def operations(df):
    # complex arithmetic
    np.sin ... np.cos ... np.arctan2
    # count
    len(df)
    # count index
    len(df.index)
    # groupby statistics
    df.groupby(by='series_c').agg(... ['mean', 'std'] ...)
    # join
    merge(df1, df2)
    # join count
    len(merge(df1, df2))
    # mean
    df.series_a.mean()
    # mean of complex arithmetic
    (np.sin ... np.cos ... np.arctan2).mean()
    # mean of series addition
    (df.series_a + df.series_b).mean()
    # mean of series multiplication
    (df.series_a * df.series_b).mean()
    # read file
    read_parquet(...)
    # series addition
    df.series_a + df.series_b
    # series multiplication
    df.series_a * df.series_b
    # standard derivation
    df.series_a.std()
    # value counts
    df.series_a.value_counts()

The operations were executed with/without filtering and caching respectively, to consider the impact of lazy evaluation, caching and related optimizations in both systems, as shown below.

  • Standard operations
    operations(df)
    
  • Operations with filtering
    # Filtering is computed together with the operations lazily.
    operations(df[(df.tip_amt >= 1) & (df.tip_amt <= 5)])
    
  • The filter operation finds the records that received a tip between $1 – 5 dollars, and it filters down to 36% of the original data.
  • Operations with filtering and caching
    # Koalas
    df = df[(df.tip_amt >= 1) & (df.tip_amt <= 5)]
    df.cache()
    len(df) # Make sure data is cached.
    operations(df)
    
    # Dask
    df = df[(df.tip_amt >= 1) & (df.tip_amt <= 5)]
    df = dask_client.persist(df)
    wait(df) # Make sure data is cached.
    operations(df)
    
  • When caching was enabled, the data was fully cached before measuring the operations.

For the entire code used in this benchmark, please refer to the notebooks included on the bottom of this blog.

Environment

The benchmark was performed on both a single node for local execution, as well as a cluster with 3 worker nodes for distributed execution. To set the environment up easily, we used Databricks Runtime 7.6 (Apache Spark 3.0.1) and Databricks notebooks.

System environment

  • Operating System: Ubuntu 18.04.5 LTS
  • Java: Zulu 8.50.0.51-CA-linux64 (build 1.8.0_275-b01)
  • Scala: 2.12.10
  • Python: 3.7.5

Python libraries

  • pandas: 1.1.5
  • PyArrow: 1.0.1
  • NumPy: 1.19.5
  • Koalas: 1.7.0
  • Dask: 2021.03.0

Local execution

For local execution, we used a single i3.16xlarge VM from AWS that has 488 GB memory and 64 cores with 25 Gigabit Ethernet.

Machine specification for local execution

Distributed execution

For distributed execution, 3 worker nodes were used with a i3.4xlarge VM that has 122 GB memory and 16 cores with (up to) 10 Gigabit Ethernet. This cluster has the same total memory as the single-node configuration.

Machine specification for distributed execution

Results

The benchmark results below include overviews with geometric means to explain the general performance differences between Koalas and Dask, and each bar shows the ratio of the elapsed times between Dask and Koalas (Dask / Koalas). Because the Koalas APIs are written on top of PySpark, the results of this benchmark would apply similarly to PySpark.

Standard operations

In local execution, Koalas was on average 1.2x faster than Dask:

  • In Koalas, join with count (join count) was 17.6x faster.
  • In Dask, computing the standard deviation was 3.7x faster.

In distributed execution, Koalas was on average 2.1x faster than Dask:

  • In Koalas, the count index operation was 25x faster.
  • In Dask, the mean of complex arithmetic operations was 1.8x faster.

Operations with filtering

In local execution, Koalas was on average 6.4x faster than Dask in all cases:

  • In Koalas, the count operation was 11.1x faster.
  • Complex arithmetic operations had the smallest gap in which Koalas was 2.7x faster.

In distributed execution, Koalas was on average 9.2x faster than Dask in all cases:

  • In Koalas, the count index operation was 16.7x faster.
  • Complex arithmetic operations had the smallest gap in which Koalas was 3.5x faster.

Operations with filtering and caching

In local execution, Koalas was on average 1.4x faster than Dask:

  • In Koalas, join with count (join count) was 5.9x faster.
  • In Dask, Series.value_counts(value counts) was 3.6x faster.

In distributed execution, Koalas was on average 5.2x faster than Dask in all cases:

  • In Koalas, the count index operation was 28.6x faster.
  • Complex arithmetic operations had the smallest gap in which Koalas was 1.7x faster.

Analysis

Koalas (PySpark) was considerably faster than Dask in most cases. The reason seems straightforward because both Koalas and PySpark are based on Spark, one of the fastest distributed computing engines. Spark has a full optimizing SQL engine (Spark SQL) with highly-advanced query plan optimization and code generation. As a rough comparison, Spark SQL has nearly a million lines of code with 1600+ contributors over 11 years, whereas Dask’s code base is around 10% of Spark’s with 400+ contributors around 6 years.

In order to identify which factors contributed to Koalas’ performance the most out of many optimization techniques in Spark SQL, we analyzed these operations executed in distributed manner with filtering when Koalas outperformed Dask most:

  • Statistical calculations
  • Joins

We dug into the execution and plan optimization aspects for these operations and were able to identify the two most significant factors: code generation and query plan optimization in Spark SQL.

Code generation

One of the most important execution optimizations in Spark SQL is code generation. The Spark engine generates optimized bytecodes for each query at runtime, which greatly improves performance. This optimization considerably affected statistical calculations and joins in the benchmark for Koalas by avoiding virtual function dispatches, etc. Please read the code generation introduction blog post to learn more.

For example, the same benchmark code of mean calculation takes around 8.37 seconds and the join count takes roughly 27.5 seconds with code generation disabled in a Databricks production environment. After enabling the code generation (on by default), calculating the mean takes around 1.26 seconds and the join count takes 2.27 seconds. It is an improvement of 650% and 1200%, respectively.

Performance difference by code generation

Performance difference by code generation

Query plan optimization

Spark SQL has a sophisticated query plan optimizer: Catalyst, which dynamically optimizes the query plan throughout execution (Adaptive query execution). In Koalas’ statistics calculations and join with filtering, the Catalyst optimizer also significantly improved the performance.

When Koalas computes the mean without leveraging the Catalyst query optimization, the raw execution plan in Spark SQL is roughly as follows. It uses brute-force to read all columns, and then performs projection multiple times with the filter in the middle before computing the mean.

Aggregate [avg(fare_amt)]
+- Project [fare_amt]
   +- Project [vendor_name, fare_amt, tip_amt, ...]
      +- Filter tip_amt >= 1 AND tip_amt <= 5
         +- Project [vendor_name, fare_amt, tip_amt, ...]
            +- Relation [vendor_name, fare_amt, tip_amt, ...]

This is considerably inefficient because it requires reading more data, spending more time on I/O and performs the same projections multiple times.

On the other hand, the plan below is optimized to perform efficiently by the Catalyst optimizer:

Aggregate [avg(fare_amt)]
+- Project [fare_amt]
   +- Relation [fare_amt, tip_amt], tip_amt >= 1 AND tip_amt <= 5

The plan becomes significantly simpler. Now it only reads the columns needed for the computation (column pruning), and filters data in the source-level that saves memory usage (filter pushdown).

As for the joining operation with counting (join count), Koalas, via PySpark, creates a raw execution plan of Spark SQL as below:

Aggregate [count()]
+- Project [tip_amt, ...]
   +- Join
      :- Project [tip_amt, ...]
      :  +- Filter tip_amt >= 1 AND tip_amt <= 5
      :     +- Project [tip_amt, ...]
      :        +- Relation[tip_amt, ...]
      +- Project [...]
         +- Relation [...]

It has the same problem as shown in the mean calculation. It unnecessarily reads and projects data multiple times. One difference is that the data will be shuffled and exchanged to perform join operations, which typically causes considerable network I/O and negative performance impact. The Catalyst optimizer is able to remove the shuffle when data on one side of the join is much smaller, resulting in the BroadcastHashJoin you see below:

Aggregate [count()]
+- Project
   +- BroadcastHashJoin
      :- Project []
      :  +- Filter tip_amt >= 1 AND tip_amt <= 5
      :     +- Relation[tip_amt]
      +- BroadcastExchange
         +- Project []
            +- Relation[]

It applies not only column pruning and filter pushdown but also removes the shuffle step by broadcasting the smaller DataFrame. Internally, it sends the smaller DataFrame to each executor, and performs joins without exchanging data. This removes an unnecessary shuffle and greatly improves the performance.

Conclusion

The results of the benchmark demonstrated that Koalas (PySpark) significantly outperforms Dask in the majority of use cases, with the biggest contributing factors being Spark SQL as the execution engine with many advanced optimization techniques.

Koalas’ local and distributed executions of the identified operations were much faster than Dask’s as shown below:

  • Local execution: 2.1x (geometric mean) and 4x (simple average)
  • Distributed execution: 4.6x (geometric mean) and 7.9x (simple average)

Secondly, caching impacted the performance of both Koalas and Dask, and it reduced their elapsed times dramatically.

Lastly, the biggest performance gaps were shown in the distributed execution for statistical calculations and joins with filtering, in which Koalas (PySpark) was 9.2x faster at all identified cases in geometric mean.

We have included the full self-contained notebooks, the dataset and operations, and all settings and benchmark codes for transparency. Please refer to the notebooks below:

--

Try Databricks for free. Get started today.

The post Benchmark: Koalas (PySpark) and Dask appeared first on Databricks.

Efficiently Building ML Models for Predictive Maintenance in the Oil and Gas Industry With Databricks

$
0
0

Guest authored post by Halliburton’s Varun Tyagi, Data Scientist, and Daili Zhang, Principal Data Scientist, as part of the Databricks Guest Blog Program

Halliburton is an oil field services company with a 100-year-long proven track record of best-in-class oilfield offerings. With operations in over 70 countries, Halliburton provides services related to exploration, development and production of oil and natural gas. Success and increased productivity for our company and our customers is heavily dependent on the data we collect from the subsurface and during all phases of the production life cycle.

At Halliburton, petabytes of historical data about drilling and maintenance activities have been collected for decades. A huge amount of time has been spent on data collection, cleaning, transformation and feature engineering in preparation for data modelling and machine learning (ML) tasks. We use ML to automate operational processes, conduct predictive maintenance and gain insights from the data we have collected to make informed decisions.

One of our biggest data-driven use cases is for predictive maintenance, which utilizes time-series data from hundreds of sensors on equipment at each rig site, as well as maintenance records and operational information from various databases. Using Apache Spark on Azure Databricks, we can easily ingest, merge and transform terabytes of data at scale, helping us gain meaningful insights regarding possible causes of equipment failures and their relations to operational and run-time parameters.

In this blog, we discuss tools, processes and techniques that we have implemented to get more value from the extensive amounts of data using Databricks.

Analytics life cycle

Halliburton follows the basic analytics development life cycle with some variations to specific use cases. The whole process has been standardized and automated as much as possible, either through event-triggered pipeline runs or scheduled jobs. Yet, the whole workflow still provides the flexibility for some unusual scenarios.

Halliburton follows the basic analytics ML development life cycle with some variations to specific use cases.

Analytics Lifecycle


With thousands of rigs for different customers globally, each rig contains lots of tools equipped with sensors that are collecting information in real time. Large amounts of historical data collected over the past 20-30 years has been collected and stored in a variety of formats and systems, such as text files, pdf files, pictures, excel files and parquet files in SAP system, on-premise SQL databases and on network drives\local machines. For each project, data scientists, data engineers and domain experts work together to identify the different data sources, and then collect and integrate the data into the same platform.

Due to the variety of sources and varying quality of the data, data scientists spend over 80% of their time cleaning, aggregating, transforming and eventually performing feature engineering on the data. After all of this hard work, various modeling methods are explored and experimented with. The final model is selected based on a specific metric for each use case. Most of the time, the metrics chosen are directly related to the economic impact.

Once the model is selected, it can be deployed either through Azure Container Service, Azure Kubernetes Service or Azure Web App Services. After the model is deployed, it can be called and utilized in applications. However, this is not the end of the analytics life cycle. Monitoring the model performance to detect any model performance drifting over time is an essential part too. The ground truth data (after-the-fact data) flows into the system, goes through the data ingestion, data cleaning/aggregation/transformation and feature engineering. It is then stored in the system for monitoring the overall model performance or for future usage. The overall data volume is huge. However, the data for each individual equipment or well is relatively small. The data cleaning, aggregation and transformation are applied on the equipment level most of the time. Thus, if the data is partitioned by equipment number or by well, a Databricks cluster with a large number of nodes and a relatively small size for each node provides great performance boosts.

Data sources and data quality

Operational data is recorded in the field from hundreds of sensors at each rig site from the equipment belonging to different product service lines (PSLs) within Halliburton. This data is stored in a proprietary format called ADI and eventually parsed into parquet files. The amount of parquet data from one PSL could easily exceed 1.5 petabytes of historic data. With more real-time sensors and data from edge devices being recorded, this amount is expected to grow exponentially.

Configuration, maintenance and other event-related data in the field have been collected as well. These types of data are normally stored in SQL databases or entered on SAP. For example, over 5 million maintenance records have been collected in less than 2.5 years. External data has been leveraged for use when it is necessary and available, including weather related, geological and geophysical data.

Before we can analyze the vast amounts of data, it has to be merged and cleaned. At Halliburton, customized Python and PySpark libraries were developed to help us in the following ways:

  • Merging un-synchronized time-series datasets acquired at different sample rates efficiently.
  • Aggregating high-frequency time series data at lower frequency time intervals efficiently using user-defined aggregation functions.
  • Analyzing and handling missing data (categorical or numerical) and implementing statistical techniques to fill them in.
  • Efficiently detecting outliers based on thresholds, percentiles and optional deletion or replacement by statistical values

Data ingestion: Delta Lake for schema evolution, update and merge (“Upsert”) capabilities

A lot of legacy data with evolving schema has been ingested into the system. The data from various sources fused and stitched together have duplicate records. For these issues, Databricks Delta Lake schema evolution and update/merge/upsert capabilities are great solutions to store such data.

The schema evolution functionality can be set with the following configuration parameter:

```python
spark.sql("set spark.databricks.delta.schema.autoMerge.enabled = true")
```

As a simple example, if incoming data is stored in the dataframe “new_data” and the Delta table archive is located in the path “delta_table_path”, then the following code sample would merge the data using the keys “id”, “key1” and “datetime” to check for existing records with matching keys before inserting the incoming data.

```python
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, delta_table_path)

deltaTable.alias("events").merge(
    new_data.alias("updates"),
    “events.id = updates.id AND
    events.key = updates.key1 AND
    events.datetime = updates.datetime”) \
    .whenNotMatchedInsertAll()\
.execute()
```

Pandas UDF utilization

Spark performs operations row by row, which is slow for certain operations. For example, filling in the missing values for all of the sensors for each equipment, from the corresponding previous value, if it is available, or from the corresponding next available value, took hours with the Spark native window function `pyspark.sql.last(‘col_i’, ignorenulls=[True]).over(win)` to perform the operation for one of our projects. Most of the time, it ended up with an out-of-memory error message. That was a pain point for the project. While searching online for ways to improve the operation, in one of the blogs introducing Spark windows function, someone mentioned `pandas_udf`, which turned out to be a powerful way to speed up the operation.

A Pandas user-defined function, also known as vectorized UDF, is a user-defined function that uses Apache Arrow to transfer data and utilize Pandas to work with the data. This blog is a great resource to learn more about utilizing `pandas_udf` with Apache Spark 3.0.

```python
col_list = col_sel  # a list of columns to fill in the missing values
def fill_pdf(pdf: pd.DataFrame) -> pd.DataFrame:
  for col_i in col_list:
    pdf.loc[:, col_i] = pdf[[col_i]].ffill().bfill()
  return pdf

processed_df = raw_df.groupBy('pumpID').applyInPandas(fill_pdf,schema=bldr_combined_adi_sap_fill.schema)
```

Feature engineering for large time-series datasets

Dealing with large amounts of time-series data from hundreds of sensors at each rig site can be challenging in terms of processing the data efficiently to extract features that could be useful for ML tasks.

We usually analyze time-series data acquired by sensors in fixed-length time windows during periods of activity in which we like to look at aggregated sensor amplitude information (min, max, average, etc.) and frequency domain features (peak frequency, signal-to-noise, wavelet transforms, etc.).

We make heavy use of the distributed computing features of Spark on Databricks to extract features from large time-series datasets.

As a simple example, let’s consider time-series data coming from two sensor types; “sensor\*1” and “sensor\*2” are amplitudes acquired at various times (“time” column). The sensors belong to vehicle transmissions identified by “trans_id”, which are powered by engines denoted by “engine_type” and drive pumps denoted by “pump_id”. In this case, we already tagged the data into appropriate fixed-length time windows in a simple operation in which the window ID is in column “time_window”.

Time-series data coming from two sensor types; "sensor\*1" and "sensor\*2" are amplitudes acquired at various times ("time" column).

Input Dataframe


For each time window for each transmission, pump and engine grouping, we first collect the time-series data from both sensors in a tuple-like ‘struct’ data type:
```python
import pyspark.sql.functions as f

group_columns = ['trans_id', 'pump_id', 'engine_type', 'time_window']
process_columns = ['time', 'sensor_1', 'sensor_2']
exprs = [f.collect_list(f.struct("time", cols)).alias(cols) for cols in process_columns]
agg_df = input_df.groupBy(group_columns).agg(*exprs)
```

This gives us a tuple of time-amplitude pairs for each sensor in each window as shown below:

tuple of time-amplitude pairs for each sensor in each window

Aggregated Data


Since the ordering of the tuples is not guaranteed by the “collect_list” aggregate function due to the distributed nature of Spark, we sort these collected tuples in increasing time order using a PySpark UDF and output the sorted amplitude values:
```python
import operator

def sort_time_arrays(l):
    res = sorted(l, key=operator.itemgetter(0))
    y = [item[1] for item in res]
    return y

sort_udf = f.udf(sort_time_arrays,t.ArrayType(t.DoubleType()))

for cols in process_columns:
    agg_df = agg_df.withColumn(cols, sort_udf(cols))
```
Data frame with the collected and time-sorted amplitude values for each window:

Sorted Data


Any signal processing function can now be applied to these sensor amplitude arrays for each window efficiently using PySpark or Pandas UDFs. In this example, we extract the Welch Power Spectral Density (Welch PSD) for each time-window, and in general, this UDF can be replaced by any function:
```python
from scipy import signal

def welch(x):
    fs=1/time_interval_sel
    freq,psd=signal.welch(x,fs,nperseg=128)
    freq = freq.tolist()
    psd = psd.tolist()
    y = freq + psd
    return y

welch_udf = f.udf(welch, t.ArrayType(t.DoubleType()))
process_cols = ['sensor_1', 'sensor_2']

for col_i in process_cols:
    agg_df = agg_df.withColumn(col_i+"_freq", welch_udf(f.col(col_i)))
```

The resulting data frame contains frequency domain Welch PSD extractions for each sensor in each time window:

Data frame contains frequency domain Welch PSD extractions for each sensor in each time window:

Transformed Data


Below are some examples of the plotted time-series data and corresponding Welch PSD results for multiple windows for the two different sensors. We may choose to further extract the peak frequencies, spectral roll-off and spectral centroid from the Welch PSD results to use as features in ML tasks for predictive maintenance.
Examples of the plotted time-series data and corresponding Welch PSD results for multiple windows for the two different sensors

Results One and Two


Utilizing the concepts above, we can efficiently extract features from extremely large time series datasets using Spark on Databricks.

Model Development, management and performance monitoring

Model development and model management is a big part of the analytics life cycle. At Halliburton, prior to using MLOps, the models and the corresponding environment files were stored in blob storage with certain name conventions, and a .csv file was used to track all of the model specific information. This caused a lot of confusion and errors when more and more models/experiments were tested out.

So, an MLOps application has been developed by the team to manage the whole flow in an automatic fashion. Essentially, the MLOps application links the 3 key elements: model development, model registration and model deployment, and runs them through DevOps pipelines automatically. The triggers to run the whole process are either by a code commit to a specific branch on the centralized git repo, or by updates to the dataset in the linked datastore. The whole process standardizes ML development/ deployment process, drastically reducing the ML project life cycle time with quick delivery and updates. It frees up the team to focus on differentiating value-added work, and at the same time, boosts collaboration and cooperation and provides clear project history tracking, making the analytics development across the company more sustainable.

Summary

The Databricks platform has enabled us to utilize petabytes of time series data in a manageable way. We have used Databricks to:

  • Streamline the data ingestion process from various sources and efficiently store the data using Delta Lake.
  • Transform and manipulate data for use in ML in an efficient and distributed manner.
  • Help prepare and automate data pipelines for use in our MLOps platform.

These have helped our organization make efficient data-driven decisions that increase success and productivity for Halliburton and our customers.

Watch the Data + AI Summit Talk

The post Efficiently Building ML Models for Predictive Maintenance in the Oil and Gas Industry With Databricks appeared first on Databricks.

Identifying Financial Fraud With Geospatial Clustering

$
0
0

For most financial service institutions (FSI), fraud prevention often implies a complex ecosystem made of various components –- a mixture of traditional rules-based controls and artificial intelligence (AI) and a patchwork of on-premises systems, proprietary frameworks and open source cloud technologies. Combined with strict regulatory requirements (such as model explainability), high governance frameworks, low latency and high availability (sub second response time for card transactions), these systems are costly to operate, hard to maintain and even harder to adapt to customers changing behaviors and fraudsters alike. Similar to risk management, a modern fraud prevention strategy must be agile at its core and combines a collaborative data-centered operating model with an established delivery strategy of code, data and machine learning (ML) such as DataOps, DevOps and MLOps. In a previous solution accelerator, we addressed the problem of combining rules with AI in a common orchestration framework powered by MLflow.

As consumers become more digitally engaged, large FSIs often have access to real-time GPS coordinates of every purchase made by their customers. With around 40 billion card transactions processed in the US every year, retail banks have a lot of data they can leverage to better understand transaction behaviors for customers opting into GPS-enabled banking applications. Given the data volume and complexity, it often requires access to a large amount of compute resources and cutting-edge libraries to run geospatial analytics that do not “fit well” within a traditional data warehouse and relational database paradigm.

Geospatial clustering to identify customer spending behaviors

Geospatial clustering to identify customer spending behaviors

In this solution centered around geospatial analytics, we show how the Databricks Lakehouse Platform enables organizations to better understand customers spending behaviors in terms of both who they are and how they bank. This is not a one-size-fits-all based model, but truly personalized AI. After all, identifying abnormal patterns can only be achieved with the ability to first understand what normal behaviour is, and doing so for millions of customers is a challenge that requires data and AI combined into one platform.

Geospatial clustering as part of a fraud prevention strategy

Geospatial clustering as part of a fraud prevention strategy

As part of this real-world solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviors at massive scale, track customers patterns over time and detect anomalous card transactions. Finally, we demonstrate how organizations can surface anomalies from an analytics environment to an online data store (ODS) with tight SLA requirements following a Lambda-like infrastructure underpinned by Delta Lake, Apache Spark and MLflow.

Geospatial clustering of card transactions

DBSCAN (density-based spatial clustering of applications with noise) is a common ML technique used to group points that are closely packed together. Compared to other clustering methodologies, it doesn’t require you to indicate the number of clusters beforehand, can detect clusters of varying shapes and sizes and effectively finds outliers that don’t belong in any dense area. This makes it a great candidate for geospatial analysis of credit card transactions and potentially fraudulent activities. However, it comes with a serious price tag: DBSCAN requires all points to be compared to every other point to find dense neighbourhoods, which is a significant limitation given the scale large FSIs operate at. As we could not find a viable solution that can scale to millions of customers or more than a few hundreds of thousands of records, we created our own open source AI library: GEOSCAN. Available with both Scala and Python APIs, GEOSCAN is our implementation of DBSCAN algorithm for geospatial clustering at big data scale.

Density based clustering (DBSCAN)

Density based clustering (DBSCAN)


Leveraging Uber’s H3 library to only group points we know are in close vicinity (sharing at least one H3 polygon) and relying on GraphX API, this can detect dense areas at massive scale, understand user spending behaviors and detect anomalous transactions in near real time.

GEOSCAN logic to group points in close proximity using H

In order to validate our framework, we created a synthetic dataset of credit card transactions in the NYC area. Our dataset only contains a tokenized value for users, a geospatial coordinate (as latitude and longitude), a timestamp and a transaction amount. In real life, the same should also contain additional transaction context (such as merchant narrative or MCC code) and often has been enriched with clean brand information (the latter will be addressed as part of a future Databricks Solution Accelerator). In this demo, we will extract dense clusters that correspond to areas with higher transaction activities such as high streets and shopping malls.

There are two modes supported by our GEOSCAN implementation, distributed and pseudo distributed.

  1. For distributed, our framework detects clusters from an entire dataframe (i.e. across all our users base).
  2. For, pseudo-distributed, it retrieves clusters for a grouped predicate, hence training millions of models for millions of customers in parallel. Both modes are useful to better understand customers’ shopping behaviour for a personalized fraud prevention strategy.

Detecting dense shopping areas

Working fully distributed, the core of the GEOSCAN algorithm relies on GraphX to detect points having distance < epsilon (expressed in meters) and neighbors > minPoints.

from geoscan import Geoscan

geoscan = Geoscan() \
   .setLatitudeCol('latitude') \
   .setLongitudeCol('longitude') \
   .setPredictionCol('cluster') \
   .setEpsilon(200) \
   .setMinPts(20)

model = geoscan.fit(points_df)
clusters_df = model.transform(points_df)

As a strong advocate of open standard, we built GEOSCAN to support RFC7946 (aka GeoJSON) as a model output that can be processed as-is with any geospatial library (such as geopandas), GIS database (geomesa) or visualization (folium). As represented below, MLflow natively supports the use of GeoJSON as a model artifact.

MLFlow displaying GeoJSON file format

MLflow displaying GeoJSON file format

Let’s walk through an example of GEOSCAN in action. With this solution, we have programmatically extracted geographical shapes corresponding to high density of card transactions in the New York City (NYC) area. As represented above, our parameters resulted in a relatively large shape covering most of NYC. Although reducing minPts value or increasing epsilon could help refine that shape, it may certainly impact less dense areas such as Williamsburg in Brooklyn. Largely domain-specific, we explore different approaches to tune our model and improve performance in the notebooks reported at the end of this blog.

Model inference and cluster tiling

As the core of GEOSCAN logic relies on the use of H3 polygons, it becomes natural to leverage the same for model inference instead of bringing in extra GIS dependencies for expensive points in polygons queries. Our approach consists in "tiling" our clusters with H3 hexagons that can easily be joined to our original dataframe, exploiting at best Delta Lake optimizations (such as ZORDER indexing) and offering complex geospatial queries as a form of a simple SQL operation.

Tiling a geo shape with H3 hexagons

Tiling a geo shape with H3 hexagons

Personalized clustering

We have demonstrated how GEOSCAN can be used across our entire dataset. However, the aim was not to machine learn the shape of NYC, nor to find the best place to go shopping, but to track user spending behaviour over time and - most importantly - where transactions are the least expected to happen for a given customer, therefore requiring a personalized approach to geospatial clustering.

from geoscan import GeoscanPersonalized

geoscan = GeoscanPersonalized() \
   .setLatitudeCol('latitude') \
   .setLongitudeCol('longitude') \
   .setPredictionCol('cluster') \
   .setGroupedCol('user') \
   .setEpsilon(500) \
   .setMinPts(3)

model = geoscan.fit(points_df)
clusters_df = model.transform(points_df)

Similar to our distributed approach, models can be stored and retrieved as per the standard Spark ML API. In this mode, we return a dataframe made of GeoJSON objects rather than a single file in which each card holder has an associated record capturing their spending geographical patterns.

Understanding customer specific patterns

It is important to step back and reflect on the insights gained so far. As we learn more about our entire customer base (distributed approach), we could leverage this information to better understand the behaviour that is specific to each individual. If everyone were to shop at the same location, such an area would be less specific to a particular user. We can detect "personalized" zones as how much they overlap with common areas, better understanding our end customers and pave the way towards truly personalized banking.

Detecting areas that are the most descriptive for each user is similar to detecting keywords that are more descriptive to each sentence in Natural Language processing use cases. We can use a Term Frequency / Inverse document frequency (TF-IDF) approach to increase the weight of user-specific locations while reducing weight around common areas.

Geographical spending patterns for a given customer

Geographical spending patterns for a given customer

We suddenly have gained incredible insights about our customers' shopping behaviour. Although the core of this user’s transactions are made in the Chelsea and financial district areas of NYC, what seems to better define this user are their transactions around the Plaza Hotel on the 5th avenue and Williamsburg. Given a specific user and a location, this framework can be used to better understand whether a card transaction falls within a known shopping pattern at a specific time of day or day of the week.

Fraud prevention using personalized AI

In the previous section, we showed that geospatial data analysis can tell us a lot of information about customers behaviour and trends, hence a critical component to anomaly detection model in an over-hatching fraud prevention strategy. In this section, we demonstrate how to use these insights to detect suspicious behaviour in real time. More often than not, fraud detection systems run outside of an analytics environment due to the combination of data sensitivity (PII), regulatory requirements (PCI/DSS) and model materiality (high SLAs and low latency). For these reasons, we explore multiple strategies to serve our insights either as a self-contained framework using MLflow or through a relational or NoSQL online datastore such as Redis, MongoDB, Redshift or ElastiCache, although many other solutions may be viable.

Extracting anomalies

Since we have stored and indexed all of our personalized ML models as H3 polygons in a Delta table, it becomes easy to enrich each transaction with their cluster using a simple JOIN operation. In the example below, we can extract anomalies (transactions not matching any known pattern of a given user) given a specific H3 resolution (see resolution table) embedded in a user defined function.

from pyspark.sql import functions as F

anomalous_transactions = (
 spark
   .read
   .table('geospatial.transactions')
   .withColumn('h3', to_h3('latitude', 'longitude', 10))
   .join(tiles, ['user', 'h3'], 'left_outer')
   .filter(F.expr('cluster IS NULL'))

Out of half a million transactions, we extracted 81 records in less than 5 seconds. Not necessarily fraudulent, maybe not even suspicious, but these transactions did not match any of our users’ "normal" behaviors, and as such, should be flagged as part of an over-hatching fraud prevention framework and further combined with other rules and models. In a real-life example, we should factor for time and additional transactional context. Would the same transaction happening on a Sunday afternoon or a Wednesday morning be suspicious given user characteristics we could learn?

With millions of transactions and low latency requirements, it would not be realistic to join these large datasets in real time. Although we could load all clusters (their H3 tiles) in memory, we may have evaluated multiple models at different time of the days, for different segments, different transaction indicators (e.g. for different brand category or MCC codes) and for millions of consumers resulting in a complex system that requires efficient lookup strategies against millions of variables.

Bloom filtering

Here comes Bloom filters, an efficient probabilistic data structure that can test the existence of a given record without keeping an entire set in memory. Although Bloom filters have been around for a long time, its usage has not - sadly - been democratized beyond blurry engineering techniques such as database optimization engines and daunting execution planners (Delta engine leverages Bloom filters optimizations under the hood, among other techniques). This technique is a powerful tool worth having in a modern data science toolkit.

The concept of a Bloom filter is to convert a series of records (in our case, a card transaction location as a H3 polygon) into a series of hash values, overlaying each of their byte arrays representations as vectors of 1 and 0. Testing the existence of a given record results in testing the existence of each of its bits set to 1. The memory efficiency of Bloom filters comes with a non-negligible downside in the context of fraud detection. Although Bloom filters offer a false negative rate of 0, there is a non-zero false positive rate (records we wrongly assume have been seen due to hash collision) that can be controlled by the length of our array and the number of hash functions.

We will be using the pybloomfilter Python library to validate this approach, training a Bloom filter against each and every known H3 tile of every given user. Although our filter may logically contain millions of records, we would only need to physically maintain 1 byte array in memory to enable a probabilistic search, sacrificing 1% of anomalous transactions (our false positive rate) for higher execution speed.

import pybloomfilter

def train_bloom(records):
 num_records = len(records)
 cluster = pybloomfilter.BloomFilter(num_records, 0.01)
 cluster.update(records)
 return cluster

records = list(tiles.filter(F.col('user') == user).toPandas().h3)
bloom = train_bloom(records)

Testing the (in-)existence of a specific card transaction location can be enabled at lightning speed.

anomalies = transactions[transactions['h3'].apply(lambda x: x not in bloom)]

In the notebooks listed in this blog, we demonstrate how data scientists can embed that business logic as an MLflow experiment that can be further delivered to a batch or stream processing or to external APIs with higher throughput requirements (see MLflow deployments).

import mlflow
model = mlflow.pyfunc.load_model('models:/bloom/production')
anomalies = model.predict(transactions)
anomalies = anomalies[anomalies['anomaly'] != 0]

However, this approach poses an important operational challenge for large financial services organizations, as new models would need to be constantly retrained and redeployed to adapt to users changing behaviors.

Let's take an example of a user going on holidays. Although their first card transactions may be returned as anomalous (not necessarily suspicious), such a strategy would need to adapt and learn the new "normal" as more and more transactions are observed. One would need to run the same process with new training data, resulting in a new version of a model being released, reviewed by an independent team of experts, approved by a governance entity and eventually updated to a fraud production endpoint outside of any change freeze. Technically possible and definitely made easier with Databricks due to the platform’s collaborative approach to data management, this approach may not be viable for many.

Online data store

It is fairly common for financial services institutions to have an online data store decoupled from an analytics platform. A real-time flow of incoming card transactions usually accessed from an enterprise message broker such as Kafka, Event Hub (Azure) or Kinesis (AWS) are compared with reference data points in real time. An alternative approach to the above is to use an online datastore (like MongoDB) to keep "pushing" reference data points to a live endpoint as a business as usual process (hence outside of ITSM change windows). Any incoming transaction would be matched against a set of rules constantly updated and accessible via sub-seconds look up queries. Using MongoDB connector (as an example), we show how organizations can save our geo clusters dataframes for real-time serving, combining the predictive power of advanced analytics with low latency and explainability of traditional rules-based systems.

import com.mongodb.spark._

tiles
 .withColumn("createdAt", current_timestamp())  
 .write
 .format("mongo")
 .mode("append")
 .option("database", "fraud")
 .option("collection", "tiles")
 .save()

In the example above, we append new reference data (card transaction H3 location) to a MongoDB collection (i.e. a table) for each of our millions card holders at regular time intervals. In that setting, new transactions can be compared to reference historical data stored on MongoDB as a simple request. Given a user and a transaction location (as a H3 polygon), is this card transaction happening in a known user pattern?

mongo > use fraud
mongo > db.tiles.find({"user": "Antoine", "tile": "8A2A1008916FFFF"})

As part of this solution, we want to leverage another built-in capability of MongoDB: Time to Live (TTL). Besides the operation benefits of not having to maintain this collection (records are purged after TTL expires), we can bring a temporal dimension to our model in order to cope with users changing patterns. With a TTL of 1 week (for example) and a new model trained on Databricks every day, we can track clusters over time while dynamically adapting our fraud strategy as new transactions are being observed, purposely drifting our model to follow users’ changing behaviors.

Example of how change in customers' transactional behaviors could be tracked over time

In the visualization above, we show an example of how change in customers' transactional behaviors could be tracked over time (thanks to our TTL on MongoDB in real time and / or time travel functionality on Delta), where any observed location stays active for a period of X days and wherein anomalous transactions can be detected in real time.

Closing thoughts: Geospatial analytics for customer-centric banking

Card fraud transactions will never be addressed by a one-size-fits-all model but should always contextualize isolated indicators coming from different controls as part of an over-hatching fraud prevention strategy. Often, this combines advanced modeling techniques (such as neural networks) with rules-based systems, integrates advanced technologies and legacy processes, cloud-based infrastructures and on-premises systems, and must comply with tight regulatory requirements and critical SLAs. Although this solution does not aim at identifying fraudulent transactions on its own, we demonstrated through the release of a new open source library, GEOSCAN, how geospatial analytics can greatly contribute to extracting anomalous events in a timely, cost-effective (self maintained) and fully-explainable manner, hence a great candidate to combat financial fraud more effectively in a coordinated rules + AI strategy.

As part of this exercise, we also discovered something equally important in financial services. We demonstrated the ability of the Lakehouse infrastructure to transition from traditional to personalized banking where consumers are no longer segmented by demographics (who they are) but by their spending patterns (how they bank), paving the way towards a more customer-centric and inclusive future of retail banking.

Getting started

Try the below notebooks on Databricks to accelerate your fraud prevention development strategy today and contact us to learn more about how we assist customers with similar use cases.

--

Try Databricks for free. Get started today.

The post Identifying Financial Fraud With Geospatial Clustering appeared first on Databricks.

Databricks and University of Rochester

$
0
0

At Databricks, we strongly believe (“know” you could say) that data and AI are mission-critical for solving the biggest problems our world faces. From healthcare to sustainability to transportation, data is a key to understanding and analyzing these issues at the deepest level – often in real time – and in turn shapes effective solutions.

That’s why we’re thrilled to announce that Databricks will be working with the University of Rochester’s Goergen Institute for Data Science on student capstone projects to drive social change with public datasets. The core mission of Databricks is to solve the world’s toughest problems with data, so we are very excited to add this work with Rochester alongside other work we have done with nonprofits, policy makers, NGOs and other organizations. One of my favorite  examples is when we leveraged public healthcare data sets to empower the data community at the early onset of the global pandemic.

Databricks’ collaboration began with Rochester’s membership in the Databricks University Alliance, a global program with more than 160 member universities worldwide that helps more than 7,000 students get hands-on experience using Databricks. In this new extended partnership, Databricks employees will work with students on identifying problems, selecting datasets, doing machine learning and sharing Databricks notebooks and models that highlight novel and actionable information.

This joint effort is especially exciting given the interest in leveraging data science for social good that Databricks and several University of Rochester faculty members share. Professor Lloyd Palum, an instructor for the course Data Science at Scale, first approached Databricks in August of 2020 with an interest in using our platform to introduce students to data-intensive applications for capstone projects. This ultimately culminated in Professor Palum, in collaboration with Dr. Ajay Anand, Deputy Director of the Goergen Institute for Data Science, presenting his approach to teaching large-scale analytics using Databricks to 50+ faculty members earlier this year. Part of that conversation involved using data for good, a subject that many Databricks employees are passionate about (for an example of such work, see this recent blog post from Chengyin Eng & Brooke Wenig exploring fatality rates in police shootings).

In follow-up conversations with Professor Palum and Dr. Anand, Databricks employees expressed great enthusiasm for working with Rochester on solving tough problems with public data sets.  What does this program mean for solving the world’s toughest problems? This spring 2021 semester, Databricks engineers will investigate various social, public health and humanitarian issues that have publicly-available datasets and present options for student capstone projects in the fall. 

As we hit milestones in this collaboration, we will publish more blog posts to bring visibility to the important work that these students, professors and Databricks employees are undertaking in the interest of social good.  We hope to make significant contributions to the ecosystem of responsible data scientists, ethical AI and data sets in the public domain that target real-world problems like climate change, pandemic management, social equity and sustainability at a global scale. Stay tuned!

How to Get Started with Databricks University Alliance

The Databricks University Alliance exists to help students and professors learn and use public-cloud-based analytical tools in college classrooms virtually or in-person. Enroll now and join more than 150 universities across the globe that are building the data science workforce of tomorrow.  If you are a professor or student interested in working with Databricks on using public data sets to drive social change, please contact  university@databricks.com. We believe that thoughtful collaboration can make a difference!

Upon acceptance, members will get access to curated content, training materials, sample notebook and pre-recorded content for learning data science and data engineering tools, including Apache Spark, Delta Lake and MLflow.  Students focused on individual skills development can sign up for the free Databricks Community Edition and follow along with these free one-hour hands-on workshops for aspiring data scientists, as well as access free self-paced courses from Databricks Academy, the training and certification organization within Databricks.

The Databricks University Alliance is powered by leading cloud providers such as Microsoft Azure, AWS and Google Cloud. Those educators looking for high-scale computing resources for their in-person and virtual classrooms may apply for cloud computing credits.

--

Try Databricks for free. Get started today.

The post Databricks and University of Rochester appeared first on Databricks.

7 Reasons to Learn PyTorch on Databricks

$
0
0

What expedites the process of learning new concepts, languages or systems? When learning a new task, do you look for analogs from skills you already possess?

Across all learning endeavors, three favorable characteristics stand out: familiarity, clarity and simplicity. Familiarity eases the transition because of a recognizable link between the old and new ways of doing. Clarity minimizes the cognitive burden. And simplicity reduces the friction in the adoption of the unknown and, as a result, increases the fruition of learning a new concept, language or system.

Aside from being popular among researchers, gaining adoption by machine learning practitioners in production, and having a vibrant community, PyTorch has a familiar feel to it, easy to learn, and you can employ it for your machine learning use cases.

Keeping these characteristics in mind, we examine in this blog several reasons why it’s easy to learn PyTorch, and how the Databricks Lakehouse Platform facilitates the learning process.

1a. PyTorch is Pythonic

Luciano Ramalho in Fluent Python defines Pythonic as an idiomatic way to use Python code that makes use of language features to be concise and readable. Python object constructs follow a certain protocol, and their behaviors adhere to a consistent pattern across classes, iterators, generators, sequences, context managers, modules, coroutines, decorators, etc. Even with little familiarity with the Python data model, modules and language constructs, you recognize similar constructs in PyTorch APIs, such as a torch.tensor, torch.nn.Module, torch.utils.data.Datasets, torch.utils.data.DataLoaders etc. Another aspect is the concise code you can write in PyTorch as with PyData packages such as Pandas, scikit-learn or SciPy.

PyTorch integrates with the PyData ecosystem, so your familiarity with NumPy makes learning Torch Tensors incredibly simple. NumPy arrays and tensors have similar data structures and operations. Just as DataFrames are central data structures to Apache Spark™ operations, so are tensors as inputs to PyTorch models, training operations, computations and scoring. A PyTorch tensor’s mental image (shown in the diagram below) maps to an n-dimensional NumPy array.

A PyTorch tensor’s mental image maps to an n-dimensional numpy array.

For instance, you can seamlessly create NumPy arrays and convert them into Torch tensors. Familiarity with NumPy operations transfers to tensor operations, as shown in the code below.

Both have familiar, imperative and intuitive operations that one would expect from Python object APIs, such as lists, tuples, dictionaries, sets, etc. All this familiarity with NumPy’s equivalent array operations on Torch tensors helps. Consider these examples:

The latest release of PyTorch 1.8.0 further builds on this analog operation between PyTorch tensors and NumPy for fast Fourier transformation series.

1b. Easy-to-extend PyTorch nn Modules

PyTorch library includes neural network modules to build a layered network architecture. In PyTorch parlance, these modules comprise each layer of your network. Derived from its base class module torch.nn.Module, you can easily create a simple or complex layered neural network. To define a PyTorch customized network module class and its methods, you follow a similar pattern to build a customized Python object class derived from its base class object. Let’s define a simple two-layered linear network example.

Notice that the custom TwoLayeredNet below is Pythonic in its flow and structure. Derived classes from the torch.nn.Module have class initializers with parameters, define interface methods, and are callable. That is, the base class torch.nn.Module implements the Python magic __call__() object method. Although the two-layered model is simple, it demonstrates this familiarity with extending a class from Python’s base object.

Furthermore, you get an intuitive feeling that you are writing or reading Python application code while using PyTorch APIs –the syntax, structure, form and behavior are all too familiar. The unfamiliar bits are the PyTorch modules and the APIs, which are no different when learning a new PyData package APIs and incorporating them into your Python application code.

For more Pythonic code, read the accompanying notebook on the imperative nature of PyTorch code for writing training loops and loss functions, familiar Python iterative constructs, and using the cuda library for GPUs.

Now we define a simple training loop with some iterations, using Python familiar language constructs.

What follows is a recognizable pattern and flow between a Python’s customized class and a simple PyTorch neural network. Also, the code reads like Python code. Another recognizable Pythonic pattern in PyTorch is how Dataset and DataLoaders use Python protocols to build iterators.

1c. Easy-to-customize PyTorch Dataset for Dataloaders

At the core of PyTorch data loading utility is the torch.utils.data.DataLoader class. It is an integral part of the PyTorch iterative training process, which iterates over batches of input during an epoch of training. DataLoaders implements a Python sequence and iterable protocol, which includes implementing __len__ and __getitem__ magic methods on an object. Again, very Pythonic in behavior; as part of the implementation, we employ list comprehensions, use NumPy arrays to convert to tensors and use random access to fetch nth data item — all conforming to familiar access patterns and behaviors of doing things in Python.

Let’s look at a simple custom Dataset of temperatures for use in training a model. Other complex datasets could be images, extensive features datasets of tensors, etc.

A PyTorch Dataloader class takes an instance of a customized FahrenheitTemperatures class object as a parameter. This utility class is standard in PyTorch training loops. It offers an ability to iterate over batches of data like an iterator: again, a very Pythonic and straightforward way of doing things!

Since we implemented our custom Dataset, let’s use it in the PyTorch training loop.

Although the aforementioned Pythonic reasons are not directly related to Databricks Lakehouse Platform, they account for ideas of familiarity, clarity, simplicity, and the Pythonic way of writing PyTorch code. Next, we examine what aspects within the Databricks Lakehouse Platform’s runtime for machine learning facilitate learning PyTorch.

2. No need to install Python packages

As part of the Databricks Lakehouse Platform, the runtime for machine learning (ML) comes preinstalled with the latest versions of Python, PyTorch, PyData ecosystem packages and additional standard ML libraries, saving you from installing or managing any packages. Out-of-the-box and ready-to-use-runtime environments also unburden you from needing to control or install packages. If you want to install additional Python packages, simply use %pip install. This ability to support package management on your cluster is popular among Databricks customers and widely used as part of their development model lifecycle.

To inspect the list of all preinstalled packages, use the pip list.

To inspect the list of all packages preinstalled with Databricks runtime for machine learning, use the pip list

3. Easy-to-Use CPUs or GPUs

Neural networks for deep learning involve numeric-intensive computations, including dot products and matrix multiplications on large and higher-ranked tensors. For compute-bound PyTorch applications that require GPUs, create a cluster of MLR with GPUs and consign your data to use GPUs. As such, all training can be done on GPUs, as the above example of TwoLayeredNet demonstrates using cuda.

Note that this example shows simple code, and showing matrix multiplication of two randomly generated tensors, real PyTorch applications will have much more intense computation during their forward and backward passes and auto-grad computations.

Example Pytorch code from the Databricks Lakehouse Platform showing matrix multiplication of two randomly generated tensors

4. Easy-to-Use TensorBoard

Already announced in a blog as part of the Databricks Runtime (DBR), this magic command displays your training metrics from TensorBoard within the same notebook. No longer do you need to leave your notebook and launch TensorBoard from another tab. This in-place TensorBoard visualization is a significant improvement toward simplicity and developer experience. And PyTorch developers can quickly see their metrics in TensorBoard.

Let’s try to run a sample PyTorch FashionMNIST example with TensorBoard logging.
First, define a SummaryWriter, followed by the FashionMNIST Dataset in the DataLoader in our PyTorch torchvision.models.resnet50 model.

PyTorch FashionMNIST example with TensorBoard logging.

Using Databricks notebook’s magic commands, you can launch the TensorBoard within your cell and examine the training metrics and model outputs.

%load_ext tensorboard

%tensorboard --logdir=./runs

5. PyTorch Integrated with MLflow

In our steadfast effort to make Databricks simpler, we enhanced MLflow fluent tracking APIs to autolog MLflow entities—metrics, tags, parameters and artifacts—for supported ML libraries, including PyTorch Lightning. Through the MLflow UI, an integral part of the workspace, you can access all MLflow experiments via the Experiment icon in the upper right corner. All experiment runs during training are automatically logged to the MLflow tracking server. No need for you to explicitly use the tracking APIs to log MLflow entities, albeit it does not prevent you from tracking and logging any additional entities such as images, dictionaries, or text artifacts.

Here is a minimal example of a PyTorch Lightning FashionMNIST instance with just a training loop step (no validation, no testing). It illustrates how you can use MLflow to autolog MLflow entities, peruse the MLflow UI to inspect its runs from within this notebook, register the model and serve or deploy it.

A PyTorch Lightning FashionMNIST instance with just a training loop step, illustrating how you can use MLflow to autolog MLflow entities, peruse the MLflow UI and register, serve or deploy the model.

Create the PyTorch model as you would create a Python class, use the FashionMNIST DataLoader a PyTorch Lightning Trainer and autolog all MLflow entities during its trainer.fit() method.

You can use the FashionMNIST DataLoader, a PyTorch Lightning Trainer, to autolog all MLflow entities during its trainer.fit() method.

6. Convert MLflow PyTorch-logged Models to TorchScript

TorchScript is a way to create serializable and optimizable models from PyTorch code. We can convert a PyTorch MLflow-logged model into a TorchScript format, save, and load (or deploy to) a high-performance and independent process. Or deploy and serve on Databricks cluster as an endpoint.

The process entails the following steps:

  1. Create an MLflow PyTorch model
  2. Compile the model using JIT and convert it to the TorchScript model
  3. Log or save the TorchScript model
  4. Load or deploy the TorchScript model

A PyTorch Lightning FashionMNIST instance with just a training loop step, illustrating how you can use MLflow to autolog MLflow entities, peruse the MLflow UI and register, serve or deploy the model.

 

We have not included all the code here for brevity, but you can examine the sample code—IrisClassification and MNIST—in the GitHub MLflow examples directory.

7. Ready-to-run PyTorch Tutorials for Distributed Training

Lastly, you can use the Databricks Lakehouse MLR cluster to distribute your PyTorch model training. We provide a set of tutorials that demonstrate a) how to set up a single node training and b) how to migrate to the Horovod library to distribute your training. Working through these tutorials equips you with how to apply distributed training for your PyTorch models. Ready-to-run and easy-to-import-notebooks into your cluster, these notebooks are an excellent stepping-stone to learn distributed training. Just follow the recommended setups and sit back and watch the model train…

With Databricks Lakehouse Platform, just follow the recommended setups and sit back and watch the model train…

© r/memes – Watching a train model meme

Each notebook provides a step-by-step guide to set up an MLR cluster, how to adapt your code to use either CPUs or GPUs and train your models in a distributed fashion with the Horovod library.

Moreover, the PyTorch community provides excellent Learning with PyTorch Examples starter tutorials. You can just as simply cut-and-paste the code into a Databricks notebook or import a Jupyter notebook and run it on your MLR cluster as in a Python IDE. As you work through them, you get a feel for the Pythonic nature of PyTorch: imperative and intuitive.

Finally, there will be a number of PyTorch production ML use case sessions at the upcoming Data + AI Summit. Registration is open now. Save your spot.

What’s Next: How to get started

You can try the accompanying notebook in your MLR cluster and import the PyTorch tutorials mentioned in this notebook. If you don’t have a Databricks account, get one today for a free trial and have a go at PyTorch on Databricks Lakehouse Platform. For single-node training, limited functionality and only CPUs usage, use the Databricks Community Edition.

--

Try Databricks for free. Get started today.

The post 7 Reasons to Learn PyTorch on Databricks appeared first on Databricks.

How (Not) to Tune Your Model with Hyperopt

$
0
0

Hyperopt is a powerful tool for tuning ML models with Apache Spark. Read on to learn how to define and execute (and debug) the tuning optimally!

So, you want to build a model. You’ve solved the harder problems of accessing data, cleaning it and selecting features. Now, you just need to fit a model, and the good news is that there are many open source tools available: xgboost, scikit-learn, Keras, and so on. The bad news is also that there are so many of them, and that they each have so many knobs to turn. How much regularization do you need? What learning rate? And what is “gamma” anyway?

There is no simple way to know which algorithm, and which settings for that algorithm (“hyperparameters”), produces the best model for the data. Any honest model-fitting process entails trying many combinations of hyperparameters, even many algorithms.

One popular open-source tool for hyperparameter tuning is Hyperopt. It is simple to use, but using Hyperopt efficiently requires care. Whether you are just getting started with the library, or are already using Hyperopt and have had problems scaling it or getting good results, this blog is for you. It will explore common problems and solutions to ensure you can find the best model without wasting time and money. It will show how to:

  • Specify the Hyperopt search space correctly
  • Debug common errors
  • Utilize parallelism on an Apache Spark cluster optimally
  • Optimize execution of Hyperopt trials
  • Use MLflow to track models

What is Hyperopt?

Hyperopt is a powerful tool for tuning ML models with Apache Spark

Hyperopt is a Python library that can optimize a function’s value over complex spaces of inputs. For machine learning specifically, this means it can optimize a model’s accuracy (loss, really) over a space of hyperparameters. It’s a Bayesian optimizer, meaning it is not merely randomly searching or searching a grid, but intelligently learning which combinations of values work well as it goes, and focusing the search there.

There are many optimization packages out there, but Hyperopt has several things going for it:

This last point is a double-edged sword. Hyperopt is simple and flexible, but it makes no assumptions about the task and puts the burden of specifying the bounds of the search correctly on the user. Done right, Hyperopt is a powerful way to efficiently find a best model. However, there are a number of best practices to know with Hyperopt for specifying the search, executing it efficiently, debugging problems and obtaining the best model via MLflow.

Specifying the space: what’s a hyperparameter?

When using any tuning framework, it’s necessary to specify which hyperparameters to tune. But, what are hyperparameters?

They’re not the parameters of a model, which are learned from the data, like the coefficients in a linear regression, or the weights in a deep learning network. Hyperparameters are inputs to the modeling process itself, which chooses the best parameters. This includes, for example, the strength of regularization in fitting a model. Scalar parameters to a model are probably hyperparameters. Whatever doesn’t have an obvious single correct value is fair game.

Some arguments are not tunable because there’s one correct value. For example, xgboost wants an objective function to minimize. For classification, it’s often reg:logistic. For regression problems, it’s reg:squarederrorc. But, these are not alternatives in one problem. It makes no sense to try reg:squarederror for classification. Similarly, in generalized linear models, there is often one link function that correctly corresponds to the problem being solved, not a choice. For a simpler example: you don’t need to tune verbose anywhere!

Some arguments are ambiguous because they are tunable, but primarily affect speed. Consider n_jobs in scikit-learn implementations . This controls the number of parallel threads used to build the model. It should not affect the final model’s quality. It’s not something to tune as a hyperparameter.

Similarly, parameters like convergence tolerances aren’t likely something to tune. Too large, and the model accuracy does suffer, but small values basically just spend more compute cycles. These are the kinds of arguments that can be left at a default.

In the same vein, the number of epochs in a deep learning model is probably not something to tune. Training should stop when accuracy stops improving via early stopping. See “How (Not) To Scale Deep Learning in 6 Easy Steps” for more discussion of this idea.

Specifying the space: what range to choose?

Next, what range of values is appropriate for each hyperparameter? Sometimes it’s obvious. For example, if choosing Adam versus SGD as the optimizer when training a neural network, then those are clearly the only two possible choices.

For scalar values, it’s not as clear. Hyperopt requires a minimum and maximum. In some cases the minimum is clear; a learning rate-like parameter can only be positive. An Elastic net parameter is a ratio, so must be between 0 and 1. But what is, say, a reasonable maximum “gamma” parameter in a support vector machine? It’s necessary to consult the implementation’s documentation to understand hard minimums or maximums and the default value.

If in doubt, choose bounds that are extreme and let Hyperopt learn what values aren’t working well. For example, if a regularization parameter is typically between 1 and 10, try values from 0 to 100. The range should include the default value, certainly. At worst, it may spend time trying extreme values that do not work well at all, but it should learn and stop wasting trials on bad values. This may mean subsequently re-running the search with a narrowed range after an initial exploration to better explore reasonable values.

Some hyperparameters have a large impact on runtime. A large max tree depth in tree-based algorithms can cause it to fit models that are large and expensive to train, for example. Worse, sometimes models take a long time to train because they are overfitting the data! Hyperopt does not try to learn about runtime of trials or factor that into its choice of hyperparameters. If some tasks fail for lack of memory or run very slowly, examine their hyperparameters. Sometimes it will reveal that certain settings are just too expensive to consider.

A final subtlety is the difference between uniform and log-uniform hyperparameter spaces. Hyperopt offers hp.uniform and hp.loguniform, both of which produce real values in a min/max range. hp.loguniform is more suitable when one might choose a geometric series of values to try (0.001, 0.01, 0.1) rather than arithmetic (0.1, 0.2, 0.3). Which one is more suitable depends on the context, and typically does not make a large difference, but is worth considering.

To recap, a reasonable workflow with Hyperopt is as follows:

  • Choose what hyperparameters are reasonable to optimize
  • Define broad ranges for each of the hyperparameters (including the default where applicable)
  • Run a small number of trials
  • Observe the results in an MLflow parallel coordinate plot and select the runs with lowest loss
  • Move the range towards those higher/lower values when the best runs’ hyperparameter values are pushed against one end of a range
  • Determine whether certain hyperparameter values cause fitting to take a long time (and avoid those values)
  • Re-run with more trials
  • Repeat until the best runs are comfortably within the given search bounds and none are taking excessive time

Use hp.quniform for scalars, hp.choice for categoricals

Consider choosing the maximum depth of a tree building process. This must be an integer like 3 or 10. Hyperopt offers hp.choice and hp.randint to choose an integer from a range, and users commonly choose hp.choice as a sensible-looking range type.

However, these are exactly the wrong choices for such a hyperparameter. While these will generate integers in the right range, in these cases, Hyperopt would not consider that a value of “10” is larger than “5” and much larger than “1”, as if scalar values. Yet, that is how a maximum depth parameter behaves. If 1 and 10 are bad choices, and 3 is good, then it should probably prefer to try 2 and 4, but it will not learn that with hp.choice or hp.randint.

Instead, the right choice is hp.quniform (“quantized uniform”) or hp.qloguniform to generate integers. hp.choice is the right choice when, for example, choosing among categorical choices (which might in some situations even be integers, but not usually).

Here are a few common types of hyperparameters, and a likely Hyperopt range type to choose to describe them:

 

Hyperparameter Type Suggested Hyperopt range
Maximum depth, number of trees, max ‘bins’ in Spark ML decision trees hp.quniform with min >= 1
Learning rate hp.loguniform with max = 0
(because exp(0) = 1.0)
Regularization strength hp.uniform with min = 0 or
hp.loguniform
Ratios or fractions, like Elastic net ratio hp.uniform with min = 0, max = 1
Shrinkage factors like eta in xgboost hp.uniform with min = 0, max = 1
Loss criterion in decision trees
(ex: gini vs entropy)
hp.choice
Activation function (e.g. ReLU vs leaky ReLU) hp.choice
Optimizer (e.g. Adam vs SGD) hp.choice
Neural net layer width, embedding size hp.quniform with min >>= 1

One final caveat: when using hp.choice over, say, two choices like “adam” and “sgd”, the value that Hyperopt sends to the function (and which is auto-logged by MLflow) is an integer index like 0 or 1, not a string like “adam”. To log the actual value of the choice, it’s necessary to consult the list of choices supplied. Example:


optimizers = ["adam", "sgd"]
search_space = {
  ...
  'optimizer': hp.choice("optimizer", optimizers)
}
 
def my_objective(params):
  ...
  the_optimizer = optimizers[params['optimizer']]
  mlflow.log_param('optimizer', the_optimizer)
  ...

 

“There are no evaluation tasks, cannot return argmin of task losses”

One error that users commonly encounter with Hyperopt is: There are no evaluation tasks, cannot return argmin of task losses.

This means that no trial completed successfully. This almost always means that there is a bug in the objective function, and every invocation is resulting in an error. See the error output in the logs for details. In Databricks, the underlying error is surfaced for easier debugging.

It can also arise if the model fitting process is not prepared to deal with missing / NaN values, and is always returning a NaN loss.

Sometimes it’s “normal” for the objective function to fail to compute a loss. Sometimes a particular configuration of hyperparameters does not work at all with the training data — maybe choosing to add a certain exogenous variable in a time series model causes it to fail to fit. It’s OK to let the objective function fail in a few cases if that’s expected. It’s also possible to simply return a very large dummy loss value in these cases to help Hyperopt learn that the hyperparameter combination does not work well.

Setting SparkTrials parallelism optimally

Hyperopt can parallelize its trials across a Spark cluster, which is a great feature. Building and evaluating a model for each set of hyperparameters is inherently parallelizable, as each trial is independent of the others. Using Spark to execute trials is simply a matter of using “SparkTrials” instead of “Trials” in Hyperopt. This is a great idea in environments like Databricks where a Spark cluster is readily available.

SparkTrials takes a parallelism parameter, which specifies how many trials are run in parallel. Of course, setting this too low wastes resources. If running on a cluster with 32 cores, then running just 2 trials in parallel leaves 30 cores idle.

Setting parallelism too high can cause a subtler problem. With a 32-core cluster, it’s natural to choose parallelism=32 of course, to maximize usage of the cluster’s resources. Setting it higher than cluster parallelism is counterproductive, as each wave of trials will see some trials waiting to execute.

However, Hyperopt’s tuning process is iterative, so setting it to exactly 32 may not be ideal either. It uses the results of completed trials to compute and try the next-best set of hyperparameters. Consider the case where max_evals the total number of trials, is also 32. If parallelism is 32, then all 32 trials would launch at once, with no knowledge of each other’s results. It would effectively be a random search.

parallelism should likely be an order of magnitude smaller than max_evals. That is, given a target number of total trials, adjust cluster size to match a parallelism that’s much smaller. If targeting 200 trials, consider parallelism of 20 and a cluster with about 20 cores.

There’s more to this rule of thumb. It’s also not effective to have a large parallelism when the number of hyperparameters being tuned is small. For example, if searching over 4 hyperparameters, parallelism should not be much larger than 4. 8 or 16 may be fine, but 64 may not help a lot. With many trials and few hyperparameters to vary, the search becomes more speculative and random. It doesn’t hurt, it just may not help much.

Set parallelism to a small multiple of the number of hyperparameters, and allocate cluster resources accordingly. How to choose max_evals after that is covered below.

Leveraging task parallelism optimally

There’s a little more to that calculation. Some machine learning libraries can take advantage of multiple threads on one machine. For example, several scikit-learn implementations have an n_jobs parameter that sets the number of threads the fitting process can use.

Although a single Spark task is assumed to use one core, nothing stops the task from using multiple cores. For example, with 16 cores available, one can run 16 single-threaded tasks, or 4 tasks that use 4 each. The latter is actually advantageous — if the fitting process can efficiently use, say, 4 cores. This is because Hyperopt is iterative, and returning fewer results faster improves its ability to learn from early results to schedule the next trials. That is, in this scenario, trials 5-8 could learn from the results of 1-4 if those first 4 tasks used 4 cores each to complete quickly and so on, whereas if all were run at once, none of the trials’ hyperparameter choices have the benefit of information from any of the others’ results.

How to set n_jobs (or the equivalent parameter in other frameworks, like nthread in xgboost) optimally depends on the framework. scikit-learn and xgboost implementations can typically benefit from several cores, though they see diminishing returns beyond that, but it depends. One solution is simply to set n_jobs (or equivalent) higher than 1 without telling Spark that tasks will use more than 1 core. The executor VM may be overcommitted, but will certainly be fully utilized. If not taken to an extreme, this can be close enough.

This affects thinking about the setting of parallelism. If a Hyperopt fitting process can reasonably use parallelism = 8, then by default one would allocate a cluster with 8 cores to execute it. But if the individual tasks can each use 4 cores, then allocating a 4 * 8 = 32-core cluster would be advantageous.

Ideally, it’s possible to tell Spark that each task will want 4 cores in this example. This is done by setting spark.task.cpus. This will help Spark avoid scheduling too many core-hungry tasks on one machine. The disadvantage is that this is a cluster-wide configuration, which will cause all Spark jobs executed in the session to assume 4 cores for any task. This is only reasonable if the tuning job is the only work executing within the session. Simply not setting this value may work out well enough in practice.

Optimizing Spark-based ML jobs

The examples above have contemplated tuning a modeling job that uses a single-node library like scikit-learn or xgboost. Hyperopt can equally be used to tune modeling jobs that leverage Spark for parallelism, such as those from Spark ML, xgboost4j-spark, or Horovod with Keras or PyTorch.

However, in these cases, the modeling job itself is already getting parallelism from the Spark cluster. Just use Trials, not SparkTrials, with Hyperopt. Jobs will execute serially. Hence, it’s important to tune the Spark-based library’s execution to maximize efficiency; there is no Hyperopt parallelism to tune or worry about.

Avoid large serialized objects in the objective function

When using SparkTrials, Hyperopt parallelizes execution of the supplied objective function across a Spark cluster. This means the function is magically serialized, like any Spark function, along with any objects the function refers to.

This can be bad if the function references a large object like a large DL model or a huge data set.


model = # load large model
train, test = # load data

def my_objective():
  ...
  model.fit(train, ...)
  model.evaluate(test, ...)

Hyperopt has to send the model and data to the executors repeatedly every time the function is invoked. This can dramatically slow down tuning. Instead, it’s better to broadcast these, which is a fine idea even if the model or data aren’t huge:


model = # load large model
train, test = # load data
b_model = spark.broadcast(model)
b_train = spark.broadcast(train)
b_test = spark.broadcast(test)

def my_objective():
  ...
  b_model.value.fit(b_train.value, ...)
  b_model.value.evaluate(b_test.value, ...)

However, this will not work if the broadcasted object is more than 2GB in size. It may also be necessary to, for example, convert the data into a form that is serializable (using a NumPy array instead of a pandas DataFrame) to make this pattern work.

If not possible to broadcast, then there’s no way around the overhead of loading the model and/or data each time. The objective function has to load these artifacts directly from distributed storage. This works, and at least, the data isn’t all being sent from a single driver to each worker.

Use Early Stopping

Optimizing a model’s loss with Hyperopt is an iterative process, just like (for example) training a neural network is. It keeps improving some metric, like the loss of a model. However, at some point the optimization stops making much progress. It’s possible that Hyperopt struggles to find a set of hyperparameters that produces a better loss than the best one so far. You may observe that the best loss isn’t going down at all towards the end of a tuning process.

It’s advantageous to stop running trials if progress has stopped. Hyperopt offers an early_stop_fn parameter, which specifies a function that decides when to stop trials before max_evals has been reached. Hyperopt provides a function no_progress_loss, which can stop iteration if best loss hasn’t improved in n trials.

How should I set max_evals?

Below is some general guidance on how to choose a value for max_evals

Parameter Expression Optimal Results Fastest Results
(ordinal parameters)

hp.uniform
hp.quniform
hp.loguniform
hp.qloguniform

20 x # parameters 10 x # parameters
(categorical parameters)

hp.choice

15 x total categorical breadth*

* “total categorical breadth” is the total number of categorical choices in the space.  If you have hp.choice with two options “on, off”, and another with five options “a, b, c, d, e”, your total categorical breadth is 10.

Modifier Optimal Results Fastest Results
Parallelism x # of workers x ½ # of workers

By adding the two numbers together, you can get a base number to use when thinking about how many evaluations to run, before applying multipliers for things like parallelism.

Example: You have two hp.uniform, one hp.loguniform, and two hp.quniform hyperparameters, as well as three hp.choice parameters. Two of them have 2 choices, and the third has 5 choices.To calculate the range for max_evals, we take 5 x 10-20 = (50, 100) for the ordinal parameters, and then 15 x (2 x 2 x 5) = 300 for the categorical parameters, resulting in a range of 350-450. With no parallelism, we would then choose a number from that range, depending on how you want to trade off between speed (closer to 350), and getting the optimal result (closer to 450). As you might imagine, a value of 400 strikes a balance between the two and is a reasonable choice for most situations. If we wanted to use 8 parallel workers (using SparkTrials), we would multiply these numbers by the appropriate modifier: in this case, 4x for speed and 8x for optimal results, resulting in a range of 1400 to 3600, with 2500 being a reasonable balance between speed and the optimal result.

One final note: when we say “optimal results”, what we mean is confidence of optimal results. It is possible, and even probable, that the fastest value and optimal value will give similar results. However, by specifying and then running more evaluations, we allow Hyperopt to better learn about the hyperparameter space, and we gain higher confidence in the quality of our best seen result.

Avoid cross validation in the objective function

The objective function optimized by Hyperopt, primarily, returns a loss value. Given hyperparameter values that Hyperopt chooses, the function computes the loss for a model built with those hyperparameters. It returns a dict including the loss value under the key ‘loss’:

return {'status': STATUS_OK, 'loss': loss}

To do this, the function has to split the data into a training and validation set in order to train the model and then evaluate its loss on held-out data. A train-validation split is normal and essential.

It’s common in machine learning to perform k-fold cross-validation when fitting a model. Instead of fitting one model on one train-validation split, k models are fit on k different splits of the data. This can produce a better estimate of the loss, because many models’ loss estimates are averaged.

However, it’s worth considering whether cross validation is worthwhile in a hyperparameter tuning task. It improves the accuracy of each loss estimate, and provides information about the certainty of that estimate, but it comes at a price: k models are fit, not one. That means each task runs roughly k times longer. This time could also have been spent exploring k other hyperparameter combinations. That is, increasing max_evals by a factor of k is probably better than adding k-fold cross-validation, all else equal.

If k-fold cross validation is performed anyway, it’s possible to at least make use of additional information that it provides. With k losses, it’s possible to estimate the variance of the loss, a measure of uncertainty of its value. This is useful to Hyperopt because it is updating a probability distribution over the loss. To do so, return an estimate of the variance under “loss_variance”. Note that the losses returned from cross validation are just an estimate of the true population loss, so return the Bessel-corrected estimate:


losses = # list of k model losses
return {'status': STATUS_OK,
'loss', np.mean(losses),
'loss_variance': np.var(losses, ddof=1)}
Note: Some specific model types, like certain time series forecasting models, estimate the variance of the prediction inherently without cross validation. If so, it’s useful to return that as above.

Choosing the Right Loss

An optimization process is only as good as the metric being optimized. Models are evaluated according to the loss returned from the objective function. Sometimes the model provides an obvious loss metric, but that may not accurately describe the model’s usefulness to the business.

For example, classifiers are often optimizing a loss function like cross-entropy loss. This expresses the model’s “incorrectness” but does not take into account which way the model is wrong. Returning “true” when the right answer is “false” is as bad as the reverse in this loss function. However it may be much more important that the model rarely returns false negatives (“false” when the right answer is “true”). Recall captures that more than cross-entropy loss, so it’s probably better to optimize for recall. It’s reasonable to return recall of a classifier in this case, not its loss. Note that Hyperopt is minimizing the returned loss value, whereas higher recall values are better, so it’s necessary in a case like this to return -recall.

Retraining the best model

Hyperopt selects the hyperparameters that produce a model with the lowest loss, and nothing more. Because it integrates with MLflow, the results of every Hyperopt trial can be automatically logged with no additional code in the Databricks workspace. The results of many trials can then be compared in the MLflow Tracking Server UI to understand the results of the search. Hundreds of runs can be compared in a parallel coordinates plot, for example, to understand which combinations appear to be producing the best loss.

This is useful in the early stages of model optimization where, for example, it’s not even so clear what is worth optimizing, or what ranges of values are reasonable.

However, the MLflow integration does not (cannot, actually) automatically log the models fit by each Hyperopt trial. This is not a bad thing. It may not be desirable to spend time saving every single model when only the best one would possibly be useful.

It is possible to manually log each model from within the function if desired; simply call MLflow APIs to add this or anything else to the auto-logged information. For example:


def my_objective():
model = # fit a model
...
mlflow.sklearn.log_model("model", model)
...

 

Although up for debate, it’s reasonable to instead take the optimal hyperparameters determined by Hyperopt and re-fit one final model on all of the data, and log it with MLflow. While the hyperparameter tuning process had to restrict training to a train set, it’s no longer necessary to fit the final model on just the training set. With the ‘best’ hyperparameters, a model fit on all the data might yield slightly better parameters. The disadvantage is that the generalization error of this final model can’t be evaluated, although there is reason to believe that was well estimated by Hyperopt. A sketch of how to tune, and then refit and log a model, follows:


all_data = # load all data
train, test = # split all_data to train, test

def fit_model(params, data):
  model = # fit model to data with params
  return model

def my_objective(params):
  model = fit_model(params, train)
  # evaluate and return loss on test

best_params = fmin(fn=my_objective, …)

final_model = fit_model(best_params, all_data)
mlflow.sklearn.log_model("model", final_model)

More best practices

If you’re interested in more tips and best practices, see additional resources:

Conclusion

This blog covered best practices for using Hyperopt to automatically select the best machine learning model, as well as common problems and issues in specifying the search correctly and executing its search efficiently. It covered best practices for distributed execution on a Spark cluster and debugging failures, as well as integration with MLflow.

With these best practices in hand, you can leverage Hyperopt’s simplicity to quickly integrate efficient model selection into any machine learning pipeline.

Get started

Use Hyperopt on Databricks (with Spark and MLflow) to build your best model!

--

Try Databricks for free. Get started today.

The post How (Not) to Tune Your Model with Hyperopt appeared first on Databricks.

Attack of the Delta Clones (Against Disaster Recovery Availability Complexity)

$
0
0

Notebook: Using Deep Clone for Disaster Recovery with Delta Lake on Databricks

For most businesses, the creation of a business continuity plan is crucial to ensure vital services, such as data stores, remain online in the event of a disaster,  emergency or other issue. For many, it is mission critical that data teams can still use the Databricks platform even in the rare case of a regional cloud outage, whether caused by a disaster like a hurricane or some other unforeseen event. As noted in the Azure and AWS disaster recovery guides, Databricks is often a core part of an overall data ecosystem, including, but not limited to, upstream data ingestion, sophisticated data pipelines, cloud-native storage, machine learning and artificial intelligence, business intelligence and orchestration. Some use cases might be particularly sensitive to a regional service-wide outage.

Disaster recovery – the tools, policies and procedures in place to recover or ensure continuity of your data infrastructure – is a crucial component of any business continuity plan. Delta clones simplify data replication, enabling you to develop an effective recovery strategy for your Delta tables. Using Delta clones allows you to quickly and easily incrementally synchronize data in the correct order between your primary and secondary sites or regions. Delta uses its transaction log to perform this synchronization, analogous to how RDBMS replication relies on its logs to restore or recover the database to a stable version. While solutions such as cloud multi-region synchronization may solve some problems, these processes are typically asynchronous, resulting in operations being performed out of order and data corruption.

This article shows how Delta clones can avoid these issues and facilitate DR by controlling the process of data synchronization between data centers.

What are clones again?

Naturally, the first question is: what are clones?

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: the same schema, constraints, column descriptions, statistics and partitioning. Note, however, that clones have a separate, independent history from the source table. For example, time travel queries on your source table and clone may not return the same result.

A shallow (also known as zero-copy) clone only duplicates the metadata of the table being cloned; the data files of the table itself are not copied. Because this type of cloning does not create another physical copy of the data, the storage costs are minimal. Shallow clones are not resource-intensive and can be extremely fast to create. However, these clones are not self-contained and maintain a dependency on the source from which they were cloned. Shallow clones are beneficial for testing and experimentation –  such as for staging structural changes against your production table without actually modifying it. For more information, refer to Easily Clone your Delta Lake for Testing, Sharing, and ML Reproducibility.

A deep clone makes a full copy of the metadata and data files of the table being cloned. In that sense, it is similar to copying with a CTAS command (CREATE TABLE... AS... SELECT...). However, it’s simpler because it makes a faithful copy of the current version of the original table at that point in time, and you don’t need to re-specify partitioning options, constraints and other information as you have to do with CTAS. In addition, it’s much faster, more robust and can work in an incremental manner. This last point is critical in that it enables an efficient solution to replicate only the data that is required to protect against failures, instead of all of the data.

A deep clone makes a full copy of the metadata and data files of the Delta table being cloned.

source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone.png


Deep clones are useful for:
  • Testing in a production environment without risking production data processes and affecting users
  • Staging major changes to a production table
  • Ensuring reproducibility of ML results
  • Data migration, sharing and/or archiving

In this article, we’ll be focusing on the role of Delta deep clones in disaster recovery.

Show me the clones!

Creating a clone can be done with the following SQL command:

CREATE OR REPLACE TABLE loan_details_delta_clone
DEEP CLONE loan_details_delta;
You can query both the original table (loan_details_delta) and the cloned table (loan_details_delta_clone) using the following SQL statements:
-- Original view of data
SELECT addr_state, funded_amnt FROM loan_details_delta GROUP BY addr_state, funded_amnt

-- Clone view of data
SELECT addr_state, funded_amnt FROM loan_details_delta_clone GROUP BY addr_state, funded_amnt
The following graphic shows the results using a Databricks notebook map visualization.
A cloned Delta table is an exact replication of the original, as demonstrated by this Databricks notebook map visualization.

https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-original-clone-view-map.png

An important feature of deep clones is that they allow incremental updates. That is, instead of copying the entire table to ensure consistency between the original and the clone, only rows that contain changes to the data (e.g., where records have been updated, deleted, merged or inserted) will need to be copied and/or modified the next time the table is cloned.

Deep Delta cloned tables allow for incremental updates

https://raw.githubusercontent.com/databricks/tech-talks/master/images/deep-clone-incremental-update.png

To illustrate how incremental updates work with deep clones, let’s delete some rows in our original table (loan_details_delta) using the following SQL statement (step 1 in the above diagram):

DELETE FROM loan_details_delta WHERE addr_state = 'OH';
At this point, the original table (loan_details_delta) no longer contains rows for Ohio (OH), while the cloned table (loan_details_delta_clone) still contains those rows. To re-sync the two tables, we perform the clone operation again (step 2):
CREATE OR REPLACE TABLE loan_details_delta_clone
DEEP CLONE loan_details_delta
The clone and original tables are now back in sync (this will be readily apparent in the following sections). But instead of copying the entire content of the original table, the rows were deleted incrementally, significantly speeding up the process. The cloned table will only update the rows that were modified in the original table, so in this case, the rows corresponding to Ohio were removed, but all the other rows remained unchanged.

Okay, let’s recover from this disaster of a blog

While disaster recovery by cloning is conceptually straightforward, as any DBA or DevOps engineer will attest, the practical implementation of a disaster recovery solution is far more complex. Many production systems require a two-way disaster recovery process, analogous to an active–active cluster for relational database systems. When the active server (Source) goes offline, the secondary server (Clone) needs to come online for both read and write operations. All subsequent changes to the system (e.g. inserts and modifications to the data) are recorded by the secondary server,which is now the active server in the cluster. Then, once the original server (Source) comes back online, you need to re-sync all the changes performed on Clone back to Source. Upon completion of the re-sync, the Source server becomes active again and the Clone server returns to its secondary state. Because the systems are constantly serving and/or modifying data, it is important that the copies of data synchronize quickly to eliminate data loss

Deep clones make it easy to perform this workflow, even on a multi-region distributed system, as illustrated in the following graphic.

Delta deep clones make it easy to perform the synchronization workflow, even on a multi-region distributed system.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline.png

In this example, Source is a table in the active Databricks region and Clone is the table in the secondary region:

  • At t0: An insert/update statement (merge) is executed on the Source table, and we then execute a DEEP CLONE to keep the Source and Clone tables in sync.
  • At t1: The two tables remain in sync.
  • At t2: The Source table is not accessible. The Clone table now becomes Source’, which is where all queries and data modifications take place from this timestamp forward.
  • At t3: A DELETE statement is executed on Source’.
  • At t4: The Source table is accessible now, but Source’ and Source are not in sync.
  • At t4’: We run a DEEP CLONE to synchronize Source’ and Source.
  • At t5: Now that the two copies are synchronized, Source resumes the identity of the active table and Source’ that of the secondary table, Clone.

Next, we’ll show you how to perform these steps using SQL commands in Databricks. You can also follow along by running the Databricks notebook Using Deep Clone for Disaster Recovery with Delta Lake on Databricks.

Modify the source

We begin by modifying the data in the table in our active region (Source) as part of our normal data processing.

In step one of the Delta clone process, data in the active region is modified as part of the normal data processing workflow.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-0.png

Example update to the Delta table illustrating step one of the Delta cloning Process.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-insert-update-example.png

In this case, we’ll implement the merge using an "UPDATE" statement (to update the Amount column) for `TX`   and an INSERT statement for `OH` (to insert new rows based on entries from `TX` and `AZ` ).

Check the versions of the source and clone tables
We started this scenario at t0, where the loan_details_delta and loan_details_delta_clone tables were in sync; then we modified the loan_details_delta table. How can we tell if the tables have the same version of the data without querying both and comparing them? With Delta Lake, this information is stored within the transaction log, so the DEEP CLONE statements can automatically determine both the source and clone versions in a single table query.

When you execute DESCRIBE HISTORY DeltaTable, you will get something similar to the following screenshot.

With Delta Lake, version information is stored within the transaction log, so the DEEP CLONE statements can automatically determine both the source and clone versions in a single table query.

Note: original table is loan_details_delta while clone table is loan_details_delta_clone.

Diving deeper into this:

  • For the source table, we query the most recent version to determine the table version – here, version 2.
  • For the clone table, we query the most recent operationParameters.sourceVersion to identify which version of the source table the clone table has – here, version 1.

As noted, all of this information is stored within the Delta Lake transaction log. You can also use the checkTableVersions() function included in the associated notebook to query the transaction log to verify the versions of the two tables:

checkTableVersions()

Delta Lake Original Table Version: 2, Cloned Table Version: 1
For more information on the log, check out Diving into Delta Lake: Unpacking the Transaction Log.

Re-sync the source and clone

As we saw in the preceding section, the source table is now on v2 while the clone table is on v1.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-0b1.png

To synchronize the two tables, we run the following command at t1:

CREATE OR REPLACE TABLE loan_details_delta_clone
DEEP CLONE loan_details_delta;
To synchronize the two tables with Delta clone, you run a CREATE o

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-1.png

Source table is not accessible

At t2, the Source table is not accessible. Whatever causes this – from user error to an entire region going down – you have to admit this is not a great life hack for waking up.

Production developers should always be prepared for their systems to go offlne.

Source:https://www.reddit.com/r/ProgrammerHumor/comments/kvwj9f/burn_the_backups_if_you_need_that_extra_kick

Jokes aside, the reality is that you should always be prepared for a production system to go offline. Fortunately, because of your business continuity plan, you have a secondary clone where you can redirect your services to read and modify your data.

Data correction

With your original Source unavailable, your table in the secondary region (Clone) is now the Source’ table.

With Delta clones, when your original source is unavailable, your table in the secondary region (the clone) stands ready to take its place.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-2.png

Some services that do not need to modify data right away can switch to read-only mode while Source is not accessible, but many services and production environments cannot afford such delays. In this case, at t3 we need to modify the Source’ data and DELETE some records:

-- Running `DELETE` on the Delta Lake Source' table
DELETE FROM loan_detail_delta_clone WHERE addr_state = 'OH';

If you review the table history or run checkTableVersions(), (Source’) table is now at version 4 after running the DELETE statement:

# Check the table versions
checkTableVersions(2)

Delta Lake Original Table Version: None, Cloned Table Version: 4

Because the original Source table is unreachable, its version is reported as None.

With Delta clones, when the original source is unreachable, its version is reported as None.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-3b.png

The reason the clone table is on version 4 can be quickly determined by reviewing its history, which shows the three previous CLONE operations and the DELETE command:

DESCRIBE HISTORY loan_details_delta_clone

 

version timestamp operation
4 DELETE
3 CLONE
2 CLONE
1 CLONE
0 CREATE TABLE AS SELECT

Getting back to the source

Whew, after <insert fix here>, the original Source table is back online!

With Delta clones, once the original source is back online, it will be out of sync with its clone.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-4b.png

But as we saw in the previous steps, there are now differences between Source and the Source’ replica. Fortunately, fixing this problem is easy:

/* Fail back from the `loan_details_delta_clone` table to the `loan_details_delta` table */
CREATE OR REPLACE TABLE loan_details_delta
DEEP CLONE loan_details_delta_clone;
By running a DEEP CLONE to replace the original Source table (loans_details_delta_clone), we can quickly return to the original state.
With Delta Lake, by running a DEEP CLONE to replace the original Source table ( with its clone , we can quickly return to the original state.

Source:https://raw.githubusercontent.com/databricks/tech-talks/master/images/delta-lake-deep-clone-timeline-5.png

Now all of our services can point back to the original Source, and the Source’ table returns to its Clone (or secondary) state.

Some of you may have noted that the Source table is now version 3 while the Clone table is version 4. This is because the version number is associated with the number of operations performed on the table. In this example, the Source table had fewer operations:

DESCRIBE HISTORY loan_details_delta

| ------- | --------- | --- | --------- | --- |
| version | timestamp | --- | operation | --- |
| ------- | --------- | --- | --------- | --- |
| 3       | ...       | ... | CLONE     | ... |
| 2       | ...       | ... | MERGE     | ... |
| 1       | ...       | ... | DELETE    | ... |
| 0       | ...       | ... | WRITE     | ... |

Words of caution

This method for disaster recovery ensures the availability of both reads and writes regardless of outages. However, this comes at the cost of possibly losing intermediate changes. Consider the following scenario.

With the Delta clones method for data disaster recovery, you are assured of availability for both reads and writes, but there is the possibility you’ll lose intermediate changes under some scenarios.

Notice that there is an update at t=1, which happens between the two CLONE operations. It’s likely that any changes made during this interval will be lost at t=4 when the second CLONE operation occurs. To ensure no changes are lost, you’d have to guarantee that no writes to the Source table occur from t=1 to t=4. This could be challenging to accomplish, considering that an entire region may be misbehaving. That being said, there are many use cases where availability is the more important consideration.

Summary

This article has demonstrated how to perform two-way disaster recovery using the DEEP CLONE feature with Delta Lake on Databricks. Using only SQL statements with Delta Lake, you can significantly simplify and speed up data replication as part of your business continuity plan. For more information on Delta clones, refer to Easily Clone your Delta Lake for Testing, Sharing, and ML Reproducibility. Check out Using Deep Clone for Disaster Recovery with Delta Lake on Databricks to walk through this exercise yourself with Databricks Runtime.

Acknowledgements

We would like to thank Peter Stern, Rachel Head, Ryan Kennedy, Afsana Afzal, Ashley Trainor for their invaluable contributions to this blog.

--

Try Databricks for free. Get started today.

The post Attack of the Delta Clones (Against Disaster Recovery Availability Complexity) appeared first on Databricks.


Private Databricks Workspaces With AWS PrivateLink Is in Public Preview

$
0
0

We’re excited to announce that PrivateLink connectivity for Databricks workspaces on AWS (Amazon Web Services) is now in public preview, with full support for production deployments. This release applies to all AWS regions supporting E2 architecture, as part of the Enterprise pricing tier. We have received great feedback from our global customers, including large financial services, healthcare and communications organizations, during the feature’s private preview period, as it allows them to deploy private workspaces of the Databricks Lakehouse Platform on AWS. Customers can enforce cloud-native, private-only connectivity for both front-end and back-end interfaces of Databricks workspaces, thus satisfying a major requirement of their enterprise governance policies.

Private Databricks workspaces with AWS PrivateLink overview

A Databricks workspace enables you to leverage enhanced security capabilities through a simple and well-integrated architecture. AWS PrivateLink for Databricks E2 workspaces enables the following benefits:

  • Private connectivity to front-end interfaces:Configure AWS VPC (virtual private cloud) Endpoints to Databricks front-end interfaces and ensure that all user/client traffic to Notebooks, SQL Endpoints, REST API (including CLI) and Databricks Connect transits over your private network and AWS network backbone.
  • Private connectivity to back-end Interfaces: If you deploy a Databricks workspace in your own-managed VPC using secure cluster connectivity, you can configure AWS VPC Endpoints to Databricks back-end interfaces and ensure that all cluster traffic to secure cluster connectivity relay and internal APIs transits over your private network and AWS network backbone.
  • Increased reliability and scalability: Your data platform is now more reliable and scalable for large and extra-large workloads, as there’s no dependency to launch public IPs for cluster nodes and attaching those to the corresponding network interfaces. Additionally, the workspace traffic is not subject to bandwidth availability on public networks.

All traffic from to Databricks front end and back-end interfaces transits over customer’s private network and AWS’s network backbone.

At a high level, the product architecture consists of a control/management plane and a data plane. The control plane resides in a Databricks AWS account and hosts services such as web application, cluster manager, jobs service, SQL gateway, etc. The data plane that’s in yourAWS account consists of a customer-managed VPC (minimum two subnets), Security Group and a root Amazon S3 bucket known as DBFS.

You can deploy a workspace with PrivateLink for both front-end and back-end interfaces using a combination E2 Account API and AWS CLI/Cloudformation, or using our technical-field managed Terraform Resource Provider. We recommend the latter if you already use Terraform for automating your infrastructure & configuration management.

Getting Started with Private Databricks Workspaces with AWS PrivateLink

Get started with the enhanced security capabilities by deploying Private Databricks Workspaces with AWS PrivateLink. Please refer to the following resources:

Please refer to Platform Security for Enterprises for a deeper view into how we bring a security-first mindset while building the most popular lakehouse platform on AWS.

--

Try Databricks for free. Get started today.

The post Private Databricks Workspaces With AWS PrivateLink Is in Public Preview appeared first on Databricks.

How We Launched a Podcast: Lessons, (Minor) Mishaps & Key Takeaways

$
0
0

After six episodes featuring amazing leaders and practitioners in the data and AI community, we wrapped up season 1 of Data Brew by Databricks, our homegrown podcast hosted by us two – Denny and Brooke. This season focused on all things lakehouses – combining the key features of data warehouses, such as ACID transactions, with the scalability of data lakes, directly against low-cost object stores.

In the coming weeks, we’ll be launching season 2 of Data Brew, and trust us, you’ll want to hyper-tune in for it. We had the opportunity to interview some of the brightest minds in research and industry to dive into the ever-changing world of machine learning. But before we kick off season 2, we thought it’d be a good time to reflect on what we learned by building our own podcast from the ground up. Hopefully, this will inspire other folks in the community to get creative, learn from our experience and launch their own podcast!

So, here it is: how to launch a podcast 101.

Identifying & developing your brand

One of the hardest parts about starting Data Brew, frankly, was coming up with a creative name and building an entire brand around it. To us, a name serves to both give a sense of what listeners can expect from your podcast and also a glimpse as to who you, as the hosts, are. However, virtually every name we came up with was already in use.

We chose Data Brew for a few reasons. It immediately gives the listener a clear idea of the topics and audience we’re speaking to: data scientists, data engineers, analysts, etc. That’s the Data part. Brew was a personal touch since Denny, a Seattle resident, is a devout coffee drinker and Brooke loves tea – and even took a course on tea in college. (Yes, we know it’s a bit weird talking about ourselves in the third person, but oh well). On top of that, the phrase together really conveyed what we’re about – steeping into the great ideas and insights from some of the best and brightest minds in data. It took a while, but eventually, we realized that our names, Denny and Brooke, jointly had the initials of Databricks, thus unlocking the potential for many alliterations: Data Brew by Databricks with Denny & Brooke.

Data Brew by Databricks with Denny & Brooke

The podcast’s name and overall POV are really the foundation to building an impactful brand. For us, this meant having some stellar design assets created playing off our “brew” vibe. You’ll even spot some of our hacky stickers we put on coffee mugs early on (thank goodness for trusty 15-year old label makers).

Practice, practice, practice

We learned this the hard way – even if you say something in your head 10 times, the moment when you try to say it live for the first time, it doesn’t always come out right (if at all). Luckily, we learned a couple of tricks that greatly helped us:

Rehearse but don’t script

Sometimes you’re intimately familiar with the guests on your podcast, other times you’re essentially strangers. Regardless, you should always research the speakers you invite. In addition to having an understanding of their professional experience, find out their hobbies and fun facts, and bring them up in the interview or interview prep talk). It’s a great way to get them to be more relaxed, likely to make them laugh, and can greatly increase your connection with them.

We always send our guests a set of proposed questions ahead of time so that no topic catches them off-guard, but still provides room for a natural conversation. One mistake we made early on was writing up our questions asynchronously and not saying them aloud. When read off the script or memorized, it sounded very formal and awkward. Written English and casual podcast English sound very different. We were basically guiding the listener through the questions rather than the conversation. To fix this, we now send just the list of topics and a few questions to initially kickstart the conversation.

Sometimes, you run through your topics faster than you anticipated, have a mental block or just find it hard to continue the conversation. Have some backup questions in your arsenal that you can ask any speaker to keep the conversation on track and flowing. Maybe you’ll even find you don’t get through all of your questions – having a good conversation is more important than rushing through them!

We often get the question “will there be a dry run”? We opted to not do them (and sometimes no meet/greet). Instead, we add a bit of buffer to the start of the recording to loosen up, discuss topics at a high level, etc. We listened to this incredibly helpful podcast from a16z on moderating talks & panels, and they advised against it. Some problems with dry runs include people script their responses or reference something during the recording that came up in the dry run that the audience isn’t aware of. So instead, we block off 15 minutes for setup and discussion, and then 5-10 minutes at the end for all the “off the record” discussions and next steps.

Finding your Voice

We all have our favorite podcasts/vidcasts. When we started this process, we listened to many different podcasters to see what their style and “voices” were like for inspiration. We found there are things that we loved and wanted to replicate, and other things that we wanted to avoid. Denny & Brooke perceive the “audience” differently and have different frames of reference, and we really enjoyed the co-host dynamic. Just remember you’re the eyes and ears of the audience and keep a clear goal for what you want them to get out of your podcast.

Speaking of voice, vocal stamina is a thing. You should definitely keep tons of liquids around. Just be careful if you’re drinking caffeine – it definitely provides you with a burst of energy, but you’re likely to speak faster when you’re caffeinated. Brooke recommends hot water with honey & lemon as the honey naturally coats your throat. Denny, on the other hand, highly recommends getting creative and drinking your soothing latte art.


Source: https://www.instagram.com/p/CLzoatQBCRK/ (yes, please follow Denny!)

Getting technical: recording & producing

We generally live by Hofstadter’s Law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.” However, we had no idea just how much work goes into the actual production side of launching even a single episode. Huge kudos to all podcasters before us. Luckily, we have some learnings to pass on:

Find the right tools

Things aren’t always as they appear. We started recording our episodes with Zoom but quickly learned that it records at a lower quality than it broadcasts due to everyone now using Zoom at home (and yes, we paid for the premo version!). We then pivoted to having everyone record locally with Quicktime, and our awesome video editor stitched them all together. It also makes the conversation more candid and relaxing to know that any redos or mess-ups can just be cut out without even noticing.

Given that we’re computer scientists and need to design for redundancy, we still had the Zoom recording as our backup. The lesson here: don’t try to do the editing at home! If you have the budget for a video editor, they will save you loads of time and do a better job :).

Since podcasts are audio-centric, good audio quality is a must. A cardioid mic is one way to pick up your voice without too much background noise, though you’ll still want to record in as quiet a setting as possible. Other tips include making sure the mic isn’t too close or too far from the speaker and testing out the acoustics of your room. It’s shocking how even characteristics like high ceilings can drastically change the output. Try it yourself: Record yourself with your built-in laptop speakers and then with an external microphone. When you listen to the recording, you’ll hear a huge difference in quality. You don’t necessarily need to invest in a cardioid mic, but a good pair of headphones with speakers will do wonders.

A few tips from our video editor

Since Data Brew is a vidcast and a podcast, we had to get our camera setup and backgrounds just right. This doesn’t mean buying a green screen or brand new furniture. Here are some basic tips:

  • Find the right camera angle. Especially when people are working on laptops, the camera is often positioned below eye level. In extreme cases, you get a pretty view of one’s nostrils. A simple fix is to prop your laptop up on a few dense textbooks until it is at eye level. Having an external keyboard will make this much more comfortable. Look directly into the camera (this is surprisingly difficult!) and try to prevent the camera from pointing at the ceiling.
  • Lights, lights, and more lights. If you ever present on a mainstage, you’ll notice that you are blinded by all the lights. If that’s not accessible, go for direct, soft light that faces you. Avoid having direct light behind you, or you might just show up as a dark silhouette.
  • Get a good external webcam camera. You don’t need a mirrorless camera (or at least to start with) but it is a good idea to get a 1080p or higher webcam so your hair video looks great!
  • Don’t read off a script. Or at least, don’t make it obvious. Try placing any notes as high up on your screen as possible to keep it eye level. But, most importantly, try to practice as best you can to avoid needing to read too much.

Was it worth it?

Even though it was a lot of work and there was a huge learning curve, it was an amazing experience. We really enjoyed building connections with thought leaders and hearing their experiences (turns out people are eager to share their knowledge). And, Denny & Brooke are still friends and didn’t kill each other (virtually). Wins all around! In addition, we applied a lot of these best practices for internal meetings as well as moderating customer panels for conferences. True transfer learning at its finest!

Speaking of machine learning, tune in to season 2 to see these tips in action and learn more about ML from experts in industry and academia, including Matei Zaharia, Erin LeDell, Ameet Talwalkar, and many more!

--

Try Databricks for free. Get started today.

The post How We Launched a Podcast: Lessons, (Minor) Mishaps & Key Takeaways appeared first on Databricks.

Reproduce Anything: Machine Learning Meets Lakehouse

$
0
0

Machine learning has proved to add unprecedented value to organization and projects – whether that’s for accelerating innovation, personalization, demand forecasting and countless other use cases. However, machine learning (ML) leverages data from a myriad of sources with an ever-changing ecosystem of tools and dependencies, making these solutions constantly in flux and difficult to reproduce.

While no one can guarantee their model is 100% correct, experiments that have a reproducible model and results are more likely to be trusted than those that are not. A reproducible ML experiment implies that we are able to at least reproduce the following:

  • Training/validation/test data
  • Compute
  • Environment
  • Model (and associated hyperparameters, etc.)
  • Code

However, reproducibility in ML is a much more difficult task than it appears. You need access to the same underlying data the model was trained on, but how can you guarantee that data hasn’t changed? Did you version control your data in addition to your source code? On top of that, which libraries (and versions), hyperparameters and models were used? Worse yet, does the code successfully run end-to-end?

In this blog, we’ll walk through how the lakehouse architecture built on Delta Lake coupled with the open-source library MLflow helps solve these replication challenges. In particular, this blog covers:

  • Lakehouse architecture
  • Data versioning with Delta Lake
  • Tracking experiments with MLflow
  • End-to-end reproducibility with Databricks

What is a lakehouse (and why you should care)

As a data scientist, you might not care where your underlying data comes from – a CSV, relational database, etc. But let’s say you’re working with training data that is updated nightly. You build a model today with a given set of hyperparameters, but tomorrow, you want to improve this model and tweak some of the hyperparameters. Well, did the model performance improve because of the updated hyperparameters or because the underlying data changed? Without being able to version your data and compare apples to apples, there is no way to know! You might say, “Well, I’ll just snapshot all my data” but this could be very costly, go stale quickly and is difficult to maintain and version. You need a single source of truth for your data that is scalable, always up to date and provides data versioning without snapshotting your entire dataset.

This is where a lakehouse comes in. Lakehouses combine the best qualities of data warehouses and data lakes. Now, you can have the scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses. This enables you to have a single source of truth for your data, and you never need to experience stale, inconsistent data again. It accomplishes this by augmenting your existing data lake with metadata management for optimized performance, eliminating the need to copy data around to a data warehouse. You get data versioning, reliable and fault-tolerant transactions, and a fast query engine, all while maintaining open standards. Now, you can have a single solution for all major data workloads – from streaming analytics to BI, data science, and AI. This is the new standard.

So this sounds great in theory, but how do you get started?

Data versioning with Delta Lake

Delta Lake is an open-source project that powers the lakehouse architecture. While there are a few open-source lakehouse projects, we favor Delta Lake for its tight integration with Apache Spark™ and its supports for the following features:

  • ACID transactions
  • Scalable metadata handling
  • Time travel
  • Schema evolution
  • Audit history
  • Deletes and updates
  • Unified batch and streaming

Good ML starts with high-quality data. By using Delta Lake and some of the aforementioned features, you can ensure that your data science projects start on a solid foundation (get it, lakehouse, foundation?). With constant changes and updates to the data, ACID transactions ensure that data integrity is maintained across concurrent reads and writes, whether or not they are batch or streaming. This way, everyone has a consistent view of the data.

Delta Lake only tracks the “delta” or changes since the previous commit, and stores them in the Delta transaction log. As such, this enables time travel based on data versions so you can keep the data constant while making changes to the model, hyperparameters, etc. However, you’re not locked in to a given schema with Delta, as it supports schema evolution, so you’re able to add additional features as input to your machine learning models.

We can view the changes in the transaction log using the history() method from the Delta APIs:

Delta history API

This makes it easy to trace the lineage of all changes to the underlying data, ensuring that your model can be reproduced with exactly the same data it was built on. You can specify a specific version or timestamp when you load in your data from Delta Lake.

version = 1 
wine_df_delta = spark.read.format('delta').option('versionAsOf', version).load(data_path)

# Version by Timestamp
timestamp = '2021-03-02T15:33:29.000+0000'
wine_df_delta = spark.read.format('delta').option('timeStampAsOf', timestamp).load(data_path)

Tracking models with MLflow

Once you’re able to reliably reproduce your data, the next step is reproducing your model. The open-source library MLflow includes 4 components for managing the ML lifecycle and greatly simplifies experiment reproducibility.

4 components of MLflow

MLflow tracking allows you to log hyperparameters, metrics, code, model and any additional artifacts (such as files, plots, data versions, etc.) to a central location. This includes logging Delta tables and corresponding versions to ensure data consistency for each run (avoiding the need to actually copy or snapshot the entire data). Let’s take a look at an example of building a random forest model on the wine dataset, and logging our experiment with MLflow. The entire code can be found in this notebook.

with mlflow.start_run() as run:
    # Log params
    n_estimators = 1000
    max_features = 'sqrt'
    params = {'data_version': data_version,
              'n_estimators': n_estimators,
              'max_features': max_features}
    mlflow.log_params(params)

    # Train and log the model 
    rf = RandomForestRegressor(n_estimators=n_estimators, 
                               max_features=max_features,         
                               random_state=seed)
    rf.fit(X_train, y_train) 
    mlflow.sklearn.log_model(rf, 'model')

    # Log metrics
    metrics = {'rmse': rmse,
               'mae': mae,
               'r2' : r2}
    mlflow.log_metrics(metrics)

The results are logged to the MLflow tracking UI, which you can access by selecting the Experiment icon in the upper right-hand corner of your Databricks notebook (unless you provide a different experiment location). From here, you can compare runs, filter based on certain metrics or parameters, etc.

 MLflow UI

In addition to manually logging your parameters, metrics and so forth, there are autologging capabilities for some built-in model flavors that MLflow supports. For example, to automatically log an sklearn model, you simply add: mlflow.sklearn.autolog() and it will log the parameters, metrics, generate confusion matrices for classification problems and much more when an estimator.fit() is called.

When logging a model to the tracking server, MLflow creates a standard model packaging format. It automatically creates a conda.yaml file, which outlines the necessary channels, dependencies and versions that are required to recreate the environment necessary to load the model. This means you can easily mirror the environment of any model tracked and logged to MLflow.

conda.yaml logged in MLflow UI

When using managed MLflow on the Databricks platform, there is a ‘reproduce run’ feature that allows you to reproduce training runs with the click of a button. It automatically snapshots your Databricks notebook, cluster configuration and any additional libraries you might have installed.

Reproduce Run option on Databricks

Check out this reproduce run feature and see if you can reproduce your own experiments or those of your coworkers!

Putting it all together

Now that you’ve learned how the Lakehouse architecture with Delta Lake and MLflow addresses the data, model, code and environment ML reproducibility challenges, take a look at this notebook and reproduce our experiment for yourself! Even with the ability to reproduce the aforementioned items, there might still be some things outside of your control. Regardless, building ML solutions with Delta Lake and MLflow on Databricks addresses the vast majority of issues people face when reproducing ML experiments

Interested to learn what other problems you can solve with a data lakehouse? Read this recent blog on the challenges of the traditional two-tier data architecture and how the lakehouse architecture is helping businesses overcome them.

--

Try Databricks for free. Get started today.

The post Reproduce Anything: Machine Learning Meets Lakehouse appeared first on Databricks.

A Guide to Data + AI Summit Sessions: Machine Learning, Data Engineering, Apache Spark and More

$
0
0

We are only a few weeks away from Data + AI Summit, returning May 24–28. If you haven’t signed up yet, take advantage of free registration for five days of virtual engagement: training, talks, meetups, AMAs and community camaraderie.

To help you navigate through hundreds of sessions, I am sharing some of the content — including deep dives — that I’m excited about.

Below are a few additional picks for developer-focused Apache Spark talks. Use the code JulesDAIS2021 for 25% off pre-conference training!

Deep Dive Into the New Features of Apache Spark™ 3.1

Monitor Apache Spark™ 3 on Kubernetes Using Metrics and Plugins

Efficient Distributed Hyperparameter Tuning With Apache Spark™

The Rise of ZStandard: Apache Spark™/Parquet/ORC/Avro

Project Zen: Making Data Science Easier in PySpark

Grow your knowledge tree by joining 100,000 of your fellow data professionals at Data + AI Summit.

--

Try Databricks for free. Get started today.

The post A Guide to Data + AI Summit Sessions: Machine Learning, Data Engineering, Apache Spark and More appeared first on Databricks.

Databricks Named Data Science & Analytics Launch Partner for New AWS for Media & Entertainment Initiative

$
0
0

“Digital transformation” isn’t just a buzzword – especially in the media and entertainment industry. More than just a more efficient way of creating or distributing content, the move to the cloud for media workflows is a necessity in order to meet high audience demands of “more content, right now” (while my kids can watch the same show twenty times in a row, I like to watch something new occasionally).

Databricks named launch partner for new AWS media & entertainment initiative

To support this growing need, Amazon Web Services (AWS) launched AWS for Media & Entertainment, an initiative featuring new and existing services and solutions from AWS and AWS Partners, built specifically for content creators, rights holders, producers, broadcasters and distributors. AWS has added the newly-announced Amazon Nimble Studio, a service that enables customers to set up creative studios in hours instead of weeks, to a portfolio that contains more purpose-built media and entertainment industry services than any other cloud platform. This portfolio includes services such as AWS Elemental MediaPackage, AWS Elemental MediaConnect, AWS Elemental MediaLive, AWS Elemental MediaConvert, and Amazon Interactive Video Service (IVS).

AWS for Media & Entertainment also simplifies the process of building, deploying and reinventing mission-critical industry workloads by aligning AWS and AWS Partner capabilities against five solution areas: Content Production; Media Supply Chain & Archive; Broadcast; Direct-to-Consumer & Streaming; and Data Science & Analytics.

We are happy to announce that Databricks is a key launch partner for AWS for Media & Entertainment in the Data Science & Analytics area, a natural fit since many  media companies use Databricks to leverage audience, content, and viewing behavior data to build better, more personalized experiences. Through the cloud-based media lakehouse, customers easily merge structured and unstructured data sources, whether streaming or batch, to build a 360 view of their business from end to end. AWS customers are using Databricks to:

  • Acquire, engage and retain subscribers through advanced audience analytics
  • Improve the user experience with real-time OTT and video streaming analytics
  • Optimize advertising revenues through better segmentation, personalization and attribution
  • Maximize content monetization through better understanding of content lifecycle management

You can hear from Databricks + AWS customers like Disney and Comcast talk through how they are creating better audience experiences using the media lakehouse paradigm at Data + AI Summit 2021.

To stay up to date on the latest M&E use cases, solutions, and customer stories, visit our Databricks Media & Entertainment page.

--

Try Databricks for free. Get started today.

The post Databricks Named Data Science & Analytics Launch Partner for New AWS for Media & Entertainment Initiative appeared first on Databricks.

Viewing all 1873 articles
Browse latest View live