Quantcast
Channel: Databricks
Viewing all 2239 articles
Browse latest View live

Announcing Hackathon for Social Good

$
0
0

Data Teams Unite!

We’re excited to announce our first-ever virtual and global hackathon, where you’ll form data teams to help tackle climate change, the COVID-19 pandemic or issues unique to your local community.

Data scientists, engineers and analysts are invited to collaborate and innovate for social good in the Spark + AI Summit Hackathon for Social Good

Your challenge

Apply your ideas and data skills to help address real-world problems.

To participate in the hackathon, follow these steps:

  1. Register a data team (up to four participants) on the Hackathon for Social Good website.
  2. Build an application or create a compelling notebook of your analysis that allows end users to better understand data related to these issues. Your application or notebook should use data analysis, data science or machine learning technologies featured in the Spark + AI Summit.
  3. Submit your project, along with a video screencast describing the potential social good impact.

Unite for a cause

By participating in the Hackathon for Social Good, your team’s good work will go toward a noble cause. In addition to helping us understand the data around these issues, the three winning teams will be invited to direct a donation to a charity of their choice, with a combined value of $35,000. Winning projects will also be announced in the Spark + AI Summit keynote on June 24 and recognized during special Summit events.

The grand-prize-winning team will award a charity with a $20,000 donation, receive free training and VIP passes to the June 22–26 Spark + AI Summit as well as complimentary passes to a future Spark + AI Summit.

Bring your best ideas to the biggest issues

When planning your hackathon project, we encourage you to focus on one of these three issues:

  1. Provide greater insights into the COVID-19 pandemic: Various COVID-19 data sets are now available on Databricks, Kaggle and GitHub. Use these sets — and other public sources — to surface insight into correlations, causes or potential solutions.
  2. Reduce the impact of climate change: Write an application or perform an analysis on the causes of or solutions to climate change.
  3. Drive social change in your community: What challenges do you see where you live and work? Check out a local city data set and inspire change close to home.

For details, including suggested data sets and complete submission and participation requirements, please visit the Hackathon for Social Good website.

Here’s what you need to know about timing:

Submissions:                April 22–June 12
Judging:                          June 15–19
Winners announced: June 24

If you have questions or comments, reach us at hackathon@databricks.com.

We can’t wait to see your projects.

START HACKING!

--

Try Databricks for free. Get started today.

The post Announcing Hackathon for Social Good appeared first on Databricks.


How a Fresh Approach to Safety Stock Analysis Can Optimize Inventory

$
0
0

Refer to the accompanying notebook for more details.

A manufacturer is working on an order for a customer only to find that the delivery of a critical part is delayed by a supplier. A retailer experiences a spike in demand for beer thanks to an unforeseen reason, and they lose sales because of their lack of supply. Customers have a negative experience because of your inability to meet demand. These companies lose immediate revenue and your reputation is damaged. Does this sound familiar?

In an ideal world, demand for goods would be easily predictable. In practice, even the best forecasts are impacted by unexpected events. Disruptions happen due to raw material supply, freight and logistics, manufacturing breakdowns, unexpected demand and more. Retailers, distributors, manufacturers and suppliers all must wrestle with these challenges to ensure they are able to reliably meet their customers’ needs while also not carrying excessive inventory. This is where an improved method of safety stock analysis can help your business.

Organizations constantly work on allocating resources where they are needed to meet anticipated demand. The immediate focus is often in improving the accuracy of their forecasts. To achieve this goal, organizations are investing in scalable platforms, in-house expertise, sophisticated new models.

Even the best forecasts do not perfectly predict the future, and sudden shifts in demand can leave shelves bare. This was highlighted in early 2020 when concerns about the virus that causes COVID-19 led to widespread toilet paper stockouts. As Craig Boyan, the president of H-E-B commented, “We sold in two weeks what we normally sell in two months.”

Scaling up production is not a simple solution to the problem. Georgia-Pacific, a leading manufacturer of toilet paper, estimated that the average American household would consume 40% more toilet paper as people stayed home during the pandemic. In response, the company was able to boost production by 20% across its 14 facilities configured for the production of toilet paper. Most mills already run operations 24 hours a day, seven days a week with fixed capacity, so any further increase in production would require an expansion in capacity enabled through the purchase of additional equipment or the building of new plants.

This bump in production output can have upstream consequences. Suppliers may struggle to provide the resources required by newly scaled and expanded manufacturing capacity. Toilet paper is a simple product, but its production depends on pulp shipped from forested regions of the U.S., Canada, Scandinavia and Russia as well as more locally sourced recycled paper fiber. It takes time for suppliers to harvest, process and ship the materials needed by manufacturers once initial reserves are exhausted.

A supply chain concept called the bullwhip effect underpins all this uncertainty. Distorted information throughout the supply chain can cause large inefficiencies in inventory, increased freight and logistics costs, inaccurate capacity planning and more. Manufacturers or retailers eager to return stocks to normal may trigger their suppliers to ramp production which in turn triggers upstream suppliers to ramp theirs. If not carefully managed, retailers and suppliers may find themselves with excess inventory and production capacity when demand returns to normal or even encounters a slight dip below normal as consumers work through a backlog of their own personal inventories. Careful consideration of the dynamics of demand along with scrutiny of the uncertainty around the demand we forecast is needed to mitigate this bullwhip effect.

Managing Uncertainty with Safety Stock Analysis

The kinds of shifts in consumer demand surrounding the COVID-19 pandemic are hard to predict, but they highlight an extreme example of the concept of uncertainty that every organization managing a supply chain must address. Even in periods of relatively normal consumer activity, demand for products and services varies and must be considered and actively managed against.

Predicted sales as a mean value of actual demand

Modern demand forecasting tools predict a mean value for demand, taking into consideration the effects of weekly and annual seasonality, long-term trends, holidays and events, and external influencers such as weather, promotions, the economy, and additional factors. They produce a singular value for forecasted demand that can be misleading, as half the time we expect to see demand below this value and the other half we expect to see demand above it.

The mean forecasted value is important to understand, but just as critical is an understanding of the uncertainty on either side of it. We can think of this uncertainty as providing a range of potential demand values, each of which has a quantifiable probability of being encountered. And by thinking of our forecasts this way, we can begin to have a conversation about what parts of this range we should attempt to address.

Statistically speaking, the full range of potential demand is infinite and, therefore, never 100% fully addressable. But long before we need to engage in any kind of theoretical dialogue, we can recognize that each incremental improvement in our ability to address the range of potential demand comes with a sizable (actually exponential) increase in inventory requirements. This leads us to pursue a targeted service level at which we attempt to address a specific proportion of the full range of possible demand that balances the revenue goals of our organization with the cost of inventory.

The consequence of defining this service level expectation is that we must carry a certain amount of extra inventory, above the volume required to address our mean forecasted demand, to serve as a buffer against uncertainty. This safety stock, when added to the cycle stock required to meet mean periodic demand, gives us the ability to address most (though not all) fluctuations in actual demand while balancing our overall organizational goals.

The relationship between cycle stock and safety stock in addressing periodic demand

Calculating the Required Safety Stock Levels

In the classic Supply Chain literature, safety stock is calculated using one of two formulas that address uncertainty in demand and uncertainty in delivery. As our focus in this article is on demand uncertainty, we could eliminate the consideration of uncertain lead times, leaving us with a single, simplified safety stock formula to consider:

Safety Stock = Ζ * √PCT * σD

In a nutshell, this formula explains that safety stock is calculated as the average uncertainty in demand around the mean forecasted value (σD) multiplied by the square root of the duration of the (performance) cycle for which we are stocking (√PCT) multiplied by a value associated with the portion of the range of uncertainty we wish to address (Ζ). Each component of this formula deserves a little explanation to ensure it is fully understood.

In the previous section of this article, we explained that demand exists as a range of potential values around a mean value which is what our forecast generates. If we assume this range is evenly distributed around this mean, we can calculate an average of this range on either side of the mean value. This is known as a standard deviation. The value σD, also known as the standard deviation of demand, provides us with a measure of the range of values around the mean.

Because we have assumed this range is balanced around the mean, it turns out that we can derive the proportion of the values in this range that exist some number of standard deviations from that mean. If we use our service level expectation to represent the proportion of potential demand we wish to address, we can back into the number of standard deviations in demand that we need to consider as part of our planning for safety stock. The actual math behind the calculation of the required number of standard deviations (known as z-scores as represented in the formula as Ζ) required to capture a percentage of the range of values gets a little complex, but luckily z-score tables are widely published and online calculators are available. With that said, here are some z-score values that correspond to some commonly employed service level expectations:

Service Level Expectation Ζ (z-score)
80.00% 0.8416
85.00% 1.0364
90.00% 1.2816
95.00% 1.6449
97.00% 1.8808
98.00% 2.0537
99.00% 2.3263
99.90% 3.0902
99.99% 3.7190

Finally, we get to the term that addresses the duration of the cycle for which we are calculating safety stock (√PCT). Putting aside why it is we need the square root calculation, this is the simplest element of the formula to understand. The PCT value represents the duration of the cycle for which we are calculating our safety stock. The division by T is simply a reminder that we need to express this duration in the same units as those used to calculate our standard deviation value. For example, if we were planning safety stock for a 7-day cycle, we can take the square root of 7 for this term so long as we have calculated the standard deviation of demand leveraging daily demand values.

Demand Variance Is Hard to Estimate

On the surface, the calculation of safety stock analysis requirements is fairly straightforward. In Supply Chain Management classes, students are often provided historical values for demand from which they can calculate the standard deviation component of the formula. Given a service level expectation, they can then quickly derive a z-score and pull together the safety stock requirements to meet that target level. But these numbers are wrong, or at least they are wrong outside a critical assumption that is almost never valid.

The sticking point in any safety stock calculation is the standard deviation of demand. The standard formula depends on knowing the variation associated with demand in the future period for which we are planning. It is extremely rare that variation in a time series is stable. Instead, it often changes with trends and seasonal patterns in the data. Events and external regressors exert their own influences as well.

To overcome this problem, supply chain software packages often substitute measures of forecast error such as the root mean squared error (RMSE) or mean absolute error (MAE) for the standard deviation of demand, but these values represent different (though related concepts). This often leads to an underestimation of safety stock requirements as is illustrated in this chart within which a 92.7% service level is achieved despite the setting of a 95% expectation.

Required stocking only achieving a 92.7% service level when built using mean absolute error against a 95% service level goal

And as most forecasting models work to minimize error while calculating a forecast mean, the irony is that improvements in model performance often exacerbate the problem of underestimation. It’s very likely this is behind the growing recognition that although many retailers work toward published service level expectations, most of them fall short of these goals.

Where Do We Go from Here and How Does Databricks Help?

An important first step in addressing the problem is recognizing the shortcomings in our safety stock analysis calculations. Recognition alone is seldom satisfying.

A few researchers are working to define techniques that better estimate demand variance for the explicit purpose of improving safety stock estimation, but there isn’t consensus as to how this should be performed. And software to make these techniques easier to implement isn’t widely available.

For now, we would strongly encourage supply chain managers to carefully examine their historical service level performance to see whether stated targets are being met. This requires the careful combination of past forecasts as well as historical actuals. Because of the cost of preserving data in traditional database platforms, many organizations do not keep past forecasts or atomic-level source data, but the use of cloud-based storage with data stored in high-performance, compressed formats accessed through on-demand computational technology — provided through platforms such as Databricks — can make this cost effective and provide improved query performance for many organizations.

As automated or digitally-enabled fulfillment systems are deployed — required for many buy online pick up in-store (BOPIS) models — and begin generating real-time data on order fulfillment, companies will wish to use this data to detect out-of-stock issues that indicate the need to reassess service level expectations as well as in-store inventory management practices. Manufacturers that were limited to running these analyses on a daily routine may want to analyze and make adjustments per shift. Databricks’ streaming ingestion capabilities provide a solution, enabling companies to perform safety stock analysis with near real-time data.

Finally, consider exploring new methods of generating forecasts that provide better inputs into your inventory planning processes. The combination of using Facebook Prophet with parallelization and autoscaling platforms such as Databricks has allowed organizations to make timely, fine-grained forecasting a reality for many enterprises. Still other forecasting techniques, such as Generalized Autoregressive Conditional Heteroskedastic (GARCH) models, may allow you to examine shifts in demand variability that could prove very fruitful in designing a safety stock strategy.

The resolution of the safety stock challenge has significant potential benefits for organizations willing to undertake the journey, but as the path to the end state is not readily defined, flexibility is going to be the key to your success. We believe that Databricks is uniquely positioned to be the vehicle for this journey, and we look forward to working with our customers in navigating it together.

Databricks thanks Professor Sreekumar Bhaskaran at the Southern Methodist University Cox School of Business for his insights on this important topic.

--

Try Databricks for free. Get started today.

The post How a Fresh Approach to Safety Stock Analysis Can Optimize Inventory appeared first on Databricks.

Glow 0.3.0 Introduces New Large-Scale Genomic Analysis Features

$
0
0

In October of last year, Databricks and the Regeneron Genetics Center® partnered together to introduce Project Glow, an open-source analysis tool aimed at empowering genetics researchers to work on genomics projects at the scale of millions of samples. Since we introduced Glow, we have been busy at work adding new high-quality algorithms, improving performance, and making Glow’s APIs easier to use. Glow 0.3.0 was released on February 21, 2020 and improves Glow’s power and ease of use in performing large-scale, high-throughput genomic analysis. In this blog, we highlight features and improvements introduced in the 0.3.0 release.

Python and Scala APIs for Glow SQL functions

In this release, native Python and Scala APIs were introduced for all Glow SQL functions, similar to what is available for Spark SQL functions. In addition to improved simplicity, this provides enhanced compile-time safety. The SQL functions and their Python and Scala clients are generated from the same source so any new functionality in the future will always appear in all three languages. Please refer to Glow PySpark Functions for more information on Python APIs for these functions. A code example showing Python and Scala APIs for the function normalize_variant is presented at the end of the next section.

Improved variant normalization

The variant normalizer received a major performance improvement in this release. It still behaves like bcftools norm and vt normalize, but is about 2.5x faster and has a more flexible API. Moreover, the new normalizer is implemented as a function in addition to a transformer.

normalize_variants transformer: The improved transformer preserves the columns of the input dataframe, adds the normalization status to the dataframe, and has the option of adding the normalization results (including the normalized coordinates and alleles) to the dataframe as a new column. To start, we use the following command to read the original_variants_df dataframe. Figure 1 shows the variants in this dataframe.

    original_variants_df = spark.read \
        .format("vcf") \
        .option("includeSampleIds", False) \
        .load("/databricks-datasets/genomics/call-sets")

Glow SQL Dataframe with original variants prior to applying the improved variant normalization provided by Glow 3.0.
Figure 1: The variant dataframe original_variants_df

The improved normalizer transformer can be applied on this dataframe using the following command. This uses the transformer syntax used by the previous version of the normalizer:

    import glow
    normalized_variants_df = glow.transform("normalize_variants", \
        original_variants_df, \
    reference_genome_path="/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa" \
    )

Example dataframe, demonstrating the improved variant normalization proved by Glow 3.0, the latest release of the joint open source genomic analysis project.
Figure 2: The normalized dataframe normalized_variants_df

Figure 2 shows the dataframe generated by the improved normalizer. The start, end, referenceAllele, and alternateAlleles fields are updated with the normalized values and a normalizationStatus column is added to the dataframe. This column contains a changed subfield that indicates whether normalization changed the variant, and an errorMessage subfield containing the error message, if an error occurred.

The newly introduced replace_columns option can be used to add the normalization results as a new column to the dataframe instead of replacing the original start, end, referenceAllele, and alternateAlleles fields:

    import glow
    normalized_variants_df = glow.transform("normalize_variants",\
        original_variants_df, \
        replace_columns="False", \
    reference_genome_path="/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa" \
    )

Example Glow SQL normalized dataframe, demonstrating Glow 3.0’s capability to add an additional column with normalization results.
Figure 3: The normalized dataframe normalized_noreplace_variants_df with normalization results added as a new column

Figure 3 shows the resulting dataframe. A normalizationResults column is added to the dataframe. This column contains the normalization status, along with normalized start, end, referenceAllele, and alternateAlleles subfields.

Since the multiallelic variant splitter is implemented as a separate transformer in this release, the mode option of the normalize_variants transformer is deprecated. Refer to the Variant Normalization documentation for more details on the normalize_variants transformer.

normalize_variant function: As mentioned above, this release introduces the normalize_variant SQL expression:

    from pyspark.sql.functions import expr
    function_normalized_variants_df = original_variants_df.withColumn( \
        "normalizationResult", \
        expr("normalize_variant(contigName, start, end, referenceAllele, alternateAlleles, '/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa')") \
    )
As discussed in the previous section, this SQL expression function has Python and Scala APIs as well. Therefore, we can rewrite the previous code example as follows:
    from glow.functions import normalize_variant
    function_normalized_variants_df = original_variants_df.withColumn( \
        "normalizationResult", \
        normalize_variant( \
            "contigName", \
            "start", \
            "end", \
            "referenceAllele", \
            "alternateAlleles", \
        "/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa" \
        ) \
    )
This example can also be easily ported to Scala:
    import io.projectglow.functions.normalize_variant
    import org.apache.spark.sql.functions.col
    val function_normalized_variants_df = original_variants_df.withColumn(
        "normalizationResult",
        normalize_variant(
            col("contigName"),
            col("start"),
            col("end"),
            col("referenceAllele"),
            col("alternateAlleles"),
    "/mnt/dbnucleus/dbgenomics/grch38/data/GRCh38_full_analysis_set_plus_decoy_hla.fa"
        )
    )

The result of any of the above commands will be the same as Figure 3.

A new transformer for splitting multiallelic variants

This release also introduced a new dataframe transformer called split_multiallelics. This transformer splits multiallelic variants into biallelic variants, and behaves similarly to vt decompose with -s option. This behavior is more powerful than the behavior of the previous splitter, which behaved like GATK’s LeftAlignAndTrimVariants with –split-multi-allelics. In particular, the array-type INFO and genotype fields with elements corresponding to reference and alternate alleles are split into biallelic rows (see -s option of vt decompose). So are the array-type genotype fields with elements sorted in colex order of genotype calls, e.g., the GL, PL, and GP fields in the VCF format. Moreover, an OLD_MULTIALLELIC INFO field is added to the dataframe to store the original multiallelic form of the split variants.

The following is an example of using the split_multiallelic transformer on the original_variants_df. Figure 4 contains the result of this transformation.

    import glow
    split_variants_df = glow.transform("split_multiallelics", original_variants_df)
Example Glow SQL dataframe, demonstrating Glow 3.0’s new dataframe transformer called split_multiallelics, which splits multiallelic variants into biallelic variants.
Figure 4: The split dataframe split_variants_df

Please note that the new splitter is implemented as a separate transformer from the normalize_variants transformer. Previously, splitting could only be done as one of the operation modes of the normalize_variants transformer using the now-deprecated mode option. Please refer to the documentation of the split_multiallelics transformer for complete details on the behavior of this new transformer.

Parsing of Annotation Fields

The VCF reader and pipe transformer now parse variant annotations from tools such as SnpEff and VEP. This flattens the ANN and CSQ INFO fields, which simplifies and accelerates queries on annotations. Figure 5 shows the output of the code below, which queries the annotated consequences in a VCF annotated using the LOFTEE VEP plugin.

    from pyspark.sql.functions import expr
    variants_df = spark.read\
        .format("vcf")\
        .load("dbfs:/databricks-datasets/genomics/vcfs/loftee.vcf")
    annotated_variants_df = original_variants_df.withColumn( \
        "Exploded_INFO_CSQ", \
        expr("explode(INFO_CSQ)") \
    ) \
    .selectExpr("contigName", \
        "start", \
        "end", \
        "referenceAllele", \
        "alternateAlleles", \
        "expand_struct(Exploded_INFO_CSQ)", \
        "genotypes" \
    )
Example Glow SQL dataframe, demonstrating Glow 3.0’s ability to parse variant annotations from tools such as SnpEff and VEP.
Figure 5: The annotated dataframe annotated_variants_df with expanded subfields of the exploded INFO_CSQ

Other Data Analysis Improvements

Glow 0.3.0 also includes optimized implementations of the linear and logistic regression functions, resulting in ~50% performance improvements. See the documentation at Linear regression and Logistic regression.

Furthermore, the new release supports Scala 2.12 in addition to Scala 2.11. The Maven artifacts for both Scala versions are available on Maven Central.

Try Glow 3.0!

Glow 0.3 is installed in the Databricks Genomics Runtime (Azure | AWS) and is optimized for improved performance when using cloud computing to analyze large genomics datasets. Learn more about our genomics solutions and how we’re helping to further human and agricultural genome research and enable advances like population-scale next-generation sequencing in the Databricks Unified Analytics Platform for Genomics and try out a preview today.

--

Try Databricks for free. Get started today.

The post Glow 0.3.0 Introduces New Large-Scale Genomic Analysis Features appeared first on Databricks.

New study: Databricks delivers nearly $29 million in economic benefits and pays for itself in less than six months

$
0
0

New commissioned study by Forrester Consulting on behalf of Databricks finds that Databricks customers experience revenue acceleration, improved data team productivity and infrastructure savings on their data analytics and AI projects

A new commissioned study by Forrester Consulting on behalf of Databricks finds that Databricks customers experience revenue acceleration, improved data team productivity and infrastructure savings resulting in a 417% ROI on their data analytics and AI projects.

According to Forrester, “In today’s hypercompetitive business environment, harnessing and applying data, business analytics, and machine learning at every opportunity to differentiate products and customer experiences is fast becoming a prerequisite for success.’’ So it’s no wonder enterprises are betting big on data analytics and AI. In fact, roughly 65% of CIOs at Fortune 1000 companies plan to invest over $50 million in data and AI projects in 2020.

But not all data and AI strategies are created equal. Recent studies have shown that only one in three data and AI projects are successful. This is largely a result of legacy analytics investments that lack the scale, collaboration features, and modern big data and AI capabilities required to build and deploy advanced analytics products. With organizations struggling to deliver data-driven innovation, this begs the question: What enterprise-grade technologies should organizations invest in to help their data teams be successful? And how can organizations quantify the impact of these investments?

Delivering measurable business value with Databricks

To help answer these questions, Databricks commissioned a Forrester Consulting study: The Total Economic Impact™ (TEI) of the Databricks Unified Data Analytics Platform. In this new study, Forrester examines how data teams — and the entire business — can move faster, collaborate better and operate more efficiently when they have a unified, open platform for data engineering, machine learning, and big data analytics. Through customer interviews, Forrester found that organizations deploying Databricks realize nearly $29 million in total economic benefits and a return on investment of 417% over a three-year period. They also concluded that the Databricks platform pays for itself in less than six months.
 
Databricks delivers business value
 

More specifically, data teams interviewed for the study experienced the following key benefits from the Databricks Unified Data Analytics Platform:

Increased revenues by accelerating data science outcomes

Databricks customers achieved a 5% increase in revenues by enabling data science teams to build more — and better — ML models, faster. Additionally, Databricks democratized data access across the organization. This led to new users creating a diverse set of new analytics products such as recommendation engines, pricing optimizations and predictive maintenance models. All these innovations led to top-line growth.

“With Databricks, we are able to train models against all our data more quickly, resulting in more accurate pricing predictions that have had a material impact on revenue.” – Bryn Clark, Data Scientist, Nationwide

Read Nationwide’s story

Improved productivity of data teams

Databricks improved customer productivity of data scientists and data engineers by 25% and 20%, respectively. Customers shared that the improved data management capabilities enabled data teams to spend less time searching for and cleaning data, less time creating and maintaining ETL pipelines, and more time building analytics and ML models to drive meaningful business outcomes. Databricks also helped remove technical barriers that limited collaboration among analysts, data scientists, and data engineers, enabling data teams to work together more efficiently.

“Being on the Databricks platform has allowed our team of data scientists to make huge strides in setting aside all those configuration headaches that we were faced with. It’s dramatically improved our productivity.” – Josh McNutt, SVP of Data Strategy and Consumer Analytics, Showtime

Read Showtime’s story

Significant cost savings retiring legacy analytics platforms

By migrating to Databricks, interviewed organizations were able to retire on-premises infrastructure and cancel legacy software licenses, resulting in millions of dollars of savings. Additionally, the management of the Databricks platform proved substantially easier than legacy environments. This enabled customers to reallocate IT resources to higher-value projects and reduce operational costs.

“Databricks has enabled Comcast to process petabytes of data while reducing compute costs by 10x . Teams can spend more time on analytics and less time on infrastructure management.”

Read Comcast’s Story 

Read the Forrester TEI study

With the Databricks Unified Data Analytics Platform, customers can now accelerate data-driven innovation, thanks to a unified, open platform for data science, ML, and analytics that brings together data teams, processes, and technologies.

To find out more, download the full Total Economic Impact study for Databricks.

DOWNLOAD NOW!

--

Try Databricks for free. Get started today.

The post New study: Databricks delivers nearly $29 million in economic benefits and pays for itself in less than six months appeared first on Databricks.

Evolving the Databricks brand

$
0
0

.

Some brands start out as, well, brands. A lot of work goes into the concept and painting the picture before the business is ever launched.

Databricks is different. It always has been and always will be an engineering-led company.

Databricks’ model for innovation is inspired by the open-source community. This is where our roots run deepest, as it underpins everything that makes us special — our platform, our culture and our ethos. And like the open-source community, Databricks is driven by the spirit of collaboration and its impact on innovation.

There weren’t many people willing to bet that a bunch of students and teachers in a research lab at Berkeley — with virtually no business experience among them — could take the open-source software that they helped create, enhance it and deliver it as a cloud-based platform for data and AI. Three major technology transformations would have to unfold in order for this to succeed:

  1. The cloud would have to go mainstream
  2. Open source software would have to become a standard in the enterprise
  3. Machine learning would have to become more than science fiction

Today, Databricks is one of, if not the fastest growing SaaS company in history. Our success is attributable to our relentless focus on innovation and making customers successful. More specifically, to putting all of our energy into simplifying data and AI so that data teams – engineers, scientists and analysts – can innovate faster.

But we didn’t get here because we focused on building an amazing brand.

In fact, some might argue that we intentionally avoided the “B word” because it was too much of a distraction. Some might say that focusing on it would have led to some kind of radical reinvention that would make us lose who we really were. Some might say that the absence of a brand was Databricks’ brand. But the truth is our brand was always there, embedded in our core values and the way we approached everything we do.

Brand is about how you’re perceived – it’s not something you can define for yourself. It’s crafted over time, through the sum of experiences that people have with your company. By definition, a brand needs this time — about seven years in our case — before it can really be captured and defined in a way that’s credible and true to your roots.

This is the first time we’ve ever been intentional about capturing our brand. We spent a lot of time with employees, our customers, developers and partners to understand how the people who know us best think about Databricks. Three consistent themes emerged, each of which describe how we operate internally as well as how our customers think about us:

Collaboration – We believe innovation happens faster when we work together, learn from each other, iterate and constantly improve. More than ever, data and AI is a team sport and Databricks is the platform for data teams.

Innovation – We believe in science and the limitless potential of data. Databricks enables organizations to realize that potential as quickly as possible.

Impact – Impact is everything. It’s why people work at Databricks. It’s what drives us. Most importantly, it’s what organizations depend on us for. We live in awe of the impact our customers are having on the world and are inspired by the role that Databricks plays.

We put a lot of thought into how we wanted to represent these attributes, both in our look and our language. Needless to say, we’re proud of the result — not just because it looks good or sounds good — but because it’s authentic to how people think about Databricks and the value that we bring to the table.

We think the video at the top of this blog captures it well – who we are, what we do and what our customers and the entire data community can achieve on an open, unified platform for data and AI. We’d love to hear what you think – email us at brand@databricks.com.

We didn’t always use these words to say it, but our mission has always been to help data teams solve the world’s toughest problems. Never have we been more proud or felt such a sense of urgency for what we can do together with our customers and partners. And as our mission continues to unfold, we can’t wait to see how our brand continues to evolve. We are just getting started!

--

Try Databricks for free. Get started today.

The post Evolving the Databricks brand appeared first on Databricks.

Faster SQL Queries on Delta Lake with Dynamic File Pruning

$
0
0

There are two time-honored optimization techniques for making queries run faster in data systems: process data at a faster rate or simply process less data by skipping non-relevant data. This blog post introduces Dynamic File Pruning (DFP), a new data-skipping technique enabled by default in Databricks Runtime 6.1, which can significantly improve queries with selective joins on non-partition columns on tables in Delta Lake.

In our experiments using TPC-DS data and queries with Dynamic File Pruning, we observed up to an 8x speedup in query performance and 36 queries had a 2x or larger speedup.

In experiments using TPC-DS data and queries with Dynamic File Pruning, Databricks observed up to an 8x speedup in query performance and 36 queries had a 2x or larger speedup.

The Benefits of Dynamic File Pruning

Data engineers frequently choose a partitioning strategy for large Delta Lake tables that allows the queries and jobs accessing those tables to skip considerable amounts of data thus significantly speeding up query execution times. Partition pruning can take place at query compilation time when queries include an explicit literal predicate on the partition key column or it can take place at runtime via Dynamic Partition Pruning.

Delta Lake on Databricks Performance Tuning

In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. This can be achieved because Delta Lake automatically collects metadata about data files managed by Delta Lake and so, data can be skipped without data file access. Prior to Dynamic File Pruning, file pruning only took place when queries contained a literal value in the predicate but now this works for both literal filters as well as join filters. This means that Dynamic File Pruning now allows star schema queries to take advantage of data skipping at file granularity.

Per Partition Per File (Delta Lake on Databricks only)
Static (based on filters) Partition Pruning File Pruning
Dynamic (based on joins) Dynamic Partition Pruning Dynamic File Pruning (NEW!)

How Does Dynamic File Pruning Work?

Before we dive into the details of how Dynamic File Pruning works, let’s briefly present how file pruning works with literal predicates.

Example 1 – Static File Pruning

For simplicity, let’s consider the following query derived from the TPC-DS schema to explain how file pruning can reduce the size of the SCAN operation.

    -- Q1
    SELECT sum(ss_quantity) 
    FROM store_sales 
    WHERE ss_item_sk IN (40, 41, 42) 

Delta Lake stores the minimum and maximum values for each column on a per file basis. Therefore, files in which the filtered values (40, 41, 42) fall outside the min-max range of the ss_item_sk column can be skipped entirely. We can reduce the length of value ranges per file by using data clustering techniques such as Z-Ordering. This is very attractive for Dynamic File Pruning because having tighter ranges per file results in better skipping effectiveness. Therefore, we have Z-ordered the store_sales table by the ss_item_sk column.

In query Q1 the predicate pushdown takes place and thus file pruning happens as a metadata-operation as part of the SCAN operator but is also followed by a FILTER operation to remove any remaining non-matching rows.

In experiments using TPC-DS data and queries with Dynamic File Pruning, Databricks observed up to an 8x speedup in query performance and 36 queries had a 2x or larger speedup.

When the filter contains literal predicates, the query compiler can embed these literal values in the query plan. However, when predicates are specified as part of a join, as is commonly found in most data warehouse queries (e.g., star schema join), a different approach is needed. In such cases, the join filters on the fact table are unknown at query compilation time.

Example 2 – Star Schema Join without DFP

Below is an example of a query with a typical star schema join.

    -- Q2 
    SELECT sum(ss_quantity) 
    FROM store_sales 
    JOIN item ON ss_item_sk = i_item_sk
    WHERE i_item_id = 'AAAAAAAAICAAAAAA'

Query Q2 returns the same results as Q1, however, it specifies the predicate on the dimension table (item), not the fact table (store_sales). This means that filtering of rows for store_sales would typically be done as part of the JOIN operation since the values of ss_item_sk are not known until after the SCAN and FILTER operations take place on the item table.

Below is a logical query execution plan for Q2.

Example query where filtering of rows for store_sales would typically be done as part of the JOIN operation since the values of ss_item_sk are not known until after the SCAN and FILTER operations take place on the item table.

As you can see in the query plan for Q2, only 48K rows meet the JOIN criteria yet over 8.6B records had to be read from the store_sales table. This means that the query runtime can be significantly reduced as well as the amount of data scanned if there was a way to push down the JOIN filter into the SCAN of store_sales.

Example 3 – Star Schema Join with Dynamic File Pruning

If we take Q2 and enable Dynamic File Pruning we can see that a dynamic filter is created from the build side of the join and passed into the SCAN operation for store_sales. The below logical plan diagram represents this optimization.

Example query with Dynamic File Pruning enabled, where a dynamic filter is created from the build side of the join and passed into the SCAN operation for store_sales.

The result of applying Dynamic File Pruning in the SCAN operation for store_sales is that the number of scanned rows has been reduced from 8.6 billion to 66 million rows. Whereas the improvement is significant, we still read more data than needed because DFP operates at the granularity of files instead of rows.

We can observe the impact of Dynamic File Pruning by looking at the DAG from the Spark UI (snippets below) for this query and expanding the SCAN operation for the store_sales table. In particular, using Dynamic File Pruning in this query eliminates more than 99% of the input data which improves the query runtime from 10s to less than 1s.

Scan node statistics , demonstrating the effect of dynamic file pruning on query performance.

Without dynamic file pruning

Scan node statistics , demonstrating the effect of dynamic file pruning on query performance.

With dynamic file pruning

Enabling Dynamic File Pruning

DFP is automatically enabled in Databricks Runtime 6.1 and higher, and applies if a query meets the following criteria:

  • The inner table (probe side) being joined is in Delta Lake format
  • The join type is INNER or LEFT-SEMI
  • The join strategy is BROADCAST HASH JOIN
  • The number of files in the inner table is greater than the value for spark.databricks.optimizer.deltaTableFilesThreshold

DFP can be controlled by the following configuration parameters:

  • spark.databricks.optimizer.dynamicFilePruning (default is true) is the main flag that enables the optimizer to push down DFP filters.
  • spark.databricks.optimizer.deltaTableSizeThreshold (default is 10GB) This parameter represents the minimum size in bytes of the Delta table on the probe side of the join required to trigger dynamic file pruning.
  • spark.databricks.optimizer.deltaTableFilesThreshold (default is 1000) This parameter represents the number of files of the Delta table on the probe side of the join required to trigger dynamic file pruning.

Note: In the experiments reported in this article we set spark.databricks.optimizer.deltaTableFilesThreshold to 100 in order to trigger DFP because the store_sales table has less than 1000 files

Experiments and Results with TPC-DS

To understand the impact of Dynamic File Pruning on SQL workloads we compared the performance of TPC-DS queries on unpartitioned schemas from a 1TB dataset. We used Z-Ordering to cluster the joined fact tables on the date and item key columns. DFP delivers good performance in nearly every query. In 36 out of 103 queries we observed a speedup of over 2x with the largest speedup achieved for a single query of roughly 8x. The chart below highlights the impact of DFP by showing the top 10 most improved queries.

Dynamic File Pruning reduces by a large factor the number of files read in several TPC-DS queries running on a 1TB dataset.

Many TPC-DS queries use a typical star schema join between a date dimension table and a fact table (or multiple fact tables) to filter date ranges which makes it a great workload to showcase the impact of DFP. The data presented in the above chart explains why DFP is so effective for this set of queries — they are now able to reduce a significant amount of data read. Each query has a join filter on the fact tables limiting the period of time to a range between 30 and 90 days (fact tables store 5 years of data). DFP is very attractive for this workload as some of the queries may access up to three fact tables.

Getting Started with Dynamic File Pruning

Dynamic File Pruning (DFP), a new feature in Databricks Runtime 6.1, can significantly improve the performance of many queries on Delta Lake. DFP is especially efficient when running join queries on non-partitioned tables. The better performance provided by DFP is often correlated to the clustering of data and so, users may consider using Z-Ordering to maximize the benefit of DFP. To leverage these latest performance optimizations, sign up for a Databricks account today!

--

Try Databricks for free. Get started today.

The post Faster SQL Queries on Delta Lake with Dynamic File Pruning appeared first on Databricks.

Intern Tips for a Virtual Databricks Internship

$
0
0

Databricks Winter 2020 Interns at our Spark Social S’mores Event

Winter 2020 Interns at our Spark Social S’mores Event

At Databricks, we host interns year-round, and we love sharing their experiences working on impactful projects that help data teams solve the world’s toughest challenges.

This summer’s intern program will be entirely virtual, with interns working on our Engineering team from all over the world. It’s a huge shift so we’re lucky to have our outgoing winter intern class share their thoughts on moving from working in our San Francisco office to a virtual experience, where our interns returned home to finish their internships remotely.

Read on to hear more about their Databricks experiences and how to get the most out of your virtual internship.

Brandon — Clusters Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has been your proudest accomplishment?
My team gives me work and lets me go off to work on it on my own, trusting that I’ll get it done. They rely on the work I do and trust that even as an intern I’ll be able to get it done.

How can other interns get the most out of their internship experience?
Read the Databricks Technical Blogs. I was able to get a sense of what types of projects were being asked of previous interns. Additionally, take time to hop on a call with your manager and/or mentor before you start to set up a positive relationship with them early on!

Sarth — ML Core Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

If you could describe your time at Databricks in 1 word, what would it be?
Learning: I’ve done eight internships previously, and I’ve learned more at Databricks so far than any of my previous ones. I couldn’t have even imagined getting to work on the projects I’m working on here — things like contributing open-source to Apache Spark™.

What has been your favorite memory or experience so far?
It’s actually been really fun during this abnormal COVID-19 time. My team used to have really fun social hours, and once we all started working from home, we just pivoted to virtual social hours and they’re great! We play games and get a chance to hang out with each other during these times.

Carl — Observability Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has been surprising about your experience so far?
Definitely the transparency across the company and visibility of engineering efforts across different teams and organizations.

How can other interns get the most out of their internship at Databricks?
Work to drive your own project by participating fully in team stand-ups and planning meetings. As my internship progressed, I made an effort to do more of this and I had good outcomes from it. The teams were really receptive to my feedback and input.

Melanie — Growth Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What skills have you learned or improved?
So many things! I learned everything from creating a design doc and holding a design review to the workflows needed to ship a feature to production to utilizing tooling to be productive in my day-to-day.

How have you stayed connected with your team through the WFH period?
Our team has been doing weekly team hangout calls, as well as a #growth-random Slack channel where we have conversations about non-work related things — everything from the food we’re eating to tips for taking care of dogs!

Jon — Delta Pipelines Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has been surprising about your experience at Databricks?
Truly how much trust and confidence the team has in interns. Interns are given real, big projects to work on throughout their time; we’re given the freedom and flexibility to find solutions on our own or the support and assistance when needed.

What tips and tricks can you share about interning from home?
Work with your manager to find the best schedule for you. I have found time to make sure I get a chance to work out or cook throughout the day by splitting my days into two or three chunks!

Dhruv — Data Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has been your proudest accomplishment?
I’ve taken time to set up meetings with all sorts of people to learn more about Databricks. I started within engineering, setting up coffee chats to get more acquainted with other team members and parts of the product. Through this process, I’ve gotten insights about other parts of the company as well.

What has been your favorite memory or experience so far?
All the different board game nights. Board game nights are a chance to get to hang out with co-workers (including full-time engineers and other interns) and socialize. They’re now virtual but still provide fun opportunities!

Andrew — Cloud Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has mentorship been like for you?
From Day One, I had a mentor assigned to me. This mentor helped me onboard and had daily syncs to go through trivial things. My mentor also gave me career advice and pre-quarantine travel advice.

I found that my other team members were also more than happy to help out. They are willing to drop what they’re working on for a moment to help you out if you’re stuck on something.

If you could describe your time at Databricks in one word, what would it be?
Ownership: you really are the owner of your project. I had full control over the project and was able to change the direction slightly as it made sense. You’re really trusted to do that!

Shubhra — Workspace Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What has mentorship been like for you?
Great! I have the freedom to make design choices, but when I make mistakes there’s support for me to learn from them. Everybody is super helpful and patient.

What has been surprising about your experience at Databricks?
The overall scale and scope of my project; I expected an “intern” project, however it’s much bigger than that. I get to work on some core functionality and code that was written a super long time ago. As an intern, it has been an amazing opportunity to be able to contribute to such important pieces of our product.

Scott — Dev Tools Team

Databricks Winter 2020 Interns at our Spark Social S’mores Event

What skills have you learned or improved?
This was my first infrastructure role so I really learned all about deployment systems and build systems — everything from github webhooks to kubernetes.

What tips and tricks can you share about interning from home?
Make sure to not stay inside all day! I take any opportunity to go outside — walking my dog several times a day, doing exercises, and stretching.

Interested in joining our next class of interns? Check out our Careers Page.

--

Try Databricks for free. Get started today.

The post Intern Tips for a Virtual Databricks Internship appeared first on Databricks.

Azure Databricks Security Best Practices

$
0
0

Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta Lake, MLflow, Koalas and Apache SparkTM, Azure Databricks is a first party PaaS on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure cloud services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Business Analysts and SecOps / Cloud Engineering.

In this article, we will share a list of cloud security features and capabilities that an enterprise data team could utilize to bake their Azure Databricks environment as per their governance policy.

Azure Databricks Security Best Practices

Security that Unblocks the True Potential of your Data Lake

Learn how Azure Databricks helps address the challenges that come with deploying, operating and securing a cloud-native data analytics platform at scale.

Bring Your Own VNET

What does the Azure Databricks platform architecture look like, and how you could set it up in your own enterprise-managed virtual network, in order to do necessary customizations as required by your network security team.

Trust But Verify with Azure Databricks

Get visibility into relevant platform activity in terms of who’s doing what and when, by configuring Azure Databricks Diagnostic Logs and other related audit logs in the Azure Cloud.

Securely Accessing Azure Data Sources from Azure Databricks

Understand the different ways of connecting Azure Databricks clusters in your private virtual network to your Azure Data Sources in a cloud-native secure manner.

Data Exfiltration Protection with Azure Databricks

Learn how to utilize cloud-native security constructs to create a battle-tested secure architecture for your Azure Databricks environment, that helps you prevent Data Exfiltration. Most relevant for organizations working with personally identifiable information (PII), protected health information (PHI) and other types of sensitive data.

Enable Customer-Managed Keys with Notebooks

Azure Databricks notebooks are stored in the scalable management layer powered by Microsoft, and are by default encrypted with a Microsoft-managed per-workspace key. You could also bring your own key to encrypt the notebooks.

Simplify Data Lake Access with Azure AD Credential Passthrough

Control who has access to what data by using seamless identity federation with Azure AD under the hood, and get cloud-native visibility into who is processing the data and when. Please feel free to refer to cloud-native access control for ADLS Gen 2 and how to configure it using Azure Storage Explorer. Such access management controls, including role-based access controls, are seamlessly utilized by Azure Databricks as outlined in the passthrough article.

Azure Databricks is HITRUST CSF Certified

Azure Databricks is HITRUST CSF Certified to meet the required level of security and risk controls to support the regulatory requirements of our customers. It is in addition to the HIPAA compliance that’s applicable through Microsoft Azure BAA.

What’s Next?

Attend the Azure Databricks Security Best Practices Webinar and bookmark this page, as we’ll keep it updated with the new security-related capabilities & controls. If you want to try out the mentioned features, get started by creating an Azure Databricks workspace in your managed VNET.

--

Try Databricks for free. Get started today.

The post Azure Databricks Security Best Practices appeared first on Databricks.


How to build a Quality of Service (QoS) analytics solution for streaming video services

$
0
0

The Importance of Quality to Streaming Video Services
Databricks QoS Solution Overview
Video QoS Solution Architecture
Making Your Data Ready for Analytics
Creating the Dashboard / Virtual Network Operations Center
Creating (Near) Real Time Alerts
Next steps: Machine learning
Getting Started with the Databricks Streaming Video Solution
 

The Importance of Quality to Streaming Video Services

As traditional pay TV continues to stagnate, content owners have embraced direct-to-consumer (D2C) subscription and ad-supported streaming for monetizing their libraries of content. For companies whose entire business model revolved around producing great content which they then licensed to distributors, the shift to now owning the entire glass-to-glass experience has required new capabilities such as building media supply chains for content delivery to consumers, supporting apps for a myriad of devices and operating systems, and performing customer relationship functions like billing and customer service.

With most vMVPD (virtual multichannel video programming distributor) and SVOD (streaming video on demand) services renewing on a monthly basis, subscription service operators need to prove value to their subscribers every month/week/day (the barriers to a viewer for leaving AVOD (ad-supported video on demand) are even lower – simply opening a different app or channel). General quality of streaming video issues (encompassing buffering, latency, pixelation, jitter, packet loss, and the blank screen) have significant business impacts, whether it’s increased subscriber churn or decreased video engagement.

When you start streaming you realize there are so many places where breaks can happen and the viewer experience can suffer, whether it be an issue at the source in the servers on-prem or in the cloud; in transit at either the CDN level or ISP level or the viewer’s home network; or at the playout level with player/client issues. What breaks at n x 104 concurrent streamers is different from what breaks at n x 105 or n x 106. There is no pre-release testing that can quite replicate real-world users and their ability to push even the most redundant systems to their breaking point as they channel surf, click in and out of the app, sign on from different devices simultaneously, and so on. And because of the nature of TV, things will go wrong during the most important, high profile events drawing the largest audiences. If you start receiving complaints on social media, how can you tell if they are unique to that one user or rather regional or a national issue? If national, is it across all devices or only certain types (e.g., possibly the OEM updated the OS on an older device type which ended up causing compatibility issues with the client)?

Identifying, remediating, and preventing viewer quality of experience issues becomes a big data problem when you consider the number of users, the number of actions they are taking, and the number of handoffs in the experience (servers to CDN to ISP to home network to client). Quality of Service (QoS) helps make sense of these streams of data so you can understand what is going wrong, where, and why. Eventually you can get into predictive analytics around what could go wrong and how to remediate it before anything breaks.

Databricks QoS Solution Overview

The aim of this solution is to provide the core for any streaming video platform that wants to improve their QoS system. It is based on the AWS Streaming Media Analytics Solution provided by AWS Labs which we then built on top of to add Databricks as a unified data analytics platform for both the real time insights and the advanced analytics capabilities.

By using Databricks, streaming platforms can get faster insights leveraging always the most complete and recent datasets powered by robust and reliable data pipelines, decreased time to market for new features by accelerating data science using a collaborative environment with support for managing the end-to-end machine learning lifecycle, reduced operational costs across all cycles of software development by having a unified platform for both data engineering and data science.

Video QoS Solution Architecture

With complexities like low-latency monitoring alerts and highly scalable infrastructure required for peak video traffic hours, the straightforward architectural choice was the Delta Architecture – both standard big data architectures like Lambda and Kappa Architectures having disadvantages around operational effort required to maintain multiple types of pipelines (streaming and batch) and lack of support for unified Data Engineering & Data Science approach.

The Delta Architecture is the next generation paradigm that enables all the types of Data Personas in your organisation to be more productive:

  • Data Engineers can develop data pipelines in a cost efficient manner continuously without having to choose between batch and streaming
  • Data Analysts can get near real-time insights and faster answers to their BI queries
  • Data Scientists can develop better machine learning models using more reliable datasets with support for time travel that facilitates reproducible experiments and reports

Delta streaming video QOS Architecture using the “multi-hop” approach for data pipelines

Fig. 1 Delta Architecture using the “multi-hop” approach for data pipelines

Writing data pipelines using the Delta Architecture follows the best practices of having a multi-layer “multi-hop” approach where we progressively add structure to data: “Bronze” tables or Ingestion tables are usually raw datasets in the native format (JSON, CSV or txt), “Silver” Tables represent cleaned/transformed datasets ready for reporting or data science and “Gold” tables are the final presentation layer.

For the pure streaming use cases, the option of materializing the Dataframes in intermediate Delta tables is basically just a tradeoff between latency/SLAs and cost (an example being real time monitoring alerts vs updates of the recommender system based on new content).

Streaming video QOS architecture can still be achieved while materializing dataframes in Delta tables

Fig. 2 A streaming architecture can still be achieved while materializing dataframes in Delta tables

The number of “hops” in this approach is directly impacted by the number of consumers downstream, complexity of the aggregations ( e.g. structured streaming enforces certain limitations around chaining multiple aggregations) and the maximisation of operational efficiency.

The QoS solution architecture is focused around best practices for data processing and is not a full VOD (video-on-demand) solution – some standard components like the “front door” service Amazon API Gateway being avoided from the high level architecture in order to keep the focus on data and analytics.


High-Level Architecture for the QoS platform

Fig. 3 High-Level Architecture for the QoS platform

Making your data ready for analytics

Both sources of data included in the QoS Solution ( application events and CDN logs ) are using the JSON format, great for data exchange – allowing you to represent complex nested structures, but not scalable and difficult to maintain as a storage format for your data lake / analytics system.

In order to make the data directly queryable across the entire organisation, the Bronze to Silver pipeline (the “make your data available to everyone” pipeline) should transform any raw formats into Delta and include all the quality checks or data masking required by any regulatory agencies.

Video Applications Events

Based on the architecture, the video application events are pushed directly to Kinesis Streams and then just ingested to a Delta append only table without any changes to the schema.

Raw format of the app events captured by the Databricks QoS solution.

Fig. 4 Raw format of the app events

Using this pattern allows a high number of consumers downstream to process the data in a streaming paradigm without having to scale the throughput of the Kinesis stream. As a side effect of using a Delta table as a sink ( which supports optimize! ), we don’t have to worry about the way the size of the processing window will impact the number of files in your target table – known as the “small files” issue in the big data world.

Both the timestamp and the type of message are being extracted from the JSON event in order to be able to partition the data and allow consumers to choose the type of events they want to process. Again combining a single Kinesis stream for the events with a Delta “Events” table reduces the operational complexity while making things easier for scaling during peak hours.

All the details are extracted from JSON for the Silver table

Fig. 5 All the details are extracted from JSON for the Silver table

CDN Logs

The CDN Logs are delivered to S3, so the easiest way to process them is the Databricks Auto Loader, which incrementally and efficiently processes new data files as they arrive in S3 without any additional setup.

    auto_loader_df = spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "json") \
        .option("cloudFiles.region", region) \
        .load(input_location)
    
    anonymized_df = auto_loader_df.select('*', ip_anonymizer('requestip').alias('ip'))\
        .drop('requestip')\
        .withColumn("origin", map_ip_to_location(col('ip')))
    
    
    anonymized_df.writeStream \
        .option('checkpointLocation', checkpoint_location)\
        .format('delta') \
        .table(silver_database + '.cdn_logs')

As the logs contain IPs – considered personal data under the GDPR regulations – the “make your data available to everyone” pipeline has to include an anonymisation step. Different techniques can be used but we decided to just strip the last octet from IPv4 and the last 80 bits from IPv6. On top, the dataset is also enriched with information around the origin country and the ISP provider which will be used later in the Network Operation Centers for localisation.

Creating the dashboard / virtual Network Operation Centers

Streaming companies need to monitor network performance and the user experience as near real time as possible, tracking down to the individual level with the ability to abstract at the segment level, easily defining new segments such as those defined by geos, devices, networks, and/or current and historical viewing behavior. For streaming companies that has meant adopting the concept of Network Operation Centers (NOC) from telco networks for monitoring the health of the streaming experience for their users at a macro level, flagging and responding to any issues early on. At their most basic, NOCs should have dashboards that compare the current experience for users against a performance baseline so that the product teams can quickly and easily identify and attend to any service anomalies.

In the QoS Solution we have incorporated a Databricks dashboard. BI Tools can also be effortlessly connected in order to build more complex visualisations, but based on customer feedback, built-in dashboards are most of the time the fastest way to present the insights to business users.

The aggregated tables for the NoC will basically be the Gold layer of our Delta Architecture – a combination of CDN logs and the application events.

Example Network Operations Center Dashboard for Databricks Video QoS solution.

Fig.6 Example of Network Operations Center Dashboard

The dashboard is just a way to visually package the results of SQL queries or Python / R transformation – each Notebook supports multiple Dashboards so in case of multiple end users with different requirements we don’t have to duplicate the code – as a bonus the refresh can also be scheduled as a Databricks job.

Visualization of the results of a SQL query within the Databricks streaming video QoS solution.

Fig.7 Visualization of the results of a SQL query

Loading time for videos (time to first frame) allows better understanding of the performance for individual locations of your CDN – in this case the AWS CloudFront Edge nodes – which has a direct impact in your strategy for improving this KPI – either by spreading the user traffic over multi-CDNs or maybe just implementing a dynamic origin selection in case of AWS CloudFront using Lambda@Edge.

Example video load time visualization for the Databricks streaming video Qos solution

Failure to understand the reasons for high levels of buffering – and the poor video quality experience that it brings – has a significant impact on subscriber churn rate. On top of that, advertisers are not willing to spend money on ads responsible for reducing the viewer engagement – as they add extra buffering on top, so the profits on the advertising business usually are impacted too. In this context, collecting as much information as possible from the application side is crucial to allow the analysis to be done not only at video level but also browser or even type / version of application.

Example butter time data visualizations for the Databricks streaming video QoS solution.

On the content side, events for the application can provide useful information about user behaviour and overall quality of experience. How many people that paused a video have actually finished watching that episode / video? Is the cause for stopping the quality of the content or are there delivery issues ? Of course further analyses can be done by linking all the sources together (user behaviour, performance of CDNs / ISPs) to not only create a user profile but also to forecast churn.


Sample data visualization providing insight into user behavior available via the Databricks streaming video QoS solution.

Creating (Near) Real Time Alerts

When dealing with the velocity, volume, and variety of data generated in video streaming from millions of concurrent users, dashboard complexity can make it harder for human operators in the NOC to focus on the most important data at the moment and zero in on root cause issues. With this solution, you can easily set up automated alerts when performance crosses certain thresholds that can help the human operators of the network as well as set off automatic remediation protocols via a Lambda function. For example:

  • If a CDN is having latency much higher than baseline (e.g., if it’s more than 10% latency versus baseline average), initiate automatic CDN traffic shifts.
  • If more than [some threshold e.g., 5%] of clients report playback errors, alert the product team that there is likely a client issue for a specific device.
  • If viewers on a certain ISP are having higher than average buffering and pixelation issues, alert frontline customer representatives on responses and ways to decrease issues (e.g., set stream quality lower).

From a technical perspective generating real-time alerts requires a streaming engine capable of processing data real time and publish-subscribe service to push notifications.

Databrick streaming video QoS solution integrated microservices using Amazon SNS and Amazon SQS

Fig.8 Integrating microservices using Amazon SNS and Amazon SQS

The QoS solution implements the AWS best practices for integrating microservices by using Amazon SNS and its integrations with Amazon Lambda ( see below the updates of web applications ) or Amazon SQS for other consumers. The custom foreach writer option makes the writing of a pipeline to send email notifications based on a rule based engine ( e.g validating the percentage of errors for each individual type of app over a period of time) really straightforward.

    def send_error_notification(row):
        
        sns_client = boto3.client('sns', region)
        
        error_message = 'Number of errors for the App has exceeded the threshold {}'.format(row['percentage'])
        
        response = sns_client.publish(
            TopicArn=,
            Message= error_message,
            Subject=,
            MessageStructure='string')
                    
        # Structured Streaming Job
        
        getKinesisStream("player_events")\
            .selectExpr("type", "app_type")\
            .groupBy("app_type")\
            .apply(calculate_error_percentage)\
            .where("percentage > {}".format(threshold)) \
            .writeStream\
            .foreach(send_error_notification)\
            .start()

Fig.9 Sending email notifications using AWS SNS

On top of the basic email use case, the Demo Player includes three widgets updated real time using AWS AppSync: number of active users, most popular videos, number of users watching concurrently a video.

Delta streaming video QOS Architecture using the “multi-hop” approach for data pipelines

Fig.10 Updating the application with the results of real-time aggregations

The QoS Solution is applying a similar approach – Structured Streaming and Amazon SNS – to update all the values allowing for extra consumers to be plugged in using AWS SQS – a common pattern when huge volumes of events have to be enhanced and analysed – pre-aggregate data once and allow each service (consumer) to make its own decision downstream.

Next Steps: Machine Learning

Manually making sense of the historical data is important but is also very slow – if we want to be able to make automated decisions in the future, we have to integrate machine learning algorithms.

As a Unified Data Analytics Platform, Databricks empowers Data Scientists to build better Data Science products using features like the ML Runtime with the built-in support for Hyperopt / Horvod / AutoML or the integration with MLFlow, the end-to-end machine learning lifecycle management tool.

We have already explored a few important use cases across our customers base while focusing on the possible extensions to the QoS Solution.

Point-of-failure prediction & remediation

As D2C streamers reach more users, the costs of even momentary loss of service increases. ML can help operators move from reporting to prevention by forecasting where issues could come up and remediating before anything goes wrong (e.g., a spike in concurrent viewers leads to switching CDNs to one with more capacity automatically).


Customer Churn

Critical to growing subscription services is keeping the subscribers you have. By understanding the quality of service at the individual level, you can add QoS as a variable in churn and customer lifetime value models. Additionally, you can create customer cohorts for those who have had video quality issues in order to test proactive messaging and save offers.

Getting Started with the Databricks Streaming Video QoS Solution

Providing consistent quality in the streaming video experience is table stakes at this point to keep fickle audiences with ample entertainment options to stay on your platform. With this solution we have sought to create a quick start for most streaming video platform environments to embed this QoS real-time streaming analytics solution in a way that:

  • Scales to any audience size
  • Quickly flags quality performance issues at key parts of the distribution workflow
  • Is flexible and modular enough to easily customize for your audience and your needs such as creating new automated alerts or enabling data scientists to test and roll-out predictive analytics and machine learning.

To get started, go to the Github repository created specifically for the Databricks streaming video QoS solution. For more guidance on how to unify batch and streaming data into a single system, view the Delta Architecture Webinar.

--

Try Databricks for free. Get started today.

The post How to build a Quality of Service (QoS) analytics solution for streaming video services appeared first on Databricks.

Databricks Launches Global University Alliance Program

$
0
0

We are excited to announce the Databricks University Alliance, a global program to help students get hands-on experience using Databricks for both in-person learning and in virtual classrooms. Educators and faculty can apply here; upon acceptance members will get access to curated curriculum content, training materials, sample notebooks, webinars, and other pre-recorded content for learning data science and data engineering tools including Apache Spark, Delta Lake, and MLflow. Accepted members will also have access to a community of fellow educators in support of data science and engineering education on Databricks.

Students focused on individual skills development from home can sign up for the free Databricks Community Edition and follow along with these free one-hour hands-on workshops for aspiring data scientists, as well as access free self-paced courses from Databricks Academy, the training and certification organization within Databricks.

 Databricks University Alliance, a global program to help students get hands-on experience using Databricks for both in-person learning and in virtual classrooms.

The Databricks University Alliance is powered by leading cloud providers such as Microsoft Azure and Amazon Web Services (AWS). Those educators looking for high-scale computing resources for their in-person and virtual classrooms may apply for cloud computing credits. We’ll evaluate applications based on the program description, educational need for large scale computing, reputation of the school + program and availability. Courses reaching a large number of students through MOOCs or other online learning will be prioritized.

Demand for Data Scientists and Data Engineers Surges

Databricks created this program to help the supply of data scientists, engineers, and analysts meet the growing demand. In 2016 the National Science Foundation recommended a national-level data science education and training agenda for US universities. Over the last five years, Google searches for Data Science have quadrupled; in addition, Glassdoor has ranked Data Science as one of the top ten Best Jobs in America every year since they started publishing the data in 2015. At Databricks, we believe that university students should learn the latest open-source data science tools to enhance their value in the workforce upon graduation.

How to Get Started with Databricks University Alliance

The Databricks University Alliance exists to help students and professors learn and use these next-generation data analytics tools for both bricks-and-mortar and virtual classrooms. Enroll now and join universities across the globe who are building the data science workforce of tomorrow.

--

Try Databricks for free. Get started today.

The post Databricks Launches Global University Alliance Program appeared first on Databricks.

Now on Databricks: A Technical Preview of Databricks Runtime 7 Including a Preview of Apache Spark 3.0

$
0
0

Introducing Databricks Runtime 7.0 Beta

We’re excited to announce that the Apache Spark 3.0.0-preview2 release is available on Databricks as part of our new Databricks Runtime 7.0 Beta. The 3.0.0-preview2 release is the culmination of tremendous contributions from the open-source community to deliver new capabilities, performance gains and expanded compatibility for the Spark ecosystem. Using the preview is as simple as selecting the version “7.0 Beta” when launching a cluster.

Spark 3.0.0 preview on Databricks Runtime 7.0 Beta

The upcoming release of Apache Spark 3.0 builds on many of the innovations from Spark 2.0, bringing new ideas as well as continuing long-term projects that have been in development. Our vision has always been to unify data and AI, and we’ve continued to invest in making Spark powerful enough to solve your toughest big data problems but also easy to use so that you’d actually be able to. And this is not just for data engineers and data scientists, but also for anyone who does SQL workloads with Spark SQL. Over 3,000 Jira tickets are resolved with this new release of Spark and, while we won’t be able to cover all these new capabilities in depth in this post, we’d like to highlight some of the items in this release.

Adaptive SQL query optimization

Spark SQL is the engine for Spark. With the Catalyst optimizer, the Spark applications built on DataFrame, Dataset, SQL, Structured Streaming, MLlib and other third-party libraries are all optimized. To generate good query plans, the query optimizer needs to understand the data characteristics. In most scenarios, data statistics are commonly absent, especially when statistics collection is even more expensive than the data processing itself. Even if the statistics are available, the statistics are likely out of date. Because of the storage and compute separation in Spark, the characteristic of data arrival is unpredictable. For all these reasons, runtime adaptivity becomes more critical for Spark than for traditional systems. This release introduces a new Adaptive Query Execution (AQE) framework and new runtime filtering for Dynamic Partition Pruning (DPP):

  • The AQE framework is built with three major features: 1) dynamically coalescing shuffle partitions, 2) dynamically switching join strategies and 3) dynamically optimizing skew joins. Based on a 1TB TPC-DS benchmark without statistics, Spark 3.0 can yield 8x speedup for q77, 2x speedup for q5 and more than 1.1x speedup for another 26 queries. AQE can be enabled by setting SQL config spark.sql.adaptive.enabled to true (default false in Spark 3.0).

TPS-DS 1TB No-Statistics with vs. without Adaptive Query Execution.

  • DPP occurs when the optimizer is unable to identify at compile time the partitions it can skip. This is not uncommon in star schema, which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In the TPC-DS benchmark, 60 out of 102 queries show a significant speedup between 2x and 18x.

TPC-DS 1 TB with vs. without Dynamic Partition Pruning

Richer APIs and functionalities

To enable new use cases and simplify the Spark application development, this release delivers new capabilities and enhances existing features.

  • Enhanced pandas UDFs. Pandas UDFs were initially introduced in Spark 2.3 for scaling the user-defined functions in PySpark and integrating pandas APIs into PySpark applications. However, the existing interface is difficult to understand when more UDF types are added. This release introduces the new pandas UDF interface with Python-type hints. This release adds two new pandas UDF types, iterator of series to iterator of series and iterator of multiple series to iterator of series, and three new pandas-function APIs, grouped map, map and co-grouped map.
  • A complete set of join hints. While we keep making the compiler smarter, there’s no guarantee that the compiler can always make the optimal decision for every case. Join algorithm selection is based on statistics and heuristics. When the compiler is unable to make the best choice, users still can use the join hints for influencing the optimizer to choose a better plan. This release extended the existing join hints by adding the new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL.
  • New built-in functions: There are 32 new built-in functions and higher-order functions are added in Scala APIs. Among these built-in functions, a set of MAP-specific built-in functions [transform_key, transform_value, map_entries, map_filter, map_zip_with] are added for simplifying the handling of data type MAP.

Enhanced monitoring capabilities

This release includes many enhancements that make monitoring more comprehensive and stable. The efficient enhancements do not have a high impact on the performance.

  • New UI for structured streaming: Structured streaming was initially introduced in Spark 2.0. This release adds the dedicated new Spark UI for inspection of these streaming jobs. This new UI offers two sets of statistics: 1) aggregate information of a streaming query job completed and 2) detailed statistics information about the streaming query, including Input Rate, Process Rate, Input Rows, Batch Duration, Operation Duration, etc.

New Spark UI for inspection of streaming jobs available via Databricks Runtime 7.0

  • Enhanced EXPLAIN command: Reading plans is critical for understanding and tuning queries. The existing solution looks cluttered and each operator’s string representation can be very wide or even truncated. This release enhanced it with a new FORMATTED mode and also provided a capability to dump the plans to the files.
  • Observable metrics: Continuously monitoring the changes of the data quality is a highly desirable feature for managing a data pipeline. This release introduced such a capability for both batch and streaming applications. Observable metrics are named arbitrary aggregate functions that can be defined on a query (dataframe). As soon as the execution of a dataframe reaches a completion point (e.g., finishes batch query or reaches streaming epoch), a named event is emitted that contains the metrics for the data processed since the last completion point.

Try the Spark 3.0 Preview in the Runtime 7.0 Beta

The upcoming Apache Spark 3.0 release brings many new feature capabilities, performance improvements and expanded compatibility to the Spark ecosystem. Aside from core functional and performance improvements for data engineering, data science, data analytics, and machine learning workloads on Apache Spark, these improvements also deliver a significantly improved SQL analyst experience with Spark, including for reporting jobs and interactive queries. Once again, we appreciate all the contributions from the Spark community to make this possible.

This blog post only summarizes some of the salient features in this release. Stay tuned as we will be publishing a series of technical blogs explaining some of these features in more depth.

If you want to try the upcoming Apache Spark 3.0 preview in Databricks Runtime 7.0, sign up for a free trial account.

--

Try Databricks for free. Get started today.

The post Now on Databricks: A Technical Preview of Databricks Runtime 7 Including a Preview of Apache Spark 3.0 appeared first on Databricks.

Fighting Cyber Threats in the Public Sector with Scalable Analytics and AI

$
0
0

Watch our on-demand webinar Real-time Threat Detection, Analytics and AI in the Public Sector to learn more and see a live demo.

In 2019, there were 7,098 data breaches exposing over 15.1 billion records. That equates to a cyber incident  every hour and fifteen minutes. The Public Sector is a prime target with cyber criminals and nation states launching a constant barrage of attacks focused on disrupting government operations and obtaining a political edge. In fact, 88% of public sector organizations report that they have faced at least one cyber attack over the past two years.

Local governments and federal agencies need to be more vigilant and defensive than ever before. To prevent attacks, information security teams need to build a holistic view of the threat environment. But, this is no easy task. In today’s digital, mobile, and connected world, confidential and sensitive data is being accessed and shared across a growing list of applications and network endpoints. This creates hundreds of thousands of events and hundreds of terabytes of data every month that need to be analyzed and contextualized in near real-time. For most agencies this creates a big data problem.

Traditional Security Tools are Falling Short

To help manage this effort, many local and federal agencies have invested in traditional SIEM tools. While these threat intelligence tools are great for monitoring known threat patterns, most were built for the on-premise world. Scaling them for terabytes of data requires expensive infrastructure build out. And even cloud-based SIEM tools typically charge per GB of data ingested. This makes scaling threat detection tools for large volumes of data cost prohibitive. As a result most agencies store a few weeks of threat data at best. This can be a real problem in scenarios where a perpetrator gains access to a network, but waits months before doing anything malicious. Without a long historical record, security teams can’t analyze cyberattacks over long tong horizons or conduct deep forensic reviews.

Beyond scaling challenges, many legacy SIEM tools lack the critical infrastructure — advanced analytics, graph processing and machine learning capabilities — needed to detect unknown threat patterns or deliver on a broader set of security use cases like behavioral analytics. For example, a rules-based SIEM might not detect questionable employee behavior such as an employee emailing sensitive documents to their personal email address right before they quit. In these scenarios, machine learning models are needed to detect anomalous behavior patterns across a broader set of non-traditional data sets.

Augmenting Threat Detection with Big Data Technologies

Databricks provides governmental agencies with the big data tools and technology to prevent and minimize cybersecurity threats

To prevent threats in today’s environment, government agencies need to find a better, more cost effective way to process, correlate and analyze massive amounts of real-time and historical data. Fortunately, the Databricks Unified Data Analytics Platform along with popular open-source tools Apache Spark™ and Delta Lake offer agencies a path forward:

  • Holistic, Real-time Threat Analysis – Native to the cloud and built on Apache Spark™, Databricks is optimized to process large volumes of threat data in real-time. This enables government agencies to quickly query petabytes of data stretching years into the past. This is critical for forensic reviews and profiling long-term threats. Delta Lake, an open-source storage layer that brings ACID transactions to big data workloads, is natively integrated into Databricks, providing additional optimizations that significantly accelerate queries of structured and unstructured data sets. Delta Lake also enables security teams to easily and reliably combine batch and streaming data sources which is critical for detecting new threats as they happen [watch this keynote to see how one of the world’s largest tech companies uses Delta Lake to scale threat detection].
  • Next Gen Threat Detection – Machine learning is critical to uncovering unknown threat patterns across broad sets of data. Databricks’ collaborative notebook environment provides data scientists with built-in machine learning libraries and the tooling they need to rapidly experiment with advanced analytics. With these capabilities, data scientists have the flexibility to build predictive machine learning models that support a broad set of security use cases such as reducing false positives produced by SIEM tools, uncovering suspicious employee behavior, detecting complex malware and more.
  • Cost Efficient Scale – The Databricks platform is a fully managed cloud service with cost-efficient pricing designed for big data processing. Cloud clusters auto-scale up and down automatically so teams only use the compute needed for the job. Security teams no longer need to absorb the costly burden of building and maintaining a homegrown cybersecurity analytics platform or paying per GB of data ingested.

Improve Your Agency’s Security Posture with Big Data and AI

As cyber criminals continue to evolve their techniques, so do local and national government security teams need to evolve their cybersecurity strategies and how they detect and prevent threats. Big data analytics and machine learning technologies provide government agencies a path forward, but choosing the right platform is critical to success.

Get Started with These Threat Detection Resources

--

Try Databricks for free. Get started today.

The post Fighting Cyber Threats in the Public Sector with Scalable Analytics and AI appeared first on Databricks.

A Convolutional Neural Network Implementation For Car Classification

$
0
0

Convolutional Neural Networks (CNN) are state-of-the-art Neural Network architectures that are primarily used for computer vision tasks. CNN can be applied to a number of different tasks, such as image recognition, object localization, and change detection. Recently, our partner Data Insights received a challenging request from a major car company: Develop a Computer Vision application which could identify the car model in a given image. Considering that different car models can appear quite similar and any car can look very different depending on their surroundings and the angle at which they are photographed, such a task was, until quite recently, simply impossible.

 
Example artificial neural network, with multiple layers between the input and output layers, where the input is an image and the output is a car model classification.
However, starting around 2012, the Deep Learning Revolution made it possible to handle such a problem. Instead of being explained the concept of a car, computers could instead repeatedly study pictures and learn such concepts themselves. In the past few years, additional Artificial Neural Network innovations have resulted in AI that can perform image classification tasks with human-level accuracy. Building on such developments we were able to train a Deep CNN to classify cars by their model. The Neural Network was trained on the Stanford Cars Dataset, which contains over 16,000 pictures of cars, comprising 196 different models. Over time we could see the accuracy of predictions began to improve, as the neural network learned the concept of a car, and how to distinguish between different models.

Example artificial neural network, with multiple layers between the input and output layers, where the input is an image and the output is a car model classification.

Example artificial neural network, with multiple layers between the input and output layers, where the input is an image and the output is a car model classification.

Together with our partner we build an end-to-end machine learning pipeline using Apache Spark™ and Koalas for the data preprocessing, Keras with Tensorflow for the model training, MLflow for the tracking of models and results, and Azure ML for the deployment of a REST service. This setup within Azure Databricks is optimized to train networks fast and efficiently, and also helps to try many different CNN configurations much more quickly. Even after only a few practice attempts, the CNN’s accuracy reached around 85%.

Setting up  an Artificial Neural Network to Classify Images

In this article we are outlining some of the main techniques used in getting a Neural Network up into production. If you’d like to attempt to get the Neural Network running yourself, the full notebooks with a meticulous step-by-step guide included, can be found below.

This demo uses the publicly available Stanford Cars Dataset which is one of the more comprehensive public data sets, although a little outdated, so you won’t find car models post 2012 (although, once trained, transfer learning could easily allow a new dataset to be substituted). The data is provided through an ADLS Gen2 storage account that you can mount to your workspace.

Stanford Cars Dataset used in CNN image classification demonstration.

For the first step of data preprocessing the images are compressed into hdf5 files (one for training and one for testing). This can then be read in by the neural network. This step can be omitted completely, if you like, as the hdf5 files are part of the ADLS Gen2 storage provided as part of the here provided notebooks.

Image Augmentation with Koalas

The quantity and diversity of data gathered has a large impact on the results one can achieve with deep learning models. Data augmentation is a strategy that can significantly improve learning results without the need to actually collect new data. With different techniques like cropping, padding, and horizontal flipping, which are commonly used to train large neural networks, the data sets can be artificially inflated by increasing the number of images for training and testing.

Example data augmentation, cropping, padding, horizontal flipping, etc., made possible by the use of Koalas.

Applying augmentation to a large corpus of training data can be very expensive, especially when comparing the results of different approaches. With Koalas it becomes easy to try existing frameworks for image augmentation in Python, and scaling the process out on a cluster with multiple nodes using the to data science familiar Pandas API.

Coding a ResNet in Keras

When you break apart a CNN, they comprise different ‘blocks’, with each block simply representing a group of operations to be applied to some input data. These blocks can be broadly categorized into:

  • Identity Block: A series of operations which keep the shape of the data the same.
  • Convolution Block: A series of operations which reduce the shape of the input data to a smaller shape.

A CNN is a series of both Identity Blocks and Convolution Blocks (or ConvBlocks) which reduce an input image to a compact group of numbers. Each of these resulting numbers (if trained correctly) should eventually tell you something useful towards classifying the image. A Residual CNN adds an additional step for each block. The data is saved as a temporary variable before the operations that constitute the block are applied, and then this temporary data is added to the output data. Generally, this additional step is applied to each block. As an example the below figure demonstrates a simplified CNN for detecting handwritten numbers:

 
Example CNN used to detect handwritten numbers.

 

There are many different methods of implementing a Neural Network. One of the more intuitive ways is via Keras. Keras provides a simple front-end library for executing the individual steps which comprise a neural network. Keras can be configured to work with a Tensorflow back-end, or a Theano back-end. Here, we will be using a Tensorflow back-end. A Keras network is broken up into multiple layers as seen below. For our network we are also defining our customer implementation of a layer.

Example Keras neural network broken into its constituent parts of input layers, convolutional layers, and max pooling layers, and demonstrating the continuous image reduction process used to aid classification.

The Scale Layer

For any custom operation that has trainable weights Keras allows you to implement your own layer. When dealing with huge amounts of image data, one can run into memory issues. Initially, RGB images contain integer data (0-255). When running gradient descent as part of the optimisation during backpropagation, one will find that integer gradients do not allow for sufficient accuracy to properly adjust network weights. Therefore, it is necessary to change to float precision. This is where issues can arise. Even when images are scaled down to 224x224x3, when we use ten thousand training images, we are looking at over 1 billion floating point entries. As opposed to turning an entire dataset to float precision, better practice is to use a ‘Scale Layer’, which scales the input data one image at a time, and only when it is needed. This should be applied after Batch Normalization in the model. The parameters of this Scale Layer are also parameters that can be learned through training.

To use this custom layer also during scoring we have to package the class together with our model. With MLflow we can achieve this with a Keras custom_objects dictionary mapping names (strings) to custom classes or functions associated with the Keras model. MLflow saves these custom layers using CloudPickle and restores them automatically when the model is loaded with mlflow.keras.load_model() and mlflow.pyfunc.load_model().

mlflow.keras.log_model(model, "model", custom_objects={"Scale": Scale})

Tracking Results with MLflow and Azure Machine Learning

Machine learning development involves additional complexities beyond software development. That there are a myriad of tools and frameworks makes it hard to track experiments, reproduce results and deploy machine learning models. Together with Azure Machine Learning one can accelerate and manage the end-to-end machine learning lifecycle using MLflow to reliably build, share and deploy machine learning applications using Azure Databricks.

In order to automatically track results, an existing or new Azure ML workspace can be linked to your Azure Databricks workspace. Additionally, MLflow supports auto-logging for Keras models (mlflow.keras.autolog()), making the experience almost effortless.

MLflow allows you to automatically track the results of your Azure ML workspace and seamlessly supports auto-logging for Keras models.

While MLflow’s built-in model persistence utilities are convenient for packaging models from various popular ML libraries such as Keras, they do not cover every use case. For example, you may want to use a model from an ML library that is not explicitly supported by MLflow’s built-in flavours. Alternatively, you may want to package custom inference code and data to create an MLflow Model. Fortunately, MLflow provides two solutions that can be used to accomplish these tasks: Custom Python Models and Custom Flavors.

In this scenario we want to make sure we can use a model inference engine that supports serving requests from a REST API client. For this we are using a custom model based on the previously built Keras model to accept a JSON Dataframe object that has a Base64-encoded image inside.

import mlflow.pyfunc

class AutoResNet150(mlflow.pyfunc.PythonModel):
    
    def predict_from_picture(self, img_df):
    import cv2 as cv
    import numpy as np
    import base64
    
    # decoding of base64 encoded image used for transport over http
    img = np.frombuffer(base64.b64decode(img_df[0][0]), dtype=np.uint8)
    img_res = cv.resize(cv.imdecode(img, flags=1), (224, 224), cv.IMREAD_UNCHANGED)
    rgb_img = np.expand_dims(img_res, 0)
    
    preds = self.keras_model.predict(rgb_img)
    prob = np.max(preds)
    
    class_id = np.argmax(preds)
    return {"label": self.class_names[class_id][0][0], "prob": "{:.4}".format(prob)}
    
    def load_context(self, context):
    import scipy.io
    import numpy as np
    import h5py
    import keras
    import cloudpickle
    from keras.models import load_model
    
    self.results = []
    with open(context.artifacts["cars_meta"], "rb") as file:
        # load the car classes file
        cars_meta = scipy.io.loadmat(file)
        self.class_names = cars_meta['class_names']
        self.class_names = np.transpose(self.class_names)
    
    with open(context.artifacts["scale_layer"], "rb") as file:
        self.scale_layer = cloudpickle.load(file)
    
    with open(context.artifacts["keras_model"], "rb") as file:
        f = h5py.File(file.name,'r')
        self.keras_model = load_model(f, custom_objects={"Scale": self.scale_layer})
    
    def predict(self, context, model_input):
    return self.predict_from_picture(model_input)

In the next step we can use this py_model and deploy it to an Azure Container Instances server which can be achieved through MLflow’s Azure ML integration.

Example Keras model deployed to an Azure Container Instance made possible by MLflow’s Azure ML integration.

Deploy an Image Classification Model in Azure Container Instances

By now we have a trained machine learning model, and have registered a model in our workspace with MLflow in the cloud. As a final step we would like to deploy the model as a web service on Azure Container Instances.

A web service is an image, in this case a Docker image. It encapsulates the scoring logic and the model itself. In this case we are using our custom MLflow model representation which gives us control over how the scoring logic takes in care images from a REST client and how the response is shaped.

# Build an Azure ML Container Image for an MLflow model
azure_image, azure_model = mlflow.azureml.build_image(
                    model_uri="{}/py_model"
                            .format(resnet150_latest_run.info.artifact_uri),
                    image_name="car-resnet150",
                    model_name="car-resnet150",
                    workspace=ws,
                    synchronous=True)

webservice_deployment_config = AciWebservice.deploy_configuration()


# defining the container specs 
aci_config = AciWebservice.deploy_configuration(cpu_cores=3.0, memory_gb=12.0)

webservice = Webservice.deploy_from_image(
    image=azure_image, 
    workspace=ws, 
    name="car-resnet150", 
    deployment_config=aci_config, 
    overwrite=True)

webservice.wait_for_deployment()

Container Instances is a great solution for testing and understanding the workflow. For scalable production deployments, consider using Azure Kubernetes Service. For more information, see how to deploy and where.

Getting Started with CNN Image Classification

This article and notebooks demonstrate the main techniques used in setting up an end-to-end workflow training and deploying a Neural Network in production on Azure. The exercises of the linked notebook will walk you through the required steps of creating this inside your own Azure Databricks environment using tools like Keras, Databricks Koalas, MLflow, and Azure ML.

Developer Resources

--

Try Databricks for free. Get started today.

The post A Convolutional Neural Network Implementation For Car Classification appeared first on Databricks.

Shrink Training Time and Cost Using NVIDIA GPU-Accelerated XGBoost and Apache Spark™ on Databricks

$
0
0

Guest Blog by Niranjan Nataraja and Karthikeyan Rajendran of Nvidia. Niranjan Nataraja is a lead data scientist at Nvidia and specializes in building big data pipelines for data science tasks and creating mathematical models for data center operations and cloud gaming services. Karthikeyan Rajendran is the lead product manager for NVIDIA’s Spark team.

This blog will show how to utilize XGBoost and Spark from Databricks notebooks and the setup steps necessary to take advantage of NVIDIA GPUs to significantly reduce training time and cost. We illustrate the benefits of GPU-acceleration with a real-world use case from NVIDIA’s GeForce NOW team and show you how to enable it in your own notebooks.

About XGBoost

XGBoost is an open source library that provides a gradient boosting framework usable from many programming languages (Python, Java, R, Scala, C++ and more). XGBoost can run on a single machine or on multiple machines under several different distributed processing frameworks (Apache Hadoop, Apache Spark, Apache Flink). XGBoost models can be trained on both CPUs and GPUs. However, data scientists on the GeForce NOW team run into significant challenges with cost and training time when using CPU-based XGBoost.

GeForce NOW Use Case

GeForce NOW is NVIDIA’s cloud-based, game-streaming service, delivering real-time gameplay straight from the cloud to laptops, desktops, SHIELD TVs, or Android devices. Network traffic latency issues can affect a gamer’s user experience. GeForce NOW uses an XGBoost model to predict the network quality of multiple internet transit providers so a gamer’s network traffic can be routed through a transit vendor with the highest predicted network quality. XGBoost models are trained using gaming session network metrics for each internet service provider. GeForce NOW generates billions of events per day for network traffic, consisting of structured and unstructured data. NVIDIA’s big data platform merges data from multiple sources and generates a network traffic data record for each gaming session which is used as training data.

As network traffic varies dramatically over the course of a day, the prediction model needs to be re-trained frequently with the latest GeForce NOW data. Given a myriad of features and large datasets, NVIDIA GeForce NOW data scientists rely upon hyperparameter searches to build highly accurate models. For a dataset of tens of million rows and a non-trivial number of features, CPU model training with Hyperopt takes more than 20 hours on a single AWS r5.4xlarge CPU instance. Even with a scale-out approach using 2 CPU server instances, the training latency requires 6 hours with spiraling infrastructure costs.

Unleashing the Power of NVIDIA GPU-accelerated XGBoost

A recent NVIDIA developer blog illustrated the significant benefits of GPU-accelerated XGBoost model training. NVIDIA data scientists followed a similar approach to achieve a 22x speed-up and 8x cost savings compared to CPU-based XGBoost. As illustrated in Figure 1, a GeForce NOW production network traffic data dataset with 40 million rows and 32 features took only 18 minutes on GPU for training when compared to 3.2 hours (191 minutes) on CPU. In addition, the right hand side of Figure 1 compares CPU cluster costs and GPU cluster costs that include both AWS instances and Databricks runtime costs.

Network quality prediction training on GPU vs. CPU

Figure 1: Network quality prediction training on GPU vs. CPU

As for model performance, the trained XGBoost models were compared on four different metrics.

  • Root mean squared error
  • Mean absolute error
  • Mean absolute percentage error
  • Correlation coefficient

The NVIDIA GPU-based XGBoost model has similar accuracy in all these metrics.

Now that we have seen the performance and cost savings, next we will discuss the setup and best practices to run a sample XGBoost notebook on a Databricks GPU cluster.

Quick Start on NVIDIA GPU-accelerated XGBoost on Databricks

Databricks supports XGBoost on several ML runtimes. Here is a well-written user guide for running XGBoost on single node and multiple nodes.

To run XGBoost on GPU, you only need to make the following adjustments:

  1. Set up a Spark cluster with GPU instances (instead of CPU instances)
  2. Modify your XGBoost training code to switch `tree_method` parameter from `hist` to `gpu_hist`
  3. Set up data loading

Set Up NVIDIA GPU Cluster for XGBoost Training

To conduct NVIDIA GPU-based XGBoost training, you need to set up your Spark cluster with GPUs and the proper Databricks ML runtime.

  • We used a p2.xlarge (61.0 GB memory, 1 GPU, 1.22 DBU) instance for the driver node and two p3.2xlarge (61.0 GB memory, 1 GPU, 4.15 DBU) instances for the worker nodes.
  • We chose 6.3 ML (includes Apache Spark 2.4.4, GPU, Scala 2.11) as our Databricks runtime version. Any Databricks ML runtime with GPUs should work for running XGBoost on Databricks.

Code Change on `tree_method` Parameter

After starting the cluster, in your XGBoost notebook you need to change the treemethod parameter from hist to gpu_hist.

For CPU-based training:

xgb_reg = xgboost.XGBRegressor(objective='reg:squarederror', ..., tree_method='hist')

For GPU-based training:

xgb_reg = xgboost.XGBRegressor(objective='reg:squarederror', ..., tree_method='gpu_hist')

Getting Started with GPU Model Training

NVIDIA’s GPU-accelerated XGBoost helped GeForce NOW meet the service-level objective of training the model every eight hours, and reduced costs significantly. Switching from CPU-based XGBoost to a GPU-accelerated version was very straightforward. If you’re also struggling with accelerating your training time or reducing your training costs, we encourage you to try it!

Watch this space to learn about new Data Science use-cases to leverage GPUs and Apache Spark 3.0 version on Databrick 7.x ML runtimes.

You can find the GeForce NOW PySpark notebook hosted on GitHub. The notebook uses hyperopt for hyperparameter search and DBFS’s local file interface to load onto worker nodes.

--

Try Databricks for free. Get started today.

The post Shrink Training Time and Cost Using NVIDIA GPU-Accelerated XGBoost and Apache Spark™ on Databricks appeared first on Databricks.

Schema Evolution in Merge Operations and Operational Metrics in Delta Lake

$
0
0

Try this notebook to reproduce the steps outlined below

We recently announced the release of Delta Lake 0.6.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are:

  • Support for schema evolution in merge operations (#170) – You can now automatically evolve the schema of the table with the merge operation. This is useful in scenarios where you want to upsert change data into a table and the schema of the data changes over time. Instead of detecting and applying schema changes before upserting, merge can simultaneously evolve the schema and upsert the changes. See the documentation for details.
  • Improved merge performance with automatic repartitioning (#349) – When merging into partitioned tables, you can choose to automatically repartition the data by the partition columns before writing to the table. In cases where the merge operation on a partitioned table is slow because it generates too many small files (#345), enabling automatic repartition (spark.delta.merge.repartitionBeforeWrite) can improve performance. See the documentation for details.
  • Improved performance when there is no insert clause (#342) – You can now get better performance in a merge operation if it does not have any insert clause.
  • Operation metrics in DESCRIBE HISTORY (#312) – You can now see operation metrics (for example, number of files and rows changed) for all writes, updates, and deletes on a Delta table in the table history. See the documentation for details.
  • Support for reading Delta tables from any file system (#347) – You can now read Delta tables on any storage system with a Hadoop FileSystem implementation. However, writing to Delta tables still requires configuring a LogStore implementation that gives the necessary guarantees on the storage system. See the documentation for details.

Schema Evolution in Merge Operations

As noted in earlier releases of Delta Lake, Delta Lake includes the ability to execute merge operations to simplify your insert/update/delete operations in a single atomic operation as well as include the ability to enforce and evolve your schema (more details can also be found in this tech talk).  With the release of Delta Lake 0.6.0, you can now evolve your schema within a merge operation.

Let’s showcase this by using a timely example; you can find the original code sample in this notebook.  We’ll start with a small subset of the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE  dataset which we have made available in /databricks-datasets.  This is a dataset commonly used by researchers and analysts to gain some insight of the number of cases of COVID-19 throughout the world. One of the issues with the data is that the schema changes over time.

For example, the files representing COVID-19 cases from March 1st – March 21st  (as of April 30th, 2020) have following schema:

# Import old_data
old_data = (spark.read.option("inferSchema", True).option("header", True)...
.csv(/databricks-datasets/COVID/.../03-21-2020.csv))
old_data.printSchema()
root
 |-- Province/State: string (nullable = true)
 |-- Country/Region: string (nullable = true)
 |-- Last Update: timestamp (nullable = true)
 |-- Confirmed: integer (nullable = true)
 |-- Deaths: integer (nullable = true)
 |-- Recovered: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)    

But the files from March 22nd onwards (as of April 30th) had additional columns including FIPS, Admin2, Active, and Combined_Key.

new_data = (spark.read.option("inferSchema", True).option("header", True)...
.csv(/databricks-datasets/COVID/.../04-21-2020.csv))
new_data.printSchema()

root
 |-- FIPS: integer (nullable = true)
 |-- Admin2: string (nullable = true)
 |-- Province_State: string (nullable = true)
 |-- Country_Region: string (nullable = true)
 |-- Last_Update: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long_: double (nullable = true)
 |-- Confirmed: integer (nullable = true)
 |-- Deaths: integer (nullable = true)
 |-- Recovered: integer (nullable = true)
 |-- Active: integer (nullable = true)
 |-- Combined_Key: string (nullable = true)

In our sample code, we renamed some of the columns (e.g. Long_ -> Longitude, Province/State -> Province_State, etc.) as they are semantically the same.  Instead of evolving the table schema, we simply renamed the columns.

If the key concern was just merging the schemas together, we could use Delta Lake’s schema evolution feature using the “mergeSchema” option in DataFrame.write(), as shown in the following statement.

new_data.write.option("mergeSchema", "true").mode("append").save(path)

But what happens if you need to update an existing value and merge the schema at the same time? With Delta Lake 0.6.0, this can be achieved with schema evolution for merge operations. To visualize this, let’s start by reviewing the old_data which is one row.

old_data.select("process_date", "Province_State", "Country_Region", "Last_Update", "Confirmed").show()
+------------+--------------+--------------+-------------------+---------+
|process_date|Province_State|Country_Region|        Last_Update|Confirmed|
+------------+--------------+--------------+-------------------+---------+
|  2020-03-21|    Washington|            US|2020-03-21 22:43:04|     1793|
+------------+--------------+--------------+-------------------+---------+

Next let’s simulate an update entry that follows the schema of new_data

# Simulate an Updated Entry
items = [(53, '', 'Washington', 'US', '2020-04-27T19:00:00', 47.4009, -121.4905, 1793, 94, 0, '', '', '2020-03-21', 2)]
cols = ['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update', 'Latitude', 'Longitude', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'Combined_Key', 'process_date', 'level']
simulated_update = spark.createDataFrame(items, cols)

and union simulated_update and new_data with a total of 40 rows.

new_data.select("process_date", "FIPS", "Province_State", "Country_Region", "Last_Update", "Confirmed").sort(col("FIPS")).show(5)
+------------+-----+--------------+--------------+-------------------+---------+
|process_date| FIPS|Province_State|Country_Region|        Last_Update|Confirmed|
+------------+-----+--------------+--------------+-------------------+---------+
|  2020-03-21|   53|    Washington|            US|2020-04-27T19:00:00|     1793|
|  2020-04-11|53001|    Washington|            US|2020-04-11 22:45:33|       30|
|  2020-04-11|53003|    Washington|            US|2020-04-11 22:45:33|        4|
|  2020-04-11|53005|    Washington|            US|2020-04-11 22:45:33|      244|
|  2020-04-11|53007|    Washington|            US|2020-04-11 22:45:33|       53|
+------------+-----+--------------+--------------+-------------------+---------+

We set the following parameter to configure your environment for automatic schema evolution:

# Enable automatic schema evolution
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true") 

Now we can run a single atomic operation to update the values (from 3/21/2020) as well as merge together the new schema with the following statement.

from delta.tables import *
deltaTable = DeltaTable.forPath(spark, DELTA_PATH)

# Schema Evolution with a Merge Operation
deltaTable.alias("t").merge(
    new_data.alias("s"),
    "s.process_date = t.process_date AND s.province_state = t.province_state AND s.country_region = t.country_region AND s.level = t.level"
).whenMatchedUpdateAll(  
).whenNotMatchedInsertAll(
).execute()

Let’s review the Delta Lake table with the following statement:

# Load the data
spark.read.format("delta").load(DELTA_PATH)
    .select("process_date", "FIPS", "Province_State", "Country_Region", "Last_Update", "Confirmed", "Admin2")
    .sort(col("FIPS"))
    .show()

+------------+-----+--------------+--------------+-------------------+---------+
|process_date| FIPS|Province_State|Country_Region|        Last_Update|Confirmed|Admin|
+------------+-----+--------------+--------------+-------------------+---------+-----+
|  2020-03-21|   53|    Washington|            US|2020-04-27T19:00:00|     1793|        |
|  2020-04-11|53001|    Washington|            US|2020-04-11 22:45:33|       30| Adams  |
|  2020-04-11|53003|    Washington|            US|2020-04-11 22:45:33|        4| Asotin |
|  2020-04-11|53005|    Washington|            US|2020-04-11 22:45:33|      244| Benton |
|  2020-04-11|53007|    Washington|            US|2020-04-11 22:45:33|       53| Chelan | 
+------------+-----+--------------+--------------+-------------------+---------+-----+  

Operational Metrics

You can further dive into the operational metrics by looking at the Delta Lake Table History (operationMetrics column) in the Spark UI by running the following statement:

deltaTable.history().show()

Below is an abbreviated output from the preceding command.

+-------+------+---------+--------------------+
|version|userId|operation|    operationMetrics|
+-------+------+---------+--------------------+
|      1|100802|    MERGE|[numTargetRowsCop...|
|      0|100802|    WRITE|[numFiles -> 1, n...|
+-------+------+---------+--------------------+

You will notice two versions of the table, one for the old schema and another version for the new schema.  When reviewing the operational metrics below, it notes that there were 39 rows inserted and 1 row updated.

{
    "numTargetRowsCopied":"0",
    "numTargetRowsDeleted":"0",{
        "numTargetRowsCopied":"0",
        "numTargetRowsDeleted":"0",
        "numTargetFilesAdded":"3",
        "numTargetRowsInserted":"39",
        "numTargetRowsUpdated":"1",
        "numOutputRows":"40",
        "numSourceRows":"40",
        "numTargetFilesRemoved":"1"
        }
        
    "numTargetFilesAdded":"3",
    "numTargetRowsInserted":"39",
    "numTargetRowsUpdated":"1",
    "numOutputRows":"40",
    "numSourceRows":"40",
    "numTargetFilesRemoved":"1"
}

You can understand more about the details behind these operational metrics by going to the SQL tab within the Spark UI.

An example of the operational metrics now available for review in the Spark UI through Delta Lake 0.6.0

The animated GIF calls out the main components of the Spark UI for your review.

  1. 39 initial rows from one file (for 4/11/2020 with the new schema) that created the initial new_data DataFrame
  2. 1 simulated update row generated that would union with the new_data DataFrame
  3. 1 row from the one file (for 3/21/2020 with the old schema) that created the old_data DataFrame.
  4. A SortMergeJoin used to join the two DataFrames together to be persisted in our Delta Lake table.

To dive further into how to interpret these operational metrics, check out the Diving into Delta Lake Part 3: How do DELETE, UPDATE, and MERGE work tech talk.

ADiving into Delta Lake Part 3: How do DELETE, UPDATE, and MERGE work tech talk.

Get Started with Delta Lake 0.6.0

Try out Delta Lake with the preceding code snippets on your Apache Spark 2.4.5 (or greater) instance (on Databricks, try this with DBR 6.6+). Delta Lake makes your data lakes more reliable (whether you create a new one or migrate an existing data lake).  To learn more, refer to https://delta.io/, and join the Delta Lake community via Slack and Google Group.  You can track all the upcoming releases and planned features in GitHub milestones. You can also try out Managed Delta Lake on Databricks with a free account.

Credits

We want to thank the following contributors for updates, doc changes, and contributions in Delta Lake 0.6.0: Ali Afroozeh, Andrew Fogarty, Anurag870, Burak Yavuz, Erik LaBianca, Gengliang Wang, IonutBoicuAms, Jakub Orłowski, Jose Torres, KevinKarlBob, Michael Armbrust, Pranav Anand, Rahul Govind, Rahul Mahadev, Shixiong Zhu, Steve Suh, Tathagata Das, Timothy Zhang, Tom van Bussel, Wesley Hoffman, Xiao Li, chet, Eugene Koifman, Herman van Hovell, hongdd, lswyyy, lys0716, Mahmoud Mahdi, Maryann Xue

--

Try Databricks for free. Get started today.

The post Schema Evolution in Merge Operations and Operational Metrics in Delta Lake appeared first on Databricks.


Manage and Scale Machine Learning Models for IoT Devices

$
0
0

A common data science internet of things (IoT) use case involves training machine learning models on real-time data coming from an army of IoT sensors.  Some use cases demand that each connected device has its own individual model since many basic machine learning algorithms  often outperform a single complex model. We see this in supply chain optimization, predictive maintenance, electric vehicle charging, smart home management, or any number of other use cases. The problem is this:

  • The overall IoT data is so large that it won’t fit on any one machine
  • The per device data does fit on a single machine
  • An individual model is needed for each device
  • The data science team is implementing using single node libraries like sklearn and pandas, so they need low friction in distributing their single-machine proof of concept

In this blog, we demonstrate how you solve this problem with two distinct schemes for each IoT device: Model Training and Model Scoring.

The Multi-IoT Device ML Solution

This is a canonical big data problem. IoT devices such as weather sensors and vehicles produce an awe-inspiring amount of data points. Single-machine solutions won’t scale to a problem of this complexity and often don’t integrate as well into production environments. And data science teams don’t want to worry about whether the DataFrame they’re using is a single-machine pandas object or is distributed by Apache Spark. And one more thing: we need to log our models and their performance somewhere for reproducibility, monitoring, and deployment.

Here are the two schemas we need to solve this problem:

  • Model Training: create a function that takes the data for a single device as an input. Train the model. Log the resulting model and any evaluation metrics using MLflow, an open source platform for the machine learning lifecycle
  • Model Scoring: create a second function that pulls the trained model from MLflow for that device, apply it, and return the predictions

With these abstractions in place, we only have to convert our functions into Pandas UDF’s in order to distribute them with Spark. A Pandas UDF allows for the efficient distribution of arbitrary Python code within a Spark job, allowing for the distribution of otherwise serial operations. We will then have taken a single-node solution and make  it embarrassingly parallel.

IoT Model Training

Now let’s take a closer look at model training. Start with some dummy data. We have a fleet of connected devices, a number of samples for each, a few features, and a label we’re looking to predict. As is often the case with IoT devices, the featurization steps can be done with Spark to leverage its scalability.

import pyspark.sql.functions as f

df = (spark.range(10000*1000)
    .select(f.col("id").alias("record_id"), (f.col("id")%10).alias("device_id"))
    .withColumn("feature_1", f.rand() * 1)
    .withColumn("feature_2", f.rand() * 2)
    .withColumn("feature_3", f.rand() * 3)
    .withColumn("label", (f.col("feature_1") + f.col("feature_2") + f.col("feature_3")) + f.rand())
)    

Next we need to define the schema that our training function will return. We want to return the device ID, the number of records used in the training, the path to the model, and an evaluation metric.

import pyspark.sql.types as t

trainReturnSchema = t.StructType([
    t.StructField('device_id', t.IntegerType()), # unique device ID
    t.StructField('n_used', t.IntegerType()),    # number of records used in training
    t.StructField('model_path', t.StringType()), # path to the model for a given device
    t.StructField('mse', t.FloatType())          # metric for model performance
])    

Define a Pandas UDF that takes a pandas DataFrame for one group of data as an input and returns model metadata as its output.

import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

@f.pandas_udf(trainReturnSchema, functionType=f.PandasUDFType.GROUPED_MAP)
def train_model(df_pandas):
    '''
    Trains an sklearn model on grouped instances
    '''
    # Pull metadata
    device_id = df_pandas['device_id'].iloc[0]
    n_used = df_pandas.shape[0]
    run_id = df_pandas['run_id'].iloc[0] # Pulls run ID to do a nested run
    
    # Train the model
    X = df_pandas[['feature_1', 'feature_2', 'feature_3']]
    y = df_pandas['label']
    rf = RandomForestRegressor()
    rf.fit(X, y)

    # Evaluate the model
    predictions = rf.predict(X)
    mse = mean_squared_error(y, predictions) # Note we could add a train/test split
    
    # Resume the top-level training
    with mlflow.start_run(run_id=run_id):
        # Create a nested run for the specific device
        with mlflow.start_run(run_name=str(device_id), nested=True) as run:
            mlflow.sklearn.log_model(rf, str(device_id))
            mlflow.log_metric("mse", mse)
            
            artifact_uri = f"runs:/{run.info.run_id}/{device_id}"
            # Create a return pandas DataFrame that matches the schema above
            returnDF = pd.DataFrame([[device_id, n_used, artifact_uri, mse]], 
                columns=["device_id", "n_used", "model_path", "mse"])

    return returnDF     

IoT Device Model Logging with Nested Runs in MLflow

The MLflow tracking package allows us to log different aspects of the machine learning development process. In our case, we will create a run (or one execution of machine learning code) for each of our devices. We will aggregate these runs together using one parent run.

This also allows us to see if any individual models are less performant than others. We simply need to add the logging logic in the Pandas UDF, as seen above. Even though this code will be executing on the worker nodes of the cluster, if we start the parent run before we start the nested run, we’ll still be able to log these models together.

IoT Device Model Logging with Nested Runs in MLflow

We could just query MLflow to get the URI’s for each model back. Returning the URI from the Pandas UDF instead just makes the whole pipeline a bit easier to stitch together.

Parallel Training

Now we just need to apply the Grouped Map Pandas UDF. As long as the data for any given device will fit on a node of the Spark cluster, we can distribute the training. First make the MLflow parent run and then apply the Pandas UDF using a groupby and then an apply.

with mlflow.start_run(run_name="Training session for all devices") as run:
    run_id = run.info.run_uuid

    modelDirectoriesDF = (df
        .withColumn("run_id", f.lit(run_id)) # Add run_id
        .groupby("device_id")
        .apply(train_model)
    )

combinedDF = (df
.join(modelDirectoriesDF, on="device_id", how="left")
)  

And there you go! A model has now been trained and logged for each device.

IoT Model Scoring

Now for the scoring. The optimization trick here is to make sure that we only fetch the model once for each device, limiting the communication overhead. Then we apply the model as we would in a single machine context and return a pandas DataFrame of the record id’s and their prediction.

applyReturnSchema = t.StructType([
    t.StructField('record_id', t.IntegerType()),
    t.StructField('prediction', t.FloatType())
])

@f.pandas_udf(applyReturnSchema, functionType=f.PandasUDFType.GROUPED_MAP)
def apply_model(df_pandas):
    '''
    Applies model to data for a particular device, represented as a pandas DataFrame
    '''
    model_path = df_pandas['model_path'].iloc[0]

    input_columns = ['feature_1', 'feature_2', 'feature_3']
    X = df_pandas[input_columns]

    model = mlflow.sklearn.load_model(model_path)
    prediction = model.predict(X)

    returnDF = pd.DataFrame({
        "record_id": df_pandas['record_id'],
        "prediction": prediction
    })
    return returnDF

predictionDF = combinedDF.groupby("device_id").apply(apply_model)  

Note that in each case we’re using a Grouped Map Pandas UDF. In the first case, we take a group as an input and return one row for each device (a many-to-one mapping). In this case, we take a group as an input and return one prediction per row (a one-to-one mapping). A Grouped Map Pandas UDF allows for both approaches.

Conclusion

So there you have individualized models trained across an army of IoT devices. This supports the idea that many basic models generally outperform a single, more complex model. Even if this is generally the case, there might be some individual models that perform below average due in part to limited or missing data for that device. Here are some ideas for taking this further:

  • Using the number of records available in the training and the evaluation metric, you can easily delineate the individual models that perform well versus the models that perform poorly. You can use this information to toggle between a per-device model and a model trained on the entire fleet.
  • You could also train an ensemble model that takes the predictions from a per-device model, the predictions from a fleet-wide model, and metadata like the evaluation metrics and number of records per device. This would create a final prediction that would improve the under-performing individual models.

Get Started with MLflow for IoT Devices

Ready to try it out yourself?  You can see the full example used in this blog post in a runnable notebook on AWS or Azure.

If you are new to MLflow, read the MLflow quickstart with the latest MLflow release. For production use cases, read about Managed MLflow on Databricks.

--

Try Databricks for free. Get started today.

The post Manage and Scale Machine Learning Models for IoT Devices appeared first on Databricks.

New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0™

$
0
0

Pandas user-defined functions (UDFs) are one of the most significant enhancements in Apache Spark for data science. They bring many benefits, such as enabling users to use Pandas APIs and improving performance.

However, Pandas UDFs have evolved organically over time, which has led to some inconsistencies and is creating confusion among users. The full release of Apache Spark 3.0, expected soon, will introduce a new interface for Pandas UDFs that leverages Python type hints to address the proliferation of Pandas UDF types and help them become more Pythonic and self-descriptive.

This blog post introduces new Pandas UDFs with Python type hints, and the new Pandas Function APIs including grouped map, map, and co-grouped map.

Pandas UDFs

Pandas UDFs were introduced in Spark 2.3, see also Introducing Pandas UDF for PySpark. Pandas is well known to data scientists and has seamless integrations with many Python libraries and packages such as NumPy, statsmodel, and scikit-learn, and Pandas UDFs allow data scientists not only to scale out their workloads, but also to leverage the Pandas APIs in Apache Spark.

The user-defined functions are executed by:

  • Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.
  • Pandas inside the function, to work with Pandas instances and APIs.

The Pandas UDFs work with Pandas APIs inside the function and Apache Arrow for exchanging data. It allows vectorized operations that can increase performance up to 100x, compared to row-at-a-time Python UDFs.

The example below shows a Pandas UDF to simply add one to each value, in which it is defined with the function called pandas_plus_one decorated by pandas_udf with the Pandas UDF type specified as PandasUDFType.SCALAR.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('double', PandasUDFType.SCALAR)
def pandas_plus_one(v):
    # `v` is a pandas Series
    return v.add(1)  # outputs a pandas Series

spark.range(10).select(pandas_plus_one("id")).show()

The Python function takes and outputs a Pandas Series. You can perform a vectorized operation for adding one to each value by using the rich set of Pandas APIs within this function. (De)serialization is also automatically vectorized by leveraging Apache Arrow under the hood.

Python Type Hints

Python type hints were officially introduced in PEP 484 with Python 3.5. Type hinting is an official way to statically indicate the type of a value in Python. See the example below.

def greeting(name: str) -> str:
    return 'Hello ' + name

The name: strindicates the name argument is of str type and the -> syntax indicates the greeting() function returns a string.

Python type hints bring two significant benefits to the PySpark and Pandas UDF context.

  • It gives a clear definition of what the function is supposed to do, making it easier for users to understand the code. For example, unless it is documented, users cannot know if greeting can take None or not if there is no type hint. It can avoid the need to document such subtle cases with a bunch of test cases and/or for users to test and figure out by themselves.
  • It can make it easier to perform static analysis. IDEs such as PyCharm and Visual Studio Code can leverage type annotations to provide code completion, show errors, and support better go-to-definition functionality.

Proliferation of Pandas UDF Types

Since the release of Apache Spark 2.3, a number of new Pandas UDFs have been implemented, making it difficult for users to learn about the new specifications and how to use them. For example, here are three Pandas UDFs that output virtually the same results:

from pyspark.sql.functions import pandas_udf, PandasUDFType                                                                                                                                                                                                                                                                                                                                                                                                                              

@pandas_udf('long', PandasUDFType.SCALAR)
def pandas_plus_one(v):
    # `v` is a pandas Series
    return v + 1  # outputs a pandas Series

spark.range(10).select(pandas_plus_one("id")).show()
from pyspark.sql.functions import pandas_udf, PandasUDFType


# New type of Pandas UDF in Spark 3.0.
@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(itr):
    # `iterator` is an iterator of pandas Series.
    return map(lambda v: v + 1, itr)  # outputs an iterator of pandas Series.

spark.range(10).select(pandas_plus_one("id")).show()
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("id long", PandasUDFType.GROUPED_MAP)
def pandas_plus_one(pdf):
    # `pdf` is a pandas DataFrame
    return pdf + 1  # outputs a pandas DataFrame

# `pandas_plus_one` can _only_ be used with `groupby(...).apply(...)`
spark.range(10).groupby('id').apply(pandas_plus_one).show()

Although each of these UDF types has a distinct purpose, several can be applicable. In this simple case, you could use any of the three. However, each of the Pandas UDFs expects different input and output types, and works in a different way with a distinct semantic and different performance. It confuses users about which one to use and learn, and how each works.

Furthermore, pandas_plus_one in the first and second cases can be used where the regular PySpark columns are used. Consider the argument of withColumn or the function with the combinations of other expressions such as pandas_plus_one("id") + 1. However, the last pandas_plus_one can only be used with groupby(...).apply(pandas_plus_one).

This level of complexity has triggered numerous discussions with Spark developers, and drove the effort to introduce the new Pandas APIs with Python type hints via an official proposal. The goal is to enable users to naturally express their pandas UDFs using Python type hints without confusion as in the problematic cases above. For example, the cases above can be written as below:

def pandas_plus_one(v: pd.Series) -> pd.Series:
    return v + 1
def pandas_plus_one(itr: Iterator[pd.Series]) -> Iterator[pd.Series]:
    return map(lambda v: v + 1, itr)
def pandas_plus_one(pdf: pd.DataFrame) -> pd.DataFrame:
    return pdf + 1

New Pandas APIs with Python Type Hints

To address the complexity in the old Pandas UDFs, from Apache Spark 3.0 with Python 3.6 and above, Python type hints such as pandas.Series, pandas.DataFrame, Tuple, and Iterator can be used to express the new Pandas UDF types.

In addition, the old Pandas UDFs were split into two API categories: Pandas UDFs and Pandas Function APIs. Although they work internally in a similar way, there are distinct differences.

You can treat Pandas UDFs in the same way that you use other PySpark column instances. However, you cannot use the Pandas Function APIs with these column instances. Here are these two examples:

# Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf, log2, col

@pandas_udf('long')
def pandas_plus_one(s: pd.Series) -> pd.Series:
    return s + 1

# pandas_plus_one("id") is identically treated as _a SQL expression_ internally.
# Namely, you can combine with other columns, functions and expressions.
spark.range(10).select(
    pandas_plus_one(col("id") - 1) + log2("id") + 1).show()
# Pandas Function API
from typing import Iterator
import pandas as pd


def pandas_plus_one(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    return map(lambda v: v + 1, iterator)


# pandas_plus_one is just a regular Python function, and mapInPandas is
# logically treated as _a separate SQL query plan_ instead of a SQL expression. 
# Therefore, direct interactions with other expressions are impossible.
spark.range(10).mapInPandas(pandas_plus_one, schema="id long").show()

Also, note that Pandas UDFs require Python type hints whereas the type hints in Pandas Function APIs are currently optional. Type hints are planned for Pandas Function APIs and may be required at some point in the future.

New Pandas UDFs

Instead of defining and specifying each Pandas UDF type manually, the new Pandas UDFs infer the Pandas UDF type from the given Python type hints at the Python function. There are currently four supported cases of the Python type hints in Pandas UDFs:

  • Series to Series
  • Iterator of Series to Iterator of Series
  • Iterator of Multiple Series to Iterator of Series
  • Series to Scalar (a single value)

Before we do a deep dive into each case, let’s look at three key points about working with the new Pandas UDFs.

  • Although Python type hints are optional in the Python world in general, you must specify Python type hints for the input and output in order to use the new Pandas UDFs.
  • Users can still use the old way by manually specifying the Pandas UDF type. However, using Python type hints is encouraged.
  • The type hint should use pandas.Series in all cases. However, there is one variant in which pandas.DataFrame should be used for its input or output type hint instead: when the input or output column is of StructType.

    Take a look at the example below:

    import pandas as pd
    from pyspark.sql.functions import pandas_udf
    
    
    df = spark.createDataFrame(
        [[1, "a string", ("a nested string",)]],
        "long_col long, string_col string, struct_col struct<col1:string>")
    
    @pandas_udf("col1 string, col2 long")
    def pandas_plus_len(
            s1: pd.Series, s2: pd.Series, pdf: pd.DataFrame) -> pd.DataFrame:
        # Regular columns are series and the struct column is a DataFrame.
        pdf['col2'] = s1 + s2.str.len() 
        return pdf  # the struct column expects a DataFrame to return
    
    df.select(pandas_plus_len("long_col", "string_col", "struct_col")).show()
    

Series to Series

Series to Series is mapped to scalar Pandas UDF introduced in Apache Spark 2.3. The type hints can be expressed as pandas.Series, ... -> pandas.Series. It expects the given function to take one or more pandas.Series and outputs one pandas.Series. The output length is expected to be the same as the input.

import pandas as pd
from pyspark.sql.functions import pandas_udf       

@pandas_udf('long')
def pandas_plus_one(s: pd.Series) -> pd.Series:
    return s + 1

spark.range(10).select(pandas_plus_one("id")).show()

The example above can be mapped to the old style with scalar Pandas UDF, as below.

from pyspark.sql.functions import pandas_udf, PandasUDFType                                                                                                                                                                                                                                                                                                                                                                                                                   

@pandas_udf('long', PandasUDFType.SCALAR)
def pandas_plus_one(v):
    return v + 1

spark.range(10).select(pandas_plus_one("id")).show()

Iterator of Series to Iterator of Series

This is a new type of Pandas UDF coming in Apache Spark 3.0. It is a variant of Series to Series, and the type hints can be expressed as Iterator[pd.Series] -> Iterator[pd.Series]. The function takes and outputs an iterator of pandas.Series.

The length of the whole output must be the same length of the whole input. Therefore, it can prefetch the data from the input iterator as long as the lengths of entire input and output are the same. The given function should take a single column as input.

from typing import Iterator
import pandas as pd
from pyspark.sql.functions import pandas_udf       

@pandas_udf('long')
def pandas_plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    return map(lambda s: s + 1, iterator)

spark.range(10).select(pandas_plus_one("id")).show()

It is also useful when the UDF execution requires expensive initialization of some state. The pseudocode below illustrates the case.

@pandas_udf("long")
def calculate(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Do some expensive initialization with a state
    state = very_expensive_initialization()
    for x in iterator:
        # Use that state for the whole iterator.
        yield calculate_with_state(x, state)

df.select(calculate("value")).show()

Iterator of Series to Iterator of Series can be also mapped to the old Pandas UDF style. See the example below.

from pyspark.sql.functions import pandas_udf, PandasUDFType                                                                                                                                                                                                                   

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
    return map(lambda s: s + 1, iterator)

spark.range(10).select(pandas_plus_one("id")).show()

Iterator of Multiple Series to Iterator of Series

This type of Pandas UDF will be also introduced in Apache Spark 3.0, together with Iterator of Series to Iterator of Series. The type hints can be expressed as Iterator[Tuple[pandas.Series, ...]] -> Iterator[pandas.Series].

It has the similar characteristics and restrictions with Iterator of Series to Iterator of Series. The given function takes an iterator of a tuple of pandas.Series and outputs an iterator of pandas.Series. It is also useful when to use some states and when to prefetch the input data. The length of the entire output should also be the same as the length of the entire input. However, the given function should take multiple columns as input, unlike Iterator of Series to Iterator of Series.

from typing import Iterator, Tuple
import pandas as pd
from pyspark.sql.functions import pandas_udf       

@pandas_udf("long")
def multiply_two(
        iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
    return (a * b for a, b in iterator)

spark.range(10).select(multiply_two("id", "id")).show()

This can also be mapped to the old Pandas UDF style as below.

from pyspark.sql.functions import pandas_udf, PandasUDFType                                                                                                                                                                                                                                                                                                                                                                                                                              

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def multiply_two(iterator):
    return (a * b for a, b in iterator)

spark.range(10).select(multiply_two("id", "id")).show()

Series to Scalar

Series to Scalar is mapped to the grouped aggregate Pandas UDF introduced in Apache Spark 2.4. The type hints are expressed as pandas.Series, ... -> Any. The function takes one or more pandas.Series and outputs a primitive data type. The returned scalar can be either a Python primitive type, e.g., int, float, or a NumPy data type such as numpy.int64, numpy.float64, etc. Any should ideally be a specific scalar type accordingly.

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql import Window

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))

@pandas_udf("double")
def pandas_mean(v: pd.Series) -> float:
    return v.sum()

df.select(pandas_mean(df['v'])).show()
df.groupby("id").agg(pandas_mean(df['v'])).show()
df.select(pandas_mean(df['v']).over(Window.partitionBy('id'))).show()

The example above can be converted to the example with the grouped aggregate Pandas UDF as you can see here:

import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import Window

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def pandas_mean(v):
    return v.sum()

df.select(pandas_mean(df['v'])).show()
df.groupby("id").agg(pandas_mean(df['v'])).show()
df.select(pandas_mean(df['v']).over(Window.partitionBy('id'))).show()

New Pandas Function APIs

This new category in Apache Spark 3.0 enables you to directly apply a Python native function, which takes and outputs Pandas instances against a PySpark DataFrame. Pandas Functions APIs supported in Apache Spark 3.0 are: grouped map, map, and co-grouped map.

Note that the grouped map Pandas UDF is now categorized as a group map Pandas Function API. As mentioned earlier, the Python type hints in Pandas Function APIs are optional currently.

Grouped Map

Grouped map in the Pandas Function API is applyInPandas at a grouped DataFrame, e.g., df.groupby(...). This is mapped to the grouped map Pandas UDF in the old Pandas UDF types. It maps each group to each pandas.DataFrame in the function. Note that it does not require for the output to be the same length of the input.

import pandas as pd

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))

def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").applyInPandas(subtract_mean, schema=df.schema).show()

Grouped map type is mapped to grouped map Pandas UDF supported from Spark 2.3, as below:

import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

Map

Map Pandas Function API is mapInPandas in a DataFrame. It is new in Apache Spark 3.0. It maps every batch in each partition and transforms each. The function takes an iterator of pandas.DataFrame and outputs an iterator of pandas.DataFrame. The output length does not need to match the input size.

from typing import Iterator
import pandas as pd

df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
        yield pdf[pdf.id == 1]

df.mapInPandas(pandas_filter, schema=df.schema).show()

Co-grouped Map

Co-grouped map, applyInPandas in a co-grouped DataFrame such as df.groupby(...).cogroup(df.groupby(...)), will also be introduced in Apache Spark 3.0. Similar to the grouped map, it maps each group to each pandas.DataFrame in the function but it groups with another DataFrame by common key(s) and then the function is applied to each cogroup. Likewise, there is no restriction on the output length.

import pandas as pd

df1 = spark.createDataFrame(
    [(1201, 1, 1.0), (1201, 2, 2.0), (1202, 1, 3.0), (1202, 2, 4.0)],
    ("time", "id", "v1"))
df2 = spark.createDataFrame(
    [(1201, 1, "x"), (1201, 2, "y")], ("time", "id", "v2"))

def asof_join(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
    return pd.merge_asof(left, right, on="time", by="id")

df1.groupby("id").cogroup(
    df2.groupby("id")
).applyInPandas(asof_join, "time int, id int, v1 double, v2 string").show()

Conclusion and Future Work

The upcoming release of Apache Spark 3.0 (read our preview blog for details). will offer Python type hints to make it simpler for users to express Pandas UDFs and Pandas Function APIs. In the future, we should consider adding support for other type hint combinations in both Pandas UDFs and Pandas Function APIs. Currently, the supported cases are only few of many possible combinations of Python type hints. There are also other ongoing discussions in the Apache Spark community. Visit Side Discussions and Future Improvement to learn more.

Try out these new capabilities today for free on Databricks as part of the Databricks Runtime 7.0 Beta.

--

Try Databricks for free. Get started today.

The post New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0™ appeared first on Databricks.

MLOps takes center stage at Spark + AI Summit

$
0
0

As companies ramp up machine learning, the growth in the number of models they have under development begins to impact their set of tools, processes and infrastructure. Machine learning involves data and data pipelines, model training and tuning (i.e., experiments), governance, and specialized tools for deployment, monitoring and observability. About three years ago, we started to first hear of  “machine learning engineer” as a role emerging in the San Francisco Bay Area. Today, machine learning engineers are more common, and companies are beginning to think through systems and strategies for MLOps — a set of new practices for productionizing machine learning.
 
MLOps-focused trainings, tutorials, keynotes, and sessions featured at the 2020 Virtual Spark + AI Summit

Featured MLOps-focused Programs and Topics

The growing interest in Devops for machine learning or  MLOps is something we’ve been tracking closely and, for the upcoming virtual Spark + AI Summit, we have training, tutorials, keynotes and sessions on topics relevant to MLOps. We provided a sneak peek during a recent virtual conference focused on ML platforms, and we have much more in store at the conference in June. Some of the topics that will be covered in the virtual Spark + AI Summit include the following:

  • Model development, tuning and governance: There will be hands-on training focused MLflow and case studies from many companies, including Atlassian, Halliburton, Zynga, Outreach and Facebook.
  • Feature stores: In 2017, Uber introduced the concept of a central place to store curated features within an organization. Since the creation and discovery of relevant features is a central part of the ML development process, teams need data management systems (feature stores) where they can store, share and discover features. At this year’s summit, a set of companies, including Tecton, Logical Clocks, Accenture and iFoods, will describe their approach to building feature stores.
  • Large-scale model inference and prediction services: Several speakers will describe how they deploy ML models to production, including sessions that detail best practices for designing large-scale prediction services. We will have speakers from Facebook, LinkedIn, Microsoft, IBM, Stanford, ExxonMobil, Condé Nast and more.
  • Monitoring and managing models in production: Model monitoring and observability are capabilities that many companies are just beginning to build. At this year’s virtual summit we are fortunate to have presentations from companies and speakers who are building these services. Some of the companies presenting and teaching on these topics include speakers from Intuit, Databricks, Iguazio and AWS.
  • MLOps: Continuous integration (CI) and continuous deployment (CD) are well-known software engineering practices that are beginning to influence how companies develop, deploy and manage machine learning models. This year’s summit will have presentations on CI/CD for ML from leading companies, including Intel, Outreach, Databricks and others.

Machine learning and AI are impacting a wider variety of domains and industries. At the same time, most companies are just beginning to build, manage and deploy machine learning models to production. The upcoming virtual Spark + AI Summit will highlight best practices, tools and case studies from companies and speakers at the forefront of making machine learning work in real-world applications.

Save Your Spot

--

Try Databricks for free. Get started today.

The post MLOps takes center stage at Spark + AI Summit appeared first on Databricks.

Modernizing Risk Management Part 1: Streaming data-ingestion, rapid model development and Monte-Carlo Simulations at Scale

$
0
0

Managing risk within the financial services, especially within the banking sector, has increased in complexity over the past several years. First, new frameworks (such as FRTB) are being introduced that potentially require tremendous computing power and an ability to analyze years of historical data. At the same, regulators are demanding more transparency and explainability from the banks they oversee. Finally, the introduction of new technologies and business models means the need for sound risk governance is at an all time high. However, the ability for the banking industry to effectively meet these demands has not been an easy undertaking. Traditional banks relying on on-premises infrastructure can no longer effectively manage risk. Banks must abandon the computational inefficiencies of legacy technologies and build an agile Modern Risk Management practice capable of rapidly responding to market and economic volatility through the use of data and advanced analytics. Recent experience shows that as new threats emerge, historical data and aggregated risk models lose their predictive values quickly. Risk analysts must augment traditional data with alternative datasets in order to explore new ways of identifying and quantifying the risks facing their business, both at scale and in real-time.

In this blog, we will demonstrate how to modernize traditional value-at-risk (VaR) calculation through the use of various components of the Databricks Unified Data Analytics Platform — Delta Lake, Apache SparkTM and MLflow — in order to enable a more agile and forward looking approach to risk management.

The Databricks architecture used to modernize traditional VaR Calculations.

This first series of notebooks will cover the multiple data engineering and data science challenges that must be addressed to effectively modernize risk management practices:

  • Using Delta Lake to have a unified view of your market data
  • Leveraging MLflow as a delivery vehicle for model development and deployment
  • Using Apache Spark for distributing Monte Carlo simulations at scale

The ability to efficiently slice and dice your Monte Carlo simulations in order to have a more agile and forward-looking approach to risk management will be covered in a second blog post, focused more on a risk analyst persona.

Modernizing data management with Delta Lake

With the rise of big data and cloud based-technologies, the IT landscape has drastically changed in the last decade. Yet, most FSIs still rely on mainframes and non-distributed databases for core risk operations such as VaR calculations and move only some of their downstream processes to modern data lakes and cloud infrastructure. As a result, banks are falling behind the technology curve and their current risk management practices are no longer sufficient for the modern economy. Modernizing risk management starts with the data. Specifically, by shifting the lense in which data is viewed: not as a cost, but as an asset.

Old Approach: When data is considered as a cost, FSIs limit the capacity of risk analysts to explore “what if“ scenarios and restrict their aggregated data silos to only satisfy predefined risk strategies. Over time, the rigidity of maintaining silos has led engineers to branch new processes and create new aggregated views on the basis of already fragile workflows in order to adapt to evolving requirements. Paradoxically, the constant struggle to keep data as a low cost commodity on-premises has led to a more fragile and therefore more expensive ecosystem to maintain overall. Failed processes (annotated as X symbol below) have far too many downstream impacts in order to guarantee both timeliness and reliability of your data. Consequently, having an intra-day (and reliable)  view of market risk has become increasingly complex and cost prohibitive to achieve given all the moving components and inter-dependencies as schematised in below diagram.
Traditional approaches to risk management, which prioritize keeping data costs low, place financial risk managers at a disadvantage.

Modern Approach: When data is considered as an asset, organizations embrace the versatile nature of the data, serving multiple use cases (such as value-at-risk and expected shortfall) and enabling a variety of ad-hoc analysis (such as understanding risk exposure to a specific country). Risk analysts are no longer restricted to a narrow view of the risk and can adopt a more agile approach to risk management. By unifying streaming and batch ETL, ensuring ACID compliance and schema enforcement, Delta Lake brings performance and reliability to your data lake, gradually increasing the quality and relevance of your data through its bronze, silver and gold layers and bridging the gap between operation processes and analytics data.

Modern approach to financial risk management emphasizes increasing the quality and relevance of data and bridging the gap between operation processes and analytics data.

In this demo, we evaluate the level of risk of various investments in a Latin America equity portfolio composed of 40 instruments across multiple industries, storing all returns in a centralized Delta Lake table that will drive all our value-at-risk calculations (covered in our part 2 demo).

Sample risk portfolio of various Latin American instruments.

For the purpose of this demo, we access daily close prices from Yahoo finance using python yfinance library. In real life, one may acquire market data from source systems directly (such as change data capture from mainframes) to a Delta Lake table, storing raw information on Bronze and curated / validated data on a Silver table, in real-time.

With our core data available on Delta Lake, we apply a simple window function to compute daily log returns and output results back to a gold table ready for risk modelling and analysis.

@udf("double")
def compute_return(first, close):
    return float(np.log(close / first))

window = Window.partitionBy('ticker').orderBy('date').rowsBetween(-1, 0)

spark \
    .read \
    .table(stock_data_silver) \
    .withColumn("first", F.first('close').over(window)) \
    .withColumn("return", compute_return('first', 'close')) \
    .select('date', 'ticker', 'return')
    .write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(stock_data_gold)

In the example below, we show a specific slice of our investment data for AVAL (Grupo Aval Acciones y Valores S.A), a financial services company operating in Columbia. Given the expected drop in its stock price post march 2020, we can evaluate its impact on our overall risk portfolio.

Sample data used by Databricks to illustrate the effectiveness of a modern approach to financial risk and data management.

Streamlining model development with MLFlow

Although quantitative analysis is not a new concept, the recent rise of data science and the explosion of data volumes has uncovered major inefficiencies in the way banks operate models. Without any industry standard, data scientists often operate on a best effort basis. This often means training models against data samples on single nodes and manually tracking models throughout the development process, resulting in long release cycles (it may take between 6 to 12 months to deliver a model to production). The long model development cycle hinders the ability for them to quickly adapt to emerging threats and to dynamically mitigate the associated risks. The major challenge FSIs face in this paradigm is reducing model development-to-production time without doing so at the expense of governance and regulations or contributing to an even more fragile data science ecosystem. 

MLflow is the de facto standard for managing the machine learning lifecycle by bringing immutability and transparency to model development, but is not restricted to AI. A bank’s definition of a model is usually quite broad and includes any financial models from Excel macros to rule-based systems or state-of-the art machine learning, all of them that could benefit from having a central model registry provided by MLflow within Databricks Unified Data Analytics Platform.

Reproducing model development

In this example, we want to train a new model that predicts stock returns given market indicators (such as S&P 500, crude oil and treasury bonds). We can retrieve “AS OF“ data in order to ensure full model reproducibility and audit compliance. This capability of Delta Lake is commonly referred to as “time travel“. The resulting data set will remain consistent throughout all experiments and can be accessed as-is for audit purposes.

DESCRIBE HISTORY market_return;
SELECT * FROM market_return TIMESTAMP AS OF '2020-05-04';
SELECT * FROM market_return VERSION AS OF 2;

In order to select the right features in their models, quantitative analysts often navigate between Spark and Pandas dataframes. We show here how to switch from a pyspark to python context in order to extract correlations of our market factors. The Databricks interactive notebooks come with built-in visualisations and also fully support the use of Matplotlib, seaborn (or ggplot2 for R).

factor_returns_pd = factor_returns_df.toPandas()
factor_corr = factor_returns_pd.corr(method='spearman', min_periods=12)

Sample variance-covariance table generated by a Databricks interactive notebook, demonstrating its efficacy and rigor in constructing its predictive risk models.

Assuming our indicators are not correlated (they are) and predictive of our portfolio returns (they may), we  want to log this graph as evidence to our successful experiment. This shows internal audit, model validation functions  as well as regulators that model exploration was conducted with highest quality standards and its development was led with empirical results.

mlflow.log_artifact('/tmp/correlation.png')

Training models in parallel

As  the number of instruments in our portfolio increases, we may want to train models in parallel. This can be achieved through a simple Pandas UDF function as follows. For convenience (models may be more complex in real life), we want to train a simple linear regression model and aggregate all model coefficients as a n x m matrix (n being the number of instruments and m the number of features derived from our market factors).

schema = StructType([
    StructField('ticker', StringType(), True), 
    StructField('weights', ArrayType(FloatType()), True)
  ])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def train_model(group, pdf):
  X = np.array(pdf['features'])
  X = sm.add_constant(X, prepend=True) 
  y = np.array(pdf['return'])
  model = sm.OLS(y, X).fit()
  w_df = pd.DataFrame(data=[[model.params]], columns=['weights'])
  w_df['ticker'] = group[0]
  return w_df

models_df = x_train.groupBy('ticker').apply(train_model).toPandas()

The resulting dataset (weight for each model) can be easily collected back to memory and logged to MLflow as our model candidate for the rest of the experiment. In the below graph, we report the predicted vs actual stock return derived from our model for Ecopetrol S.A., an oil and gas producer in Columbia.

Sample model output visualization retained by MLflow, along with the experiment, its revisions, and the underlying data, providing full transparency, traceability, and context.

Our experiment is now stored on MLflow alongside all evidence required for an independent validation unit (IVU) submission which is likely a part of your model risk management framework. It is key to note that this experiment is not only linked to our notebook, but to the exact revision of it, bringing independent experts and regulators the full traceability of our model as well all the necessary context required for model validation.

Monte Carlo simulations at scale with Apache Spark

Value-at-risk is the process of simulating random walks that cover possible outcomes as well as worst case (n) scenarios. A 95% value-at-risk for a period of (t) days is the best case scenario out of the worst 5% trials. We therefore want to generate enough simulations to cover a range of possible outcomes given a 90 days historical market volatility observed across all the instruments in our portfolio. Given the number of simulations required for each instrument, this system must be designed with a high degree of parallelism in mind, making value-at-risk the perfect workload to execute in a cloud based environment. Risk management is the number one reason top tier banks evaluate cloud compute for analytics today and accelerate value through the Databricks runtime.

Creating a multivariate distribution

Whilst the industry recommends generating between 20 to 30 thousands simulations, the main complexity of calculating value-at-risk for a mixed portfolio is not to measure individual assets returns, but the correlations between them. At a portfolio level, market indicators can be elegantly manipulated within native python without having to shift complex matrix computation to a distributed framework. As it is common to operate with multiple books and portfolios, this same process can easily scale out by distributing matrix calculation in parallel. We use the last 90 days of market returns in order to compute todays’ volatility (extracting both average and covariance).

def retrieve_market_factors(from_date, to_date):

    from_ts = F.to_date(F.lit(from_date)).cast(TimestampType())
    to_ts = F.to_date(F.lit(to_date)).cast(TimestampType())

    f_ret = spark.table(market_return_table) \
        .filter(F.col('date') > from_ts) \
        .filter(F.col('date') <= to_ts) \
        .orderBy(F.asc('date'))

    f_ret_pdf = f_ret.toPandas()
    f_ret_pdf.index = f_ret_pdf['date']
    f_ret_pdf = f_ret_pdf.drop(['date'], axis=1)

    return f_ret_pdf

We generate a specific market condition by sampling a point of the market's multivariate projection (superposition of individual normal distributions of our market factors). This provides a feature vector that can be injected into our model in order to predict the return of our financial instrument.

def simulate_market(f_ret_avg, f_ret_cov, seed):
    np.random.seed(seed = seed)
    return np.random.multivariate_normal(f_ret_avg, f_ret_cov)

Generating consistent and independent trials at scale

Another complexity of simulating value-at-risk is to avoid auto-correlation by carefully fixing random numbers using a ‘seed’. We want each trial to be independent albeit consistent across instruments (market conditions are identical for each simulated position). See below an example of creating an independent and consistent trial set - running this same block twice will result in the exact same set of generated market vectors.

seed_init = 42
seeds = [seed_init + x for x in np.arange(0, 10)]
market_data = [simulate_market(f_ret_avg, f_ret_cov, s) for s in seeds]
market_df = pd.DataFrame(market_data, columns=feature_names)
market_df['_seed'] = seeds

In a distributed environment, we want each executor in our cluster to be responsible for multiple simulations across multiple instruments. We define our seed strategy so that each executor will be responsible for num_instruments x ( num_simulations / num_executors ) trials. Given 100,000 Monte Carlo simulations, a parallelism of 50 executors and 10 instruments in our portfolio, each executor will run 20,000 instrument returns.

# fixing our initial seed with today experiment
trial_date = datetime.strptime('2020-05-01', '%Y-%m-%d')
seed_init = int(trial_date.timestamp())

# create our seed strategy per executor
seeds = [[seed_init + x, x % parallelism] for x in np.arange(0, runs)]
seed_pdf = pd.DataFrame(data = seeds, columns = ['seed', 'executor'])
seed_sdf = spark.createDataFrame(seed_pdf).repartition(parallelism, 'executor')

# evaluate and cache our repartitioning strategy
seed_sdf.cache()
seed_sdf.count()    

We group our set of seeds per executor and generate trials for each of our models through the use of a Pandas UDF. Note that there may be multiple ways to achieve the same, but this approach has the benefit to fully control the level of parallelism in order to ensure no hotspot occurs and no executor will be left idle waiting for other tasks to finish.

@pandas_udf('ticker string, seed int, trial float', PandasUDFType.GROUPED_MAP)
def run_trials(pdf):
    
    # retrieve our broadcast models and 90 days market volatility
    models = model_dict.value
    f_ret_avg = f_ret_avg_B.value
    f_ret_cov = f_ret_cov_B.value
    
    trials = []
    for seed in np.array(pdf.seed):
    market_features = simulate_market(f_ret_avg, f_ret_cov, seed)
    for ticker, model in models_dict.items(): 
        trial = model.predict(market_features)
        trials.append([ticker, seed, trial])

    return pd.DataFrame(trials, columns=['ticker', 'seed', 'trial'])

# execute Monte Carlo in parallel
mc_df = seed_sdf.groupBy('executor').apply(run_trials)

We append our trials partitioned by day onto a Delta Lake table so that analysts can easily access a day’s worth of simulations and group individual returns by a trial Id (i.e. the seed) in order to access the daily distribution of returns and its respective value-at-risk.

Sample Delta Lake table with trials partitioned by day have been appended to facilitate risk analysts’ review.

With respect to our original definition of data being a core asset (as opposition to being a cost), we store all our trials enriched with our portfolio taxonomy (such as industry type and country of operation), enabling a more holistic and on-demand view of the risk facing our investment strategies. These concepts of slicing and dicing value-at-risk data efficiently and easily (through the use of SQL) will be covered in our part 2 blog post, focused more towards a risk analyst persona.

Getting started with a modern approach to VaR and risk management

In this article, we have demonstrated how banks can modernize their risk management practices by efficiently scaling their Monte Carlo simulations from tens of thousands up to millions by leveraging both the flexibility of cloud compute and the robustness of Apache Spark.  We also demonstrated how Databricks, as the only Unified Data Analytics Platform, helps accelerate model development lifecycle by bringing both the transparency of your experiment and the reliability in your data, bridging the gap between science and engineering and enabling banks to have a more robust yet agile approach to risk management.

Try the below  on Databricks today! And if you want to learn how unified data analytics can bring data science, business analytics and engineering together to accelerate your data and ML efforts, check out the on-demand workshop - Unifying Data Pipelines, Business Analytics and Machine Learning with Apache Spark™

VaR and Risk Management Notebooks:
https://databricks.com/notebooks/00_context.html
https://databricks.com/notebooks/01_market_etl.html
https://databricks.com/notebooks/02_model.html
https://databricks.com/notebooks/03_monte_carlo.html
https://databricks.com/notebooks/04_var_aggregation.html
https://databricks.com/notebooks/05_alt_data.html
https://databricks.com/notebooks/06_backtesting.html

Contact us to learn more about how we assist customers with market risk use cases.

--

Try Databricks for free. Get started today.

The post Modernizing Risk Management Part 1: Streaming data-ingestion, rapid model development and Monte-Carlo Simulations at Scale appeared first on Databricks.

Automating away engineering on-call workflows at Databricks

$
0
0

A Summer of Self-healing

This summer I interned with the Cloud Infrastructure team. The team is responsible for building scalable infrastructure to support Databricks’s multi-cloud product, while using cloud-agnostic technologies like Terraform and Kubernetes. My main focus was developing a new auto-remediation service, Healer, which automatically repairs our Kubernetes infrastructure to improve our service availability and reduce on-call burden.

Automatically Reducing Outages and Downtime

The Cloud Infra team at Databricks is responsible for underlying compute infrastructure for all of Databricks, managing thousands of VMs and database instances across clouds and regions. As components in a distributed system, these cloud-managed resources are expected to fail from time to time. On-call engineers sometimes perform repetitive tasks to fix these expected incidents. When a PagerDuty alert fires, the on-call engineer manually addresses the problem by following a documented playbook.

Though ideally we’d like to track down and fix the root cause of every issue, to do so would be prohibitively expensive. For this long tail of issues, we instead rely on playbooks that address the symptoms and keep them in check. And in some cases, the root cause is a known issue with one of the many open-source projects we work with (like Kubernetes, Prometheus, Envoy, Consul, Hashicorp Vault), so a workaround is the only feasible option.

On-call pages require careful attention from our engineers. Databricks engineering categorizes issues based on priority. Lower-priority issues will only page during business hours (i.e. engineers won’t be woken up at night!). For example, if a Kubernetes node is corrupt in our dev environment, the on-call engineer will only be alerted the following morning to triage the issue. Since Databricks engineering is distributed worldwide (with offices in San Francisco, Toronto, and Amsterdam) and most teams are based out of a single office, an issue with the dev environment can impede certain engineers for hours, decreasing developer productivity.

We are always looking for ways to reduce our keeping-the-lights-on (KTLO) burden, so designing a system that responds to alerts without human intervention by executing engineer-defined playbooks makes a lot of sense to manage resources at our scale. We set out to design a system that would help us address these systemic concerns.

Self-healing Architecture

The Databricks self-managing service Healer features an event-driven architecture that autonomously monitors and repairs the Kubernetes infrastructure.

The Healer architecture is composed of input events (Prometheus/Alertmanager), execution (Healer endpoint, worker queue/threads), and actions (Jenkins, Kubernetes, Spinnaker jobs).


Healer is designed using an event-driven architecture that autonomously repairs the Kubernetes infrastructure. Our alerting system (Prometheus, Alertmanager) monitors our production infrastructure and fires alerts based on defined expressions. Healer runs as a backend service listening to HTTP requests from Alertmanager with alert payloads.

Using the alert metadata, Healer constructs the appropriate remediation based on the alert type and the alert labels. A remediation dictates what the remediation action will be as well as any parameters needed.

Each remediation is scheduled onto a worker execution thread pool. The worker thread will run the respective remediation by making calls to the appropriate service and then monitor the remediation for completion. In practice, this could be kicking off a Jenkins, Kubernetes, or Spinnaker job that automates the manual script workflow. We choose to support these frameworks, because they provide Databricks engineers with a wide ability to customize actions in reaction to the alerts.

Once the remediation completes, JIRA and Slack notifications are sent to the corresponding team confirming remediation task completion.

Healer can be easily extended with new kinds of remediations. Engineering teams outside of Cloud Infra can onboard remediations jobs that integrate with their service alerts, taking needed actions to recover from incidents, reducing on-call load generally across engineering.

Example Use Case

One use case for Healer is for remediating low disk space on our Kubernetes nodes. The on-call engineer is notified of this problem by an alert called “NodeDiskPressure”. To remedy NodeDiskPressure, an on-call engineer would connect to the appropriate node and execute a docker image prune command.

To automate this, we first develop an action to be triggered by Healer; we define a Jenkins job called DockerPruneNode, which automates the manual steps equivalent to connecting to a node and executing docker image prune. We then configure a Healer remediation to resolve NodeDiskPressure alerts automatically by defining a Healer rule that binds an exact remedy (DockerPruneNode) given an alert and its parameters.

Below is an example of how a NodeDiskPressure alert gets translated into a specific remediation including the job to be run and all the needed parameters. The final remediation object has three “translated” params taken from the alert as well as one “static” hard-coded param.

Example repair initiated by the Databricks auto-remediation service Healer for a NodeDiskPressure alert.

Example repair initiated by the Databricks auto-remediation service Healer for a NodeDiskPressure alert.

The configuration also has a few other parameters which engineers can configure to tune the exact behavior of the remediation. They are omitted here for brevity.

After defining this rule, the underlying issue is fully automated away, allowing on-call engineers can focus on other more important matters!

Future Steps

Currently Healer is up, running, and improving availability of our development infrastructure. What do next steps for the service look like?

Initially, we plan to onboard more of the Cloud Infra team’s use cases. Specifically, we are looking at the following use cases:

  • Support fine-grained auto-scaling to our clusters by leveraging existing system usage alerts (CPU, memory) to trigger a remediation that will increase cluster capacity.
  • Terminate and reprovision Kubernetes nodes that are identified as unhealthy.
  • Rotate service TLS certifications when they are close to expiring.

Furthermore, we want to continue to push adoption of this tool within the engineering organization and help other teams to onboard their use cases. This general framework can be extended to other teams to reduce their on-call load as well.

I am looking forward to seeing what other incidents Healer can help remediate for Databricks in the future!

Special thanks for a great internship experience to my mentor Ziheng Liao, managers Nanxi Kang and Eric Wang, as well as the rest of the cloud team here at Databricks!

I really enjoyed my summer at Databricks and encourage anyone looking for a challenging and rewarding career in platform engineering to join the team. If you are interested in contributing to our self-healing architecture, check out our open job opportunities!

--

Try Databricks for free. Get started today.

The post Automating away engineering on-call workflows at Databricks appeared first on Databricks.

Viewing all 2239 articles
Browse latest View live