A Guide to Women In Unified Analytics Events at Spark+AI Summit Europe

October 4, 2019, 1:34 pm

≫ Next: Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow

Spark + AI Summit is Europe’s largest data and machine learning conference, and the big news in 2019 is how many women are driving some of the greatest advances in big data, machine learning, and data science.

This 15-17 October in Amsterdam, more than 1700 data scientists, engineers, and developers will gather to share best practices for building real-world AI applications. In 120-plus informative sessions, we’ll discuss the latest advances in open-source technologies, including Apache SparkTM, MLflow, Delta Lake, Koalas, and much more.

We’re very excited about a keynote address from Katie Bouman, Assistant Professor of Computing and Mathematical Sciences at Caltech. Katie specializes in using emerging computational methods to push the boundaries of imaging, and she’ll talk about how she developed an ML algorithm to capture the first-ever picture of a black hole.

We’re also thrilled to introduce new events for Women in Unified Analytics, providing fantastic opportunities for women and their allies in big data, data science, machine learning, and AI to connect, learn, and network. All genders welcome!

Below is the full event agenda of events:

15 October, 18:30 – 21:00 – Happy Hour + Meetup

Clémence Burnichon, Senior Data Scientist, Depop
Adi Polak, Senior Cloud Developer Advocate, Microsoft
Laura Hollink, Tenure Track Researcher, Information Access, CWI

16 October, 18:00 – 19:30 – Happy Hour & Networking

17 October, 11:50 – 13:30 – Lunch with Panel

Devon Edward-Joseph, Data Engineer, Lloyds Bank
Caroline Jelinda Britto, Data Scientist Manager, Wehkamp
Mamta Thangaraj, Functional Lead, Operations Research/Decision Analytics, ARM

If you haven’t done so already, use code WUDASummit to register for the 2019 Spark + AI Summit with 20% off, and join women data science leaders from Depop, Microsoft, CWI, ARM, Wehkamp, and Lloyds Banking Group for tech talks and panel discussion on topics ranging from technology trends to diversity and inclusion, ethical AI, and career development!

Women in Unified Analytics events featured at the 2019 Spark + AI Summit Europe

Try Databricks for free. Get started today.

The post A Guide to Women In Unified Analytics Events at Spark+AI Summit Europe appeared first on Databricks.

↧

Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow

September 20, 2019, 6:00 am

≫ Next: Democratizing Financial Time Series Analysis with Databricks

≪ Previous: A Guide to Women In Unified Analytics Events at Spark+AI Summit Europe

Try this notebook series in Databricks

The advent of genome-wide association studies (GWAS) in the late 2000s enabled scientists to begin to understand the causes of complex diseases such as diabetes and Crohn’s disease at their most fundamental level. However, academic bioinformatics tools to perform GWAS have not kept pace with the growth of genomic data, which has been doubling globally every seven months.

Given the scale of the challenge and the importance of genomics to the future of healthcare, at Databricks we have dedicated an engineering team to develop extensible Spark-native implementations of workflows such as GWAS, which leverage the high-performance big-data store, Delta Lake, and log runs with MLflow. Combining these three technologies with a library we have developed in-house to enable customers to work with genomic data solves the challenges that we have seen our customers face when working with population-scale genomic data.

This tooling includes an architecture that allows users to ingest genomics data directly from flat file formats such as bed, VCF, or BGEN, into Delta Lake. In this blog, we focus on moving common association testing kernels into Spark SQL, streamlining the running of common tests such as genome-wide linear regression. In our next blog, we will generalize this process by using the pipe-transformer parallelize any single-node bioinformatics tool with Apache Spark™, starting with the GWAS tool SAIGE.

Here we showcase how to run and end-to-end GWAS workflow in a single notebook using the publicly available 1,000 genomes dataset, producing the results in figure 1. We used associated variants from the GWAS catalog to generate a synthetic body-mass index (BMI) phenotype (since the 1000 Genomes project did not capture phenotypes). This notebook is written in Python, but can also be implemented in R, Scala and SQL.

Figure 1. Databricks dashboard showing key results from a GWAS on simulated data based on the 1000 genomes dataset.

Ingest 1,000 Genomes Data into Delta Lake

To start, we will load in the 1,000 Genomes VCF file as a Spark SQL DataFrame and calculate summary statistics. Our schema is an intuitive representation of genomic variants that is consistent across both VCF and BGEN data.

# Reading Databricks Delta version of the VCF file
spark.read.format("com.databricks.vcf"). \
           option("splitToBiallelic", "true"). \
           option("flattenInfoFields", "false"). \
           load(vcf_path). \
           selectExpr("*", "expand_struct(call_summary_stats(genotypes))", "expand_struct(hardy_weinberg(genotypes))"). \
           write. \
           format("delta"). \
           save(delta_path)

Figure 2. Databricks’ display() command showing VCF file in a Spark DataFrame

The 1,000 Genomes dataset contains whole genome sequencing data, and thus includes many rare variants. By running a count query on the dataset, we find that there are more than 80 million variants. Let’s go ahead and log this metric to MLflow.

# VCF count
num_variants = spark.read.format("delta").load(delta_path).count()
mlflow.log_metric("Number Variants pre-QC", num_variants)
num_variants

# Output
81271745

Perform quality control

In our genomics library, we have added quality control functions that compute common statistics across the genotypes at a single variant, as well as across all of the samples in a single callset. Here we are going to filter variants that are not in Hardy-Weinberg equilibrium (“pValueHwe”), which is a population genetics statistic that can be used to assess if variants have been correctly genotyped. We will exclude rare variants based on allele frequency.

spark.read.format("delta"). \
   load(delta_path). \
   where((col("alleleFrequencies").getItem(0) >= allele_freq_cutoff) & 
         (col("alleleFrequencies").getItem(0) <= (1.0 - allele_freq_cutoff)) & (col("pValueHwe") >= hwe_cutoff)). \
   write. \
   format("delta"). \
   save(delta_qc_path)

Figure 3. Histogram of Hardy-Weinberg Equilibrium P values

Control for ancestry

Population structure can confound genotype-phenotype association analyses. To control for differing ancestry between participants in the study, here we calculate principal components (PCs), which are provided as covariates to the regression kernel. Spark supports singular value decomposition (SVD) through the Spark MLLib DistributedMatrix API, and SVD can be used to calculate PCs from the transpose of the genotypes matrix. We have introduced an API in Spark that makes it easy to build a DistributedMatrix from a DataFrame, and use this to run SVD and get our PCs.

vectorized = spark.read.format("delta"). \
                        load(delta_qc_path). \
                        selectExpr("array_to_sparse_vector(genotype_states(genotypes)) as features").cache()

matrix = RowMatrix(MLUtils.convertVectorColumnsFromML(vectorized, "features").rdd.map(lambda x: x.features))
pcs = matrix.computeSVD(num_pcs)
pcs_df = spark.createDataFrame(pcs.V.toArray().tolist(), ["pc" + str(i) for i in range(num_pcs)])

After running PCA, we get back a dense matrix of PCs per sample, that we will pass as covariates to the regression analysis. The next steps will extract out only the sampleId and the principal components. This allows us to join against the 1,000 Genomes sample metadata file to label each sample with their super-population.

With Databricks’ display() command, we can view the clusters of our components within the following scatterplot.

Figure 4. Principal Component Analysis with Super Population Labelling:

EUR = European, EAS = East Asian, AMR = Admixed American, SAS = South Asian, AFR = African

Ingest Phenotype Data

For this genome-wide association study, we will be using simulated BMI phenotypic data to associate with the genotypes. Similar to the ingestion of our genotype data, we will ingest the BMI data by reading our sample Parquet data.

# Ingest normalized phenotype data
phenotypes_path = "dbfs:/databricks-datasets/genomics/1000G/phenotypes.normalized"
bmiPhenotype = spark.read. \
                     format("parquet"). \
                     load(phenotype_path). \
                     withColumnRenamed("values", "phenotype_values")

# View BMI data
display(bmiPhenotype.selectExpr("explode(phenotype_values) AS bmi"))

You can visualize the BMI histogram from the preceding display() command.

Figure 5. BMI histogram

Running the Genome-Wide Association Study

Now we have performed the necessary quality control and data extraction, transformation and loading (ETL), the next phase of our solution is to run our GWAS by performing the following tasks:

Mapping the genotypes, phenotypes, and principal components together (using crossJoin).
Calculate the GWAS statistics by running linear regression.
Build a new Apache Spark DataFrame (gwas_df) that contains the GWAS statistics.

# Map variants to GWAS via cross-joins between genotypes, phenotypes, and principle components
covariates = spark.read.format("delta").load(principal_components_path)
phenotypeAndCovariates = bmiPhenotype.crossJoin(covariates)
genotypes = spark.read.format("delta").load(delta_qc_path)

genotypes.crossJoin(phenotypeAndCovariates). \
          selectExpr("contigName", "start", "phenotype", \
                     "expand_struct(linear_regression_gwas(genotype_states(genotypes), phenotype_values, covariates))"). \
          write. \
          format("delta"). \
          save(gwas_results_path)

# Display data
display(spark.read.format("delta").load(gwas_results_path))

Figure 6. Spark DataFrame of GWAS results

The display() command allows us to sanity check the results. Next we can convert our PySpark DataFrame to R thus allowing us to use the qqman package to visualize the results across the genome with a Manhattan plot.

# Extract out GWAS results (and alias various column names)
gwas_results <- select(gwas_df, c(cast(alias(gwas_df$contigName, "CHR"), "double"), alias(gwas_df$start, "BP"), "P"))

# Convert from a Spark DataFrame to an R DataFrame
gwas_results_rdf <- as.data.frame(gwas_results)

# Install packages necessary for Manhattan plot
install.packages("qqman", repos="http://cran.us.r-project.org")
library(qqman)

# Create Manhatatan plot of GWAS results and log to MLflow
png('/databricks/driver/manhattan.png')
manhattan(gwas_results_rdf, 
          col = c("#228b22", "#6441A5"), 
          chrlabs = NULL,
          suggestiveline = -log10(1e-05), 
          genomewideline = -log10(5e-08),
          highlight = NULL, 
          logp = TRUE, 
          annotatePval = NULL, 
          ylim=c(0,17))
dev.off()

mlflow.log_artifact('/databricks/driver/manhattan.png')

Figure 7. GWAS Manhattan Plot

As you can see from our genome-wide association study, for our 1000 genomes simulated data, there are several loci associated with BMI clustered on chromosome 2. In fact, these are the loci whose known associations with BMI were used to simulate our BMI phenotype.

# Execute QQ plot
qq(gwas_results_rdf$P)

We can also check that we have successfully controlled for ancestry by making a quantile-quantile (QQ) plot. In this case, the deviation from expected represents true associations.

Figure 8. GWAS QQ Plot

Finally, we have logged parameters, metrics and plots associated with this GWAS run using MLflow, enabling tracking, monitoring and reproducing of analyses.

Summarizing the Analysis

In this blog, we have demonstrated an end-to-end GWAS workflow using Apache Spark, Delta Lake, and MLflow. Whether you are validating the accuracy of a genotyping assay in clinical use like Sanford Health, or performing a meta-analysis of GWAS results for target identification like Regeneron, the Databricks platform makes it easy to extend analyses and build downstream exploratory visualizations through our built-in dashboarding functionality or through our optimized connectors to BI tools like Tableau and PowerBI which can enable non-coding bench scientists and clinicians to rapidly explore large datasets.

Figure 9. MLflow tracking of each run enables reproducibility of experiments

By robustly engineering an end-to-end GWAS workflow, scientists can move away from ad hoc analysis on flat files, to scalable and reproducible computational frameworks in production. Furthermore by reading VCF data as a Spark DataSource into Delta Lake, data scientists can now integrate tabular phenotypes, Electronic Health Record (EHR) extracts, images, real-world evidence and lab values under a unified framework. Try it yourself today by downloading the Engineering population scale GWAS Databricks notebook.

Try it!

Run our scalable GWAS workflow on the Databricks platform (Azure | AWS). Learn more about our genomics solutions in the Databricks Unified Analytics for Genomics and try out a preview today.

Try Databricks for free. Get started today.

The post Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow appeared first on Databricks.

↧

Democratizing Financial Time Series Analysis with Databricks

October 9, 2019, 6:00 am

≫ Next: How Informatica Data Engineering Goes Hadoop-less with Databricks

≪ Previous: Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow

Try this notebook in Databricks

Introduction

The role of data scientists, data engineers, and analysts at financial institutions includes (but is not limited to) protecting hundreds of billions of dollars worth of assets and protecting investors from trillion-dollar impacts, say from a flash crash. One of the biggest technical challenges underlying these problems is scaling time series manipulation. Tick data, alternative data sets such as geospatial or transactional data, and fundamental economic data are examples of the rich data sources available to financial institutions, all of which are naturally indexed by timestamp. Solving business problems in finance such as risk, fraud, and compliance ultimately rests on being able to aggregate and analyze thousands of time series in parallel. Older technologies, which are RDBMS-based, do not easily scale when analyzing trading strategies or conducting regulatory analyses over years of historical data. Moreover, many existing time series technologies use specialized languages instead of standard SQL or Python-based APIs.

Fortunately, Apache Spark™ contains plenty of built-in functionality such as windowing which naturally parallelizes time-series operations. Moreover, Koalas, an open-source project that allows you to execute distributed Machine Learning queries via Apache Spark using the familiar pandas syntax, helps extend this power to data scientists and analysts.

In this blog, we will show how to build time series functions on hundreds of thousands of tickers in parallel. Next, we demonstrate how to modularize functions in a local IDE and create rich time-series feature sets with Databricks Connect. Lastly, if you are a pandas user looking to scale data preparation which feeds into financial anomaly detection or other statistical analyses, we use a market manipulation example to show how Koalas makes scaling transparent to the typical data science workflow.

Set-Up Time Series Data Sources

Let’s begin by ingesting a couple of traditional financial time series datasets: trades and quotes. We have simulated the datasets for this blog, which are modeled on data received from a trade reporting facility (trades) and the National Best Bid Offer (NBBO) feed (from an exchange such as the NYSE). You can find some example data here: https://www.tickdata.com/product/nbbo/.

This article generally assumes basic financial terms; for more extensive references, see Investopedia’s documentation. What is notable from the datasets below is that we’ve assigned the TimestampType to each timestamp, so the trade execution time and quote change time have been renamed to event_ts for normalization purposes. In addition, as shown in the full notebook attached in this article, we ultimately convert these datasets to Delta format so that we ensure data quality and keep a columnar format, which is most efficient for the type of interactive queries we have below.

trade_schema = StructType([
    StructField("symbol", StringType()),
    StructField("event_ts", TimestampType()),
    StructField("trade_dt", StringType()),
    StructField("trade_pr", DoubleType())
])

quote_schema = StructType([
    StructField("symbol", StringType()),
    StructField("event_ts", TimestampType()),
    StructField("trade_dt", StringType()),
    StructField("bid_pr", DoubleType()),
    StructField("ask_pr", DoubleType())
])

Merging and Aggregating Time Series with Apache Spark™

There are over six hundred thousand publicly traded securities globally today in financial markets. Given our trade and quote datasets span this volume of securities, we’ll need a tool that scales easily. Because Apache Spark™ offers a simple API for ETL and it is the standard engine for parallelization, it is our go-to tool for merging and aggregating standard metrics which in turn help us understand liquidity, risk, and fraud. We’ll start with the merging of trades and quotes, then aggregate the trades dataset to show simple ways to slice the data. Lastly, we’ll show how to package this code up into classes for faster iterative development with Databricks Connect. The full code used for the metrics below is in the attached notebook.

AS-OF Joins

An as-of join is a commonly used ‘merge’ technique that returns the latest right value effective at the time of the left timestamp. For most time-series analyses, multiple types of time series are joined together on the symbol to understand the state of one time series (e.g. NBBO) at a particular time present in another time series (e.g. trades). The example below records the state of the NBBO for every trade for all symbols. As seen in the figure below, we have started off with an initial base time series (trades) and merged the NBBO dataset so that each timestamp has the latest bid and offer recorded ‘as of the time of the trade.’ Once we know the latest bid and offer, we can compute the difference (known as the spread) to understand at what points the liquidity may have been lower (indicated by a large spread). This kind of metric impacts how you may organize your trading strategy to boost your alpha.

First, let’s use the built-in windowing function last to find the last non-null quote value after ordering by time.

# sample code inside join method
        
#define partitioning keys for window
partition_spec = Window.partitionBy('symbol')
        
# define sort - the ind_cd is a sort key (quotes before trades)
join_spec = partition_spec.orderBy('event_ts'). \
                  rowsBetween(Window.unboundedPreceding, Window.currentRow)
        
# use the last_value functionality to get the latest effective record
select(last("bid", True).over(join_spec).alias("latest_bid"))

Now, we’ll call our custom join to merge our data and attach our quotes. See attached notebook for full code.

# apply our custom join
mkt_hrs_trades = trades.filter(col("symbol") == "K")
mkt_hrs_trades_ts = base_ts(mkt_hrs_trades)
quotes_ts = quotes.filter(col("symbol") == "K")

display(mkt_hrs_trades_ts.join(quotes_ts))

Marking VWAP Against Trade Patterns

We’ve shown a merging technique above, so now let’s focus on a standard aggregation, namely Volume-Weighted Average Price (VWAP), which is the average price weighted by volume. This metric is an indicator of the trend and value of the security throughout the day. The vwap function within our wrapper class (in the attached notebook) shows where the VWAP falls above or below the trading price of the security. In particular, we can now identify the window during which the VWAP (in orange) falls below the trade price, showing that the stock is overbought.

trade_ts = base_ts(trades.select('event_ts', symbol, 'price', lit(100).alias("volume")))
vwap_df = trade_ts.vwap(frequency = 'm')

display(vwap_df.filter(col(symbol) == "K") \
    .filter(col('time_group').between('09:30','16:00')) \
    .orderBy('time_group'))

Faster Iterative Development with Databricks Connect

Up to this point, we’ve created some basic wrappers for one-off time-series metrics. However, productionalization of code requires modularization and testing, and this is best accomplished in an IDE. This year, we introduced Databricks Connect, which gives the ability for local IDE development and enhances the experience with testing against a live Databricks cluster. The benefits of Databricks Connect for financial analyses include the ability to add time-series features on small test data with the added flexibility to execute interactive Spark queries against years of historical tick data to validate features.

We use PyCharm to organize classes needed for wrapping PySpark functionality for generating a rich time series feature set. This IDE gives us code completion, formatting standards, and an environment to quickly test classes and methods before running code.

We can quickly debug classes then run Spark code directly from our laptop using a Jupyter notebook which loads our local classes and executes interactive queries with scalable infrastructure. The console pane shows our jobs being executed against a live cluster.

Lastly, we get the best of both worlds by using our local IDE and at the same time appending to our materialized time-series view on our largest time-series dataset.

Leveraging Koalas for Market Manipulation

The pandas API is the standard tool for data manipulation and analysis in Python and is deeply integrated into the Python data science ecosystem, e.g. NumPy, SciPy, matplotlib. One drawback of pandas is that it does not scale easily to large amounts of data. Financial data always includes years of historical data, which is critical for risk aggregation or compliance analysis. To make this easier, we introduced Koalas as a way to leverage pandas APIs while executing Spark on the backend. Since the Koalas API matches Pandas, we don’t sacrifice ease of use, and migration to scalable code is a one-line code change (see import of Koalas in the next section). Before we showcase Koalas’ fit for financial time series problems, let’s start with some context on a specific problem in financial fraud: front running.

Front running occurs when the following sequence occurs:

A trading firm is aware of non-public information which may affect the price of a security
The firm buys a large bulk order (or large set of orders totaling a large aggregate volume)
Due to the removal of liquidity, the security price rises
The firm sells the security to investors (which has been driven upward from the previous purchase) and makes a large profit, forcing investors to pay a larger price even though the information upon which the security was traded was non-public

Source: CC0 Public domain images: https://pxhere.com/en/photo/1531985, https://pxhere.com/en/photo/847099

For illustration purposes, a simple example using farmer’s markets and an apple pie business is found here. This example shows Freddy, a runner who is aware of the imminent demand for apples needed for apple pie businesses across the country and subsequently purchases apples at all farmer’s markets. This, in effect, allows Freddy to sell his apples at a premium to buyers since Freddy caused a major impact by purchasing before any other buyers (representing investors) had a chance to buy the product.

Detection of front running requires an understanding of order flow imbalances (see diagram below). In particular, anomalies in order flow imbalance will help identify windows during which front running may be occurring.

Let’s now use the koalas package to improve our productivity while solving the market manipulation problem. Namely, we’ll focus on the following to find order flow imbalance anomalies:

De-duplication of events at the same time
Lag windows for assessing supply/demand increases
Merging of data frames to aggregate order flow imbalances

De-duplication of Time Series

Common time series data cleansing involves imputation and de-duplication. You may find duplicate values in high-frequency data (such as quote data). When there are multiple values per time with no sequence number, we need to deduplicate so subsequent statistical analysis makes sense. In the case below, multiple bid/ask shares quantities are reported per time, so for computation of order imbalance, we want to rely on one value for maximum depth per time.

import databricks.koalas as ks 

kdf_src = ks.read_delta("...")
grouped_kdf = kdf_src.groupby(['event_ts'], as_index=False).max()
grouped_kdf.sort_values(by=['event_ts'])
grouped_kdf.head()

Time Series Windowing with Koalas

We’ve deduplicated our time series, so now we let’s look at windows so we can find supply and demand. Windowing for time series generally refers to looking at slices or intervals of time. Most trend calculations (simple moving average, for example) all use the concept of time windows to perform calculations. Koalas inherits the simple pandas interface for getting lag or lead values within a window using shift (analogous to Spark’s lag function), as demonstrated below.

grouped_kdf.set_index('event_ts', inplace=True, drop=True)
lag_grouped_kdf = grouped_kdf.shift(periods=1, fill_value=0)

lag_grouped_kdf.head()

Merge on Timestamp and Compute Imbalance with Koalas Column Arithmetic

Now that we have lag values computed, we want to be able to merge this dataset with our original time series of quotes. Below, we employ the Koalas merge to accomplish this with our time index. This gives us the consolidated view we need for supply/demand computations which lead to our order imbalance metric.

lagged = grouped_kdf.merge(lag_grouped_kdf, left_index=True, right_index=True, suffixes=['', '_lag'])
lagged['imblnc_contrib'] = lagged['bid_shrs_qt']*lagged['incr_demand'] \
    - lagged['bid_shrs_qt_lag']*lagged['decr_demand'] \
    - lagged['ask_shrs_qt']*lagged['incr_supply'] \
    + lagged['ask_shrs_qt_lag']*lagged['decr_supply']

Koalas to NumPy for Fitting Distributions

After our initial prep, it’s time to convert our Koalas data frame to a format useful for statistical analysis. For this problem, we might aggregate our imbalances down to the minute or other unit of time before proceeding, but for purposes of illustration, we’ll run against the full dataset for our ticker ‘ITUB’. Below, we convert our Koalas structure to a NumPy dataset so we can use the SciPy library for detecting anomalies in order flow imbalance. Simply use the to_numpy() syntax to bridge this analysis.

from scipy.stats import t
import scipy.stats as st
import numpy as np

q_ofi_values = lagged['imblnc_contrib'].to_numpy()

Below, we plotted the distribution of our order flow imbalances along with markers for the 5th and 95th percentiles to identify the events during which imbalance anomalies occurred. See the full notebook for the code to fit distributions and create this plot. The time during imbalances we just computed with our koalas/SciPy workflow will correlate with potential instances of front running, the market manipulation scheme we were searching for.

The time series visualization below pinpoints the anomalies retrieved as outliers above, highlighted in orange. In our final visualization, we use the plotly library to summarize time windows and frequency of anomalies in the form of a heat map. Specifically, we identify the 10:50:10 – 10:50:20 timeframe as a potential problem area from the front running perspective.

Conclusion

In this article, we’ve shown how Apache Spark and Databricks can be leveraged for time series analysis both directly, by using windowing and wrappers, and indirectly, by using Koalas. Most data scientists rely on the pandas API, so Koalas helps them use pandas functionality while allowing the scale of Apache Spark. The advantages of using Spark and Koalas for time series analyses include:

Parallelize analyses of your time series for risk, fraud, or compliance use cases with as-of joins and simple aggregations
Iterate faster and create rich time series features with Databricks Connect
Arm your data science and quant teams with Koalas to scale out data preparation while not sacrificing pandas ease of use and APIs

Try this notebook on Databricks today! Contact us to learn more about how we assist customers with financial time series use cases.

Try Databricks for free. Get started today.

The post Democratizing Financial Time Series Analysis with Databricks appeared first on Databricks.

↧

How Informatica Data Engineering Goes Hadoop-less with Databricks

October 10, 2019, 12:02 pm

≫ Next: Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes

≪ Previous: Democratizing Financial Time Series Analysis with Databricks

Back in May, we announced our partnership with Informatica to build out a rich set of integrations between our two platforms.

It’s been exciting work for the team because of what we can do for joint customers that combine our Managed Delta Lake with Informatica’s Big Data Management and Enterprise Data Catalog. The vision led us to use the term “Intelligent Data Pipelines” that we outlined in our first blog post. Customers can have a solution that enables data engineers to quickly ingest high volumes of data from multiple hybrid sources into the cloud, stream that into an optimized data lake, and ensure that data is properly governed, making it accurate and ready for downstream analytics and ML.

Migrating Big Data Workloads from On-premises Hadoop to the Cloud

Most recently, we focused specifically on organizations looking to migrate their big data workloads from on premises Hadoop to the cloud. Those data teams still spend a lot of time on data preparation and ingestion vs. the higher-value advanced analytics and machine learning. Core Hadoop services such as YARN and HDFS are complex to manage that results in high TCO. Users have to manually configure and optimize clusters for scale-up and scale-down, which is time consuming and directly impacts the reliability and performance of Hadoop-based data lakes.

Key Questions Concerning a Hadoop to Cloud Migration

Does migrating from Hadoop to the cloud release the operational burden of managing shared clusters? How do you manage compute and storage when migrating to the cloud? What are the key benefits of migrating to a cloud-native platform like Databricks? How does Databricks compare to YARN and HDFS?

Those questions are the exact topic of this blog co-authored by Informatica and Databricks. It is a detailed review of the architecture changes in migrating from Hadoop to Databricks, and for added measure it covers best practices of Hadoop migration to fully leverage the Databricks and Informatica data engineering integration. Check it out!

Try Databricks for free. Get started today.

The post How Informatica Data Engineering Goes Hadoop-less with Databricks appeared first on Databricks.

↧

Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes

October 16, 2019, 12:01 am

≫ Next: Managed MLflow Now Available on Databricks Community Edition

≪ Previous: How Informatica Data Engineering Goes Hadoop-less with Databricks

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.

Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

To further drive adoption and grow the community, we’ve decided to partner with the Linux Foundation to leverage their platform and their extensive experience in fostering influential open source projects, ranging from Linux itself, Jenkins, and Kubernetes. We are joined by Alibaba, Booz Allen Hamilton, Intel, and Starburst in the announcement to develop Delta Lake support not just for Apache Spark, but also Apache Hive, Apache Nifi, and Presto.

Rich Feature Sets for More Robust Data Lakes

As discussed earlier, Delta Lake makes data lakes easier to work with and more robust. It is designed to address many of the problems commonly found with data lakes. For example, incomplete data ingestion can lead to corrupt data; this is addressed by Delta Lake’s ACID Transactions, including for multiple data pipelines reading and writing data concurrently to a data lake. Data sources feeding data lakes may not provide complete column data or correct data types, and so Schema Enforcement prevents bad data from causing data corruption. Change data capture and update/delete/upsert support allows non-append-only workloads to work well on data lakes, a must for GDPR/CCPA.

The list of Delta Lake’s capabilities goes on, with the overarching goal of bringing greater data reliability and scalability to data lakes, so that their data can be consumed more easily by other systems and technologies.

Data Lake Openness and Extensibility

The key tenets for Delta Lake’s design are for openness and extensibility. Delta Lake stores all the data and metadata in cloud object stores, with an open protocol design that leverages existing open formats such as JSON and Apache Parquet. This openness not only removes the risk of vendor lock-in, but is also critical in building an ecosystem to enable the myriad of different use cases from data science, machine learning, and SQL.

To ensure the project’s long-term growth and community development, we’ve worked with the Linux Foundation to further this spirit of openness.

Open Delta Lake Governance & Community Participation

We’re excited that the Linux Foundation will now host Delta Lake as a neutral home for the project, with an open-governance model to encourage participation and technical contributions. This will help provide a framework for long-term stewardship; establish a community ecosystem invested in Delta Lake’s success; and develop an open standard for data storage in data lakes. We believe that this approach will help ensure that data stored in Delta Lake remains open and accessible, while driving increased innovation and development to solve the challenging problems in this space.

The Databricks team has created and contributed to a variety of open-source projects for the data & AI ecosystem, including Apache Spark, MLflow, Koalas, and Delta Lake. We continue to participate in the open-source community because we know it’s the fastest, most comprehensive way to bring new capabilities to market. We’ve been able to build a sustainable, healthy business, while also connecting with the community to ensure that projects don’t lock customers into proprietary systems or data formats.

We can’t wait to see how the community will shape the future of Delta Lake and the broader ecosystem. Please visit delta.io for the latest release information, and follow @DeltaLakeOSS on Twitter.

Learn more: Linux Foundation Press Release on hosting the Delta Lake Open Source Project

Try Databricks for free. Get started today.

The post Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes appeared first on Databricks.

↧

Managed MLflow Now Available on Databricks Community Edition

October 17, 2019, 12:00 am

≫ Next: Introducing the MLflow Model Registry

≪ Previous: Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes

In February 2016, we introduced Databricks Community Edition, a free edition for big data developers to learn and get started quickly with Apache Spark. Since then our commitment to foster a community of developers remains steadfast: to date, we have over 150K registered Community Edition users; we have trained thousands of people at meetups and Spark + AI Summits, and other open-source events.

Today, we are excited to extend Databricks Community Edition with hosted MLflow for free, as part of our ongoing commitment to help developers learn about machine learning lifecycle. With the Community Edition, you can try tutorials that demonstrate how to track results and experiments as you build machine learning models—a crucial stage in the machine learning model’s development lifecycle.

MLflow is an open-source platform for the machine learning lifecycle with four components: MLflow Tracking, MLflow Projects, MLflow Models, and MLflow Registry. MLflow is now included in Databricks Community Edition, meaning that you can utilize its Tracking and Model APIs within a notebook or from your laptop just as easily as you would with managed MLflow in Databricks Enterprise Edition.

In this blog, we briefly explain how you can use MLflow in Community Edition. We’ll share an example notebook that trains a Keras/TensorFlow model and run it within Databricks Community Edition, followed by how to run GitHub examples on your laptop and log results remotely on Databricks Community Edition.

Run Experiments within Community Edition Workspace

First, register for Community Edition. Then, create a cluster with ML Runtime 6.0, which ships with a pre-configured ML environment including mlflow, Keras, PyTorch, TensorFlow, and other libraries. With any other Runtime, you’ll have to install the mlflow library or run dbutils.library.installPyPI(“mlflow”) in one of the first cells of your notebook.

Creating an Experiment in your Workspace

When in a notebook, MLflow will automatically log results to an experiment associated with the notebook. You can also explicitly create an experiment under which all your model training runs and results are tracked, as shown below:

Logging Runs in your Default Notebook Experiment

While running your MLflow code within a notebook, the runs will be logged to a default experiment associated with the notebook. Alternatively, you can explicitly set an experiment name with mflow.set_experiment(“path_to_experiment_name”), to aggregate and compare runs across multiple notebooks.

Under this workspace and default experiment name, we will train a Keras MNIST model with various regularization parameters—such as the number of epochs, hidden layers, units per layer, batch size, momentum, dropout, and activation function. We can run a few experiments with different regularization parameters and select the best model with the lowest validation loss and highest accuracy.

Creating an MLflow Session with the Tracking Server

By using the mlflow.start_run(run_name=run_name), we automatically initiate a session with the tracking server, while the mlflow.keras.autolog() will pick up this current active run session and automatically log parameters, metrics, tags, and model. Below is an excerpt of the code from the notebook, which you can import into Community Edition.

def run_mlflow(run_name="MLflow CE MNIST"):
    # start an active run
    mlflow.start_run(run_name=run_name)
    # automatically log the metrics under this run_name
    mlflow.keras.autolog()
    ...
    # build Keras model
    model = models.Sequential()
    model.add(layers.Flatten(input_shape=x_train[0].shape))
    ...
    model.add(layers.Dense(10, activation=tf.nn.softmax))
    # compile & fit the model with optimizer and loss type
    model.compile(optimizer=optimizer,
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])
     )
      model.fit(x_train, y_train, epochs=args.epochs, batch_size=args.batch_size)
      # evaluate the model
      test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
      # end the current run
      mlflow.end_run(status='FINISHED')
    ...

As you can see from the above, the tracking experiment runs within a Community Edition is relatively simple. With a few lines of code, you can use the MLflow Tracking and Model APIs to generate runs in your notebook and visualize their parameters and metrics for evaluation.

This step is an important stage in your model development life cycle.

Run Experiments Locally and Track Results on Community Edition

You can also run experiments on your laptop or local machine, tracking results to the Community Edition. Only after configuring your local environment and registering for a Community Edition can you track results remotely.

Configuring your Local Environment

pip install mlflow(as described in the MLflow quickstart guide)
As above, create an experiment in your workspace and get its path.
Create a credentials file via databricks configure CLI (and answer the prompts)
- Databricks Host (should begin with https://): https://community.cloud.databricks.com
- Username: enter your login credentials
- Password: enter password for community edition
Configure MLflow to communicate with the Community Edition server: export MLFLOW_TRACKING_URI=databricks
Test out your configuration by creating an experiment via the CLI: mlflow experiments create -n /Users/<your-username>/my-experiment

After the above steps, you can run any Python, Java, or R script containing your machine learning and MLflow code locally and track the results on the MLflow Tracking Server hosted on Community Edition. In addition to the above steps, set the MLFLOW_EXPERIMENT_NAME environment variable to the experiment created above, or in Python:

import mlflow
mlflow.set_experiment("/path to your experiment name in your Workspace")

For this experimental run, we are going to add the above lines to the examples/sklearn_elasticnet_diabetes/osx/train_diabetes.py from the MLflow GitHub Repository in our cloned repo.

Let’s execute three separate runs, each with different parameters on our laptop. With each run, the results will be logged on our Community Edition server under the experiment created above.

python train_diabetes.py 0.01 0.01 && python train_diabetes.py 0.01 0.75 && python train_diabetes.py 0.01 1.0

As shown in the animation above, when the code is executed locally, the runs’ results are logged remotely on the MLflow Tracking Server hosted on your Community Edition.

Or you could simply cut-and-paste this simple code into your favorite editor and run from your laptop, after configuring the laptop with Databricks MLflow credentials:

import os
import shutil

from random import random, randint
import mlflow
from mlflow import log_metric, log_param, log_artifacts

if __name__ == "__main__":

   # set the tracking server to be Databricks Community Edition
   # set the experiment name; if name does not exist, MLflow will
   # create one for you
   mlflow.set_tracking_uri("databricks")
   mlflow.set_experiment("/Users/your@mail/your_experiment_name")
   print("Running experiment_ce.py")
   print("Tracking on https://community.cloud.databricks.com")
   mlflow.start_run(run_name="CE_TEST")

   # log parameters and metrics
   log_param("param-1", randint(0, 100))
   log_metric("metric-1", random())
   log_metric("metric-2", random() + 1)
   log_metric("metric-3", random() + 2)

   # create artifact directory for your artifacts
   if not os.path.exists("outputs"):
       os.makedirs("outputs")
   with open("outputs/test.txt", "w") as f:
       f.write("Looks like I logged on the Community Edition!")

   # log artifacts
   log_artifacts("outputs")
   shutil.rmtree('outputs')
   mlflow.end_run()

Summary

To recap, MLflow is now available on Databricks Community Edition. As an important step in machine learning model development stage, we shared two ways to run your machine learning experiments using MLflow APIs: one is by running in a notebook within Community Edition; the other is by running scripts locally on your laptop and logging results to the tracking server hosted on Community Edition.

Intended for rapid experimentation and learning, the MLflow server on Community Edition is not designed for production use. For example, it does not include the ability to run and reproduce MLflow Projects. And its scalability and uptime guarantees are limited.

Since its original release in February 2016, Community Edition has proved a useful tool for learning about Apache Spark, data science, and data engineering. We’re happy to extend it to learn about managing the machine learning lifecycle with MLflow.

What’s Next

To get started, try some examples from the MLflow GitHub repository on your laptop. These Python scripts (quickstart/mlflow_tracking.py and sklearn_elasticnet_wine/train.py) are a good start to train models locally on your laptop and track remotely on the Community Edition. Or import and run this notebook in your Community Edition.

Join the MLflow community and download the latest MLflow 1.3. Finally, after using MLflow, feel free to contribute.

If you are new to MLflow, read the MLflow quickstart. For production use cases, read about Managed MLflow on Databricks.

Try Databricks for free. Get started today.

The post Managed MLflow Now Available on Databricks Community Edition appeared first on Databricks.

↧

Introducing the MLflow Model Registry

October 17, 2019, 12:00 am

≫ Next: Introducing Glow: an open-source toolkit for large-scale genomic analysis

≪ Previous: Managed MLflow Now Available on Databricks Community Edition

At today’s Spark + AI Summit in Amsterdam, we announced the availability of the MLflow Model Registry, a new component in the MLflow open source ML platform. Since we introduced MLflow at Spark+AI Summit 2018, the project has gained more than 140 contributors and 800,000 monthly downloads on PyPI, making MLflow one of the fastest growing open source projects in machine learning!

MLflow already has the ability to track metrics, parameters, and artifacts as part of experiments, package models and reproducible ML projects, and deploy models to batch or real-time serving platforms.

The MLflow Model Registry builds on MLflow’s existing capabilities to provide organizations with one central place to share ML models, collaborate on moving them from experimentation to testing and production, and implement approval and governance workflows. Since we started MLflow, model management was the top requested feature among our open source users, so we are excited to launch a model management system that integrates directly with MLflow.

The Model Registry gives MLflow new tools to share, review and manage ML models throughout their lifecycle

Applying Good Engineering Principles to Machine Learning with MLflow Model Registry

Many Data Science and Machine Learning projects fail due to preventable issues that have been discovered and solved in software engineering more than a decade ago. However, those solutions need to be adapted due to key differences between developing code and training ML models.

Expertise, Code, AND Data: With the addition of data, Data Science and ML code not only needs to deal with data dependencies, but also handle the inherent non-deterministic characteristics of statistical modeling. ML models are not guaranteed to behave the same way when trained twice, unlike traditional code that can be easily unit tested.

Model Artifacts: In addition to application code, ML products and features also depend on models that are the result of a training process. Those model artifacts can often be large (on the order of gigabytes) and often need to be served differently from code itself.

Collaboration: In large organizations, models that are deployed in an application are often not trained by the same people responsible for the deployment. Handoffs between experimentation, testing, and production deployments are similar, but not identical to approval processes in software engineering.

The MLflow Model Registry addresses the aforementioned challenges. Below are some of the key features of this new component.

One hub for managing ML models collaboratively

Building and deploying ML models is a team sport. Not only are the responsibilities along the machine learning model lifecycle often split across multiple people (e.g. data scientists train models whereas production engineers deploy them), but also, at each lifecycle stage, teams can benefit from collaboration and sharing (e.g. a fraud model built in one part of the organization could be re-used in others).

The new Model Registry facilitates sharing of expertise and knowledge across teams by making ML models more discoverable and providing collaborative features to jointly improve on common ML tasks. Simply register an MLflow model from your experiments to get started. The registry will then let you track multiple versions of the model and mark each one with a lifecycle stage: development, staging, production or archived.

Sample machine learning models displayed via the MLflow Model Registry dashboard

Flexible CI/CD pipelines to manage stage transitions

The MLflow Model Registry lets you manage your models’ lifecycle either manually or through automated tools. Analogous to the approval process in software engineering, users can manually request to move a model to a new lifecycle stage (e.g., from Staging to Production), and review or comment on other users’ transition requests. Alternatively, you can use the Model Registry’s API to plug in continuous integration and deployment (CI/CD) tools such as Jenkins to automatically test and transition your models. Each model also links to the experiment run that built it in MLflow Tracking to let you easily review models.

Example machine learning model page view in MLflow, showing how users can request and review changes to a model’s stage

Visibility and governance for the ML lifecycle

In large enterprises, the number of ML models that are in development, staging, and production at any given point in time may be in the 100s or 1,000s. Having full visibility into which models exist, what stages they are in, and who has collaborated on and changed the deployment stages of a model allows organizations to better manage their ML efforts.

The MLflow Model Registry provides full visibility and enables governance by keeping track of each model’s history and managing who can approve changes to the model’s stages.

Identify model versions, stages, and authors of each model version

Get Started with the MLflow Model Registry

We’ve been developing the MLflow Model Registry with feedback from Databricks customers over the past few quarters, and today, we’ve posted the first open source patch for the MLflow Model Registry on GitHub. We would love to hear your feedback! We plan to continue developing the registry over the next few months and included it in the next MLflow release. Databricks customers can also sign up here to get started with the Model Registry.

Try Databricks for free. Get started today.

The post Introducing the MLflow Model Registry appeared first on Databricks.

↧

Introducing Glow: an open-source toolkit for large-scale genomic analysis

October 18, 2019, 6:00 am

≫ Next: Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar and FAQ Now Available!

≪ Previous: Introducing the MLflow Model Registry

The key to solving some of today’s most challenging medical problems lies in the analysis of genomics data. Understanding the impact of the minor changes in an individual’s genome on their overall health is fundamentally a data driven challenge that requires integration across hundreds of thousands of individuals. By analyzing genomes across large cohorts, researchers can build highly accurate models that predict disease risk, which can then be interrogated to understand the prospects of targeting a single gene for therapeutic development. However, aggregating data from many individuals in order to build these models and run them at population-scale entails significant data engineering efforts.

Today, we are excited to introduce Glow, an open-source collaboration between the Regeneron Genetics Center^® and Databricks. Glow is an open-source toolkit built on Apache Spark™ that makes it easy to aggregate genomic and phenotypic data with accelerated algorithms for genomic data preparation, statistical analysis, and machine learning at biobank-scale. Over the last few years, we have seen researchers struggle when trying to aggregate insights across large genomic cohorts. In the Glow project, we have jointly developed an industrial-quality framework that accelerates the data engineering processes required to build high-quality pipelines for storing and analyzing genomic data at scale. Simultaneously, Glow provides a bridge out of niche bioinformatic toolkits into modern data analytics environments, where machine learning can be fully leveraged on multifaceted population-scale health datasets that include genomic data, and a rapidly expanding universe of -omics and phenotypes.

Problems with analyzing large genomic datasets

In the last few years, our organizations have worked with a wide variety of projects and collaborators, and we found areas where we could be more effective and efficient when working with large genomic variation datasets, such as the UK Biobank cohort. These areas for improvement include:

Lack of scalable workflows: As incredibly valuable as the legacy bioinformatics tools for genomics have been (e.g. GATK, PLINK, BCFtools, tabix, Picard, SAIGE, BOLT-LMM, VEP, SnpEff), they were primarily designed to run on single-node machines that do not scale for population-wide analyses. Teams are spending long hours splitting up datasets to parallelize workflows in hopes of improving processing speeds. On top of that, they typically create a large number of interconnected jobs to run these complex workflows. Not only is this time consuming, but it’s also hard to manage. A single job failure can bring down an entire pipeline, causing hours or days of lost work.
Rigid tools that are hard to use: Traditional bioinformatics tools have a steep learning curve with little flexibility making them hard to adopt and use. These tools are typically rigid command line tools requiring users to learn a proprietary query language and API structure. Although there have been efforts to create tools for massive datasets, they require specialized APIs, file formats, and neither support user-defined functions nor integration with common phenotypic data sources like an EHR system or imaging study, preventing teams from optimizing their genome analysis workflows to meet their unique needs.
Limited support for tertiary analytics and ML: Genomic analysis tools rely on file formats without explicit data schemas that are designed for a limited set of genomic analyses. Integrating novel genomic methods and machine learning is not an option, preventing teams from building powerful predictive models. For example, using ML across genotypes from many samples for use cases like polygenic risk scoring is difficult because existing genome-wide association study (GWAS) tools do not efficiently integrate into large-scale ML frameworks. As such, scientists typically pre-filter variants before building a risk model, reducing the quality of the model, especially in the presence of high impact rare variants.

As genomic datasets grow larger and larger, these problems become more challenging. While single node command line tools may have been sufficient to preprocess and conduct quality control on cohorts of hundreds of samples, they are far too slow and cumbersome to use when merging hundreds of thousands of samples. Traditional GWAS tools may have been sufficient when studying a single phenotype, but their throughput becomes too low when working with high-dimensional phenotype data or PheWAS studies. While several tools aim to solve these problems today, they have complex and proprietary APIs that make them both hard to learn and difficult to use alongside phenotypic data culled from an electronic medical record system, or generated by transcriptomic or imaging studies.

Glow integrates bioinformatics tools with best-of-breed big data processing engines

In Glow, we aspire to solve these problems by building an easy-to-learn and easy-to-use genomics library that builds on top of the widely used Apache Spark open-source project, and is natively optimized to benefit from the scale of cloud computing. We approach the problem with the following three guidelines:

Build for scale on industry-trusted tooling: We have developed on top of leading open-source technologies for distributed computing in the cloud including Apache Spark SQL and the high performance Delta Lake storage layer. These tools transparently manage, cache, and process large volumes of data, making it possible to both query petabytes of genomic data in near-real time and run thousands of data processing tasks with high reliability and scalability.
Simplify use with prebuilt genomic analyses and integration with common tools: We provide built-in, single line commands in Python, R, Scala, and SQL for common genomic analyses (e.g., quality control functions, variant normalization, GWAS, etc.), that make it easy to get your workflows running in no time. We work with common array (BGEN) and sequencing (VCF) file formats, and provide a bridge to run command-line tools in parallel using Glow. This allows you to eliminate time spent slicing and dicing your genotype data, and lets you focus on doing science.
Empower downstream workflows with open source integrations: Glow allows you to take advantage of machine learning with native integrations with popular open-source technologies for machine learning (e.g. tensorflow through Horovod, pandas, scikit-learn, etc.) and native integration with tracking frameworks like MLFlow that enable analysis reproducibility. Glow is built with open-source APIs, uses the open and widely used Delta Lake file format, and has clear project documentation and source code.

Our approach abstracts away complexity, leading to a framework that is powerful, but lightweight. Most functions in Glow are implemented directly in Spark SQL and can be called with a single line of code. Native integration with Spark SQL also provides a unified set of APIs for working with both genomic data or phenotype data, and allows users to flow directly between traditional genomic data processing and machine learning. Ultimately, a complex GWAS analysis can be simplified down to tens of lines of code, and run in minutes.

Join us and try Glow!

We are excited to release Glow into the wild, and we hope that you are excited by our vision too! Glow is an open source project hosted on Github, with an Apache 2 license. You can get started by reading our project docs, or create a fork of the repository to start contributing code today. Our hope is to grow Glow as a project where many diverse researchers with varied interests and skills who are working across large-scale genomics can come together to collaborate on new architectures and methods.

Try Databricks for free. Get started today.

The post Introducing Glow: an open-source toolkit for large-scale genomic analysis appeared first on Databricks.

↧

Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar and FAQ Now Available!

October 18, 2019, 7:00 am

≫ Next: Spark + AI in Amsterdam: European Summit Recap, Keynote Videos, & Announcements

≪ Previous: Introducing Glow: an open-source toolkit for large-scale genomic analysis

On Oct 9th, 2019, we hosted a live webinar —Scaling Financial Time Series Analysis Beyond PCs and Pandas — with Junta Nakai, Industry Leader Financial Services at Databricks, and Ricardo Portilla, Solution Architect at Databricks. This was a live webinar showcasing the content in this blog- Democratizing Financial Time Series Analysis with Databricks.

Please find the slide deck for this webinar here.

Fundamental economic data, financial stock tick data and alternative data sets such as geospatial or transactional data are all indexed by time, often at irregular intervals. Solving business problems in finance such as investment risk, fraud, transaction costs analysis and compliance ultimately rests on being able to analyze millions of time series in parallel. Older technologies, which are RDBMS-based, do not easily scale when analyzing trading strategies or conducting regulatory analyses over years of historical data.

In this webinar we reviewed:

How to build time series functions on hundreds of thousands of tickers in parallel using Apache Spark™.
Lastly, if you are a Pandas (Python Data Analysis Library) user looking to scale data preparation which feeds into financial anomaly detection or other statistical analyses, we used a market manipulation example to show how Koalas makes scaling transparent to the typical data science workflow.

We demonstrated these concepts using this notebook in Databricks

If you’d like free access to the Unified Data Analytics Platform and try our notebooks on it, you can access a free trial here.

Toward the end, we held a Q&A and below are the questions and answers.

Q: BI Tools traditionally query data warehouses, can they now connect to Databricks?

A: Great question. There are two approaches to this. Yes, you can connect your BI tools to Databricks directly to query the data lake. Let’s look at this slide below.

If you look at the diagram, BI tools are pointed to one of the managed tables created by Apache Spark. If you have an aggregate table that is specific to a line business (let’s say you created a table with aggregated trade windows throughout the day), this can be queried with a BI tool such as Tableau, Looker, etc. If you need very low latency, for example, say you need to create dashboards for the C-level, then you can query a data warehouse.

Q: Is there a way to effectively distribute the modeling of time series, or is this only distributed pandas based data manipulation to prepare the data set. Specifically, I use quite a bit of SARIMAX. I am trying to figure out how to distribute cross-validation of candidate SARIMAX models.

A: This presentation was more focused on the manipulation aspect, but Spark can absolutely distribute things like hyperparameter tuning and cross-validation. So if you have a grid defined or you want to do a random bayesian search, what you need to do is, define independent problems or partitioning of your problems. So a good example is forecasting. Let’s say, I want to iterate through a 100 different combinations, where I want to change whether we specify daily seasonality or yearly seasonality, and multiply that by all the different parameters that I am using for an ARIMA model. Then all I need to do is define that grid and Spark can basically execute one task per different input vector parameter. So effectively you are running up to 1000 or 5000 forecasts all in parallel. This would be the go-to method to actually parallelize things like forecasting.

Q: Is Koala open source? Do Koalas works with scikit-learn?

Yes, Koalas is open-source software. Koalas definitely works with scikit-learn. If you go through the notebook in the blog here, you can effectively convert any of those data structures and can feed it directly into scikit-learn. The only difference is you may have to convert the structure directly before you put it into a machine learning model .ie you may have to convert to pandas in the final step. But it should work otherwise. The two numpy data structures act as the bridge.

Q: As a team how can we do code review or version control, if we work on Databricks?

The blog article actually points out the mechanism to do that. If you want to leverage Databricks for the performance aspect, compute aspect, MLFlow and all of that. We released something called Databricks Connect. It allows you to work on your local IDE. If you do that, you can always check in your code to version control using your standard tools and then deploy using Jenkins as you usually do. The second option is Databricks notebooks themselves integrate with Git, so you can directly save the work, as you are going along, in a notebook as well.

Q: Any resources, demo, tutorial to handle geospatially oriented time series data? For example, something that could look at the past 5 years of real estate data and combine that with traffic data to show how housing density affects traffic patterns.

A: The techniques that were highlighted here were multi-purpose. For the AS-OF join, you can certainly use the data sets you are describing. It is a matter of just aligning the right time stamps and then choosing a partitioning column. We will consider having subsequent blogs on geospatial in particular, that will likely go deeper into the techniques or libraries that can be used to effectively join geospatial data. But right now AS-OF join should work for any data set you want to use, as long as you are just trying to merge them to get contextual AS-OF data.

Additional Resources

Try Databricks for free. Get started today.

The post Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar and FAQ Now Available! appeared first on Databricks.

↧

Spark + AI in Amsterdam: European Summit Recap, Keynote Videos, & Announcements

October 23, 2019, 12:30 pm

≫ Next: Simplify Data Lake Access with Azure AD Credential Passthrough

≪ Previous: Scaling Financial Time Series Analysis Beyond PCs and Pandas: On-Demand Webinar and FAQ Now Available!

Spark + AI Summit Europe 2019 came to Amsterdam this past week! Over 2,300 data scientists, data engineers, and global business leaders from 63 different countries descended upon the RAI Amsterdam Convention Centre, for the latest community and open source developments around Apache Spark™, Delta Lake, MLflow, Koalas, and more. Check out the keynote recordings to learn more about the latest announcements and community updates from this sold out event!

Open Source Updates: Delta Lake joins the Linux Foundation, Apache Spark™ 3.0 plans, MLflow Model Registry, and more

At Spark Summit Europe 2019, we learned about some exciting new developments with several open source Apache Spark™ projects that the community has been eagerly awaiting.

First up, we heard from Ali Ghodsi, CEO and Co-founder of Databricks, on the tough problems that data scientists, data engineers and business analysts face head on everyday. In his keynote address, entitled Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems, Ali clearly lays out an expansive vision for the future of big data and AI that is not to be missed.

Delta Lake joins the Linux Foundation

In his keynote address, New Developments in the Open Source Ecosystem, Principal Software Engineer at Databricks Michael Armbrust shared plans for the continued growth of open source Delta Lake, highlighting the increasingly rapid adoption of this promising technology. Michael was pleased to report that over 3,700 organizations are already using Delta, and more than 2 billion gigabytes (2 exabytes, you read that right) of data are processed with it each month.

To top it off, in a surprise announcement, Michael told the crowd in Amsterdam that Delta Lake is joining the Linux Foundation, to help continue to drive adoption of Delta Lake and growth of the open source community!

Punctuating Michael’s point, Senior Software Engineer Burak Yavuz walked the audience through a Delta Lake demo, expertly showcasing Delta’s capability and power.

Apache Spark™ 3.0 upcoming enhancements

In addition to the exciting news about Delta Lake, Michael also shared several new developments about the upcoming release of Apache Spark™ 3.0, including significant performance improvements that are coming to the Spark SQL Optimizer. These improvements, which include partition pruning and the clever use of broadcast joins for certain merge operations, can provide up to 17x performance improvements for some queries. Finally, Michael introduced a new federated data catalog to Spark.

MLflow Model Registry update

Not to be outdone, Chief Technologist and Co-founder of Databricks Matei Zaharia discussed the importance of the new Model Registry to the MLflow ecosystem. In his keynote address to the crowd at RAI Amsterdam Convention Centre entitled Simplifying Model Management with MLflow, the original creator of Apache Spark™ explained how the MLflow Model Registry allows data teams to organize and productionize different versions of machine learning models, by offering a collaborative repository where named ML models can be saved and versioned.

The Model Registry also makes it possible for data engineers and data scientists to implement flexible CI/CD pipelines, which Databricks Software Engineer Corey Zumar was kind enough to demo for the crowd. Learn more about the MLflow Model Registry here.

Koalas community growth and adoption

We also heard from Databricks’ own Principal Consultant Brooke Wenig on the continuing success of the Koalas open source project, which aims to bring the power of Apache Spark™ to pandas, the popular Python data analysis library. Open source community members have downloaded Koalas over 10,000 times per day, and the project has experienced over 100% month-over-month growth since its inception. We were also treated to a live demonstration, showing how easy it is to transition from single node data science on pandas to multi node data science on Spark using Koalas. Learn more about this exciting open source project here.

Keynotes: Katie Bouman on creating the first black hole image, Gaël Varoquaux on the “secret weapon” of scikit-learn’s success, and much more

This year’s Spark + AI Summit Europe featured a keynote speech from none other than Katie Bouman, Assistant Professor of Computing and Mathematical Sciences at Caltech. In her keynote speech, Imaging the Unseen: Taking the First Picture of a Black Hole, Katie shared with us the process that she and her team used to photograph the celestial majesty from the Event Horizon Telescope in space.

We were also lucky enough to hear from Gaël Varoquaux, Creator of scikit-learn and Faculty Researcher at Inria. In Gaël’s keynote, Democratizing Machine Learning: Perspective From a scikit-learn Creator, he explained the simple principles, including one that he calls a “secret weapon,” that have been key to scikit-learn’s runaway success.

Oriol Vinyals, Principal Scientist at Google DeepMind and former member of the Google Brain team, shared a fascinating story with us in his talk Project AlphaStar: mastering the real-time strategy game StarCraft II with AI.

Customer Keynotes

The lively audience in Amsterdam was also treated to talks from other legends and luminaries, including:

Alessio Basso, PayMe/HSBC, on reinventing payments at HSBC with a unified platform for data and AI in the cloud
Johan Vallin, Electrolux, on forecasting ‘what-if’ scenarios in retail using ML-powered interactive tools
Dr. Stephen Galsworthy, Quby, on saving energy in homes with a unified approach to data and AI
Mark Hamilton and Christina Lee, Microsoft, on Microsoft’s AI For Good Initiative

Women in Unified Analytics: Panel discussions and networking at Spark + AI in Amsterdam

This year’s Spark Summit also gathered Women in Unified Analytics and allies, providing an opportunity for women in big data, data science, machine learning and AI to connect, learn, and network. Female leaders from Microsoft, Lloyds Bank, Wehkamp, Centrum Wiskunde & Informatica, ARM, Depop, and more, met together for tech talks and panel discussion. They covered topics including technology trends, diversity & inclusion, ethical AI, and career development for women.

Spark Summit Europe Community Sessions: Spark tuning workshops, technical tutorials, big data case studies, and more

In addition to the Keynote presentations, at this year’s Spark + AI Summit Europe, attendees were treated to over 140 different community sessions and instructor-led trainings. These community sessions featured speakers from companies like: KTH, Socialbakers, Airbnb, Eventbrite, Getyourguide, H&M, CERN, La Poste, Klario, Facebook, Societe Generale, Canal+, Nielsen, and more. These technical sessions covered all sorts of use cases and best practices, with hands-on tutorials on topics including deep learning, structured streaming, Apache Spark™ tuning, Delta Lake, MLflow, and more.

What’s Next for Spark + AI Summit

Spark + AI Summit Europe 2019 keynote videos are now available! To see the newest product announcements and thought leadership, follow @Databricks on Twitter or subscribe to our newsletter. You can also learn Apache Spark, Delta Lake, and MLflow today on our free Databricks Community Edition, or build a production data application by trying Databricks today for free.

As always, thanks for your support, and we look forward to seeing you again stateside, at the upcoming Spark + AI Summit in San Francisco on June 23-25, 2020!

Simplify Data Lake Access with Azure AD Credential Passthrough

October 24, 2019, 6:00 am

≫ Next: Scaling Hyperopt to Tune Machine Learning Models in Python

≪ Previous: Spark + AI in Amsterdam: European Summit Recap, Keynote Videos, & Announcements

Azure Databricks brings together the best of the Apache Spark, Delta Lake, an Azure cloud. The close partnership provides integrations with Azure services, including Azure’s cloud-based role-based access control, Azure Active Directory(AAD), and Azure’s cloud storage Azure Data Lake Storage (ADLS).

Even with these close integrations, data access control continues to prove a challenge for our users. Customers want to control which users have access to which data and audit who is accessing what. They want a simple solution that integrates with their existing controls. Azure AD Credential Passthrough is our solution to these requests.

Azure Data Lake Storage Gen2

Azure Data Lake Storage (ADLS) Gen2, which became generally available earlier this year, is quickly becoming the standard for data storage in Azure for analytics consumption. ADLS Gen2 enables a hierarchical file system that extends Azure Blob Storage capabilities and provides enhanced manageability, security and performance.

The hierarchical file system provides granular access control to ADLS Gen2. Role-based access control (RBAC) could be used to grant role assignments to top-level resources and POSIX compliant access control lists (ACLs) allow for finer grain permissions at the folder and file level. These features allow users to securely access their data within Azure Databricks using the Azure Blob File System (ABFS) driver, which is built into the Databricks Runtime.

Challenges with Accessing ADLS from Databricks

Even with the ABFS driver natively in Databricks Runtime, customers still found it challenging to access ADLS from an Azure Databricks cluster in a secure way. The primary way to access ADLS from Databricks is using an Azure AD Service Principal and OAuth 2.0 either directly or by mounting to DBFS. While this remains the ideal way to connect for ETL jobs, it has some limitations interactive use cases:

Accessing ADLS from an Azure Databricks cluster requires a service principal to be made with delegated permissions for each user. The credentials should then be stored in Secrets. This creates complexity for Azure AD and Azure Databricks admins.
Mounting a filesystem to DBFS allows all users in the Azure Databricks workspace to have access to the mounted ADLS account. This requires customers to set up multiple Azure Databricks workspaces for different roles and access controls in line with their storage account access, thereby increasing complexity.
When assessing ADLS, either directly or with mount points, users on an Databricks cluster share the same identity when accessing resources. This means there is no audit trail of which user accessed which data with cloud-native logging such as Storage Analytics

To solve this, we looked into how we could expand our seamless Single Sign-on with Azure AD integration to reach ADLS.

Getting Started with Azure AD Credential Passthrough

Azure AD Credential Passthrough allows you to authenticate seamlessly to Azure Data Lake Storage (both Gen1 and Gen2) from Azure Databricks clusters using the same Azure AD identity that you use to log into Azure Databricks. Your data access is controlled via the ADLS roles and ACLs you have already set up and can be analyzed in Azure’s Storage Analytics.

When you enable your cluster for Azure AD Credential Passthrough, commands that you run on that cluster will be able to read and write your data in ADLS without requiring you to configure service principal credentials for access to storage. In order to use Credential Passthrough, just enable the new “Azure Data Lake Storage Credential Passthrough” cluster configuration.

Passthrough is available on both High Concurrency and Standard clusters. Currently, Python and SQL are supported on High Concurrency clusters, which isolate commands run by different users to ensure that credentials cannot be leaked across different sessions. This allows multiple users to share one Passthrough clusters and access ADLS using their own identities.

On Standard clusters, Python, SQL, Scala and R are all supported and users are isolated by restricting the cluster to a single user.

Powerful, Built-in Access Control

Azure AD Passthrough allows for powerful data access controls by supporting both RBAC and ACLs for ADLS Gen2. Users can be granted to the whole storage account through RBAC or one filesystem/folder/file using ACLs. Passthrough will ensure a user can only access the data that they have previously been granted access to via Azure AD in ADLS Gen2.

Since Passthrough identifies the individual user, auditing is available by simply enabling ADLS logging via Storage Analytics. All ADLS access will be tied directly to the user via the OAuth user ID in Storage Analytic logs.

Conclusion

Azure AD Credential Passthrough provides end to end security from Azure Databricks to Azure Data Lake Storage. This feature provides seamless access control over your data with no additional setup. You can safely let your analysts, data scientists, and data engineers use the powerful features of the Databricks Unified Analytics Platform while keeping your data secure!

Related Resources

Real-time distributed monitoring and logging in the Azure Cloud

How can you observe the unobservable? At Databricks we rely heavily on detailed metrics from our internal services to maintain high availability and reliability. However,…

Azure Databricks – Bring Your Own VNET

Azure Databricks Unified Analytics Platform is the result of a joint product/engineering effort between Databricks and Microsoft. It’s available as a managed first-party service on…

Spark + AI Summit 2019 Product Announcements and Recap. Watch the keynote recordings today!

Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 registered data scientists, engineers,…

Try Databricks for free. Get started today.

The post Simplify Data Lake Access with Azure AD Credential Passthrough appeared first on Databricks.

↧

Scaling Hyperopt to Tune Machine Learning Models in Python

October 29, 2019, 6:00 am

≫ Next: Why we are investing 100 million euros in our European Development Center

≪ Previous: Simplify Data Lake Access with Azure AD Credential Passthrough

Try the Hyperopt notebook to reproduce the steps outlined below and watch our on-demand webinar to learn more.

Hyperopt is one of the most popular open-source libraries for tuning Machine Learning models in Python. We’re excited to announce that Hyperopt 0.2.1 supports distributed tuning via Apache Spark. The new SparkTrials class allows you to scale out hyperparameter tuning across a Spark cluster, leading to faster tuning and better models. SparkTrials was contributed by Joseph Bradley, Hanyu Cui, Lu Wang, Weichen Xu, and Liang Zhang (Databricks), in collaboration with Max Pumperla (Konduit).

What is Hyperopt?

Hyperopt is an open-source hyperparameter tuning library written for Python. With 445,000+ PyPI downloads each month and 3800+ stars on Github as of October 2019, it has strong adoption and community support. For Data Scientists, Hyperopt provides a general API for searching over hyperparameters and model types. Hyperopt offers two tuning algorithms: Random Search and the Bayesian method Tree of Parzen Estimators.

For developers, Hyperopt provides pluggable APIs for its algorithms and compute backends. We took advantage of this pluggability to write a new compute backend powered by Apache Spark.

Scaling out Hyperopt with Spark

With the new class SparkTrials, you can tell Hyperopt to distribute a tuning job across a Spark cluster. Initially developed within Databricks, this API for hyperparameter tuning has enabled many Databricks customers to distribute computationally complex tuning jobs, and it has now been contributed to the open-source Hyperopt project, available in the latest release.

Hyperparameter tuning and model selection often involve training hundreds or thousands of models. SparkTrials runs batches of these training tasks in parallel, one on each Spark executor, allowing massive scale-out for tuning. To use SparkTrials with Hyperopt, simply pass the SparkTrials object to Hyperopt’s fmin() function:

from hyperopt import SparkTrials

best_hyperparameters = fmin(
fn = training_function,
space = search_space,
algo = hyperopt.tpe,
max_evals = 64,
trials = SparkTrials())

For a full example with code, check out the Hyperopt documentation on SparkTrials.

Under the hood, fmin() will generate new hyperparameter settings to test and pass them toSparkTrials. The diagram below shows how SparkTrials runs these tasks asynchronously on a cluster: (A) Hyperopt’s primary logic runs on the Spark driver, computing new hyperparameter settings. (B) When a worker is ready for a new task, Hyperopt kicks off a single-task Spark job for that hyperparameter setting. (C) Within that task, which runs on one Spark executor, user code will be executed to train and evaluate a new ML model. (D) When done, the Spark task will return the results, including the loss, to the driver. These new results are used by Hyperopt to compute better hyperparameter settings for future tasks.

Since SparkTrials fits and evaluates each model on one Spark worker, it is limited to tuning single-machine ML models and workflows, such as scikit-learn or single-machine TensorFlow. For distributed ML algorithms such as Apache Spark MLlib or Horovod, you can use Hyperopt’s default Trials class.

Using SparkTrials in practice

SparkTrials takes 2 key parameters: parallelism (Maximum number of parallel trials to run, defaulting to the number of Spark executors) and timeout (Maximum time in seconds which fmin is allowed to take, defaulting to None). Timeout provides a budgeting mechanism, allowing a cap on how long tuning can take.

The parallelism parameter can be set in conjunction with the max_evals parameter for fmin() using the guideline described in the following diagram. Hyperopt will test max_evals total settings for your hyperparameters, in batches of size parallelism. If parallelism = max_evals, then Hyperopt will do Random Search: it will select all hyperparameter settings to test independently and then evaluate them in parallel. If parallelism = 1, then Hyperopt can make full use of adaptive algorithms like Tree of Parzen Estimators which iteratively explore the hyperparameter space: each new hyperparameter setting tested will be chosen based on previous results. Setting parallelism in between 1 and max_evals allows you to trade off scalability (getting results faster) and adaptiveness (sometimes getting better models). Good choices tend to be in the middle, such as sqrt(max_evals).

To illustrate the benefits of tuning, we ran Hyperopt with SparkTrials on the MNIST dataset using the PyTorch workflow from our recent webinar. Our workflow trained a basic Deep Learning model to predict handwritten digits, and we tuned 3 parameters: batch size, learning rate, and momentum. This was run on a Databricks cluster on AWS with p2.xlarge workers and Databricks Runtime 5.5 ML.

In the plot below, we fixed max_evals to 128 and varied the number of workers. As expected, more workers (greater parallelism) allow faster runtimes, with linear scale-out.

We then fixed the timeout at 4 minutes and varied the number of workers, repeating this experiment for several trials. The plot below shows the loss (negative log likelihood, where “180m” = “0.180”) vs. the number of workers; the blue points are individual trials, and the red line is a LOESS curve showing the trend. In general, model performance improves as we use greater parallelism since that allows us to test more hyperparameter settings. Notice that behavior varies across trials since Hyperopt uses randomization in its search.

Getting started with Hyperopt 0.2.1

SparkTrials is available now within Hyperopt 0.2.1 (available on the PyPi project page) and in the Databricks Runtime for Machine Learning (5.4 and later).

To learn more about Hyperopt and see examples and demos, check out:

Documentation on the project Github.io page, including a full code example

Example notebooks in the Databricks Documentation for AWS and Azure

Hyperopt can also combine with MLflow for tracking experimentals and models. Learn more about this integration in the open-source MLflow example and in our Hyperparameter Tuning blog post and webinar.

You can get involved via the Github project page:

Report open-source issues on the Github Issues page
Contribute to Hyperopt on Github

Related Resources

Try Databricks for free. Get started today.

The post Scaling Hyperopt to Tune Machine Learning Models in Python appeared first on Databricks.

↧

Why we are investing 100 million euros in our European Development Center

October 30, 2019, 6:00 am

≫ Next: Scalable near real-time S3 access logging analytics with Apache Spark™ and Delta Lake

≪ Previous: Scaling Hyperopt to Tune Machine Learning Models in Python

A few days ago, we announced an investment of 100 million euros in our European Development Center in Amsterdam. I want to take a moment to describe why this is a pivotal moment for Databricks and why Amsterdam is a cornerstone of our growth strategy.

Solving the Hardest Data Problems at Scale

Our Unified Data Analytics Platform helps customers solve some of the hardest problems on the planet, from genomics research to credit card fraud detection. The Netherlands provides us with access to a large pool of talent that is uniquely suited to our needs. The Netherlands is home to world-class universities such as the Vrije Universiteit Amsterdam, Delft University of Technology, and many others. We have built close partnerships with local universities and research centers, helping translate cutting-edge research into product. For example, Databricks partners with Centrum Wiskunde & Informatica (CWI), one of the world leaders in distributed systems and database research.

Our employees and partners benefit from the excellent infrastructure that powers the competitive Dutch economy. For example, most of our employees in Netherlands skip the car commute and take a quick train or bike ride to work because of the superb public transport and bike-friendly infrastructure.

It’s no secret that relocating to the Netherlands and getting settled is very easy. The Dutch provide a “fast track” for knowledge workers, with streamlined entry and onboarding procedures. IN Amsterdam, for example, provides a one-stop shop for registration, immigration and much more. Many employees who relocate to the Netherlands are eligible for the 30% Ruling, which provides a significant tax incentive for up to five years.

Databricks European Development Center

Lastly, we are thrilled by the accomplishments of the European Development Center. At Databricks, we have built the European Development Center as a fully operational site from the start, with local leadership in key functions such as engineering, product management, HR and customer success. As a result, the EDC has shipped key features in almost every aspect of our Unified Data Analytics Platform.

The future is bright for the Databricks European Development Center. If you are interested in learning about opportunities in beautiful Amsterdam, message me directly or visit our Careers Site.

Related Resources

Try Databricks for free. Get started today.

The post Why we are investing 100 million euros in our European Development Center appeared first on Databricks.

↧

Scalable near real-time S3 access logging analytics with Apache Spark™ and Delta Lake

October 31, 2019, 7:12 am

≫ Next: Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake

≪ Previous: Why we are investing 100 million euros in our European Development Center

The original blog is from Viacheslav Inozemtsev, Senior Data Engineer at Zalando, reproduced with permission.

Introduction

Many organizations use AWS S3 as their main storage infrastructure for their data. Moreover, by using Apache Spark™ on Databricks they often perform transformations of that data and save the refined results back to S3 for further analysis. When the size of data and the amount of processing reach a certain scale, it often becomes necessary to observe the data access patterns. Common questions that arise include (but are not limited to): Which datasets are used the most? What is the ratio between accessing new and past data? How quickly a dataset can be moved to a cheaper storage class without affecting the performance of the users? Etc.

In Zalando, we have faced this issue since data and computation became a commodity for us in the last few years. Almost all of our ~200 engineering teams regularly perform analytics, reporting, or machine learning meaning they all read data from the central data lake. The main motivation to enable observability over this data was to reduce the cost of storage and processing by deleting unused data and by shrinking resource usage of the pipelines that produce that data. An additional driver was to understand if our engineering teams needed to query historical data or if they are only interested in the recent state of the data.

To answer these types of questions S3 provides a useful feature – S3 Server Access Logging. When enabled, it constantly dumps logs about every read and write access in the observed bucket. The problem that appears almost immediately, and especially at a higher scale, is that these logs are in the form of comparatively small text files, with a format similar to the logs of Apache Web Server.

To query these logs we have leveraged capabilities of Apache Spark™ Structured Streaming on Databricks and built a streaming pipeline that constructs Delta Lake tables. These tables – for each observed bucket – contain well-structured data of the S3 Access Logs, they are partitioned, can be sorted if needed, and, as a result, enable extended and efficient analysis of the access patterns of the company’s data. This allows us to answer the previously mentioned questions and many more. In this blog post we are going to describe the production architecture we designed in Zalando, and to show in detail how you can deploy such a pipeline yourself.

Solution

Before we start, let us make two qualifications.

The first note is about why we chose Delta Lake, and not plain Parquet or any other format. As you will see, to solve the problem described we are going to create a continuous application using Spark Structured Streaming. The properties of the Delta Lake, in this case, will give us the following benefits:

ACID Transactions: No corrupted/inconsistent reads by the consumers of the table in case write operation is still in progress or has failed leaving partial results on S3. More information is also available in Diving into Delta Lake: Unpacking the Transaction Log.
Schema Enforcement: The metadata is controlled by the table; there is no chance that we break the schema if there is a bug in the code of the Spark job or if the format of the logs has changed. More information is available in Diving Into Delta Lake: Schema Enforcement & Evolution.
Schema Evolution: On the other hand, if there is a change in the log format – we can purposely extend the schema by adding new fields. More information is available in Diving Into Delta Lake: Schema Enforcement & Evolution.
Open Format: All the benefits of the plain Parquet format for readers apply, e.g. predicate push-down, column projection, etc.
Unified Batch and Streaming Source and Sink: Opportunity to chain downstream Spark Structured Streaming jobs to produce aggregations based on the new content

The second note is about the datasets that are being read by the clients of our data lake. For the most part, the mentioned datasets consist of 2 categories: 1) snapshots of the data warehouse tables from the BI databases, and 2) continuously appended streams of events from the central event bus of the company. This means that there are 2 types of patterns of how data gets written in the first place – full snapshot once per day and continuously appended stream, respectively.

In both cases we have a hypothesis that the data generated in the last day is consumed most often. For the snapshots we also know of infrequent comparisons between the current snapshot and past versions, for example one from a year ago. We are aware of the use case when the whole month or even year of historical data for a certain stream of event data has to be processed. This gives us an idea of what to look for, and this is where the described pipeline should help us to prove or disprove our hypotheses.

Let us now dive into the technical details of the implementation of this pipeline. The only entity we have at the current stage is the S3 bucket. Our goal is to analyze what patterns appear in the read and write access to this bucket.

To give you an idea of what we are going to show, on the diagram below you can see the final architecture, that represents the final state of the pipeline. The flow it depicts is the following:

AWS constantly monitors the S3 bucket data-bucket
It writes raw text logs to the target S3 bucket raw-logs-bucket
For every created object an Event Notification is sent to the SQS queue new-log-objects-queue
Once every hour a Spark job gets started by Databricks
Spark job reads all the new messages from the queue
Spark job reads all the objects (described in the messages from the queue) from raw-logs-bucket
Spark job writes the new data in append mode to the Delta Lake table in the delta-logs-bucket S3 bucket (optionally also executes OPTIMIZE and VACUUM, or runs in the Auto-Optimize mode)
This Delta Lake table can be queried for the analysis of the access patterns

Administrative Setup

First we will perform the administrative setup of configuring our S3 Server Access Logging and creating an SQS Queue.

Configure S3 Server Access Logging

First of all you need to configure S3 Server Access Logging for the data-bucket. To store the raw logs you first need to create an additional bucket – let’s call it raw-logs-bucket. Then you can configure logging via UI or using API. Let’s assume that we specify target prefix as data-bucket-logs/, so that we can use this bucket for S3 access logs of multiple data buckets.

After this is done – raw logs will start appearing in the raw-logs-bucket as soon as someone is doing requests to the data-bucket. The number and the size of the objects with logs will depend on the intensity of requests. We experienced three different patterns for three different buckets as noted in the table below.

You can see that the velocity of data can be rather different, which means you have to account for this when processing these disparate sources of data.

Create an SQS queue

Now, when logs are being created, you can start thinking about how to read them with Spark to produce the desired Delta Lake table. Because S3 logs are written in the append-only mode – only new objects get created, and no object ever gets modified or deleted – this is a perfect case to leverage the S3-SQS Spark reader created by Databricks. To use it, you need first of all to create an SQS queue. We recommend to set Message Retention Period to 7 days, and Default Visibility Timeout to 5 minutes. From our experience, these are good defaults, that as well match defaults of the Spark S3-SQS reader. Let’s refer to the queue with the name new-log-objects-queue.

Now you need to configure the policy of the queue to allow sending messages to the queue from the raw-logs-bucket. To achieve this you can edit it directly in the Permissions tab of the queue in the UI, or do it via API. This is how the statement should look like:

{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "SQS:SendMessage",
    "Resource": "arn:aws:sqs:{REGION}:{MAIN_ACCOUNT_ID}:new-log-objects-queue",
    "Condition": {
        "ArnEquals": {
            "aws:SourceArn": "arn:aws:s3:::raw-logs-bucket"
        }
    }
}

Configure S3 event notification

Now, you are ready to connect raw-logs-bucket and new-log-objects-queue, so that for each new object there is a message sent to the queue. To achieve this you can configure the S3 Event Notification in the UI or via API. We show here how the JSON version of this configuration would look like:

{

    "QueueConfigurations": [
        {
            "Id": "raw-logs",
            "QueueArn": "arn:aws:sqs:{REGION}:{MAIN_ACCOUNT_ID}:new-log-objects-queue",
            "Events": ["s3:ObjectCreated:*"]
        }
    ]
}

Operational Setup

In this section, we will perform the necessary cluster configurations including creating IAM roles and prepare the cluster configuration.

Create IAM roles

To be able to run the Spark job, you need to create two IAM roles – one for the job (cluster role), and one to access S3 (assumed role). The reason you need to additionally assume a separate S3 role is that the cluster and its cluster role are located in the dedicated AWS account for Databricks EC2 instances and roles, whereas the raw-logs-bucket is located in the AWS account where the original source bucket resides. And because every log object is written by the Amazon role – there is an implication that cluster role doesn’t have permission to read any of the logs in accordance to the ACL of the log objects. You can read more about it in Secure Access to S3 Buckets Across Accounts Using IAM Roles with an AssumeRole Policy.

The cluster role, referred here as cluster-role, should be created in the AWS account dedicated for Databricks, and should have these 2 policies:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": ["arn:aws:sqs:{REGION}:{DATABRICKS_ACCOUNT_ID}:new-log-objects-queue"],
            "Effect": "Allow"
        }
    ]
}

and

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{DATABRICKS_ACCOUNT_ID}:role/s3-access-role-to-assume"
        }
    ]
}

You will also need to add the instance profile of this role as usual to the Databricks platform.

The role to access S3, referred here as s3-access-role-to-assume, should be created in the same account, where both buckets reside. It should refer to the cluster-role by its ARN in the assumed_by parameter, and should have these 2 policies:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:GetObjectMetadata"
            ],
            "Resource": [
                "arn:aws:s3:::raw-logs-bucket",
                "arn:aws:s3:::raw-logs-bucket/*"
            ],
            "Effect": "Allow"
        }
    ]
}

and

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:GetObjectMetadata",
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::delta-logs-bucket",
                "arn:aws:s3:::delta-logs-bucket/data-bucket-logs/*"
            ]
            "Effect": "Allow"
        }
    ]
}

where delta-logs-bucket is another bucket you need to create, where the resulting Delta Lake tables will be located.

Prepare cluster configuration

Here we outline the spark_conf settings that are necessary in the cluster configuration so that the job can run correctly:

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl
spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
spark.hadoop.fs.s3a.credentialsType AssumeRole
spark.hadoop.fs.s3a.stsAssumeRole.arn arn:aws:iam::{MAIN_ACCOUNT_ID}:role/s3-access-role-to-assume

If you go for more than one bucket, we also recommend these settings to enable FAIR scheduler, external shuffling, and RocksDB for keeping state:

spark.sql.streaming.stateStore.providerClass com.databricks.sql.streaming.state.RocksDBStateStoreProvider
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.scheduler.mode FAIR

Generate Delta Lake table with a Continuous Application

In the previous sections you completed the perfunctory administrative and operational setups. Now that this is done, you can write the code that will finally produce the desired Delta Lake table, and run it in a Continuous Application mode.

The notebook

The code is written in Scala. First we define a record case class:

Then we create a few helper functions for parsing:

And finally we define the Spark job:

Create Databricks job

The last step is to make the whole pipeline run. For this you need to create a Databricks job. You can use the “New Automated Cluster” type, add spark_conf we defined above, and schedule it to be run, for example, once every hour using “Schedule” section. This is it – as soon as you confirm creation of the job, and when it starts running by scheduler – you should be able to see that messages from the SQS queue are getting consumed, and that the job is writing to the output Delta Lake table.

Execute Notebook Queries

At this point data is available, and you can create your notebook and execute your queries to answer questions we started with in the beginning of this blog post.

Create interactive cluster and a notebook to run analytics

As soon as Delta Lake table has the data – you can start querying it. For this you can create a permanent cluster with the role that only needs to be able to read the delta-logs-bucket. This means it doesn’t need to use the AssumeRole technique, but only need ListBucket and GetObject permissions. After that you can attach a notebook to this cluster and execute your first analysis.

Queries to analyze access patterns

Let’s get back to one of the questions that we asked in the beginning – which datasets are used the most? If we assume that in the source bucket every dataset is located under prefix data/{DATASET_NAME}/, then to answer it, we could come up with a query like this one:

SELECT dataset, count(*) AS cnt
FROM (
    SELECT regexp_extract(key, '^data\/([^/]+)\/.+', 1) AS dataset
    FROM delta.`s3://delta-logs-bucket/data-bucket-logs/`
    WHERE date = 'YYYY-MM-DD' AND bucket = 'data-bucket' AND key rlike '^data\/' AND operation = 'REST.GET.OBJECT'
)
GROUP BY dataset
ORDER BY cnt DESC;

The outcome of the query can look like this:

This query tells us how many individual GetObject requests were made to each dataset during one day, ordered from the top most accessed down to the less intensively accessed. By itself it might not be enough to say, if one dataset is accessed more often. We can normalize each aggregation by the number of objects in each dataset. Also, we can group by dataset and day, so that we also see the correlation in time. There are many further options, but the point is that having at hand this Delta Lake table we can answer any kind of question about what access patterns in the bucket.

Extensibility

The pipeline we have shown is extensible out of the box. You can fully reuse the same SQS queue and add more buckets with logging into the pipeline, by simply using the same raw-logs-bucket to store S3 Server Access Logs. Because the Spark job already partitions by date and bucket, it will keep working fine, and your Delta Lake table will contain log data from the new buckets.

One piece of advice we can give is to use AWS CDK to handle infrastructure, i.e. to configure the buckets raw-logs-bucket and delta-logs-bucket, SQS queue, and the role s3-access-role-to-assume. This will simplify operations and make the infrastructure become code as well.

Conclusion

In this blog post we have described how S3 Server Access Logging can be transformed into Delta Lake in a continuous fashion, so that analysis of the access patterns to the data can be performed. We showed that Spark Structured Streaming together with the S3-SQS reader can be used to read raw logging data. We described what kind of IAM policies and spark_conf parameters you will need to make this pipeline work. Overall, this solution is easy to deploy and operate, and it can give you a good benefit by providing observability over the access to the data.

Try Databricks for free. Get started today.

The post Scalable near real-time S3 access logging analytics with Apache Spark™ and Delta Lake appeared first on Databricks.

↧

Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake

October 31, 2019, 9:59 am

≫ Next: New Microsoft Azure Data Warehouse Service and Azure Databricks Combine Analytics, BI, and Data Science

≪ Previous: Scalable near real-time S3 access logging analytics with Apache Spark™ and Delta Lake

Migrating from Hadoop on-premises to the cloud has been a common theme in recent Databricks blog posts and conference sessions. They’ve identified key considerations, highlighted partnerships and described solutions for moving and streaming data to the cloud with governance and other controls, and compared the runtime environments offered between Hadoop and Databricks to highlight the benefits of the Databricks Unified Data Analytics Platform.

Challenges for Hadoop users when moving to the cloud

WANdisco has partnered with Databricks to solve many of the challenges for large-scale Hadoop migrations. A particular challenge for organizations that have adopted Hadoop at scale is the traditional problem of data gravity. Because their applications assume ready, local, fast access to an on-premises data lake built on HDFS, building applications away from that data becomes difficult, because it requires building additional workflows to manually copy or access data from the on-premises Hadoop data lake.

This problem is exacerbated by an order of magnitude if those on-premises data sets continue to change, because the workflows to move data between environments add a layer of complexity, and don’t handle changing data easily.

While the cloud brings efficiencies for data lakes there remains concerns about the reliability and the consistency of the data. Data Lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions.

Hadoop migration with Databricks and WANdisco

The Databricks and WANdisco partnership solves these challenges, by providing full read and write access to changing data lakes at scale during migration between on-premises systems and Databricks in the cloud. This solution is called LiveAnalytics, and it takes advantage of WANdisco’s platform to migrate and replicate the largest Hadoop datasets to Databricks and Delta Lake. WANdisco makes it possible to migrate data at scale, even while those data sets continue to be modified, using a novel distributed coordination engine to maintain data consistency between Hive data and Delta Lake tables.

LiveAnalytics migrates and replicates the largest Hadoop datasets to Databricks and Delta Lake

WANdisco’s architecture and consensus-based approach is the key to this capability. It allows migration without disruption, application downtime or data loss, and opens up the benefits of applying Databricks to the largest of data lakes that were previously difficult to bring to the cloud.

Because WANdisco LiveAnalytics provides direct support of Delta Lake and Databricks along with common Hadoop platforms, it provides a compelling solution to bringing your on-premises Hadoop data to Databricks without impacting your ability to continue to use Hadoop while migration is in process.

WANdisco’s architecture allows migration from Hadoop to the cloud without disruption, application downtime or data loss

You can take advantage of WANdisco’s technology today to help bring your Hadoop data lake to Databricks, with native support of common Hadoop platforms on-premises and for Databricks and Delta Lake on Azure or AWS.

Related Resources

Try Databricks for free. Get started today.

The post Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake appeared first on Databricks.

↧

New Microsoft Azure Data Warehouse Service and Azure Databricks Combine Analytics, BI, and Data Science

November 4, 2019, 6:00 am

≫ Next: Celebrating Growth at Databricks and 1,000 Employees!

≪ Previous: Solving the Challenge of Big Data Cloud Migration with WANdisco, Databricks and Delta Lake

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like Electrolux, Shell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Three ways Azure Databricks works with Azure Synapse Analytics

Today, Microsoft announced the next evolution of their data warehouse service: Azure Synapse Analytics. This is exciting news and we continue to work closely with Microsoft to integrate with Azure Synapse and bring analytics, business intelligence (BI), and data science together in one solution architecture. Here are three key ways Azure Databricks works with Azure Synapse:

1. The high-performance connector between Azure Databricks and Azure Synapse will enable fast data transfer between the services, including support for streaming data. This means customers can continue to use Azure Databricks (up to 50x faster than open source Apache Spark) for extract, transform, and load (ETL) workloads to prep and shape data at scale for Azure Synapse.
2. Azure Data Factory (ADF) supports Azure Databricks in the Mapping Data Flows feature. This offers code-free visual ETL for data preparation and transformation at scale, and now that ADF is part of the Azure Synapse workspace it provides another avenue to access these capabilities.
3. Azure Synapse and Azure Databricks can run analytics on the same data in Azure Data Lake Storage. This opens even greater opportunities to combine analytics, BI, and data science solutions with a shared data lake across services.

We would love to hear your feedback as you begin using Azure Databricks and Azure Synapse in the next evolution of the Modern Data Warehouse solution architecture.

Try Databricks for free. Get started today.

The post New Microsoft Azure Data Warehouse Service and Azure Databricks Combine Analytics, BI, and Data Science appeared first on Databricks.

↧

Celebrating Growth at Databricks and 1,000 Employees!

November 4, 2019, 6:00 am

≫ Next: Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions

≪ Previous: New Microsoft Azure Data Warehouse Service and Azure Databricks Combine Analytics, BI, and Data Science

This November, Databricks hired our 1,000th full-time employee! Founded in Berkeley in 2013, our six co-founders created Databricks to help data teams solve the world’s toughest problems – and since then, we’ve grown tremendously! Not only have we had some major milestones like our Microsoft partnership, resulting in Azure Databricks and the creation of new open source projects like Delta Lake and MLflow, but we have also expanded our employee count and global presence. We now have offices across the world, including London, Amsterdam, Singapore, New York and our headquarters in SF! We are so excited for what’s to come and owe a big thank you to our employees, partners, and customers who have been on this journey with us.

How did we reach 1,000 employees?

At Databricks, we recognize that hiring a new member for our team is a great example of how our employees embody one of our core values “Teamwork makes the dream work!” Providing an excellent candidate experience requires the coordination and collaboration of many teams. With the help of our awesome teammates, Databricks has been able to scale quickly and effectively.

Take a look at all the teams that go into helping a new hire have a great experience!

The path a candidate takes to become a new Brickster

How has our company changed from our first year in 2013?

We asked our employees to talk about the biggest changes and growth that they’ve noticed at Databricks from the year that they first started to now, both in the role they play here and the company itself. Learn from their perspective below!

2013: Meet Michael Armbrust, Principal Software Engineer

Michael Armbrust, Principal Software Engineer, Databricks

Michael was one of the first engineers hired at Databricks and is a frequent speaker at Spark+AI Summit. He is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Delta Lake.

Before Databricks, I was a post-doc at Google doing research on building composable optimizers as part of the F1 team. When I joined Databricks, I was really excited to get to put many of the ideas we came up with during that project into production. This lead to the catalyst optimizer that powers Spark SQL today. Even though I studied Databases in grad school, I had never built a “real” one before, and being able to create something like that from scratch at my first job was an amazing opportunity. It was also really gratifying to see hundreds of global contributors from the Apache® Spark community help grow the engine after we released it as open source. Working with such a vibrant community was a big change from working on it on my own for a year.

More recently I got the opportunity to open-source another really cool piece of technology, called “Delta”. Delta started as this proprietary product, that was inspired by a conversation I had with a potential customer at Spark Summit. He was at a large Fortune 100 company, and he wanted to ingest petabytes of data per week into a massive data lake that could be queried in real time by analysts around the world. I knew that a workload like that would overload Spark’s metadata management layer. However, this challenge sparked the idea of creating a scalable transaction log. It was cool to see how many different factors contributed to the start of Delta Lake: the Spark Community, Spark Summit, our sales team, leadership (including our CEO Ali) and the awesome members of the “streamteam” at Databricks. We worked really closely with the customer and in around six months we had gone from an idea to actually running in production! Before long, we decided to share this technology with the world, and earlier this year the Delta Lake open source project was born. While this project is still young, I’m really excited with the momentum so far, and I can’t wait to see where it goes!

2014: Meet Tim Hunter, Technical Lead, Jobs Team

Tim Hunter, Tech Lead, Jobs Team, Databricks

Tim started off at Databricks as a Software Engineer on the Machine Learning Team. He did his Ph.D. in the AMPLab, the UC Berkeley lab that created Apache Spark. As part of his research, he wrote Machine Learning algorithms using Spark 0.0.2. Wanting to learn more about the impact of our product, he transferred over to become a Solutions Architect in London, and recently moved to Amsterdam to lead our Jobs Team.

When I started at Databricks, there were 15 employees cramped in tiny office right down the hill from UC Berkeley. Back then, a lot of our work was groundbreaking and done in a semi-stealth session. It was finally unveiled during the first Spark Summit, in what we used to call the “Mother of all demos”. My early work was to make Spark much easier to use with a service called the “chauffeur”, that connects data science to clusters – so people could point and click and not have to worry about the details behind the scenes. After the first two years, I worked on the ML team, with a focus on deep learning, deployment and AI efforts inside Databricks. This is where I looked at how we could add and scale more intricate functions of machine learning like DBUs, image processing, geospatial processing, and graph processing.

One thing that I enjoy about engineering at Databricks is that you’re not just writing code, you’re also thinking globally about how to present your work to users and getting feedback on it. If you also enjoy speaking, like I do, there are a lot of opportunities to give presentations about your work, especially at Meetups or Spark Summit. Since I wanted to understand how the ML system I was building was being used, I asked to do a rotation as a Solutions Architect in our London office. The excitement and the small team reminded me of the original small startup form a few years ago – except that it had the full backing of the U.S. team and a proven, very successful product to sell. There, I helped large European companies put together some ML/AI solutions on top of Databricks in a wide variety of industries such as car manufacturing, drug processing, and chemical processing. After this rotation, I moved to our Amsterdam office to be the Tech Lead of our newly formed Jobs team. It is amazing to see that the service maintained by the jobs team, which sprouted out as a quick hackathon project a few years ago, has developed into an industrial strength system that is pretty much used by every Databricks user! The flexibility that Databricks has given me to explore different offices and roles within the company has really helped me grow as an engineer, and allowed me to see the different areas of growth that we have gone through in our product.

2015: Meet Jen Aman, Senior Event Manager

Jen Aman, Sr. Manager Marketing, Databricks

Jen started off as a Marketing Manager and helped Databricks launch our third Spark Summit when it was still only being held in the United States. She is now a Senior Event Manager, and plans and develops the strategy for large scale global events for Databricks, including our annual Spark + AI Summit (Americas and Europe) and company retreat.

I started two weeks before our third Spark Summit, which at the time was only being held in both San Francisco (around 2,000 attendees) and the east coast (around 1,400 attendees). We decided that year that it would be a good time to launch Spark Summit Europe for the first time in Amsterdam. This year, we held our 5th Spark Summit in Europe – which sold out! We eventually decided to only have the Americas Summit in San Francisco and moved from hosting it in hotel ballrooms at the Hilton to Moscone Center (one of the largest convention centers in SF), where we now have around 5,000 attendees. My first year, my core responsibility was to figure out booth duty schedule for everybody. Having started only two weeks before, I had no idea who anyone was and had to look at their pictures to figure out names. Eventually, I took on more responsibility for Summit: owning the agenda process, call for papers (community to submit talks), the marketing and execution of these talks, managing speaker attendance, keynote process, swag, catering, space planning and managing the creative.

The content of our Spark + AI Summit conference has also expanded, outside of just additional keynotes and tracks running at the same time. We now have vertical events – health and life sciences and FinTech, as well as networking events, meetups, tutorials, lightning talks and an advisory bar for questions on Apache Spark™ and Databricks. It’s also been a huge change managing our internal attendance – we only had 54 Databricks employees during the 3rd Spark Summit and in 2019’s Summit, we had 800+ employees! Our external audience has also expanded from mainly Spark enthusiasts to adding on Databricks customers and partners, data professionals networks, and our Women in Unified Analytics program. Seeing the evolution of Spark Summit, in addition to other internal events I help launch, has made it really rewarding to see the impact and growth of our events!

2016: Meet Shelby Ferson, Geo Enterprise Account Executive

Shelby Ferson, Enterprise Account Executive, Databricks

Shelby started off at Databricks as a Mid Market Rep, when the sales team was less than 20 people. She has since been promoted to a Commercial Account Executive, and now is an Enterprise Account Executive, helping evangelize Databricks and communicating its value to customers and system integrators, while also helping build out regions around the world. .

I had the exciting opportunity to join Databricks when our team was fairly small, our sales team only had less than 20 people globally at that time! Throughout my time here, it has been rewarding to work with customers that are continuously innovating and seeing how our product has been able to support them and their initiatives over these past three years. It’s also refreshing that with our size, I can still walk down to the engineering floor and have technical conversations and learn more about our product, no matter how busy they are. Everyone is willing to help to get customers on board and work as a team to create the best experience for our customers. It’s a great testament that when sales, product, engineering and customer success, work really closely, amazing things happen!

I’ve also been lucky to be closely supported by our sales leadership, who encouraged me to challenge myself and helped accelerate my growth within Databricks. Our executives invest in building a positive sales team culture and have supported initiatives that I’ve worked (along with the team) to help launch, such as our Women of Databricks events and external events like Databricks’ co-sponsored talk with Tableau. This support also extends to when we build out new regions, and how we emphasize the importance of training and embedding our sales culture to our new teammates. Recently, I had the opportunity to help ramp up and support our sales team in our Australia office, and share all the knowledge I had around our product and company. It’s been a once in a lifetime opportunity to be part of the first sales team here, and now getting to see us expanding our sales team to almost 300 people and expanding in regions all over the world!

2017: Meet Yvette Ramirez, Junior Recruiter

Meet Yvette Ramirez, Junior Recruiter, Databricks

Yvette joined Databricks in 2017 as a Recruiting Coordinator in the San Francisco office. After focusing on candidate interviewing and hiring experiences, Yvette supported the Field Engineering, Customer Success, and Professional Services teams as a sourcer for a year. Most recently, she transitioned into a role at the Amsterdam office as a Junior Recruiter to help build the Go-To-Market team in EMEA.

When I started as a Recruiting Coordinator, there were a little less than 250 employees, and the recruiting team was just 9 people. The size, chaos, and growth of the startup world drives you to be scrappy and create organization from the ambiguity. This was the main reason why I moved from Florida to California, for the opportunity to grow with a company like Databricks that I believed was going to be really successful. After just one year, our team helped to more than double the size of the company, reaching 600+ employees. As the growth continued globally into EMEA and Sydney, the size and complexity of our team’s work followed. I saw this opportunity allowing me to expand my skill set, working with the help of many teammates to jump into the Sourcing role, and mentoring the RCs who came afterward.

My most recent move brought me to Amsterdam as a Jr. Recruiter, helping build out our EMEA Go-To-Market teams. With a smaller European recruiting team, it’s really exciting to again get the chance to help build out our offices that are rapidly growing. An influx of people creates the opportunity for more diverse perspectives, which leads to better systems, processes, and ultimately can transform the way we operate and hire. Diversity has always been a passion of mine, having helped our diversity committee at the early stages of its inception. Through events like Lunch & Learns and Women in Analytics events at Spark Summit, we’ve strived to make Databricks a more inclusive environment. Building on that work, we strive to think on a larger scale on how Databricks can become one of the most diverse and inclusive places out there. It has been amazing to grow alongside Databricks, and I’m so excited to see what else we accomplish globally!

2018: Meet Kunal Taneja, Senior Manager, Field Engineering, APJ

Kunal Taneja, Senior Manager, Field Engineering, APJ, Databricks

Kunal started off as our first employee in the southern hemisphere as our Sr. Manager of Field Engineering. Within 6 months, he eventually took on building out all of our field engineering teams in APJ. He is now responsible for leading, managing and recruiting a team of field engineers in APJ, who help organizations adopt and use Databricks for driving business value from AI, ML and Unified Analytics.

I joined Databricks in Sydney around employee 370 and was the first hire for our Asia-Pacific Region (APJ). I was paired up with an account representative who started about a month after me, and we were tasked to help grow out our region and the investment in APJ. We then brought in an SVP General Manager for our region based out of Singapore, the hub for our central functions, with the focus of ramping up hiring and building out APJ. Within the first 6 months, we had hired 8 people, and my manager asked if I wanted to help build out our region of Solutions Architects to keep up with the growth of our account representatives. From there, I was asked to lead and grow the APJ Solutions Architect team.

It’s crazy to think about how when we first started, we had only 2 employees and no office for 7 months. Now, approaching 1 year, we had to move offices twice just because of our growth in Australia, and have team members everywhere in the APJ region! Because we’re a smaller office, I’ve really enjoyed that we get to work so closely together, and hang out outside of work at happy hours, lunches and boat cruises! From a region perspective, we’ve also been growing tremendously in Australia, Japan, and India, where the cloud market is booming. We now have 8 Solutions Architects on board, and next year we are expected to at least double. Databricks has offered such a unique opportunity to work with some of the best people in the industry and I’m so excited to see our presence in APJ continue to expand!

2019: Meet Amy Reichandater, Chief People Officer

Amy Reichandater, Chief People Officer, Databricks

Amy is our Chief People Officer, and joined Databricks with a strong background in creating highly scalable hiring and retention programs, and driving culture, organization development, and total rewards strategies to support the company’s accelerated global expansion. She has some exciting plans for our growth as a company!

When I joined Databricks in May 2019, I was so excited by the people, market opportunity, growth, and the opportunity to help build an amazing company. As I think about the future of Databricks and what we want to accomplish, my main goal is to create an extraordinary and consistent employee experience. My vision is for all employees, globally, to see Databricks as the most important experience in their career, and as a place where they can bring their best selves to work. Regardless of who they are and where they come from, we want them to feel connected to Databricks, understand our mission and feel empowered to make the company better.

This means that we need to be thoughtful about not only keeping the talent bar high, but also how we can create opportunities for our own teams to grow their career internally as Databricks grows. With doing this, we want to ensure our hiring adds strategic value to the business, and that we evolve the culture and operations in a way that helps us achieve our potential as a company. Our team is really excited about creating extraordinary candidate and employee experiences as we scale, and I’m looking forward to seeing us continue to hire the best talent and build amazing teams around the world!

We are so proud of the team that we have been able to grow with our company and we’re not stopping! Interested? Find your place at Databricks.

Try Databricks for free. Get started today.

The post Celebrating Growth at Databricks and 1,000 Employees! appeared first on Databricks.

↧

Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions

November 5, 2019, 10:00 am

≫ Next: Automate and Fast-track Data Lake and Cloud ETL with Databricks and StreamSets

≪ Previous: Celebrating Growth at Databricks and 1,000 Employees!

Try this Loan Risk with AutoML Pipeline API Notebook in Databricks

Introduction

In the post Using AutoML Toolkit to Automate Loan Default Predictions, we had shown how the Databricks Labs’ AutoML Toolkit simplified Machine Learning model feature engineering and model building optimization (MBO). It also had improved the area-under-the-curve (AUC) from 0.6732 (handmade XGBoost model) to 0.723 (AutoML XGBoost model). With AutoML Toolkit’s Release 0.6.1, we have upgraded to MLflow version 1.3.0 and introduced a new Pipeline API that simplifies feature generation and inference.

In this post, we will discuss:

Family Runner API that allows you to easily try different model families to determine the best model
Simplify Inference with the Pipeline API
Simplify Feature Engineering with the Pipeline API

It’s all in the Family…Runner

As noted in the original post Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning, we had tried three different model families: GLM, GBT, and XGBoost. Without diving into the details, this comprised hundreds of lines of code for each model type.

As noted in Using AutoML Toolkit to Automate Loan Default Predictions, we had reduced this to a few lines of code for each model type. With AutoML Toolkit FamilyRunner API, we have simplified this further by allowing you to use it to run multiple model types concurrently distributed across the nodes of your Databricks cluster. Below are the three lines of code required to run two models (Logistic Regression and XGBoost).

val xgBoostConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", xgBoostOverrides)
val logisticRegressionConfig = ConfigurationGenerator.generateConfigFromMap("LogisticRegression", "classifier", logisticRegOverrides)

val runner = FamilyRunner(datasetTrain, Array(xgBoostConfig, logisticRegressionConfig)).executeWithPipeline()

Within the output cell of this code snippet, you can observe the FamilyRunner API execute multiple tasks, each working to find the best hyperparameters for your selection of model types.

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.MlFlowLoggingValidationStageTransformer log ==> 
Stage Name: MlFlowLoggingValidationStageTransformer_18aeadd79de9 
Total Stage Execution time: 194 ms 
Stage Params: {
    automlInternalId: automl_internal_id,
    isDebugEnabled: true,
    mlFlowAPIToken: [REDACTED],
    mlFlowExperimentName: /Users/jas.bali@databricks.com/AutoML/Jas_AutoML_Demo/runXG_1,
    mlFlowLoggingFlag: true,
    mlFlowTrackingURI: https://demo.cloud.databricks.com,
    pipelineId: 290b3c8d-8dbc-4b1b-a9da-8807153ec602
} 
 Input dataset count: 547821 
 Output dataset count: 547821 
...

With AutoML Toolkit’s Release 0.6.1, we have upgraded to utilize the latest version of MLflow (1.3.0). The following clip shows the results of this AutoML FamilyRunner experiment logged within MLflow allowing you to compare the results of the logistic regression model (AUC=0.716) and XGBoost (AUC=0.72).

Simplifying Inference with the Pipeline API

Pipeline APIs on the FamilyRunner allow the functionality of running inference using either an MLflow Run ID or PipelineModel object. These pipelines contain a sequence of stages that are directly built from AutoML’s main configuration. By running inference one of these ways, it ensures that the prediction dataset goes through the identical set of feature engineering steps that are used for the training. This makes for fully-contained, portable and serializable pipelines that can be exported and served for standalone requirements, without the need to manually apply feature engineering tasks. The following code provides a snippet of running an inference.

Using MLflow Run ID

When you are using MLFlow with your AutoML run, you can run inference by simply using MLflow Run ID (and MLflow config) as noted in the code snippet below.

val bestMlFlowRunId = runner.bestMlFlowRunId("XGBoost")
val bestPipelineModel = PipelineModelInference.getPipelineModelByMlFlowRunId(bestMlFlowRunId, xgBoostConfig.loggingConfig)
val inferredDf = bestPipelineModel.transform(datasetValid)

As can be seen in the cell output, the AutoML Pipeline API executes all of the stages originally created against the training data, now applied to the validation dataset. In this example, below is the abridged pipeline API cell output showing the stages it had executed.

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.ZipRegisterTempTransformer log ==> 
Stage Name: ZipRegisterTempTransformer_a88351e04577 
Total Stage Execution time: 57 ms 
...
 
=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.MlFlowLoggingValidationStageTransformer log ==> 
Stage Name: MlFlowLoggingValidationStageTransformer_18aeadd79de9 
Total Stage Execution time: 233 ms 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.CardinalityLimitColumnPrunerTransformer log ==> 
Stage Name: CardinalityLimitColumnPrunerTransformer_e8aede7e3f4d 
Total Stage Execution time: 1 ms 
...
 
=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DateFieldTransformer log ==> 
Stage Name: DateFieldTransformer_5ec5e2680828 
Total Stage Execution time: 7 ms 
...

 
=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DropColumnsTransformer log ==> 
Stage Name: DropColumnsTransformer_1859c7895f19 
Total Stage Execution time: 4 ms 
...
 
=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.ColumnNameTransformer log ==> 
Stage Name: ColumnNameTransformer_d727a713897e 
Total Stage Execution time: 3 ms 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DropColumnsTransformer log ==> 
Stage Name: DropColumnsTransformer_a3160a31ec07 
Total Stage Execution time: 3 ms 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DataSanitizerTransformer log ==> 
Stage Name: DataSanitizerTransformer_a9866eaba0de 
Total Stage Execution time: 1.79 seconds 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.VarianceFilterTransformer log ==> 
Stage Name: VarianceFilterTransformer_63da1ccb67fe 
Total Stage Execution time: 4 ms 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DropColumnsTransformer log ==> 
Stage Name: DropColumnsTransformer_d239d19c60e6 
Total Stage Execution time: 12 ms 
...

=== AutoML Pipeline Stage: class com.databricks.labs.automl.pipeline.DropColumnsTransformer log ==> 
Stage Name: DropColumnsTransformer_54010312beee 
Total Stage Execution time: 5 ms 
...

bestPipelineModel: org.apache.spark.ml.PipelineModel = final_linted_infer_pipeline_25618e0d3e91
inferredDf: org.apache.spark.sql.DataFrame = [term: string, home_ownership: string ... 20 more fields]

As noted in the previous code snippet (expand to review it), the inference DataFrame inferredDf generated by the Pipeline API contains the validation dataset including the prediction calculated (as noted in the screenshot below).

As can be seen, only MLflow Run ID was required to fetch pipeline and run an inference. This is because Pipeline APIs internally log all artifacts to a run under an experiment in the MLflow project. The notebook on Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions further demonstrates all the tags added to MLflow Run.

Use PipelineModel to Manually Save and Load your AutoML Pipelines

Even if MLflow is not enabled, the PipelineModel provides the flexibility to manually save these pipeline models under a custom path.

//Save it
val pipelinePath = "tmp/predict-pipeline-lg-1"
runner.bestPipelineModel("LogisticRegression").write.overwrite().save(pipelinePath)

// Load it
val pipelineModel = PipelineModel.load(pipelinePath)
val inferredDf = pipelineModel.transform(datasetValid)

Simplifying Feature Engineering with the Pipeline API

In addition to the full inference pipeline, FamilyRunner also exposes an API to run only feature engineering steps, without executing feature selection or computing feature importances. It takes AutoML’s main configuration object and converts that into a pipeline. This can be useful for doing analysis on feature engineering datasets, without having to manually apply Pearson filters, covariance, outlier filters, cardinality limits, and more. It enables the use of models, which aren’t yet part of the AutoML toolkit, but still leverages AutoML’s advanced feature engineering stages.

val featureEngPipelineModel = FamilyRunner(datasetTrain, Array(xgBoostConfig, logisticRegressionConfig)).generateFeatureEngineeredPipeline(verbose=true)("XGBoost")
val featuredData = featureEngPipelineModel.transform(datasetTrain)
display(featuredData)

Discussion

With the Family Runner API, you can run multiple model types concurrently to find the best model and its hyperparameters across multiple models. With AutoML Toolkit’s Release 0.6.1, we have upgraded to MLflow 1.3.0 and introduced a new Pipeline API that significantly simplifies feature generation and inference. Try the AutoML Toolkit and the Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify Loan Risk Analysis notebook today!

Contributions

We’d like to thank Sean Owen, Ben Wilson, Brooke Wenig, and Mladen Kovacevic for their contributions to this blog.

Try Databricks for free. Get started today.

The post Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions appeared first on Databricks.

↧

Automate and Fast-track Data Lake and Cloud ETL with Databricks and StreamSets

November 6, 2019, 9:00 am

≫ Next: Use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly

≪ Previous: Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions

Data lake ingestion is a critical component of a modern data infrastructure. But enterprises often run into challenges when they have to use this data for analytics and machine learning workloads. Consolidating high volumes of data from disparate sources into a data lake is difficult, even more so if it is from both batch and streaming sources. Big data is often unorganized and inconsistent with discrepancy in formats and data types. This makes it difficult to update data in the data lake. With low query speeds and lack of real-time access, the result is a development environment that can’t keep pace. Additionally, it leads to a lack of data quality and poor overall performance of the data lake further delaying deployments in production.

Bringing Speed and Agility with Smart ETL Ingest

What can organizations do to make their data lakes more performant and useful? The challenges discussed above can slow down an organization’s cloud analytics/data science plans significantly – especially if they are limited on data engineering and data science professionals. Data engineers waste their time on ad-hoc, proof of concept sandboxes while struggling to transition data into production. In turn, data scientists lack the confidence to use that data for analytics and machine learning applications.

Databricks and StreamSets have partnered to accelerate value to cloud analytics by automating ingest and data transformation tasks. The joint solution brings rapid pipeline design and testing to cloud data processing. StreamSets Data Collector and Transformer provides a drag-and-drop interface to design, manage and test data pipelines for cloud data processing.

Together, this partnership brings the power of Databricks and Delta Lake to a wider audience. Delta Lake makes it possible to unify batch and streaming data from disparate sources and analyze it at data warehouse speeds. It supports transactional insertions, deletions, upserts and queries. It provides ACID compliance, which means that any writes are always complete and failed jobs are fully backed out.

The integration provides several key benefits:

Faster migration to cloud with less overhead on data engineering resources
Easily bring data from multiple disparate sources using a drag-and drop interface
Better management of data quality and performance for cloud data lakes with Delta Lake
Change Data Capture (CDC) capability from several data sources in to Delta Lake
Decreased risk of disruptions for Hadoop migrations with quicker time-to-value on on-prem to cloud initiatives
Continuous monitoring of data pipelines to lower support cost and optimize ETL pipelines

Databricks Architecture with StreamSets

Using Visual Pipeline Development to Ingest Data into Delta Lake

Data teams spend a large amount of time building ETL jobs in their current data architectures, and this often tends to be complex and code-intensive. For example, organizations may want to know real-time usage in production as well as run historical reports to analyze usage trends over time without being slowed down by complex ETL processing. Overcoming messy data issues, corrupt data and other challenges requires validation and reprocessing that can take hours if not days. The query performance of streaming data may slow things down further.

The integration of Databricks and StreamSets solves this by allowing users to design, test and monitor batch and streaming ETL pipelines without the need for coding or specialized skills. The drag-and-drop interface with StreamSets makes it easy to ingest data from multiple sources into Delta Lake. With its execution engine – StreamSets Transformer, users can create data processing pipelines that execute on Apache Spark. Transformer generates native Spark applications that execute on a Databricks cluster.

Below is an example of how simple it is to create a Delta ingest pipeline with Streamsets where Kafka is the source and Delta is the destination.

There is a native Delta Lake destination in Transformer, which is very easy to configure. You simply specify location of Delta dataset, which could be a DBFS mount, and data from Kafka (or any other source supported by Transformer) flows into the destination Delta table.

An example of how simple it is to create a Delta Lake ingest pipeline with StreamSets where Kafka is the source and Delta is the destination

Transformer can also perform transformations on Delta tables, which are expressed visually but convert to Spark code at runtime and pushed down as Spark jobs into Databricks clusters, so joint customers can enjoy the scale, reliability and agility of a fully managed data engineering and AI platform with the click of a few buttons.

Below is an example of a transformation pipeline where both the source and destination are Delta Lake tables, while the middle steps are transformations being done on the source table.

An example of a transformation pipeline where both the source and destination are Delta Lake tables, while the middle steps are transformations on the source table

Transformer communicates with Databricks via simple REST APIs. It orchestrates the uploading of code and running of jobs in Databricks via these secure APIs.

A simple configuration dialogue in a Transformer pipeline allows a customer to connect Transformer to their Databricks environment. Note that Transformer supports both interactive and data engineering clusters in Databricks, giving customers the flexibility to choose the right cluster type for the right use case.

A simple configuration dialogue in a Transformer pipeline allows a customer to connect Transformer to their Databricks environment

Monitoring of Delta Lake pipelines is also a critical capability of integration because it gives customers a visual window into the health and status of how well an ingestion job or transformation pipeline is doing. For example, the below screenshot depicts the flow of records from a relational source & Kafka into a Delta table and it can be monitored for throughput or record count.

The flow of records from a relational source & Kafka into a Delta Lake table

Change Data Capture (CDC) with Delta Lake’s MERGE

Data lakes such as Delta Lake bring together data from multiple origin data sources into a central location for holistic analytics. If the source data in an origin data source changes, it becomes imperative to reflect that change in Delta Lake so data remains fresh and accurate. Equally important is the need to manage that change reliably so end users do not end up doing analytics on partially ingested or dirty data.

Change data capture (CDC) is one such technique to reconcile changes in a source system with a destination system. StreamSets has out of the box CDC capability for popular relational data sources (such as mysql, postgres and more), which makes it possible to capture changes in those databases. In many cases, StreamSets reads the binary log of the relational system to capture changes, which means the source database does not experience any performance or load impact from CDC pipelines.

Streamsets has implemented Delta’s MERGE functionality which makes it possible to reconcile changes from CDC sources to Delta tables with a simple visual pipeline automatically, hence simplifying CDC pipelines from source systems into Delta Lake for customers.

functionality which makes it possible to reconcile changes from CDC sources to Delta tables with a simple visual pipeline automatically.

Because StreamSets uses Delta to implement the CDC pipeline, customers get the benefit of transactional semantics and performance of Delta Lake on the CDC ingest process, which guarantees that fresh reliable data is available in the lake, in a format that’s optimized for downstream analytics.

How to Get Started with Databricks and StreamSets New Ingest Solution

We are excited about this integration and its potential for accelerating analytics and ML projects in the cloud. To learn more, register for this Manage Big Data Pipelines in the Cloud Webinar. We will show a live demo of how easy it is to build high-volume data pipelines to move data into Delta lake.

Related Resources

Try Databricks for free. Get started today.

The post Automate and Fast-track Data Lake and Cloud ETL with Databricks and StreamSets appeared first on Databricks.

↧

Use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly

November 11, 2019, 6:34 am

≫ Next: Databricks, AWS, and SafeGraph Team Up For Easier Analysis of Consumer Behavior

≪ Previous: Automate and Fast-track Data Lake and Cloud ETL with Databricks and StreamSets

Data Engineering teams deploy short, automated jobs on Databricks. They expect their clusters to start quickly, execute the job, and terminate. Data Analytics teams run large auto-scaling, interactive clusters on Databricks. They expect these clusters to adapt to increased load and scale up quickly in order to minimize query latency. Databricks is pleased to announce Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.

Cluster lifecycles before Databricks Pools

Without Pools, Databricks acquires virtual machine (VM) instances from the cloud provider upon request. This is cost-effective but slow. There are no idle VM instances to pay for, but with each cluster create and auto-scaling event, Databricks must request VMs from the cloud and wait for them to initialize. The below diagram shows the typical lifecycle for Data Engineering job clusters and interactive Data Analytics clusters.

Databricks clusters acquire VM instances directly from the cloud provider when not using Databricks Pools.

This is not sufficient for Data Engineers running short jobs. The cluster start time can dominate the job’s total execution time. Nor is it sufficient for Data Analysts. Waiting for a cluster to scale up when running a large query slows down productivity.

Comparing performance with Databricks Pools

The graph below shows the median start times for Databricks clusters. Without Pools – seen in red – each cluster create request must acquire new VMs from the cloud, initialize daemon services on those VMs, and download Databricks Runtime (DBR) to them. These steps result in a median cluster creation time of 145 seconds. That’s two and a half minutes! With Pools – seen in blue – cluster creation skips these steps and takes less than 40 seconds. Cluster auto-scaling also skips these steps, providing a similar performance boost.

Median cluster create times are 4x faster with Databricks Pools.

Typical cluster creation times with (blue line) and without (red line) Databricks Pools. Pools are 4x faster.

A new architecture with Databricks Pools

Databricks introduces Pools, a managed cache of VM instances, to achieve this reduction in cluster start and auto-scaling times from minutes to seconds,

When a cluster attached to a pool needs VM instances, rather than requesting new ones from the cloud provider, it checks the pool. If there are enough idle instances in the pool, the cluster acquires them and starts or scales quickly. If there are not enough idle instances, the pool expands by allocating new instances from the cloud provider to satisfy the cluster’s request. This will slow down the request, so it is important to maintain enough idle instances in the pool. When a pool cluster releases instances, they return to the pool and are free for other clusters to use. Only clusters attached to a pool can use that pool‘s idle instances.

The below diagram shows the typical lifecycle for Data Engineering job clusters and interactive Data Analytics clusters using Databricks Pools.

Databricks clusters start and scale 4x faster when acquiring instances from a Databricks Pool.

Cost control with Databricks Pools

Keeping idle VM instances in a Databricks Pool is great for performance, but not free. Databricks does not charge DBUs for idle instances not in use by a Databricks cluster, but cloud provider infrastructure costs do apply.

There are a few recommended ways to manage this cost. First, manually edit the size of your pool to meet your needs. If you’re only running interactive workloads during business hours, make sure the pool’s “Min Idle” instance count is set to zero after hours. Or if your automated data pipeline runs for a few hours at night, set the “Min Idle” count a few minutes before the pipeline starts and then revert it to zero afterwards. Alternatively, always keep a “Min Idle” of zero, but set the “Idle Instance Auto Termination” timeout to meet your needs. The first job run on the pool will start slowly, but subsequent jobs run within the timeout period will start quickly. When the jobs are done, all instance in the pool will terminate after the idle timeout period, avoiding cloud provider costs.

Optionally, you can also budget VM resources by setting a maximum capacity for the pool. This limits the sum of all idle instances and instances used by clusters attached to the pool.

Deploying a managed cache of VM instances via Databricks Pools

Getting started with Databricks Pools is easy. Click the Clusters icon in the sidebar, select the pools tab and click the “Create Pool” button.

Getting started with Databricks Pools: Creating a pool

After you’ve created the pool, you can see the number of instances that are in use by clusters, idle and ready for use, and pending (i.e. idle, but not yet ready).

Getting started with Databricks Pools: A demo pool

In order to use the idle instances in the pool, select the pool from the dropdown in the cluster create template. This works both for interactive clusters and automated jobs clusters. With a pool selected, the cluster will use the pool’s instance type for both the driver and worker nodes.

Assuming there are enough idle instances warm in the pool – set via the “Min Idle” field during pool creation – the cluster will start in under 40 seconds. While the cluster is running, the pool will backfill more idle instances in order to maintain the minimum idle instance count. Once the cluster is done using the instances, they will return to the pool to be used by the next cluster. Idle instances above the minimum idle count are terminated after being idle for the “Idle Instance Auto Termination” timeout period (defaults to 60 minutes).

Conclusion

Databricks Pools increase the productivity of both Data Engineers and Data Analysts. With Pools, Databricks customers eliminate slow cluster start and auto-scaling times. Data Engineers can reduce the time it takes to run short jobs in their data pipeline, thereby providing better SLAs to their downstream teams. Data Analytics teams can scale out clusters faster to decrease query execution time, increasing the recency of downstream reporting. Pools allow teams to rapidly iterate and innovate and move them one step closer to real-time analytics. All of this is possible while reducing Databricks licensing costs, making the feature a no brainer to deploy.

Get started with Databricks Pools

To learn how to deploy the feature, please read Databricks Pools documentation here. If you already don’t have Databricks, start a trial here and use the quick start guide here.

Related Resources

https://docs.databricks.com/user-guide/instance-pools/index.html

https://databricks.com/glossary/what-is-databricks-runtime

https://docs.databricks.com/clusters/index.html

https://databricks.com/session/virtualizing-apache-spark

Try Databricks for free. Get started today.

The post Use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly appeared first on Databricks.

↧

Ingest 1,000 Genomes Data into Delta Lake

Perform quality control

Control for ancestry

Ingest Phenotype Data

Running the Genome-Wide Association Study

Summarizing the Analysis

Try it!

Introduction

Set-Up Time Series Data Sources

Merging and Aggregating Time Series with Apache Spark™

AS-OF Joins

Marking VWAP Against Trade Patterns

Faster Iterative Development with Databricks Connect

Leveraging Koalas for Market Manipulation

De-duplication of Time Series

Time Series Windowing with Koalas

Merge on Timestamp and Compute Imbalance with Koalas Column Arithmetic

Koalas to NumPy for Fitting Distributions

Conclusion

Migrating Big Data Workloads from On-premises Hadoop to the Cloud

Key Questions Concerning a Hadoop to Cloud Migration

Run Experiments within Community Edition Workspace

Creating an Experiment in your Workspace

Logging Runs in your Default Notebook Experiment

Creating an MLflow Session with the Tracking Server

Run Experiments Locally and Track Results on Community Edition

Configuring your Local Environment

Summary

What’s Next

Read More

Applying Good Engineering Principles to Machine Learning with MLflow Model Registry

The MLflow Model Registry addresses the aforementioned challenges. Below are some of the key features of this new component.

One hub for managing ML models collaboratively

Flexible CI/CD pipelines to manage stage transitions

Visibility and governance for the ML lifecycle

Get Started with the MLflow Model Registry

Problems with analyzing large genomic datasets

Glow integrates bioinformatics tools with best-of-breed big data processing engines

Join us and try Glow!

Additional Resources

Open Source Updates: Delta Lake joins the Linux Foundation, Apache Spark™ 3.0 plans, MLflow Model Registry, and more

Delta Lake joins the Linux Foundation

Apache Spark™ 3.0 upcoming enhancements

MLflow Model Registry update

Koalas community growth and adoption

Keynotes: Katie Bouman on creating the first black hole image, Gaël Varoquaux on the “secret weapon” of scikit-learn’s success, and much more

Customer Keynotes

Women in Unified Analytics: Panel discussions and networking at Spark + AI in Amsterdam

Spark Summit Europe Community Sessions: Spark tuning workshops, technical tutorials, big data case studies, and more

What’s Next for Spark + AI Summit

Related Links

Azure Data Lake Storage Gen2

Challenges with Accessing ADLS from Databricks

Getting Started with Azure AD Credential Passthrough

Powerful, Built-in Access Control

Conclusion

Related Resources

What is Hyperopt?

Scaling out Hyperopt with Spark

Using SparkTrials in practice

Getting started with Hyperopt 0.2.1

Related Resources

Solving the Hardest Data Problems at Scale

Databricks European Development Center

Related Resources

Introduction

Solution

Administrative Setup

Configure S3 Server Access Logging

Create an SQS queue

Configure S3 event notification

Operational Setup

Create IAM roles

Prepare cluster configuration

Generate Delta Lake table with a Continuous Application

The notebook

Create Databricks job

Execute Notebook Queries

Create interactive cluster and a notebook to run analytics

Queries to analyze access patterns