Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

Announcing the 2021 Databricks Data Team Award Finalists

$
0
0

The annual Databricks Data Team Award recognizes data teams that are harnessing the power of data + AI to deliver solutions to some of the world’s toughest problems. From finding ways to help manage the COVID-19 pandemic to improving the sustainability of industrial machinery to enabling better healthcare outcomes, Databricks helps unify data engineers, scientists and analysts so they can innovate faster than ever before.

For the second year running, the Data Team Awards will be presented at the Data + AI Summit (formerly known as Spark + AI Summit), the world’s largest gathering of the data community.

Here are the finalists in each of the four categories:

Data Team for Good Award

This award celebrates organizations that have used data + AI to make a positive difference in the world, delivering solutions for global challenges –– from healthcare to sustainability.

Artemis Health
Artemis Health is on a mission to help U.S. companies and their advisors make better-informed decisions about how to optimize health benefits programs for happier and healthier employees. The data team is helping to make this possible using the Databricks Lakehouse Platform to ingest and transform data that powers their Employee Benefits Analysis platform, enabling their customers to gain better insight into health care plans, usage and potential abuse. As an example, one customer was able to identify signs of opioid abuse in their employees and determine who might be “doctor shopping” or seeking prescriptions from multiple providers. They broke it down even further to see if certain office locations were more prone to opioid abuse than others. This allowed their customer to intervene and make sure employees were following safe and appropriate protocols when using their health benefits. Artemis Health is using data for good and trying to turn the world’s health data into great healthcare that everyone can afford.

Daimler
Daimler, a leading global supplier of premium and luxury vehicles is on a mission to improve the sustainability of their production and supply chain operations, with a commitment to become carbon neutral by 2039. The success of this ambitious initiative is heavily reliant on harnessing the power of data and AI with eMission Platform, Daimler’s centralized data hub built upon the Databricks Lakehouse Platform. eMission Platform seeks to create transparency in the whole product life cycle, assuring compliance with emission regulations while maximizing performance. Daimler’s data team is collecting massive diverse sets of emissions data, standardizing it with Delta Lake, and then applying machine learning models to identify bottlenecks in fully-automated production lines. Analysts then use these insights to optimize operational efficiency, and ultimately decrease the carbon footprint of every vehicle that is brought to market. With the help of advanced analytics they achieve the ultimate goal: The right car, meeting all regulatory requirements, in the right market, on time.

Samsara
Samsara’s mission is to increase the efficiency, safety and sustainability of operations that power the global economy. They are the pioneers of the Connected Operations Cloud, which allows businesses that depend on physical operations to harness IoT data to develop actionable business insights and improve their operations. Samsara innovates quickly based on customer feedback to make a real impact on the world – using data to help prevent distracted driving and worksite accidents, monitor in-transit temperatures to combat food waste, and reduce the environmental footprint of their customers’ operations. To keep up with their mission and the many impactful projects that are making a difference in the world, Samsara leverages the Databricks Lakehouse Platform to store, transform and analyze data, allowing their engineers, data scientists, and analysts to deliver new, meaningful products and data insights to their customers.

US Department of Veteran Affairs (VA)
The U.S. Department of Veterans Affairs (VA) is advancing its efforts to prevent veteran suicide through data and analytics. In 2020, Databricks, a native cloud service under Microsoft Azure, became FedRAMP High and available to VA. With the Databricks Lakehouse Platform, VA was able to enhance algorithms like the Medication Possession Ratio (MPR), shortening MPR computation by 88% and expanding the scope of data to improve the accuracy and efficiency of the algorithm. MPR is one of the factors that improve prediction of suicide and missed mental health appointments by evaluating a Veteran’s medication regimen. With Databricks, VA is able to quickly ingest and process over 5 million patient records (representing 130 VA Medical Centers), including 8 years of their medical history data and over 100K medications. The U.S. Department of Veterans Affairs is the largest health system in the US and they are using data and analytics to serve Veterans and save the lives of those who have risked so much.

Data Team Innovation Award

This award recognizes the Data Teams that are pushing the boundaries of what’s possible, leading their industries by continually innovating and building solutions to solve even the most complex challenges.

Atlassian
Atlassian is a leading provider of team collaboration and productivity software. Its team collaboration and productivity software help teams organize, discuss, and complete shared work. Innovation is at their core. As an early adopter of the Lakehouse architecture, Atlassian has made a paradigm shift enabling data democratization at scale, both internally and externally. They have standardized upon Databricks on AWS to serve as an open and unified Lakehouse platform, enabling the company to become more data driven. Across the organization, they’ve seen a massive shift with over 3,000 internal users accessing the platform on a monthly basis –that’s more than half of the organization leveraging data to make decisions. The data and platform teams at Atlassian are among the most innovative in the industry, leading the way with new and creative ways to capitalize on data.

John Deere
John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to meet the world’s increasing need for food, fuel, shelter and infrastructure. Their Enterprise Data Lake (EDL) built upon the Databricks Lakehouse Platform is at the core of this innovation, with petabytes of data and trillions of records being ingested into the EDL, giving data teams fast, reliable access to standardized data sets to innovate and deliver ML and analytics solutions ranging from traditional IT use cases to customer applications. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Virgin Hyperloop
Virgin Hyperloop is pushing the boundaries of rapid mass mobility systems with the goal of making high-speed travel more accessible than ever before, while reducing the emission of greenhouse gases and transit times. From the vehicle and track design, to determining the best travel routes, to optimizing the passenger experience, it’s all being done with data. The Virgin Hyperloop data team uses the Databricks Lakehouse Platform to ingest diverse data sets, and run simulations at unprecedented speeds, to help inform critical design decisions, while also proving the business viability and low environmental impact of the system to potential partners, investors, and customers. Through the power of data + AI, Virgin Hyperloop is bringing next-generation transportation to the world.

Data Team OSS Award

This award recognizes and celebrates the Data Teams that are making the most out of leveraging, or contributing to, the open-source technologies that are defining the future of data + AI.

Adobe
Adobe has a strong commitment to open source technology and open standards. Not only do they leverage open-source technology for their own products, they extend the capability to their customers. Adobe’s market-leading Experience Platform leverages the Databricks Lakehouse Platform as part of its open architecture to help drive personalization across its various applications and services. Adobe’s data team is able to build solutions atop open technologies like Delta Lake and MLflow to deliver actionable insights and leverage machine learning for a full picture of their customers in real-time to deliver personalized experiences across every channel.

Healthgrades
Healthgrades’ Platform division works with the nation’s leading health systems to improve patient engagement and strengthen physician alignment, driving measurable financial outcomes. Hg Mercury, the company’s enterprise-wide intelligent engagement platform, is a comprehensive data and SaaS platform that delivers insights and communications solutions that integrate seamlessly with the health system’s broader MarTech stack and enterprise ecosystem.To make data and information more accessible to their customers, Healthgrades has adopted an open data architecture leveraging Databricks Lakehouse Platform. Healthgrades leverages machine learning algorithms to enable Health Systems to be more proactive and predictive as they engage current and potential patients. Through this open architecture, using Delta, MLflow, Airflow, Python, Scala and other open frameworks, Healthgrades has seen a decrease in the overall cost of operation, improved customer satisfaction and improved user experience. Bottom line, through open source technology, at Healthgrades, better health gets a head start.

Scribd
With over a million paid subscribers and 100 million monthly visitors, Scribd is on a mission to change the way the world reads. Through their monthly subscription, readers gain access to the best ebooks, audiobooks, magazines, podcasts, and much more. Scribd leverages data and analytics to uncover interesting ways to get people excited about reading. The data engineering team optimizes the Lakehouse architecture to ultimately deliver an engaging experience to their customers. Additionally, the team is contributing to multiple open source projects and they have a deep understanding of the Delta Lake OSS ecosystem.

Data Team Impact Award

This award recognizes the data teams that have done an outstanding job supporting their organization’s core mission, harnessing the power of data + AI to create a massive impact.

Asurion
Tech care company Asurion helps nearly 300 million customers worldwide with repair, replacement, protection and Expert tech help for phones, home tech and appliances. Asurion Experts keep customers’ tech working to keep life moving. Among the company’s many initiatives to enhance the customer experience, Asurion leverages insights gleaned from Expert’s interactions with customers. Asurion uses the Databricks Lakehouse platform and advanced machine learning models to power and support real-time insights collected from support calls. The rapid availability of new data enables Asurion to enhance training for Experts and help provide customized support for customers.

GSK
GSK is a global, science-led health company that specializes in pharmaceuticals, vaccines, and consumer healthcare products. Through their science-based, cutting-edge products, they are changing the health and wellness of consumers around the world. The breadth of investment in D&A technology and people expertise has led to the realization of tremendous short and long-term impact already, for example cost savings in safety stock analysis and optimization. GSK’s D&A platform, powered by Databricks, has supported over 100 use cases, including pricing, trading, forecasting, KPI tracking, and financial modeling, applying insights and smart decision making, ultimately supporting their purpose to help people do more, feel better and live longer.

H&M
The H&M group is a family of brands serving markets around the globe. Throughout its deep fashion heritage, dating back to 1947, H&M Group has been committed to making fashion available for everyone; now, they want to make sustainable fashion affordable and available for everyone. H&M Group has been making necessary investments in digitizing its supply chain, logistics, tech infrastructure and AI, with Databricks providing key data and analytics capabilities that enable the fast development and deployment of AI solutions to market. By harnessing the power of data and technology, H&M Group continues to lead the industry by making its operations more efficient and sustainable, while also remaining at the heart of fashion and lifestyle inspiration for their customers.

And the winner is…

The Databricks Data Team Award winners for 2021 will be announced at the Data + AI Summit at the end of May, and will be celebrated in an upcoming blog post, so check back to see which of the finalists took home the trophy.

In the meantime, check out how our customers across industries are innovating and unlocking the power of data and AI.

--

Try Databricks for free. Get started today.

The post Announcing the 2021 Databricks Data Team Award Finalists appeared first on Databricks.


Evolution to the Data Lakehouse

$
0
0

This is a guest authored article by the data team at Forest Rim Technology. We thank Bill Inmon, CEO, and Mary Levins, chief data strategy officer, of Forest Rim Technology for their contributions.

The Original Data Challenge

With the proliferation of applications came the problem of data integrity. The problem with the advent of large numbers of applications was that the same data appeared in many places with different values. In order to make a decision, the user had to find WHICH version of the data was the right one to use among the many applications. If the user did not find and use the right version of data, incorrect decisions might be made.

A different architectural approach is needed to find the right data to use for decision making.

People discovered that they needed a different architectural approach to find the right data to use for decision making. Thus, the data warehouse was born.

The data warehouse

The data warehouse caused disparate application data to be placed in a separate physical location. The designer had to build an entirely new infrastructure around the data warehouse.

The traditional analytical infrastructure surrounding the data warehouse

The analytical infrastructure surrounding the data warehouse contained such things as:

  • Metadata – a guide to what data was located where
  • A data model – an abstraction of the data found in the data warehouse
  • Data lineage – the tale of the origins and transformations of data in the warehouse
  • Summarization – a description of the algorithmic work designed to create the data
  • KPIs – where are key performance indicators found
  • ETL – enabled application data to be transformed into corporate data

The limitations of data warehouses became evident with the increasing variety of data (text, IoT, images, audio, videos etc) in the enterprise. In addition, the rise of machine learning (ML) and AI introduced iterative algorithms that required direct data access and were not based on SQL.

All the data in the corporation

As important and useful as data warehouses are, for the most part, data warehouses centered around structured data. But now there are many other data types in the corporation. In order to see what data resides in a corporation, consider a simple graph:

Structured data is typically transaction-based data that is generated by an organization to conduct day-to-day business activities. Textual data is data that is generated by letters, email and conversations that take place inside the corporation. Other unstructured data is data that has other sources, such as IoT data, image, video and analog-based data.

The data lake

The data lake is an amalgamation of ALL of the different kinds of data found in the corporation. It has become the place where enterprises offload all their data, given its low-cost storage systems with a file API that hold data in generic and open file formats, such as Apache Parquet and ORC. The use of open formats also made data lake data directly accessible to a wide range of other analytics engines, such as machine learning systems.

The data lake is an amalgamation of ALL of the different kinds of data found in the corporation

When the data lake was first conceived, it was thought that all that was required was that data should be extracted and placed in the data lake. Once in the data lake, the end user could just dive in and find data and do analysis. However, corporations quickly discovered that using the data in the data lake was a completely different story than merely having the data placed in the lake.

Many of the promises of the data lakes have not been realized due to the lack of some critical features: no support for transactions, no enforcement of data quality or governance and poor performance optimizations. As a result, most of the data lakes in the enterprise have become data swamps.

Challenges with current data architecture

Due to the limitations of data lakes and warehouses, a common approach is to use multiple systems – a data lake, several data warehouses and other specialized systems, resulting in three common problems:

    1. Lack of openness: Data warehouses lock data into proprietary formats that increase the cost of migrating data or workloads to other systems. Given that data warehouses primarily provide SQL-only access, it is hard to run any other analytics engines, such as machine learning systems. Moreover, it is very expensive and slow to directly access data in the warehouse with SQL, making integrations with other technologies difficult.

    2. Limited support for machine learning: Despite much research on the confluence of ML and data management, none of the leading machine learning systems, such as TensorFlow, PyTorch and XGBoost, work well on top of warehouses. Unlike BI, which extracts a small amount of data, ML systems process large datasets using complex non-SQL code. For these use cases, warehouse vendors recommend exporting data to files, which further increases complexity and staleness.

    3. Forced trade-off between data lakes and data warehouses: More than 90% of enterprise data is stored in data lakes due to its flexibility from open direct access to files and low cost, as it uses cheap storage. To overcome the lack of performance and quality issues of the data lake, enterprises ETLed a small subset of data in the data lake to a downstream data warehouse for the most important decision support and BI applications. This dual system architecture requires continuous engineering to ETL data between the lake and warehouse. Each ETL step risks incurring failures or introducing bugs that reduce data quality, while keeping the data lake and warehouse consistent is difficult and costly. Apart from paying for continuous ETL, users pay double the storage cost for data copied to a warehouse.

Emergence of the data lakehouse

We are seeing the emergence of a new class of data architecture called data lakehouse, which is enabled by a new open and standardized system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

The data lakehouse architecture addresses the key challenges of current data architectures

The data lakehouse architecture addresses the key challenges of current data architectures discussed in the previous section by:

  • enabling open direct-access by using open formats, such as Apache Parquet
  • providing native class support for data science and machine learning
  • offering best-in-class performance and reliability on low cost storage

Here are the various features that enable the key benefits of the lakehouse architecture:

Openness:

  • Open File Formats: Built on open and standardized file formats, such as Apache Parquet and ORC
  • Open API: Provides an open API that can efficiently access the data directly without the need for proprietary engines and vendor lock-in
  • Language Support: Supports not only SQL access, but also a variety of other tools and engines, including machine learning and Python/R libraries

Machine learning support:

  • Support for diverse data types: Store, refine, analyze and access data for many new applications, including images, video, audio, semi-structured data and text.
  • Efficient non-SQL direct reads: Direct efficient access of large volumes of data for running machine learning experiments using R and Python libraries.
  • Support for DataFrame API: Built-in declarative DataFrame API with query optimizations for data access in ML workloads since ML systems such as TensorFlow, PyTorch and XGBoost have adopted DataFrames as the main abstraction for manipulating data.
  • Data Versioning for ML experiments: Providing snapshots of data enabling data science and machine learning teams to access and revert to earlier versions of data for audits and rollbacks or to reproduce ML experiments.

Best-in-class performance and reliability at low cost:

  • Performance optimizations: Enable various optimization techniques, such as caching, multi-dimensional clustering and data skipping, by leveraging file statistics and data compaction to right-size the files.
  • Schema enforcement and governance: Support for DW
  • schema architectures like star/snowflake-schemas and provide robust governance and auditing mechanisms.

  • Transaction support: Leverage ACID transactions to ensure consistency as multiple parties concurrently read or write data, typically using SQL.
  • Low cost storage: Lakehouse architecture is built using low cost object storage such Amazon S3, Azure Blob Storage or Google Cloud Storage.

Comparing data warehouse and data lake with data lakehouse

Data warehouse Data lake Data lakehouse
Data format Closed, proprietary format Open format Open format
Types of data Structured data, with limited support for semi-structured data All types: Structured data, semi-structured data, textual data, unstructured (raw) data All types: Structured data, semi-structured data, textual data, unstructured (raw) data
Data access SQL-only, no direct access to file Open APIs for direct access to files with SQL, R, Python and other languages Open APIs for direct access to files with SQL, R, Python and other languages
Reliability High quality, reliable data with ACID transactions Low quality, data swamp High quality, reliable data with ACID transactions
Governance and security
Fine-grained security and governance for row/columnar level for tables Poor governance as security needs to be applied to files Fine-grained security and governance for row/columnar level for tables
Performance High Low High
Scalability Scaling becomes exponentially more expensive Scales to hold any amount of data at low cost, regardless of type Scales to hold any amount of data at low cost, regardless of type
Use case support Limited to BI, SQL applications and decision support Limited to machine learning One data architecture for BI, SQL and machine learning

Impact of the lakehouse

We believe that the data lakehouse architecture presents an opportunity comparable to the one we saw during early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise and combine the data science focus of the data lake with the end-user analytics of the data warehouse will unlock incredible value for organizations.


Forest Rim Technology was founded by Bill Inmon and is the world leader in converting textual unstructured data to a structured database for deeper insights and meaningful decisions.

--

Try Databricks for free. Get started today.

The post Evolution to the Data Lakehouse appeared first on Databricks.

Building a Cybersecurity Lakehouse for CrowdStrike Falcon Events

$
0
0

Endpoint data is required by security teams for threat detection, threat hunting, incident investigations and to meet compliance requirements. The data volumes can be terabytes per day or petabytes per year. Most organizations struggle to collect, store and analyze endpoint logs because of the costs and complexities associated with such large data volumes. But it doesn’t have to be this way.

In this two part blog series we will cover how you can operationalize petabytes of endpoint data with Databricks to improve your security posture with advanced analytics, in a cost effective way. Part 1 (this blog) will cover the architecture of data collection and the integration with a SIEM (Splunk). At the end of this blog with notebooks provided you will be ready to use the data for analysis. Part 2 will discuss specific use cases, how to create ML models and automated enrichments and analytics. At the end of part 2, you will be able to implement the notebooks to detect and investigate threats using end point data.

We will use Crowdstrike’s Falcon logs as our example. To access Falcon logs, one can use the Falcon Data Replicator (FDR) to push raw event data from CrowdStrike’s platform to cloud storage such as Amazon S3. This data can be ingested, transformed, analyzed and stored using the Databricks Lakehouse Platform alongside the rest of their security telemetry. Customers can ingest CrowdStrike Falcon data, apply Python-based real-time detections, search through historical data with Databricks SQL, and query from SIEM tools like Splunk with Databricks Add-on for Splunk.

Challenge of operationalizing Crowdstrike data

Although the Crowdstrike Falcon data offers comprehensive event logging details, it is a daunting task to ingest, process and operationalize complex and large volumes of cybersecurity data on a near real-time basis in a cost-effective manner. These are a few of the well-known challenges:

  • Real-time data ingestion at scale: It is difficult to keep track of processed and unprocessed raw data files, which are written by FDR on cloud storage in near real time.
  • Complex transformations: The data format is semi-structured. Every line of each log file contains hundreds of underministically different types of payloads, and the structure of event data can change over time.
  • Data governance: This kind of data can be sensitive, and access must be gated to only users who need it.
  • Simplified security analytics end-to-end: Scalable tools are needed to do the data engineering, MLand analysis on these fast-moving and high-volume datasets.
  • Collaboration: Effective collaboration can leverage domain expertise from the data engineers, cybersecurity analysts and ML engineers. Thus, having a collaborative platform improves the efficiency of cybersecurity analysis and response workloads.

As a result, security engineers across enterprises find themselves in a difficult situation struggling to manage cost and operational efficiency. They either have to accept being locked into very expensive proprietary systems or spend tremendous efforts to build their own endpoint security tools while fighting for scalability and performance.

Databricks cybersecurity lakehouse

Databricks offers security teams and data scientists a new hope to perform their jobs efficiently and effectively, as well as a set of tools to combat the growing challenges of big data and sophisticated threats.

Lakehouse, an open architecture that combines the best elements of data lakes and data warehouses, simplifies building a multi-hop data engineering pipeline that progressively adds structure to the data. The benefit of a multi-hop architecture is that data engineers can build a pipeline that begins with raw data as a “single source of truth” from which everything flows. Crowstrike’s semi-structured raw data can be stored for years, and subsequent transformations and aggregations can be done in an end-to-end streaming fashion to refine the data and introduce context-specific structure to analyze and detect security risks in different scenarios.

  • Data ingestion: Autoloader (AWS | Azure | GCP) helps to immediately read data as soon as a new file is written by Crowdstrike FDR into raw data storage. It leverages cloud notification services to incrementally process new files as they arrive on the cloud. Autoloader also automatically configures and listens to the notification service for new files and can scale up to millions of files per second.
  • Unified stream and batch processing: Delta Lake is an open approach to bringing data management and governance to data lakes that leverages Apache Spark’s™ distributed computation power for huge volumes of data and metadata. Databricks’s Delta Engine is a highly-optimized engine that can process millions of records per second.
  • Data governance: With Databricks Table Access Control (AWS | Azure | GCP), admins can grant different levels of access to delta tables based on a user’s’ business function.
  • Security analysis tools: Databricks SQL helps to create an interactive dashboard with automatic alerting when unusual patterns are detected. Likewise, it can easily integrate with highly-adopted BI tools such as Tableau, Microsoft Power BI and Looker.
  • Collaboration on Databricks notebooks: Databricks collaborative notebooks enable security teams to collaborate in real time. Multiple users can run queries in multiple languages, share visualizations and make comments within the same workspace to keep investigations moving forward without interruption.

Lakehouse architecture for Crowdstrike Falcon data

We recommend the following lakehouse architecture for cybersecurity workloads, such as Crowdstrike’s Falcon data. Autoloader and Delta Lake simplify the process of reading raw data from cloud storage and writing to a delta table at low cost and minimal DevOps work.

Recommended lakehouse architecture for endpoint security.

In this architecture, semi-structured Crowdstrike data is loaded to the customer’s cloud storage in the landing zone. Then Autoloader uses cloud notification services to automatically trigger the processing and ingestion of new files into the customer’s bronze tables, which will act as the single source of truth for all downstream jobs. Autoloader will track processed and unprocessed files using checkpoints in order to prevent duplicate data processing.

As we move from the bronze-to-silver stage, schema will be added to provide structure to the data. Since we are reading from a single source of truth, we are able to process all of the different event types and enforce the correct schema as they are written to their respective tables. The ability to enforce schemas at the Silver layer provides a solid foundation for building ML and analytical workloads.

The gold stage, which aggregates data for faster query and performance in dashboards and BI tools, is optional, depending on the use case and data volumes. Alerts can be set to trigger when unexpected trends are observed.

Another optional feature is the Databricks Add-on for Splunk, which allows security teams to take advantage of Databricks’ cost-effective model and the power of AI without having to leave the comforts of Splunk. Customers can run ad-hoc queries against Databricks from within a Splunk dashboard or search bar with the add-on. Users can also launch notebooks or jobs in Databricks through a Splunk dashboard or in response to a Splunk search. The Databricks integration is bi-directional, letting customers summarize noisy data or run detections in Databricks that show up in Splunk Enterprise Security. Customers can even run Splunk searches from within a Databricks notebook to prevent the need to duplicate data.

The Splunk and Databricks integration allows customers to reduce costs, expand the data sources they analyze and provide the results of a more robust analytics engine, all without changing the tools used by their staff day-to-day.

Code walkthrough

Since Autoloader abstracts the most complex part of file-based data ingestion, raw-to-bronze ingestion pipeline can be created within a few lines of code. Below is a Scala code example for a Delta ingestion pipeline. Crowdstrike Falcon event records have one common field name: “event_simpleName.”

val crowdstrikeStream = spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "text")    // text file doesn't need schema 
  .option("cloudFiles.region", "us-west-2")
  .option("cloudFiles.useNotifications", "true")  
  .load(rawDataSource)
  .withColumn("load_timestamp", current_timestamp())
  .withColumn("load_date", to_date($"load_timestamp"))
  .withColumn("eventType", from_json($"value", "struct", Map.empty[String, String]))     .selectExpr("eventType.event_simpleName","load_date","load_timestamp", "value" )
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkPointLocation)
  .table("demo_bronze.crowdstrike") 

In the raw-to-bronze layer, only the event name is extracted from the raw data. By adding a load timestamp and date columns, users store the raw data into the bronze table. The bronze table is partitioned by event name and load date, which helps to make bronze-to-silver jobs more performant, especially when there is interest for a limited number of event date ranges.

Next, a bronze-to-silver streaming job reads events from a bronze table, enforces a schema and writes to hundreds of event tables based on the event name. Below is a Scala code example:

spark
  .readStream
  .option("ignoreChanges", "true")
  .option("maxBytesPerTrigger", "2g")
  .option("maxFilesPerTrigger", "64")
  .format("delta")
  .load(bronzeTableLocation)
  .filter($"event_simpleName" === "event_name")
  .withColumn("event", from_json($"value", schema_of_json(sampleJson)) )
  .select($"event.*", $"load_timestamp", $"load_date")
  .withColumn("silver_timestamp", current_timestamp())
  .writeStream
  .format("delta")
  .outputMode("append")
  .option("mergeSchema", "true")    
  .option("checkpointLocation", checkPoint)
  .option("path", tableLocation)   
  .start()

Each event schema can be stored in a schema registry or in a Delta table in case a schema needs to be shared across multiple data-driven services. Note that the above code uses a sample json string read from the bronze table, and the schema is inferred from the json using schema_of_json(). Later, the json string is converted to a struct using from_json(). Then, the struct flattened, prompting the addition of a timestamp column. These steps provide a dataframe with all the required columns to be appended to an event table. Finally, we write this structured data to an event table with append mode.

It is also possible to fan out events to multiple tables with one stream with foreachBatch by defining a function that will handle microbatches. Using foreachBatch(), it is possible to reuse existing batch data sources for filtering and writing to multiple tables. However, foreachBatch() provides only at-least-once write guarantees. So, a manual implementation is needed to enforce exactly-once semantics.

At this stage, the structured data can be queried with any of the languages supported in Databricks notebooks and jobs: Python, R, Scala and SQL. The silver layer data is convenient to use for ML and Cyberattack analysis.

The next streaming pipeline would be silver-to-gold. In this stage, it is possible to aggregate data for dashboarding and alerting. In the second part of this blog series we will provide some more insights into how we build dashboards using Databricks SQL.

What’s next

Stay tuned for more blog posts that build even more value on this use case by applying ML and using Databricks SQL.

You can use these notebooks in your own Databricks deployment. Each section of the notebooks has comments. We invite you to email us at cybersecurity@databricks.com. We look forward to your questions and suggestions for making this notebook easier to understand and deploy.


Now, we invite you to log in to your own Databricks account and run these notebooks. We look forward to your feedback and suggestions.

Please refer to the docs for detailed instructions on importing the notebook to run.


Acknowledgments
We would like to thank Bricksters who supported this blog, and special thanks to Monzy Merza, Andrew Hutchinson, Anand Ladda for their insightful discussion and contributions.

--

Try Databricks for free. Get started today.

The post Building a Cybersecurity Lakehouse for CrowdStrike Falcon Events appeared first on Databricks.

Countdown to Data + AI Summit 2021: Event Preview and Flyover

$
0
0

Data + AI Summit 2021, the global event for the data community, takes place virtually in just a few days, May 24–28 — and it’s free to attend! This year’s Summit is truly a can’t miss event – it’s not just keynotes and sessions on a website, it’s much, much more!

Data + AI Summit, formerly known as Spark + AI Summit, will bring together tens of thousands of data teams, leaders and visionaries from more than 160 countries to shape the future of data and AI in an interactive, engaging virtual platform. Attendees can choose from 200+ technical sessions and hear from practitioners, leaders, innovators as they share best practices for Apache Spark™, Delta Lake, MLflow, PyTorch, TensorFlow, and Transformers. Participants can also join deep dives into streaming architectures, SQL analytics and leveraging modern BI tools for the lakehouse.

With Summit only a week away, here’s a rundown of what you need to know:

Our theme: the future is open

This year, the theme of Summit is all about “openness” and the endless possibilities that it provides. So what exactly does this mean?
This year we want our theme to capture the optimism and opportunity for our data community and what’s ahead of us. It represents an ongoing commitment to the open source community and a nod to new product innovations that further enable an open, collaborative data ecosystem. Furthermore, our theme reinforces that now more than ever, we need to band together to do what’s right — to help solve the world’s toughest problems and drive the change we want to see across the globe.

Our virtual event platform launches Friday, May 21, but here’s a sneak peek at what awaits our attendees. As soon as the platform launches, get a head start by building your agenda and personal profile. If you haven’t already registered, there’s still time! General admission is FREE and sessions will be available live from May 24-28 and then the entire platform will be available on-demand through June 28!

Let’s take a tour of what our platform offers!

Personalized dashboard

As you enter the conference, you will be welcomed by your personalized dashboard — a home for everything you need to know about the conference. The dashboard presents the most useful links to navigate the program agenda, featured sessions, and interactive attendee experience. Make sure to keep an eye on your inbox for notifications so you don’t miss any program updates.

Sample personalized dashboard available to attendees of Data + AI Summit

Build your agenda

Our agenda is jam-packed with rich technical content, product deep dives, AMA sessions and more. To add sessions to your personalized agenda in “My Summit,” simply click on the ❤️ next to the session title. You can also add each session to your personal calendar.

We have an incredible lineup of keynotes from industry thought leaders such as Ali Ghodsi, Matei Zaharia and Reynold Xin, as well as luminary keynotes from visionaries like Bill Inmon, the Father of Data warehousing, Shadfi Goldwasser, a Turing Award winner for work on the science of cryptography, DJ Patil, the first U.S. Chief Data Scientist, Rajat Monga, co-creator of TensorFlow, Bill Nye, science educator, engineer, comedian, inventor and Malala Yousafzai, Nobel laureate and education activist, and many others. Make sure to take the time to play with the agenda filters by track and topic and explore the speakers’ pages to build your ideal agenda.

Sample personalized agenda available to attendees of Data + AI Summit

Dev Hub + Expo

Connect with your peers and sponsors at the Dev Hub + Expo. We are making live networking happen at our Hallway Chatter rooms, where you can chat with like-minded attendees or hear lightning talks from community members at our expo stage. You can also learn more about Delta Lake, Apache Spark™, MLflow, Koalas and more at the Databricks Booth, and interact with our valued sponsors who are tech innovators in the Data & AI Community by visiting their booths.

Other highlights at the Dev Hub + Expo are our job board, advisory bar where you can chat 1:1 with subject matter experts on just about anything, and our solutions theater where you can hear industry uses cases.

Sample Dev Hub and Expo available to attendees of Data + AI Summit

Our goal is to bring people together to learn and connect in a virtual environment. Check out “My Summit” to meet like-minded individuals and to discover recommended sessions/experiences based on your interests.

Each evening, attendees will be able to “choose their own adventure” by joining a musical performance, attend our highly-engaging live Meetups or explore everything the Dev Hub + Expo hall has to offer.

Sample personalized Summit Quest leaderboard available to attendees of Data + AI Summit

Meetups at Data + AI Summit

Your registration gives you access to Meetups — accessible only through the Summit platform. These special events are free to attend on a first-come, first served-basis.

May 24: Databricks University Alliance Panel
May 25: Women in Advanced Data Analytics: Data + ML
May 26: Data Engineering: Dagster and Spark SQL
May 27: Machine Learning Frameworks, Model Management, Operations

Databricks Experience (DBX)

A curated set of content and sessions to help you learn more about the amazing innovation happening at Databricks. Designed to help you make the most of your time at Summit, DBX offers quick access to the subjects most relevant to your needs. Some key features of DBX include:

  1. Product deep dives concentrated on Delta Lake, managing multi-cloud platforms, modernizing to a cloud data architecture and more
  2. Training in SQL Analytics, data science, machine learning and more
  3. News on upcoming products and features
  4. Networking and engagement with other attendees and Databricks experts

There is so much more that we can share, but now it’s your turn to discover what Data + AI Summit has to offer. Take a look at this guide for a deeper dive. If you have registered, join the experience on May 24–28…and if you haven’t yet registered, it’s not too late. Join us for all the action at Data + AI Summit — we look forward to seeing you there!

--

Try Databricks for free. Get started today.

The post Countdown to Data + AI Summit 2021: Event Preview and Flyover appeared first on Databricks.

Customer-managed Key (CMK) Public Previews for Databricks on Azure and AWS

$
0
0

We’re excited to release the Customer-managed key (CMK) public previews for Azure Databricks and Databricks workspaces on AWS (Amazon Web Services), with full support for production deployments. On Microsoft Azure, you can now use your own key to encrypt the notebooks and queries managed by Azure Databricks; this capability is available in the Premium pricing tier in all Azure Databricks regions. For those using AWS you can bring your own key to encrypt the data on DBFS and cluster volumes, available in the Enterprise pricing tier in all AWS regions supporting E2 architecture. We have received great feedback from our global customers during the corresponding private previews, as these capabilities allow them to unleash the full power of Databricks Lakehouse Platform to process highly sensitive and confidential data.

Let’s dive deeper into these capabilities.

CMK Managed Services for Azure Databricks

An Azure Databricks workspace is a managed application on the Azure Cloud that delivers enhanced security capabilities through a simple and well-integrated architecture. With the Customer-managed key from an Azure Key Vault instance, users can encrypt the notebooks, queries and secrets stored in the Azure Databricks regional infrastructure. This is a public preview release allowing you to leverage the capability for production deployments.

It’s already possible to bring your own key from Azure Key Vault to encrypt the data stored on DBFS (Blob Storage) and Azure-native data sources like ADLS Gen2 and Azure SQL. You can seamlessly process such data with Azure Databricks without having to configure any settings for a workspace.

Azure Databricks managed data can be encrypted using your own key from Azure Key Vault.

Use Case Status
CMK Managed Services (notebooks, queries and secrets stored in control plane) Public Preview
CMK Workspace Storage (DBFS) Generally Available (GA)
CMK for your own data sources Already works seamlessly

CMK Workspace Storage for Databricks on AWS

A Databricks workspace on AWS delivers the same security capabilities as on Azure, described above. Use the customer-managed key from an AWS KMS instance to encrypt the data stored on DBFS and Cluster EBS Volumes. This is a public preview release. If you already configure an AWS account default encryption key for EBS volumes, we provide the flexibility to opt-out of using the CMK capability for Cluster EBS Volumes.

Also in public preview is the ability to use customer-managed keys to encrypt the notebooks, queries and secrets stored in the Databricks regional infrastructure. And it’s already possible to use a customer-managed key to encrypt your data on AWS-native data sources like S3 and RDS. Refer to this documentation to allow your Databricks clusters to encrypt / decrypt data on your non-DBFS S3 buckets.

Databricks managed data can be encrypted using your own key from AWS KMS.

Use Case Status
CMK Managed Services (notebooks, queries and secrets stored in control plane) Public Preview
CMK Workspace Storage (DBFS) Generally Available (GA)
CMK for your own data sources Already works seamlessly

Get started with the enhanced security capabilities by deploying Azure Databricks and Databricks on AWS Workspaces with customer-managed keys for Managed Services and Workspace Storage. Please refer to the following resources:

Please refer to Platform Security for Enterprises for a deeper view into how we bring a security-first mindset while building the most popular lakehouse platform on Azure & AWS.

--

Try Databricks for free. Get started today.

The post Customer-managed Key (CMK) Public Previews for Databricks on Azure and AWS appeared first on Databricks.

Summit This Week: 29+ Talks on Deep Learning, MLOps, NLP and Other Machine Learning Topics

$
0
0

The Data + AI Summit happening this week had its origins as the Spark + AI Summit, but we thought it was important to expand the conference due to the convergence of data topics we’ve been seeing in the community. The Summit now covers all things data — from data science and data engineering to data analytics and machine learning.

This year, we’re lucky to be joined by some of the leading innovators in the field of machine learning, such as Rajat Monga (co-creator of TensorFlow), Soumith Chintala (co-creator of PyTorch), Clément Delangue (co-creator of Hugging Face Transformers NLP), Matei Zaharia (co-creator of MLflow), Manasi Vartak (creator of ModelDB) and more.

In addition to these well-known creators, there are many other amazing practitioners joining Data + AI Summit to share their knowledge. Here’s some of the 200+ sessions at Data + AI Summit.

29+ Talks on Deep Learning, Mlops, Nlp and Other Machine Learning Topics

Machine learning at large
Keynote: AI is Eating Software by Rajat Monga, co-creator of TensorFlow

Keynote: Using Mathematics to Address the Growing Distrust in Algorithms by Turing Award winner Shafi Goldwasser, CS Professor at MIT, UC Berkeley and Weizmann

A Vision for the Future of ML Frameworks by Soumith Chintala, co-creator of PyTorch and lead of PyTorch at Facebook AI

Meetup: Machine Learning Frameworks, Model Management and Ops by Clément Delangue (Hugging Face), Max Fisher (Databricks), Zhihao Jia (Facebook)

Keynote: AI for Intelligent Financial Services by Dr. Manuela Veloso (JP Morgan, Carnegie Mellon University)

Deep learning
Drug Repurposing using Deep Learning on Knowledge Graphs by Alexander Thomas (Wisecube) and Vishnu Vettrivel (Wisecube)

Object Detection with Transformers by Dr Liam Li (Determined AI)

Automated Background Removal Using PyTorch by Oleksander Miroshnychenko (GlobalLogic), Simona Stolnicu (Levi9)

ML operations (MLOps)
CI/CD in MLOps: Implementing a Framework for Self-Service Everything by Cara Phillips (Artis Consulting) and Wesly Clark (J.B. Hunt)

Why APM is Not the Same as ML Monitoring by Cory Johannsen (Verta)

The Function, the Context, and the Data – Enabling MLOps at Stitch Fix by Elijah Ben Izzy

Consolidating MLOps at One of Europe’s Biggest Airports by Floris Hoogenboom (Schipol) and Sebastiaan Grasdijk (Schipol)

Catch Me If You Can: Keeping Up With ML Models in Production by Shreya Shankar (former Viaduct, Google Brain, Stanford)

Machine Learning CI/CD for Email Attack Detection by Jeshua Bratman (Abnormal Security) and Justin Young (Abnormal Security)

Scale
Anomaly Detection at Scale! By Opher Dubrovsky (Nielsen) and Max Peres (Nielsen)

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue by Han Wang (Lyft)

The Rise of Vector Data by Edo Liberty (Pinecone)

Model Monitoring at Scale with Apache Spark and Verta by Manasi Vartak (Verta)

Natural Language Processing (NLP)
Efficient Large-Scale Language Model Training on GPU Clusters by Deepak Narayanan (PhD Student @ Stanford)

Automatic ICD-10 Code Assignment to Consultations in healthcare by Joinal Ahmed (Halodoc) and Nirav Kumar (Halodoc)

NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil by Sumeet Trehan

Advanced Natural Language Processing with Apache Spark NLP by David Talby (John Snow Labs)

Conversational AI with Transformer Models by Rajesh Shreedhar Bhat (Walmart) and Dinesh Ladi (Walmart)

Recommendation Systems
Towards Personalization in Global Digital Health by Africa Perianez (benshi.ai)

Scaling Online ML Predictions at DoorDash by Hien Luu and Arbaz Khan

Offer Recommendation System with Apache Spark at Burger King by Luyang Wang and Kai Huang (Intel)

Recommender-Based Transformers by Denis Rothman

Building A Product Assortment Recommendation Engine by Ethan Dubois (Anheuser Busch) and Justin Morse (Anheuser Busch)

I hope you enjoy this selection of amazing talks on the past, present and future of machine learning. You’ll also want to be sure to attend the Apache Spark Data Science and Machine Learning keynotes on Thursday AM (PDT) where you’ll hear the latest announcements and product releases from Databricks and related open source projects.

See you in the Dev Hub at Summit!

Register today

The post Summit This Week: 29+ Talks on Deep Learning, MLOps, NLP and Other Machine Learning Topics appeared first on Databricks.

Announcing LabelSpark, the Labelbox Connector on Databricks

$
0
0

This is a guest authored post by Nick Lee, partnership integration lead, at Labelbox

Large data lakes typically house a combination of structured and unstructured data. Data teams often use Apache Spark™ to analyze structured data, but may struggle to apply the same analysis to unstructured, unlabeled data (specifically in the form of images, video, etc). To tackle these challenges, Fortune 500 enterprises such as WarnerMedia and Stryker are leveraging Labelbox’s training data platform to quickly produce structured data from unstructured data. Labelbox has been used to support a variety of production AI use cases, including improved marketing personalization through visual search, manufacturing defect detection and smart camera development.

 Labelbox’s training data platform supports a variety of production AI use cases.

In the past, AI/ML teams had to use expensive, manual processes to transform their unstructured data into something more useful — either by paying a third-party to label their data, buying a labeled dataset or narrowing the scope of their project to leverage public datasets. Finding faster and more cost effective ways to convert unstructured data into structured data is highly beneficial towards supporting more advanced use cases built around companies’ unique, unstructured datasets.

With Labelbox, Databricks users can quickly convert unstructured to structured data and apply the results to a range of machine learning use cases, from deep learning to computer vision.

With Databricks, data science and AI teams can now easily prepare unstructured data for AI and analytics. Teams can label data with human effort, machine learning models in Databricks, or a combination of both. Teams can also employ a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. In terms of time and cost savings, this process can drastically reduce the amount of unstructured data you need to achieve strong model performance.

With LabelSpark, the Labelbox connector on Databricks, data teams can use a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels.

Labelbox has recently launched a connector between Databricks and Labelbox — the LabelSpark library — so teams can connect an unstructured dataset to Labelbox. With LabelSpark, teams can programmatically set up an ontology for labeling and return the labeled dataset in a Spark DataFrame. Combining Databricks and Labelbox gives data and AI teams an end-to-end environment for unstructured data workflows, along with a query engine built around Delta Lake, coupling fast annotation tools with a powerful machine learning compute environment.

Learn more about using Databricks with Labelbox and see a live technical demo of the workflow at the Productionizing Unstructured Data for AI and Analytics session at Data + AI Summit 2021.

--

Try Databricks for free. Get started today.

The post Announcing LabelSpark, the Labelbox Connector on Databricks appeared first on Databricks.

Databricks Announces 2021 North America Partner Awards

$
0
0

The Databricks Partner Ecosystem of over 450 partners globally are critical to building and delivering the best data and AI solutions in the world. We are proud to collaborate with our partners and recognize that success of our joint customers is the result of mutual commitment and investments in training, solution development, and our ongoing field programs. Our partners bring expertise to accelerate success for data teams with the right software, services, and strategic consulting expertise.

At this year’s Partner Executive Summit, we were thrilled to present the 2021 Databricks North America Partner Awards, which recognized our top-performing partners for their exceptional accomplishments and joint collaboration with Databricks in the past year. This award highlights the standout work we have done in collaboration with technology, consulting and system integrator partners, who brought deep industry expertise, technology skills and impactful solutions to Databricks customers like H&M, Credit Suisse, Starbucks and Navy Federal Credit Union.

Check out a rundown of the winners of each category:

Consulting and System Integrator Partners

Global Consulting & SI Partner of the Year: Accenture/Avanade
Databricks has partnered with Accenture and Avanade for years to create 20+ joint solution accelerators and 100+ global client solutions that deliver opportunities across industries. By combining solution accelerators and assets, strong investments in training and marketing and a dedicated innovation center, we enable joint customers to realize a faster time-to-deployment and 3x ROI. Co-developed offerings cover Hadoop Migration, industrialized machine learning, master data management and financial risk management, while client engagements span top enterprise companies in financial services, energy and utilities, retail and consumer goods and much more.

Congratulations to Alan Grogan, Atish Ray, and the Accenture and Avanade team!
Databricks 2021 Global C&SI Partner of the Year Award winner

National Consulting & SI Partner of the Year: Slalom

With strong leadership collaboration, go-to-market alignment and technical resources, it’s no surprise that Slalom wins this year’s Databricks National Consulting & SI Partner of the Year award. We partnered to develop the Modern Culture of Data powered by Databricks, which enables customers to unlock the potential of their organization and realize investments in AI. By combining our collective expertise across data engineering, data science and DevOps, customers accelerate the ML process. Since its inception, the Modern Culture of Data powered by Databricks has helped customers like Comcast and Walgreens embed data into every business decision.

Congratulations to David Frigeri and the Slalom team!
Databricks 2021 North America National Partner of the Year Award winner

C&SI Innovation Award: Neudesic
When it comes to solution accelerators and assets, Neudesic goes above and beyond. The Neudesic Azure Data and AI Platform Accelerator is a collection of repeatable frameworks, automations and configurations that unify data engineering and data science. Single-click deployments lower cost and provide a faster path to production, while integrating critical functions like MLOps and machine learning lifecycle management. Over 25+ clients have been able to automate the movement of data from siloed sources to an enterprise-scale data lake in as quickly as 10 days, reducing upfront deployment time by 50%. In addition, their Utility Data & AI Platform Accelerator enables energy and utility customers to use AI to identify patterns in large IoT datasets to identify the sources and causes of reliability issues.

Congratulations Mike Rossi and the Neudesic team!
Databricks 2021 North America C&SI Innovation Award winner

C&SI Customer Impact Award: Cognizant

Our partnership with Cognizant has led to impactful engagements across key industries. To name a few, we partnered with Cognizant to help a global automotive manufacturer lower costs and reach a faster time-to-market. Due to the customer’s on-premises Hadoop environment having a high number of admins, poor data science standards and limited hardware capacity, they leveraged Cognizant and Databricks for the development of their reference architecture and data migration. This resulted in 50% faster ETL workloads and 3x productivity increase for data science and data engineering models. We also partnered together to enable a global pharmaceutical company to implement a data science strategy that reduced their data processing from 128 days to 12 hours, improving profitability and saving them $20M in budget.

Congratulations Guarav Gupta and the Cognizant team!
Databricks 2021 North America Customer Impact Award winner

C&SI Customer Rising Star Award: 3Cloud
The 3Cloud team quickly built a strong Databricks practice within their organization and together, we have helped several customers implement AI solutions using Azure Databricks. With repeatable solution accelerators built on Azure Databricks, customers can set up their data environments in minutes, auto-scale and collaborate on shared projects in an interactive workspace. In addition, 3Cloud is one of Databricks’ biggest marketing advocates, consistently publishing blogs, hosting webinars and creating videos that showcase how to extract, transfer, and load data to the Databricks Lakehouse Platform for more efficient performance and rapid acceleration.

Congratulations Adam Jorgensen and the 3Cloud team!
Databricks 2021 North America Rising Star Award winner

Federal Consulting & SI Partner of the Year: Booz Allen Hamilton
Booz Allen Hamilton brings bold, innovative thinking to industries ranging from defense to international development. It’s this boldness that makes partnering with Booz Allen Hamilton differentiated for our federal organizations. Since our partnership began, we’ve been helping customers like the National Geospatial-Intelligence Agency, the Veterans Administration and the Department of Defence transform how they modernize IT and data analytics. With over 500 practitioners at Booz Allen Hamilton trained on Databricks, we have been able to co-develop repeatable solution accelerators and assets for Federalwide Assurance, cyber and genomics quickly, amplifying the impact we bring to our joint customers.

Congratulations to Steven Escaravage and the Booz Allen Hamilton team!
Databricks 2021 North America Federal Partner of the Year Award winner

LATAM Consulting & SI Partner of the Year: BlueShift
Since joining the Databricks partner ecosystem last year, our partnership with BlueShift has been seamless. With great leadership alignment and strong investments in training and marketing, BlueShift and Databricks have worked together with several key customers across Brazil to add value to their automation, big data, infrastructure and cloud projects. For example, we recently partnered to help a Brazilian investment firm migrate from on-premises to the cloud, reducing their data processing time from 3-hours to 2 minutes and reducing costs through autoscaling. In addition to nurturing our customer relationships, BlueShift has gone above and beyond to host several joint marketing events and to develop repeatable solutions centered on the Databricks Lakehouse platform.

Congratulations to Alan Camillo and the BlueShift team!
Databricks 2021 North America LATAM Partner of the Year Award winner

Technology Partners

ISV Innovation Award: Matillion
To keep up with the changing pace of the market, organizations need to innovate by taking advantage of the cloud to maximize analytics, reporting and advanced use cases like machine learning and AI. Matillion ETL for Delta Lake is an elegant solution for simplifying data ingestion and transformation on the Databricks Lakehouse Platform. Matillion’s deep integration with Databricks leverages push-down instructions to maximize performance on Delta Lake. With just a few clicks, the easy-to-use interface empowers data teams to unify their data science and BI environments.

Congratulations Matthew Scullion and the Matillion team!
Databricks 2021 North America ISV Innovation Award winner

ISV Customer Impact Award: Qlik
Our partnership with Qlik enhances efficiency and the customer experience. For example, we partnered to help the transportation and logistics company J.B. Hunt address the criticality of customer responsiveness, service quality and operational efficiency. This accelerated their delivery of analytics-ready data to its users with one-to-two-minute latency. Over the last year, we’ve seen even more demand from mutual customers to modernize their data architectures across multiple cloud environments. Qlik’s support for Databricks, in both data integration and analytics, has enabled us to meet and exceed these demands. Qlik’s SQL integration, multi-cloud support and ongoing customer success are just some of the many reasons Qlik is this year’s ISV Customer Impact Award winner.

Congratulations Itamar Ankorion and the Qlik team!
 Databricks 2021 North America ISV Customer Impact Award winner

ISV Momentum Award: Confluent
Together, Confluent and Databricks have developed a powerful and complete data solution focused on helping companies operate at scale in real-time, and we’re only just getting started. Leveraging Confluent Cloud and the Databricks Lakehouse Platform as fully-managed services on multiple clouds, developers can implement real-time data pipelines with less effort and without the need to upgrade their datacenter (or set up a new one). This opens the door for processing all types of data to allow customers to make real-time predictions and gain immediate business insights. We’ll continue to enhance this partnership with further growth of joint customers on Kafka and by introducing a new connector for Confluent Cloud later this year.

Congratulations Sid Rabindran and the Confluent team!
Databricks 2021 North America ISV Momentum Award winner


With dbt, analysts take ownership of the entire analytics engineering workflow. We partnered with dbt to announce brand-new support for the dbt-spark plugin in dbt Cloud, which includes native support for Databricks. Three reasons why we’re excited about the Databricks and dbt partnership: analysts can model complete datasets from the same platform trusted by their data science counterparts; organizations can apply analytical best practices like version control, testing, scheduling and documentation without sacrificing speed or reliability; and open source is at the heart of the dbt ecosystem, so customers benefit from continuous innovation and community support.

Congratulations Amy Deora and the dbt team!
Databricks 2021 North America ISV Rising Star Award winner

Partner Champions

This year, we awarded special recognition to members of our Partner Champions group. These eight Partner Champions were recognized for their excellent evangelism of the Databricks Lakehouse Platform, through both customer implementations, technical innovation and community support.

  • Jai Malhotra from Accenture for implementing Databricks’ first true multi-cloud engagement, allowing for single platform and automation at scale with AWS, Azure and Google Cloud.
  • Nathan Buesgens from Accenture for leading the effort behind Accenture’s Industrialized Machine Learning platform to accelerate model delivery and optimize the data science workflow.
  • Vishal Vibhandik from Cognizant for leading AI and analytic collaborations and being personally involved in nearly all joint client engagements between Databricks and Cognizant.
  • Matt Collins from Slalom for championing the Modern Culture of Data powered by Databricks to help organizations achieve the massive cultural transformation needed to drive analytics at scale.
  • Abhishek Dey from Wipro for driving IntelliProc, which complements Azure Cloud Native Data Services and automates the cloud data transformation journey with ‘out of the box’ capabilities.
  • Jason Workman from Insight for leading the charge on several strategy, architecture and migration projects to the Databricks Lakehouse Platform, along with incredible collaboration with the Databricks cross-functional teams.
  • Maciej Szpakowski from Prophecy.io for providing the technical expertise to integrate Prophecy with Databricks, dedication to end-to-end migrations for key global customers and ongoing collaboration with the Databricks product team.
  • Alexa Maturana-Lowe from Fivetran for going above-and-beyond as a Databricks champion and collaborator, and for leading the effort and innovation behind capturing Delta Lake workloads using Fivetran.

Databricks 2021 North America Partner Champions

Thank you to the entire community of our 450+ partners, and we look forward to even more engagement and collaboration!

--

Try Databricks for free. Get started today.

The post Databricks Announces 2021 North America Partner Awards appeared first on Databricks.


Time Series Data Analytics in Financial Services with Databricks and KX

$
0
0

This is a guest co-authored post. We thank Connor Gervin, partner engineering lead, KX, for his contributions.

KX recently announced a partnership with Databricks making it possible to cover all the use cases for high-speed time-series data analytics. Today we’re going to explain the integration options available between both platforms for both streaming and batch time-series data and why this provides significant benefits to Financial Services companies.

Time-series data is used ubiquitously within Financial Services and we believe the new use-cases can bring both technologies and data to new frontiers. There are a number of tools in the market to capture, manage and process time-series data. Given the nature of financial services use cases, it is not only necessary to leverage the best of breed tools for data processing but to keep innovating. This was the leading force behind the partnership between Databricks and KX. Through this initiative we can combine the low-latency streaming capabilities of kdb+ and q language with the scalable data and AI platform that Databricks provides. Financial Service firms can unlock and accelerate innovation by taking advantage of the time-series data in the cloud rather than having it locked and siloed in on-premise instances of kdb+.

In one simple lakehouse platform, Databricks provides the best-in-class ETL and data science capabilities that are now available to KX users to perform ad hoc analysis and collaboratively build sophisticated machine learning models. This integration brings KX data that was previously out of reach, at the edge or on-prem, to now land into the cloud with Databricks for ML and AI use cases in a simple manner. In addition to a new world of use-cases on KX data, KX introduces Databricks to the world of low-latency streaming analytics where continuous intelligence and data-driven decision making can be performed in real-time. In this post we will now outline just a few of the options available today to users who would like to integrate Databricks and KX  across a range of use-cases and deployment configurations.

Option: kdb+ to blob store as Parquet or Delta

Within KX Insights, kdb+ has drivers to directly write data to the native blob and storage services on each of the cloud providers (e.g. S3, ADLS Gen2, GCS). Databricks provides an Autoloader mechanism to easily recognize new data in the cloud storage service and subsequently have the data loaded into Databricks to process. Databricks can then write out data to the native blob and storage services as Parquet or Delta Lake where new insights, analytics or machine learning objects can then be loaded into the kdb+ environment.

There are a number of financial use cases that this integration serves such as batch modeling, optimizations and machine learning on the data, leading to better risk calculations, pricing models, populating rates data, surveillance and even recommendations across portfolios. The ability to feed the calculations and values back into KX, allows the low-latency streaming and service use cases where kdb+ accelerates the business goals.

Within KX Insights, kdb+ has drivers to directly write data to the native blob and storage services on each of the cloud providers (e.g. S3, ADLS Gen2, GCS).

Option: KdbSpark Spark data source

KdbSpark is a Spark Data Source available directly on GitHub. It was originally developed by Hugh Hyndman, a well regarded kdb+ champion, two years ago. This interface provides the ability to directly send and receive data and queries between Databricks and kdb+ both on the same instance as well as across a network connection. KdbSpark provides the ability to directly connect Databricks to kdb+, execute ad-hoc queries or server-side functions with arguments and support additional speed enhancements such as predicate push-down. One of the many beneficial examples for the business would be the ability to access kdb+ over the data source thereby allowing quants and data modelers to leverage Databricks notebooks for exploratory data analysis for new strategies and insights directly with q and kdb+.

KdbSpark is a Spark Data Source available directly on GitHub. I

Option: KX Insights on a Databricks Cluster

Both KX and Databricks have worked together to build a containerized version of KX Cloud Edition that can be distributed and run across the Databricks cluster. By leveraging Databricks, kdb+ can easily scale in the cloud with the simple interface and mechanisms of Spark. By scaling kdb+, a number of workloads can be run in parallel against the data stored natively within kdb+ on Object Storage across all the cloud providers. This allows kdb+ scripts and models to horizontally scale on ephemeral cloud compute, for example Monte Carlo simulations, regressions, and various other models in a faster, easier, cheaper way, all the while increasing throughput and greatly reducing the time to actionable insights.

By leveraging Databricks, kdb+ can easily scale in the cloud with the simple interface and mechanisms of Spark.

Option: Federated Queries via ODBC/JDBC/REST API

There’s more. KX Insights provides a mechanism to federate queries across multiple data sources via a single interface, providing the ability to decompose a query between kdb+ and Databricks so that each respective system storing data can then be queried, aggregated and analysed. For analysts, focused on BI workload or correlating time-series with fundamental data, it is preferred to have a single pane of glass access layer to various real-time and historical data sources.

KX Insights provides a mechanism to federate queries across multiple data sources in a single interface

Option: Databricks APIs

KX has developed a series of interfaces that enables access to Databricks RESTful APIs from within kdb+/q. All of the exposed features of Databricks are available in a programmable way, including Databricks Workspace APIs, SQL End Points and MLflow. MLflow APIs are now very exciting because kdb+ and q can support the ability to connect to both the open-source and managed versions of MLflow, and enable MLflow to manage the modeling and machine learning lifecycle via kdb+.

Option: Full kdb+ Python runtime (PyQ)

As part of the Fusion for kdb+ interface collection, PyQ brings the Python programming language to the kdb+ database allowing developers to integrate Python and q code seamlessly in one application. When using the Python Runtime within a Spark Session, PyQ brings a full interactive kdb+ environment right inside the Python process by simply installing via pip. Databricks and kdb+ together via Python is an integration ideal for executing or porting your Data Science and Machine Learning models into Databricks to be trained and executed on the cloud.

In summary, there are a number of ways that Databricks and KX integrate. With the collaboration of strong technical skill sets on both sides, Databricks and KX have partnered to deliver best-of-industry features to their mutual customers. Please reach out to your respective contact at either organization to see more of how Databricks and KX technologies integrate and democratize time-series data for new use cases and building business value.

Or come hear KX speak at the Databricks Financial Services Industry Forum on Unlocking New Insights on Real-time Financial Data with ML.

--

Try Databricks for free. Get started today.

The post Time Series Data Analytics in Financial Services with Databricks and KX appeared first on Databricks.

Introducing Delta Sharing: An Open Protocol for Secure Data Sharing

$
0
0

Data sharing has become critical in the modern economy as enterprises look to securely exchange data with their customers, suppliers and partners. For example, a retailer may want to publish sales data to its suppliers in real time, or a supplier may want to share real-time inventory. But so far, data sharing has been severely limited because sharing solutions are tied to a single vendor. This creates friction for both data providers and consumers, who naturally run different platforms.

Today, we’re launching a new open source project that simplifies cross-organization sharing: Delta Sharing, an open protocol for secure real-time exchange of large datasets, which enables secure data sharing across products for the first time. We’re developing Delta Sharing with partners at the top software and data providers in the world.

To see why today’s data sharing solutions create friction, consider a retailer that wants to share data with an analyst at one of its suppliers. Today, the retailer could use one of several cloud data warehouses that offer data sharing, but then the analyst would need to work with their IT, security, and procurement teams to deploy the same warehouse product at their company, a process that can take months. Furthermore, once the warehouse is deployed, the first thing the analyst would do is export the data from it into their favorite data science tool, such as pandas or Tableau.

With Delta Sharing, data users can directly connect to the shared data through pandas, Tableau, or dozens of other systems that implement the open protocol, without having to deploy a specific platform first. This reduces their access time from months to minutes, and greatly reduces work for data providers who want to reach as many users as possible.

We’re working with a vibrant ecosystem of partners on Delta Sharing, including product teams at the leading cloud, BI and data vendors:

 Delta Sharing Ecosystem - Apache Spark, Pandas, Presto, Trino, Rust, Hive, Tableau, Power BI, Qlik, Looker, Databricks, Microsoft Azure, Google BigQuery, Starburst, Dremio, AtScale, Immuta, Privacera, Alation, Collibra, Nasdaq, S&P, ICE, NYSE, AWS, FactSet, Precisely, Atlassian, Foursquare, Sequence Bio

Delta Sharing Ecosystem

In this post, we’ll explain how Delta Sharing works and why we’re so excited about an open approach to data sharing.

Delta Sharing goals

Delta Sharing is designed to be easy for both providers and consumers to use with their existing data and workflows. We designed it with four goals in mind:

  • Share live data directly without copying it: We want to make it easy to share existing data in real time. Today, the majority of enterprise data is stored in cloud data lake and lakehouse systems. Delta Sharing works over these; in particular, it lets you securely share any existing dataset in the Delta Lake or Apache Parquet formats.
  • Support a wide range of clients: Recipients should be able to directly consume data from their tools of choice without installing a new platform. The Delta Sharing protocol is designed to be easy for tools to support directly. It’s based on Parquet, which most tools already support, so implementing a connector for it is easy.
  • Strong security, auditing and governance: The protocol is designed to help you meet privacy and compliance requirements. Delta Sharing lets you grant, track and audit access to shared data from a single point of enforcement.
  • Scale to massive datasets: Data sharing increasingly needs to support terabyte-scale datasets, such as fine-grained industrial or financial data, a challenge for legacy solutions. Delta Sharing leverages the cost and elasticity of cloud storage systems to share massive datasets economically and reliably.

How does Delta Sharing work?

Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets. There are two parties involved: Data Providers and Recipients.

As the Data Provider, Delta Sharing lets you share existing tables or parts thereof (e.g., specific table versions of partitions) stored on your cloud data lake in Delta Lake format. A Delta Lake table is essentially a collection of Parquet files, and it’s easy to wrap existing Parquet tables into Delta Lake if needed. The data provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. We’ve open sourced a reference sharing server; and we provide a hosted one on Databricks, as we imagine other vendors will.

As a Data Recipient, all you need is one of the many Delta Sharing clients that supports the protocol. We’ve released open source connectors for pandas, Apache Spark, Rust and Python, and we’re working with partners on many more.

Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets.

The actual exchange is carefully designed to be efficient by leveraging the functionality of cloud storage systems and Delta Lake. The protocol works as follows:

  1. The recipient’s client authenticates to the sharing server (via a bearer token or other method) and asks to query a specific table. The client can also provide filters on the data (e.g. “country=US”) as a hint to read just a subset of the data.
  2. The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in S3 or other cloud storage systems that actually make up the table.
  3. To transfer the data, the server generates short-lived pre-signed URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel at massive bandwidth, without streaming through the sharing server. This powerful feature available in all the major clouds makes it fast, cheap and reliable to share very large datasets.

Benefits of the design

The Delta Sharing design provides many benefits for both providers and consumers:

  • Data Providers can easily share an entire table, or just one version or partition of the table, because clients are only given access to a specific subset of the objects in it.
  • Data Providers can update data reliably in real time using the ACID transactions on Delta Lake, and recipients will always see a consistent view.
  • Data Recipients don’t need to be on the same platform as the provider, or even in the cloud at all — sharing works across clouds and even from cloud to on-premise users.
  • The Delta Sharing protocol is very easy for clients to implement if they already understand Parquet. Most of our prototype implementations with open source engines and BI tools only took 1-2 weeks to build.
  • Transfer is fast, cheap, reliable and parallelizable using the underlying cloud system.

An open ecosystem

As previously mentioned, we are excited about establishing an open approach to data sharing. Data providers, like Nasdaq, have uniformly told us that it is too hard to deliver data to diverse consumers, all of which use different analytics tools.

“We support Delta Sharing and its vision of an open protocol that will simplify secure data sharing and collaboration across organizations. Delta Sharing will enhance the way we work with our partners, reduce operational costs and enable more users to access a comprehensive range of Nasdaq’s data suite to discover insights and develop financial strategies,” said Bill Dague, Head of Alternative Data, Nasdaq.

With Delta Sharing, dozens of popular systems will be able to connect directly to shared data so that any user can use it, reducing friction for all participants. We are working with dozens of partners to define the Delta Sharing standard, and we invite you to participate.
Many of these companies extended their support for today’s launch:

BI Tools: Tableau, Qlik, Power BI, Looker
Analytics: AtScale, Dremio, Starburst, Microsoft Azure, Google BigQuery
Governance: Collibra, Immuta, Alation, Privacera
Data Providers: Nasdaq, Precisely, Safegraph, Atlassian, AWS, FactSet, Foursquare, ICE, Qandl, S&P, SequenceBio

Delta Sharing on Databricks

Databricks customers will have a native integration of Delta Sharing in our Unity Catalog, providing a streamlined experience for sharing data both within and across organizations. Administrators will be able to manage shares using a new CREATE SHARE SQL syntax or REST APIs and audit all accesses centrally. Recipients will be able to consume the data from any platform. Sign up to join our waitlist for preview access and updates.

Roadmap

This first version of Delta Sharing is just a start. As we develop the project, we plan to extend it to sharing other objects, such as streams, SQL views or arbitrary files like machine learning models. We believe that the future of data sharing is open, and we are thrilled to bring this approach to other sharing workflows.

Getting started with Delta Sharing

To try the open source Delta Sharing release, follow the instructions at delta.io/sharing. Or, if you are a Databricks customer, sign up for updates on our service. We are very excited to hear your feedback!

--

Try Databricks for free. Get started today.

The post Introducing Delta Sharing: An Open Protocol for Secure Data Sharing appeared first on Databricks.

Introducing Databricks Unity Catalog: Fine-grained Governance for Data and AI on the Lakehouse

$
0
0

Data lake systems such as S3, ADLS, and GCS store the majority of data in today’s enterprises thanks to their scalability, low cost, and open interfaces. Over time, these systems have also become an attractive place to process data thanks to lakehouse technologies such as Delta Lake that enable ACID transactions and fast queries. However, one area where data lakes have remained harder to manage than traditional databases is governance; so far, these systems have only offered tools to manage permissions at the file level (e.g. S3 and ADLS ACLs), using cloud-specific concepts like IAM roles that are unfamiliar to most data professionals.

That’s why we’re thrilled to announce our Unity Catalog, which brings fine-grained governance and security to lakehouse data using a familiar, open interface. Unity Catalog lets organizations manage fine-grained data permissions using standard ANSI SQL or a simple UI, enabling them to safely open their lakehouse for broad internal consumption. It works uniformly across clouds and data types. Finally, it goes beyond managing tables to govern other types of data assets, such as ML models and files. Thus, enterprises get a simple way to govern all their data and AI assets:

Unity Catalog, a new product that brings fine-grained governance and security to lakehouse systems using an open interface while retaining all the benefits of data lakes.

What’s hard with data lake governance tools today?

Although all cloud storage systems (e.g. S3, ADLS and GCS) offer security controls today, these tools are file-oriented and cloud-specific, both of which cause problems as organizations scale up. We’ve often seen customers run into four problems:

  • Lack of fine-grained (row, column and view level) security: Cloud data lakes can generally only set permissions at the file or directory level, making it hard to share just a subset of a table with particular users. This makes it tedious to onboard enterprise users who should not have access to the whole table.
  • Governance tied to physical data layout: Because governance controls are at the file level, data teams must carefully structure their data layout to support the desired policies. For example, a team might partition data into different directories by country and give access to each directory to different groups. But what should the team do when governance rules change? If different states inside one country adopt different data regulations, the organization may need to restructure all its data.
  • Nonstandard, cloud-specific interfaces: Cloud governance APIs such as IAM are unfamiliar to data professionals (e.g., database administrators), and different across clouds. Today, enterprises increasingly have to store data in multiple clouds, (e.g., to satisfy privacy regulations), so they need to be able to manage data across clouds.
  • No support for other asset types: Data lake governance APIs work for files in the lake, but modern enterprise workflows produce a wide range of other types of data assets. For example, SQL workflows often revolve around views, data science workloads produce ML models, and many workloads connect to data sources other than the lake (e.g., databases). In the modern compliance landscape, all of these assets need to be governed the same way if they contain sensitive data. Thus, data teams have to reimplement the same security policies in many different systems.

Unity Catalog’s approach

Unity Catalog solves these problems by implementing a fine-grained approach to data governance based on open standards that works across data asset types and clouds. It is designed around four key principles:

  • Fine-grained permissions: Unity Catalog can enforce permissions for data at the row, column or view level instead of the file level, so that you can always share just part of your data with a new user without copying it.
  • An open, standard interface: Unity Catalog’s permission model is based on ANSI SQL, making it instantly familiar to any database professional. We’ve also built a UI to make governance easy for data stewards, and we’ve extended the SQL model to support attribute-based access control, allowing you to tag many objects with the same attribute (e.g., “PII data”) and apply one policy to all of them. Finally, the same SQL based interface can be used to manage ML models and external data sources.
  • Central control: Unity Catalog can work across multiple Databricks workspaces, geographic regions and clouds, allowing you to manage all enterprise data centrally. This central position also enables it to track lineage and audit all accesses.
  • Secure access from any platform: Although we love the Databricks platform, we know that many customers will also access the data from other platforms and that they’d like their governance rules to work across them. Unity Catalog enforces security permissions from any client that connects through JDBC/ODBC or through Delta Sharing, the open protocol we’ve launched to exchange large datasets between a wide range of platforms.

Let’s look at how the Unity Catalog can be used to implement common governance tasks.

Easily manage permissions using ANSI SQL

Unity Catalog brings fine-grained centralized governance to all data assets across clouds through the open standard ANSI SQL Data Control Language (DCL). This means administrators can easily grant permission to arbitrary user-specific subsets of the data using familiar SQL — no need to learn an arcane, cloud-specific interface. We’ve also added a powerful tagging feature that lets you control access to multiple data items at once based on attributes to further simplify governance at scale.

Below are a few examples of how you can use SQL grant statements with the Unity Catalog to add permissions to existing data stored on your data lake.

First, you can create tables in the catalog either from scratch or by pointing to existing data in a cloud storage system, such as S3, accessed with cloud-specific credentials:

 CREATE EXTERNAL TABLE iot_events LOCATION s3:/...
  WITH CREDENTIAL iot_iam_role

You can now simply use SQL standard GRANT statements to set permissions, as in any database. Below is an example of how to grant permissions to iot_events to an entire group such as engineers, or to just the date and country columns to the marketing group:

GRANT SELECT ON iot_events TO engineers
GRANT SELECT(date, country) ON iot_events TO marketing

The Unity Catalog also understands SQL views. This allows you to create SQL views to aggregate data in a complex way. Here is how you can use View-Based Access Control to grant access to only an aggregate version of the data for business_analysts:


CREATE VIEW aggregate_data AS
  SELECT date, country, COUNT(*) AS num_events FROM iot_events
  GROUP BY date, country

GRANT SELECT ON aggregate_data TO business_analysts

In addition, the Unity Catalog allows you to set policies across many items at once using attributes (Attribute-Based Access Control), a powerful way to simplify governance at scale. For example, you can tag multiple columns as PII and manage access to all columns tagged as PII in a single rule:

ALTER TABLE iot_events ADD ATTRIBUTE pii ON email
ALTER TABLE users ADD ATTRIBUTE pii ON phone

GRANT SELECT ON DATABASE iot_data
  HAVING ATTRIBUTE NOT IN (pii)
  TO product_managers

Finally, the same attribute system lets you easily govern MLflow models and other objects in a consistent way with your raw data:

GRANT EXECUTE ON MODELS HAVING ATTRIBUTE (eu_data)
  TO eu_product_managers

Discover and govern data assets in the UI

Unity Catalog’s UI makes it easy to discover, describe, audit and govern data assets in one place. Data stewards can set or review all permissions visually, and the catalog captures audit and lineage information that shows you how each data asset was produced and accessed. The UI is designed for collaboration so that data users can document each asset and see who uses it.

The Unity Catalog UI makes it easy for data stewards to confidently manage  and secure data access to meet compliance and privacy needs, directly on the lakehouse.

The Unity Catalog UI makes it easy for data stewards to confidently manage
and secure data access to meet compliance and privacy needs, directly on the lakehouse.

Share data across organizations with Delta Sharing

Every organization needs to share data with customers, partners and suppliers to collaborate. Unity Catalog implements the open source Delta Sharing standard to let you securely share data across organizations, regardless of which computing platform or cloud they run on (any Delta Sharing client can connect to the data).

Share Data Across Organizations with Delta Sharing

Open interfaces for easy access

Unity Catalog works with your existing catalogs, data, storage and computing systems so you can leverage your existing investments and build a future-proof governance model. It can mount existing data in Apache Hive Metastores or cloud storage systems such as S3, ADLS and GCS without moving it. It also connects with governance platforms like Privacera and Immuta to let you define custom workflows for managing access to data. Finally, we designed Unity Catalog so that you can also access it from computing platforms other than Databricks: ODBC/JDBC interfaces and high-throughput access via Delta Sharing allow you to securely query your data any computing system.

Next steps

As shared in our keynote today, we’re very excited to begin the preview of the Unity Catalog shortly. You can already sign up to join our waitlist. We look forward to your feedback!

--

Try Databricks for free. Get started today.

The post Introducing Databricks Unity Catalog: Fine-grained Governance for Data and AI on the Lakehouse appeared first on Databricks.

Announcing the Launch of Delta Live Tables: Reliable Data Engineering Made Easy

$
0
0

sign up for public preview

As the amount of data, data sources and data types at organizations grow, building and maintaining reliable data pipelines has become a key enabler for analytics, data science and machine learning (ML). Prioritizing these initiatives puts increasing pressure on data engineering teams because processing the raw, messy data into clean, fresh, reliable data is a critical step before these strategic initiatives can be pursued.

At Data + AI Summit, we announced Delta Live Tables (DLT), a new capability on Delta Lake to provide Databricks customers a first-class experience that simplifies ETL development and management. DLT vastly simplifies the work of data engineers with declarative pipeline development, improved data reliability and cloud-scale production operations.

We developed this product in response to our customers, who have shared their challenges in building and maintaining reliable data pipelines. So let’s take a look at why ETL and building data pipelines are so hard.

Data teams are constantly asked to provide critical data for analysis on a regular basis. To do this, teams are expected to quickly turn raw, messy input files into exploratory data analytics dashboards that are accurate and up to date. While the initial steps of writing SQL queries to load data and transform it are fairly straightforward, the challenge arises when these analytics projects require consistently fresh data, and the initial SQL queries need to be turned into production grade ETL pipelines. This fresh data relies on a number of dependencies from various other sources and the jobs that update those sources. To solve for this, many data engineering teams break up tables into partitions and build an engine that can understand dependencies and update individual partitions in the correct order.

Delta Live Tables (DLT), a new capability on Delta Lake to provide Databricks customers with a first-class experience that simplifies ETL development and management.

Once this is built out, check-points and retries are required to ensure that you can recover quickly from inevitable transient failures. On top of that, teams are required to build quality checks to ensure data quality, monitoring capabilities to alert for errors and governance abilities to track how data moves through the system. And once all of this is done, when a new request comes in, these teams need a way to redo the entire process with some changes or new feature added on top of it.

With all of these teams’ time spent on tooling instead of transforming, the operational complexity begins to take over, and data engineers are able to spend less and less time deriving value from the data.

Delta Live Tables

DLT enables data engineers to streamline and democratize ETL, making the ETL lifecycle easier and enabling data teams to build and leverage their own data pipelines by building production ETL pipelines writing only SQL queries. By just adding “LIVE” to your SQL queries, DLT will begin to automatically take care of all of your operational, governance and quality challenges. With the ability to mix Python with SQL, users get powerful extensions to SQL to implement advanced transformations and embed AI models as part of the pipelines.

DLT makes the ETL lifecycle easier, and enabling data teams to build and leverage their own data pipelines by building production ETL pipelines writing only SQL queries.

Understand your data dependencies

DLT takes the queries that you write to transform your data and instead of just executing them against a database, DLT deeply understands those queries and analyzes them to understand the data flow between them. Once it understands the data flow, lineage information is captured and can be used to keep data fresh and pipelines operating smoothly.

What that means is that because DLT understands the data flow and lineage, and because this lineage is expressed in an environment-independent way, different copies of data (i.e. development, production, staging) are isolated and can be updated using a single code base. The same set of query definitions can be run on any of those data sets.

The ability to track data lineage is hugely beneficial for improving change management and reducing development errors, but most importantly, it provides users the visibility into the sources used for analytics – increasing trust and confidence in the insights derived from the data.

DLT provides users of the data the visibility into the sources used for analytics - increasing trust and confidence in the insights derived from the data.

Gain deep visibility into your pipelines

DLT provides deep visibility into pipeline operations with detailed logging and tools to visually track operational stats and quality metrics. With this capability, data teams can understand the performance and status of each table in the pipeline. Data engineers can see which pipelines have run successfully or failed,  and can reduce downtime with automatic error handling and easy refresh.

DLT provides deep visibility into pipeline operations with detailed logging and tools to visually track operational stats and quality metrics.

Treat your data as code

One of the core ideas we considered in building this new product, that has become popular across many data engineering projects today, is the idea of treating your data as code. Your data should be a single source of truth for what is going on inside your business. Beyond just the transformations, there are a number of things that should be included in the code that defines your data.

  1. Quality Expectations: With declarative quality expectations, DLT allows users to specify what makes bad data bad and how bad data should be addressed with tunable severity.
  2. Documentation with Transformation: DLT enables users to document where the data comes from, what it’s used for and how it was transformed. This documentation is stored along with the transformations, guaranteeing that this information is always fresh and up to date.
  3. Table Attributes: Attributes of a table (e.g., Contains PII) along with quality and operational information about table execution is automatically captured in the Event Log. This information can be used to understand how data flows through an organization and meet regulatory requirements.

With declarative pipeline development, improved data reliability and cloud-scale production operations, DLT makes the ETL lifecycle easier and enables data teams to build and leverage their own data pipelines to get to insights faster, ultimately reducing the load on data engineers.

“At Shell, we are aggregating all our sensor data into an integrated data store — working at the multi-trillion-record scale. Delta Live Tables has helped our teams save time and effort in managing data at this scale. We have been focusing on continuously improving our AI engineering capability and have an Integrated Development Environment (IDE) with a graphical interface supporting our Extract Transform Load (ETL) work.
With this capability augmenting the existing lakehouse architecture, Databricks is disrupting the ETL and data warehouse markets, which is important for companies like ours. We are excited to continue to work with Databricks as an innovation partner.”
— Dan Jeavons, General Manager — Data Science, Shell

Getting started

Delta Live Tables is currently in Private Preview and is available to customers upon request. Existing customers can request access to DLT to start developing DLT pipelines here. Visit the Demo Hub to see a demo of DLT and the DLT documentation to learn more.

As this is a gated preview, we will onboard customers on a case-by-case basis to guarantee a smooth preview process. We have limited slots for preview and hope to include as many customers as possible. If we are unable to onboard you during the gated preview, we will reach out and update you when we are ready to roll out broadly.

Learn more about Delta Live Tables directly from the product and engineering team by attending the free Data + AI Summit.

--

Try Databricks for free. Get started today.

The post Announcing the Launch of Delta Live Tables: Reliable Data Engineering Made Easy appeared first on Databricks.

Databricks Announces the First Feature Store Co-designed with a Data and MLOps Platform

$
0
0

Today, we announced the launch of the Databricks Feature Store, the first of its kind that has been co-designed with Delta Lake and MLflow to accelerate ML deployments. It inherits all of the benefits from Delta Lake, most importantly: data stored in an open format, built-in versioning and automated lineage tracking to facilitate feature discovery. By packaging up feature information with the MLflow model format, it provides lineage information from features to models, which facilitates end-to-end governance and model retraining when data changes. At model deployment, the models look up features from the Feature Store directly, significantly simplifying the process of deploying new models and features.

The data problem in AI

Raw data (transaction logs, click history, images, text, etc.) cannot be used directly for machine learning (ML). Data engineers, data scientists and ML engineers spend an inordinately large amount of time transforming data from its raw form into the final “features” that can be consumed by ML models. This process is also called “feature engineering” and can involve anything from aggregating data (e.g. number of purchases for a user in a given time window) to complex features that are the result of ML algorithms (e.g. word embeddings).

This interdependency of data transformations and ML algorithms poses major challenges for the development and deployment of ML models.

  • Online/offline skew: For some of the most meaningful ML use cases, models need to be deployed online at low latency (think of a recommendation model that needs to run in tens of milliseconds when a webpage loads). The transformations that were used to compute features at training time (offline) now need to be repeated in model deployment (online) at low latency. This often leads teams to reimplement features multiple times, introducing subtle differences (called online/offline skew) that have significant impact on the model quality.
  • Reusability and discoverability: In most cases, features get reimplemented several times because they are not discoverable, and if they are, they are not managed in a way that facilitates reuse. Naive approaches to this problem provide search based on feature names, which requires data scientists to correctly guess which names someone else used for their features. Furthermore, there is no way that data scientists can know which features are used where, making decisions such as updating or deleting a feature table difficult.

The Databricks Feature Store takes a unique approach to solving the data problem in AI

The Databricks Feature Store is the first of its kind that is co-designed with a data and MLOps platform. Tight integration with the popular open source frameworks Delta Lake and MLflow guarantees that data stored in the Feature Store is open, and that models trained with any ML framework can benefit from the integration of the Feature Store with the MLflow model format. As a result, the Feature Store provides several unique differentiators that help data teams accelerate their ML efforts:

  • Eliminating online/offline skew with native model packaging: MLflow integration enables the Feature Store to package up feature lookup logic hermetically with the model artifact. When an MLflow model that was trained on data from the Feature Store is deployed, the model itself will look up features from the appropriate online store. This means that the client that calls the model can be oblivious to the fact that the Feature Store exists in the first place. As a result, the client becomes less complex and feature updates can be made without any changes to the client that calls the model.
  • Enabling reusability and discoverability with automated lineage tracking: Computing features in a data-native environment enables the Databricks Feature Store to automatically track the data sources used for feature computation, as well as the exact version of the code that was used. This facilitates lineage-based search: a data scientist can take their raw data and find all of the features that are already being computed based on that same data. In addition, integration with the MLflow model format provides downstream lineage from features to models: the Feature Store knows exactly which models and endpoints consume any given feature, facilitating end-to-end lineage, as well as safe decision-making on whether a feature table can be updated or deleted.

The Feature Store UI as a central repository for discovery, collaboration and governance

The Feature Store is a central repository of all the features in an organization. It provides a searchable record of all features, their definition and computation logic, data sources, as well as producers and consumers of features. Using the UI, a data scientist can:

  • Search for feature tables by feature table name, feature, or data source
  • Navigate from feature tables to their features and connected online stores
  • Identify the data sources used to create a feature table
  • Identify all of the consumers of a particular features, including Models, Endpoints, Notebooks and Jobs
  • Control access to feature table’s metadata


Feature Store search using feature name ‘customer_id’, feature tables with ‘feature_pipeline’, and ‘raw_data’ sources in their name.

The Databricks Feature Store is fully-integrated with other Databricks components. This native integration provides full lineage of the data that was consumed to compute features, all of the consumers that request features, and models in the Databricks Model Registry that were trained using features from the Feature Store.

Feature table 'user_features.behavior', shows lineage to data sources and producers notebook; models; as well as consumer models, endpoints, and notebooks
Feature table 'user_features.behavior', shows lineage to data sources and producers notebook; models; as well as consumer models, endpoints, and notebooks

Consistent access to features offline at high throughput and online at low latency

The Feature Store supports a variety of offline and online feature providers for applications to access features. Features are served in two modes. The batch layer provides features at high throughput for training of ML models and batch inference. Online providers serve features at low latency for the consumption of the same features in online model serving. An online provider is defined as a pluggable common abstraction that enables support for a variety of online stores and support APIs to publish features from offline to online stores and to look-up features.

Features are computed using batch computation or as streaming pipelines on Databricks and stored as Delta tables. These features can then be published using a scheduled Databricks Job or as a streaming pipeline. This ensures consistency of features used in batch training and features used in batch or online model inference, guaranteeing that there is no drift between features that were consumed at training and at serving time.

Model packaging

Integration with the MLflow Model format ensures that feature information is packaged with the MLflow model. During model training, the Feature Store APIs will automatically package the model artifact with all the feature information and code required at runtime for looking up feature information from the Feature Registry and fetching features from suitable providers.

feature_spec.yaml containing feature store information is packaged with the MLflow model artifact
feature_spec.yaml containing feature store information is packaged with the MLflow model artifact.

This feature spec, packaged with the model artifact, provides the necessary information for the Feature Store scoring APIs to automatically fetch and join the features by the stored keys for scoring input data. At model deployment, this means the client that calls the model does not need to interact with the Feature Store. As a result, features can be updated without making changes to the client.

Feature Store API workflows

The FeatureStoreClient Python library provides APIs to interact with the Feature Store components to define feature computation and storage, use existing features in model training and automatic lookup for batch scoring, and publish features to online stores. The library comes automatically packaged with Databricks Runtime for Machine Learning (v 8.3 or later).

Creating new features

The low level APIs provide a convenient mechanism to write custom feature computation code. Data scientists write a Python function to compute features using source tables or files.

def compute_customer_features(data):
  '''Custom function to compute features and return a Spark DataFrame'''
  pass

customer_features_df = compute_customer_features(input_df)

To create and register new features in the Feature Registry, call the create_feature_table API. You can specify a specific database and table name as destinations for your features. Each feature table needs to have a primary key to uniquely identify the entity’s feature values

from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

customer_feature_table = fs.create_feature_table(
  name='recommender.customer_features',
  keys='customer_id',
  features_df=customer_features_df,
  description='Customer features. Uses data from app interactions.'
)

Training a model using features from the Feature Store

In order to use features from the Feature Store, create a training set that identifies the features required from each feature table and describes the keys from your training data set that would be used to lookup or join with feature tables.

The example below uses two features ('total_purchases_30d' and 'page_visits_7d') from the customer_features table, and uses the customer_id from the training dataframe to join with the feature tables’ primary key. Additionally, it uses ‘quantity_sold' from the product_features table and the 'product_id' and ‘country_code’ as composite primary keys to lookup features from that table. Then it uses the create_training_set API to build the training set, excluding keys not required for model training.

from databricks.feature_store import FeatureLookup

feature_lookups = [
    FeatureLookup(
      table_name = 'prod.customer_features',
      feature_name = 'total_purchases_30d',
      lookup_key = 'customer_id'
    ),
    FeatureLookup(
      table_name = 'prod.customer_features',
      feature_name = 'page_visits_7d',
      lookup_key = 'customer_id'
    ),
    FeatureLookup(
      table_name = 'prod.product_features',
      feature_name = 'quantity_sold',
      lookup_key = ['product_id', 'country_code']
    )
  ]

fs = FeatureStoreClient()

training_set = fs.create_training_set(
  df,
  feature_lookups = feature_lookups,
  label = 'rating',
  exclude_columns = ['customer_id', 'product_id']
)

You can use any ML framework to train the model. The log_model API will package feature lookup information along with the MLmodel. This information is used for feature lookup during model inference.

from sklearn import linear_model

# Augment features specified in the training set specification
training_df = training_set.load_df().toPandas()

X_train = training_df.drop(['rating'], axis=1)
y_train = training_df.rating

model = linear_model.LinearRegression().fit(X_train, y_train)

fs.log_model(
  model,
  "recommendation_model",
  flavor=mlflow.sklearn,
  training_set=training_set,
  registered_model_name="recommendation_model"
)

Feature lookup and batch scoring

When scoring a model packaged with feature information, these features are looked up automatically during scoring. The user only needs to provide the primary key lookup columns as input. The Feature Store’s score_batch API, under the hood, will use the feature spec stored in the model artifact to consult the Feature Registry for the specific tables, feature columns and the join keys. Then the API will perform the efficient joins with the appropriate feature tables to produce a dataframe of the desired schema for scoring the model. The following simple use code depicts this operation with the model trained above.

# batch_df has columns ‘customer_id’ and ‘product_id’
predictions = fs.score_batch(
    model_uri,
    batch_df
)

# The returned ‘predictions’ dataframe has these columns:
#  inputs from batch_df: ‘customer_id’, ‘product_id’
#  features: ‘total_purchases_30d’, ‘page_visits_7d’, ‘quantity_sold’
#  model output: ‘prediction’

Publishing features to online stores

To publish a feature table to an online store, first specify a type of an online store spec and use the publish_table API. The following version will overwrite the existing table with recent features from the customer_features table from the batch provider.

online_store = AmazonRdsMySqlSpec(hostname, port, user, password)

fs.publish_table(
  name='recommender_system.customer_features',
  online_store=online_store,
  mode='overwrite'
)

publish_features supports various options to filter out specific feature values (by date or any filter condition). In the following example, today’s feature values are streamed into the online stores and merged with existing features.

fs.publish_table(
  name='recommender_system.customer_features',
  online_store=online_store,
  filter_condition=f"_dt = '{str(datetime.date.today())}'",
  mode='merge',
  streaming='true'
)

Get started with the Feature Store

Ready to get started or try it out for yourself? You can read more about Databricks Feature Store and how to use it in our documentation at AWS, Azure and GCP.

--

Try Databricks for free. Get started today.

The post Databricks Announces the First Feature Store Co-designed with a Data and MLOps Platform appeared first on Databricks.

Introducing Databricks AutoML: A Glass Box Approach to Automating Machine Learning Development

$
0
0

Today, we announced Databricks AutoML, a tool that empowers data teams to quickly build and deploy machine learning models by automating the heavy lifting of preprocessing, feature engineering and model training/tuning. With this launch, data teams can select a dataset, configure training, and deploy models entirely through a UI. We also provide an advanced experience in which data scientists can access generated notebooks with the source code for each trained model to customize training or collaborate with experts for productionization. Databricks AutoML integrates with the Databricks ML ecosystem, including automatically tracking trial run metrics and parameters with MLflow and easily enabling teams to register and version control their models in the Databricks Model Registry for deployment.

Databricks AutoML, a tool that empowers data teams to quickly build and deploy machine learning models by automating the heavy lifting of preprocessing, feature engineering and model training/tuning.

A glass box approach to AutoML

Today, many existing AutoML tools are opaque boxes — meaning users don’t know exactly how a model was trained. Data scientists hit a wall with these tools when they need to make domain-specific modifications or when they work in an industry that requires auditability for regulatory reasons. Data teams then have to invest the time and resources to reverse engineer these models to make customizations, which counteracts many of the productivity gains they were supposed to receive.

That’s why we are excited to bring customers Databricks AutoML, a glass box approach to AutoML that provides Python notebooks for every model trained to augment developer workflows.

Data scientists can leverage their domain expertise and easily add or modify cells to these generated notebooks. Data scientists can also use Databricks AutoML generated notebooks to jumpstart ML development by bypassing the need to write boilerplate code.

Databricks AutoML, a glass box approach to AutoML that provides Python notebooks for every model trained to augment developer workflows.

Get quick insights into datasets

In addition to model training and selection, Databricks AutoML creates a data exploration notebook to give basic summary stats on a dataset. By automating the data exploration stage, which many find tedious, Databricks AutoML saves data scientists time and allows them to quickly gut check that their datasets are fit for training. The data exploration notebooks use pandas profiling to provide users with warnings–high cardinality, correlations, and null values–as well as information on the distribution of variables.

 Databricks AutoML creates a data exploration notebook to give basic summary stats on a dataset.

Learn ML best practices

The AutoML experience integrates with MLflow–our API for tracking metrics/parameters across trial runs–and uses ML best practices to help improve productivity on data science teams:

  • From the Experiments page, data scientists can compare trial runs and register and serve models in the Databricks Model Registry.
  • Our generated training notebooks provide all the code used to train a given model–from loading data to splitting test/train sets to tuning hyperparameters to displaying SHAP plots for explainability.

The AutoML experience integrates with MLflow--our API for tracking metrics/parameters across trial runs--and uses ML best practices to help improve productivity on data science teams

AutoML public preview features

The Databricks AutoML Public Preview parallelizes training over sklearn and xgboost models for classification (binary and multiclass) and regression problems. We support datasets with numerical, categorical and timestamp features and automatically handle one-hot encoding and null imputation. Trained models are sklearn pipelines such that all data preprocessing is wrapped with the model for inference.

The Databricks AutoML Public Preview parallelizes training over sklearn and xgboost models for classification (binary and multiclass) and regression problems.

Additionally, Databricks AutoML has several advanced options. Many teams are trying to get quick answers from AutoML, so customers can control how long AutoML training lasts through configurable stopping conditions: a wall-clock timeout or the maximum number of trials to run. They can also configure the evaluation metric for ranking model performance.

 Databricks AutoML has several advanced options.

Get started with Databricks AutoML public preview

Databricks AutoML is now in Public Preview and is part of the Databricks Machine Learning experience. To get started:

  • Use the left-hand sidebar to switch to the “Machine Learning” experience to access Databricks AutoML via the UI. Click on the “(+) Create” on the left navigation bar and click “AutoML Experiment” or navigate to the Experiments page and click “Create AutoML Experiment” to get started.
  • Use the AutoML API, a single-line call, which can be seen in our documentation.

Databricks AutoML is now in Public Preview and is part of the Databricks Machine Learning experience.

Ready to get started or try Databricks AutoML out for yourself? Read more about Databricks AutoML and how to use it on AWS, Azure, and GCP.

--

Try Databricks for free. Get started today.

The post Introducing Databricks AutoML: A Glass Box Approach to Automating Machine Learning Development appeared first on Databricks.

Introducing Databricks Machine Learning: a Data-native, Collaborative, Full ML Lifecycle Solution

$
0
0

Today, we announced the launch of Databricks Machine Learning, the first enterprise ML solution that is data-native, collaborative, and supports the full ML lifecycle. This launch introduces a new purpose-built product surface in Databricks specifically for Machine Learning (ML) that brings together existing capabilities, such as managed MLflow, and introduces new components, such as AutoML and the Feature Store. Databricks ML provides a solution for the full ML lifecycle by supporting any data type at any scale, enabling users to train ML models with the ML framework of their choice and managing the model deployment lifecycle – from large-scale batch scoring to low latency online serving.

Databricks Machine Learning, the first enterprise ML solution that is data-native, collaborative, and supports the full ML lifecycle.

The hard part about AI is data

Many ML platforms fall short because they ignore a key challenge in ML: they assume that high-quality data is ready and available for training. That requires data teams to stitch together solutions that are good at data but not AI, with others that are good at AI but not data. To complicate things further, the people responsible for data platforms and pipelines (data engineers) are different from those that train ML models (data scientists), who are different from those who deploy product applications (engineering teams who own business applications). As a result, solutions for ML need to bridge gaps between data and AI, the tooling required and the people involved.

The answer is a data-native and collaborative solution for the full ML lifecycle

Data-native
ML models are the result of “compiling” data and code into a machine learning model. However, existing tools used in software development are inadequate for dealing with this interdependency between data and code. Databricks ML is built on top of an open data lakehouse foundation, which makes it the first data-native ML solution. Capabilities include:

  • Any type of data, at any scale, from any source: With the Machine Learning Runtime, users can ingest and process images, audio, video, tabular or any other type of data – from CSV files to terabytes of streaming IoT sensor data. With an open source ecosystem of connectors, data can be ingested from any data source, across clouds, from on prem or from IoT sensors.
  • Built-in data versioning, lineage and governance: Integrating with the time travel feature of Delta Lake, Databricks ML automatically tracks the exact version of data used to train a model. Combined with other lineage information logged by MLflow, this provides full end-to-end governance to facilitate robust ML pipelines.

Collaborative
Fully productionizing ML models requires contributions from data engineers, data scientists and application engineers. Databricks ML facilitates collaboration for all members of a data team by both supporting their workflows on Databricks and by providing built-in processes for handoffs. Key features include:

  • Multi-language notebooks: Databricks Notebooks support Python, SQL, R and Scala within the same notebook. This provides flexibility for individuals who want to mix and match, and also collaboration across individuals who prefer different languages.
  • Cloud-native collaborative features: Databricks Notebooks can be shared and jointly worked on in real time. Users can see who is active in a notebook from the co-presence indicator and watch their changes in real time. Built-in comments further facilitate collaboration.
  • Model lifecycle management: The Model Registry is a collaborative hub in which teams can share ML models, collaborate on everything from experimentation to online testing and production, integrate with approval and governance workflows, and monitor ML deployments and their performance.
  • Sharing and managed access: To enable secure collaboration, Databricks provides fine-grained access controls on all types of objects (Notebooks, Experiments, Models, etc.).

Full ML lifecycle
MLOps is a combination of DataOps, DevOps and ModelOps. To get MLops right, there is a vast ecosystem of tools that need to be integrated. Databricks ML takes a unique approach to supporting the full ML lifecycle and true MLOps.

MLOps is a combination of DataOps, DevOps and ModelOps.

  • DataOps: Through its data-native nature, Databricks ML is the only ML platform that provides built-in data versioning and governance. The exact version of the data is logged with every ML model that is trained on Databricks.
  • DevOps: Databricks ML provides native integration with Git providers through its Repos feature, enabling data teams to follow best practices and integrate with CI/CD systems.
  • ModelOps: With managed MLflow, Databricks ML provide a full set of features from tracking ML models with their associated parameter and metrics, to managing the deployment lifecycle, to deploying models across all modes (form batch to online scoring) on any platform (AWS, Azure, GCP, on-prem, or on-device).
  • Full reproducibility: Providing a well-integrated solution for the full ML lifecycle means that work on Databricks ML is fully reproducible: data, parameters, metrics, models, code, compute configuration and library versions are all tracked and can be reproduced at any time.

New persona-based navigation and machine learning dashboard
To simplify the full ML lifecycle on Databricks, we introduce a new persona-based navigation. Machine Learning is a new option, alongside Data Science & Engineering and SQL. By selecting Machine Learning, users gain access to all of the tools and features to train, manage and deploy ML models. We also provide a new ML landing page where we surface recently accessed ML assets (e.g. Models, Features, Experiments) and ML-related resources.

To simplify the full ML lifecycle, Databricks Machine Learning  introduce a new persona-based navigation.

Introducing: Feature Store and AutoML
The latest additions to Databricks Machine Learning further underscore the unique attributes of a data-native and collaborative platform:

Feature Store
Our Feature Store is the first feature store that is co-designed with a data and MLops platform. It facilitates reuse of features with a centralized Feature Registry and eliminates the risk for online/offline skew by providing consistent access to features offline (for training and batch scoring) and online (for model serving).

The Databricks Feature Store is the first feature store that is co-designed with a data and MLops platform.

  • To facilitate end-to-end lineage and lineage-based search, the Feature Registry tracks all Feature Tables, the code that produces them, the source data used to compute features and all consumers of features (e.g. Models and Endpoints). This provides full lineage from raw data, to which feature tables are computed based on that raw data, and which models consume which feature tables.
  • To ensure feature consistency between training and serving and eliminate offline/online skew, the feature provider makes features available at high throughput and low latency. The feature provider also integrates with MLflow, simplifying the model deployment process. The MLflow model format stores information about which features the model consumed from the Feature Store, and at deployment, the model takes care of feature lookup, allowing the client application that calls the model to entirely ignore the existence of a feature store

Read more about our Feature Store product in the Feature Store launch blog post.

AutoML
Our AutoML product takes a glass box approach that provides a UI-based workflow for citizen data scientists to deploy recommended models. AutoML also generates the training code that a data scientist would write if they developed the same model themself. This transparency is critical in highly regulated environments and for collaboration with expert data scientists.

Databricks' AutoML product takes a glass box approach that provides a UI-based workflow for citizen data scientists to deploy recommended models.

  • In highly regulated environments, auditability and reproducibility is often a hard requirement. Most AutoML products are opaque-boxes that only provide a model artifact, making it hard to meet regulatory requirements like providing visibility into what type of model was trained, etc. Because Databricks AutoML generates the full Python notebook with the training code, we provide full visibility for regulators.
  • For collaboration with expert data scientists, the generated code is a starting point that can be used to either adjust the model using domain expertise. In practice, AutoML is often used for a baseline model, and once a model shows promise, experts can refine it.

Read more about our AutoML product in the AutoML launch blog post.

Getting Started

Databricks Machine Learning is available to all Databricks customers starting today. Simply click on the new persona switcher and select Machine Learning. The new navigation bar will give you access to all ML features, and the ML Dashboard will guide you through relevant resources and provide access to your recently used ML artifacts. Learn more in our documentation for AWS, Azure and GCP.

Learn more about the new features of Databricks Machine Learning directly from the product and engineering team by attending the free Data + AI Summit.

The post Introducing Databricks Machine Learning: a Data-native, Collaborative, Full ML Lifecycle Solution appeared first on Databricks.


Congratulations to the 2021 Databricks Data Team Award Winners

$
0
0

During the Data + AI Summit (fka Spark + AI Summit), Databricks CEO Ali Ghodsi recognized four exceptional data teams for how they have used data, machine learning, and AI to solve some of the world’s toughest problems.

The second annual Databricks Data Team Awards brought a diverse set of submissions, representing many industries, use cases, and geographies. Across the board, the finalists showcased how the Databricks Lakehouse Platform can help bring together the diverse talents of data engineers, data scientists and data analysts to focus their ideas, skills and energy toward accomplishing amazing things.

We are proud to recognize and celebrate this year’s Data Team Award winners, H&M Group, John Deere, Scribd, and the US Department of Veterans Affairs. Hear how these organizations are using data in very different and unique ways to do incredible work.

2021 Data Team Winners

Data Team for Good Award: US Department of Veterans Affairs
The U.S. Department of Veterans Affairs (VA) is advancing its efforts to prevent veteran suicide through data and analytics. In 2020, Databricks, a native cloud service under Microsoft Azure, became FedRAMP High and available to VA. With the Databricks Lakehouse Platform, VA was able to enhance algorithms like the Medication Possession Ratio (MPR), shortening MPR computation by 88% and expanding the scope of data to improve the accuracy and efficiency of the algorithm. MPR is one of the factors that improve prediction of suicide risk and opioid overdose risk by evaluating a Veteran’s medication regimen. With Databricks Spark, VA is able to quickly ingest and process over 6 million patient records (representing 130 VA Medical Centers), including 20 years of their medical history data and all medications on VA’s national drug list. The U.S. Department of Veterans Affairs is the largest health system in the US and they are using data and analytics to serve Veterans and save the lives of those who have risked so much.

Accepting the award on behalf of the team at US Department of Veterans Affairs was Wanmei Ou, Ph.D., White House Presidential Innovation Fellow, Head of Rockies Data and Analytics Platform and Jodie Trafton, Ph.D., Director, Program Evaluation and Resource Center, VHA Office of Mental Health and Suicide Prevention.

Finalists: Artemis Health, Daimler, Samsara

Data Team Innovation Award: John Deere
John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to meet the world’s increasing need for food, fuel, shelter and infrastructure. Their Enterprise Data Lake (EDL) built upon the Databricks Lakehouse Platform is at the core of this innovation, with petabytes of data and trillions of records being ingested into the EDL, giving data teams fast, reliable access to standardized data sets to innovate and deliver ML and analytics solutions ranging from traditional IT use cases to customer applications. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in the supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Accepting the award on behalf of the team at John Deere was Brian Roller, Senior Software Engineer.

Finalists: Atlassian, Virgin Hyperloop

Data Team OSS Award: Scribd
With over a million paid subscribers and 100 million monthly visitors, Scribd is on a mission to change the way the world reads. Through their monthly subscription, readers gain access to the best ebooks, audiobooks, magazines, podcasts, and much more. Scribd leverages data and analytics to uncover interesting ways to get people excited about reading. The data engineering team optimizes the Lakehouse architecture to ultimately deliver an engaging experience to their customers. Additionally, the team is contributing to multiple open source projects and they have a deep understanding of the Delta Lake OSS ecosystem.

Accepting the award on behalf of the team at Scribd was R Tyler Croy, Director Of Platform Engineering.

Finalists: Adobe, Healthgrades

Data Team Impact Award: H&M
The H&M group is a family of brands serving markets around the globe. Throughout its deep fashion heritage, dating back to 1947, H&M Group has been committed to making fashion available for everyone; now, they want to make sustainable fashion affordable and available for everyone. H&M Group has been making necessary investments in digitizing its supply chain, logistics, tech infrastructure and AI, with Databricks providing key data and analytics capabilities that enable the fast development and deployment of AI solutions to market. By harnessing the power of data and technology, H&M Group continues to lead the industry by making its operations more efficient and sustainable, while also remaining at the heart of fashion and lifestyle inspiration for its customers.

Accepting the award on behalf of the team at H&M was Errol Koolmeister, Head of AI Foundation.

Finalists: Asurion, GSK

Cheers to your success
Congratulations to the Data Team Award winners for their exceptional achievements! We will continually celebrate data teams for using data and AI to make a difference.

--

Try Databricks for free. Get started today.

The post Congratulations to the 2021 Databricks Data Team Award Winners appeared first on Databricks.

Jump Start Your Data Projects with Pre-Built Solution Accelerators

$
0
0

Deliver value faster.

We hear this theme in nearly every executive discussion with customers. Data teams and data leaders need to deliver value in weeks, not months or years. The business climate is volatile, and they don’t have the luxury of long project timelines to deliver data and analytic capabilities designed to drive business value, such as increased revenues or decreased costs.

At the same time, data teams often face resource constraints like the lack of in-house experts in Python or Scala or even a broader lack of deep data science expertise. Additionally, data teams need to find special “unicorns” that have not just the technical skills, but also the domain knowledge to understand the nuances of the industry dynamics (for example, regulatory constraints) being solved.

Even with the right talent, data teams will often need to spend weeks or months researching, building the back-end data pipelines to serve their models, developing the models, and then optimizing the code for a proof of concept (POC). This lag from identifying the need, working through potential solutions, finalizing an implementation and seeing results sucks up momentum from even the most important data science initiatives.

Our customers have asked for a more prescriptive approach based on best practices from our work spanning thousands of clients in various industries, from the largest household names to digital-native challenger brands, with the most demanding SLAs.

Meet the Databricks Solution Accelerators

To help our customers overcome these challenges, we’re proud to offer a rich portfolio of Solution Accelerators. Solution Accelerators is fully-functional pre-built code to tackle the most common and high-impact use cases that our customers are facing. Solution Accelerators are designed to help Databricks customers go from idea to POC in less than 2 weeks.

From its inception, Databricks has been focused on helping customers accelerate value through data science. To enable us to go deeper into specific domains, we’ve assembled a team of seasoned executives and experts in Retail & Manufacturing, Financial Services, Media & Communications and Healthcare & Life Sciences and are focusing these resources on tackling the most pressing use cases in each of these industries.

These teams are focused on the development of Solution Accelerators within their industries. Each Solution Accelerator is backed by extensive research on emerging and best practices, is fully optimized to take advantage of Databricks performance and includes additional enablement materials such as blogs, webinars and business value calculators that help customers consider how an implementation may impact their organization’s goals. These assets are made freely available to Databricks customers through our public blogs, industry-aligned webinars and engagement with local Databricks representatives.

Databricks Solution Accelerators vs. the traditional approach

The Databricks Solution Accelerators provide core analytics with modular design so you can focus on integration and customization for your business, delivering faster time to value for data teams by delivering a POC in <2 weeks

Solution Accelerators aren’t designed to be a one-size-fits all full solution. They’re accelerators. They can be extended with customer data, customized to specific business needs and integrated into processes. We’ve seen many customers begin with a Solution Accelerator POC, and bring a full solution to production several weeks following the POC.

Solution Accelerators now in market or coming soon

You can see all our accelerators at our Databricks Solution Accelerator hub.

Databricks is committed to continually adding to and updating these Solution Accelerators across industries.

Below are some explainers walking through a couple of our solutions now in market:

Additionally, we have several solutions dedicated specifically to Cloudera/Hadoop-to-Cloud Migration and Automated/Production Pipelines Migration going across industries.

Getting started:

If you’re ready to get started, visit the Databricks Solution Accelerator hub page to find all the relevant assets for your use case.

You can also contact your account team to:

  • Help implement fast POCs in your environment.
  • Perform business value assessments to support your business case.
  • Help with Databricks PS or third-party Sis for full implementation.

Or visit the Databricks Blog to get:

  • Full description of the problem, challenges, and approach.
  • Direct access to notebooks that you can load into your environment.
  • Learn about the latest new solution accelerator launches

You can also check out our events page to get the latest solution webinars to:

  • Walk-through of the business problem, challenges and impact.
  • Technical walk-through on implementing the solution.

--

Try Databricks for free. Get started today.

The post Jump Start Your Data Projects with Pre-Built Solution Accelerators appeared first on Databricks.

Celebrating Asian American Pacific Islander Heritage Month

$
0
0

We joined Databricks in early October of 2020, in the middle of the pandemic with no clear end in sight. Our remote onboarding began with a few emails, some icebreaker slides, Kahoots and the ubiquitous Zoom session. Our cohort was a diverse group consisting of new hires from at least four different departments in the company, and our only initial commonality seemed to be the new red badge and a slight competitive streak at trivia. We greeted each other and then went to our separate technical training sessions. The technical process was smooth but fully onboarding at a company also requires building relationships, and those take time.

We initially started contemplating an Asian Employee Resource Group in January to celebrate important cultural events and build community, especially for those starting remotely. Employee Resource Groups (ERGs) are voluntary, employee-led collectives designed to foster an inclusive workplace by creating a place for individuals from underrepresented backgrounds (and allies) to come together to discuss important issues and build connections. The Databricks Asian Employee Network (AEN) Employee Resource Group was officially launched on Lunar New Year 2021 with the goal of fostering a community of Bricksters who can learn from one another, advocate for each other and work together to advance awareness, inclusion, engagement and equitable policies; basically to help us all onboard a bit better.

During this time, some of us had been personally feeling the impact of the increasing anti-Asian sentiment and violence. The pandemic had amplified and brought this hateful rhetoric to the forefront. Our fledgling ERG became a safe space where we could help each other process these events, including sharing our collective fear and frustration, and also banding together to take action.

Since our group’s inception, we’ve helped organize two company-wide fundraisers to support the #StopAAPIHate movement, as well as Covid-19 relief efforts in India. Building on this incredible momentum,  in May, we celebrated Asian American Pacific Islander (AAPI) Heritage month with virtual events and programs that helped highlight historic achievements and contributions of the AAPI community across the globe. Here are some of our favorite moments:

Asian Employee Network Block Party
We were excited to host our virtual Block Party to create a fun and interactive space with themed rooms for everyone to develop more meaningful connections. We shared recipes in the pot luck room,  created a playlist together in the music room and chatted about Asian literature in the “Lit Corner!”

Snapshot from the Asian Employee Network Block Party virtual event

Snapshot from the Asian Employee Network Block Party virtual event

“The In-Betweens” Keynote with Jeff Chang 
We also hosted Jeff Chang, award-winning author, American historian and Vice President at Race Forward for an intimate fireside chat on the “in-between” status that the AAPI community faces in the 21st century. This powerful talk centered on building solidarity and discussed how important it is for us to address the long history of violence against Asian Americans in order to truly achieve racial justice for all.

Jeff Chang

Reflecting on our shared history 
This month we thought it was important to take time to learn more about the history of Asian Americans and hear the diverse experiences and stories from these Asian and Pacific Islander communities. To help facilitate this collective reflection, we organized movie screenings and discussions throughout the month in addition to recommending books by Asian and Pacific Islander authors. Our team decided to feature the first episode of the PBS film series, Asian Americans, as well as the Oscar-nominated film, Minari. These powerful stories showcased the incredible role Asian Americans have had in shaping our history.

 

We are so proud of the Asian Employee Network and the supportive community we have been able to build in just a few months. Our team is excited to continue growing and having an even bigger positive impact in the years to come!

Interested in joining Databricks?

Visit our careers page to explore our global opportunities and to learn more about how you can join our Databricks community.

--

Try Databricks for free. Get started today.

The post Celebrating Asian American Pacific Islander Heritage Month appeared first on Databricks.

Don’t Miss These Top 10 Announcements From Data + AI Summit

$
0
0

The 2021 Data + AI Summit was filled with so many exciting announcements for open source and Databricks, talks from top-tier creators across the industry (such as Rajat Monga, co-creator of TensorFlow) and guest luminaries like Bill Nye, Malala Yousafzai and the NASA Mars Rover team. You can watch the keynotes, meetups and hundreds of talks on-demand on the Summit platform, which is available through June 28th and requires free registration.

In this post, I’d like to cover my personal top 10 announcements in open source and Databricks from Summit. They are in no particular order and links to the talks go to the Summit platform.

Delta Lake 1.0

The Delta Lake open source project is a key enabler of the lakehouse, as it fixes many of the limitations of data lakes: data quality, performance and governance. The project has come a long way since its initial release, and the Delta Lake 1.0 release was just certified by the community. The release represents a variety of new features, including generated columns and cloud independence with multi-cluster writes and my favorite — Delta Lake standalone, which reads from Delta tables but doesn’t require Apache SparkTM.

Delta Lake 1.0 announcement from the Wednesday AM Keynote

Delta Lake 1.0 announcement from the Wednesday AM Keynote


We also announced a bunch of new committers to the Delta Lake project, including QP Hou, R.Tyler Croy, Christian Williams, Mykhailo Osypov and Florian Valeye.

Learn more about Delta Lake 1.0 in the keynotes from co-creator and Distinguished Engineer Michael Armbrust.

Delta Sharing

Open isn’t just about open source – it’s about access and sharing. Data is the lifeblood of every successful organization, but it also needs to be able to smoothly flow between organizations. Data sharing solutions have historically been tied to a single commercial product, introducing vendor lock-in risk and data silos. Databricks Co-founder & CEO Ali Ghodsi announced Delta Sharing, the industry’s first open protocol for secure data sharing. It supports SQL and Python data science, plus has easy management, privacy, security and compliance. It will be part of the Delta Lake project under the Linux foundation.

We’ve already seen tremendous support for the project, with over 1,000 datasets to be made available by AWS data exchange, FactSet, S&P Global, Nasdaq and more. Additionally, Microsoft, Google, Tableau and many others have committed to adding support for Delta Sharing into their products.

Delta Sharing announcement from the Wednesday AM Keynote

Delta Sharing announcement from the Wednesday AM Keynote


Learn more about Delta Sharing from Apache Spark and MLflow co-creator Matei Zaharia in the keynotes. You can also watch a session from Tableau on How to Gain 3 Benefits with Delta Sharing.

Early Release: Delta Lake Definitive Guide by O’Reilly

My esteemed colleagues Denny Lee, Vini Jaiswal and Tathagata Das are hard at work writing a new book exploring how to build modern data lakehouse architectures with Delta Lake. As Michael Armbrust announced during the keynotes, we’ve joined with O’Reilly to make the early release available for free. Download it today, and we’ll be sure to let you know when the final release is published!

Early Release: Delta Lake Definitive Guide by O’Reilly

Early Release: Delta Lake Definitive Guide by O’Reilly

Unity Catalog

Companies are collecting massive amounts of data in data lakes in the cloud, and these lakes keep growing. It’s been hard to maintain governance in a single cloud, let alone the multi-cloud environment that many enterprises use. The Unity Catalog is the industry’s first unified catalog for the lakehouse, enabling users to standardize on one fine-grained solution across all clouds. You can use ANSI SQL to control access to tables, fields, views, models — not files. It also provides an audit log to make it easy to understand who and what is accessing all your data.

Unity Catalog announcement from Wednesday’s AM Keynote

nity Catalog announcement from Wednesday’s AM Keynote


Learn more from Chief Technologist Matei Zahari in the opening keynote, and sign up for the waitlist to get access to Unity Catalog.

Databricks SQL: improved performance, administration and analyst experience

We want to provide the most performant, simplest and most powerful SQL platform in an open way. SQL is an important part of the data lakehouse vision, and we’ve been focused on improving the performance and usability of SQL in real-world applications.

Last year, we talked about how Databricks, powered by Delta Lake and the Photon engine, performed better than data warehouses in the TPC-DS price/performance comparison on a 30TB workload. At Summit, Reynold Xin, Chief Architect at Databricks, announced an update to this performance optimization work, focused on concurrent queries on a 10GB TPC-DS workload. After making over 100 different micro-optimizations, Databricks SQL now outperforms popular cloud data warehouses for small queries with lots of concurrent users.

Databricks Chief Architect Reynold Xiao on improvements to Databricks SQL

Databricks Chief Architect Reynold Xiao on improvements to Databricks SQL


Learn more about the improvements in Databricks SQL and the Photon Engine from Reynold Xin, Chief Architect at Databricks and top all-time contributor to Apache Spark. Be sure to stay tuned for Databricks CEO Ali Ghodsi’s discussion with Bill Inmon, the “father” of the data warehouse.

You can also watch an in-depth session from the tech lead and product manager of the Photon team.

Lakehouse momentum

The momentum in lakehouse adoption that Databricks CEO Ali Ghodsi discussed in the opening keynote is representative of significant engineering advancements that are simplifying the work of data teams globally.

No longer do these companies need to have two-tier data architectures with both data lakes and (sometimes multiple) data warehouses. By adopting the data lakehouse, they can now have both the performance, reliability and compliance capabilities typical in a data warehouse along with the scalability and support for unstructured data found in data lakes.

Rohan Dhupelia joined the opening keynote to talk about how the lakehouse transformed and simplified the work of data teams at Atlassian.

Ali then invited Bill Inmon, the “father” of the data warehouse, to the virtual stage to talk about the transformation he’s seen over the last few decades. Bill says “if you don’t turn your data lake into a lakehouse, then you’re turning it into a swamp” and emphasizes that the lakehouse will unlock the data and present opportunities we’ve never seen before.

Hear first-hand from Ali, Rohan and Bill in the opening keynote on the lakehouse data architecture, data engineering and analytics. Stay tuned for Bill Inmon’s upcoming book on the data lakehouse and read his blog post to understand the evolution to the lakehouse.

Koalas is being merged into Apache Spark

The most important library for data science is pandas. In order to better support data scientists moving from single-node “laptop data science” to highly scalable clusters, we launched the Koalas project two years ago. Koalas is an implementation of the pandas APIs, optimized for clustered environments enabling work on large datasets.

We’re now seeing over 3 million PyPI downloads of Koalas every month – changing the way data scientists work at scale. Reynold Xin, the all-time top contributor to Apache Spark, announced that we’ve decided to donate Koalas upstream into the Apache Spark project. Now, anytime you write code for Apache Spark, you can take comfort in knowing the pandas APIs will be available to you.

The merging of these projects has a great side benefit for Spark users as well – the efficient plotting techniques in the pandas APIs on Spark automatically determine the best way to plot data without the need for manual downsampling.

Brooke Wenig, Machine Learning Practice Lead at Databricks, demos the pandas APIs on Apache Spark

Brooke Wenig, Machine Learning Practice Lead at Databricks, demos the pandas APIs on Apache Spark


Learn more about the Koalas merger in the keynote and demo from Reynold Xin and Brooke Wenig. You can also watch a deep-dive into the Koalas project from the engineering team, including benchmarks and comparisons to other pandas scaling efforts.

Machine Learning Dashboard

Director of Product Management Clemens Mewald announced several new improvements in the machine learning capabilities of Databricks.

These improvements seek to simplify the full machine learning lifecycle – from data to model deployment (and back). One way we’re doing this is by providing new persona-based navigation in the Databricks Workspace – providing a ML Dashboard that brings together data, models, feature stores and experiment tracking under a single interface.

Director of Product Management Clemens Mewald talks about the new ML capabilities

Director of Product Management Clemens Mewald talks about the new ML capabilities


Learn more from Clemens in the Spark, Data Science, and Machine Learning keynote and see a demo from Sr. Product Manager Kasey Uhlenhuth. They also have an in-depth session where they dive into the details of these announcements.

Machine Learning Feature Store

The Databricks Feature Store is the first that’s co-designed with a data and MLOps platform.

What’s a feature? Features are the inputs to a machine learning model, including transformations, context, feature augmentation and pre-computed attributes.

A feature store exists to make it easier to implement a feature once and use it both during training and low-latency production serving, preventing online/offline skew. The Databricks Feature Store includes a feature registry to facilitate discoverability and reusability of features, including tracking of the data sources. It also integrates into MLflow, enabling the feature versions used in training a particular version of a model to be automatically used in production serving, without manual configuration.

The data in the Databricks Feature Store is stored in an open format – Delta Lake tables, so they can be accessed from clients in Python, SQL and more.

Director of Product Management Clemens Mewald talks about the new ML capabilities

Director of Product Management Clemens Mewald talks about the new ML capabilities


Learn more from Clemens in the Spark, Data Science and Machine Learning keynote and see a demo from Sr. Product Manager Kasey Uhlenhuth. They also have an in-depth session where they dive into the details of these announcements.

AutoML with reproducible trial notebooks

Databricks AutoML is a unique glass-box approach to AutoML that empowers data teams without taking away control. It generates a baseline model to quickly validate the feasibility of a machine learning project and guide the project direction.

Many other AutoML solutions designed for the citizen data scientist hit a wall if the auto-generated model doesn’t work – it doesn’t provide the control needed to tune it.

Sr. Product Manager Casey Uhlenhuth talks about AutoML’s glass-box approach

Sr. Product Manager Casey Uhlenhuth talks about AutoML’s glass-box approach

Databricks AutoML augments data scientists and enables them to see exactly what’s happening under the hood by providing the source code for each trial run in separate, modifiable Python notebooks. The transparency of this glass box approach means there is no need to spend time reverse engineering an opaque auto-generated model to tune based on your subject matter expertise. It also supports regulatory compliance via the ability to show exactly how a model is trained.

AutoML integrates tightly with MLflow, tracking all the parameters, metrics, artifacts and models associated with every trial run. Get more details in the in-depth session.

Register or login to the event site to rewatch any or all of the 2021 sessions

--

Try Databricks for free. Get started today.

The post Don’t Miss These Top 10 Announcements From Data + AI Summit appeared first on Databricks.

Women in Product at Databricks

$
0
0

Databricks was a proud sponsor of the 2021 Women In Product Conference, which this year, had a theme of igniting possibility. We interviewed a few of the women on our product team –designers, program managers and product managers — all of whom play an integral role in Databricks’ success. We had the opportunity to speak to them, and learn more about their career journeys and what inspires them outside of work. Read below to get to know some of our Product teammates!

Miranda Luna – Product Manager

What’s the best career advice you’ve received?
I’d have to call it a tie between ‘Join a company where, more often than not, you’re in a room where you’re actively learning from someone else.’ and ‘Get comfortable bringing structure to ambiguity.’ I can’t emphasize enough how important it’s been for my career path to be surrounded by intelligent, driven and intentional colleagues and to learn from them every day. It’s been equally as critical to deeply understand customer problems and distill what’s required to solve them.  As your career progresses, the problems only get bigger and more amorphous, so a commitment to strengthening that muscle over time has an outsize impact on your effectiveness.

Who’s a person you look up to and why?
I’ll always look up to my mother, someone who embodies both hard work and  a true love of learning. She obtained her PhD in mechanical engineering while raising three kids and working full time at NASA. While the previous statement is impressive, she also made a fairly significant career pivot from designing air-cleaning systems for the shuttle and space suits to studying the impact of different weather patterns on climate change. Truly a lifelong learner.

What’s your favorite book and why?
Every year, I come back to John Steinbeck’s East of Eden and The Grapes of Wrath. Even after reading them many times, I can always find a new facet to appreciate.

Ginny Wood – Product Designer

How did you decide you wanted to become a product designer?
While I was a graphic designer, my friend helped me get a product design internship to explore different options. Once I discovered design systems, I knew this work could fascinate me for my whole career. I love principles, patterns, and systems. I love that I get to be craft-oriented about solving detailed problems, while scaling them to serve a whole product ecosystem.

What’s the best career advice you’ve ever received?
“Art is not about thinking something up. It is the opposite — getting something down.”—Julia Cameron
Creating anything is a process of discovery, not generation. You are responsible for the quantity of attempts, and the universe is responsible for the quality of what you find—the solution already exists, and you are merely discovering it.

Who’s a person you look up to and why?
My brother Robinson Wood. He inspires me to take creative risks, trust the process, and approach people with curiosity.

Anna Shestinian – Product Manager

Who’s a person you look up to and why?
My sister Leah for her resilience in the ongoing call for justice, belonging, comradeship, truth-telling, and joy.

What’s the best career advice you’ve received?
There is power in a pause. Pausing and reflecting allows us to make better product decisions and deal with more complex issues. A false sense of urgency around a project creates a lot of activity without productive results.

What book(s) would you recommend to our readers?
I just finished the Broken Earth trilogy by N. K. Jemisin, which blends sci-fi and fantasy with ecological consciousness.

Stephanie Liu – Program Manager, Product

Any good books you’ve read recently that you’d like to recommend?
Invisible Women should be required reading in all school curriculum. It’s well researched, infuriating, and will change how you see the modern world around us — how the world is not designed for women, despite us being half the population, and how that hurts women and society. You will finish the book and start chanting “sex disaggregated data NOW!”
The Hidden Life of Trees is well-cited and a fascinating look at how trees communicate and the importance of old growth forests to the planet. Braiding Sweetgrass is a compelling blend of botany and motherhood and moving nature prose.

What’s the best career advice you’ve received?
Poke your head up every few years and re-evaluate your career goals. Are you learning what you want to be learning? Is this the best job for you right now? Imagine yourself at 30, 35, 40, 45, 50… what job/role/title do you want to be searching for?

Who’s a person you look up to and why?
Marian Croak – a successful woman technical leader (over 200 patents), who also doesn’t necessarily fit the stereotypical mold of an executive. She’s very soft spoken, and earlier in her career had to make her voice heard by recruiting others to her cause to amplify her positions. I was always impressed with her style when I briefly intersected with her at Google. It’s inspiring to see the different ways that women, especially women of color, can be successful in their careers without adopting more “male traits” to be heard and gain credibility.

Noa Braun- Program Manager, Product

What’s the best career advice you’ve received?
There was an exec at my previous company who was known for saying that “the whole person comes to work – not just the employee”. This seems obvious, but I think we can all recognize how important it is to understand this when working on a team, or cross-team. Blockers and issues will arise no matter what you work on, but approaching them with a desire to collaborate and understand others will consistently lead to stronger outcomes.

Any good books you’ve read recently that you’d like to recommend?
Most recently I’ve finished Transcendent Kingdom by Yaa Gyasi. It’s an incredibly emotionally compelling exploration of the journey to seek purpose and identity in reacting to and processing loss. It’s a bit of a tear-jerker, but is definitely a beautifully told story – I would definitely highly recommend it (and her other book)!

Who’s a person you look up to and why?
My sister has always been my role model, for her thoughtfulness, intellect, determination and seemingly never-ending care for those around her.

Stefania Leone- Product Manager

How did you decide you wanted to become a product manager?
During my PhD and postdoc, I worked on developer experience and loved it. I then started to work on a low code platform product on the customer success side, and quickly realized that I can have a greater impact on the product side. That’s how I moved into my first product role (and never looked back)!

Any good books you’ve read recently that you’d like to recommend?
I just read Educated by Tara Westover over the weekend – I did not do much else but read

What’s the best career advice you’ve received?
It’s not advice I have been given, but being a planner, I had great success setting a goal and making a  plan on how to achieve it. For example, I always wanted to work and live abroad and with this approach, we went to the US for a postdoc (even though we had a small baby, which was not initially planned  and recently relocated to Amsterdam with the whole family when I joined Databricks.
 
Learn more about databricks and open roles to join these fabulous women on the product management team! See our open roles here.

--

Try Databricks for free. Get started today.

The post Women in Product at Databricks appeared first on Databricks.

Viewing all 1873 articles
Browse latest View live