This is a guest blog from Justin Mills, Data Team Lead at Yesware. To try out Databricks for your next Spark application, sign up for a free trial.
Yesware provides salespeople with data-driven insights to help them understand whether their outreach efforts are working or not. We accomplish this by logging events such as email opens / clicks from our application (which integrates with a sales person’s e-mail application), and associating these events with the activity of the sales person in the form of a report. For example, by using Yesware a sales team can track the open and reply rates on emails sent using different templates. They can then prioritize the use of templates that have a higher success rate and share best practices with the rest of the team. We track a high volume of email data and event data – over 20 million email activities per month and approximately 200 million events associated with those emails. We also integrate with Salesforce and incorporate its rich dataset into our reports. Spark and Databricks give us the development and production environments for connecting all this data together to build new features.
In this post, I will walk through how we at Yesware used Databricks to take an important new feature in our Premier Tier from idea to production – The Activity versus Engagement Report. We were able to build this new feature report more quickly than previous reports and at a reduced cost using Spark and Databricks. Moreover, we are confident that Spark on Databricks will provide us with the platform to run more complex jobs and process more data as we scale up.
Background
Our data pipeline is fairly simple. Databases are archived into S3; Databricks runs Spark jobs to process those files and produce Resilient Distributed Datasets (RDDs) as output data. These RDDs are then written to timestamped tables in a PostgreSQL table. An API layer sits atop this database and provides clients access to the underlying data by running SQL queries to aggregate the data.
Many of our reports support arbitrary date selection, so we typically store data in “day” buckets, which means most of these tables have a key that includes user id and day. Each job writes to a new table name. When the job is done writing the data and building indexes, it updates a metadata table to indicate that the most recent version of a data source is ready for querying. It also includes some other attributes, such as the time range represented in the data and when the job was run to indicate how current the data is. This allows us to quickly roll back to a previous version of the data if new data turns out to be incorrect. The rendering of this data in a report is done in a separate application that consumes the service.
Building the Activity vs. Engagement Report with Databricks
Concept
We wanted to give salespeople an easy way to see which accounts they should focus their efforts on. Our idea is to show a correlation of sales activity and prospect engagement in a report called “Activity versus Engagement”. Activity would represent things such as emails sent and engagement would be action taken in response to those emails such as clicking on links, replying, and data from Salesforce (see a mock-up of the idea from our designer below).
Activity vs. Engagement report concept mock-up
Prototyping
We began prototyping using Databricks notebooks in conjunction with our custom-built Spark library (the Yesware JAR – where our production code lives). The library includes production-tested methods to build RDD’s out of our data in S3, and the code to build derived datasets that do things like aggregate event data at the email level.
During the prototyping phase, Databricks provided a one-stop shop to build prototype code and the tools to visualize the results in a way that we could start to see what it would look like in the application. We built intermediate Spark RDDs and cached them to make re-computations faster. A typical development cycle might look like:
- Use Databricks to start a cluster with the Yesware JAR loaded into it.
- Run the first N cells of a notebook, some of them caching intermediate RDDs
- Tweak values in a cell
- Run that cell and any following cell again
- Repeat until data visualizations and table data showed the results we were looking for.
Through this process, we were also able to decide what insights from the initial idea we could keep and what we needed to continue to refine. It was an incredibly useful method for one or two developers to move quickly before an entire team jumped in to build out the production-ready feature.
The early prototypes of this concept were centered around binning data by prospect. We took one of our aggregated datasets (an email with summary statistics) and flattened that out into a record per email address. We then aggregated that data based on email address to compute an activity and engagement score. Databricks notebooks allowed us to incrementally build up the data we ultimately want in the Activity versus Engagement Report in its visual environment:
Prototype of the report in Databricks notebook
Testing
After we had a working prototype in the form of a Databricks notebook, the next step was to properly test the prototype code. Our ultimate goal was to merge the tested code into the Yesware JAR, where the code becomes part of our production pipeline. This was done by the following process:
- Creating objects and classes in Scala to contain the logic in the prototype notebook cells.
- Copying this logic over to these classes. This process can sometimes be a straight copy of a block of Spark code, and other times means some refactoring of that code.
- Writing tests of the new code in the Yesware JAR
- Submitting a pull request for code review
- Merging and releasing a new version of the Yesware JAR.
We try to write our Spark code to take parameters to the best of our ability. These parameters are things such as how many days of input data to use or the target database to write to. This makes step 2 above much simpler as not all parameter values available in a Databricks Spark environment are available locally for test. The final piece of copying the logic into the Yesware JAR is wrapping all of the individual pieces of code into a single entry point. This is done as a scala object that we can call into, passing all the parameters necessary. So where our prototype notebook may have dozens of cells each with one piece of Spark transformations, the production notebook often has one cell that looks something like this:
val options = Map( // Options to the job here ) // sparkContext is defined by Databricks and available // to all cells in all notebooks. val reportRDD = BuildActivityEngagementReport.run(sparkContext, options)
Part of the process of copying code from the notebook into source code was reworking the entry point to the job so that the Databricks notebook could provide inputs to control the job such as how many days of data to process, the target database to load into, etc. The result is that our prototyping notebook(s) are comprised of many cells each with Spark transformations and business logic, our production notebook looks something like this:
val reportRDD = BuildActivityEngagementReport.run(sparkContext, options)
where options is a Map of the parameters mentioned above such as database credentials, etc.
Another benefit of having code live in our shared library is easier automated testing and code reviews. This is a big part of the Yesware culture and we wanted to add our Spark and Databricks code to the mix. To support this, we have a small sample of production data that we use locally to test our Spark code using ScalaTest. Having live production data is useful for new developers to have enough to play with locally to be able to do interesting things and provides more real-world data for automated tests to run against. Being in a Git repository integrates perfectly with our existing development flow of using Github Pull Requests to review any proposed changes. This process also gives us a chance to look over our implementation again to see if there are any other optimizations we could do. Code reviews, for example, led us to realize we were filtering data after an expensive join, and that with some rearranging, we could reduce the volume of data leading into the join and make it faster. Having this formal process in place and tests to back up the validity of such optimizations has helped us immensely.
Production
Once the JAR source code was updated and we cut a release, we used the Databricks jobs feature to schedule this job. In the notebook that calls into our custom Spark library (Yesware JAR), we also include some basic visualizations of the data (see figure below). This is extremely useful to ensure that the job is continuing to produce data that looks correct. It also is a unique way to provide higher level visualizations, such as the overall Activity versus Engagement Report for all of our users, or for our “top performers” across all customers. The job is setup to prefer spot instances to save on cost but fall back to on-demand if needed. Finally, the email notifications are connected to some mailing lists so we can track progress/alerting via existing mechanisms.
Output from the Activity vs. Engagement Report job
Final product
The final product is a chart of activity versus engagement that individual salespeople can use to see how their prospecting efforts are compared to the baseline for their team. It is also a tool for sales managers to direct effort and provide guidance as to what is working and what is not working.
Final product in the Yesware application
This report joins several other team-based reports that are available to Yesware users and help enable Sales teams to gain more insights into what is working. We will be looking more to Spark and Databricks in the future for building such features as it has become the cornerstone for our reporting features.
To try Databricks, sign up for a free trial!