In part 2 of this three-part series on Azure data analytics for modern industrial internet of things (IIoT) applications, we ingested real-time IIoT data from field devices into Azure and performed complex time-series processing on Data Lake directly. In this post, we will leverage machine learning for predictive maintenance and to maximize the revenue of a wind turbine while minimizing the opportunity cost of downtime, thereby maximizing profit.
The end result of our model training and visualization will be a Power BI report shown below:
The end-to-end architecture is again shown below.
Machine Learning: Power Output and Remaining Life Optimization
Optimizing the utility, lifetime, and operational efficiency of industrial assets like wind turbines has numerous revenue and cost benefits. The real-world challenge we explore in this article is maximizing the revenue of a wind turbine while minimizing the opportunity cost of downtime, thereby maximizing our net profit.
Net profit = Power generation revenue – Cost of added strain on equipment
If we push a turbine to a higher RPM, it will generate more energy and therefore more revenue. However, the added strain on the turbine will cause it to fail more often, introducing cost.
To solve this optimization problem, we will create two models:
- Predict the power generated of a turbine given a set of operating conditions
- Predict the remaining life of a turbine given a set of operating conditions
We can then produce a profit curve to identify the optimal operating conditions that maximize power revenue while minimizing costs.
Using Azure Databricks with our Gold Delta tables, we will perform feature engineering to extract the fields of interest, train the two models, and finally deploy the models to Azure Machine Learning for hosting.
To calculate the remaining useful lifetime of each Wind Turbine, we can use our maintenance records that indicate when each asset is replaced.
%sql -- Calculate the age of each turbine and the remaining life in days CREATE OR REPLACE VIEW turbine_age AS WITH reading_dates AS (SELECT distinct date, deviceid FROM turbine_power), maintenance_dates AS ( SELECT d.*, datediff(nm.date, d.date) as datediff_next, datediff(d.date, lm.date) as datediff_last FROM reading_dates d LEFT JOIN turbine_maintenance nm ON (d.deviceid=nm.deviceid AND d.date<=nm.date) LEFT JOIN turbine_maintenance lm ON (d.deviceid=lm.deviceid AND d.date>=lm.date )) SELECT date, deviceid, min(datediff_last) AS age, min(datediff_next) AS remaining_life FROM maintenance_dates GROUP BY deviceid, date;
To predict power output at a six-hour time horizon, we calculate time series shifts using Spark window functions.
CREATE OR REPLACE VIEW feature_table AS SELECT r.*, age, remaining_life, -- Calculate the power 6 hours ahead using Spark Windowing and build a feature_table to feed into our ML models LEAD(power, 6, power) OVER (PARTITION BY r.deviceid ORDER BY time_interval) as power_6_hours_ahead FROM gold_readings r JOIN turbine_age a ON (r.date=a.date AND r.deviceid=a.deviceid) WHERE r.date < CURRENT_DATE(); -- Only train on historical data
There are strong correlations between both Turbine operating parameters (RPM and Angle) as well as weather conditions and the power generated six hours from now.
We can now train an XGBoost Regressor model to use our feature columns (weather, sensor and power readings) to predict our label (power reading six hours ahead). We can train a model for each Wind Turbine in parallel using a Pandas UDF, which distributes our XGBoost model training code to all the available nodes in the Azure Databricks cluster.
# Create a Spark Dataframe that contains the features and labels we need feature_cols = ['angle','rpm','temperature','humidity','windspeed','power','age'] label_col = 'power_6_hours_ahead' # Read in our feature table and select the columns of interest feature_df = spark.table('feature_table') # Create a Pandas UDF to train a XGBoost Regressor on each turbine's data @pandas_udf(feature_df.schema, PandasUDFType.GROUPED_MAP) def train_power_model(readings_pd): mlflow.xgboost.autolog() # Auto-Log the XGB parameters, metrics, model and artifacts with mlflow.start_run(): # Train an XGBRegressor on the data for this Turbine alg = xgb.XGBRegressor() train_dmatrix = xgb.DMatrix(data=readings_pd[feature_cols].astype('float'),label=readings_pd[label_col]) model = xgb.train(dtrain=train_dmatrix, evals=[(train_dmatrix, 'train')]) return readings_pd # Run the Pandas UDF against our feature dataset power_predictions = feature_df.groupBy('deviceid').apply(train_power_model)
Azure Databricks will automatically track each model training run with a hosted MLflow experiment. For XGBoost Regression, MLflow will track any parameters passed into the params argument, the RMSE metric, the turbine this model was trained on, and the resulting model itself. For example, the RMSE for predicting power on deviceid WindTurbine-18 is 45.79.
We can train a similar model for the remaining life of the wind turbine. The actuals vs. predicted for one of the turbines is shown below.
Model Deployment and Hosting
Azure Databricks is integrated with Azure Machine Learning for model deployment and scoring. Using the Azure ML APIs directly inside of Databricks, we can automatically deploy an image for each model to be hosted in a fast, scalable container service (ACI or AKS) by Azure ML.
# Create a model image inside of AzureML model_image, azure_model = mlflow.azureml.build_image(model_uri=path, workspace=workspace, model_name=model, image_name=model, description="XGBoost model to predict power output” synchronous=False) # Deploy a web service to host the model as a REST API dev_webservice_deployment_config = AciWebservice.deploy_configuration() dev_webservice = Webservice.deploy_from_image(name=dev_webservice_name, image=model_image, workspace=workspace)
Once the model is deployed, it will show up inside the Azure ML studio, and we can make REST API calls to score data interactively.
# Construct a payload to send with the request payload = { 'angle':12, 'rpm':10, 'temperature':25, 'humidity':50, 'windspeed':10, 'power':200, 'age':10 } def score_data(uri, payload): rest_payload = json.dumps({"data": [list(payload.values())]}) response = requests.post(uri, data=rest_payload, headers={"Content-Type": "application/json"}) return json.loads(response.text) print(f'Predicted power (in kwh) from model: {score_data(power_uri, payload)}') print(f'Predicted remaining life (in days) from model: {score_data(life_uri, payload)}')
Now that both the power optimization and the RUL models are deployed as prediction services, we can utilize both in optimizing net profit from each wind turbine.
Assuming $1 per KWh, annual revenue can simply be calculated by multiplying the expected hourly power by 24 hours and 365 days.
The annual cost can be calculated by multiplying the daily revenue by the number of times the Turbine needs to be maintained in a year (365 days / remaining life).
We can iteratively score various operating parameters simply by making multiple calls to our models hosted in Azure ML. By visualizing the expected profit cost for various operating parameters, we can identify the optimal RPM to maximize profit.
Data Serving: Azure Data Explorer and Azure Synapse Analytics
Operational Reporting in ADX
Azure Data Explorer (ADX) provides real-time operational analytics on streaming time-series data. IIoT device data can be streamed directly into ADX from IoT Hub, or pushed from Azure Databricks using the Kusto Spark Connector from Microsoft as shown below.
stream_to_adx = ( spark.readStream.format('delta').option('ignoreChanges',True).table('turbine_enriched') .writeStream.format("com.microsoft.kusto.spark.datasink.KustoSinkProvider") .option("kustoCluster",kustoOptions["kustoCluster"]) .option("kustoDatabase",kustoOptions["kustoDatabase"]) .option("kustoTable", kustoOptions["kustoTable"]) .option("kustoAadAppId",kustoOptions["kustoAadAppId"]) .option("kustoAadAppSecret",kustoOptions["kustoAadAppSecret"]) .option("kustoAadAuthorityID",kustoOptions["kustoAadAuthorityID"]) )
PowerBI can then be connected to the Kusto table to create a true, real-time, operational dashboard for Turbine engineers.
ADX also contains native time-series analysis functions such as forecasting and anomaly detection. For example, the Kusto code below finds anomalous points for RPM readings in the data stream.
turbine_raw | where rpm > 0 | make-series rpm_normal = avg(rpm) default=0 on todatetime(timestamp) in range(datetime(2020-06-30 00:00:00), datetime(2020-06-30 01:00:00), 10s) | extend anomalies = series_decompose_anomalies(rpm_normal, 0.5) | render anomalychart with(anomalycolumns=anomalies, title="RPM Anomalies")
Analytical Reporting in ASA
Azure Synapse Analytics (ASA) is the next generation data warehouse from Azure that natively leverages ADLS Gen 2 and integrates with Azure Databricks to enable seamless data sharing between these services.
While leveraging the capabilities of Synapse and Azure Databricks, the recommended approach is to use the best tool for the job given your team’s requirements and the user personas accessing the data. For example, data engineers that need the performance benefits of Delta and data scientists that need a collaborative, rich and flexible workspace will gravitate towards Azure Databricks. Analysts that need a low-code or data warehouse-based SQL environment to ingest, process and visualize data will gravitate towards Synapse.
The Synapse streaming connector for Azure Databricks allows us to stream the Gold Turbine readings directly into a Synapse SQL Pool for reporting.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy") # Use COPY INTO for faster loads write_to_synapse = ( spark.readStream.format('delta').option('ignoreChanges',True).table('turbine_enriched') # Read in Gold turbine readings .writeStream.format("com.databricks.spark.sqldw") # Write to Synapse .option("url",dbutils.secrets.get("iot","synapse_cs")) # SQL Pool JDBC (SQL Auth) .option("tempDir", SYNAPSE_PATH) # Temporary ADLS path .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", "turbine_enriched") # Table in Synapse to write to .option("checkpointLocation", CHECKPOINT_PATH+"synapse") # Streaming checkpoint .start() )
Alternatively, Azure Data Factory can be used to read data from the Delta format and write it into Synapse SQL Pools. More documentation can be found here.
Now that the data is clean, processed, and available to data analysts for reporting, we can build a live PowerBI dashboard against the live data as well as the predictions from our ML model as shown below.
Summary
To summarize, we have successfully:
- Ingested real-time IIoT data from field devices into Azure
- Performed complex time-series processing on Data Lake directly
- Trained and deployed ML models to optimize the utilization of our Wind Turbine assets
- Served the data to engineers for operational reporting and data analysts for analytical reporting
They key big data technology that ties everything together is Delta Lake. Delta on ADLS provides reliable streaming data pipelines and highly performant data science and analytics queries on massive volumes of time-series data. Lastly, it enables organizations to truly adopt a Lakehouse pattern by bringing best of breed Azure tools to a write-once, access-often data store.
What’s Next?
Try out the notebook hosted here, learn more about Azure Databricks with this 3-part training series and see how to create modern data architectures on Azure by attending this webinar.
--
Try Databricks for free. Get started today.
The post Modern Industrial IoT Analytics on Azure – Part 3 appeared first on Databricks.