Inspiration

Do something simple & meaningful without a data scientist

What it does

Estimates the Time To Intervene for a set of features of a mains.

How I built it

We saw that the Mains ID was a common key in the data, so started with the complete list of assets. Then, one-by-one we looked at likely features and which looked like they could be integrated into a simple grid of data. Owing to the short time in the Hackathon we made some concessions in features, for example omitting pressure profiles, as we didn't have time to transform them into an easy to use shape. Instead we focused on "quick win" features than slotted nicely into a simple grid of data. We used various characteristic of soil, pipe and population.

The grid of data was uploaded to Azure ML Studio. From here we trained models using Linear Regression and Bayesian Linear Regression in order to make our prediction. It was an automatic process to publish this as a web service. We wrote a simple client to query this in WPF, although didn't have time to implement the design.

Challenges we ran into

The first model wouldn't finish training, so with some help from the Microsoft team, we came to realise that skewy values in the data were confusing the calculation. We cleaned the data but were not happy with the results. The problem was that 10% of our data represented burst pipes, with a burst year, but the pipes that had not yet burst of course have no value for that year. These null values were not helping. The next step was to only include records of actual bursts. This approach produced results, but all the predictions were in the past, because as far as the ML algorithm could see from the selection of records, "all pipes always burst between 2000 & 2015", which is clearly not true. Microsoft recommended taking a predictive maintenance approach, so rather than predict the year in which the pipe would burst, predict it's eventual lifespan. Initially this had the same problem, but we thought of a way to utilise the missing 90% of records. We attributed the manufacturer's expected lifespan for each pipe material to each pipe, and made a column which featured the actual life of known burst pipes, or the expected life of a pipe where it hasn't yet burst. Although still slightly flawed, given that the expected lifespan is not entirely accurate, the predictive model was outputting believable results, and was factoring in the features such as corrosivity of the soil.

Accomplishments that we are proud of

  • Coming up with a viable concept / proposal of practical benefit
  • A working predictive model (albeit currently can't rely on the level of accuracy) despite never having done ML before

What we learned

  • Importance of an experience data scientist to consullt with regarding methodology
  • Machine Learning "doesn't like gaps"

What's next for our solution

  • Integrate weather and flow profiles
  • Gather more data (maybe cross-company) on bursts to better estimate Time to Intervene
  • Refine model
  • Validate model

Built With

  • Excel & SQl Server for manual data cleansing
  • Azure ML Studio for Machine Learning

Built With

  • azure-ml
  • data
  • excel
  • invision
  • sql-server
  • webapi
Share this project:

Updates