In order to fit the data to predict the nacelle weight and blade weight, we tried to use a simple linear model and minimise mean squared error. However given the many issing values and the scarcity of the data we had we found this difficult. We were afraid of using more powerful techniques in fear of overfitting the data, so we decided to try generate some of our own data instead.

Extensive internet searching didn't help the cause and we couldn't find any sort of extra datasets online. Furthermore, it didn't help that there was no dataset description available, so we didn't know what a lot of the column names meant at all. With nothing else to turn to, we decided to try and mimic some data using General Adversarial Networks.

What it does

Given some information about wind turbines, we predict the nacelle weight and blade weight of the turbine.

How we built it

Since the data provided was quite limited and there were also some missing values, we imputed the missing values. This gave us more information to work with when training the models. We then trained a GAN to generate more data. A linear regression model is then fitted on the generated and original data to predict the nacelle and blade weight.

Challenges we ran into

Very limited data and troubles with generating good data. There is a balance with complicating our generating model in case it learns to overfit on the data, so we made sure to compare the distribution of the generated data with the real, existing data we had to check similarity.

Accomplishments that we're proud of

The final linear regression with the extra generated data gave the following results:

R2 score:
train MAE:
train RMSE:

This was based on the data we had with all entries of nacelle weight filled. For the single blade weight:

R2 score
train MAE
train RMSE

We also obtained confidence intervals based on the standard deviations of the residuals of our estimates. A example of a 95% confidence interval for the first missing nacelle weight would be:

(39.21, 81.43, 123.66)

And for the first missing single blade weight:

(11.09, 19.65, 28.20)

What we learned

The data had a lot of noise, and it was difficult to interpret which variables to excluded and which variables were independent. It was hard to test for any independence at all for the Regions category and the Operator category given that they were either particularly unbalanced or scarce in data points.

What's next for Wind turbine project

The results we obtained weren't spectacular at all. A simple linear regression model was the best we could do to obtain our results. Had we collected more data/had more information, more data preprocessing and feature extraction would certainly enhance our model performance.

Built With

Share this project: