⚡ ML Coursework · Plant 1, India · 34-day dataset
Solar Power
Prediction with
Machine Learning
Predicting AC power output from irradiation and temperature using Random Forest, XGBoost, and a PyTorch FFNN — with Optuna Bayesian tuning and GroupKFold time-series validation.
0.9965
Best R²
278
Best MAE (kW)
3
Models tested
750
Fold evaluations
3182
Data points
Exploratory Data Analysis
Dataset structure & distributions
Two CSV files merged after timestamp alignment. Weather sensor (plant-level, 3182 rows) and power generation (inverter-aggregated). 15-minute intervals, 34 days, May–June 2020.
| Stat | Ambient (°C) | Module (°C) | Irradiation |
|---|---|---|---|
| count | 3182 | 3182 | 3182 |
| mean | 25.53 | 31.09 | 0.228 |
| std | 3.35 | 12.26 | 0.301 |
| min | 20.40 | 18.14 | 0.000 |
| 25% | 22.71 | 21.09 | 0.000 |
| 50% | 24.61 | 24.62 | 0.025 |
| 75% | 27.97 | 40.81 | 0.455 |
| max | 35.67 | 66.69 | 0.994 |
| Stat | AC Power (kW) | DC Power (kW) |
|---|---|---|
| count | 3182 | 3182 |
| mean | 6652.15 | 68021.36 |
| std | 8590.77 | 87917.83 |
| min | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 |
| 50% | 726.01 | 7512.33 |
| 75% | 13649.68 | 139358.13 |
| max | 29150.21 | 298937.79 |
AC Power distribution (target) — bimodal: night zeros + daytime peak
Irradiation distribution — spike at 0 confirms night-time dominance
Module temperature distribution — strong right skew
Ambient temperature distribution — symmetric, centred ~25°C
Correlation with AC Power — Irradiation r=1.00, Module r=0.96, Ambient r=0.73
Irradiation vs AC Power scatter — near-linear relationship
Module temperature vs AC Power — r = 0.96
Ambient temperature vs AC Power — r = 0.73
Average AC power by hour — solar bell curve (daytime only)
Missing timestamps by hour — 25 total, all at night (irradiation ≈ 0)
Feature relevance — why PLANT_ID & SOURCE_KEY were dropped (zero variance / non-informative)
Interactive 3D — Three.js
Solar panel simulation
Drag to orbit, scroll to zoom. Day cycle shows panels reacting to sun position. Data view shows the irradiation→AC power scatter cloud in 3D space.
Drag · Orbit · Scroll to zoom
Model Selection & Hyperparameter Tuning
Optuna Bayesian tuning — 50 trials each
GroupKFold(5) cross-validation embedded inside each Optuna trial. Best parameters saved and used for final repeated evaluation (50 × 5 = 750 rows per model).
🌲 Random Forest — best hyperparameters
n_estimators
1009
max_depth
42
min_samples_split
5
min_samples_leaf
4
max_features
0.7865
bootstrap
True
⚡ XGBoost — best hyperparameters
n_estimators
254
max_depth
12
learning_rate
0.1460
subsample
0.531
gamma
0.157
reg_lambda
0.520
🧠 FFNN — best hyperparameters (PyTorch)
hidden_dim
126
hidden_layers
3
dropout
0.091
activation
leaky_relu
learning_rate
3.32e-4
epochs
108
Optuna trial progression — best R² per trial (RF vs XGBoost vs FFNN)
Statistical Analysis
750 evaluations per model — real notebook outputs
50 repeats × 5 GroupKFolds = 750 rows of R², MAE, RMSE per model. Friedman test confirms significant differences between models.
🏆 Winner
🌲
Random Forest
n=1009 trees, max_depth=42
R² 95% CI(0.9930, 0.9935)
MAE 95% CI(322.7, 329.7)
RMSE 95% CI(666.2, 684.5)
Best R² ever0.9965
ANOVA F-stat1193.67 (p≈0)
⚡
XGBoost
Gradient boosted trees
R² 95% CI(0.9754, 0.9779)
MAE 95% CI(844.3, 898.0)
RMSE 95% CI(1185.5, 1251.1)
Best R²0.9958
ANOVA F-stat4.38 (p=0.0017)
🧠
FFNN
PyTorch · leaky_relu · 3 layers
R² 95% CI(0.9611, 0.9804)
MAE 95% CI(551.0, 689.4)
RMSE 95% CI(872.4, 1030.7)
Best R²0.9971
ANOVA F-stat0.12 (p=0.9747)
RF — mean R² per fold (50 runs)
XGBoost — mean R² per fold
FFNN — mean R² per fold
Mean R² — higher is better
Mean MAE — lower is better
Mean RMSE — lower is better
Friedman Test (R²)
Statistic χ²315.70
p-value2.79 × 10⁻⁶⁹
ConclusionSignificant ✓
ANOVA — RF folds
F-statistic1193.67
p-value≈ 0.0000
Paired t (F1 vs F2)t = −248.39
ANOVA — XGBoost folds
F-statistic4.38
p-value0.0017
Paired t (F1 vs F2)t = −12.77
Fitting & Generalisation
Cross-validation results — overfitting diagnosis
GroupKFold(5) on best RF hyperparameters. Train vs test R² gap = 0.0033 — confirms a well-balanced model with no significant overfitting.
| Fold | Train R² | Test R² | Train MAE | Test MAE | Train RMSE | Test RMSE |
|---|---|---|---|---|---|---|
| 1 | 0.997822 | 0.988751 | 192.37 | 348.13 | 407.11 | 839.41 |
| 2 | 0.997485 | 0.992985 | 200.67 | 313.62 | 434.70 | 692.52 |
| 3 | 0.997199 | 0.995316 | 203.45 | 292.53 | 454.41 | 588.52 |
| 4 | 0.996959 | 0.996533 | 209.00 | 277.56 | 467.29 | 529.73 |
| 5 | 0.996986 | 0.996439 | 209.37 | 272.14 | 465.73 | 535.40 |
0.9973
Mean Train R²
All folds ≥ 0.9969
0.9940
Mean Test R²
Gap = 0.0033 — balanced ✓
Train vs Test R² per fold — gap stays below 0.01
Predicted vs Actual — RF best fold (test set)
Residual plot — RF model (residuals centred on zero, no systematic pattern)
Interactive Tool
Live AC Power predictor
Adjust weather inputs to simulate the Random Forest model output. Based on actual feature correlations from EDA: irradiation dominates (r=1.00), module temp second (r=0.96).
Predicted AC output
14820
kW
RF model confidence 0.9965
Open Source
Full source code on GitHub
Machine_Learning_Algorithm_Solar_Power_Prediction_TimeSeries
Complete Jupyter notebook — preprocessing, EDA, Optuna tuning, RF/XGBoost/FFNN training, GroupKFold validation, statistical tests. Open source.
Python
Jupyter
scikit-learn
PyTorch
Optuna
XGBoost