⚡ ML Coursework · Plant 1, India · 34-day dataset

Solar Power
Prediction with
Machine Learning

Predicting AC power output from irradiation and temperature using Random Forest, XGBoost, and a PyTorch FFNN — with Optuna Bayesian tuning and GroupKFold time-series validation.

0.9965
Best R²
278
Best MAE (kW)
3
Models tested
750
Fold evaluations
3182
Data points

Exploratory Data Analysis
Dataset structure & distributions
Two CSV files merged after timestamp alignment. Weather sensor (plant-level, 3182 rows) and power generation (inverter-aggregated). 15-minute intervals, 34 days, May–June 2020.
StatAmbient (°C)Module (°C)Irradiation
count318231823182
mean25.5331.090.228
std3.3512.260.301
min20.4018.140.000
25%22.7121.090.000
50%24.6124.620.025
75%27.9740.810.455
max35.6766.690.994
StatAC Power (kW)DC Power (kW)
count31823182
mean6652.1568021.36
std8590.7787917.83
min0.000.00
25%0.000.00
50%726.017512.33
75%13649.68139358.13
max29150.21298937.79
AC Power distribution (target) — bimodal: night zeros + daytime peak
Irradiation distribution — spike at 0 confirms night-time dominance
Module temperature distribution — strong right skew
Ambient temperature distribution — symmetric, centred ~25°C
Correlation with AC Power — Irradiation r=1.00, Module r=0.96, Ambient r=0.73
Irradiation vs AC Power scatter — near-linear relationship
Module temperature vs AC Power — r = 0.96
Ambient temperature vs AC Power — r = 0.73
Average AC power by hour — solar bell curve (daytime only)
Missing timestamps by hour — 25 total, all at night (irradiation ≈ 0)
Feature relevance — why PLANT_ID & SOURCE_KEY were dropped (zero variance / non-informative)

Interactive 3D — Three.js
Solar panel simulation
Drag to orbit, scroll to zoom. Day cycle shows panels reacting to sun position. Data view shows the irradiation→AC power scatter cloud in 3D space.
Drag · Orbit · Scroll to zoom

Model Selection & Hyperparameter Tuning
Optuna Bayesian tuning — 50 trials each
GroupKFold(5) cross-validation embedded inside each Optuna trial. Best parameters saved and used for final repeated evaluation (50 × 5 = 750 rows per model).
🌲 Random Forest — best hyperparameters
n_estimators
1009
max_depth
42
min_samples_split
5
min_samples_leaf
4
max_features
0.7865
bootstrap
True
⚡ XGBoost — best hyperparameters
n_estimators
254
max_depth
12
learning_rate
0.1460
subsample
0.531
gamma
0.157
reg_lambda
0.520
🧠 FFNN — best hyperparameters (PyTorch)
hidden_dim
126
hidden_layers
3
dropout
0.091
activation
leaky_relu
learning_rate
3.32e-4
epochs
108
Optuna trial progression — best R² per trial (RF vs XGBoost vs FFNN)

Statistical Analysis
750 evaluations per model — real notebook outputs
50 repeats × 5 GroupKFolds = 750 rows of R², MAE, RMSE per model. Friedman test confirms significant differences between models.
🏆 Winner
🌲
Random Forest
n=1009 trees, max_depth=42
Mean R²
0.9933
Mean MAE
326.22 kW
Mean RMSE
675.34 kW
R² 95% CI(0.9930, 0.9935)
MAE 95% CI(322.7, 329.7)
RMSE 95% CI(666.2, 684.5)
Best R² ever0.9965
ANOVA F-stat1193.67 (p≈0)
XGBoost
Gradient boosted trees
Mean R²
0.9767
Mean MAE
871.15 kW
Mean RMSE
1218.27 kW
R² 95% CI(0.9754, 0.9779)
MAE 95% CI(844.3, 898.0)
RMSE 95% CI(1185.5, 1251.1)
Best R²0.9958
ANOVA F-stat4.38 (p=0.0017)
🧠
FFNN
PyTorch · leaky_relu · 3 layers
Mean R²
0.9708
Mean MAE
620.18 kW
Mean RMSE
951.51 kW
R² 95% CI(0.9611, 0.9804)
MAE 95% CI(551.0, 689.4)
RMSE 95% CI(872.4, 1030.7)
Best R²0.9971
ANOVA F-stat0.12 (p=0.9747)
RF — mean R² per fold (50 runs)
XGBoost — mean R² per fold
FFNN — mean R² per fold
Mean R² — higher is better
Mean MAE — lower is better
Mean RMSE — lower is better

Friedman Test (R²)

Statistic χ²315.70
p-value2.79 × 10⁻⁶⁹
ConclusionSignificant ✓

ANOVA — RF folds

F-statistic1193.67
p-value≈ 0.0000
Paired t (F1 vs F2)t = −248.39

ANOVA — XGBoost folds

F-statistic4.38
p-value0.0017
Paired t (F1 vs F2)t = −12.77

Fitting & Generalisation
Cross-validation results — overfitting diagnosis
GroupKFold(5) on best RF hyperparameters. Train vs test R² gap = 0.0033 — confirms a well-balanced model with no significant overfitting.
FoldTrain R²Test R²Train MAETest MAETrain RMSETest RMSE
10.9978220.988751192.37348.13407.11839.41
20.9974850.992985200.67313.62434.70692.52
30.9971990.995316203.45292.53454.41588.52
40.9969590.996533209.00277.56467.29529.73
50.9969860.996439209.37272.14465.73535.40
0.9973
Mean Train R²
All folds ≥ 0.9969
0.9940
Mean Test R²
Gap = 0.0033 — balanced ✓
Train vs Test R² per fold — gap stays below 0.01
Predicted vs Actual — RF best fold (test set)
Residual plot — RF model (residuals centred on zero, no systematic pattern)

Interactive Tool
Live AC Power predictor
Adjust weather inputs to simulate the Random Forest model output. Based on actual feature correlations from EDA: irradiation dominates (r=1.00), module temp second (r=0.96).

Weather inputs

☀ Irradiation (kW/m²)0.65
🌡 Ambient Temperature (°C)28.0
🔲 Module Temperature (°C)40.0
Dominant featureIrradiation
Model R² (test)0.9965
Best fold MAE278.19 kW
Best fold RMSE529.76 kW
Predicted AC output
14820 kW
RF model confidence 0.9965

Open Source
Full source code on GitHub

Machine_Learning_Algorithm_Solar_Power_Prediction_TimeSeries

Complete Jupyter notebook — preprocessing, EDA, Optuna tuning, RF/XGBoost/FFNN training, GroupKFold validation, statistical tests. Open source.

Python
Jupyter
scikit-learn
PyTorch
Optuna
XGBoost