⚡ ML Coursework · Plant 1, India · 34-day dataset

Solar Power
Prediction with
Machine Learning

Predicting AC power output from irradiation and temperature using Random Forest, XGBoost, and a PyTorch FFNN — with Optuna Bayesian tuning and GroupKFold time-series validation.

Explore Analysis View on GitHub

0.9965

Best R²

278

Best MAE (kW)

Models tested

750

Fold evaluations

3182

Data points

Exploratory Data Analysis

Dataset structure & distributions

Two CSV files merged after timestamp alignment. Weather sensor (plant-level, 3182 rows) and power generation (inverter-aggregated). 15-minute intervals, 34 days, May–June 2020.

Stat	Ambient (°C)	Module (°C)	Irradiation
count	3182	3182	3182
mean	25.53	31.09	0.228
std	3.35	12.26	0.301
min	20.40	18.14	0.000
25%	22.71	21.09	0.000
50%	24.61	24.62	0.025
75%	27.97	40.81	0.455
max	35.67	66.69	0.994

Stat	AC Power (kW)	DC Power (kW)
count	3182	3182
mean	6652.15	68021.36
std	8590.77	87917.83
min	0.00	0.00
25%	0.00	0.00
50%	726.01	7512.33
75%	13649.68	139358.13
max	29150.21	298937.79

AC Power distribution (target) — bimodal: night zeros + daytime peak

Irradiation distribution — spike at 0 confirms night-time dominance

Module temperature distribution — strong right skew

Ambient temperature distribution — symmetric, centred ~25°C

Correlation with AC Power — Irradiation r=1.00, Module r=0.96, Ambient r=0.73

Irradiation vs AC Power scatter — near-linear relationship

Module temperature vs AC Power — r = 0.96

Ambient temperature vs AC Power — r = 0.73

Average AC power by hour — solar bell curve (daytime only)

Missing timestamps by hour — 25 total, all at night (irradiation ≈ 0)

Feature relevance — why PLANT_ID & SOURCE_KEY were dropped (zero variance / non-informative)

Interactive 3D — Three.js

Solar panel simulation

Drag to orbit, scroll to zoom. Day cycle shows panels reacting to sun position. Data view shows the irradiation→AC power scatter cloud in 3D space.

Drag · Orbit · Scroll to zoom

Model Selection & Hyperparameter Tuning

Optuna Bayesian tuning — 50 trials each

GroupKFold(5) cross-validation embedded inside each Optuna trial. Best parameters saved and used for final repeated evaluation (50 × 5 = 750 rows per model).

🌲 Random Forest — best hyperparameters

n_estimators

1009

max_depth

min_samples_split

min_samples_leaf

max_features

0.7865

bootstrap

True

⚡ XGBoost — best hyperparameters

n_estimators

254

max_depth

learning_rate

0.1460

subsample

0.531

gamma

0.157

reg_lambda

0.520

🧠 FFNN — best hyperparameters (PyTorch)

hidden_dim

126

hidden_layers

dropout

0.091

activation

leaky_relu

learning_rate

3.32e-4

epochs

108

Optuna trial progression — best R² per trial (RF vs XGBoost vs FFNN)

Statistical Analysis

750 evaluations per model — real notebook outputs

50 repeats × 5 GroupKFolds = 750 rows of R², MAE, RMSE per model. Friedman test confirms significant differences between models.

🏆 Winner

🌲

Random Forest

n=1009 trees, max_depth=42

Mean R²

0.9933

Mean MAE

326.22 kW

Mean RMSE

675.34 kW

R² 95% CI(0.9930, 0.9935)

MAE 95% CI(322.7, 329.7)

RMSE 95% CI(666.2, 684.5)

Best R² ever0.9965

ANOVA F-stat1193.67 (p≈0)

⚡

XGBoost

Gradient boosted trees

Mean R²

0.9767

Mean MAE

871.15 kW

Mean RMSE

1218.27 kW

R² 95% CI(0.9754, 0.9779)

MAE 95% CI(844.3, 898.0)

RMSE 95% CI(1185.5, 1251.1)

Best R²0.9958

ANOVA F-stat4.38 (p=0.0017)

🧠

FFNN

PyTorch · leaky_relu · 3 layers

Mean R²

0.9708

Mean MAE

620.18 kW

Mean RMSE

951.51 kW

R² 95% CI(0.9611, 0.9804)

MAE 95% CI(551.0, 689.4)

RMSE 95% CI(872.4, 1030.7)

Best R²0.9971

ANOVA F-stat0.12 (p=0.9747)

RF — mean R² per fold (50 runs)

XGBoost — mean R² per fold

FFNN — mean R² per fold

Mean R² — higher is better

Mean MAE — lower is better

Mean RMSE — lower is better

Friedman Test (R²)

Statistic χ²315.70

p-value2.79 × 10⁻⁶⁹

ConclusionSignificant ✓

ANOVA — RF folds

F-statistic1193.67

p-value≈ 0.0000

Paired t (F1 vs F2)t = −248.39

ANOVA — XGBoost folds

F-statistic4.38

p-value0.0017

Paired t (F1 vs F2)t = −12.77

Fitting & Generalisation

Cross-validation results — overfitting diagnosis

GroupKFold(5) on best RF hyperparameters. Train vs test R² gap = 0.0033 — confirms a well-balanced model with no significant overfitting.

Fold	Train R²	Test R²	Train MAE	Test MAE	Train RMSE	Test RMSE
1	0.997822	0.988751	192.37	348.13	407.11	839.41
2	0.997485	0.992985	200.67	313.62	434.70	692.52
3	0.997199	0.995316	203.45	292.53	454.41	588.52
4	0.996959	0.996533	209.00	277.56	467.29	529.73
5	0.996986	0.996439	209.37	272.14	465.73	535.40

0.9973

Mean Train R²

All folds ≥ 0.9969

0.9940

Mean Test R²

Gap = 0.0033 — balanced ✓

Train vs Test R² per fold — gap stays below 0.01

Predicted vs Actual — RF best fold (test set)

Residual plot — RF model (residuals centred on zero, no systematic pattern)

Interactive Tool

Live AC Power predictor

Adjust weather inputs to simulate the Random Forest model output. Based on actual feature correlations from EDA: irradiation dominates (r=1.00), module temp second (r=0.96).

Weather inputs

☀ Irradiation (kW/m²)0.65

🌡 Ambient Temperature (°C)28.0

🔲 Module Temperature (°C)40.0

Dominant featureIrradiation

Model R² (test)0.9965

Best fold MAE278.19 kW

Best fold RMSE529.76 kW

Predicted AC output

14820 kW

RF model confidence 0.9965

Open Source

Full source code on GitHub

Machine_Learning_Algorithm_Solar_Power_Prediction_TimeSeries

Complete Jupyter notebook — preprocessing, EDA, Optuna tuning, RF/XGBoost/FFNN training, GroupKFold validation, statistical tests. Open source.

Python

Jupyter

scikit-learn

PyTorch

Optuna

XGBoost

View on GitHub 📓 Open Notebook

Solar PowerPrediction withMachine Learning

Friedman Test (R²)

ANOVA — RF folds

ANOVA — XGBoost folds

Weather inputs

Machine_Learning_Algorithm_Solar_Power_Prediction_TimeSeries

Solar Power
Prediction with
Machine Learning