From Drone Video to a 3D Gaussian Splat

This project takes drone video footage and turns it into a 3D Gaussian Splat that can be rendered from new viewpoints. The implementation follows the practical "canonical" pipeline most 3DGS systems use in the real world:

Video → frames → COLMAP (camera poses + sparse points) → 3D Gaussian Splatting training → novel-view rendering.

This post focuses on what Gaussian splatting is, how it works, why it matters, and how the Kaggle pipeline reproduces it end-to-end.

Github

What is 3D Gaussian Splatting?

3D Gaussian Splatting (3DGS) represents a scene as millions of tiny 3D ellipsoids ("Gaussians") instead of a mesh or a voxel grid.

Each Gaussian typically stores:

Position: the 3D center
Shape: a covariance (scale + rotation) defining an ellipsoid
Opacity: how much it contributes when projected
Color / appearance: often spherical harmonics (SH) coefficients for view-dependent color

At render time, the system projects ("splats") these ellipsoids onto the image plane and composites them using alpha blending (back-to-front, sorted by depth). The key is that the rendering is differentiable, so the Gaussians can be optimized to match real images.

Why this matters: compared to NeRF-style volumetric rendering, 3DGS is typically much faster to render while achieving high-quality novel views, making it attractive for interactive applications and real-time visualization.

How Gaussian Splatting works

A practical 3DGS pipeline has three core ideas:

1) Get cameras and a sparse 3D seed (SfM)

You can't reconstruct 3D from a single image reliably. 3DGS commonly starts by estimating:

camera intrinsics/extrinsics
a sparse point cloud

This is usually done with Structure-from-Motion (SfM).

2) Initialize and optimize Gaussians

Sparse points become initial Gaussian centers (or guide initialization). Training iteratively:

renders the current Gaussians from known camera poses
compares the render to each training image (photometric loss)
backpropagates to update Gaussian parameters

3) Adaptive density control (densify / prune)

During training, the model adapts density:

densify (split/add Gaussians) in high-error, high-detail regions
prune redundant or near-transparent Gaussians to control memory

This "grow where needed" behavior is one reason 3DGS can capture sharp edges and fine structure efficiently.

Why Gaussian Splatting matters

Gaussian splatting hits a practical sweet spot:

High visual quality for novel view synthesis
Real-time rendering potential on GPUs (interactive frame rates)
Avoids meshing complexities for complex scenes (thin structures, foliage, specularities)
Useful representation for downstream tasks (localization, mapping, simulation) when paired with geometry/semantics

It's not the right tool for every 3D problem, but it is one of the most deployment-friendly neural scene representations today.

Real-world applications

Media, VFX, and interactive content

Fast capture of real environments for AR/VR walkthroughs
Rapid asset creation for games and film previsualization
View synthesis for cinematic shots without full mesh reconstruction

Digital twins and industrial inspection

Plant/factory walkthroughs from handheld/drone scans
Monitoring site changes over time (before/after scans)
Remote inspection: render novel viewpoints without re-flying a drone

Robotics (SLAM-adjacent workflows)

Gaussian splats are increasingly used as dense, renderable scene maps:

Localization: render predicted views from candidate poses and compare to live camera frames (photometric alignment)
Mapping: maintain an updated scene representation for navigation and teleoperation
Sim-to-real / synthetic data: generate additional viewpoints to augment perception datasets
Human-in-the-loop teleop: provide interactive novel views of remote environments

Important caveat: classic 3DGS is primarily a renderable representation, not a guaranteed metric-accurate mesh. Robotics workflows often combine 3DGS with depth/geometry constraints, semantics, or SLAM back-ends for robustness.

E-commerce and product visualization

Turning turntable videos into interactive "look around" product views
View-dependent appearance helps with glossy/reflective objects

Leo ooooo

From Drone Video to a 3D Gaussian Splat

Leo ooooo