Greyscale Image Colorization (best in class)

TL;DR

Built a U-Net-based model in PyTorch to colorize grayscale driving scene images by predicting the a and b channels of the Lab color space. Achieved strong results on the validation set with an MSE of ~5.77 and a clean architectural implementation using modular PyTorch blocks.

Greyscale Image Colorization (best in class) preview

Motivation & Context

This project was developed as part of an advanced machine learning course where we were asked to choose and solve an image-to-image translation task. I selected image colorization - converting grayscale images into colored versions - as it combines low-level spatial understanding with semantic reasoning, particularly challenging in real-world driving scenes.

Model Architecture

I implemented a U-Net architecture, inspired by the original paper “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al., but adapted it carefully for the colorization task. The model was built in modular components using PyTorch:

  • ConvBlock: Each block included two Conv2d layers, each followed by BatchNorm2d and ReLU.
  • DownBlock: Each encoder block combined a ConvBlock and MaxPool2d for spatial downsampling. The implementation returned both the downsampled output and the intermediate ConvBlock result for skip connections. This was a workaround for an earlier bug where I had mistakenly used the output after MaxPooling for skip connections.
  • UpBlock: Used an Upsample layer followed by a Conv2d (kernel size 1) for feature refinement. Initially, I avoided ConvTranspose2d based on advice I read on StackOverflow, though later I realized the U-Net paper itself suggests upsampling with convolution. The main difference was my kernel size (1 instead of 2).
  • Final Output: The model predicted two output channels (a and b in Lab space), with the grayscale input serving as the L channel.

The full model followed the classic encoder–bottleneck–decoder structure, with skip connections linking mirrored layers to preserve spatial information.

Training Process

Training presented some initial challenges:

  • I had issues with GPU memory. My GPU couldn't fit larger batch sizes but smaller ones only utilized about 50% of my 8GB GPU, so there’s room for future optimization.
  • I began by training for only 1–2 epochs at a time to inspect early outputs.
  • An initial bug in test data ordering led to an MSE > 500 due to unsorted filenames - which I fixed later.

Once the pipeline was stable, I ran:

  • Two training runs (10 epochs each) at learning rates 1e-3 and 3e-4 → achieved MSE ≈ 6.5.
  • Added a learning rate scheduler to decay LR from 1e-3 to 1/20 over 30 epochs → final MSE: 5.7691.

Validation accuracy was tracked; no signs of overfitting appeared, likely due to low variance between training and validation sets (since both came from the same test drive).

Although I considered adding data augmentation and integrating the other challenge dataset, GPU conversion issues made further experimentation impractical in the available time.

Reflections

This project taught me how to:

  • Build and debug complex CNN architectures.
  • Think pragmatically - sometimes the simplest architectural fix is the right one.
  • Measure real-world model performance with both metrics (MSE) and visuals.
  • Balance time, performance, and GPU limits in a real research workflow.