Motivation & Context
This project was developed as part of an advanced machine learning course where we were asked to choose and solve an image-to-image translation task. I selected image colorization - converting grayscale images into colored versions - as it combines low-level spatial understanding with semantic reasoning, particularly challenging in real-world driving scenes.
Model Architecture
I implemented a U-Net architecture, inspired by the original paper “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al., but adapted it carefully for the colorization task. The model was built in modular components using PyTorch:
- ConvBlock: Each block included two Conv2d layers, each followed by BatchNorm2d and ReLU.
- DownBlock: Each encoder block combined a ConvBlock and MaxPool2d for spatial downsampling. The implementation returned both the downsampled output and the intermediate ConvBlock result for skip connections. This was a workaround for an earlier bug where I had mistakenly used the output after MaxPooling for skip connections.
- UpBlock: Used an Upsample layer followed by a Conv2d (kernel size 1) for feature refinement. Initially, I avoided ConvTranspose2d based on advice I read on StackOverflow, though later I realized the U-Net paper itself suggests upsampling with convolution. The main difference was my kernel size (1 instead of 2).
- Final Output: The model predicted two output channels (a and b in Lab space), with the grayscale input serving as the L channel.
The full model followed the classic encoder–bottleneck–decoder structure, with skip connections linking mirrored layers to preserve spatial information.
Training Process
Training presented some initial challenges:
- I had issues with GPU memory. My GPU couldn't fit larger batch sizes but smaller ones only utilized about 50% of my 8GB GPU, so there’s room for future optimization.
- I began by training for only 1–2 epochs at a time to inspect early outputs.
- An initial bug in test data ordering led to an MSE > 500 due to unsorted filenames - which I fixed later.
Once the pipeline was stable, I ran:
- Two training runs (10 epochs each) at learning rates 1e-3 and 3e-4 → achieved MSE ≈ 6.5.
- Added a learning rate scheduler to decay LR from 1e-3 to 1/20 over 30 epochs → final MSE: 5.7691.
Validation accuracy was tracked; no signs of overfitting appeared, likely due to low variance between training and validation sets (since both came from the same test drive).
Although I considered adding data augmentation and integrating the other challenge dataset, GPU conversion issues made further experimentation impractical in the available time.
Reflections
This project taught me how to:
- Build and debug complex CNN architectures.
- Think pragmatically - sometimes the simplest architectural fix is the right one.
- Measure real-world model performance with both metrics (MSE) and visuals.
- Balance time, performance, and GPU limits in a real research workflow.