Deep learning has become a big deal in the world of computer vision. But before we get into how it works, let’s understand something important: how do computers “see” an image?
How Computers See Images
To you and me, an image is a picture — a photo of a cat, a car, or a tree. But to a computer, an image is just a grid of numbers.
- In a black-and-white (grayscale) image, each pixel (tiny square of the image) has a value from 0 (black) to 255 (white).
- For example, a 7×7 image is just a 7-by-7 grid of numbers.
Most real-world images, though, are in color, and that means each image actually has three layers, called channels:
- Red
- Green
- Blue
This is called an RGB image. The computer uses the combination of these three values to figure out what color each pixel should be. For example:
- A purple pixel might be Red: 150, Green: 0, Blue: 255
- A yellow pixel might be Red: 255, Green: 255, Blue: 0
Filters and Feature Extraction
Now, imagine you want the computer to recognize what’s in the picture — like whether it’s a dog or a bicycle.
This is where filters (also called kernels) come in. These are small grids, like 3×3 or 5×5, that move across the image and do some math to highlight important parts — like edges, shapes, and patterns.
This process is called convolution, and it’s the heart of something called a Convolutional Neural Network (CNN).
Think of filters like special glasses:
- One filter might highlight the edges in an image.
- Another might blur the image.
- Another might detect patterns like stripes or circles.
When filters pass over the image, they create new images called feature maps — these are simplified, transformed versions of the original image that make it easier for the computer to understand what it’s looking at.
Convolutional Neural Networks (CNNs)
A CNN is a special type of deep learning model that’s designed for image tasks.
How It Works:
- Input layer takes in the image (with its RGB values).
- Convolution layers apply filters to extract important features.
- Feature maps are created from the filter outputs.
- These maps are passed through additional layers — often more filters or pooling layers that shrink the image and keep the important parts.
- The processed data is then sent to a fully connected neural network, like a giant decision tree.
- The output layer gives probabilities for different classes (e.g., 70% apple, 20% orange, 10% banana).
Learning Through Mistakes
CNNs learn by making guesses and checking how wrong they are. Here’s how:
- The model starts with random guesses.
- A loss function calculates how far off the guess is from the correct answer.
- The model adjusts its internal settings (called weights and biases) to make better guesses next time.
- This cycle repeats thousands of times through something called epochs, where the model gets smarter after each round.
What Can CNNs Do?
CNNs are used in all kinds of image-related tasks:
- Image classification (Is it a dog or a cat?)
- Object detection (Where is the cat in the picture?)
- Image captioning (A cat sitting on a sofa)
- Face recognition (Unlock your phone with your face)
- Medical imaging (Spotting tumors in X-rays)
Moving Beyond CNNs: Vision + Language
CNNs are powerful, but they mostly work with images only. New models go even further by combining images and text.
These models are called multimodal models because they can understand multiple types of data (like images + words).
Example: Microsoft’s Florence Model
- Florence is trained on millions of images and captions.
- It has an image encoder that understands pictures.
- It has a language encoder that understands text.
- Together, these let the model do many things:
- Describe an image
- Answer questions about an image
- Classify or detect objects
- Match images to the right caption
Instead of training one model for each task, Florence can do it all.
Summary
Here’s a quick summary of everything you learned:
Topic | What It Means |
---|---|
Pixels | Tiny squares in an image, represented by numbers. |
RGB Channels | Red, Green, and Blue layers that form a full-color image. |
Filters | Small grids that help the computer find patterns. |
CNN (Convolutional Neural Network) | A deep learning model that’s great for image tasks. |
Feature Map | New version of the image showing important features. |
Loss Function | Tells the model how wrong its guess was. |
Multimodal Models | Newer models that combine image + text understanding. |
Final Thought
Deep learning has made computer vision smarter than ever before. From recognizing your face to translating signs in real time, it’s helping machines see, understand, and even describe the world around us.