{"id":1979,"date":"2025-05-23T16:37:32","date_gmt":"2025-05-23T20:37:32","guid":{"rendered":"https:\/\/molecularsciences.org\/content\/?p=1979"},"modified":"2025-05-20T16:48:46","modified_gmt":"2025-05-20T20:48:46","slug":"what-is-deep-learning-in-computer-vision","status":"publish","type":"post","link":"https:\/\/molecularsciences.org\/content\/what-is-deep-learning-in-computer-vision\/","title":{"rendered":"What is Deep Learning in Computer Vision"},"content":{"rendered":"\n<p>Deep learning has become a big deal in the world of computer vision. But before we get into how it works, let\u2019s understand something important: <strong>how do computers \u201csee\u201d an image?<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How Computers See Images<\/h2>\n\n\n\n<p>To you and me, an image is a picture \u2014 a photo of a cat, a car, or a tree. But to a computer, an image is just <strong>a grid of numbers<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In a <strong>black-and-white (grayscale)<\/strong> image, each pixel (tiny square of the image) has a value from 0 (black) to 255 (white).<\/li>\n\n\n\n<li>For example, a <strong>7&#215;7 image<\/strong> is just a 7-by-7 grid of numbers.<\/li>\n<\/ul>\n\n\n\n<p>Most real-world images, though, are in <strong>color<\/strong>, and that means each image actually has <strong>three layers<\/strong>, called channels:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Red<\/strong><\/li>\n\n\n\n<li><strong>Green<\/strong><\/li>\n\n\n\n<li><strong>Blue<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This is called an <strong>RGB image<\/strong>. The computer uses the combination of these three values to figure out what color each pixel should be. For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A purple pixel might be <strong>Red: 150, Green: 0, Blue: 255<\/strong><\/li>\n\n\n\n<li>A yellow pixel might be <strong>Red: 255, Green: 255, Blue: 0<\/strong><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Filters and Feature Extraction<\/h2>\n\n\n\n<p>Now, imagine you want the computer to recognize what\u2019s in the picture \u2014 like whether it\u2019s a dog or a bicycle.<\/p>\n\n\n\n<p>This is where <strong>filters<\/strong> (also called <strong>kernels<\/strong>) come in. These are small grids, like 3&#215;3 or 5&#215;5, that move across the image and do some math to highlight important parts \u2014 like edges, shapes, and patterns.<\/p>\n\n\n\n<p>This process is called <strong>convolution<\/strong>, and it\u2019s the heart of something called a <strong>Convolutional Neural Network (CNN)<\/strong>.<\/p>\n\n\n\n<p>Think of filters like special glasses:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>One filter might highlight the <strong>edges<\/strong> in an image.<\/li>\n\n\n\n<li>Another might blur the image.<\/li>\n\n\n\n<li>Another might detect <strong>patterns<\/strong> like stripes or circles.<\/li>\n<\/ul>\n\n\n\n<p>When filters pass over the image, they create new images called <strong>feature maps<\/strong> \u2014 these are simplified, transformed versions of the original image that make it easier for the computer to understand what it\u2019s looking at.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Convolutional Neural Networks (CNNs)<\/h2>\n\n\n\n<p>A <strong>CNN<\/strong> is a special type of deep learning model that\u2019s designed for image tasks.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How It Works:<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Input layer<\/strong> takes in the image (with its RGB values).<\/li>\n\n\n\n<li><strong>Convolution layers<\/strong> apply filters to extract important features.<\/li>\n\n\n\n<li><strong>Feature maps<\/strong> are created from the filter outputs.<\/li>\n\n\n\n<li>These maps are passed through additional layers \u2014 often more filters or pooling layers that shrink the image and keep the important parts.<\/li>\n\n\n\n<li>The processed data is then sent to a <strong>fully connected neural network<\/strong>, like a giant decision tree.<\/li>\n\n\n\n<li>The <strong>output layer<\/strong> gives probabilities for different classes (e.g., 70% apple, 20% orange, 10% banana).<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Learning Through Mistakes<\/h2>\n\n\n\n<p>CNNs <strong>learn<\/strong> by making guesses and checking how wrong they are. Here\u2019s how:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The model starts with random guesses.<\/li>\n\n\n\n<li>A <strong>loss function<\/strong> calculates how far off the guess is from the correct answer.<\/li>\n\n\n\n<li>The model <strong>adjusts<\/strong> its internal settings (called weights and biases) to make better guesses next time.<\/li>\n\n\n\n<li>This cycle repeats thousands of times through something called <strong>epochs<\/strong>, where the model gets smarter after each round.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">What Can CNNs Do?<\/h2>\n\n\n\n<p>CNNs are used in all kinds of image-related tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Image classification<\/strong> (Is it a dog or a cat?)<\/li>\n\n\n\n<li><strong>Object detection<\/strong> (Where is the cat in the picture?)<\/li>\n\n\n\n<li><strong>Image captioning<\/strong> (A cat sitting on a sofa)<\/li>\n\n\n\n<li><strong>Face recognition<\/strong> (Unlock your phone with your face)<\/li>\n\n\n\n<li><strong>Medical imaging<\/strong> (Spotting tumors in X-rays)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Moving Beyond CNNs: Vision + Language<\/h2>\n\n\n\n<p>CNNs are powerful, but they mostly work with images only. New models go even further by combining <strong>images and text<\/strong>.<\/p>\n\n\n\n<p>These models are called <strong>multimodal models<\/strong> because they can understand multiple types of data (like images + words).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example: Microsoft\u2019s Florence Model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Florence is trained on <strong>millions of images and captions<\/strong>.<\/li>\n\n\n\n<li>It has an <strong>image encoder<\/strong> that understands pictures.<\/li>\n\n\n\n<li>It has a <strong>language encoder<\/strong> that understands text.<\/li>\n\n\n\n<li>Together, these let the model do many things:\n<ul class=\"wp-block-list\">\n<li>Describe an image<\/li>\n\n\n\n<li>Answer questions about an image<\/li>\n\n\n\n<li>Classify or detect objects<\/li>\n\n\n\n<li>Match images to the right caption<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p>Instead of training one model for each task, Florence can <strong>do it all<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>Here\u2019s a quick summary of everything you learned:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Topic<\/th><th>What It Means<\/th><\/tr><\/thead><tbody><tr><td><strong>Pixels<\/strong><\/td><td>Tiny squares in an image, represented by numbers.<\/td><\/tr><tr><td><strong>RGB Channels<\/strong><\/td><td>Red, Green, and Blue layers that form a full-color image.<\/td><\/tr><tr><td><strong>Filters<\/strong><\/td><td>Small grids that help the computer find patterns.<\/td><\/tr><tr><td><strong>CNN (Convolutional Neural Network)<\/strong><\/td><td>A deep learning model that\u2019s great for image tasks.<\/td><\/tr><tr><td><strong>Feature Map<\/strong><\/td><td>New version of the image showing important features.<\/td><\/tr><tr><td><strong>Loss Function<\/strong><\/td><td>Tells the model how wrong its guess was.<\/td><\/tr><tr><td><strong>Multimodal Models<\/strong><\/td><td>Newer models that combine image + text understanding.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Final Thought<\/h2>\n\n\n\n<p>Deep learning has made computer vision smarter than ever before. From recognizing your face to translating signs in real time, it&#8217;s helping machines <strong>see, understand, and even describe the world around us.<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep learning has become a big deal in the world of computer vision. But before we get into how it works, let\u2019s understand something important: how do computers \u201csee\u201d an image? How Computers See Images To you and me, an image is a picture \u2014 a photo of a cat, a car, or a tree. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[532],"tags":[533,534,540,535],"class_list":["post-1979","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","tag-ai","tag-computer-vision","tag-deep-learning","tag-ml"],"_links":{"self":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/1979","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/comments?post=1979"}],"version-history":[{"count":1,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/1979\/revisions"}],"predecessor-version":[{"id":1980,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/1979\/revisions\/1980"}],"wp:attachment":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/media?parent=1979"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/categories?post=1979"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/tags?post=1979"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}