{"id":329278,"date":"2023-09-18T13:00:00","date_gmt":"2023-09-18T13:00:00","guid":{"rendered":"https:\/\/pyimagesearch.com\/?p=40478"},"modified":"2023-09-18T13:00:00","modified_gmt":"2023-09-18T13:00:00","slug":"sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks","status":"publish","type":"post","link":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/","title":{"rendered":"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks"},"content":{"rendered":"<p class=\"syndicated-attribution\"><meta name= \\\"keywords \\\" content= \\\"\u96fb\u5b50\u8a08\u7b97\u6a5f, \u6559\u80b2, IT \u96fb\u8166\u73ed,\u96fb\u8166\u88dc\u7fd2\uff0c \u96fb\u8166\u73ed\uff0c \u5bb6\u6559\uff0c \u79c1\u4eba\u8001\u5e2b\uff0c \u8cc7\u8a0a\u6280\u8853\uff0c \u7a0b\u5e8f\u8a2d\u8a08\uff0c \u96fb\u5b50\u8a08\u7b97\u6a5f\uff0c \u904a\u6232\uff0c \u860b\u679c\uff0c \u96fb\u5f71\uff0c \u8a08\u7b97\u6a5f\uff0c\u7de8\u78bc\uff0c Java\uff0c C\/C++\uff0c JavaScript\uff0c PHP\uff0c HTML\uff0c CSS\uff0c MySQL\uff0c mobile\uff0c Android\uff0c \u52d5\u6f2b\uff0c Python\uff0c teacher\uff0c \u88dc\u7fd2\uff0c \u96fb\u8166\u88dc\u7fd2 \u8cc7\u8a0a, \u7535\u5b50\u8ba1\u7b97\u673a, IT ,Game, apple, movie, Computer,student,Java,\u6559\u80b2, ,\u5b66\u751f, \u5b66\u4e60, learn, \u6559\u5b66,  Android, apple,anime, animation, \u4fe1\u606f\u6280\u672f, \u7a0b\u5e8f\u8bbe\u8ba1, \u79fb\u52a8\u7535\u8bdd, \u8cc7\u8a0a\u79d1\u6280,Game, Jeu, Juego,Call Of Duty ,\u4f7f\u547d\u53ec\u559a , \u6e38\u620f, \u7535\u5b50\u6e38\u620f,, \u591a\u4eba\u7535\u5b50\u6e38\u620f, \u7f51\u7edc\u6e38\u620f\uff0conline\uff0conline game, \u624b\u673a\u6e38\u620f, mobile \\\"><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"TOC\"\/>\n<div class=\"yoast-breadcrumbs\"><span><span><a href=\"https:\/\/pyimagesearch.com\/\">Home<\/a><\/span><\/div>\n<h2><strong>Table of Contents<\/strong><\/h2>\n<div class=\"toc\">\n<ul>\n<li id=\"TOC-h2BPTitle\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h2BPTitle\">SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/a><\/li>\n<ul>\n<li id=\"TOC-h3Integration\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Integration\">SAM and CLIP Integration<\/a><\/li>\n<li id=\"TOC-h3Environment\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Environment\">Configuring Your Development Environment<\/a><\/li>\n<li id=\"TOC-h3Help\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Help\">Need Help Configuring Your Development Environment?<\/a><\/li>\n<li id=\"TOC-h3Structure\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Structure\">Project Structure<\/a><\/li>\n<li id=\"TOC-h3Individual\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Individual\">Getting Individual Objects from Image<\/a><\/li>\n<li id=\"TOC-h3Downstream\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3\"><\/a><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Downstream\">Downstream Tasks with SAM and CLIP<\/a><\/li>\n<\/ul>\n<li id=\"TOC-h2Summary\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h2Summary\">Summary<\/a><\/li>\n<ul>\n<li id=\"TOC-h3Citation\"><a rel=\"noopener\"  href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#h3Citation\">Citation Information<\/a><\/li>\n<\/ul>\n<\/ul>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h2BPTitle\"\/>\n<h2><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h2BPTitle\"><strong>SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/strong><\/a><\/h2>\n<p>In this tutorial, you will learn about the Segment Anything Model (SAM) from Meta AI and understand how it can be integrated with other foundational models like CLIP (contrastive language-image pretraining) to tackle a diverse range of downstream tasks like zero-shot image classification, text-to-image retrieval and image similarity.  <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/pyimagesearch.com\/wp-content\/uploads\/2023\/09\/sam-2-featured.png\"  rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured.png?lossy=2&#038;strip=1&#038;webp=1\" alt=\"\" class=\"wp-image-41341\" width=\"603\" height=\"500\" srcset=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured.png?size=126x104&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured.png?size=378x313&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/09\/sam-2-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w\" sizes=\"(max-width: 603px) 100vw, 603px\" \/><\/a><\/figure>\n<\/div>\n<p>This lesson is the last of a 2-part series on <strong>Segment Anything Model (SAM) from Meta AI<\/strong>:<\/p>\n<ol>\n<li><a href=\"https:\/\/pyimg.co\/0ivy4\"  rel=\"noreferrer noopener\"><em>SAM from Meta AI (Part 1): Segmentation with Prompts<\/em><\/a><em>  <\/em><\/li>\n<li><a href=\"https:\/\/pyimg.co\/27uny\"  rel=\"noreferrer noopener\"><strong><em>SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/em><\/strong><\/a><strong><em> <\/em>(this tutorial)<\/strong><\/li>\n<\/ol>\n<p><strong>To learn how to integrate SAM with CLIP for downstream tasks, <\/strong><strong><em>just keep reading.<\/em><\/strong><\/p>\n<div id=\"pyi-source-code-block\" class=\"source-code-wrap\">\n<div class=\"gpd-source-code\">\n<div class=\"gpd-source-code-content\">\n        <img decoding=\"async\" src=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2020\/01\/source-code-icon.png?lossy=2&#038;strip=1&#038;webp=1\" alt=\"\"><\/p>\n<h4>Looking for the source code to this post?<\/h4>\n<p>                    <a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#download-the-code\" class=\"pyis-cta-modal-open-modal\">Jump Right To The Downloads Section <svg class=\"svg-icon arrow-right\" width=\"12\" height=\"12\" aria-hidden=\"true\" role=\"img\" focusable=\"false\" viewBox=\"0 0 14 14\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\"><path d=\"M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z\" fill=\"#169FE6\"><\/path><\/svg><\/a>\n            <\/div>\n<\/div>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h2BPTitle\"\/>\n<h2><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h2BPTitle\"><strong>SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/strong><\/a><\/h2>\n<p>In the first part of this series, we discussed the development of \u201cfoundational models,\u201d which are trained on large-scale datasets and possess more  \u201cgeneral\u201d capabilities that allow them to understand data holistically and perform various downstream tasks of varied data distributions.<\/p>\n<p>We took an in-depth look at the SAM, which is a foundational segmentation model, and tried to develop a holistic understanding of SAM. <\/p>\n<p>The two most important characteristics of SAM that we discussed were its ability to perform segmentation on data from a variety of distributions with prompt engineering and that it can be seamlessly integrated with pre-established computer vision models and systems to boost their capabilities and performance on complex tasks for which training task-specific models would not be feasible.<\/p>\n<p>In the <a href=\"https:\/\/pyimg.co\/0ivy4\"  rel=\"noreferrer noopener\">previous tutorial<\/a>, we discussed the former in detail to understand how SAM can be prompted in different ways to segment specific regions in any image in real-time. In this tutorial, we will focus on the latter and discuss how SAM can be integrated with other foundational models like CLIP to perform a range of downstream tasks (e.g., zero-shot image classification, text-to-image retrieval, and image similarity).  <\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Integration\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Integration\"><strong>SAM and CLIP Integration<\/strong><\/a><\/h3>\n<p>As we briefly discussed in the previous tutorial, the CLIP model is a recent foundational model for computer vision that utilizes large-scale web-based image-text data to train strong representations that encode semantic knowledge. The image and text instances are directly scraped from the web and are weakly aligned (e.g., image and tags on the web), which allows us to train this model with low annotation cost.<\/p>\n<p>The CLIP model consists of an image encoder and a text encoder. The image encoder takes an image as input and maps it to an N-dimensional representation space. Similarly, the text encoder takes as input the corresponding text and maps it to the same N-dimensional representation space. Then, a contrastive objective is used to align corresponding image and text pairs and push dissimilar image and text pairs apart to learn useful semantic representations. <\/p>\n<p>Let us now take an example and understand how our SAM+CLIP system can help us perform complex and diverse downstream tasks.<\/p>\n<p>Suppose we have an image <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">I<\/code> with multiple objects, and we want to classify each object in the image into a set of categories in a zero-shot way (i.e., without using any data to train the model).<\/p>\n<p><strong>Figure 1<\/strong> shows an overview of a pipeline that integrates SAM with the CLIP model to perform this task seamlessly.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><a href=\"https:\/\/lh3.googleusercontent.com\/-dbiAScohbemPfQp-X5L2mpRme7xhK9l_slcn2B3WC7eMwRvO87oqKgZpwSLjhoqVD7p1jSwepjUJ_0qRtiIWCWSTFdwJDS_a1RkLWzK-5lyq0nsEa3IiGGBht3mBfOU3_WSeP4zftMw6W1JgAI1AJo\"  rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/-dbiAScohbemPfQp-X5L2mpRme7xhK9l_slcn2B3WC7eMwRvO87oqKgZpwSLjhoqVD7p1jSwepjUJ_0qRtiIWCWSTFdwJDS_a1RkLWzK-5lyq0nsEa3IiGGBht3mBfOU3_WSeP4zftMw6W1JgAI1AJo\" alt=\"\" width=\"646\" height=\"500\"\/><\/a><figcaption><strong>Figure 1:<\/strong> Overview of SAM and CLIP integration pipeline for downstream tasks (source: image by the author).<\/figcaption><\/figure>\n<\/div>\n<p>The input image <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">I<\/code> is passed through SAM, which is used in a mode that extracts and outputs all plausible segmentation masks for objects in that image. Then, these individual objects are cropped out of the image and preprocessed to pass them through the pre-trained CLIP image encoder, which maps them to an N-dimensional representation.<\/p>\n<p>Similarly, the text prompts are passed as input to the pre-trained CLIP text encoder, which maps them to an N-dimensional representation. Since CLIP is trained in a way that the output of the text and image encoder are aligned in the N-dimensional space, we can use cosine similarity between the cropped object representation and the prompt representation and classify the objects in the image by matching it with the most similar prompt from the prompt list as shown.<\/p>\n<p>Similarly, this integration system can be used for various other tasks (e.g., text-to-image retrieval, image similarity, etc.), as detailed in this tutorial.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Environment\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Environment\"><strong>Configuring Your Development Environment<\/strong><\/a><\/h3>\n<p>To follow this guide, you need to have the SAM and CLIP model. Furthermore, you will need to download the pre-trained checkpoints for these models.<\/p>\n<p>Luckily, this can be done easily by following the commands below:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"1\">$ pip install\ngit+https:\/\/github.com\/facebookresearch\/segment-anything.git\n$ pip install opencv-python pycocotools matplotlib onnxruntime onnx\n$pip install open_clip_torch\n\n$ mkdir checkpoints\n$ cd checkpoints\n$ wget -c https:\/\/dl.fbaipublicfiles.com\/segment_anything\/sam_vit_h_4b8939.pth<\/pre>\n<p><strong>If you need help configuring your development environment for OpenCV, we <em>highly recommend<\/em> that you read our <\/strong><a href=\"https:\/\/pyimagesearch.com\/2018\/09\/19\/pip-install-opencv\/\"  rel=\"noreferrer noopener\"><strong><em>pip install OpenCV<\/em> guide<\/strong><\/a> \u2014 it will have you up and running in minutes.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Help\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Help\"><strong>Need Help Configuring Your Development Environment?<\/strong><\/a><\/h3>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/pyimagesearch.com\/pyimagesearch-university\/\"  rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"334\" src=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2021\/05\/pyimagesearch_plus_jupyter.png?lossy=2&#038;strip=1&#038;webp=1\" alt=\"Need help configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University \u2014 you\u2019ll be up and running with this tutorial in minutes.\" class=\"wp-image-19836\" srcset=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2021\/05\/pyimagesearch_plus_jupyter.png?size=126x84&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2021\/05\/pyimagesearch_plus_jupyter-300x200.png?lossy=2&amp;strip=1&amp;webp=1 300w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2021\/05\/pyimagesearch_plus_jupyter.png?size=378x253&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2021\/05\/pyimagesearch_plus_jupyter.png?lossy=2&amp;strip=1&amp;webp=1 500w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/><\/a><figcaption>Need help configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href=\"https:\/\/pyimagesearch.com\/pyimagesearch-university\/\"  rel=\"noreferrer noopener\">PyImageSearch University<\/a> \u2014 you\u2019ll be up and running with this tutorial in minutes.<\/figcaption><\/figure>\n<\/div>\n<p>All that said, are you:<\/p>\n<ul>\n<li>Short on time?<\/li>\n<li>Learning on your employer\u2019s administratively locked system?<\/li>\n<li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?<\/li>\n<li><strong>Ready to run the code <\/strong><strong><em>immediately<\/em><\/strong><strong> on your Windows, macOS, or Linux system?<\/strong><\/li>\n<\/ul>\n<p>Then join <a href=\"https:\/\/pyimagesearch.com\/pyimagesearch-university\/\"  rel=\"noreferrer noopener\">PyImageSearch University<\/a> today!<\/p>\n<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides <\/strong><strong><em>pre-configured<\/em><\/strong><strong> to run on Google Colab\u2019s ecosystem right in your web browser!<\/strong> No installation required.<\/p>\n<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Structure\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Structure\"><strong>Project Structure<\/strong><\/a><\/h3>\n<p>We first need to review our project directory structure.<\/p>\n<p>Start by accessing this tutorial\u2019s <strong><em>\u201cDownloads\u201d<\/em><\/strong> section to retrieve the source code and example images.<\/p>\n<p>From there, take a look at the directory structure:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"2\">\u251c\u2500\u2500 checkpoints\n\u251c\u2500\u2500 clip_integration.py\n\u251c\u2500\u2500 config.py\n\u251c\u2500\u2500 gdino_integration.py\n\u251c\u2500\u2500 get_objects.py\n\u251c\u2500\u2500 images\n\u2502   \u251c\u2500\u2500 kitchen.jpeg\n\u2502   \u2514\u2500\u2500 living_room.jpg\n\u251c\u2500\u2500 sam.py\n\u2514\u2500\u2500 utils.py<\/pre>\n<p>In the previous tutorial, we discussed in detail the directory structure and the function of each file. <\/p>\n<p>Specifically, we discussed the checkpoints and images folder, which stores the pre-trained checkpoints and images we will use for the tutorial. Furthermore, we discussed the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.py<\/code>, <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">gdino_integration.py<\/code>, <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">sam.py<\/code>, and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">utils.py<\/code> files in detail.<\/p>\n<p>In this tutorial, we will take an in-depth look at the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_objects.py<\/code> file, which implements the code to extract segmentation masks for objects in the input image and then crop them for our downstream tasks.<\/p>\n<p>Additionally, we will walk through the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">clip_integration.py<\/code> file, which implements the code for our SAM and CLIP integrated system and performs tasks (e.g., zero-shot object classification, text-to-image retrieval, and image similarity).<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Individual\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Individual\"><strong>Getting Individual Objects from Image<\/strong><\/a><\/h3>\n<p>In this section, we discuss how we can extract different objects in our input image so we can use them later for performing our downstream recognition and retrieval tasks.<\/p>\n<p>In the <a href=\"https:\/\/pyimg.co\/0ivy4\"  rel=\"noreferrer noopener\">previous tutorial<\/a>, we discussed how SAM can be prompted to segment regions or objects of interest. This can be referred to as the promptable segmentation mode for SAM.<\/p>\n<p>Another more general mode in which SAM can be used is to generate all plausible segmentation masks in a given input image. This can be done with the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">SamAutomaticMaskGenerator<\/code> functionality, where SAM takes an input image, predicts plausible segmentation masks, and estimates bounding boxes corresponding to each mask.<\/p>\n<p>Let us go ahead and discuss how this can be implemented in code and how we can use this functionality of SAM to segment and crop out prominent objects in our input image.<\/p>\n<p>We open the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_objects.py<\/code> file and get started.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"3\"># USAGE: python get_objects.py\n\n# Import the necessary packages\nimport os\nimport pickle\n\nimport cv2\nimport matplotlib.pyplot as plt\nfrom segment_anything import SamAutomaticMaskGenerator, sam_model_registry\n\nfrom pyimagesearch import config, utils\n\n\ndef generate_object_masks(image):\n    # Initialize SAM model and generate masks\n    print(\"[INFO] Loading SAM model...\")\n    sam = sam_model_registry[config.MODEL_TYPE](checkpoint=config.SAM_CHECKPOINT_PATH)\n    mask_generator = SamAutomaticMaskGenerator(sam)\n    masks = mask_generator.generate(image)\n\n    return masks\n\n\ndef save_object_crops(masks, mask_ids, labels):\n    # Crop objects from image using mask IDs and save\n    obj_crops = []\n    for id in mask_ids:\n        box = masks[id][\"bbox\"]\n        x0, y0, w, h = box\n        crop = image[y0 : y0 + h, x0 : x0 + w]\n        obj_crops.append(crop)\n\n    # Save object crops and corresponding labels\n    obj_dict = {\"crops\": obj_crops, \"labels\": labels}\n    with open(os.path.join(config.OUT_PATH, \"objects.pkl\"), \"wb\") as fp:\n        pickle.dump(obj_dict, fp)<\/pre>\n<p>We start by importing the <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">os<\/code> module, and the <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">pickle<\/code> package (<strong>Lines 4 and 5<\/strong>). Next we import the OpenCV and Matplotlib libraries, as shown on <strong>Lines 7 and 8<\/strong>, respectively. On <strong>Line 9<\/strong>, we import necessary modules that will allow us to use SAM for making predictions in our tutorial. Specifically, we get the <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">SamAutomaticMaskGenerator<\/code> and <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">sam_model_registry<\/code> modules from <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">segment_anything<\/code>, as shown on <strong>Line 9<\/strong>.<\/p>\n<p>Finally we import  the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config<\/code> file, which stores the initial configurations of our parameters (<strong>Line 11<\/strong>) and the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">utils.py<\/code> file, which implements the helper functions to visualize the segmentation masks from SAM (<strong>Line 11<\/strong>).<\/p>\n<p>Now that we have imported the necessary modules and packages, it is time to implement our <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_object_masks<\/code> function (<strong>Lines 14-21<\/strong>), which takes the input image and uses the pre-trained SAM to predict and automatically generate masks for the entire image.<\/p>\n<p>We first initialize SAM using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">sam_model_registry<\/code> functionality with the type of model (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.MODEL_TYPE<\/code>), which we take as <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">vit_h<\/code> for this tutorial. Further, we provide the path where we stored the SAM checkpoint (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.SAM_CHECKPOINT_PATH<\/code>),  as shown on <strong>Line 17<\/strong>. <\/p>\n<p>On <strong>Line 18<\/strong>, we get the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">mask_generator<\/code> using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">SamAutomaticMaskGenerator<\/code> module, allowing us to automatically extract masks for the different objects in our input image. <\/p>\n<p>Finally, on <strong>Line 19<\/strong>, we use the generator functionality of the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">mask_generator<\/code> object that we created to get the masks for the entire image and return the predicted <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">masks<\/code> on <strong>Line 21<\/strong>.<\/p>\n<p>Now that we have completed the definition of our <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_object_masks<\/code> function, discuss the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">save_object_crops<\/code> function which will allow us to crop out everyday objects detected by SAM from our input image and save them with their corresponding labels. This will further help us perform the downstream tasks we want with the different objects in our image as we will see later.<\/p>\n<p>We first initialize an empty <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> list (<strong>Line 26<\/strong>), where we will store our object crops. Then, for each ID in our <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">mask_ids<\/code> list, we get the bounding box predicted by SAM (<strong>Line 28<\/strong>) and get the corresponding <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">x0<\/code>, <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">y0<\/code> coordinates and width and height (<strong>Line 29<\/strong>).<\/p>\n<p>Next, we use the coordinates to get the part or crop of the image with the particular object (<strong>Line 30<\/strong>) and add it to our <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> list (<strong>Line 31<\/strong>).<\/p>\n<p>Finally, we create a dictionary named <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_dict<\/code> and store the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> list and the corresponding labels, as shown on <strong>Line 34<\/strong>. Then, we save this dictionary with the filename <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">objects.pkl<\/code> in the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.OUT_PATH<\/code> folder (<strong>Line 35<\/strong>).<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"39\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"4\">if __name__ == \"__main__\":\n    # Load input image and convert to RGB\n    print(\"[INFO] Loading image...\")\n    image = cv2.imread(config.IMG_PATH[1])\n    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n\n    # Generate segmentation masks\n    print(\"[INFO] Generating masks from SAM...\")\n    all_masks = generate_object_masks(image)\n\n    # Create output directory if it doesn't exist\n    if not os.path.exists(config.OUT_PATH):\n        os.makedirs(config.OUT_PATH)\n\n    # Plot and save predicted image\n    prediction_path = os.path.join(config.OUT_PATH, config.OUT_PRED_PATH)\n    plt.figure(figsize=(8, 8))\n    plt.imshow(image)\n    utils.show_all_masks(all_masks)\n    plt.savefig(prediction_path)\n\n    # Plot individual object masks\n    for i, mask_info in enumerate(all_masks):\n        plt.figure(figsize=(6, 6))\n        plt.imshow(image)\n        utils.show_mask(mask_info[\"segmentation\"], plt.gca())\n        plt.axis(\"off\")\n        plt.savefig(os.path.join(config.OUT_PATH, f\"mask_{i}.png\"))\n\n    # TODO: Place mask and labels in the config\n    mask_ids = [0, 2, 3, 7, 8, 11, 12, 20, 22, 23, 26, 27, 34, 35, 45, 66, 67, 78]\n    labels = [\n        \"painting\",\n        \"door\",\n        \"painting\",\n        \"carpet\",\n        \"vase\",\n        \"plant\",\n        \"pillow\",\n        \"blanket\",\n        \"pillow\",\n        \"pillow\",\n        \"blanket\",\n        \"jug\",\n        \"blanket\",\n        \"dog\",\n        \"plant\",\n        \"plant\",\n        \"plant\",\n        \"pillow\",\n    ]\n\n    # Save object crops and corresponding labels\n    print(\"[INFO] Cropping and saving objects...\")\n    save_object_crops(all_masks, mask_ids, labels)<\/pre>\n<p>Next, let us go ahead and implement the main function to see our model in action.<\/p>\n<p>We start by loading our image using OpenCV from the path defined by <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.IMG_PATH[1]<\/code> and converting it from BGR to RGB color space using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">cv2.cvtColor<\/code> function, as shown on <strong>Lines 42 and 43<\/strong>. <\/p>\n<p>Next, on <strong>Line 47<\/strong>, we use the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_object_masks<\/code> function that we defined above to extract the segmentation masks from our input image and store it as <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">all_masks<\/code>. Notice that the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">all_masks<\/code> output from SAM is a dictionary with the following keys: [&#8216;segmentation&#8217;, &#8216;area&#8217;, &#8216;bbox&#8217;, &#8216;predicted_iou&#8217;, &#8216;point_coords&#8217;, &#8216;stability_score&#8217;, &#8216;crop_box&#8217;].<\/p>\n<p>Note that we will use the segmentation masks (i.e., &#8216;segmentation&#8217; key) and the bounding box (i.e., &#8216;bbox&#8217; key) for cropping individual objects.<\/p>\n<p>Now that we have the SAM predictions, let us prepare to visualize them on our input image and save the final visualizations.<\/p>\n<p>We first check if the folder where the output predictions will be stored (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.OUT_PATH<\/code>) already exists, and if not, we create it (<strong>Lines 50 and 51<\/strong>). Furthermore, we define the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">prediction_path<\/code> on <strong>Line 54<\/strong>, which indicates the location where our segmentation visualization will be stored.<\/p>\n<p>On <strong>Lines 55 and 56<\/strong>, we initialize a matplotlib figure and visualize the input image.<\/p>\n<p>Next, we visualize all the predicted segmentation masks from SAM using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">show_all_masks<\/code> function and save the visualization using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">plt.savefig<\/code> at the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">prediction_path<\/code>, as shown on<strong> Lines 57 and 58<\/strong>.<\/p>\n<p><strong>Figure 2<\/strong> shows the visualization of all segmentation masks predicted by SAM for our input image.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><a href=\"https:\/\/lh5.googleusercontent.com\/22y736ZtuRNdIzc6jLZ4wcxlVzoESbOyg8QpHrrxeJBlBvl3s8B9lEl8U4tcCkDHoS8tcpR7M-V0qmSpctXJyDbT4WHNgSEUTQ5Mkf_BXyogARTsSeq3AX2UvESGdv3wK4eKmfbKTMs6LHrpEum0qRQ\"  rel=\"noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/22y736ZtuRNdIzc6jLZ4wcxlVzoESbOyg8QpHrrxeJBlBvl3s8B9lEl8U4tcCkDHoS8tcpR7M-V0qmSpctXJyDbT4WHNgSEUTQ5Mkf_BXyogARTsSeq3AX2UvESGdv3wK4eKmfbKTMs6LHrpEum0qRQ\" alt=\"\"\/><\/a><figcaption> <strong>Figure 2:<\/strong> Input Image (<em>left<\/em>) and all mask prediction from SAM (<em>right<\/em>) (source: image by the author).<\/figcaption><\/figure>\n<\/div>\n<p>Now, let us go ahead and visualize each segmentation mask and the corresponding bounding box predicted by SAM to get a sense of the prominent objects that were segmented out by SAM.<\/p>\n<p>We start by iterating over the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">all_masks<\/code> output. For each segmentation prediction, we first plot the input image (<strong>Lines 62 and 63<\/strong>) and then visualize the segmentation mask (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">mask_info['segmentation']<\/code>) using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">show_mask<\/code> function (<strong>Lines 64 and 65<\/strong>).<\/p>\n<p>Finally, we save the visualization using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">plt.savefig<\/code> in the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.OUT_PATH<\/code> folder, as shown on <strong>Line 66<\/strong>.<\/p>\n<p>Now that we have visualized each prediction and their corresponding segmentation masks and bounding boxes, let us take a list of predicted mask IDs from SAM with prominent everyday objects and the corresponding object names or labels (<strong>Lin<\/strong><strong>e 69<\/strong>).<\/p>\n<p>Note that on <strong>Lin<\/strong><strong>e 69-89<\/strong>, we manually curate a few mask IDs (which we visualized) with prominent everyday objects (e.g., paintings, plants, vases, pillows, etc.) and assign them corresponding labels with the help of the labels list.<\/p>\n<p>Finally we use the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">save_object_crops<\/code> function that we defined above to save the object crops.<\/p>\n<p>We will use these objects extracted from our input image for our downstream tasks.<\/p>\n<p><strong>Figure 3<\/strong> shows a few examples of cropped objects and their corresponding labels. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><a href=\"https:\/\/lh3.googleusercontent.com\/OifGyDgzKpbjvNozsx8M2LqdltKj4j1C2BPSv4BHa-phIByXVNrxzOaBjh4AKPjd3D3b-OfvOTo8PtvbjEKTZ8MG8P5buW_BRCG8I1xDeWPdNqnAisrQMrmFOME06gDicmnxqpvNBXbZguNMDcsfbxY\"  rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/OifGyDgzKpbjvNozsx8M2LqdltKj4j1C2BPSv4BHa-phIByXVNrxzOaBjh4AKPjd3D3b-OfvOTo8PtvbjEKTZ8MG8P5buW_BRCG8I1xDeWPdNqnAisrQMrmFOME06gDicmnxqpvNBXbZguNMDcsfbxY\" alt=\"\" width=\"700\" height=\"173\"\/><\/a><figcaption><strong>Figure 3:<\/strong> Cropped individual objects \u2014painting, plant, dog, painting, door, plant (source: image by the author).<\/figcaption><\/figure>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Downstream\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Downstream\"><strong>Downstream Tasks with SAM and CLIP<\/strong><\/a><\/h3>\n<p>Now that we have extracted individual objects from our input image, it is time to complete building our SAM and the CLIP-based system and use it for the aforementioned downstream tasks.<\/p>\n<p>Specifically, we will take the cropped objects stored in the previous section and pass them through the pre-trained CLIP image encoder model. Additionally, we will engineer prompts for our labels and pass them through the pre-trained CLIP text encoder model to get N-dimensional (here 512-dimensional) representations.<\/p>\n<p>Then, we will use cosine similarity to classify images or retrieve them based on text prompts, as discussed below.<\/p>\n<p>Let us open our <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">clip_integration.py<\/code> file and get started.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"5\"># USAGE\n# python clip_integration.py\n\n# import the necessary packages\nfrom pyimagesearch import config\nimport matplotlib.pyplot as plt\nimport open_clip\nimport pickle\nimport torch\nimport PIL\nimport os\n\ndef compute_clip_features(model, tokenizer, image, prompts):\n    \"\"\"Compute CLIP features for given image and prompts\"\"\"\n\n    # tokenize the text\n    text = tokenizer(prompts)\n\n    # compute CLIP image and text features\n    with torch.no_grad(), torch.cuda.amp.autocast():\n        image_features = model.encode_image(image)\n        text_features = model.encode_text(text)\n        image_features \/= image_features.norm(dim=-1, keepdim=True)\n        text_features \/= text_features.norm(dim=-1, keepdim=True)\n\n    # return image and text features\n    return (image_features, text_features)\n\nif __name__ == \"__main__\":\n    # load object crops and labels\n    print(\"[INFO] Loading object crops...\")\n    with open(os.path.join(config.OUT_PATH, \"objects.pkl\"), \"rb\") as fp:\n        obj_dict = pickle.load(fp)\n        obj_crops = obj_dict[\"crops\"]\n        labels = obj_dict[\"labels\"]\n\n    # get unique objects\n    print(\"[INFO] Getting unique objects...\")\n    objects = set(labels)\n\n    # initialize CLIP model\n    print(\"[INFO] Loading CLIP model...\")\n    model, _, preprocess = open_clip.create_model_and_transforms(\n        \"ViT-B-32\", pretrained=\"laion2b_s34b_b79k\"\n    )\n    tokenizer = open_clip.get_tokenizer(\"ViT-B-32\")\n\n    # create prompts\n    print(\"[INFO] Creating prompts...\")\n    prompts = [\"a photo of a \" + item for item in objects]<\/pre>\n<p>We start by importing the necessary packages as always on <strong>Lines 5-11<\/strong>.<\/p>\n<p>Next, we define the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">compute_clip_features<\/code> function, which takes as input the pre-trained CLIP model, the corresponding text tokenizer, the input image, and the list of prompts and outputs the image features and text features or representations from CLIP (<strong>Lines 13-27<\/strong>).<\/p>\n<p>On <strong>Line 17<\/strong>, we start by using the tokenizer to convert the text prompts into corresponding tokens, which can be input to the text encoder of CLIP. Next, we switch off the gradient computation (<strong>Line 20<\/strong>) (as we are using a pre-trained CLIP model for inference only) and encode the image and the text by passing it through the CLIP model using the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">encode_image<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">encode_text<\/code> functions (<strong>Lines 21 and 22<\/strong>).<\/p>\n<p>Finally, we normalize <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> to have unit norm and return them on <strong>Lines 23 and 24<\/strong>.<\/p>\n<p>Now that we have implemented the function to get the CLIP features, let us load the cropped objects we saved in the previous section and prepare for implementing the downstream tasks.<\/p>\n<p>We start by opening the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">objects.pkl<\/code> file from the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">config.OUT_PATH<\/code> folder (<strong>Line 32<\/strong>) and loading the saved <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_dict<\/code> dictionary (<strong>Line 33<\/strong>). Next, we get the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">labels<\/code> list we created in the previous section (<strong>Lines 34 and 35<\/strong>). We use the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">set()<\/code> functionality to get a list of unique objects from our labels list (<strong>Line 39<\/strong>).<\/p>\n<p>It is now time to initialize and load our pre-trained CLIP model. Note that for this tutorial, we will use the <a href=\"https:\/\/github.com\/mlfoundations\/open_clip\"  rel=\"noreferrer noopener\">Open CLIP<\/a> implementation and weights.<\/p>\n<p>On <strong>Lines 43 and 44<\/strong>, we initialize the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">open_clip<\/code> ViT-B-32 architecture-based model and load the  <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">laion2b_s34b_b79k<\/code> pre-trained weights. This allows us to get the model and preprocess transformations (<strong>Line 43<\/strong>), which we will apply to our input image before passing it through the image encoder of the CLIP model. Furthermore, on <strong>Line 46<\/strong>, we get the text tokenizer for &#8216;ViT-B-32&#8217;, which will allow us to preprocess the text and convert it to tokens, which can be fed to the text encoder of CLIP.<\/p>\n<p>Then, we engineer prompts for each object in our unique objects list. The simplest way to engineer prompts for CLIP is to simply have a sentence like \u2018a photo of a {}\u2019 and replace the {} with the object name.<\/p>\n<p>On <strong>Line 50<\/strong>, we create a prompt for each object with the above format. <\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"52\" data-enlighter-title=\"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks\" data-enlighter-group=\"6\"># initialize list to store all processed images\n    images_all = []\n\n# loop over all objects and perform zero-shot image classification\nfor (i, crop) in enumerate(obj_crops):\n        # pre-process the image and compute both image and text\n        # features\n        image = PIL.Image.fromarray(crop)\n        image_processed = preprocess(image).unsqueeze(0)\n        images_all.append(image_processed)\n        image_features, text_features = compute_clip_features(model,\n            tokenizer, image_processed, prompts)\n\n        # calculate the similarity and display it\n        similarity = image_features @ text_features.T\n        probs = (100.0 * similarity).softmax(dim=-1)\n        print(\"Object is\", prompts[torch.argmax(probs).item()])\n\n    # text to image retrieval\n    images_all_tensor = torch.cat(images_all, axis=0)\n\n    # loop over all prompts\n    for (i, prompt) in enumerate(prompts):\n        # compute features and calculate the similarity\n        (image_features, text_features) = compute_clip_features(model,\n            tokenizer, images_all_tensor, [prompt])\n        similarity = image_features @ text_features.T\n        probs = (100.0 * similarity).softmax(dim=-1)\n        max_crop = obj_crops[torch.argmax(probs).item()]\n\n        # plot the image and save to disk\n        plt.figure(figsize=(1, 1))\n        plt.imshow(max_crop)\n        plt.savefig(os.path.join(config.OUT_PATH, \"obj_\" + str(i) + \".png\"))\n\n    # calculate and display image similarity\n    img_embed, _ = compute_clip_features(model, tokenizer,\n        images_all_tensor, [prompts[0]])\n    sim = img_embed @ img_embed.T\n    print(sim)<\/pre>\n<p>Now that we have our prompts and pre-trained CLIP model ready, it is time to go ahead and see our pipeline perform different downstream tasks.<\/p>\n<p>Let us start with zero-shot classification of the objects we extracted from the image using SAM.<\/p>\n<p>In this task, we take one object crop at a time and use CLIP to identify which prompt sentence from the prompts list best describes the object.<\/p>\n<p>We start by initializing an empty <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">images_all<\/code> list which will store all processed images later.<\/p>\n<p>For each cropped object image in the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> list, we first convert the numpy array to a PIL image (<strong>Line 59<\/strong>) and use the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">preprocess<\/code> function from Open CLIP to get the image in the format expected by the image encoder of CLIP (<strong>Line 60<\/strong>). We store this processed image in the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">images_all<\/code> list, as shown on <strong>Line 61<\/strong>.<\/p>\n<p>Next, we pass the CLIP model, tokenizer, processed image (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_processed<\/code>) and prompts list as arguments to the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">get_clip_features<\/code> function we defined above and get the corresponding <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> (<strong>Lines 62 and 63<\/strong>). <\/p>\n<p>Notice that <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> is the CLIP image encoder representation for the current cropped object under consideration (with dimension <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">[1, 512]<\/code>) and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> is the CLIP text encoder representation for all 9 prompts in the prompts list (with dimension <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">[9, 512]<\/code>).<\/p>\n<p>Since the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> are in the same latent space, we can simply compute the cosine similarity of <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> with <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> of each prompt.<\/p>\n<p>Since both <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> have unit norms, the cosine similarity can be easily calculated by simply taking the outer product of matrices (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features @ text_features.T<\/code>), as shown on <strong>Line 66<\/strong>.<\/p>\n<p>Furthermore, to convert the similarity scores to probabilities, we scale them by multiplying them by 100 and taking <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">softmax<\/code>, as shown on <strong>Line 67<\/strong>.<\/p>\n<p>This gives us <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">9<\/code> probability values, each indicating the similarity of the prompt sentence representation with the image representation.<\/p>\n<p>On <strong>Line 68<\/strong>, we take the prompt with the maximum probability as the description or class of the object in the image crop.<\/p>\n<p>Next, let us move on to the text-to-image retrieval task. For this task, our job is to take one prompt in our prompts list at a time and get the particular object crop (from all object crops we have), which matches the prompt.<\/p>\n<p>First, we consolidate all our object crops and preprocess them to a format that the CLIP image encoder expects and put them together in a list.<\/p>\n<p>On <strong>Line 71<\/strong>, we consolidate by concatenating all the elements in the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">images_all<\/code> list.<\/p>\n<p>We are now ready to perform our text-to-image retrieval task.<\/p>\n<p>For each prompt in our prompt list, we iterate through one by one (<strong>Line 74<\/strong>) and pass the CLIP model, tokenizer, <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">images_all<\/code> tensor, and the specific prompt (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">[prompts[i]]<\/code>) to the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">compute_clip_features<\/code> function (<strong>Lines 76 and 77<\/strong>), which outputs the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code>.<\/p>\n<p>Notice that the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> here is the CLIP image encoder representation for all cropped objects (with dimension <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">[18, 512]<\/code>), and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> is the CLIP text encoder representation of the particular prompt under consideration (with dimension <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">[1, 512]<\/code>).<\/p>\n<p>Once we have the representations, we follow the procedure we did in our zero-shot classification experiment.<\/p>\n<p>As both <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">text_features<\/code> have unit norms, the cosine similarity can be easily calculated by simply taking the outer product of matrices (i.e., <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">image_features @ text_features.T<\/code>) as shown on <strong>Line 78<\/strong>.<\/p>\n<p>Furthermore, to convert the similarity scores to probabilities, we scale them by multiplying with 100 and take softmax, as shown on <strong>Line 7<\/strong><strong>9<\/strong>.<\/p>\n<p>This gives us <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">18<\/code> probability values, each indicating the similarity of the given prompt sentence with the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">18<\/code> object crops.<\/p>\n<p>On <strong>Line 80<\/strong>, we take the crop from the <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">obj_crops<\/code> list with the maximum probability assigned (i.e., <code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">max_crop<\/code>) for the given prompt and plot it using matplotlib on <strong>Lines 83 and 84<\/strong>. Finally, on <strong>Line 85<\/strong>, we save this object crop.<\/p>\n<p>Let us now move to our image similarity task, where our job will be to find out which object crops in our list <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">obj_crops<\/code> are the most similar.<\/p>\n<p>Note that in both the previous tasks, we used both the CLIP image and text encoders to get scores since these tasks were based on both the image and text modality.<\/p>\n<p>However, we only need the CLIP image encoder for the image similarity task, as this task only involves visual modality and not text modality.<\/p>\n<p>On <strong>Line 88<\/strong>, we simply use the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">compute_clip_features<\/code> function, which takes our model, tokenizer, tensor with all our images, and a dummy prompt. Note that we can pass any prompt here since we are only interested in using the image encoder to get image features, and we will not use the text or text encoder features at all for this task.<\/p>\n<p>We store the image embeddings or features as <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">img_embed<\/code>, as shown on <strong>Line 88<\/strong>.<\/p>\n<p>Next, we want to compute the similarity score between each pair of object crops or images.<\/p>\n<p>This can simply be done using the outer product between <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">img_embed<\/code> and <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">img_embed<\/code> transpose, as shown on <strong>Line 90<\/strong>.<\/p>\n<p>Let us consider a few examples of crops we saw above and try to understand the <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">sim<\/code> similarity matrix for these crops that our code generates.<\/p>\n<p><strong>Figure 4<\/strong> shows the image similarities for a few crops we considered in the previous section.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><a href=\"https:\/\/lh6.googleusercontent.com\/5NNGIAl5cupqr2r0d2OVD35xcbXMr0KkZNuTaLVO7GTnkXJ5oPxkn0cI9hrq--0uuqy5pAX16JfY11YcPTxLPZEVX_26Rj86VCLAapjftTMl1sU0hfKRe2qSylIPbDhNoazyF2AW8Os8Ehp-X00texw\"  rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/5NNGIAl5cupqr2r0d2OVD35xcbXMr0KkZNuTaLVO7GTnkXJ5oPxkn0cI9hrq--0uuqy5pAX16JfY11YcPTxLPZEVX_26Rj86VCLAapjftTMl1sU0hfKRe2qSylIPbDhNoazyF2AW8Os8Ehp-X00texw\" alt=\"\" width=\"352\" height=\"500\"\/><\/a><figcaption><strong>Figure 4:<\/strong> Similarity matrix from CLIP for a few object crops \u2014 painting, door, painting, plant, dog, plant (source: image by the author).<\/figcaption><\/figure>\n<\/div>\n<p>Notice that semantically similar objects have high similarity scores. Also, note that the similarity of an image crop with itself (on the diagonal) is always <code class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">1.0<\/code>.<\/p>\n<p>In <strong>row 1<\/strong>, we notice that the painting crop is most similar to (apart from itself) the crop of another painting. In <strong>row 3<\/strong>, we also notice the same thing with the other painting.<\/p>\n<p>In <strong>row 4<\/strong>,  we notice that the plant crop is most similar to (apart from itself) the crop of another plant as shown. We also notice the same thing in <strong>row 6<\/strong>.<\/p>\n<p>This completes our implementation for performing various downstream tasks with SAM and the CLIP-based system we discussed at the beginning of the tutorial.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<div id=\"pitch\" style=\"padding: 40px; width: 100%; background-color: #F4F6FA;\">\n<h3>What&#8217;s next? I recommend <a  href=\"https:\/\/pyimagesearch.com\/pyimagesearch-university\/?utm_source=blogPost&#038;utm_medium=bottomBanner&#038;utm_campaign=What%27s%20next%3F%20I%20recommend\">PyImageSearch University<\/a>.<\/h3>\n<p>\t<script src=\"https:\/\/fast.wistia.com\/embed\/medias\/kno0cmko2z.jsonp\" async><\/script><script src=\"https:\/\/fast.wistia.com\/assets\/external\/E-v1.js\" async><\/script><\/p>\n<div class=\"wistia_responsive_padding\" style=\"padding:56.25% 0 0 0;position:relative;\">\n<div class=\"wistia_responsive_wrapper\" style=\"height:100%;left:0;position:absolute;top:0;width:100%;\">\n<div class=\"wistia_embed wistia_async_kno0cmko2z videoFoam=true\" style=\"height:100%;position:relative;width:100%\">\n<div class=\"wistia_swatch\" style=\"height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;\"><img decoding=\"async\" src=\"https:\/\/fast.wistia.com\/embed\/medias\/kno0cmko2z\/swatch\" style=\"filter:blur(5px);height:100%;object-fit:contain;width:100%;\" alt=\"\" aria-hidden=\"true\" onload=\"this.parentNode.style.opacity=1;\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<div style=\"margin-top: 32px; margin-bottom: 32px; \">\n\t\t<strong>Course information:<\/strong><br \/>\n\t\t80 total classes \u2022 105+ hours of on-demand code walkthrough videos \u2022 Last updated: September 2023<br \/>\n\t\t<span style=\"color: #169FE6;\">\u2605\u2605\u2605\u2605\u2605<\/span> 4.84 (128 Ratings) \u2022 16,000+ Students Enrolled\n\t<\/div>\n<p><strong>I strongly believe that if you had the right teacher you could <em>master<\/em> computer vision and deep learning.<\/strong><\/p>\n<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?<\/p>\n<p>That\u2019s <em>not<\/em> the case.<\/p>\n<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive<\/em> terms. <em>And that\u2019s exactly what I do<\/em>. My mission is to change education and how complex Artificial Intelligence topics are taught.<\/p>\n<p>If you&#8217;re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you\u2019ll learn how to <em>successfully<\/em> and <em>confidently<\/em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.<\/p>\n<p><strong>Inside PyImageSearch University you&#8217;ll find:<\/strong><\/p>\n<ul style=\"margin-left: 0px;\">\n<li style=\"list-style: none;\">&check; <strong>80 courses<\/strong> on essential computer vision, deep learning, and OpenCV topics<\/li>\n<li style=\"list-style: none;\">&check; <strong>80 Certificates<\/strong> of Completion<\/li>\n<li style=\"list-style: none;\">&check; <strong>105+ hours<\/strong> of on-demand video<\/li>\n<li style=\"list-style: none;\">&check; <strong>Brand new courses released <em>regularly<\/em><\/strong>, ensuring you can keep up with state-of-the-art techniques<\/li>\n<li style=\"list-style: none;\">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab<\/strong><\/li>\n<li style=\"list-style: none;\">&check; Run all code examples in your web browser \u2014 works on Windows, macOS, and Linux (no dev environment configuration required!)<\/li>\n<li style=\"list-style: none;\">&check; Access to <strong>centralized code repos for <em>all<\/em> 520+ tutorials<\/strong> on PyImageSearch<\/li>\n<li style=\"list-style: none;\">&check; <strong> Easy one-click downloads<\/strong> for code, datasets, pre-trained models, etc.<\/li>\n<li style=\"list-style: none;\">&check; <strong>Access<\/strong> on mobile, laptop, desktop, etc.<\/li>\n<\/ul>\n<p style=\"text-align: center;\">\n\t\t<a  class=\"button link\" href=\"https:\/\/pyimagesearch.com\/pyimagesearch-university\/?utm_source=blogPost&#038;utm_medium=bottomBanner&#038;utm_campaign=What%27s%20next%3F%20I%20recommend\" style=\"background-color: #6DC713; border-bottom: none;\">Click here to join PyImageSearch University<\/a>\n\t<\/p>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h2Summary\"\/>\n<h2><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h2Summary\"><strong>Summary<\/strong><\/a><\/h2>\n<p>In this tutorial, we looked at how SAM can be integrated with existing models to build systems that can perform diverse downstream tasks without fine-tuning the models using a task-specific curated dataset.<\/p>\n<p>Specifically, we built our own SAM and CLIP-based pipeline in PyTorch, which can extract objects of interest from images and perform classification or text-to-image retrieval in a zero-shot fashion on the extracted individual objects.<\/p>\n<p>The two tutorials in this series show the amazing capabilities of SAM and discuss the various ways that SAM can be used in your own computer vision for solving segmentation or other diverse downstream tasks.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"h3Citation\"\/>\n<h3><a href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/#TOC-h3Citation\"><strong>Citation Information<\/strong><\/a><\/h3>\n<p><strong>Chandhok, S. <\/strong>\u201cSAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks,\u201d <em>PyImageSearch<\/em>, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, and R. Raha, eds., 2023, <a href=\"https:\/\/pyimg.co\/27uny\"  rel=\"noreferrer noopener\">https:\/\/pyimg.co\/27uny<\/a>  <\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"classic\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">@incollection{Chandhok_2023_SAM-Part2,\n  author = {Shivam Chandhok},\n  title = {{SAM} from {Meta AI} (Part 2): Integration with {CLIP} for Downstream Tasks},\n  booktitle = {PyImageSearch},\n  editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha},\n  year = {2023},\n  url = {https:\/\/pyimg.co\/27uny},\n}<\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<div style=\"padding: 40px; width: 100%; background-color: #F4F6FA;\">\n<img decoding=\"async\" src=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/05\/maskcv.png?lossy=2&#038;strip=1&#038;webp=1\" alt=\"Featured Image\" style=\"width: 100%; height: auto; margin-bottom: 20px;\" srcset=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/05\/maskcv.png?size=126x70&#038;lossy=2&#038;strip=1&#038;webp=1 126w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/05\/maskcv-300x166.png?lossy=2&#038;strip=1&#038;webp=1 300w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/05\/maskcv.png?size=378x209&#038;lossy=2&#038;strip=1&#038;webp=1 378w, https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2023\/05\/maskcv.png?lossy=2&#038;strip=1&#038;webp=1 500w\" sizes=\"(max-width: 500px) 100vw, 500px\"><\/p>\n<h3>Unleash the potential of computer vision with Roboflow &#8211; Free!<\/h3>\n<ul style=\"margin-left: 0px;\">\n<li style=\"list-style: none;\">Step into the realm of the future by <a  href=\"https:\/\/roboflow.com\/?ref=pyimagesearch\">signing up or logging into your Roboflow account<\/a>. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.<\/li>\n<li style=\"list-style: none;\">Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch\u2019s comprehensive library, crafted to cater to a wide range of requirements.<\/li>\n<li style=\"list-style: none;\">Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.<\/li>\n<li style=\"list-style: none;\">Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.<\/li>\n<li style=\"list-style: none;\">Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.<\/li>\n<\/ul>\n<p style=\"text-align: center;\">\n        <a  class=\"button link\" href=\"https:\/\/roboflow.com\/?ref=pyimagesearch\" style=\"background-color: #6DC713; border-bottom: none;\">Join Roboflow Now<\/a>\n    <\/p>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), <em>simply enter your email address in the form below!<\/em><\/strong><\/p>\n<div id=\"download-the-code\" class=\"post-cta-wrap\">\n<div class=\"gpd-post-cta\">\n<div class=\"gpd-post-cta-content\">\n<div class=\"gpd-post-cta-top\">\n<div class=\"gpd-post-cta-top-image\"><img decoding=\"async\" src=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2020\/01\/cta-source-guide-1.png?lossy=2&#038;strip=1&#038;webp=1\" alt=\"\" srcset=\"https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2020\/01\/cta-source-guide-1.png?lossy=2&#038;strip=1&#038;webp=1 410w,https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2020\/01\/cta-source-guide-1.png?size=126x174&#038;lossy=2&#038;strip=1&#038;webp=1 126w,https:\/\/b2633864.smushcdn.com\/2633864\/wp-content\/uploads\/2020\/01\/cta-source-guide-1.png?size=252x348&#038;lossy=2&#038;strip=1&#038;webp=1 252w\" sizes=\"(max-width: 410px) 100vw, 410px\" \/><\/div>\n<div class=\"gpd-post-cta-top-title\">\n<h4>Download the Source Code and FREE 17-page Resource Guide<\/h4>\n<\/div>\n<div class=\"gpd-post-cta-top-desc\">\n<p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.<\/strong> Inside you&#8217;ll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!<\/p>\n<\/div><\/div>\n<div class=\"gpd-post-cta-bottom\">\n<form id=\"footer-cta-code\" class=\"footer-cta\" action=\"https:\/\/www.getdrip.com\/forms\/4130035\/submissions\" method=\"post\"  data-drip-embedded-form=\"4130035\">\n\t\t\t\t\t<input name=\"fields[email]\" type=\"email\" value=\"\" placeholder=\"Your email address\" class=\"form-control\" \/><\/p>\n<p>\t\t\t\t\t<button type=\"submit\">Download the code!<\/button><\/p>\n<div style=\"display: none;\" aria-hidden=\"true\"><label for=\"website\">Website<\/label><br \/><input type=\"text\" id=\"website\" name=\"website\" tabindex=\"-1\" autocomplete=\"false\" value=\"\" \/><\/div>\n<\/p><\/form>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<\/div>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/\">SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/pyimagesearch.com\/\">PyImageSearch<\/a>.<\/p>\n\n<p class=\"syndicated-attribution\"><figure class= \\\"wp-block-image alignnone \\\"><img src= \\\"http:\/\/itteacheritfreelance.hk\/test\/wordpress\/wp-content\/uploads\/2016\/05\/logo2-2.png\\\" alt=\\\"IT\u96fb\u8166\u88dc\u7fd2 java\u88dc\u7fd2 \u70ba\u5927\u5bb6\u914d\u5c0d\u96fb\u8166\u88dc\u7fd2,IT freelance, \u79c1\u4eba\u8001\u5e2b, PHP\u88dc\u7fd2,CSS\u88dc\u7fd2,XML,Java\u88dc\u7fd2,MySQL\u88dc\u7fd2,graphic design\u88dc\u7fd2,\u4e2d\u5c0f\u5b78ICT\u88dc\u7fd2,\u4e00\u5c0d\u4e00\u79c1\u4eba\u88dc\u7fd2\u548cFreelance\u81ea\u7531\u5de5\u4f5c\u914d\u5c0d\u3002\\\"\/><figcaption>\u7acb\u523b\u8a3b\u518a\u53ca\u5831\u540d\u96fb\u8166\u88dc\u7fd2\u8ab2\u7a0b\u5427!<\/figcaption><\/figure>\r\n<\/br>Find A Teacher Form:\r\n<\/br>https:\/\/docs.google.com\/forms\/d\/1vREBnX5n262umf4wU5U2pyTwvk9O-JrAgblA-wH9GFQ\/viewform?edit_requested=true#responses\r\n<\/br><\/br>Email:\r\n<\/br>public1989two@gmail.com<br><br><br><br><br><br><br>\r\n<a href=www.itsec.hk style=color:#FFFFFF;>www.itsec.hk<\/a><br>\r\n<a href=\\\"www.itsec.vip\\\" style=color:#FFFFFF;>www.itsec.vip<\/a><br>\r\n<a href=\\\"www.itseceu.uk\\\" style=color:#FFFFFF;>www.itseceu.uk<\/a><br><\/p>","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>Table of Contents SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks SAM and CLIP Integration Configuring Your Development Environment Need Help Configuring Your Development Environment? Project Structure Getting Individual Objects from Image Downstream Tasks with SAM\u2026<\/p>\n<p>The post <a rel=\"nofollow\" href=\"https:\/\/pyimagesearch.com\/2023\/09\/18\/sam-from-meta-ai-part-2-integration-with-clip-for-downstream-tasks\/\">SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks<\/a> appeared first on <a rel=\"nofollow\" href=\"https:\/\/pyimagesearch.com\/\">PyImageSearch<\/a>.<\/p>\n<\/div>","protected":false},"author":2021,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"slim_seo":{"title":"SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks - ITTeacherITFreelance.hk","description":"Table of Contents SAM from Meta AI (Part 2): Integration with CLIP for Downstream Tasks SAM and CLIP Integration Configuring Your Development Environment Need H"},"footnotes":""},"categories":[10700],"tags":[10727,10728,10729,10730,10731,10732,10733],"_links":{"self":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329278"}],"collection":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/2021"}],"replies":[{"embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=329278"}],"version-history":[{"count":1,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329278\/revisions"}],"predecessor-version":[{"id":329279,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/329278\/revisions\/329279"}],"wp:attachment":[{"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=329278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=329278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itteacheritfreelance.hk\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=329278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}