Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active August 26, 2024 04:15
Show Gist options
  • Save simonw/c9b55757c8959b2c84d0d9bddd020f4d to your computer and use it in GitHub Desktop.
Save simonw/c9b55757c8959b2c84d0d9bddd020f4d to your computer and use it in GitHub Desktop.

Gemini API Image and Prompt Tool

human (Aug 25, 2024, 09:07 PM)

/**
 * @license
 * Copyright 2024 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
import { GoogleGenerativeAI } from "https://esm.run/@google/generative-ai";
import { marked } from "https://esm.run/marked";
/**
 * Returns a model instance.
 *
 * @param {GoogleGenerativeAI.ModelParams} params
 * @returns {GoogleGenerativeAI.GenerativeModel}
 */
export async function getGenerativeModel(params) {
  // Fetch API key from server
  // If you need a new API key, get it from https://makersuite.google.com/app/apikey
  const API_KEY = await (await fetch("API_KEY")).text();
  const genAI = new GoogleGenerativeAI(API_KEY);
  return genAI.getGenerativeModel(params);
}
/**
 * Converts a File object to a GoogleGenerativeAI.Part object.
 *
 * @param {Blob} file
 * @returns {GoogleGenerativeAI.Part}
 */
export async function fileToGenerativePart(file) {
  const base64EncodedDataPromise = new Promise((resolve) => {
    const reader = new FileReader();
    reader.onloadend = () => resolve(reader.result.split(",")[1]);
    reader.readAsDataURL(file);
  });
  return {
    inlineData: { data: await base64EncodedDataPromise, mimeType: file.type },
  };
}
/**
 * Scrolls the document all the way to the bottom.
 */
export function scrollToDocumentBottom() {
  const scrollingElement = document.scrollingElement || document.body;
  scrollingElement.scrollTop = scrollingElement.scrollHeight;
}
/**
 * Updates the `resultEl` with parsed markdown text returned by a `getResult()` call.
 *
 * @param {HTMLElement}} resultEl
 * @param {() => Promise<GoogleGenerativeAI.GenerateContentResponse>} getResult
 * @param {boolean} streaming
 */
export async function updateUI(resultEl, getResult, streaming) {
  resultEl.className = "loading";
  let text = "";
  try {
    const result = await getResult();
    if (streaming) {
      resultEl.innerText = "";
      for await (const chunk of result.stream) {
        // Get first candidate's current text chunk
        const chunkText = chunk.text();
        text += chunkText;
        resultEl.innerHTML = marked.parse(text);
        scrollToDocumentBottom();
      }
    } else {
      const response = await result.response;
      text = response.text();
    }
    resultEl.className = ""; // Remove .loading class
  } catch (err) {
    text += "\n\n> " + err;
    resultEl.className = "error";
  }
  resultEl.innerHTML = marked.parse(text);
  scrollToDocumentBottom();
}

Based on that example text, build me an HTML page with Vanilla JS that loads the Gemini API from esm.run - it should have a file input and a textarea and a submit button - you attach an image, enter a prompt and then click the button and it does a Gemini prompt with that image and prompt and injects the returned result into a div on the page

assistant (Aug 25, 2024, 09:07 PM)

I'd be happy to create an HTML page with vanilla JavaScript that uses the Gemini API to process an image and prompt. Here's the implementation:

Gemini API Image and Prompt Processor

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Gemini API Image and Prompt Processor</title>
    <script type="module">
        import { GoogleGenerativeAI } from "https://esm.run/@google/generative-ai";
        import { marked } from "https://esm.run/marked";

        async function getGenerativeModel(params) {
            const API_KEY = "YOUR_API_KEY_HERE"; // Replace with your actual API key
            const genAI = new GoogleGenerativeAI(API_KEY);
            return genAI.getGenerativeModel(params);
        }

        async function fileToGenerativePart(file) {
            return new Promise((resolve) => {
                const reader = new FileReader();
                reader.onloadend = () => resolve({
                    inlineData: { 
                        data: reader.result.split(",")[1],
                        mimeType: file.type 
                    }
                });
                reader.readAsDataURL(file);
            });
        }

        async function processImageAndPrompt() {
            const fileInput = document.getElementById('imageInput');
            const promptInput = document.getElementById('promptInput');
            const resultDiv = document.getElementById('result');
            
            if (!fileInput.files[0] || !promptInput.value) {
                alert('Please select an image and enter a prompt.');
                return;
            }

            resultDiv.innerHTML = 'Processing...';
            
            try {
                const model = await getGenerativeModel({ model: "gemini-pro-vision" });
                const imagePart = await fileToGenerativePart(fileInput.files[0]);
                
                const result = await model.generateContent([promptInput.value, imagePart]);
                const response = await result.response;
                const text = response.text();
                
                resultDiv.innerHTML = marked.parse(text);
            } catch (error) {
                resultDiv.innerHTML = `Error: ${error.message}`;
            }
        }

        // Attach event listener to the submit button
        document.getElementById('submitBtn').addEventListener('click', processImageAndPrompt);
    </script>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
        }
        textarea {
            width: 100%;
            height: 100px;
        }
        #result {
            margin-top: 20px;
            border: 1px solid #ccc;
            padding: 10px;
            min-height: 100px;
        }
    </style>
</head>
<body>
    <h1>Gemini API Image and Prompt Processor</h1>
    <input type="file" id="imageInput" accept="image/*">
    <textarea id="promptInput" placeholder="Enter your prompt here"></textarea>
    <button id="submitBtn">Process</button>
    <div id="result"></div>
</body>
</html>

This HTML page includes everything needed to interact with the Gemini API using an image and a text prompt. Here's a breakdown of the key components:

  1. The page loads the GoogleGenerativeAI and marked libraries from esm.run.
  2. It includes a file input for selecting an image, a textarea for entering the prompt, and a submit button.
  3. The JavaScript code handles:
    • Converting the selected file to the format required by the Gemini API.
    • Sending the image and prompt to the API when the submit button is clicked.
    • Displaying the result in the result div, with markdown parsing for better formatting.

To use this page:

  1. Replace "YOUR_API_KEY_HERE" with your actual Gemini API key.
  2. Save the content as an HTML file and open it in a web browser.
  3. Select an image, enter a prompt, and click the "Process" button to see the results.

Note that this implementation assumes you have a valid API key and are authorized to use the Gemini API. Make sure to handle the API key securely in a production environment.

human (Aug 25, 2024, 09:10 PM)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Bounding Box Visualizer (Gemini Coordinates)</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            display: flex;
            flex-direction: column;
            align-items: center;
            padding: 20px;
        }
        #fileInput, #coordinatesInput {
            margin-bottom: 20px;
        }
        #coordinatesInput {
            width: 100%;
            height: 100px;
        }
        #imageContainer {
            width: 80%;
            position: relative;
        }
        #canvas {
            width: 100%;
            height: auto;
        }
        #hoveredCoordinates {
            margin-top: 10px;
            font-size: 14px;
        }
    </style>
</head>
<body>
    <input type="file" id="fileInput" accept="image/*">
    <textarea id="coordinatesInput" placeholder="Paste coordinates here (e.g., [488, 945, 519, 999])"></textarea>
    <div id="imageContainer">
        <canvas id="canvas"></canvas>
    </div>
    <div id="hoveredCoordinates"></div>

    <script>
        var fileInput = document.getElementById('fileInput');
        var coordinatesInput = document.getElementById('coordinatesInput');
        var canvas = document.getElementById('canvas');
        var ctx = canvas.getContext('2d');
        var hoveredCoordinates = document.getElementById('hoveredCoordinates');

        var image = null;
        var boxes = [];

        var colors = ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#FF00FF', '#00FFFF'];

        fileInput.addEventListener('change', handleImageUpload);
        coordinatesInput.addEventListener('input', handleCoordinatesChange);
        canvas.addEventListener('mousemove', handleCanvasMouseMove);
        canvas.addEventListener('mouseleave', function() {
            hoveredCoordinates.textContent = '';
        });

        function handleImageUpload(e) {
            var file = e.target.files[0];
            var reader = new FileReader();
            reader.onload = function(event) {
                image = new Image();
                image.onload = drawImage;
                image.src = event.target.result;
            };
            reader.readAsDataURL(file);
        }

        function handleCoordinatesChange(e) {
            var input = e.target.value;
            var parsedBoxes = input.match(/\[\s*\d+\s*,\s*\d+\s*,\s*\d+\s*,\s*\d+\s*\]/g) || [];
            boxes = parsedBoxes.map(function(box) {
                bbox = JSON.parse(box);
                return [
                    1000 - bbox[2],
                    1000 - bbox[3],
                    1000 - bbox[0],
                    1000 - bbox[1]
                ];
            });
            drawImage();
        }

        function drawImage() {
            if (!image) return;

            canvas.width = image.width;
            canvas.height = image.height;
            ctx.drawImage(image, 0, 0);

            boxes.forEach(function(box, index) {
                var ymin = (box[0] / 1000) * image.height;
                var xmin = (box[1] / 1000) * image.width;
                var ymax = (box[2] / 1000) * image.height;
                var xmax = (box[3] / 1000) * image.width;

                var width = xmax - xmin;
                var height = ymax - ymin;

                ctx.strokeStyle = colors[index % colors.length];
                ctx.lineWidth = 2;
                ctx.strokeRect(xmin, ymin, width, height);
            });
        }

        function handleCanvasMouseMove(e) {
            var rect = canvas.getBoundingClientRect();
            var x = (e.clientX - rect.left) / (rect.right - rect.left) * canvas.width;
            var y = (e.clientY - rect.top) / (rect.bottom - rect.top) * canvas.height;

            var hoveredBoxIndex = boxes.findIndex(function(box) {
                var ymin = (box[0] / 1000) * image.height;
                var xmin = (box[1] / 1000) * image.width;
                var ymax = (box[2] / 1000) * image.height;
                var xmax = (box[3] / 1000) * image.width;

                return x >= xmin && x <= xmax && y >= ymin && y <= ymax;
            });

            if (hoveredBoxIndex !== -1) {
                hoveredCoordinates.textContent = 'Hovered box coordinates: [' + boxes[hoveredBoxIndex].join(', ') + ']';
            } else {
                hoveredCoordinates.textContent = '';
            }
        }
    </script>
</body>
</html>

now incorproate the logic from this tool, such that when the response is returned from the prompt the image is displayed with any rendered bounding boxes

assistant (Aug 25, 2024, 09:10 PM)

I understand you want to incorporate the bounding box visualization logic from the provided HTML into our Gemini API image and prompt processor. I'll update the previous HTML to include this functionality. The new version will display the uploaded image with bounding boxes based on the coordinates returned in the Gemini API response.

Gemini API Image Processor with Bounding Box Visualization

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Gemini API Image Processor with Bounding Box Visualization</title>
    <script type="module">
        import { GoogleGenerativeAI } from "https://esm.run/@google/generative-ai";
        import { marked } from "https://esm.run/marked";

        async function getGenerativeModel(params) {
            const API_KEY = "YOUR_API_KEY_HERE"; // Replace with your actual API key
            const genAI = new GoogleGenerativeAI(API_KEY);
            return genAI.getGenerativeModel(params);
        }

        async function fileToGenerativePart(file) {
            return new Promise((resolve) => {
                const reader = new FileReader();
                reader.onloadend = () => resolve({
                    inlineData: { 
                        data: reader.result.split(",")[1],
                        mimeType: file.type 
                    }
                });
                reader.readAsDataURL(file);
            });
        }

        async function processImageAndPrompt() {
            const fileInput = document.getElementById('imageInput');
            const promptInput = document.getElementById('promptInput');
            const resultDiv = document.getElementById('result');
            
            if (!fileInput.files[0] || !promptInput.value) {
                alert('Please select an image and enter a prompt.');
                return;
            }

            resultDiv.innerHTML = 'Processing...';
            
            try {
                const model = await getGenerativeModel({ model: "gemini-pro-vision" });
                const imagePart = await fileToGenerativePart(fileInput.files[0]);
                
                const result = await model.generateContent([promptInput.value, imagePart]);
                const response = await result.response;
                const text = response.text();
                
                resultDiv.innerHTML = marked.parse(text);

                // Extract coordinates from the response
                const coordinates = extractCoordinates(text);
                if (coordinates.length > 0) {
                    displayImageWithBoundingBoxes(fileInput.files[0], coordinates);
                }
            } catch (error) {
                resultDiv.innerHTML = `Error: ${error.message}`;
            }
        }

        function extractCoordinates(text) {
            const regex = /\[\s*\d+\s*,\s*\d+\s*,\s*\d+\s*,\s*\d+\s*\]/g;
            const matches = text.match(regex) || [];
            return matches.map(JSON.parse);
        }

        function displayImageWithBoundingBoxes(file, coordinates) {
            const reader = new FileReader();
            reader.onload = function(event) {
                const image = new Image();
                image.onload = function() {
                    const canvas = document.getElementById('canvas');
                    canvas.width = image.width;
                    canvas.height = image.height;
                    const ctx = canvas.getContext('2d');
                    ctx.drawImage(image, 0, 0);

                    const colors = ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#FF00FF', '#00FFFF'];
                    coordinates.forEach((box, index) => {
                        const [ymin, xmin, ymax, xmax] = box.map(coord => (1000 - coord) / 1000);
                        const width = (xmax - xmin) * image.width;
                        const height = (ymax - ymin) * image.height;

                        ctx.strokeStyle = colors[index % colors.length];
                        ctx.lineWidth = 2;
                        ctx.strokeRect(xmin * image.width, ymin * image.height, width, height);
                    });
                };
                image.src = event.target.result;
            };
            reader.readAsDataURL(file);
        }

        // Attach event listener to the submit button
        document.getElementById('submitBtn').addEventListener('click', processImageAndPrompt);
    </script>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
        }
        textarea {
            width: 100%;
            height: 100px;
        }
        #result, #imageContainer {
            margin-top: 20px;
            border: 1px solid #ccc;
            padding: 10px;
        }
        #canvas {
            max-width: 100%;
            height: auto;
        }
    </style>
</head>
<body>
    <h1>Gemini API Image Processor with Bounding Box Visualization</h1>
    <input type="file" id="imageInput" accept="image/*">
    <textarea id="promptInput" placeholder="Enter your prompt here"></textarea>
    <button id="submitBtn">Process</button>
    <div id="result"></div>
    <div id="imageContainer">
        <canvas id="canvas"></canvas>
    </div>
</body>
</html>

This updated HTML page now incorporates the bounding box visualization logic. Here are the key changes and additions:

  1. The processImageAndPrompt function now includes logic to extract coordinates from the API response and display the image with bounding boxes.

  2. New functions added:

    • extractCoordinates: Extracts coordinate arrays from the API response text.
    • displayImageWithBoundingBoxes: Renders the uploaded image on a canvas and draws bounding boxes based on the extracted coordinates.
  3. The HTML structure now includes a canvas element for displaying the image with bounding boxes.

  4. The styling has been updated to accommodate the new image display area.

To use this updated page:

  1. Replace "YOUR_API_KEY_HERE" with your actual Gemini API key.
  2. Save the content as an HTML file and open it in a web browser.
  3. Select an image, enter a prompt that you expect to return bounding box coordinates, and click the "Process" button.
  4. The API response will be displayed in the result div, and if coordinates are found in the response, the image will be shown below with bounding boxes drawn around the specified areas.

This implementation assumes that the Gemini API response includes coordinates in the format [y_min, x_min, y_max, x_max] within the text. If the API returns coordinates in a different format, you may need to adjust the extractCoordinates function accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment