Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active October 23, 2024 19:28
Show Gist options
  • Save simonw/7cc2a9c3e612a8af502d733ff619e066 to your computer and use it in GitHub Desktop.
Save simonw/7cc2a9c3e612a8af502d733ff619e066 to your computer and use it in GitHub Desktop.

Extract Text from Image

human (Oct 23, 2024, 10:01 AM)

cat <<EOF > input.json
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this imaage"
        },
        {
          "inlineData": {
            "data": "$(base64 -i output_0.png)",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}
EOF

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key=$GOOGLE_API_KEY" -H 'Content-Type: application/json' -X POST -d -

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
  -H 'Content-Type: application/json' \
  -X POST \
  -d @input.json | jq

Turn this into a Bash script that runs like this:

prompt-gemini "this is the prompt"
prompt-gemini "This is the prompt" blah.png
prompt-gemini "This is the prompt" blah.pdf
prompt-gemini "this is the prompt" -m pro

It should exit with an error if GOOGLE_API_KEY is not set

It should use a temporary file for input.json which is deleted on completion

If no file was provided it should skip the inlineData bit

It should use the correct mimeType for PNG or PDF or JPG or JPEG or GIF depending on the file extension

The -m option should follow the following rules: it defaults to 8b, or it can be:

8b => gemini-1.5-flash-8b-latest (the default) flash-8b => gemini-1.5-flash-8b-latest flash => gemini-1.5-flash-latest pro => gemini-1.5-pro-latest

Any other value should be passed used directly in the gemini-1.5-flash:generateContent portion of the URL

assistant (Oct 23, 2024, 10:01 AM)

I'll help you create a Bash script that implements these requirements.

Gemini API Prompt Script

#!/bin/bash

# Check if GOOGLE_API_KEY is set
if [ -z "$GOOGLE_API_KEY" ]; then
    echo "Error: GOOGLE_API_KEY environment variable is not set" >&2
    exit 1
fi

# Default model
model="8b"
prompt=""
image_file=""

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        -m)
            model="$2"
            shift 2
            ;;
        *)
            if [ -z "$prompt" ]; then
                prompt="$1"
            elif [ -z "$image_file" ]; then
                image_file="$1"
            fi
            shift
            ;;
    esac
done

# Validate prompt
if [ -z "$prompt" ]; then
    echo "Error: No prompt provided" >&2
    echo "Usage: prompt-gemini \"prompt\" [image_file] [-m model]" >&2
    exit 1
fi

# Map model names to full model strings
case $model in
    "8b"|"flash-8b")
        model_string="gemini-1.5-flash-8b-latest"
        ;;
    "flash")
        model_string="gemini-1.5-flash-latest"
        ;;
    "pro")
        model_string="gemini-1.5-pro-latest"
        ;;
    *)
        model_string="gemini-1.5-$model"
        ;;
esac

# Create temporary file
temp_file=$(mktemp)
trap 'rm -f "$temp_file"' EXIT

# Determine mime type if image file is provided
if [ -n "$image_file" ]; then
    if [ ! -f "$image_file" ]; then
        echo "Error: File '$image_file' not found" >&2
        exit 1
    fi

    # Get file extension and convert to lowercase
    ext=$(echo "${image_file##*.}" | tr '[:upper:]' '[:lower:]')
    
    case $ext in
        png)
            mime_type="image/png"
            ;;
        jpg|jpeg)
            mime_type="image/jpeg"
            ;;
        gif)
            mime_type="image/gif"
            ;;
        pdf)
            mime_type="application/pdf"
            ;;
        *)
            echo "Error: Unsupported file type .$ext" >&2
            exit 1
            ;;
    esac

    # Create JSON with image data
    cat <<EOF > "$temp_file"
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "$prompt"
        },
        {
          "inlineData": {
            "data": "$(base64 "$image_file")",
            "mimeType": "$mime_type"
          }
        }
      ]
    }
  ]
}
EOF
else
    # Create JSON without image data
    cat <<EOF > "$temp_file"
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "$prompt"
        }
      ]
    }
  ]
}
EOF
fi

# Make API request
curl "https://generativelanguage.googleapis.com/v1beta/models/$model_string:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d @"$temp_file" | jq

I've created a script that meets all your requirements. Here's how to use it:

  1. Save it as prompt-gemini and make it executable with chmod +x prompt-gemini
  2. Set your GOOGLE_API_KEY environment variable
  3. Use it with any of the formats you specified:
prompt-gemini "Your prompt here"
prompt-gemini "Your prompt" image.png
prompt-gemini "Your prompt" doc.pdf
prompt-gemini "Your prompt" -m pro

The script includes:

  • Error checking for the API key
  • Automatic mime type detection based on file extension
  • Model name mapping as specified
  • Temporary file handling with proper cleanup
  • Support for all the required file types (PNG, PDF, JPG, JPEG, GIF)
  • Pretty-printed JSON output using jq

Would you like me to explain any part of the script or make any modifications?

@simonw
Copy link
Author

simonw commented Oct 23, 2024

I manually edited this to use base64 -i image.png instead (needed on macOS).

I made one modification in a separate session: https://gist.github.com/simonw/b1bffe54ebdf3583ec4e3639fb535567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment