Skip to content

Instantly share code, notes, and snippets.

@simonw
Created October 25, 2024 16:10
Show Gist options
  • Save simonw/32273a445da3318df690749701805863 to your computer and use it in GitHub Desktop.
Save simonw/32273a445da3318df690749701805863 to your computer and use it in GitHub Desktop.

Extracting SVG content from a file

human (Oct 25, 2024, 08:56 AM)

cat /tmp/tmp | rg -U '(<svg.*</svg>)'

What's wrong with this? I want to output just the caputred regex bit

assistant (Oct 25, 2024, 08:56 AM)

Let me help you fix that regex command. The issue is that the -U flag alone won't guarantee capturing multiline SVG content correctly. Here's the corrected version:

cat /tmp/tmp | rg -U '(?s)(<svg.*?</svg>)'

The key changes are:

  1. Added (?s) flag inline to enable dot-matches-newline mode
  2. Made the .* non-greedy with *? to prevent over-capturing when there might be multiple SVG tags

This will now correctly:

  • Match across multiple lines with (?s)
  • Capture everything between (and including) <svg and </svg>
  • Use non-greedy matching to handle potential multiple SVG tags in the file
  • Return just the captured group since it's in parentheses

You could also write it with PCRE mode if you prefer:

cat /tmp/tmp | rg -P '(?s)(<svg.*?</svg>)'

Both versions should correctly extract just the SVG content you're looking for.

human (Oct 25, 2024, 08:57 AM)

still no output, and teh input has <svg and stuff in it

assistant (Oct 25, 2024, 08:57 AM)

Let me help troubleshoot. Since you mentioned the input definitely has SVG content, let's break this down:

  1. First, let's verify the content exists:
cat /tmp/tmp | grep svg
  1. The ripgrep regex might be too strict. Let's try a simpler version:
cat /tmp/tmp | rg -U '<svg[\s\S]*?</svg>'

If that still doesn't work, could you share a small sample of what's in the file? That way I can help craft the exact regex needed. Sometimes SVG content can have attributes or namespaces that need to be accounted for in the regex.

Also, one common gotcha: if you want to make sure ripgrep isn't suppressing any output due to color settings, you could try:

cat /tmp/tmp | rg --no-line-number --color never -U '<svg[\s\S]*?</svg>'

human (Oct 25, 2024, 08:57 AM)

% cat /tmp/tmp | grep svg
```svg
<svg width="400" height="300" viewBox="0 0 400 300" xmlns="http://www.w3.org/2000/svg">
</svg>

assistant (Oct 25, 2024, 08:57 AM)

Ah, I see the issue - there are backticks (```) surrounding the SVG content. Let's modify the regex to handle that:

cat /tmp/tmp | rg -U '```svg\n(.*)</svg>' -r '$1</svg>'

Or if you want an even more precise match:

cat /tmp/tmp | rg -U '```svg\n(<svg.*</svg>)\n```' -r '$1'

This will:

  1. Match the opening ```svg
  2. Capture everything between that and the closing
  3. Use the -r flag to only output the captured content
  4. Remove the backticks and ```svg marker

This should give you just the clean SVG content without the markdown code fence markers.

human (Oct 25, 2024, 08:58 AM)

# Loop through each model
for model in "${models[@]}"; do
    output_file="$model.svg"
    
    # Check if file already exists
    if [ ! -f "$output_file" ]; then
        echo "Generating SVG using model: $model"
        llm -m $model 'Generate an SVG of a pelican riding a bicycle' | | rg -U '<svg[\s\S]*?</svg>' > "$output_file"
        echo "Created $output_file"
    else
        echo "Skipping $model - file already exists"
    fi
done

Modify this so that it either writes out the .svg file if the regex matches, or writes out a file called $model.failure.YYYYMMDDHHMM.txt if the regex fails to match

assistant (Oct 25, 2024, 08:58 AM)

Here's the modified script that handles both successful SVG generation and failures:

# Loop through each model
for model in "${models[@]}"; do
    output_file="$model.svg"
    timestamp=$(date '+%Y%m%d%H%M')
    failure_file="$model.failure.$timestamp.txt"
    
    # Check if file already exists
    if [ ! -f "$output_file" ]; then
        echo "Generating SVG using model: $model"
        # Capture the full output
        output=$(llm -m $model 'Generate an SVG of a pelican riding a bicycle')
        
        # Try to extract SVG
        if echo "$output" | rg -U '<svg[\s\S]*?</svg>' > "$output_file" && [ -s "$output_file" ]; then
            echo "Created $output_file"
        else
            # If regex didn't match or file is empty, save full output to failure file
            echo "$output" > "$failure_file"
            echo "No SVG found - saved output to $failure_file"
            # Clean up empty svg file if it was created
            rm -f "$output_file"
        fi
    else
        echo "Skipping $model - file already exists"
    fi
done

Key changes:

  1. Added timestamp generation using date '+%Y%m%d%H%M'
  2. Capture full LLM output to a variable first
  3. Check both if the regex matches (rg succeeds) AND if the resulting file has content (-s test)
  4. If either check fails, save the complete output to a timestamped failure file
  5. Clean up any empty SVG file that might have been created
  6. Removed the double pipe || which was a syntax error in the original

This will now:

  • Create modelname.svg if valid SVG is found
  • Create modelname.failure.202410251423.txt (for example) if no SVG is found or if extraction fails
  • Preserve the full output in failure cases for debugging
  • Still skip existing SVG files
  • Clean up any empty files that might be created during failed extractions

human (Oct 25, 2024, 09:00 AM)

It should also heck the exit cod of that llm command and, if it did not succeed, write its standard error output to that failure file and not attempt the svg extraction

assistant (Oct 25, 2024, 09:00 AM)

Here's the modified script that handles LLM command failures properly:

# Loop through each model
for model in "${models[@]}"; do
    output_file="$model.svg"
    timestamp=$(date '+%Y%m%d%H%M')
    failure_file="$model.failure.$timestamp.txt"
    
    # Check if file already exists
    if [ ! -f "$output_file" ]; then
        echo "Generating SVG using model: $model"
        # Capture both stdout and stderr, and the exit status
        if output=$(llm -m $model 'Generate an SVG of a pelican riding a bicycle' 2>&1); then
            # LLM succeeded, try to extract SVG
            if echo "$output" | rg -U '<svg[\s\S]*?</svg>' > "$output_file" && [ -s "$output_file" ]; then
                echo "Created $output_file"
            else
                # SVG extraction failed
                echo "$output" > "$failure_file"
                echo "No SVG found - saved output to $failure_file"
                rm -f "$output_file"
            fi
        else
            # LLM command failed
            echo "LLM command failed - saved error to $failure_file"
            echo "$output" > "$failure_file"
        fi
    else
        echo "Skipping $model - file already exists"
    fi
done

Key changes:

  1. Added error capturing to the llm command with 2>&1
  2. Wrapped llm command in an if statement to check its exit status
  3. Only attempts SVG extraction if the llm command succeeded
  4. Saves both stdout and stderr to the failure file when the llm command fails
  5. Different error messages to distinguish between LLM failures and SVG extraction failures

This will now:

  • Handle LLM command failures properly (like API errors, timeouts, etc.)
  • Save error output to the failure file in case of LLM command failures
  • Only attempt SVG extraction if the LLM command succeeded
  • Still handle SVG extraction failures as before
  • Keep the timestamp-based failure files for both types of failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment