Skip to content

Instantly share code, notes, and snippets.

@hexcowboy
Last active May 13, 2023 00:31
Show Gist options
  • Save hexcowboy/f9d1e1f76ca54a2b6c3ce3a300711bef to your computer and use it in GitHub Desktop.
Save hexcowboy/f9d1e1f76ca54a2b6c3ce3a300711bef to your computer and use it in GitHub Desktop.
NFT Image Scraper

Started by asking it to write python code with an example from the Resevoir docs

can you turn this into a python3 request

curl --request GET \
     --url 'https://api.reservoir.tools/tokens/v6?collection=0x3Fe1a4c1481c8351E91B64D5c398b159dE07cbc5&sortBy=tokenId&sortDirection=asc&limit=10&includeAttributes=true' \
     --header 'accept: */*' \
     --header 'x-api-key: demo-api-key'

Asked it to iterate over the 20k NFT collection

can you make it iterate 100 at a time, from 0-20,000

Told it a bit about how the response was formatted and how it could interface with the API

so actually instead of providing "offset", each response will provide a continuation key in the form of

{
    ...rest of response
    "continuation": "abcdef123"
}

which you must use in the subsequent request params as "continuation": "abcdef123"

Now that it has a loop set up properly, I told it about the business logic I wanted it to include

great. lets exclude the "includeAttributes" param. the response will have the following shape

{
    "tokens": [
        {
            "token": {
                "tokenId": "0",
                "image": "https://i.seadn.io/gcs/files/9acb975358b9caf654f9103f309c1b3e.png?w=500&auto=format"
            },
        },
    ],
    "continuation": "MHhkNjY4YTJlMDAxZjMzODViOGJiYzVhODY4MmFjM2MwZDgzYzE5MTIyXzk5",
}

there will be many `tokens`. you will need to go through each token on each request to get the `token.image`, download the image, and output it to a folder called "./images/". name the image by the `token.tokenId`, so the resulting file should be at `./images/0.

some of the `token.images` will be gif and some will be .avif, so you will need to first determine which one it is, and then convert it to a webp image when you save the file.

also, you should resize the images so they are 100x100 pixels.

I noticed a bug, so I asked it to fix it

i found a bug. to fix it, you have to check if `token.image` is None, and if it is, use `token.collection.image` instead

It was running very slowly so I asked it to run the downloads in parallel. This is usually a relatively big engineering task.

this works great. can you please make it async so it can download images in parallel?

Once it did this, I just had it fix a few smaller bugs and it was set to go. In total, I spent 20 minutes prompting ChatGPT before it completely build this scraper. I probably would've spent 1-2 hours writing this myself and a junior dev would probably have spent a day or more.

import asyncio
import io
import os
import sys
import aiohttp
import PIL
from PIL import Image
url = "https://api.reservoir.tools/tokens/v6"
headers = {"accept": "*/*", "x-api-key": "demo-api-key"}
continuation_key = None
image_folder = "./images/"
# Create the image folder if it doesn't exist
if not os.path.exists(image_folder):
os.makedirs(image_folder)
async def download_image(session, token_id, image_url):
try:
async with session.get(image_url) as response:
if response.status == 200:
image_data = await response.read()
try:
# Open the image using PIL
image = Image.open(io.BytesIO(image_data))
# Convert the image to WebP format
image = image.convert("RGB")
image_format = "webp"
# Resize the image to 100x100 pixels
image = image.resize((100, 100), Image.LANCZOS)
# Save the image to the specified folder
image_path = os.path.join(
image_folder, f"{token_id}.{image_format}"
)
image.save(image_path, format=image_format)
print(f"Saved image {token_id}.{image_format}")
except PIL.UnidentifiedImageError:
print(f"Error: Failed to identify image file for token {token_id}")
else:
print(
f"Failed to download image for token {token_id}. Status: {response.status}"
)
except aiohttp.ClientError as e:
print(f"Error downloading image for token {token_id}: {e}")
async def process_tokens(tokens):
async with aiohttp.ClientSession() as session:
tasks = []
for token in tokens:
token_id = token.get("token", {}).get("tokenId")
image_url = token.get("token", {}).get("image")
if image_url is None:
image_url = token.get("token", {}).get("collection", {}).get("image")
if image_url is None:
print(f"Token {token_id} has no image URL. Skipping...")
continue
task = asyncio.create_task(download_image(session, token_id, image_url))
tasks.append(task)
await asyncio.gather(*tasks)
async def main():
global continuation_key # Declare continuation_key as a global variable
# Retrieve the filenames from the image folder
processed_tokens = set()
for filename in os.listdir(image_folder):
token_id = os.path.splitext(filename)[0]
processed_tokens.add(token_id)
while True:
params = {
"collection": "0xd668A2E001f3385B8BBC5a8682AC3C0D83C19122",
"sortBy": "tokenId",
"sortDirection": "asc",
"limit": 100,
}
if continuation_key:
params["continuation"] = continuation_key
async with aiohttp.ClientSession() as session:
try:
async with session.get(url, params=params, headers=headers) as response:
data = await response.json()
# Extract the continuation key from the response
continuation_key = data.get("continuation")
tokens = data.get("tokens", [])
# Skip tokens that have already been processed
tokens = [
token
for token in tokens
if token["token"]["tokenId"] not in processed_tokens
]
await process_tokens(tokens)
# Update the set of processed tokens
processed_tokens.update(
token["token"]["tokenId"] for token in tokens
)
# Break the loop if no more continuation key is provided
if not continuation_key:
break
except aiohttp.ClientError as e:
print("Error:", e)
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment