Skip to content

Instantly share code, notes, and snippets.

@pansapiens
Last active October 6, 2023 23:16
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save pansapiens/110431456e8a4ba4f2eb to your computer and use it in GitHub Desktop.
Save pansapiens/110431456e8a4ba4f2eb to your computer and use it in GitHub Desktop.
Convert HTML to a self contained file with inline Base64 encoded PNG images
#!/usr/bin/env python
# A simple script to suck up HTML, convert any images to inline Base64
# encoded format and write out the converted file.
#
# Usage: python standalone_html.py <input_file.html> <output_file.html>
#
# TODO: Consider MHTML format: https://en.wikipedia.org/wiki/MHTML
import os
from bs4 import BeautifulSoup
def guess_type(filepath):
"""
Return the mimetype of a file, given it's path.
This is a wrapper around two alternative methods - Unix 'file'-style
magic which guesses the type based on file content (if available),
and simple guessing based on the file extension (eg .jpg).
:param filepath: Path to the file.
:type filepath: str
:return: Mimetype string.
:rtype: str
"""
try:
import magic # python-magic
return magic.from_file(filepath, mime=True)
except ImportError:
import mimetypes
return mimetypes.guess_type(filepath)[0]
def file_to_base64(filepath):
"""
Returns the content of a file as a Base64 encoded string.
:param filepath: Path to the file.
:type filepath: str
:return: The file content, Base64 encoded.
:rtype: str
"""
import base64
with open(filepath, 'rb') as f:
encoded_str = base64.b64encode(f.read())
return encoded_str.decode('utf-8')
def make_html_images_inline(in_filepath, out_filepath):
"""
Takes an HTML file and writes a new version with inline Base64 encoded
images.
:param in_filepath: Input file path (HTML)
:type in_filepath: str
:param out_filepath: Output file path (HTML)
:type out_filepath: str
"""
basepath = os.path.split(in_filepath.rstrip(os.path.sep))[0]
soup = BeautifulSoup(open(in_filepath, 'r'), 'html.parser')
for img in soup.find_all('img'):
img_path = os.path.join(basepath, img.attrs['src'])
mimetype = guess_type(img_path)
img.attrs['src'] = \
"data:%s;base64,%s" % (mimetype, file_to_base64(img_path))
with open(out_filepath, 'w') as of:
of.write(str(soup))
if __name__ == '__main__':
import sys
make_html_images_inline(sys.argv[1], sys.argv[2])
@pansapiens
Copy link
Author

Could probably optionally use htmlmin here: https://htmlmin.readthedocs.io/en/latest/

@carbontracking
Copy link

That's brilliant, exactly the script I was looking for.
One thing though, I tested it on Windows so YMMV

The function file_to_base64 returns a binary object, so when I looked at the embedded images they were all wrapped in a b'........' which meant they didn't display.

To make it work I changed
return encoded_str
to
return encoded_str.decode("utf-8")

and then it worked perfectly !

Thanks again.

@pansapiens
Copy link
Author

Glad it was useful - this was originally written for Python 2.x, hence the string decoding issue. I've integrated your change so it should be Python 3.x compatible now.

@almeyras
Copy link

almeyras commented Jul 7, 2021

Yes! This is a much needed script for batch exporting files from Ulysses to Moodle (both pages and books). But I get an error (macOS):

python3 standalone_html.py Desktop/index.html
Traceback (most recent call last):
File "standalone_html.py", line 67, in <module>
make_html_images_inline(sys.argv[1], sys.argv[2])

@pansapiens
Copy link
Author

@almeyras You need a second argument, the output file path. eg

python3 standalone_html.py Desktop/index.html Desktop/index_standalone.html

@mzhukovs
Copy link

Nice. Here's [Python3] code for replacing HTML images specified with URLs to base64:

import base64
import mimetypes
import requests
from bs4 import BeautifulSoup

def make_html_images_inline(html: str) -> str:
    soup = BeautifulSoup(html, 'html.parser')

    for img in soup.find_all('img'):
        img_src = img.attrs['src']

        if not img_src.startswith('http'):
            continue

        mimetype = mimetypes.guess_type(img_src)[0]
        img_b64 =  base64.b64encode(requests.get(img_src).content)

        img.attrs['src'] = \
            "data:%s;base64,%s" % (mimetype, img_b64.decode('utf-8'))

    return str(soup)

@almeyras
Copy link

almeyras commented Mar 23, 2022

@almeyras You need a second argument, the output file path. eg

python3 standalone_html.py Desktop/index.html Desktop/index_standalone.html

Thank you so much, that worked. This script is a bliss. I will be using it for making direct exports to Moodle without all the fuss involved in uploading images.

Now I want to report an error when the script runs into image names with blank spaces:

Traceback (most recent call last):
  File "/Users/daniel/Desktop/standalone_html.py", line 67, in <module>
    make_html_images_inline(sys.argv[1], sys.argv[2])
  File "/Users/daniel/Desktop/standalone_html.py", line 60, in make_html_images_inline
    "data:%s;base64,%s" % (mimetype, file_to_base64(img_path))
  File "/Users/daniel/Desktop/standalone_html.py", line 40, in file_to_base64
    with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'Captura%20de%20pantalla%202022-03-14%20a%20las%2022.14.46.jpg'

Thank you yet again!

@dmptechcode
Copy link

dmptechcode commented Jan 30, 2023

Hello. This is useful to no end, and I love it. I have a problem though, which is probably related to my locale (Italy).
It seems to have a problem with accented characters, apostophes and the like. I just lose them when converting.

For instance,

L&rsquo;Hotel

becomes

L�Hotel

and results as you can see in the character displaying as a question mark on browser.

The file seems to be actually ANSI encoded. If I open it with Notepad++, switch to ANSI (no conversion) and then convert it to UTF-8 it becomes completely readable again, but I lose all the HTML codes, so

L&rsquo;Hotel

becomes

L'Hotel

in the code.

@pansapiens
Copy link
Author

If anyone is looking for a more comprehensive solution to do this, I really like the SingleFile browser extension: https://github.com/gildas-lormeau/SingleFile (there is also a commandline version https://github.com/gildas-lormeau/single-file-cli).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment