-
-
Save pansapiens/110431456e8a4ba4f2eb to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
# A simple script to suck up HTML, convert any images to inline Base64 | |
# encoded format and write out the converted file. | |
# | |
# Usage: python standalone_html.py <input_file.html> <output_file.html> | |
# | |
# TODO: Consider MHTML format: https://en.wikipedia.org/wiki/MHTML | |
import os | |
from bs4 import BeautifulSoup | |
def guess_type(filepath): | |
""" | |
Return the mimetype of a file, given it's path. | |
This is a wrapper around two alternative methods - Unix 'file'-style | |
magic which guesses the type based on file content (if available), | |
and simple guessing based on the file extension (eg .jpg). | |
:param filepath: Path to the file. | |
:type filepath: str | |
:return: Mimetype string. | |
:rtype: str | |
""" | |
try: | |
import magic # python-magic | |
return magic.from_file(filepath, mime=True) | |
except ImportError: | |
import mimetypes | |
return mimetypes.guess_type(filepath)[0] | |
def file_to_base64(filepath): | |
""" | |
Returns the content of a file as a Base64 encoded string. | |
:param filepath: Path to the file. | |
:type filepath: str | |
:return: The file content, Base64 encoded. | |
:rtype: str | |
""" | |
import base64 | |
with open(filepath, 'rb') as f: | |
encoded_str = base64.b64encode(f.read()) | |
return encoded_str.decode('utf-8') | |
def make_html_images_inline(in_filepath, out_filepath): | |
""" | |
Takes an HTML file and writes a new version with inline Base64 encoded | |
images. | |
:param in_filepath: Input file path (HTML) | |
:type in_filepath: str | |
:param out_filepath: Output file path (HTML) | |
:type out_filepath: str | |
""" | |
basepath = os.path.split(in_filepath.rstrip(os.path.sep))[0] | |
soup = BeautifulSoup(open(in_filepath, 'r'), 'html.parser') | |
for img in soup.find_all('img'): | |
img_path = os.path.join(basepath, img.attrs['src']) | |
mimetype = guess_type(img_path) | |
img.attrs['src'] = \ | |
"data:%s;base64,%s" % (mimetype, file_to_base64(img_path)) | |
with open(out_filepath, 'w') as of: | |
of.write(str(soup)) | |
if __name__ == '__main__': | |
import sys | |
make_html_images_inline(sys.argv[1], sys.argv[2]) | |
That's brilliant, exactly the script I was looking for.
One thing though, I tested it on Windows so YMMV
The function file_to_base64 returns a binary object, so when I looked at the embedded images they were all wrapped in a b'........'
which meant they didn't display.
To make it work I changed
return encoded_str
to
return encoded_str.decode("utf-8")
and then it worked perfectly !
Thanks again.
Glad it was useful - this was originally written for Python 2.x, hence the string decoding issue. I've integrated your change so it should be Python 3.x compatible now.
Yes! This is a much needed script for batch exporting files from Ulysses to Moodle (both pages and books). But I get an error (macOS):
python3 standalone_html.py Desktop/index.html
Traceback (most recent call last):
File "standalone_html.py", line 67, in <module>
make_html_images_inline(sys.argv[1], sys.argv[2])
@almeyras You need a second argument, the output file path. eg
python3 standalone_html.py Desktop/index.html Desktop/index_standalone.html
Nice. Here's [Python3] code for replacing HTML images specified with URLs to base64:
import base64
import mimetypes
import requests
from bs4 import BeautifulSoup
def make_html_images_inline(html: str) -> str:
soup = BeautifulSoup(html, 'html.parser')
for img in soup.find_all('img'):
img_src = img.attrs['src']
if not img_src.startswith('http'):
continue
mimetype = mimetypes.guess_type(img_src)[0]
img_b64 = base64.b64encode(requests.get(img_src).content)
img.attrs['src'] = \
"data:%s;base64,%s" % (mimetype, img_b64.decode('utf-8'))
return str(soup)
@almeyras You need a second argument, the output file path. eg
python3 standalone_html.py Desktop/index.html Desktop/index_standalone.html
Thank you so much, that worked. This script is a bliss. I will be using it for making direct exports to Moodle without all the fuss involved in uploading images.
Now I want to report an error when the script runs into image names with blank spaces:
Traceback (most recent call last):
File "/Users/daniel/Desktop/standalone_html.py", line 67, in <module>
make_html_images_inline(sys.argv[1], sys.argv[2])
File "/Users/daniel/Desktop/standalone_html.py", line 60, in make_html_images_inline
"data:%s;base64,%s" % (mimetype, file_to_base64(img_path))
File "/Users/daniel/Desktop/standalone_html.py", line 40, in file_to_base64
with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'Captura%20de%20pantalla%202022-03-14%20a%20las%2022.14.46.jpg'
Thank you yet again!
Hello. This is useful to no end, and I love it. I have a problem though, which is probably related to my locale (Italy).
It seems to have a problem with accented characters, apostophes and the like. I just lose them when converting.
For instance,
L’Hotel
becomes
L�Hotel
and results as you can see in the character displaying as a question mark on browser.
The file seems to be actually ANSI encoded. If I open it with Notepad++, switch to ANSI (no conversion) and then convert it to UTF-8 it becomes completely readable again, but I lose all the HTML codes, so
L’Hotel
becomes
L'Hotel
in the code.
If anyone is looking for a more comprehensive solution to do this, I really like the SingleFile browser extension: https://github.com/gildas-lormeau/SingleFile (there is also a commandline version https://github.com/gildas-lormeau/single-file-cli).
There is also this: https://github.com/Y2Z/monolith
Could probably optionally use
htmlmin
here: https://htmlmin.readthedocs.io/en/latest/