Skip to content

Instantly share code, notes, and snippets.

@ilovefreesw
Last active October 4, 2023 07:18
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ilovefreesw/da435865a443a62923d67e6af6c6b2a8 to your computer and use it in GitHub Desktop.
Save ilovefreesw/da435865a443a62923d67e6af6c6b2a8 to your computer and use it in GitHub Desktop.
A powershelll script to bulk convert webpages to PDF using headless Chrome. Save PDF with numberic names or based on webpage title.
$sourceFile = " " # the source file containing the URLs you want to convert
$destFolder = " " # converted PDFs will be saved here. Folder has to exist. Don't forget to make sure that this path must end with "/"
$num = 1
foreach ($link in [System.IO.File]::ReadLines($sourceFile))
{
$outfile = $num.ToString() + '.pdf'
$outputPath = Join-Path -Path $destFolder -ChildPath $outfile
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link"
Start-Sleep -Seconds 3
$num++
}
@zatt
Copy link

zatt commented Jun 15, 2023

I couldn't get this script to work, so I used AI to fix it. Before it would overwrite every file with the new file and I would be left with only one pdf file. If anyone else has this problem, here's an updated script that works for me:

$sourceFile = " " # the source file containing the URLs you want to convert
$destFolder = " " # converted PDFs will be saved here. Folder has to exist.
$num = 1

foreach ($link in [System.IO.File]::ReadLines($sourceFile))
{
$outfile = $num.ToString() + '.pdf'
$outputPath = Join-Path -Path $destFolder -ChildPath $outfile
& 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link"
Start-Sleep -Seconds 3
$num++
}

P.S. Thanks for the article/script, much appreciated!

@tjubb
Copy link

tjubb commented Oct 3, 2023

this is awesome! Huge time saver for a project I'm working on! However, is it possible to tweak the script to name the output files according to the page title instead of a nondescript number?

@ilovefreesw
Copy link
Author

ilovefreesw commented Oct 4, 2023

@tjubb

You can try this snippet. But the webpage title cannot be used as file name as it contains illegal characters such as ":"

This modified snippet depends on PowerHTML, so install it first:

Install-Module -Name PowerHTML

and then use this script:

$sourceFile = ""  # the source file containing the URLs you want to convert
$destFolder = "" # converted PDFs will be saved here. Folder has to exist.

foreach ($link in [System.IO.File]::ReadLines($sourceFile))
{
       $title = (ConvertFrom-HTML -uri $link).SelectNodes('/html/head/title').InnerText
       $pattern = "[{0}]" -f [RegEx]::Escape([System.IO.Path]::GetInvalidFileNameChars() -join '')
       $cleanedTitle = $title -replace $pattern, ''

       echo $cleanedTitle

       $outfile = $cleanedTitle + '.pdf'

       $outputPath = Join-Path -Path $destFolder -ChildPath $outfile

       & 'C:\Program Files\Google\Chrome\Application\chrome.exe' --headless --print-to-pdf="$outputPath" "$link"
       Start-Sleep -Seconds 3
       echo " "
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment