This script takes a list of files out of a file, removes duplicate files and then loops through all those files.
It checks each files headers to see if it has a HTTP status code of 200.
If it is a redirect (301) or not found (404) then it will not add it to the sitemap.
So all good files are added into sitemap.xml at the end.
It was built to take the output from site crawlers and build a sitemap out of what they find. Because when they generate a sitemap it does not take redirects into account, this does.
To run it you will need NodeJS installed. If you need some help with that, have a Google.
Once installed you just need to run it like so.
node extract.js files.txt
Where files.txt is your file containing your list of files.
The list of files must resemble something like this.
http://www.example.com/index.html
http://www.example.com/contact.html
http://www.example.com/aboutus.html