Skip to content

Instantly share code, notes, and snippets.

@david-crespo
Last active February 15, 2023 21:30
Show Gist options
  • Save david-crespo/89baec40d680a17ebc2a4d622c5fc0cf to your computer and use it in GitHub Desktop.
Save david-crespo/89baec40d680a17ebc2a4d622c5fc0cf to your computer and use it in GitHub Desktop.
Download FB tagged photos and videos

Download photos and videos you're tagged in on Facebook

Why

When you download an archive of your Facebook account, Facebook includes photos and videos you've uploaded, but not photos and videos you're tagged in that were uploaded by other people. This is a script to automatically download those.

Setup

This requires Python 3.

  1. Make sure you have curl (Linux and Mac likely already have it)
  2. mkdir photos videos in the same directory as the script
  3. pip3 install selenium
  4. Download the ChromeDriver executable and put it somewhere in your PATH
  5. Set FB_USER_ID and CHROME_PROFILE_PATH in helpers.py
  6. Set CONTAINER_SELECTOR (see below)

Authentication

The trick here is to avoid having to log in from the script by using the same Chrome profile every time the script runs. Run python3 tagged_photos.py, a Chrome window will open, and you will be redirected to FB login. Once you log in, your login will persist, so you can close the window and run the script again, and it should work.

CONTAINER_SELECTOR

The photo downloader relies on a particular class that is likely to change over time because it's auto-generated by FB's frontend build process. It was .atb when I wrote this but it'll probably change all the time. You'll have to dig into the source of the photo page to figure out what the right class is.

Running it

python3 tagged_photos.py or python3 tagged_videos.py

The photos can take a while if you have a lot because it is navigating through the site in real time and I didn't figure out how to parallelize it because this is Python (would have been easy in JS). For about 900 photos it took almost an hour.

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
FB_USER_ID = '' # SET ME
# on mac, probably /Users/<mac username>/Library/Application Support/Google/Chrome/Default
CHROME_PROFILE_PATH = ""
def get_driver():
wd_options = Options()
wd_options.add_argument("--disable-notifications")
wd_options.add_argument("--disable-infobars")
wd_options.add_argument("--mute-audio")
wd_options.add_argument("--start-maximized")
wd_options.add_argument("--user-data-dir={}".format(CHROME_PROFILE_PATH))
return webdriver.Chrome(chrome_options=wd_options)
def scroll_to_bottom(driver):
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(1)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
import json, re
from datetime import datetime, timezone
from subprocess import call
from helpers import scroll_to_bottom, get_driver, FB_USER_ID
# you will likely need to update this to something that selects
# for the container around the photo info, timestamp, album, etc
CONTAINER_SELECTOR = ".atb"
def get_fb_id(link):
match = re.search("fbid=([0-9]+)", link)
if match:
return match.group(1)
return "fake_id_" + str(hash(link))
if __name__ == '__main__':
print("-"*20 + "\nOpening Browser...")
driver = get_driver()
driver.get("https://m.facebook.com/{}/photos".format(FB_USER_ID))
scroll_to_bottom(driver)
photo_links = list(map(
lambda el: el.get_attribute("href"),
driver.find_elements_by_css_selector('.timeline.photos a')
))
pretty = dict(sort_keys=True, indent=4, separators=(',', ': '))
photos = []
for link in photo_links:
driver.get(link)
photo_id = get_fb_id(link)
full_size_url = driver.find_element_by_link_text("View Full Size").get_attribute("href")
actor = driver.find_element_by_css_selector('.actor').text
people = list(map(
lambda el: el.text,
driver.find_elements_by_css_selector('.tagName')
))
caption = driver.find_element_by_css_selector('.msg > div').text
timestamp_json = driver.find_element_by_css_selector('{} abbr'.format(CONTAINER_SELECTOR)).get_attribute('data-store')
timestamp = json.loads(timestamp_json).get("time")
info = driver.find_element_by_css_selector('{} > div'.format(CONTAINER_SELECTOR)).text.replace('\u00b7', '-').rstrip(' -')
date = datetime.fromtimestamp(timestamp, timezone.utc).strftime("%Y-%m-%d")
filename = "{}_{}.jpg".format(date, photo_id)
driver.get(full_size_url)
photo = {
"fb_url": link,
"cdn_url": driver.current_url,
"actor": actor,
"caption": caption,
"timestamp": timestamp,
"info": info,
"filename": filename,
"people": people
}
print(json.dumps(photo, **pretty))
photos.append(photo)
with open('photos/data.json', 'w') as f:
f.write(
json.dumps(photos, **pretty)
)
call(["curl", driver.current_url, "--output", "photos/{}".format(filename)])
import re
from subprocess import call
from helpers import scroll_to_bottom, get_driver, FB_USER_ID
if __name__ == '__main__':
print("-"*20 + "\nOpening Browser...")
driver = get_driver()
driver.get("https://www.facebook.com/{}/videos".format(FB_USER_ID))
scroll_to_bottom(driver)
video_links = list(map(
lambda el: el.get_attribute("href").replace('www.', 'm.'),
driver.find_elements_by_css_selector('ul.fbStarGrid > li > a')
))
for link in video_links:
driver.get(link)
page_source = driver.page_source
driver.find_element_by_css_selector('[data-sigil="m-video-play-button playInlineVideo"]').click() # play video
cdn_url = driver.find_element_by_css_selector('video').get_attribute('src')
filename = cdn_url.split('?')[0].split('/')[-1]
with open('videos/{}.html'.format(filename), 'w') as f:
f.write(page_source)
call(["curl", cdn_url, "--output", "videos/{}".format(filename)])
@thecmanp11
Copy link

In line 66 of tagged_photos.py I needed to change "photos/data.json" to just "photos.json"

@quick7silver
Copy link

@Sneakysouthpaw
Copy link

Thanks a lot for this. I get the following error towards the end of the script. Any ideas where I could be going wrong?
_

Traceback (most recent call last):
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\Scripts\tagged_photos.py", line 71, in
call(["curl", driver.current_url, "--output", "photos/{}".format(filename)])
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

_

@david-crespo
Copy link
Author

david-crespo commented Dec 13, 2020

Thanks a lot for this. I get the following error towards the end of the script. Any ideas where I could be going wrong?

File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\Scripts\tagged_photos.py", line 71, in
call(["curl", driver.current_url, "--output", "photos/{}".format(filename)])
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in **init**
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified_

The most likely culprit is you don't have curl installed. I didn't test this on Windows. https://curl.se/windows/

@Sneakysouthpaw
Copy link

Sneakysouthpaw commented Dec 14, 2020

Thanks a lot for this. I get the following error towards the end of the script. Any ideas where I could be going wrong?

File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\Scripts\tagged_photos.py", line 71, in
call(["curl", driver.current_url, "--output", "photos/{}".format(filename)])
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in **init**
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Bruiser\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified_

The most likely culprit is you don't have curl installed. I didn't test this on Windows. https://curl.se/windows/

Thank you so much, that did the trick !

@jeasmith
Copy link

Hey - This worked a treat for me! I wanted to add the dates back on to the images from the json file so they would import into my Photo app closer to the date they were taken, so I've written a bit of additional code. (I'm a dotnet dev so please excuse any lack of python knowledge)

import json
import time
import piexif
from os import listdir

not_found_photos = []

with open('photos.json') as json_file:
    data = json.load(json_file)
    for image in data:
        file_name_from_json = image['filename']
        time_from_timestamp_from_json = time.strftime("%Y:%m:%d %H:%M:%S", time.localtime(image['timestamp']))
        found = False
        photo_list = listdir('photos')
        for photo_file_name in photo_list:
            if(photo_file_name == file_name_from_json):
                found = True
                file_name = 'photos/' + photo_file_name
                exif_dict = piexif.load(file_name)
                exif_dict['0th'][piexif.ImageIFD.DateTime] = time_from_timestamp_from_json
                exif_dict['Exif'][piexif.ExifIFD.DateTimeOriginal] = time_from_timestamp_from_json
                exif_dict['Exif'][piexif.ExifIFD.DateTimeDigitized] = time_from_timestamp_from_json
                exif_bytes = piexif.dump(exif_dict)
                piexif.insert(exif_bytes, file_name)

        if(found == False):
            not_found_photos.append(file_name_from_json)

print('Files not found: ' + len(not_found_photos))

In the case that you don't happen to have the json anymore, you can also get it from the file name, it just won't have the time data (ie just the date):

import piexif
from os import listdir

photo_list = listdir('photos')
for photo_file_name in photo_list:
    if(photo_file_name.endswith('.jpg')):
        file_name_date = photo_file_name.split('_')[0].split('-')
        date_to_use = ':'.join(map(str, file_name_date)) + ' 00:00:00'
        file_name = 'photos/' + photo_file_name
        exif_dict = piexif.load(file_name)
        exif_dict['0th'][piexif.ImageIFD.DateTime] = date_to_use
        exif_dict['Exif'][piexif.ExifIFD.DateTimeOriginal] = date_to_use
        exif_dict['Exif'][piexif.ExifIFD.DateTimeDigitized] = date_to_use
        exif_bytes = piexif.dump(exif_dict)
        piexif.insert(exif_bytes, file_name)

@david-crespo
Copy link
Author

That's awesome. Here are a couple of tips for more pythonic style:

No parens around if conditions:

if photo_file_name == file_name_from_json:

The following is preferred unless you need to distinguish between False and some other falsy value like None or [] or 0 or "" (which you don't here):

if not found:

But you actually don't need either of those lines line because you don't have to loop through photo_list anyway:

not_found_photos = []

with open('photos.json') as json_file:
    data = json.load(json_file)
    photo_list = listdir('photos')
    for image in data:
        file_name_from_json = image['filename']

        if file_name_from_json not in photo_list:
            not_found_photos.append(file_name_from_json)
            continue

        file_name = 'photos/' + file_name_from_json
        time_from_timestamp_from_json = time.strftime("%Y:%m:%d %H:%M:%S", time.localtime(image['timestamp']))
            
        exif_dict = piexif.load(file_name)
        exif_dict['0th'][piexif.ImageIFD.DateTime] = time_from_timestamp_from_json
        exif_dict['Exif'][piexif.ExifIFD.DateTimeOriginal] = time_from_timestamp_from_json
        exif_dict['Exif'][piexif.ExifIFD.DateTimeDigitized] = time_from_timestamp_from_json
        exif_bytes = piexif.dump(exif_dict)
        piexif.insert(exif_bytes, file_name)
            
print('Files not found: ' + len(not_found_photos))

@sevenfour-74
Copy link

I keep getting the error

Warning: Failed to create the file photos/x.jpg: No such file or directory

so I moved "--ouput" before the url in line 71 of tagged_photos and then got the error

curl: (6) Could not resolve host: photos

Do you know what's going wrong?

@david-crespo
Copy link
Author

david-crespo commented Apr 5, 2021

I believe the solution is to either

  • create the photos directory by running mkdir photos, or

  • add --create-dirs to the curl command, like this

    call(["curl", driver.current_url, "--create-dirs", "--output", "photos/{}".format(filename)])
    
  • change photos/{}".format(filename) to filename (which will drop all the photos right in the directory where the script is).

The --output is necessary. Removing --output makes it think photos is another url to hit (as opposed to the directory where the downloaded photos go), which is why it complains that it cannot find the host named photos.

@david-crespo
Copy link
Author

I updated the setup steps to include

  1. make sure you have curl
  2. mkdir photos videos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment