Equim-chan/README.adoc

## README.adoc

      
    Raw
  

              README.adoc
            
          
    Equim’s SOP for Archiving YouTube Livestream


This document describes an SOP for archiving a YouTube livestream, either public or private.


The demonstrations below are operated under Arch Linux, but it should work on other systems as well, including Windows MSYS2.


This SOP was originally written for archiving Gawr Gura’s unarchived streams.


Overview


The result of the archive consists of:


an MPEG2-TS video file


a thumbnail file


a metadata JSON file


a streamlink trace log file


an integrity check log file


a raw livechat JSON file


a rendered livechat HTML file


The folder layout will be like:


.
├── m-Bq5CG_rGQ.html
├── m-Bq5CG_rGQ.json.gz
├── slchk.log
├── streamlink.log
├── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.info.json
├── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.jpg
└── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.ts


Preparation for Tools


Install the Tools


You will need


streamlink


youtube-dl


ffmpeg


virtualenv


pytchat


scripts in this gist


$ sudo pacman -S streamlink youtube-dl python-virtualenv ffmpeg
$ git clone https://github.com/taizan-hokuto/pytchat.git
$ cd pytchat
$ virtualenv venv
$ . venv/bin/activate
$ pip install -r requirements.txt
$ deactivate


Prepare cookies.txt


Note


This step is only needed if you are going to archive a private stream.


Chrome extension


Firefox extension


Export your cookies of host youtube.com to a cookies.txt, then you need to test and sanitize the file using youtube-dl.


$ youtube-dl --cookies cookies.txt --skip-download "$any_youtube_video_url"


youtube-dl will actually rewrite your cookies.txt, you will see # This file is generated by youtube-dl.  Do not edit. at the beginning of your cookies.txt after that.


Before the Stream, After the Waiting Room is Available


Set Cookies for pytchat


Note


This step is only needed if you are going to archive a private stream.


You need to prepare the livechat archiver at this step.


Visit https://www.youtube.com/live_chat?v=$video_id in your browser, after the page is loaded, open devtools (usually F12), go to the "network" tab, look up for any POST request whose URL is prefixed with ttps://www.youtube.com/youtubei/v1/live_chat/get_live_chat, then copy the value of Cookie of such request.


Edit pytchat/config/__init__.py, add a field with key cookie to headers, and paste the cookie value there.


Start Archiving the Chat


$ . venv/bin/activate
$ ./chat-archive.py fetch "$video_id"


2 Minutes Before the Scheduled Start Time


Start Archiving the Video


Run script streamlink-cookies.py if you have prepared a cookies.txt, otherwise you should replace it with vanilla streamlink.


$ env TZ=UTC ./streamlink-cookie.py \
  -o archive.ts \
  -l trace \
  --retry-streams 30 \
  --hls-live-restart \
  --hls-segment-threads 4 \
  --hls-segment-attempts 20 \
  --hls-playlist-reload-attempt 20 \
  "$video_url" \
  best \
  |& tee streamlink.log


Tip


The log file at trace level is very important as it is the source to check your archive’s integrity.


During the Stream


Get the Exact Start Time of the Stream


Note


If the stream is going to be unarchived, this will be the only chance you are able get the infomation.


$ ./get_start_time.sh "$video_id"


Archive Metadata and Thumbnail


$ youtube-dl --write-thumbnail --write-info-json --skip-download "$video_url"


Note


The archived JSON may contain sensitive infomation about your archiving environment such as your IP address, make sure to erase them if you intend to share it.


After the Stream


Render the Chat HTML


$ env TZ=UTC ./chat-archive.py render "$video_id" "$exact_stream_start_timestamp"
$ # compress the raw json
$ gzip -9 "$video_id.json"


Check the Integrity of the Archive


$ ./slchk.py streamlink.log |& tee slchk.log


If you see missing segments count: 0, congrats.


Organize Files


Rename your archive.ts to with a proper title. It is recommeneded to use the same filename as the metadata JSON file you got from youtube-dl.


## chat-archive.py
#!/usr/bin/env python

import sys
import time
import logging
import json
from datetime import datetime, timedelta, timezone

import pytchat
from pytchat.processors.dummy_processor import DummyProcessor
from pytchat.processors.html_archiver import HTMLArchiver

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s:%(lineno)s:\t%(msg)s')

def fetch(video_id, fallback_poll_interval=5):
    stream = pytchat.create(video_id=video_id, processor=DummyProcessor())
    total_len = 0

    with open(video_id + '.json', 'a') as json_out:
        logging.info(f'appending to {json_out.name}')

        while stream.is_alive():
            poll_interval = fallback_poll_interval

            chats = stream.get()
            if len(chats) != 1:
                logging.info(f'len(chats) != 1, sleep: {poll_interval}')
                time.sleep(poll_interval)
                continue

            chat = chats[0]
            if not chat:
                logging.info(f'chats[0] is empty, sleep: {poll_interval}')
                time.sleep(poll_interval)
                continue

            poll_interval = chat.get('timeout', poll_interval)
            chatdata = chat.get('chatdata') or []
            for item in chatdata:
                json_out.write(json.dumps(item, ensure_ascii=False, sort_keys=True, separators=(',', ':')) + '\n')

            logging.info(f'len: {total_len} + {len(chatdata)}, sleep: {poll_interval}')
            total_len += len(chatdata)

            time.sleep(poll_interval)

def render(video_id, start_us_utc):
    start = datetime.fromtimestamp(start_us_utc / 1e6, timezone.utc)
    ar = HTMLArchiver(video_id + '.html')

    with open(video_id + '.json') as json_in:
        logging.info(f'reading from {json_in.name}')
        batch = []

        for line in json_in:
            chat = json.loads(line)
            if 'addChatItemAction' not in chat:
                continue

            # write elapsed time
            for k, v in chat['addChatItemAction']['item'].items():
                if not v.get('timestampUsec'):
                    continue
                timestamp_us = float(v['timestampUsec'])
                timestamp = datetime.fromtimestamp(timestamp_us / 1e6, timezone.utc)
                if timestamp >= start:
                    elapsed = str(timestamp - start)
                else:
                    elapsed = '-' + str(start - timestamp)
                chat['addChatItemAction']['item'][k]['timestampText'] = {'simpleText': elapsed}

            batch.append(chat)
            if (len(batch)+1) % 32 == 0:
                ar.process([{'chatdata': batch}])
                batch.clear()

        if len(batch) > 0:
            ar.process([{'chatdata': batch}])
    ar.finalize()

if __name__ == '__main__':
    verb = sys.argv[1]
    if verb == 'fetch':
        video_id = sys.argv[2]
        fetch(video_id)
    elif verb == 'render':
        video_id = sys.argv[2]
        start_us_utc = float(sys.argv[3])
        render(video_id, start_us_utc)
    else:
        sys.exit(1)

## get_start_time.sh
#!/usr/bin/env bash

youtube-dl --cookies cookies.txt -g "https://www.youtube.com/watch?v=$1" | \
  head -n 1 | \
  xargs curl -SsL | \
  tail -n 1 | \
  xargs curl -SsL -I | \
  grep -i 'last-modified' | \
  sed 's/last-modified: //i' | \
  xargs -d '\n' date +'%s%6N' --utc -d

## slchk.py
#!/usr/bin/env python

import re
import sys

log_file = sys.argv[1]

segment_pat = re.compile(r'^.+ Segment')
segments = []
with open(log_file) as f:
    for line in f:
        if 'Segment' not in line:
            continue
        segment_id = int(segment_pat.sub('', line.replace('complete', '')).strip())
        timestamp = line[1:line.index(']')]
        segments.append((segment_id, timestamp))
segments.sort(key=lambda x: x[0])

print(f'start: {segments[0][0]} ({segments[0][1]})')
print(f'end: {segments[-1][0]} ({segments[-1][1]})')

latest_id, latest_timestamp = segments[0]
missing_count = 0

for segment_id, timestamp in segments[1:]:
    if segment_id == latest_id + 1:
        latest_id = segment_id
        latest_timestamp = timestamp
        continue
    has_missing = True

    delta = segment_id - latest_id - 1
    if delta > 1:
        print(f'missing: {latest_id + 1}-{segment_id - 1} ({latest_timestamp} - {timestamp})')
    else:
        print(f'missing: {latest_id + 1} ({latest_timestamp} - {timestamp})')
    missing_count += delta

    latest_id = segment_id
    latest_timestamp = timestamp

print(f'missing segments count: {missing_count}')

sys.exit(1 if missing_count > 0 else 0)

## streamlink-cookie.py
#!/usr/bin/env python

import sys
from subprocess import Popen

cmd = ['streamlink']
with open('cookies.txt') as f:
    for line in f:
        line = line.strip()
        if len(line) == 0 or line.startswith('# '):
            continue

        seps = line.split('\t')
        key, value = seps[-2], seps[-1]
        if key.startswith('ST-'):
            continue

        cmd += ('--http-cookie', f'{key}={value}')
cmd += sys.argv[1:]

proc = Popen(cmd, stdin=sys.stdin, stdout=sys.stdout, stderr=sys.stderr)
proc.wait()
sys.exit(proc.returncode)
	#!/usr/bin/env python

	import sys
	import time
	import logging
	import json
	from datetime import datetime, timedelta, timezone

	import pytchat
	from pytchat.processors.dummy_processor import DummyProcessor
	from pytchat.processors.html_archiver import HTMLArchiver

	logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s:%(lineno)s:\t%(msg)s')

	def fetch(video_id, fallback_poll_interval=5):
	stream = pytchat.create(video_id=video_id, processor=DummyProcessor())
	total_len = 0

	with open(video_id + '.json', 'a') as json_out:
	logging.info(f'appending to {json_out.name}')

	while stream.is_alive():
	poll_interval = fallback_poll_interval

	chats = stream.get()
	if len(chats) != 1:
	logging.info(f'len(chats) != 1, sleep: {poll_interval}')
	time.sleep(poll_interval)
	continue

	chat = chats[0]
	if not chat:
	logging.info(f'chats[0] is empty, sleep: {poll_interval}')
	time.sleep(poll_interval)
	continue

	poll_interval = chat.get('timeout', poll_interval)
	chatdata = chat.get('chatdata') or []
	for item in chatdata:
	json_out.write(json.dumps(item, ensure_ascii=False, sort_keys=True, separators=(',', ':')) + '\n')

	logging.info(f'len: {total_len} + {len(chatdata)}, sleep: {poll_interval}')
	total_len += len(chatdata)

	time.sleep(poll_interval)

	def render(video_id, start_us_utc):
	start = datetime.fromtimestamp(start_us_utc / 1e6, timezone.utc)
	ar = HTMLArchiver(video_id + '.html')

	with open(video_id + '.json') as json_in:
	logging.info(f'reading from {json_in.name}')
	batch = []

	for line in json_in:
	chat = json.loads(line)
	if 'addChatItemAction' not in chat:
	continue

	# write elapsed time
	for k, v in chat['addChatItemAction']['item'].items():
	if not v.get('timestampUsec'):
	continue
	timestamp_us = float(v['timestampUsec'])
	timestamp = datetime.fromtimestamp(timestamp_us / 1e6, timezone.utc)
	if timestamp >= start:
	elapsed = str(timestamp - start)
	else:
	elapsed = '-' + str(start - timestamp)
	chat['addChatItemAction']['item'][k]['timestampText'] = {'simpleText': elapsed}

	batch.append(chat)
	if (len(batch)+1) % 32 == 0:
	ar.process([{'chatdata': batch}])
	batch.clear()

	if len(batch) > 0:
	ar.process([{'chatdata': batch}])
	ar.finalize()

	if __name__ == '__main__':
	verb = sys.argv[1]
	if verb == 'fetch':
	video_id = sys.argv[2]
	fetch(video_id)
	elif verb == 'render':
	video_id = sys.argv[2]
	start_us_utc = float(sys.argv[3])
	render(video_id, start_us_utc)
	else:
	sys.exit(1)
	#!/usr/bin/env bash

	youtube-dl --cookies cookies.txt -g "https://www.youtube.com/watch?v=$1" \| \
	head -n 1 \| \
	xargs curl -SsL \| \
	tail -n 1 \| \
	xargs curl -SsL -I \| \
	grep -i 'last-modified' \| \
	sed 's/last-modified: //i' \| \
	xargs -d '\n' date +'%s%6N' --utc -d
	#!/usr/bin/env python

	import re
	import sys

	log_file = sys.argv[1]

	segment_pat = re.compile(r'^.+ Segment')
	segments = []
	with open(log_file) as f:
	for line in f:
	if 'Segment' not in line:
	continue
	segment_id = int(segment_pat.sub('', line.replace('complete', '')).strip())
	timestamp = line[1:line.index(']')]
	segments.append((segment_id, timestamp))
	segments.sort(key=lambda x: x[0])

	print(f'start: {segments[0][0]} ({segments[0][1]})')
	print(f'end: {segments[-1][0]} ({segments[-1][1]})')

	latest_id, latest_timestamp = segments[0]
	missing_count = 0

	for segment_id, timestamp in segments[1:]:
	if segment_id == latest_id + 1:
	latest_id = segment_id
	latest_timestamp = timestamp
	continue
	has_missing = True

	delta = segment_id - latest_id - 1
	if delta > 1:
	print(f'missing: {latest_id + 1}-{segment_id - 1} ({latest_timestamp} - {timestamp})')
	else:
	print(f'missing: {latest_id + 1} ({latest_timestamp} - {timestamp})')
	missing_count += delta

	latest_id = segment_id
	latest_timestamp = timestamp

	print(f'missing segments count: {missing_count}')

	sys.exit(1 if missing_count > 0 else 0)