Skip to content

Instantly share code, notes, and snippets.

Process yt-dlp Extracted Subtitles Script

Description

This Python script is designed to process .vtt subtitle files obtained using yt-dlp from YouTube or similar platforms. It merges subtitles with overlapping segments and cleans the text by removing excess whitespace. The script outputs the processed subtitles into a new text file with a timestamped filename.

Features

  • Subtitle Merging: Combines multiple subtitle entries into a single entry, considering overlaps.
  • Text Cleaning: Cleans subtitle text by replacing newline characters and reducing multiple spaces to a single space.
  • Output: Generates a cleaned and merged text file for each .vtt file in the specified directory.
@lucidyan
lucidyan / exact_bayesian_inference.md
Created February 4, 2020 20:15
Exact Bayesian Inference for A/B testing (all three parts)

Exact Bayesian Inference for A/B testing

use "MathJax Plugin for Github" Chrome extension for Equation support

Part I

author: Evan Haas
2009.12.09
Source

In this three part series I’m going to talk about statistics in the context of A/B Testing. Part I discusses how to analyze experiments using traditional techniques from the frequentist school. Part II will discuss the Bayesian approach, and Part III will provide an implementation of the Bayesian method. Much of the information is adapted from the excellent Information Theory, Inference, and Learning Algorithms by Davi

@lucidyan
lucidyan / spark-on-k8s-operator.md
Last active November 18, 2019 14:20
Example of running spark-on-k8s-operator on minikube cluster locally

spark-on-k8s-operator

Install minikube

curl -Lo minikube https://github.com/kubernetes/minikube/releases/download/v1.5.2/minikube-linux-amd64   && chmod +x minikube
sudo mkdir -p /usr/local/bin/
sudo install minikube /usr/local/bin/

Install VirtualBox

Grep files with selected extensions only

find ./ -type f \( -iname \*.md -o -iname \*.txt \) -exec grep -Hi 'word' {} +

Show files opened in Sublime

cat $HOME/.config/sublime-text-3/Local/Auto\ Save\ Session.sublime_session |grep "\"file\":" |sed 's/^[[:space:]]*//g' |sed 's/^\"file\"\: \"//g' |sort -u | sed 's/[\",]*//ig'"

Convert all *.wav to *.mp3

@lucidyan
lucidyan / scipy_odr_test.py
Created May 9, 2019 20:55 — forked from peci1/scipy_odr_test.py
Test of scipy.odr regressor
from __future__ import print_function
import numpy as np
import scipy.linalg
from scipy.odr import *
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
import sys
import time
@lucidyan
lucidyan / spark_to_pandas.py
Last active September 20, 2022 02:44 — forked from joshlk/faster_toPandas.py
fastest pyspark DataFrame to pandas DataFrame conversion using mapPartitions
import pandas as pd
from pyspark.sql import DataFrame
# Wrapper for seamless Spark's serialisation
def spark_to_pandas(spark_df: DataFrame) -> pd.DataFrame:
"""
PySpark toPandas realisation using mapPartitions
much faster than vanilla version
fork: https://gist.github.com/lucidyan/1e5d9e490a101cdc1c2ed901568e082b
origin: https://gist.github.com/joshlk/871d58e01417478176e7
@lucidyan
lucidyan / gpu-control.md
Last active March 19, 2023 09:37
Prevent NVIDIA GPUs' throttling on headless server

Prevent NVIDIA GPUs' throttling on headless server

  • Unlock manual fan & overclock settings
    sudo nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration
  • Reboot system
  • Create script /usr/local/bin/gpu-fan-control.sh
#!/bin/bash
@lucidyan
lucidyan / flatten_dict.py
Created July 3, 2018 14:14
Non-recursive flatten nested dictionaries in Python3
# Non-recursive flatten nested dictionaries in Python3
# Source: https://codereview.stackexchange.com/a/173483/173155
from itertools import chain, starmap
def flatten_dict(dictionary):
"""Flatten a nested dictionary structure"""
def unpack(parent_key, parent_value):