Skip to content

Instantly share code, notes, and snippets.

用几种办法来减少dataframe占用的内存:

  1. 去掉信息重复的columns
  2. 提前去掉不需要的行
  3. 转换数字到最小精度(-50%)
  4. 转换Python string (objects)为pyarrow str (-30%)
  5. 转换date string为pd datetime (-85%)
  6. 转换大量重复出现的string为category (-95%)

using functools.partial to pass in real arguments into kedro:

from functools import partial, update_wrapper
from kedro.pipeline import Pipeline, node

from .nodes import process_todo, DemoMerger


def create_wrapped_partial(func, *args, **kwargs):
import numpy as np
import matplotlib.pyplot as plt

def plot_quadratic_coefficients(coefficients):
    """
    Plots y = ax^2 + bx + c for each set of coefficients within specified x and y ranges.

    Parameters:
    - coefficients: dict, a dictionary of coefficient sets with 'a', 'b', and 'c' for each key.

用来数出df里某列 tag counts数量, 然后可视化的代码:

def safe_split_tag_str(tag_str, separator=","):
    """
    Splits a tag string into a list of non-empty, whitespace-stripped tag strings.
    """
    if not tag_str:
        return []

(pixiv-data-process/yada/13_pixiv_streamlined.ipynb)

输入一个(本地或者s3地址), 返回包含了所有文件的列表, 上传图片-meta的关系到s3:

(没那么多数据的时候可以直接这么用:)

# https://github.com/troph-team/build-it/blob/f996fe55a6fd2beda9e62a6624be0f0fe2a05848/buildit/sagemaker/parquet_splitter.py#L13
import os
from dataproc3.sagemaker import ParquetSplitter

nd setup, works on lambda h100 pcie:

conda:

cd ~/ && mkdir -p miniconda3 && wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O ./miniconda3/miniconda.sh --no-check-certificate && bash ./miniconda3/miniconda.sh -b -u -p ./miniconda3 && rm ./miniconda3/miniconda.sh && ./miniconda3/bin/conda init bash && source ~/.bashrc  && python -m pip install unibox ipykernel jupyter poetry && python -m ipykernel install --user --name=conda310 

nd:

@trojblue
trojblue / extract_url_from_artstation_json.py
Created November 16, 2023 01:10
User fevercell_projects.json File extract all links from twitter or x.com from this json:
import json
# Function to extract handles from a given domain in a nested dictionary
def extract_handles(data, domain):
def find_handles(d):
handles = []
for k, v in d.items():
if isinstance(v, dict):
handles.extend(find_handles(v))
elif isinstance(v, list):
@trojblue
trojblue / cuda_11.8_installation_on_Ubuntu_22.04
Created August 30, 2023 16:56 — forked from MihailCosmin/cuda_11.8_installation_on_Ubuntu_22.04
Instructions for CUDA v11.8 and cuDNN 8.7 installation on Ubuntu 22.04 for PyTorch 2.0.0
#!/bin/bash
### steps ####
# verify the system has a cuda-capable gpu
# download and install the nvidia cuda toolkit and cudnn
# setup environmental variables
# verify the installation
# https://gist.github.com/MihailCosmin/affa6b1b71b43787e9228c25fe15aeba
###
@trojblue
trojblue / pytthon_debug.md
Created June 27, 2023 02:50
common python debug commands

save to txt:

my_list =  responses[56]

with open('my_file2.txt', 'w', encoding="utf-8") as f:
    for item in my_list:
        f.write("%s\n" % item)

save to clipboard: