Skip to content

Instantly share code, notes, and snippets.

View BexTuychiev's full-sized avatar
🏠
Working from home

bexgboost BexTuychiev

🏠
Working from home
View GitHub Profile
Feature Jupyter Notebooks Databricks Notebooks
Platform Open-source, runs locally or on cloud platforms Exclusive to the Databricks platform
Collaboration and Sharing Limited collaboration features, manual sharing Built-in collaboration, real-time concurrent editing
Execution Relies on local or external servers Execution on Databricks clusters
Integration with Big Data Can be integrated with Spark, requires additional configurations Native integration with Apache Spark, optimized for big data
Built-in Features External tools/extensions for version control, collaboration, and visualization Integrated with Databricks-specific features like Delta Lake, built-in support for collaboration and analytics tools
Cost and Scaling Local installations are often free, cloud-based solutions may have costs Paid service, costs depend on usage, scales seamlessly with Databricks clusters
Ease of Use Familiar and widely used in the data science commun
import pandas as pd
import seaborn as sns
# Load the dataset from Seaborn
diamonds = sns.load_dataset("diamonds")
# Create a Pandas DataFrame
df = pd.DataFrame(diamonds)
# Save the DataFrame directly as a Parquet file
import pandas as pd
import seaborn as sns
# Load the dataset from Seaborn
diamonds = sns.load_dataset("diamonds")
# Create a Pandas DataFrame
df = pd.DataFrame(diamonds)
# Save the DataFrame directly as a Parquet file
This is a test gist. 13894950uijklakd#$%^&*'\\./
import optuna
import xgboost as xgb
from sklearn.metrics import mean_squared_error # or any other metric
from sklearn.model_selection import train_test_split
# Load the dataset
X, y = ... # load your own
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the objective function for Optuna
import pandas as pd
import numpy as np
import string
# Set the desired number of rows and columns
num_rows = 10_000_000
num_cols = 10
chunk_size = 100_000
# Define an empty DataFrame to store the chunks
@BexTuychiev
BexTuychiev / db_benchmark.py
Created March 27, 2023 17:07
A benchmark code that measures the computation time of Pandas PyArrow Backend, Polars and Data.table
import time
import datatable as dt
import pandas as pd
import polars as pl
# Define a DataFrame to store the results
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)
import os
import random
import time
import numpy as np
import pandas as pd
from faker import Faker
# Set seed for reproducibility
random.seed(42)
FROM nvidia/cuda:11.2.0-runtime-ubuntu20.04
# install utilities
RUN apt-get update && \
apt-get install --no-install-recommends -y curl
ENV CONDA_AUTO_UPDATE_CONDA=false \
PATH=/opt/miniconda/bin:$PATH
RUN curl -sLo ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh \
&& chmod +x ~/miniconda.sh \
$ dvc remove dvclive.dvc models.dvc
$ rm -rf dvclive models
$ git add - all
$ git commit -m "Remove all experiments"
$ git tag "cnn32"