Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / academic.py
Last active September 26, 2023 13:04
Q&A on all your academic papers…
import logging
import os
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import Chroma
@rjurney
rjurney / centrality.groovy
Created September 9, 2023 01:37
Feature engineering using JanusGraph for Relato's business graph - customer recommender system in 2014
// Use this to start up a session
conf = new BaseConfiguration()
conf.setProperty("storage.directory", "/Users/rjurney/Software/marketing/titan/data")
conf.setProperty("storage.backend", "berkeleyje")
graph = TitanFactory.open(conf)
// Get a graph traverser
g = graph.traversal()
// Various centralities to use as features - JSONize and save
@rjurney
rjurney / Dockerfile
Last active August 10, 2023 03:50
Dockerfile to use a named Anaconda Python 'conda' environment as a kernel in a Jupyter notebook with Poetry
# Start from a Jupyter Docker Stacks version
FROM jupyter/scipy-notebook:python-3.10.11
# Work in the jovyan user's home directory
WORKDIR "/home/${NB_USER}"
# Needed for poetry package management: no venv, latest poetry, GRANT_SUDO don't work :(
ENV POETRY_VIRTUALENVS_CREATE=false \
POETRY_VERSION=1.4.2 \
GRANT_SUDO=yes
@rjurney
rjurney / assortativity.md
Created July 6, 2023 06:35
What is wrong with this Markdown? Why won't Jupyter parse it?

Assortativity

Assortativity in networks refers to a correlation pattern observed in real-world networks where nodes are preferentially connected to other nodes that are like (or unlike) them in some way. This is essentially a bias in connection preference.

--ChatGPT4

A related term is assortative mixing:

In the study of complex networks, assortative mixing, or assortativity, is a bias in favor of connections between network nodes with similar characteristics. In the specific case of social networks, assortative mixing is also known as homophily. The rarer disassortative mixing is a bias in favor of connections between dissimilar nodes.

@rjurney
rjurney / poetry.toml
Last active April 12, 2023 21:46
Poetry update can't solve a Python 3.9+ project with JUST pytest as a dev dependency... what is going on here?
[virtualenvs]
create = false

Add a random ID column to a pandas DataFrame using Numpy

I needed to generate random IDs to partition some data for Dask when writing a Parquet file from pandas for a less expensive operation where multiple cores were not required. I didn't like any of the answers that I found, so I decided to hack this recipe myself to remind myself I can still work from API docs :)

I think for efficiency you want to do this via [numpy.random.randint][1] and then make a column out of it via a [pandas.Series][2], since a Series is just a [numpy.ndarray][3] with some dressing added.

One-dimensional ndarray with axis labels (including time series).

import random
@rjurney
rjurney / first_try.py
Last active September 3, 2022 03:19
Can't install Modin with Ray with poetry on Python 3.10 on OS X
poetry add "modin[ray]"
Using version ^0.15.2 for modin
Updating dependencies
Resolving dependencies... (60.3s)
Writing lock file
Package operations: 31 installs, 0 updates, 0 removals
@rjurney
rjurney / datasets.js
Last active September 1, 2022 23:05
The fields available in the DBLP data
{
"entity_id": "<UUID4>",
"entity_type": "node",
"entity_class": "",
"@key": "conf\/www\/Ericsson07",
"@cdate": "2021-01-01",
"@mdate": "2022-08-31",
"@publtype": NaN,
"address": "",
// Note: there is another form where author is just a string - must ETL
@rjurney
rjurney / 00README.md
Last active August 25, 2022 10:41
DBLP Types, Schemas and Example Records

DBLP Training Data

I need to create a network with a set of edges that include a SAME_AS edge type and a NOT_SAME_AS edge type for entity resolution to serve as training data to enable @tanmoyio to proceed with training an entity resolution model in #3.

DBLP Datasets

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

@rjurney
rjurney / test_etl.py
Created August 25, 2022 03:24
Test code for attempt at PyDantic ETL code for PySpark
def test_graphlet_etl(spark_session_context) -> None:
"""Test the classes with Spark UDFs."""
spark, sc = spark_session_context
@F.pandas_udf("long")
def text_runtime_to_minutes_pandas_udf(x: pd.Series) -> pd.Series:
"""text_runtime_to_minutes_pandas_udf PySpark pandas_udf to run text_runtime_to_minutes.
Parameters