Skip to content

Instantly share code, notes, and snippets.

@dburkhardt
Last active April 2, 2020 16:18
Show Gist options
  • Save dburkhardt/dcca9e9432525564ddfcc73406e153eb to your computer and use it in GitHub Desktop.
Save dburkhardt/dcca9e9432525564ddfcc73406e153eb to your computer and use it in GitHub Desktop.
Issue

Hi, I just wanted to bring this back up again because I've been logging some of the issue's I've encountered. It seems we're at a bit of a philosophical divide, and so perhaps it's best for me to just register which use cases I have that AnnData / scanpy are causing me friction:

Instead of pasting all errors, I'm just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly.

1. Cannot pass AnnData to numpy or sklearn operators

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)

2. Requirement to use .var_vector or .obs_vector for single columns

# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]

3. .var_vector doesn't return a Series

pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]

4. Clusters as categories creates confusing scatterplots

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])

Produces the following plot. I would like it to have order 0-5 by default

image

5. Clusters as categories frustrate subclustering

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment