dburkhardt/issue.md

## issue.md

      
    Raw
  

              issue.md
            
          
    Hi, I just wanted to bring this back up again because I've been logging some of the issue's I've encountered. It seems we're at a bit of a philosophical divide, and so perhaps it's best for me to just register which use cases I have that AnnData / scanpy are causing me friction:
Instead of pasting all errors, I'm just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly.
1. Cannot pass AnnData to numpy or sklearn operators
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)
2. Requirement to use .var_vector or .obs_vector for single columns
# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]
3. .var_vector doesn't return a Series
pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]
4. Clusters as categories creates confusing scatterplots
sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])
Produces the following plot. I would like it to have order 0-5 by default

5. Clusters as categories frustrate subclustering
sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])