Skip to content

Instantly share code, notes, and snippets.

@brianhill11
Last active August 15, 2022 13:46
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save brianhill11/314bb423fec9040aae91157c5d73c2b3 to your computer and use it in GitHub Desktop.
Save brianhill11/314bb423fec9040aae91157c5d73c2b3 to your computer and use it in GitHub Desktop.
Aggregate one-hot encoded vectors for multiple column values mapping to same primary key
"""
Requires Pandas, Numpy
Example:
In: data = {"primary_key": ["a", "a", "a", "b", "b", "b", "b", "c", "c"],
"target": ["x", "y", "z", "w", "x", "y", "z", "t", "u"]}
In: example_df = pd.DataFrame.from_dict(data)
Out:
primary_key target
0 a x
1 a y
2 a z
3 b w
4 b x
5 b y
6 b z
7 c t
8 c u
In: categorical_encoder(example_df, group_col="primary_key", target_col="target")
Out:
target_t target_u target_w target_x target_y target_z
a 0 0 0 1 1 1
b 0 0 1 1 1 1
c 1 1 0 0 0 0
"""
import numpy as np
import pandas as pd
def categorical_encoder(dataframe, group_col, target_col):
"""
dataframe: Pandas dataframe containing data to encode
group_col: column in dataframe to group into single value
target_col: column with categorical data that will be encoded into a single vector
"""
# get one-hot encoded vector for each row
dummy_df = pd.get_dummies(dataframe, columns=[target_col])
aggregated_vecs = {}
# group by group_col, and aggregate each group by summing one-hot encoded vectors
for name, group in dummy_df.groupby(group_col):
# get all columns that match our dummy variable prefix
g = group[group.columns[(group.columns).str.startswith(target_col)]]
# sum columns together
aggregated_vec = np.matrix(g).sum(axis=0)
#print(np.squeeze(np.asarray(phecode_vec)))
# turn matrix into vector
aggregated_vecs[name] = np.squeeze(np.asarray(aggregated_vec))
# create dataframe with dictionary mapping group_col values to aggregated vectors
aggregated_vecs_df = pd.DataFrame.from_dict(aggregated_vecs, orient="index")
# add back column names
aggregated_vecs_df.columns = dummy_df.columns.values[dummy_df.columns.str.startswith(target_col)]
return aggregated_vecs_df
@Hywel-W
Copy link

Hywel-W commented Jul 20, 2022

Hi Brian,
I used your split_bam.py script yesterday and it looked like it all worked according to plan but it only output one file? Below is the command I used and the output from the script, can you see any reason why only one output file should have been produced instead of two, have I done something wrong?

[c.wpchjw@cl1 split_bam_python_script]$ python split_bam.py -b /scratch/scwc0011/ISSF3_pcHiC_Analysis/chicagoTeam-chicago-8d9f26734c5b/chicagoTools/H199-S3.bam -p 0.5
Number of reads processed: 100000
Number of reads processed: 200000
Number of reads processed: 300000
Number of reads processed: 400000
Number of reads processed: 500000
Number of reads processed: 600000
Number of reads processed: 700000
Number of reads processed: 800000
Number of reads processed: 900000
Number of reads processed: 1000000
Number of reads processed: 1100000
Number of reads processed: 1200000
Number of reads processed: 1300000
Number of reads processed: 1400000
Number of reads processed: 1500000
Number of reads processed: 1600000
Number of reads processed: 1700000
Number of reads processed: 1800000
Number of reads processed: 1900000
Number of reads processed: 2000000
Number of reads processed: 2100000
Number of reads processed: 2200000
Number of reads processed: 2300000
Number of reads processed: 2400000
Number of reads processed: 2500000
Number of reads processed: 2600000
Number of reads processed: 2700000
Number of reads processed: 2800000
Number of reads processed: 2900000
Number of reads processed: 3000000
Number of reads processed: 3100000
Number of reads processed: 3200000
Number of reads processed: 3300000
Number of reads processed: 3400000
Number of reads processed: 3500000
Number of reads processed: 3600000
Number of reads processed: 3700000
Number of reads processed: 3800000
Number of reads processed: 3900000
Number of reads processed: 4000000
Number of reads processed: 4100000
Number of reads processed: 4200000
Number of reads processed: 4300000
Number of reads processed: 4400000
Total reads processed: 4423098
Number paired reads: 2 (4.521717583467515e-05%)
Number unpaired reads: 0 (0.0%)
Number paired reads missing their mate: 4423096 (99.99995478282416%)
Number of reads in first file: 2212112 (50.01272863499746%)
Number of reads in second file: 2210986 (49.98727136500254%)

The output file was named: H199-S3_0.5.bam

Many thanks in advance for your help

Hywel

@brianhill11
Copy link
Author

@Hywel-W I think you found a bug in my code that I can fix later this evening, but can you try manually specifying the output files using the ‘-o1’ and ‘-o2’ command line arguments?

@Hywel-W
Copy link

Hywel-W commented Jul 20, 2022 via email

@Hywel-W
Copy link

Hywel-W commented Jul 21, 2022 via email

@Hywel-W
Copy link

Hywel-W commented Aug 15, 2022 via email

@brianhill11
Copy link
Author

Hi @Hywel-W I think I used python 2.7 when I wrote this way back when. I don't think I've tried to run the script with python 3.X so that might be part of the issue, but I would need some more time to debug..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment