Last active
August 15, 2022 13:46
-
-
Save brianhill11/314bb423fec9040aae91157c5d73c2b3 to your computer and use it in GitHub Desktop.
Aggregate one-hot encoded vectors for multiple column values mapping to same primary key
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Requires Pandas, Numpy | |
Example: | |
In: data = {"primary_key": ["a", "a", "a", "b", "b", "b", "b", "c", "c"], | |
"target": ["x", "y", "z", "w", "x", "y", "z", "t", "u"]} | |
In: example_df = pd.DataFrame.from_dict(data) | |
Out: | |
primary_key target | |
0 a x | |
1 a y | |
2 a z | |
3 b w | |
4 b x | |
5 b y | |
6 b z | |
7 c t | |
8 c u | |
In: categorical_encoder(example_df, group_col="primary_key", target_col="target") | |
Out: | |
target_t target_u target_w target_x target_y target_z | |
a 0 0 0 1 1 1 | |
b 0 0 1 1 1 1 | |
c 1 1 0 0 0 0 | |
""" | |
import numpy as np | |
import pandas as pd | |
def categorical_encoder(dataframe, group_col, target_col): | |
""" | |
dataframe: Pandas dataframe containing data to encode | |
group_col: column in dataframe to group into single value | |
target_col: column with categorical data that will be encoded into a single vector | |
""" | |
# get one-hot encoded vector for each row | |
dummy_df = pd.get_dummies(dataframe, columns=[target_col]) | |
aggregated_vecs = {} | |
# group by group_col, and aggregate each group by summing one-hot encoded vectors | |
for name, group in dummy_df.groupby(group_col): | |
# get all columns that match our dummy variable prefix | |
g = group[group.columns[(group.columns).str.startswith(target_col)]] | |
# sum columns together | |
aggregated_vec = np.matrix(g).sum(axis=0) | |
#print(np.squeeze(np.asarray(phecode_vec))) | |
# turn matrix into vector | |
aggregated_vecs[name] = np.squeeze(np.asarray(aggregated_vec)) | |
# create dataframe with dictionary mapping group_col values to aggregated vectors | |
aggregated_vecs_df = pd.DataFrame.from_dict(aggregated_vecs, orient="index") | |
# add back column names | |
aggregated_vecs_df.columns = dummy_df.columns.values[dummy_df.columns.str.startswith(target_col)] | |
return aggregated_vecs_df |
@Hywel-W I think you found a bug in my code that I can fix later this evening, but can you try manually specifying the output files using the ‘-o1’ and ‘-o2’ command line arguments?
Thanks Brian, I'll set up a run now and let you know how it goes later.
From: Brian Hill ***@***.***>
Sent: 20 July 2022 15:26
To: brianhill11 ***@***.***>
Cc: Hywel Williams ***@***.***>; Mention ***@***.***>
Subject: Re: brianhill11/categorical_encoder.py
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
@brianhill11 commented on this gist.
…________________________________
@Hywel-W<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHywel-W&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8%2FN6t73VxvUiKb7Shn12EcekcBnlvzG6fPtgtqrka0%3D&reserved=0> I think you found a bug in my code that I can fix later this evening, but can you try manually specifying the output files using the '-o1' and '-o2' command line arguments?
-
Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2F314bb423fec9040aae91157c5d73c2b3%23gistcomment-4239017&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=murivC3JOLYZUp89hFTUgOcNuv4SZuJyBzUin66mc3M%3D&reserved=0>, or unsubscribe<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARAF2JR6ZHYVBNS5P42IUR3VVAD7BANCNFSM54DV2TBQ&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xg16YzM84LXRJUyeJDYC%2BNVpoYnsYGKXBjMacPQq5Ac%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
Hi Brian, I've been trying to run this script but for some reason I keep getting the following error message and I can't work out why:
Traceback (most recent call last):
File "/scratch/scwc0011/ISSF3_pcHiC_Analysis/split_bam_python_script/split_bam.py", line 3, in <module>
import pysam
File "/home/c.wpchjw/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
from pysam.libchtslib import *
ImportError: /home/c.wpchjw/.local/lib/python3.9/site-packages/pysam/libchtslib.cpython-39-x86_64-linux-gnu.so: undefined symbol: __intel_sse2_strcpy
Does that make sense to you and if so do you know how I can fix it?
Best wishes
Hywel
From: Brian Hill ***@***.***>
Sent: 20 July 2022 15:26
To: brianhill11 ***@***.***>
Cc: Hywel Williams ***@***.***>; Mention ***@***.***>
Subject: Re: brianhill11/categorical_encoder.py
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
@brianhill11 commented on this gist.
…________________________________
@Hywel-W<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHywel-W&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8%2FN6t73VxvUiKb7Shn12EcekcBnlvzG6fPtgtqrka0%3D&reserved=0> I think you found a bug in my code that I can fix later this evening, but can you try manually specifying the output files using the '-o1' and '-o2' command line arguments?
-
Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2F314bb423fec9040aae91157c5d73c2b3%23gistcomment-4239017&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=murivC3JOLYZUp89hFTUgOcNuv4SZuJyBzUin66mc3M%3D&reserved=0>, or unsubscribe<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARAF2JR6ZHYVBNS5P42IUR3VVAD7BANCNFSM54DV2TBQ&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xg16YzM84LXRJUyeJDYC%2BNVpoYnsYGKXBjMacPQq5Ac%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
Hi Brian, I was wondering if you can help as I'm still unable to run your split_bam.py script without getting the error below:
(base) ***@***.*** split_bam_python_script]$ python split_bam.py -b H180-M-00A-PT.bam -p 0.5 -o1 H180a.bam -o2 H180b.bam
Traceback (most recent call last):
File "/scratch/scwc0011/ISSF3_pcHiC_Analysis/split_bam_python_script/split_bam.py", line 3, in <module>
import pysam
File "/home/c.wpchjw/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
from pysam.libchtslib import *
ImportError: /home/c.wpchjw/.local/lib/python3.9/site-packages/pysam/libchtslib.cpython-39-x86_64-linux-gnu.so: undefined symbol: __intel_sse2_strcpy
I am running this in anaconda and have also tried downloading pysam but with the same error message. Could you let me know what version of python this script was written for as that help or whether it is available in anaconda?
Many thanks
Hywel
From: Hywel Williams
Sent: 20 July 2022 15:28
To: brianhill11 ***@***.***>; brianhill11 ***@***.***>
Cc: Mention ***@***.***>
Subject: RE: brianhill11/categorical_encoder.py
Thanks Brian, I'll set up a run now and let you know how it goes later.
From: Brian Hill ***@***.******@***.***>>
Sent: 20 July 2022 15:26
To: brianhill11 ***@***.******@***.***>>
Cc: Hywel Williams ***@***.******@***.***>>; Mention ***@***.******@***.***>>
Subject: Re: brianhill11/categorical_encoder.py
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
@brianhill11 commented on this gist.
…________________________________
@Hywel-W<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHywel-W&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8%2FN6t73VxvUiKb7Shn12EcekcBnlvzG6fPtgtqrka0%3D&reserved=0> I think you found a bug in my code that I can fix later this evening, but can you try manually specifying the output files using the '-o1' and '-o2' command line arguments?
-
Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2F314bb423fec9040aae91157c5d73c2b3%23gistcomment-4239017&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=murivC3JOLYZUp89hFTUgOcNuv4SZuJyBzUin66mc3M%3D&reserved=0>, or unsubscribe<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARAF2JR6ZHYVBNS5P42IUR3VVAD7BANCNFSM54DV2TBQ&data=05%7C01%7Cwilliamshj1%40cardiff.ac.uk%7Ca629dfc46c8b4174bf5b08da6a5bc0b0%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637939239548017664%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xg16YzM84LXRJUyeJDYC%2BNVpoYnsYGKXBjMacPQq5Ac%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
Hi @Hywel-W I think I used python 2.7 when I wrote this way back when. I don't think I've tried to run the script with python 3.X so that might be part of the issue, but I would need some more time to debug..
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Brian,
I used your split_bam.py script yesterday and it looked like it all worked according to plan but it only output one file? Below is the command I used and the output from the script, can you see any reason why only one output file should have been produced instead of two, have I done something wrong?
[c.wpchjw@cl1 split_bam_python_script]$ python split_bam.py -b /scratch/scwc0011/ISSF3_pcHiC_Analysis/chicagoTeam-chicago-8d9f26734c5b/chicagoTools/H199-S3.bam -p 0.5
Number of reads processed: 100000
Number of reads processed: 200000
Number of reads processed: 300000
Number of reads processed: 400000
Number of reads processed: 500000
Number of reads processed: 600000
Number of reads processed: 700000
Number of reads processed: 800000
Number of reads processed: 900000
Number of reads processed: 1000000
Number of reads processed: 1100000
Number of reads processed: 1200000
Number of reads processed: 1300000
Number of reads processed: 1400000
Number of reads processed: 1500000
Number of reads processed: 1600000
Number of reads processed: 1700000
Number of reads processed: 1800000
Number of reads processed: 1900000
Number of reads processed: 2000000
Number of reads processed: 2100000
Number of reads processed: 2200000
Number of reads processed: 2300000
Number of reads processed: 2400000
Number of reads processed: 2500000
Number of reads processed: 2600000
Number of reads processed: 2700000
Number of reads processed: 2800000
Number of reads processed: 2900000
Number of reads processed: 3000000
Number of reads processed: 3100000
Number of reads processed: 3200000
Number of reads processed: 3300000
Number of reads processed: 3400000
Number of reads processed: 3500000
Number of reads processed: 3600000
Number of reads processed: 3700000
Number of reads processed: 3800000
Number of reads processed: 3900000
Number of reads processed: 4000000
Number of reads processed: 4100000
Number of reads processed: 4200000
Number of reads processed: 4300000
Number of reads processed: 4400000
Total reads processed: 4423098
Number paired reads: 2 (4.521717583467515e-05%)
Number unpaired reads: 0 (0.0%)
Number paired reads missing their mate: 4423096 (99.99995478282416%)
Number of reads in first file: 2212112 (50.01272863499746%)
Number of reads in second file: 2210986 (49.98727136500254%)
The output file was named: H199-S3_0.5.bam
Many thanks in advance for your help
Hywel