Skip to content

Instantly share code, notes, and snippets.

View avrilcoghlan's full-sized avatar

Avril Coghlan avrilcoghlan

View GitHub Profile
@avrilcoghlan
avrilcoghlan / pdb_rest_example_get_uniprot_for_pdbid.py
Created June 18, 2019 10:37
script to retrieve the UniProt id for a particular PDB id
#!/usr/bin/env python
# example from https://github.com/PDBeurope/PDBe_Programming/blob/master/REST_API/snippets/basic_get_post.py
# edited to use the python 'requests' module, and to get the UniProt id. for a particular PDBe entry id.
import argparse
import sys
import requests # this is used to access json files
PY3 = sys.version > '3'
@avrilcoghlan
avrilcoghlan / retrieve_bioactivity_info_from_chembl.py
Created May 30, 2019 10:11
Python script to query the ChEMBL database to retrieve a list of compounds with bioactivities for certain target proteins, and then retrieve information on the molecular properties of those compounds
import pandas as pd # uses pandas python module to view and analyse data
import requests # this is used to access json files
#====================================================================#
# using a list of known targets, find compounds that are active on these targets:
def find_bioactivities_for_targets(targets):
targets = ",".join(targets) # join the targets into a suitable string to fulfil the search conditions of the ChEMBL api
@avrilcoghlan
avrilcoghlan / rename_genes_in_maker_gff.pl
Created August 27, 2013 14:22
Perl script that renames genes in the maker gff files so that they have unique names.
#!/usr/bin/env perl
=head1 NAME
rename_genes_in_maker_gff.pl
=head1 SYNOPSIS
rename_genes_in_maker_gff.pl input_gff output_gff outputdir species
where input_gff is the input gff file,
@avrilcoghlan
avrilcoghlan / treefam_gene_losses.pl
Created March 1, 2013 13:22
Perl script to identify gene losses in human since divergence from chimp, based on TreeFam trees
#!/usr/local/bin/perl
#
# Perl script treefam_genelosses.pl
# Written by Avril Coghlan (alc@sanger.ac.uk).
# 28-Aug-06.
#
# For the TreeFam project.
#
# This perl script connects to the MYSQL database of
@avrilcoghlan
avrilcoghlan / merge_optical_map_xml_files.py
Last active October 8, 2022 06:49
Python script for merging optical map xml files (for different scaffolds) into one large xml file
import sys
import os
from xml.etree import ElementTree as ET
import AvrilFileUtils
class Error (Exception): pass
#====================================================================#
# define a function to merge optical map xml files for different scaffolds.
@avrilcoghlan
avrilcoghlan / run_genewisedb_afterblast.pl
Last active June 19, 2022 12:20
Perl script to run GeneWise by comparing a file of multiple of HMMs to a fasta file of multiple sequences, by running GeneWise on the regions of the DNA sequences where the proteins used to make the HMM have tblastn matches
This file has been truncated, but you can view the full file.
#!/usr/local/bin/perl
=head1 NAME
run_genewisedb_afterblast.pl
=head1 SYNOPSIS
run_genewisedb_afterblast.pl input_fasta input_hmms output outputdir spliceflat parameterfile treefam_seqs eval_cutoff flank_length blast_path
where input_fasta is the input fasta file of scaffolds,
@avrilcoghlan
avrilcoghlan / reformat_paralogs_file.pl
Created March 4, 2022 13:52
Perl script to reformat the file of within-species paralogs into the format that my pipeline expects
#!/usr/bin/perl
$file = $ARGV[0]; # input file of within-species paralogs from BioMart
open(FILE,"$file") || die "ERROR: cannot open $file\n";
while(<FILE>)
{
$line = $_;
chomp $line;
@temp = split(/\t+/,$line);
# Genome project Gene stable ID Paralogue gene stable ID
@avrilcoghlan
avrilcoghlan / format_blastp_output_for_chembl_humanblast.py
Last active March 4, 2022 13:28
Python script to parse BLAST output from comparing ChEMBL proteins to human proteins
import os
import sys
from collections import defaultdict
import FiftyHG_Chembl
#====================================================================#
def main():
# find the blast output files:
@avrilcoghlan
avrilcoghlan / format_blastp_output_for_chembl_singleproteintargetsonly.py
Created March 4, 2022 11:33
Python script to filter BLAST hits to ChEMBL, to just take hits to single-protein targets:
import os
import sys
from collections import defaultdict
import FiftyHG_Chembl
#====================================================================#
def main():
# find the blast output files:
@avrilcoghlan
avrilcoghlan / format_blastp_output_for_chembl_besthitonly.py
Created March 4, 2022 11:19
Python script to just take the top ChEMBL hit for each query gene, and any hits with E-values within 1e+5 of it. Also, only take hits of E-value <= 1e-10:
import os
import sys
from collections import defaultdict
import FiftyHG_Chembl
#====================================================================#
def main():
# find the blast output files: