Skip to content

Instantly share code, notes, and snippets.

@danilotat
Created June 26, 2024 22:19
Show Gist options
  • Save danilotat/6d1085c2e9c14fd8b8bd122cf3212e54 to your computer and use it in GitHub Desktop.
Save danilotat/6d1085c2e9c14fd8b8bd122cf3212e54 to your computer and use it in GitHub Desktop.
Master directives
Download human RefSeq proteome
```
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
```
Last awk
`
awk '/^>/ && /BRCA2/ {print; getline; while(!/^>/) {print; getline}}' UP000005640_9606.fasta
`
## Task 1
Mismatch repair (MMR) is key for DNA damage repair. Many mutations in genes involved for this pathway are known to be crucial for genetic instability and then considered cancer risk factors. You’re interested in genes whose name could have the prefix “MLH”, “MSH” or “PMS”, like “MLH1” or “PMS1”.
- Create a folder named `mmr_genes` in your home.
- How many proteins are reported inside the RefSeq proteome?
- How many different proteins have name starting with “MLH”?
- Which of these proteins starts with the sequence “MASLGAN” ?
- Which is the longest protein starting with “PMS”? Export its sequence (without header) in a file called `pms_longest.fasta` in the folder `mmr_genes`
>NOTE: Use the `GN` (gene name) tag from the header of the fasta. To solve the last question, start from the last awk command shown. Be smart, don’t use ChatGPT et simila
## Task 2
Write a shell script called `num_loop.sh` that loops through every number from 1 to 20 and prints each number to standard output. The script should also conditionally print “I'm big!” for every number larger than 10.
## Task 3
Download and decompress the GTF file from the latest Ensembl version
```
https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz
```
The GTF file is sorted, in order to have features sorted by position and length along a chromosome.
- Create a folder named “BRCA2” in your home.
- In which chromosome is located?
- Is it true that BRCA2 has 4 protein coding transcripts?
- Export the entries reporting non-sense mediated decay transcripts in a file called `NMD.gtf` inside the `BRCA2` folder
## Git & Github
**proteins.txt**
```
Lysozyme Chicken LYSC_CHICK P00698
Lysozyme Human LYSC_HUMAN P61626
Hemoglobin-alpha Human HBA_HUMAN P69905
Hemoglobin-beta Human HBB_HUMAN P68871
```
**proteins.txt** with length
```
Lysozyme Chicken LYSC_CHICK P00698 147
Lysozyme Human LYSC_HUMAN P61626 148
Hemoglobin-alpha Human HBA_HUMAN P69905 142
Hemoglobin-beta Human HBB_HUMAN P68871 147
PFKM Human PFKAM_HUMAN P08237 780
Pfkm Mouse PFKAM_MOUSE P47857 780
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment