Skip to content

Instantly share code, notes, and snippets.

@ar0ch
Last active March 28, 2022 00:58
Show Gist options
  • Save ar0ch/82dd4738632db130cb7ce5b33871df9f to your computer and use it in GitHub Desktop.
Save ar0ch/82dd4738632db130cb7ce5b33871df9f to your computer and use it in GitHub Desktop.
Convert CEL to VCF with PLINK
#!/bin/bash
mkdir lgen plnk vcf
for i in folder_of_CELs/*;do
j=`basename $i`
cat $i |tail -n +14 | pee "awk -F '\t' '{print \"FAMID\",\$1,\$2,\$5,\$6}' > lgen/$j.lgen" \
"awk -F '\t' '{print \$3,\$2,'0',\$4}' > lgen/$j.map" \
"awk -F'\t' '{print "FAMID",$1,'0','0','0','0'}' > lgen/$j.fam"
plink --lgen lgen/$j.lgen --fam lgen/$j.fam --map lgen/$j.map --make-bed --out plink/$j
plink --lgen lgen/$j.lgen --fam lgen/$j.fam --map lgen/$j.map --recode vcf --out vcf/$j
done
@ar0ch
Copy link
Author

ar0ch commented Aug 29, 2019

This is not a particularly elegant or efficient solution but it works. We only read the CEL file once though thanks to pee

The CEL header should look something like like:

[Header]
GSGT Version 2.0.4
Processing Date 5/6/2019 1:48 PM
Content GSA-24v2-0_A2.bpm
Num SNPs 665608
Total SNPs 665608
Num Samples 96
Total Samples 96
File 56 of 96
Cluster GSA-24v2-0_A1_ClusterFile.egt
Gender Male
[Data]
Sample ID SNP Name Chr Position Allele1 - Forward Allele2 - Forward

The LGEN, FAM and MAP files should look like:

==> lgen/test_ind.fam <==
FAMID test_ind 0 0 0 0

==> lgen/test_ind.lgen <==
FAMID test_ind 1:103380393 G G
FAMID test_ind 1:109439680 A A
FAMID test_ind 1:118227370 T T
FAMID test_ind 1:1183442 A G
...

==> lgen/test_ind.map <==
1 1:103380393 0 102914837
1 1:109439680 0 108897058
1 1:118227370 0 117684748
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment