Skip to content

Instantly share code, notes, and snippets.

View alperyilmaz's full-sized avatar

Alper Yilmaz alperyilmaz

View GitHub Profile
@alperyilmaz
alperyilmaz / uniq-solutions1.sh
Last active November 17, 2016 13:21
most frequent 5 words in Emma
$ cat emma.txt | tr A-Z a-z | tr -sc "a-z0-9" "\n" | sort | uniq -c | sort -nr | head -5
5242 to
5209 the
4898 and
4300 of
3192 i
@alperyilmaz
alperyilmaz / uniq-solutions2.sh
Created November 17, 2016 13:14
most frequent 5 letters in beginning of words in Emma
$ cat emma.txt | tr A-Z a-z | tr -sc "a-z0-9" "\n" | sort | uniq | cut -c1 | uniq -c | sort -nr | head -5
763 s
672 c
567 p
554 a
523 d
# note that first uniq is just collapsing repeating words so that each word is considered ONCE
@alperyilmaz
alperyilmaz / uniq-solutions3.sh
Created November 17, 2016 13:18
the year in which most movies were filmed
$ cut -f3 movies | sort | uniq -c | sort -nr | head -5
441 2002
405 2000
403 2001
384 1998
384 1996
@alperyilmaz
alperyilmaz / uniq-solutions4.sh
Created November 17, 2016 13:25
date and time at which most ratings were given
$ cat ratings | cut -f4 | sort | uniq -c | sort -nr | head
432 01-03-1996 00:00:00
63 26-07-2005 19:24:47
44 28-03-1996 22:58:30
42 30-03-1996 16:27:16
42 27-03-1996 19:23:03
42 16-04-1996 13:08:41
42 15-04-1996 10:23:54
42 14-04-1996 17:37:12
42 14-04-1996 14:45:40
@alperyilmaz
alperyilmaz / uniq-solutions5.sh
Created November 17, 2016 13:32
frequency distribution of ratings
$ cut -f3 ratings | sort | uniq -c
94988 0.5
384180 1
118278 1.5
790306 2
370178 2.5
2356676 3
879764 3.5
2875850 4
585022 4.5
@alperyilmaz
alperyilmaz / uniq-solutions6.sh
Created November 17, 2016 13:35
most frequent five letters from harfler file, shown in alphabetical order
$ sort harfler | uniq -c | sort -nr | head -5 | sort -k2
15 e
11 f
15 n
9 u
10 y
@alperyilmaz
alperyilmaz / uniq-solutions7.sh
Created November 17, 2016 13:42
frequency of year and genre combination
$ rev movies | cut -f1-2 | rev | sort | uniq -c | sort -nr | head
85 2002 Drama
84 2000 Drama
80 1998 Drama
80 1996 Drama
79 1999 Drama
78 2001 Drama
74 1995 Drama
68 1997 Drama
62 1994 Drama
@alperyilmaz
alperyilmaz / uniq-solutions8.sh
Created November 17, 2016 13:45
second most frequent start codon in E.coli coding sequences
$ cut -f2 Ecoli-cds-protein | cut -c1-3 | sort | uniq -c | sort -nr
3715 ATG
307 GTG
71 TTG
2 CTG
2 ATT
Verifying that "alperyilmaz.id" is my Blockstack ID. https://onename.com/alperyilmaz
@alperyilmaz
alperyilmaz / comic-test
Created April 27, 2018 08:40
bowtie test
$ docker run --rm -v $(pwd)/data2:/data comics/bowtie2 wgsim -e 0 -N 10000 -1 300 -2 300 -r 0 -R 0 /data/s_cerevisiae.fa /data/sample_seq_1.fastq /data/sample_seq_2.fastq
[wgsim] seed = 1524818001
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 18 sequences, total length: 12162995
$ ls -alh data2
total 25M
drwxr-xr-x 2 alper alper 4.0K Apr 27 11:32 .
drwxr-xr-x 4 alper alper 4.0K Apr 27 11:31 ..
-rw-r--r-- 1 root root 6.2M Apr 27 11:33 sample_seq_1.fastq