Dan Bolser dbolser-ebi

## gist:2645f0f51df05c805953
Total Elapsed Time = 8.131346 Seconds
  User+System Time = 7.401346 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 49.8   3.689  3.689 114036   0.0000 0.0000  Bit::Vector::Resize
 11.8   0.880  0.905 236667   0.0000 0.0000  Set::IntRange::Interval_Fill
 8.34   0.617  0.745 221030   0.0000 0.0000  Set::IntRange::Max
 5.51   0.408  4.097 114036   0.0000 0.0000  Set::IntRange::Resize
 3.13   0.232  0.232  31690   0.0000 0.0000  Bit::Vector::Norm
 1.82   0.135  0.197  31690   0.0000 0.0000  Set::IntRange::new

## time to die?
[dbolser@ebi-001 VCFLoad_simple]$ cat calculate_frequency.sql

-- Try adding an index...
#ALTER TABLE tmp_individual_genotype_single_bp
#  ADD INDEX allele_1_idx (allele_1),
#  ADD INDEX allele_2_idx (allele_2);

# Query OK, 277545309 rows affected (43 min 29.78 sec)

-- The above index seems to have no impact on query execution time

## gist:b5b1ea053331ad3cfb83
{{Taxobox
| image = Candida albicans 2.jpg
| image_width = 250px
| regnum = [[Fungi]]
| division = [[Ascomycota]]
| divisio = [[Ascomycota]]
| classis = [[Saccharomycetes]]
| ordo = [[Saccharomycetales]]
| familia = [[Saccharomycetaceae]]
| genus = ''[[Candida (genus)|Candida]]''

## gist:ae4ce30424a9f53e3789
10495  git init
10496  git remote add origin git@github.com:dbolser-ebi/VCFLoad-simple.git
10497  git add load_vcf_simple.plx
10498  git commit -m 'first commit'
10499  git push
10500  git rebase origin
10501  git rebase origin/master
10502  git rebase origin master
10503  git remote -v
10504  git fetch

## gist:e17da7821a8b9f3489e7
ensrw@mysql-eg-prod-1.ebi.ac.uk:4238 (solanum_lycopersicum_variation_27_80_250)
> SELECT COUNT(*), COUNT(DISTINCT allele_code_id), COUNT(DISTINCT allele) FROM allele_code;
+----------+--------------------------------+------------------------+
| COUNT(*) | COUNT(DISTINCT allele_code_id) | COUNT(DISTINCT allele) |
+----------+--------------------------------+------------------------+
|  1136037 |                        1136037 |                      0 |
+----------+--------------------------------+------------------------+
1 row in set (2 min 11.99 sec)


## some.sh
url=http://ves-ebi-60:8045/solr
core=transPlant-IPK

curl "${url}/${core}/update" \
    -H 'Content-type:application/xml' \
    -d '<delete><query>database_name:GEBIS</query></delete>'


## outline.txt
We have 2 'staging' databases that we use to prepare alternating
releases of our data. One staging machine is a copy of what's
currently 'live' and the other is the place where the next
release is prepared (pre-live if you like).

We have 3 'production' databases where DB heavy processes are run
in preparation for putting a database onto 'pre-live'.

We have 3 'development' databases where we run ad-hock analysis.

## what?
> SELECT COUNT(DISTINCT species_set_id)
FROM plantsx INNER JOIN species_set USING (genome_db_id)
INNER JOIN method_link_species_set USING (species_set_id) WHERE method_link_id = 401
GROUP BY method_link_id;
+--------------------------------+
| COUNT(DISTINCT species_set_id) |
+--------------------------------+
|                             43 |
+--------------------------------+
1 row in set (0.01 sec)

## get_all_exon_sequences.pl
#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::Registry->
  load_registry_from_db(
    -host => 'mysql-eg-prod-3.ebi.ac.uk',

## my.sql
mysql-staging-2-ensrw hordeum_vulgare_core_30_83_2 -Ne '
  SELECT CONCAT(">", name, char(10), sequence)
  FROM temp_name INNER JOIN seq_region USING (name)
  INNER JOIN dna USING (seq_region_id)' \
      > Data/Hv_IBSC_PGSB_v2/bac_assemblies/morex.fasta
	Total Elapsed Time = 8.131346 Seconds
	User+System Time = 7.401346 Seconds
	Exclusive Times
	%Time ExclSec CumulS #Calls sec/call Csec/c Name
	49.8 3.689 3.689 114036 0.0000 0.0000 Bit::Vector::Resize
	11.8 0.880 0.905 236667 0.0000 0.0000 Set::IntRange::Interval_Fill
	8.34 0.617 0.745 221030 0.0000 0.0000 Set::IntRange::Max
	5.51 0.408 4.097 114036 0.0000 0.0000 Set::IntRange::Resize
	3.13 0.232 0.232 31690 0.0000 0.0000 Bit::Vector::Norm
	1.82 0.135 0.197 31690 0.0000 0.0000 Set::IntRange::new
	[dbolser@ebi-001 VCFLoad_simple]$ cat calculate_frequency.sql

	-- Try adding an index...
	#ALTER TABLE tmp_individual_genotype_single_bp
	# ADD INDEX allele_1_idx (allele_1),
	# ADD INDEX allele_2_idx (allele_2);

	# Query OK, 277545309 rows affected (43 min 29.78 sec)

	-- The above index seems to have no impact on query execution time
	{{Taxobox
	\| image = Candida albicans 2.jpg
	\| image_width = 250px
	\| regnum = [[Fungi]]
	\| division = [[Ascomycota]]
	\| divisio = [[Ascomycota]]
	\| classis = [[Saccharomycetes]]
	\| ordo = [[Saccharomycetales]]
	\| familia = [[Saccharomycetaceae]]
	\| genus = ''[[Candida (genus)\|Candida]]''
	10495 git init
	10496 git remote add origin git@github.com:dbolser-ebi/VCFLoad-simple.git
	10497 git add load_vcf_simple.plx
	10498 git commit -m 'first commit'
	10499 git push
	10500 git rebase origin
	10501 git rebase origin/master
	10502 git rebase origin master
	10503 git remote -v
	10504 git fetch
	ensrw@mysql-eg-prod-1.ebi.ac.uk:4238 (solanum_lycopersicum_variation_27_80_250)
	> SELECT COUNT(*), COUNT(DISTINCT allele_code_id), COUNT(DISTINCT allele) FROM allele_code;
	+----------+--------------------------------+------------------------+
	\| COUNT(*) \| COUNT(DISTINCT allele_code_id) \| COUNT(DISTINCT allele) \|
	+----------+--------------------------------+------------------------+
	\| 1136037 \| 1136037 \| 0 \|
	+----------+--------------------------------+------------------------+
	1 row in set (2 min 11.99 sec)
	url=http://ves-ebi-60:8045/solr
	core=transPlant-IPK

	curl "${url}/${core}/update" \
	-H 'Content-type:application/xml' \
	-d '<delete><query>database_name:GEBIS</query></delete>'
	We have 2 'staging' databases that we use to prepare alternating
	releases of our data. One staging machine is a copy of what's
	currently 'live' and the other is the place where the next
	release is prepared (pre-live if you like).

	We have 3 'production' databases where DB heavy processes are run
	in preparation for putting a database onto 'pre-live'.

	We have 3 'development' databases where we run ad-hock analysis.
	> SELECT COUNT(DISTINCT species_set_id)
	FROM plantsx INNER JOIN species_set USING (genome_db_id)
	INNER JOIN method_link_species_set USING (species_set_id) WHERE method_link_id = 401
	GROUP BY method_link_id;
	+--------------------------------+
	\| COUNT(DISTINCT species_set_id) \|
	+--------------------------------+
	\| 43 \|
	+--------------------------------+
	1 row in set (0.01 sec)
	#!/usr/bin/env perl

	use strict;
	use warnings;

	use Bio::EnsEMBL::Registry;

	Bio::EnsEMBL::Registry->
	load_registry_from_db(
	-host => 'mysql-eg-prod-3.ebi.ac.uk',
	mysql-staging-2-ensrw hordeum_vulgare_core_30_83_2 -Ne '
	SELECT CONCAT(">", name, char(10), sequence)
	FROM temp_name INNER JOIN seq_region USING (name)
	INNER JOIN dna USING (seq_region_id)' \
	> Data/Hv_IBSC_PGSB_v2/bac_assemblies/morex.fasta