This gist is for use with script described here.
(Contents converted from a file that begins with What I did to plot Shafi's DESeq2 output from
and ends with with my plot_expression script.md
)
These are the general steps for preparing the data from what DESeq2 produces. (As described in Michael Love's answer here ideally analysis of the data would be run with DESeq(dds, betaPrior=FALSE)
, but that wasn't done on the data provided me, and yet the aneuploidy was still obvious using the DESeq2 results to plot across the chromosomes. The fact the anueploidy was visible is consistent with the one time I tried it both ways, I didn't see much of an increase in log2FoldChange values for genes where I had seen using TPM values (Salmon-generated data there) show much more FoldChange. However, that may have been a very unique case, since the sample was very unique, where the 'inference' falls apart because "shrinking of fold changes requires that the software can estimate the range of reasonable values for LFC by looking at the distribution of LFCs (particularly the upper quantile of the absolute LFC)., see Michael Love's comments at 02-19-2015, 02:09 PM here):
- Deleted all rows where
lof2foldchange
wasNA
. - Added a new column called
FoldChange
. FoldChange
entries were made the inverse log, base 2 of thelof2foldchange
column.- I added an additional column, as last column on the right, that I called
rel_fix
where the value on each row in that column was1.0
. - Renamed first column header
gene
. - Saved munged data in tab-delimited form.
- Made sure line endings will be recognized by setting them to
Unix
style.
Below details the intitial steps as performed with Excel. Excel was used because that was the form the data was provided and I hoped that once the data cleaning process was worked out, the provider would be able to do those steps for similar data in the future.
Feel free to use your favorite tools to accomplish the steps.
- I deleted all rows where
lof2foldchange
wasNA
(using MBK's answer here) to select all those rows and then used menuEdit
>Delete...
to delete those seected rows, and then following that I turned off the filter.
That step was critical or Eccel wouldn't let me do the next steps for those cells without throwing an error. (I also assumed without that step my script would throw an error possibly too.) - Next I inserted a column called
FoldChange
aand set the contents for the first entry in that column equal to=2^C2
, based on https://support.microsoft.com/en-us/help/214116/xl-formulas-to-find-the-log-and-inverse-log-of-a-number .
Then copied that cell to all the others below that.
I did that step here because my script already handles the conversion to log2 and labels the plot axis appropriately. While I can tell it not to convert to log2, the axis will not refelct that the data is actually in log2 form, and so just easier to undo the log2 calculation here and let the script later recalculate it as it does by default. - I added an additional column, as last column on the right, that I called
rel_fix
. In that column I set all entries to be 1.0 by setting the first one and copying the first one and pasting to all.
The idea is that this will allow me to get theFoldChange
value using values from two columns as as myplot_expression_across_chromosomes.py
is set up to do. That will work because I can use theFoldChange
as the experimental values and tell the script to divide it by therel_fix
column which will be1.0
, and so when my script divides experimental data by wild-type data it always gives what the result theFoldChange
already shows as the result. - Made first column header
gene
.
Not sure if that was necessary but by doing it, things will be consistent with my old data and it better insures a small thing doesn't block the script running correctly. - Saved that data as tab-delimited
data.txt
since Excel won't let me save wtih.tsv
extension. - Exited Excel and renamed the file
data.tsv
in the operating system. - Opened the
data.tsv
in Sublime Text and changedline endings
toUnix
underView
>Line Endings
.
For me this step was CRITICAL, but may not be in all cases.
(Otherwise the ending for me wasMac OS
style and while they still looked okay in browser at PythonAnywhere, I saw an error where it could only read one line as diagnosed by temporarily insertingprint(line)
after the linefor line in data_file:
inplot_expression_across_chromosomes.py
.)
gene baseMean log2FoldChange FoldChange lfcSE stat pvalue padj rel_fix
Q0045 43.12549925 -0.299218949 0.812692255 0.39772274 -0.752330503 0.451852329 0.460098844 1.0
Q0050 306.7615864 -0.21163634 0.863557208 0.435968249 -0.485439801 0.627364405 0.777204768 1.0
Q0055 133.1451461 -0.252451454 0.839468762 0.423703906 -0.595820455 0.551295177 0.718261454 1.0
Q0060 5.127642428 -0.281679976 0.822632528 0.410742269 -0.685782784 0.492850094 0.673742795 1.0
Q0065 4.502251164 -0.319113731 0.801562138 0.358650008 -0.889763626 0.373592822 0.566250389 1.0
Q0070 1.905722251 -0.114269939 0.923849697 0.241490485 -0.473186092 0.636080419 NA 1.0
Q0075 2.705718533 0.242002044 1.182632676 0.362736728 0.667156165 0.504672388 0.681654476 1.0
....
Documentation for the script being used is found here.
To run the script:
Uploaded Excel output to PythonAnywhere in a directory where I already had the script.
We'll refer to the data file as data.tsv
here.
Ran command:
python plot_expression_across_chromosomes.py data.tsv --columns 1,9,4 --exp_desig <experimental_designation> --base_desig wt
Where columns 1,9,4
referenced the gene names, the rel_fix
(last) column that is used in place of the typical wild-type data value, and the FoldChange column for the experimental data, respectively; experimental_designation
was replaced with text identifying the strain designation or allele involved.
Oter times, the options --save_vg
and --smooth
were used in conjunction with the rest of that command to control output.