Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fomightez/87bbf42b53ba001329785b44e8f652f8 to your computer and use it in GitHub Desktop.
Save fomightez/87bbf42b53ba001329785b44e8f652f8 to your computer and use it in GitHub Desktop.
Converting DESeq2 output to use in plot_expression_across_chromosomes.py

Converting DESeq2 output to use in plot_expression_across_chromosomes.py

This gist is for use with script described here.
(Contents converted from a file that begins with What I did to plot Shafi's DESeq2 output from and ends with with my plot_expression script.md)

Data preparation

These are the general steps for preparing the data from what DESeq2 produces. (As described in Michael Love's answer here ideally analysis of the data would be run with DESeq(dds, betaPrior=FALSE), but that wasn't done on the data provided me, and yet the aneuploidy was still obvious using the DESeq2 results to plot across the chromosomes. The fact the anueploidy was visible is consistent with the one time I tried it both ways, I didn't see much of an increase in log2FoldChange values for genes where I had seen using TPM values (Salmon-generated data there) show much more FoldChange. However, that may have been a very unique case, since the sample was very unique, where the 'inference' falls apart because "shrinking of fold changes requires that the software can estimate the range of reasonable values for LFC by looking at the distribution of LFCs (particularly the upper quantile of the absolute LFC)., see Michael Love's comments at 02-19-2015, 02:09 PM here):

  • Deleted all rows where lof2foldchange was NA.
  • Added a new column called FoldChange.
  • FoldChange entries were made the inverse log, base 2 of the lof2foldchange column.
  • I added an additional column, as last column on the right, that I called rel_fix where the value on each row in that column was 1.0.
  • Renamed first column header gene.
  • Saved munged data in tab-delimited form.
  • Made sure line endings will be recognized by setting them to Unix style.

Below details the intitial steps as performed with Excel. Excel was used because that was the form the data was provided and I hoped that once the data cleaning process was worked out, the provider would be able to do those steps for similar data in the future.
Feel free to use your favorite tools to accomplish the steps.

Data preparation as done with Excel

  • I deleted all rows where lof2foldchange was NA (using MBK's answer here) to select all those rows and then used menu Edit > Delete... to delete those seected rows, and then following that I turned off the filter.
    That step was critical or Eccel wouldn't let me do the next steps for those cells without throwing an error. (I also assumed without that step my script would throw an error possibly too.)
  • Next I inserted a column called FoldChange aand set the contents for the first entry in that column equal to =2^C2, based on https://support.microsoft.com/en-us/help/214116/xl-formulas-to-find-the-log-and-inverse-log-of-a-number .
    Then copied that cell to all the others below that.
    I did that step here because my script already handles the conversion to log2 and labels the plot axis appropriately. While I can tell it not to convert to log2, the axis will not refelct that the data is actually in log2 form, and so just easier to undo the log2 calculation here and let the script later recalculate it as it does by default.
  • I added an additional column, as last column on the right, that I called rel_fix. In that column I set all entries to be 1.0 by setting the first one and copying the first one and pasting to all.
    The idea is that this will allow me to get the FoldChange value using values from two columns as as my plot_expression_across_chromosomes.pyis set up to do. That will work because I can use the FoldChange as the experimental values and tell the script to divide it by the rel_fix column which will be 1.0, and so when my script divides experimental data by wild-type data it always gives what the result the FoldChange already shows as the result.
  • Made first column header gene.
    Not sure if that was necessary but by doing it, things will be consistent with my old data and it better insures a small thing doesn't block the script running correctly.
  • Saved that data as tab-delimited data.txt since Excel won't let me save wtih .tsv extension.
  • Exited Excel and renamed the file data.tsv in the operating system.
  • Opened the data.tsv in Sublime Text and changed line endings to Unix under View > Line Endings.
    For me this step was CRITICAL, but may not be in all cases.
    (Otherwise the ending for me was Mac OS style and while they still looked okay in browser at PythonAnywhere, I saw an error where it could only read one line as diagnosed by temporarily inserting print(line) after the line for line in data_file: in plot_expression_across_chromosomes.py.)

Sample of first few lines of the prepared data

gene	baseMean	log2FoldChange	FoldChange	lfcSE	stat	pvalue	padj	rel_fix
Q0045	43.12549925	-0.299218949	0.812692255	0.39772274	-0.752330503	0.451852329	0.460098844	1.0
Q0050	306.7615864	-0.21163634	0.863557208	0.435968249	-0.485439801	0.627364405	0.777204768	1.0
Q0055	133.1451461	-0.252451454	0.839468762	0.423703906	-0.595820455	0.551295177	0.718261454	1.0
Q0060	5.127642428	-0.281679976	0.822632528	0.410742269	-0.685782784	0.492850094	0.673742795	1.0
Q0065	4.502251164	-0.319113731	0.801562138	0.358650008	-0.889763626	0.373592822	0.566250389	1.0
Q0070	1.905722251	-0.114269939	0.923849697	0.241490485	-0.473186092	0.636080419	NA	1.0
Q0075	2.705718533	0.242002044	1.182632676	0.362736728	0.667156165	0.504672388	0.681654476	1.0
....

Executing the script with the prepared data

Documentation for the script being used is found here.

To run the script:

Uploaded Excel output to PythonAnywhere in a directory where I already had the script.

We'll refer to the data file as data.tsv here.

Ran command:

python plot_expression_across_chromosomes.py data.tsv --columns 1,9,4 --exp_desig <experimental_designation> --base_desig wt

Where columns 1,9,4 referenced the gene names, the rel_fix (last) column that is used in place of the typical wild-type data value, and the FoldChange column for the experimental data, respectively; experimental_designation was replaced with text identifying the strain designation or allele involved.

Oter times, the options --save_vg and --smooth were used in conjunction with the rest of that command to control output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment