Last active May 12, 2019 18:07
Transpose Beamer
import Data.List
import qualified Data.ByteString.Char8 as B
main :: IO ()
main = do
stdin <- B.getContents
B.putStrLn $ (B.intercalate (B.pack "\n")
. map (B.intercalate (B.pack "\t"))
. transpose . map (B.split '\t')
. B.split '\n') stdin
(cat list.txt | while read f; do # several
cut -f $col “$f” | paste -s # details
) | gzip > $tpfile # ommitted; e.g. rowname equality check.
transpose -i $tpfile -o $dst # Conflicted, but: bottleneck process.
main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])
data = numpy.loadtxt([1]), delimiter="\t", dtype='str')
numpy.savetxt(sys.stdout, data.transpose(), fmt='%s', delimiter="\t")
df = pandas.read_table([1]), sep='\t', dtype='str')
df.transpose().to_csv(sys.stdout, sep="\t")
my @data = map { chomp; [ split "\t" ] } <>;
my @idx = 0..$#data;
for (my $i = 0; $i < @{$data[0]}; $i++) {
print join "\t", map { $data[$_][$i] } @idx;
print "\n";
df <- read.table(gzfile(args[[1]]), colClasses=c('character'))
write.table(t(df), "", sep="\t", quote=FALSE, row.names=TRUE, col.names=FALSE)
puts{|x|x*" "}
datamash transpose < <(zcat i.gz) > >(gzip > o.gz)
% vim: ts=2 sw=0 sts=-1 et ai cole=0 wrap
%^ vim modeline to set "tabs" to be two spaces and some other stuff
% Transposition presentation roughly translated to Beamer. Will probably compile
% if you have a distribution of TeX Live (I personally use latexmk for
% compilation).
% Come to think of it, to use minted probably you will also need pygments. If
% this isn't available through a proper system package manager (which it is, use
% homebrew) you could install with pip install --user Pygments, and make sure
% that pygmentize ends up in your PATH.
% It's not a perfect translation, nor does it probably fully utilise Beamer's
% numerous features, but I think it does a fair job.
% use 16:9 aspect ratio instead of 4:3
% \documentclass[aspectratio=169]{beamer}
% this is just a theme I personally like the look of
\title{Transposing a big matrix/text file}
\subtitle{Merging many columns to create a big matrix in a text file}
\author{Stijn van Dongen}
% automatically insert title slides for sections, subsections, and
% subsubsections
% nicer tables with \toprule, \midrule, \bottomrule
% \usepackage{booktabs}
% format SI units and other numbers, eg with \num, \si, \SI
% listings of code, set to use an appropriate fontsize because everything in
% Beamer is HUGE
% Also some options to determine how it wraps code if it has to.
% linenos,
% chosen so that the listings mostly don't need to wrap
% make annoying red fboxes around $ in haskell code go away
% \usemintedstyle{friendly}
% line numbers size (although they're not currently on)
% make things that are yet to be transitioned to transparent, rather than
% invisible
% \setbeamercovered{transparent}
% :r !python -c "print('\\\\\\\\\n'.join(' & '.join(map(str, range(i, i + 5))) for i in range(0, 15, 5)))"
%^ vim command used to generate the table contents. you can recycle it by
%navigating to the line in normal mode, and then issuing the key sequence
% 0f:y$q:p<CR> where <CR> is a carriage return. Obviously needs Python.
0 & 1 & 2 & 3 & 4\\
5 & 6 & 7 & 8 & 9\\
10 & 11 & 12 & 13 & 14
% :r !python -c "print('\\\\\\\\\n'.join(' & '.join(map(str, range(i, 11 + i, 5))) for i in range(5)))"
0 & 5 & 10\\
1 & 6 & 11\\
2 & 7 & 12\\
3 & 8 & 13\\
4 & 9 & 14
\section{Transposing a big matrix in a text file}
\frametitle{Transposing a big matrix in a text file}
\item[{--}] 60k genes \(\times\) 20k samples
\item[{--}] \(\num{1.2e9}\) fields
\item[{--}] 3.5G compressed file
\item many Smart-Seq2 samples (our use case 2018/19)
\item \textbf{10x / hdf5} may make this data pattern obsolete (or just rare)
\item sometimes required either by scientist or circumstance
\item reuse old code \dots
\item 2009, ArrayExpress miRNA project at EBI, dense tables
\item custom solutions \texttt{transpose}
\item \url{}
\item not for the faint of heart (but
\textcolor{red}{\texttt{valgrind}}-tested tec)
\bfseries Benefits & \bfseries Costs \\
fast, low memory &
bespoke C code \\
shoveling bytes &
should be optimal (is it?) \\
& ownership \\
\item \textcolor{red}{read matrix as a single string}
\item \textcolor{red}{write transpose while walking array of ptr from sep to
\item r/w gzipped data transparently (zlib)
\item recognises header line with off-by-one field+tab or just field
\item whatever else is needed \dots
\item \textcolor{red}{\texttt{transpose(transpose(X)) == X}} (validation)
2019: how does it compare? Investigate \(\rightarrow\)
\item[\textunderscore\textunderscore] bash
\footnote{\smallish \url{}}
\item[\textunderscore\textunderscore] Python pandas
\item[\textunderscore\textunderscore] Python numpy
\item[\textunderscore\textunderscore] Vanilla python
\item[\textunderscore\textunderscore] Haskell 1 (stackoverflow)
\item[\textunderscore\textunderscore] Bytes aware Haskell
\item[\textunderscore\textunderscore] perl
\item[\textunderscore\textunderscore] R
\item[\textunderscore\textunderscore] \textcolor{red}{datamash (GNU tool)}
\item[\textunderscore\textunderscore] ruby (stackoverflow, very much not
\frametitle{Omitted solutions}
\item[--] awk
\item[--] jq
\item[--] julia
\item[--] bash
\item[--] ruby (optimised)
\frametitle{Transpose test case}
\item \{ 10k, 20k, 30k, 40k, 50k, 60k \} \(\times\) 4671 matrix
\item Largest test case:
\item[\(\circ\)] 282M fields (84\% zeroes)
\item[\(\circ\)] Compressed 125M
\item[\(\circ\)] Uncompressed 666M
Note: read cells as strings, so that\\
\texttt{transpose(transpose(X)) == X}\\
(avoid rounding/truncating/NaN/NA/null/""/conversions)
% there's nothing stopping this from being a PDF/PS/<other vector format> file
\item[--] Pure Python, Haskell, datamash are effective, with different
time/memory trade-offs.
\item[--] Special purpose C code is highly effective, minimal memory
\item[--] Python data frames, perl, ruby, awk, R best avoided
\frametitle{Original problem: Creating a big matrix in a text file}
Aggregation step after a parallelised pipeline:\\
Combine 60k-element columns from thousands of result files.
Python: slow churn reading files, high memory
File-utils type approach:
\centering \Large
for c in zip(*(l.split() for l in zin.readlines())):
