AndreaCrotti/db_sol6.tex

## db_sol6.tex
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{longtable}
\usepackage{float}
\usepackage{wrapfig}
\usepackage{soul}
\usepackage{amssymb}
\usepackage{hyperref}


\title{EXERCISE 6}
\author{Andrea Crotti (\#299466), Can Liu (\#286105), Sebastian Roidl (\#281941)}
\date{04 Dicembre 2009}

\begin{document}

\maketitle

\obeylines

\section*{}

 \texttt{DEADLINE:} \textit{2009-12-02 Mer 12:15}\newline

\section*{1. Selection and projection}
\label{sec-2}

\subsection*{1. Given the size of a relation (1000 pages), what are the I/O costs for an equality selection on a non-key attribute for the following cases? Assuming each B-tree leaf holds 20 record pointers while each page contains 10 records}
\label{sec-2.1}

   In worst case equality selection must be done scanning the whole data and checking over the key we need to compare.
   Using indexes or (un)clustered B+-trees can reduce the number of I/O operations.

\subsubsection*{(a) with a clustered b-tree of height 3 (matching records are located in one page)}
\label{sec-2.1.1}

    This is the best scenario, a clustered B+Tree records all the data in a sorted order on disk.
    $cost = h + 1 = 4$ I/Os, we only need traverse the tree from root to leaf and then read the data page.

\subsubsection*{(b) without any index, nor is the file sorted on the attribute occurring in selection}
\label{sec-2.1.2}

    This is the worst possible case, we need to scan over all the records and check the equality of the attribute.
    So the number of I/Os needed is the total number of pages of the relation.
    $cost = 1000$ I/Os

\subsubsection*{(c) with an unclustered b-tree index of height 3, and there are 10 matching records}
\label{sec-2.1.3}

    We scan through the tree and in plus we must read the pages which this time are not contiguous as before.
    $cost = h + k = 13$ I/Os

\subsubsection*{(d) with an unclustered b-tree of height 3 and one tenth of the records match the selection}
\label{sec-2.1.4}

    Same as before, having more records matching the selection makes the unclustered b+-tree much less convenient.
    $cost = h + #tuples/10 = 3 + 1000 = 10003$ I/Os
    If the matching rate is higher than it would become even faster to do a total lookup on the pages.

\subsection*{2. We assume the relation is of N pages.}
\label{sec-2.2}


\subsubsection*{(a) What is the complexity of sorting-based projection?}
\label{sec-2.2.1}

    When doing a \emph{projection} we must
\begin{itemize}
\item remove unnecessary attributes
\item delete all the possible duplicates that we generated
\end{itemize}
    If we use a sorting-based algorithm then all the duplicates can only be adjacent, and it becomes very easy to detect and delete them.
    The complexity is then $O(N \log{N})$, where for N we denote the number of pages.

\subsubsection*{(b) If the records are distributed uniformly, what is the complexity for hash-based projection?}
\label{sec-2.2.2}

    For hash-based projection the best-case is $O(N)$.
    To be more precise we have to compose the cost of partitioning and comparing, so we get
    $partCost + inBucketCompCost = (N + N) + N = 3N$
    To have good performances with hash-based projections the buckets must fit in memory.

\subsubsection*{(c) Why is sorting-based projection the standard solution in practice (or why is it superior to hash-based projection)?}
\label{sec-2.2.3}

    In general a sorting-based projection is used because:
\begin{itemize}
\item sorting is a basic routine in database implementations and the code is fairly optimized
\item hashing can fail if the bucket is too large
\item hashing-based projection can become very slow if the data are not distributed uniformly
\end{itemize}
\section*{2. Join}
\label{sec-3}

  Consider $Order \Join_{Order.cid=Customer.id} Customer$.
  Cost metric is the number of page I/O.
\begin{itemize}
\item Order contains 10,000 tuples with 10 tuples per page.
\item Customer contains 2,000 tuples and 10 tuples per page.
\item available 52 buffer pages
\item each Customer tuple matches 5 tuples in Order
\item it takes 3 I/O operations to access a leaf of $B^{+}- tree$ for Order, 2 I/Os for Customer.
\end{itemize}
  Formalizing the following data are given:
\begin{itemize}
\item M = 1000 (number of pages in \emph{Order})
\item N = 200 (number of pages in \emph{Customer})
\item Po = 10 (number of tuples/page in \emph{Order})
\item Pc = 10 (number of tuples/page in \emph{Customer})
\item B = 52 (number of buffer pages)
\item Ord\_for\_Cust = 5 (tuples of Order matching any Customer)
\end{itemize}
\subsection*{1. What is the cost of joining Order and Customer using a page-oriented simple nested loops join?}
\label{sec-3.1}

   The Order relation is the outer relation, while the customer relation is inner.
   Since in this case the join is page-oriented, the cost is:
   $M + (Po * M * N) = 1000 + (10 * 1000 * 200) \approx 2 * (10^6)$ I/Os

\subsection*{2. What is the cost of joining Order and Customer using a block nested loops join?}
\label{sec-3.2}

   Having
   One buffer page is for scanning \emph{Customer} (inner relation), so we use (52-2) buffer pages for holding the blocks of \emph{Order} (outer relation).
   The cost is:
   $M + \lceil M / (B - 2)\rceil * N = 1000 + \lceil 1000 / 50 \rceil * 200 = 5000$ I/Os

\subsection*{3. What is the cost of joining Order and Customer using a sort-merge join?}
\label{sec-3.3}

   First we need to sort the two relations. So we get cost: $10000 * log10000 + 2000 * log2000$
   Order is scanned once, each Customer group is scanned once per matching Order tuple.
   So we get the final cost: $M * logM + N * logN + M + N = 1000 * log1000 + 200 * log200 + 1200$ I/Os

\subsection*{4. What is the cost of joining Order and Customer using a hash join?}
\label{sec-3.4}

   We first partition both relations, which means we read and write both relations, so we have cost of 2 * (M + N) = 2 * 1200 I/Os.
   Then we scan matching partitions, read both relations, so we have cost of 1000 + 200 I/Os.
   $Cost = Cost(Partitioning) + Cost(Matching) = 2400 + 1200 = 3600$ I/Os

\subsection*{5. If unclustered B+tree index existed on Order.cid or Customer.id,would either provide a cheaper alternative (index nexted loop join) for performing the join than a block nested loop join?}
\label{sec-3.5}

   We can use the index on the join column of one relation and make it the inner loop (that's where you lose more time).
   So we get:
   $Cost = M + (M * Po * Ord\_for\_Cust) = 1000 + (1000 * 10 * 5) = 51000$
   In this case we don't get any advantage using a B+tree.

\subsection*{6. Reconsider the question above when the B+−tree index is clustered.}
\label{sec-3.6}

   Here the only difference is that we can access to the matching Customer's tuples in just 1 I/O because of the clustering.
   So we get:
   $Cost = M + (M * Po * 1) = 1000 + 10000 = 11000$ I/Os


\end{document}
	\documentclass[11pt]{article}
	\usepackage[utf8]{inputenc}
	\usepackage[T1]{fontenc}
	\usepackage{graphicx}
	\usepackage{longtable}
	\usepackage{float}
	\usepackage{wrapfig}
	\usepackage{soul}
	\usepackage{amssymb}
	\usepackage{hyperref}


	\title{EXERCISE 6}
	\author{Andrea Crotti (\#299466), Can Liu (\#286105), Sebastian Roidl (\#281941)}
	\date{04 Dicembre 2009}

	\begin{document}

	\maketitle

	\obeylines

	\section*{}

	\texttt{DEADLINE:} \textit{2009-12-02 Mer 12:15}\newline

	\section*{1. Selection and projection}
	\label{sec-2}

	\subsection*{1. Given the size of a relation (1000 pages), what are the I/O costs for an equality selection on a non-key attribute for the following cases? Assuming each B-tree leaf holds 20 record pointers while each page contains 10 records}
	\label{sec-2.1}

	In worst case equality selection must be done scanning the whole data and checking over the key we need to compare.
	Using indexes or (un)clustered B+-trees can reduce the number of I/O operations.

	\subsubsection*{(a) with a clustered b-tree of height 3 (matching records are located in one page)}
	\label{sec-2.1.1}

	This is the best scenario, a clustered B+Tree records all the data in a sorted order on disk.
	$cost = h + 1 = 4$ I/Os, we only need traverse the tree from root to leaf and then read the data page.

	\subsubsection*{(b) without any index, nor is the file sorted on the attribute occurring in selection}
	\label{sec-2.1.2}

	This is the worst possible case, we need to scan over all the records and check the equality of the attribute.
	So the number of I/Os needed is the total number of pages of the relation.
	$cost = 1000$ I/Os

	\subsubsection*{(c) with an unclustered b-tree index of height 3, and there are 10 matching records}
	\label{sec-2.1.3}

	We scan through the tree and in plus we must read the pages which this time are not contiguous as before.
	$cost = h + k = 13$ I/Os

	\subsubsection*{(d) with an unclustered b-tree of height 3 and one tenth of the records match the selection}
	\label{sec-2.1.4}

	Same as before, having more records matching the selection makes the unclustered b+-tree much less convenient.
	$cost = h + #tuples/10 = 3 + 1000 = 10003$ I/Os
	If the matching rate is higher than it would become even faster to do a total lookup on the pages.

	\subsection*{2. We assume the relation is of N pages.}
	\label{sec-2.2}


	\subsubsection*{(a) What is the complexity of sorting-based projection?}
	\label{sec-2.2.1}

	When doing a \emph{projection} we must
	\begin{itemize}
	\item remove unnecessary attributes
	\item delete all the possible duplicates that we generated
	\end{itemize}
	If we use a sorting-based algorithm then all the duplicates can only be adjacent, and it becomes very easy to detect and delete them.
	The complexity is then $O(N \log{N})$, where for N we denote the number of pages.

	\subsubsection*{(b) If the records are distributed uniformly, what is the complexity for hash-based projection?}
	\label{sec-2.2.2}

	For hash-based projection the best-case is $O(N)$.
	To be more precise we have to compose the cost of partitioning and comparing, so we get
	$partCost + inBucketCompCost = (N + N) + N = 3N$
	To have good performances with hash-based projections the buckets must fit in memory.

	\subsubsection*{(c) Why is sorting-based projection the standard solution in practice (or why is it superior to hash-based projection)?}
	\label{sec-2.2.3}

	In general a sorting-based projection is used because:
	\begin{itemize}
	\item sorting is a basic routine in database implementations and the code is fairly optimized
	\item hashing can fail if the bucket is too large
	\item hashing-based projection can become very slow if the data are not distributed uniformly
	\end{itemize}
	\section*{2. Join}
	\label{sec-3}

	Consider $Order \Join_{Order.cid=Customer.id} Customer$.
	Cost metric is the number of page I/O.
	\begin{itemize}
	\item Order contains 10,000 tuples with 10 tuples per page.
	\item Customer contains 2,000 tuples and 10 tuples per page.
	\item available 52 buffer pages
	\item each Customer tuple matches 5 tuples in Order
	\item it takes 3 I/O operations to access a leaf of $B^{+}- tree$ for Order, 2 I/Os for Customer.
	\end{itemize}
	Formalizing the following data are given:
	\begin{itemize}
	\item M = 1000 (number of pages in \emph{Order})
	\item N = 200 (number of pages in \emph{Customer})
	\item Po = 10 (number of tuples/page in \emph{Order})
	\item Pc = 10 (number of tuples/page in \emph{Customer})
	\item B = 52 (number of buffer pages)
	\item Ord\_for\_Cust = 5 (tuples of Order matching any Customer)
	\end{itemize}
	\subsection*{1. What is the cost of joining Order and Customer using a page-oriented simple nested loops join?}
	\label{sec-3.1}

	The Order relation is the outer relation, while the customer relation is inner.
	Since in this case the join is page-oriented, the cost is:
	$M + (Po * M * N) = 1000 + (10 * 1000 * 200) \approx 2 * (10^6)$ I/Os

	\subsection*{2. What is the cost of joining Order and Customer using a block nested loops join?}
	\label{sec-3.2}

	Having
	One buffer page is for scanning \emph{Customer} (inner relation), so we use (52-2) buffer pages for holding the blocks of \emph{Order} (outer relation).
	The cost is:
	$M + \lceil M / (B - 2)\rceil * N = 1000 + \lceil 1000 / 50 \rceil * 200 = 5000$ I/Os

	\subsection*{3. What is the cost of joining Order and Customer using a sort-merge join?}
	\label{sec-3.3}

	First we need to sort the two relations. So we get cost: $10000 * log10000 + 2000 * log2000$
	Order is scanned once, each Customer group is scanned once per matching Order tuple.
	So we get the final cost: $M * logM + N * logN + M + N = 1000 * log1000 + 200 * log200 + 1200$ I/Os

	\subsection*{4. What is the cost of joining Order and Customer using a hash join?}
	\label{sec-3.4}

	We first partition both relations, which means we read and write both relations, so we have cost of 2 * (M + N) = 2 * 1200 I/Os.
	Then we scan matching partitions, read both relations, so we have cost of 1000 + 200 I/Os.
	$Cost = Cost(Partitioning) + Cost(Matching) = 2400 + 1200 = 3600$ I/Os

	\subsection*{5. If unclustered B+tree index existed on Order.cid or Customer.id,would either provide a cheaper alternative (index nexted loop join) for performing the join than a block nested loop join?}
	\label{sec-3.5}

	We can use the index on the join column of one relation and make it the inner loop (that's where you lose more time).
	So we get:
	$Cost = M + (M * Po * Ord\_for\_Cust) = 1000 + (1000 * 10 * 5) = 51000$
	In this case we don't get any advantage using a B+tree.

	\subsection*{6. Reconsider the question above when the B+−tree index is clustered.}
	\label{sec-3.6}

	Here the only difference is that we can access to the matching Customer's tuples in just 1 I/O because of the clustering.
	So we get:
	$Cost = M + (M * Po * 1) = 1000 + 10000 = 11000$ I/Os



	\end{document}