Created
December 4, 2009 10:45
-
-
Save AndreaCrotti/248980 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\documentclass[11pt]{article} | |
\usepackage[utf8]{inputenc} | |
\usepackage[T1]{fontenc} | |
\usepackage{graphicx} | |
\usepackage{longtable} | |
\usepackage{float} | |
\usepackage{wrapfig} | |
\usepackage{soul} | |
\usepackage{amssymb} | |
\usepackage{hyperref} | |
\title{EXERCISE 6} | |
\author{Andrea Crotti (\#299466), Can Liu (\#286105), Sebastian Roidl (\#281941)} | |
\date{04 Dicembre 2009} | |
\begin{document} | |
\maketitle | |
\obeylines | |
\section*{} | |
\texttt{DEADLINE:} \textit{2009-12-02 Mer 12:15}\newline | |
\section*{1. Selection and projection} | |
\label{sec-2} | |
\subsection*{1. Given the size of a relation (1000 pages), what are the I/O costs for an equality selection on a non-key attribute for the following cases? Assuming each B-tree leaf holds 20 record pointers while each page contains 10 records} | |
\label{sec-2.1} | |
In worst case equality selection must be done scanning the whole data and checking over the key we need to compare. | |
Using indexes or (un)clustered B+-trees can reduce the number of I/O operations. | |
\subsubsection*{(a) with a clustered b-tree of height 3 (matching records are located in one page)} | |
\label{sec-2.1.1} | |
This is the best scenario, a clustered B+Tree records all the data in a sorted order on disk. | |
$cost = h + 1 = 4$ I/Os, we only need traverse the tree from root to leaf and then read the data page. | |
\subsubsection*{(b) without any index, nor is the file sorted on the attribute occurring in selection} | |
\label{sec-2.1.2} | |
This is the worst possible case, we need to scan over all the records and check the equality of the attribute. | |
So the number of I/Os needed is the total number of pages of the relation. | |
$cost = 1000$ I/Os | |
\subsubsection*{(c) with an unclustered b-tree index of height 3, and there are 10 matching records} | |
\label{sec-2.1.3} | |
We scan through the tree and in plus we must read the pages which this time are not contiguous as before. | |
$cost = h + k = 13$ I/Os | |
\subsubsection*{(d) with an unclustered b-tree of height 3 and one tenth of the records match the selection} | |
\label{sec-2.1.4} | |
Same as before, having more records matching the selection makes the unclustered b+-tree much less convenient. | |
$cost = h + #tuples/10 = 3 + 1000 = 10003$ I/Os | |
If the matching rate is higher than it would become even faster to do a total lookup on the pages. | |
\subsection*{2. We assume the relation is of N pages.} | |
\label{sec-2.2} | |
\subsubsection*{(a) What is the complexity of sorting-based projection?} | |
\label{sec-2.2.1} | |
When doing a \emph{projection} we must | |
\begin{itemize} | |
\item remove unnecessary attributes | |
\item delete all the possible duplicates that we generated | |
\end{itemize} | |
If we use a sorting-based algorithm then all the duplicates can only be adjacent, and it becomes very easy to detect and delete them. | |
The complexity is then $O(N \log{N})$, where for N we denote the number of pages. | |
\subsubsection*{(b) If the records are distributed uniformly, what is the complexity for hash-based projection?} | |
\label{sec-2.2.2} | |
For hash-based projection the best-case is $O(N)$. | |
To be more precise we have to compose the cost of partitioning and comparing, so we get | |
$partCost + inBucketCompCost = (N + N) + N = 3N$ | |
To have good performances with hash-based projections the buckets must fit in memory. | |
\subsubsection*{(c) Why is sorting-based projection the standard solution in practice (or why is it superior to hash-based projection)?} | |
\label{sec-2.2.3} | |
In general a sorting-based projection is used because: | |
\begin{itemize} | |
\item sorting is a basic routine in database implementations and the code is fairly optimized | |
\item hashing can fail if the bucket is too large | |
\item hashing-based projection can become very slow if the data are not distributed uniformly | |
\end{itemize} | |
\section*{2. Join} | |
\label{sec-3} | |
Consider $Order \Join_{Order.cid=Customer.id} Customer$. | |
Cost metric is the number of page I/O. | |
\begin{itemize} | |
\item Order contains 10,000 tuples with 10 tuples per page. | |
\item Customer contains 2,000 tuples and 10 tuples per page. | |
\item available 52 buffer pages | |
\item each Customer tuple matches 5 tuples in Order | |
\item it takes 3 I/O operations to access a leaf of $B^{+}- tree$ for Order, 2 I/Os for Customer. | |
\end{itemize} | |
Formalizing the following data are given: | |
\begin{itemize} | |
\item M = 1000 (number of pages in \emph{Order}) | |
\item N = 200 (number of pages in \emph{Customer}) | |
\item Po = 10 (number of tuples/page in \emph{Order}) | |
\item Pc = 10 (number of tuples/page in \emph{Customer}) | |
\item B = 52 (number of buffer pages) | |
\item Ord\_for\_Cust = 5 (tuples of Order matching any Customer) | |
\end{itemize} | |
\subsection*{1. What is the cost of joining Order and Customer using a page-oriented simple nested loops join?} | |
\label{sec-3.1} | |
The Order relation is the outer relation, while the customer relation is inner. | |
Since in this case the join is page-oriented, the cost is: | |
$M + (Po * M * N) = 1000 + (10 * 1000 * 200) \approx 2 * (10^6)$ I/Os | |
\subsection*{2. What is the cost of joining Order and Customer using a block nested loops join?} | |
\label{sec-3.2} | |
Having | |
One buffer page is for scanning \emph{Customer} (inner relation), so we use (52-2) buffer pages for holding the blocks of \emph{Order} (outer relation). | |
The cost is: | |
$M + \lceil M / (B - 2)\rceil * N = 1000 + \lceil 1000 / 50 \rceil * 200 = 5000$ I/Os | |
\subsection*{3. What is the cost of joining Order and Customer using a sort-merge join?} | |
\label{sec-3.3} | |
First we need to sort the two relations. So we get cost: $10000 * log10000 + 2000 * log2000$ | |
Order is scanned once, each Customer group is scanned once per matching Order tuple. | |
So we get the final cost: $M * logM + N * logN + M + N = 1000 * log1000 + 200 * log200 + 1200$ I/Os | |
\subsection*{4. What is the cost of joining Order and Customer using a hash join?} | |
\label{sec-3.4} | |
We first partition both relations, which means we read and write both relations, so we have cost of 2 * (M + N) = 2 * 1200 I/Os. | |
Then we scan matching partitions, read both relations, so we have cost of 1000 + 200 I/Os. | |
$Cost = Cost(Partitioning) + Cost(Matching) = 2400 + 1200 = 3600$ I/Os | |
\subsection*{5. If unclustered B+tree index existed on Order.cid or Customer.id,would either provide a cheaper alternative (index nexted loop join) for performing the join than a block nested loop join?} | |
\label{sec-3.5} | |
We can use the index on the join column of one relation and make it the inner loop (that's where you lose more time). | |
So we get: | |
$Cost = M + (M * Po * Ord\_for\_Cust) = 1000 + (1000 * 10 * 5) = 51000$ | |
In this case we don't get any advantage using a B+tree. | |
\subsection*{6. Reconsider the question above when the B+−tree index is clustered.} | |
\label{sec-3.6} | |
Here the only difference is that we can access to the matching Customer's tuples in just 1 I/O because of the clustering. | |
So we get: | |
$Cost = M + (M * Po * 1) = 1000 + 10000 = 11000$ I/Os | |
\end{document} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment