Last active
January 28, 2016 15:24
-
-
Save mitallast/4186bc5666326ffd2bf7 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Question 1) | |
Discuss the trade-offs between a linked list and an array-based list, | |
considering both practical and theoretical considerations. In what | |
situations are linked-lists preferable to array-based lists, and | |
vice-versa? Keep answer brief. | |
Answer: | |
+------------------+-----------------------------------+ | |
| | data structure | | |
+------------------+---------------------+-------------+ | |
| operation | array list | linked list | | |
+------------------+---------------------+-------------+ | |
| contains | O(N) | O(N) | | |
+------------------+---------------------+-------------+ | |
| iteration 1 item | O(1) | O(1) | | |
+------------------+---------------------+-------------+ | |
| add | O(1) see 1) | O(1) | | |
+------------------+---------------------+-------------+ | |
| remove(object) | O(N) | O(N) | | |
+------------------+---------------------+-------------+ | |
| remove(index) | O(N) see 2) | O(i) | | |
+------------------+---------------------+-------------+ | |
| get(index) | O(1) | O(index) | | |
+------------------+---------------------+-------------+ | |
1) If array is full, required allocation new array and copy elements from previous. If array has free space, O(1). | |
Also, if we know a required count of entries, it's good practice to initialize array list with this known number. | |
2) In practice, if we have array of N elements, and we want remove i-th element, we need move M-i elements: | |
System.arraycopy(elementData, index+1, elementData, index, size - index - 1); | |
Also, this structures have different memory footprint: | |
- array list use Object[] to store pointers to values, it costs: | |
(object header - 12 bytes) | |
+ (pointer size - 32 or 64 bit) * (size of array). | |
- Linked list use object to store each value and links to prev and next entries, it costs: | |
((object header - 12 bytes) | |
+ (pointer to value - 32 or 64 bit) | |
+ (pointer to next entry - 32 or 64 bit) | |
+ (pointer to prev entry - 32 or 64 bit)) * (count of entries). | |
In most cases, we use array list for fast random access and less memory usage. | |
============================================================================================================ | |
Question 2) | |
version 1: | |
You have M iterators (equivalent to java.util.Iterator). Each iterator is defined as some base iterator composed | |
with a number of low-cost transformations to form a chain of iterators underlying the resultant iterator. Each | |
resultant iterator delivers a stream of ordered (and in-order) data items, which you want to combine into a single | |
(union) ordered stream. | |
Each data item is composed of a sequence of K keys [k1,k2,k3..kK], with the | |
order defined on k1 first, then k2, etc.; i.e. [1, 2, 3,..] < [2, 1, 3,..]. | |
Each key sequence occurs only once per iterator, but is likely to occur in | |
more than one iterator. | |
version 2: | |
Есть несколько источников данных, размер данных таков, что он не может поместиться в память ПК, каждый итератор | |
возвращает данные в виде ключей состоящих из подключей k1, k2, k3 ..., надо результаты всех итераторов объединить | |
в один используя понятие композитного итератора, когда потоки данных от всех ключей объединяются в один такой что | |
все ключи в нём тоже отсортированы. | |
============================================================================================================ | |
Part A: Outline a novel efficient algorithm for finding the union of these iterators. Try to find an algorithm that | |
is more efficient than O(N.M.K.lg(M)), where for simplicity of analysis we assume each iterator returns exactly | |
N elements. State the algorithmic complexity of your algorithm (do not prove it, but be prepared to justify it); | |
you may introduce new terms for use in your analysis. You may state multiple algorithms, but please indicate | |
the trade-offs between the choices if you do. | |
Answer: | |
Define terms: | |
M - number of iterators | |
n - number of data items in iterator | |
N - total number of data items, n*M in our case. | |
K - number of keys in data item | |
Obliviously, best solution can't be better than O(N*K). In practice, approximation will be O(N*K*fn(M)), | |
where fn() is a some approximation function. | |
There is a specialized algorithm to merge multiple ordered streams - "N-way merge", kind of | |
Merge Ordered Sequences Algorithms. | |
Typical implementation consists of priority queue to sort top of each stream data items. For priority q, insertion | |
new element and deletion minimum element costs O(log(m)) where m is number of elements in q. Retrieving all elements | |
from "N-way merge" costs approximately O(N*lg(M)), where N is total number of elements. Operation of comparing two | |
data items costs O(K). Total solution complexity approximately O(N*K*lg(M)) and required constant memory usage, | |
max M entries in memory at time. | |
It's means, that this solution approximately in O(lg(M)) slower, than ideal O(N*K). | |
Is this the best solution? According to the question, it's impossible to merge multiple ordered streams without | |
comparison data items. As a result, this is a comparison-based sorting problem with theoretical O(n*log(n)) barrier. | |
We can't find algorithm with better performance, than O(N*K*lg(M)). | |
Google guava library already have implementation of N-way merge, there's an code example: | |
https://gist.github.com/mitallast/4186bc5666326ffd2bf7 | |
============================================================================================================ | |
Part B: Outline some simple steps that could be taken to redesign the | |
iterator definitions to reduce the constant factor overheads, and explain | |
why they would do so. | |
Answer (for question 2 version 1): | |
1) Simplify data streams: | |
There's a schematic representation of application: | |
(loop over iterator) | |
-> (merge iterator) | |
-> (list of resultant iterators) | |
-> (resultant iterator) | |
-> (chain iterator (transformation item 1)) | |
-> (chain iterator (transformation item 2)) | |
... | |
-> (chain iterator (transformation item T)) | |
-> (base iterator) | |
This structure contains multiple chain iterators, what produce overhead, but haven't helpful work. | |
Simplify architecture: combine transformations into composite transformation: | |
(loop over iterator) | |
-> (merge iterator) | |
-> (list of resultant iterators) | |
-> (resultant iterator) | |
-> (transformation item 1) | |
-> (transformation item 2) | |
-> ... | |
-> (transformation item T) | |
-> (base iterator) | |
2) Optimize data item representation for less memory overhead: | |
Naive java design may looks like this: | |
class DataItem { | |
private final ImmutableList<Integer> keys; | |
// additional methods hidden | |
} | |
There's a big memory usage (as example google guava RegularImmutableList): | |
(data item header: 12 bytes) | |
+ (pointer to ImmutableList: 32 or 64 bit) | |
+ (size of RegularImmutableList of N elements: | |
(object header 12 bytes) | |
+ (int offset: 4 bytes) | |
+ (int size: 4 bytes) | |
+ (pointer to Object[] array: 32 or 64 bit) | |
+ (size of Object[]: | |
(object header: 12 bytes) | |
+ (pointer size: 32 or 64 bit) | |
* (number of items) | |
* (size of Integer: | |
(object header 12 bytes) | |
(value int: 4 bytes) | |
) | |
) | |
) | |
For N keys memory usage is approximately N*20 + 48 bytes, and N + 3 objects. | |
Rewrite to custom immutable data structure: | |
class DataItem implements Comparable<DataItem> { | |
private final int[] keys; | |
// implement comparable interface | |
// additional methods hidden | |
} | |
Memory usage: | |
(data item header: 12 bytes) | |
+ (pointer to int[] array: 32 or 64 bit) | |
+ (size of int[] array: | |
(object header: 12 bytes) | |
(number of items) * (int size: 4 bytes) | |
) | |
For N keys memory usage is approximately N*4 + 38 bytes, about in 5 times less than original, and only 2 objects. | |
============================================================================================================ | |
Question 3) | |
Part A: What is a suitable data structure for storing an immutable ordered | |
collection of data items (as defined in Q2) for implementing a database? | |
Answer: | |
First, the solution depends on the operations that need to be optimized. It's usually need the following operations: | |
- optimized append operation | |
- random access by index | |
- sequential read | |
Generalized solution for immutable ordered collections is log-based data structures. | |
In practice, most of production solutions implements different types of segmented log. | |
Most popular production ready libraries, like LevelDb, RocksDB, Cassandra uses the Log-Structured-Merge Tree (LSM-Tree). | |
This solutions allow fast append operations (approximately O(1)), fast random access (approximately requires | |
disk IO operations O(1) for hash-based, or O(lg(N)) for tree-based with in-memory key lookup tables). | |
Usually, LSM-trees contains key and value pairs, contains on SSTable on disk: | |
[ key 1: value 1 ; key 2: value 2; ... ; key N: value N] | |
In Q2, there's only a data items without primary key. We can simplify structure of SSTable for our situation: | |
[ data item 1; data item 2; ... ; data item N] | |
============================================================================================================ | |
Part B: How would you optimise its in-memory representation, with Java | |
specifically in mind? Why? What are the trade-offs and why are they | |
sensible? | |
Answer: | |
The most significant trade-off is amount of disk i/o operations, because disk access operation takes much more time | |
than heap. | |
To optimize random access operation by data item index, we can use in-memory offset index for each SSTable, example: | |
- index 0 : offset | |
- index 1 : offset | |
- ... | |
- index N : offset | |
In java, for each segment we cat store first index of segment and array with offsets: | |
class SegmentIndex{ | |
int firstIndex; | |
long[] offsets; | |
} | |
for N data items, memory usage will be: | |
(segment index object: header 12 bytes) | |
+ (first index size: 4 bytes) | |
+ (array pointer: depend platform, 32 or 64 bits) | |
+ (array object header size: 12 bytes) | |
+ (long size - 8 bytes) * N. | |
To optimize sequential read from SSTable, it's possible use direct buffer with fixed size to reduce number of | |
disk seek operations and short-data reads. | |
To optimize random access reads, we can use cache like Least Recently Used algorithm. | |
============================================================================================================ | |
Part C: If each key can be compared in byte-order (i.e. given two keys, | |
their respective order is defined by the first byte that differs), how does | |
this affect your answer to A? What if only some keys can? | |
Answer: | |
In my code example, stream is byte-ordered, comparator is naive pure-java implemented. In practice may be required | |
use signed or unsigned lexicographical order. | |
If all my keys can can be compared in byte-order, I cat use Google guava contains utilities for, with optimization | |
to use sun.misc.Unsafe api for comparing byte[] arrays as long[] arrays for better performance. | |
In other case, it's need to implement custom pure java comparator, and this solution will works not so fast as more | |
generalized and optimized solution, as native lexicographical order. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package org.questions; | |
import com.google.common.collect.Iterators; | |
import com.google.common.primitives.SignedBytes; | |
import com.google.common.primitives.UnsignedBytes; | |
import java.util.*; | |
public class Test { | |
// number of iterators | |
private final static int M = 10; | |
// number of data items in iterator | |
private final static int n = 10; | |
// number of data items | |
private final static int N = n * M; | |
// number of keys in data item | |
private final static int K = 3; | |
// determined pseudo random generator | |
private final static Random random = new Random(13); | |
public static void main(String args[]) { | |
/** | |
* Complexity of merging iterator - O(N*log(M)). | |
* Complexity of comparator - O(K). | |
* | |
* Total complexity O(K*N*log(M)). | |
*/ | |
String comparatorType = Optional.of(args) | |
.filter(strings -> strings.length > 0) | |
.map(strings -> strings[0]) | |
.orElse("naive"); | |
final Comparator<DataItem> comparator; | |
System.out.println("comparator: " + comparatorType); | |
switch (comparatorType) { | |
case "naive": | |
comparator = new NaiveDataItemComparator(); | |
break; | |
case "unsigned": | |
comparator = new UnsignedLexicographicalDataItemComparator(); | |
break; | |
case "signed": | |
comparator = new SignedLexicographicalDataItemComparator(); | |
break; | |
default: | |
throw new IllegalArgumentException(); | |
} | |
List<Iterator<DataItem>> iterators = baseIterators(comparator); | |
Iterator<DataItem> iterator = Iterators.mergeSorted(iterators, comparator); | |
DataItem prev = null; | |
while (iterator.hasNext()) { | |
DataItem current = iterator.next(); | |
System.out.println(current); | |
if (prev != null) { | |
if (comparator.compare(prev, current) > 0) { | |
throw new Error("Invalid item order"); | |
} | |
} | |
prev = current; | |
} | |
} | |
/** | |
* @return list of M data item iterators | |
*/ | |
private static List<Iterator<DataItem>> baseIterators(Comparator<DataItem> comparator) { | |
List<Iterator<DataItem>> iterators = new ArrayList<>(M); | |
for (int i = 0; i < M; i++) { | |
iterators.add(baseIterator(comparator)); | |
} | |
return iterators; | |
} | |
/** | |
* @return iterator with N ordered data items | |
*/ | |
private static Iterator<DataItem> baseIterator(Comparator<DataItem> comparator) { | |
ArrayList<DataItem> dataItems = new ArrayList<>(n); | |
for (int i = 0; i < n; i++) { | |
dataItems.add(dataItem()); | |
} | |
dataItems.sort(comparator); | |
return dataItems.iterator(); | |
} | |
/** | |
* @return DataItem with K keys | |
*/ | |
private static DataItem dataItem() { | |
byte[] data = new byte[K]; | |
random.nextBytes(data); | |
return new DataItem(data); | |
} | |
/** | |
* One comparison takes O(K), where K is number of data item keys count; | |
*/ | |
private static class NaiveDataItemComparator implements Comparator<DataItem> { | |
@Override | |
public int compare(DataItem o1, DataItem o2) { | |
for (int i = 0; i < K; i++) { | |
int cmp = Byte.compare(o1.get(i), o2.get(i)); | |
if (cmp != 0) { | |
return cmp; | |
} | |
} | |
return 0; | |
} | |
} | |
private static class UnsignedLexicographicalDataItemComparator implements Comparator<DataItem> { | |
@Override | |
public int compare(DataItem o1, DataItem o2) { | |
return UnsignedBytes.lexicographicalComparator().compare(o1.data, o2.data); | |
} | |
} | |
private static class SignedLexicographicalDataItemComparator implements Comparator<DataItem> { | |
@Override | |
public int compare(DataItem o1, DataItem o2) { | |
return SignedBytes.lexicographicalComparator().compare(o1.data, o2.data); | |
} | |
} | |
private static class DataItem { | |
private final byte[] data; | |
public DataItem(byte[] data) { | |
this.data = data; | |
} | |
public byte get(int index) { | |
return data[index]; | |
} | |
@Override | |
public String toString() { | |
return "DataItem{" + Arrays.toString(data) + '}'; | |
} | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment