Skip to content

Instantly share code, notes, and snippets.

@mitallast
Last active January 28, 2016 15:24
Show Gist options
  • Save mitallast/4186bc5666326ffd2bf7 to your computer and use it in GitHub Desktop.
Save mitallast/4186bc5666326ffd2bf7 to your computer and use it in GitHub Desktop.
Question 1)
Discuss the trade-offs between a linked list and an array-based list,
considering both practical and theoretical considerations. In what
situations are linked-lists preferable to array-based lists, and
vice-versa? Keep answer brief.
Answer:
+------------------+-----------------------------------+
| | data structure |
+------------------+---------------------+-------------+
| operation | array list | linked list |
+------------------+---------------------+-------------+
| contains | O(N) | O(N) |
+------------------+---------------------+-------------+
| iteration 1 item | O(1) | O(1) |
+------------------+---------------------+-------------+
| add | O(1) see 1) | O(1) |
+------------------+---------------------+-------------+
| remove(object) | O(N) | O(N) |
+------------------+---------------------+-------------+
| remove(index) | O(N) see 2) | O(i) |
+------------------+---------------------+-------------+
| get(index) | O(1) | O(index) |
+------------------+---------------------+-------------+
1) If array is full, required allocation new array and copy elements from previous. If array has free space, O(1).
Also, if we know a required count of entries, it's good practice to initialize array list with this known number.
2) In practice, if we have array of N elements, and we want remove i-th element, we need move M-i elements:
System.arraycopy(elementData, index+1, elementData, index, size - index - 1);
Also, this structures have different memory footprint:
- array list use Object[] to store pointers to values, it costs:
(object header - 12 bytes)
+ (pointer size - 32 or 64 bit) * (size of array).
- Linked list use object to store each value and links to prev and next entries, it costs:
((object header - 12 bytes)
+ (pointer to value - 32 or 64 bit)
+ (pointer to next entry - 32 or 64 bit)
+ (pointer to prev entry - 32 or 64 bit)) * (count of entries).
In most cases, we use array list for fast random access and less memory usage.
============================================================================================================
Question 2)
version 1:
You have M iterators (equivalent to java.util.Iterator). Each iterator is defined as some base iterator composed
with a number of low-cost transformations to form a chain of iterators underlying the resultant iterator. Each
resultant iterator delivers a stream of ordered (and in-order) data items, which you want to combine into a single
(union) ordered stream.
Each data item is composed of a sequence of K keys [k1,k2,k3..kK], with the
order defined on k1 first, then k2, etc.; i.e. [1, 2, 3,..] < [2, 1, 3,..].
Each key sequence occurs only once per iterator, but is likely to occur in
more than one iterator.
version 2:
Есть несколько источников данных, размер данных таков, что он не может поместиться в память ПК, каждый итератор
возвращает данные в виде ключей состоящих из подключей k1, k2, k3 ..., надо результаты всех итераторов объединить
в один используя понятие композитного итератора, когда потоки данных от всех ключей объединяются в один такой что
все ключи в нём тоже отсортированы.
============================================================================================================
Part A: Outline a novel efficient algorithm for finding the union of these iterators. Try to find an algorithm that
is more efficient than O(N.M.K.lg(M)), where for simplicity of analysis we assume each iterator returns exactly
N elements. State the algorithmic complexity of your algorithm (do not prove it, but be prepared to justify it);
you may introduce new terms for use in your analysis. You may state multiple algorithms, but please indicate
the trade-offs between the choices if you do.
Answer:
Define terms:
M - number of iterators
n - number of data items in iterator
N - total number of data items, n*M in our case.
K - number of keys in data item
Obliviously, best solution can't be better than O(N*K). In practice, approximation will be O(N*K*fn(M)),
where fn() is a some approximation function.
There is a specialized algorithm to merge multiple ordered streams - "N-way merge", kind of
Merge Ordered Sequences Algorithms.
Typical implementation consists of priority queue to sort top of each stream data items. For priority q, insertion
new element and deletion minimum element costs O(log(m)) where m is number of elements in q. Retrieving all elements
from "N-way merge" costs approximately O(N*lg(M)), where N is total number of elements. Operation of comparing two
data items costs O(K). Total solution complexity approximately O(N*K*lg(M)) and required constant memory usage,
max M entries in memory at time.
It's means, that this solution approximately in O(lg(M)) slower, than ideal O(N*K).
Is this the best solution? According to the question, it's impossible to merge multiple ordered streams without
comparison data items. As a result, this is a comparison-based sorting problem with theoretical O(n*log(n)) barrier.
We can't find algorithm with better performance, than O(N*K*lg(M)).
Google guava library already have implementation of N-way merge, there's an code example:
https://gist.github.com/mitallast/4186bc5666326ffd2bf7
============================================================================================================
Part B: Outline some simple steps that could be taken to redesign the
iterator definitions to reduce the constant factor overheads, and explain
why they would do so.
Answer (for question 2 version 1):
1) Simplify data streams:
There's a schematic representation of application:
(loop over iterator)
-> (merge iterator)
-> (list of resultant iterators)
-> (resultant iterator)
-> (chain iterator (transformation item 1))
-> (chain iterator (transformation item 2))
...
-> (chain iterator (transformation item T))
-> (base iterator)
This structure contains multiple chain iterators, what produce overhead, but haven't helpful work.
Simplify architecture: combine transformations into composite transformation:
(loop over iterator)
-> (merge iterator)
-> (list of resultant iterators)
-> (resultant iterator)
-> (transformation item 1)
-> (transformation item 2)
-> ...
-> (transformation item T)
-> (base iterator)
2) Optimize data item representation for less memory overhead:
Naive java design may looks like this:
class DataItem {
private final ImmutableList<Integer> keys;
// additional methods hidden
}
There's a big memory usage (as example google guava RegularImmutableList):
(data item header: 12 bytes)
+ (pointer to ImmutableList: 32 or 64 bit)
+ (size of RegularImmutableList of N elements:
(object header 12 bytes)
+ (int offset: 4 bytes)
+ (int size: 4 bytes)
+ (pointer to Object[] array: 32 or 64 bit)
+ (size of Object[]:
(object header: 12 bytes)
+ (pointer size: 32 or 64 bit)
* (number of items)
* (size of Integer:
(object header 12 bytes)
(value int: 4 bytes)
)
)
)
For N keys memory usage is approximately N*20 + 48 bytes, and N + 3 objects.
Rewrite to custom immutable data structure:
class DataItem implements Comparable<DataItem> {
private final int[] keys;
// implement comparable interface
// additional methods hidden
}
Memory usage:
(data item header: 12 bytes)
+ (pointer to int[] array: 32 or 64 bit)
+ (size of int[] array:
(object header: 12 bytes)
(number of items) * (int size: 4 bytes)
)
For N keys memory usage is approximately N*4 + 38 bytes, about in 5 times less than original, and only 2 objects.
============================================================================================================
Question 3)
Part A: What is a suitable data structure for storing an immutable ordered
collection of data items (as defined in Q2) for implementing a database?
Answer:
First, the solution depends on the operations that need to be optimized. It's usually need the following operations:
- optimized append operation
- random access by index
- sequential read
Generalized solution for immutable ordered collections is log-based data structures.
In practice, most of production solutions implements different types of segmented log.
Most popular production ready libraries, like LevelDb, RocksDB, Cassandra uses the Log-Structured-Merge Tree (LSM-Tree).
This solutions allow fast append operations (approximately O(1)), fast random access (approximately requires
disk IO operations O(1) for hash-based, or O(lg(N)) for tree-based with in-memory key lookup tables).
Usually, LSM-trees contains key and value pairs, contains on SSTable on disk:
[ key 1: value 1 ; key 2: value 2; ... ; key N: value N]
In Q2, there's only a data items without primary key. We can simplify structure of SSTable for our situation:
[ data item 1; data item 2; ... ; data item N]
============================================================================================================
Part B: How would you optimise its in-memory representation, with Java
specifically in mind? Why? What are the trade-offs and why are they
sensible?
Answer:
The most significant trade-off is amount of disk i/o operations, because disk access operation takes much more time
than heap.
To optimize random access operation by data item index, we can use in-memory offset index for each SSTable, example:
- index 0 : offset
- index 1 : offset
- ...
- index N : offset
In java, for each segment we cat store first index of segment and array with offsets:
class SegmentIndex{
int firstIndex;
long[] offsets;
}
for N data items, memory usage will be:
(segment index object: header 12 bytes)
+ (first index size: 4 bytes)
+ (array pointer: depend platform, 32 or 64 bits)
+ (array object header size: 12 bytes)
+ (long size - 8 bytes) * N.
To optimize sequential read from SSTable, it's possible use direct buffer with fixed size to reduce number of
disk seek operations and short-data reads.
To optimize random access reads, we can use cache like Least Recently Used algorithm.
============================================================================================================
Part C: If each key can be compared in byte-order (i.e. given two keys,
their respective order is defined by the first byte that differs), how does
this affect your answer to A? What if only some keys can?
Answer:
In my code example, stream is byte-ordered, comparator is naive pure-java implemented. In practice may be required
use signed or unsigned lexicographical order.
If all my keys can can be compared in byte-order, I cat use Google guava contains utilities for, with optimization
to use sun.misc.Unsafe api for comparing byte[] arrays as long[] arrays for better performance.
In other case, it's need to implement custom pure java comparator, and this solution will works not so fast as more
generalized and optimized solution, as native lexicographical order.
package org.questions;
import com.google.common.collect.Iterators;
import com.google.common.primitives.SignedBytes;
import com.google.common.primitives.UnsignedBytes;
import java.util.*;
public class Test {
// number of iterators
private final static int M = 10;
// number of data items in iterator
private final static int n = 10;
// number of data items
private final static int N = n * M;
// number of keys in data item
private final static int K = 3;
// determined pseudo random generator
private final static Random random = new Random(13);
public static void main(String args[]) {
/**
* Complexity of merging iterator - O(N*log(M)).
* Complexity of comparator - O(K).
*
* Total complexity O(K*N*log(M)).
*/
String comparatorType = Optional.of(args)
.filter(strings -> strings.length > 0)
.map(strings -> strings[0])
.orElse("naive");
final Comparator<DataItem> comparator;
System.out.println("comparator: " + comparatorType);
switch (comparatorType) {
case "naive":
comparator = new NaiveDataItemComparator();
break;
case "unsigned":
comparator = new UnsignedLexicographicalDataItemComparator();
break;
case "signed":
comparator = new SignedLexicographicalDataItemComparator();
break;
default:
throw new IllegalArgumentException();
}
List<Iterator<DataItem>> iterators = baseIterators(comparator);
Iterator<DataItem> iterator = Iterators.mergeSorted(iterators, comparator);
DataItem prev = null;
while (iterator.hasNext()) {
DataItem current = iterator.next();
System.out.println(current);
if (prev != null) {
if (comparator.compare(prev, current) > 0) {
throw new Error("Invalid item order");
}
}
prev = current;
}
}
/**
* @return list of M data item iterators
*/
private static List<Iterator<DataItem>> baseIterators(Comparator<DataItem> comparator) {
List<Iterator<DataItem>> iterators = new ArrayList<>(M);
for (int i = 0; i < M; i++) {
iterators.add(baseIterator(comparator));
}
return iterators;
}
/**
* @return iterator with N ordered data items
*/
private static Iterator<DataItem> baseIterator(Comparator<DataItem> comparator) {
ArrayList<DataItem> dataItems = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
dataItems.add(dataItem());
}
dataItems.sort(comparator);
return dataItems.iterator();
}
/**
* @return DataItem with K keys
*/
private static DataItem dataItem() {
byte[] data = new byte[K];
random.nextBytes(data);
return new DataItem(data);
}
/**
* One comparison takes O(K), where K is number of data item keys count;
*/
private static class NaiveDataItemComparator implements Comparator<DataItem> {
@Override
public int compare(DataItem o1, DataItem o2) {
for (int i = 0; i < K; i++) {
int cmp = Byte.compare(o1.get(i), o2.get(i));
if (cmp != 0) {
return cmp;
}
}
return 0;
}
}
private static class UnsignedLexicographicalDataItemComparator implements Comparator<DataItem> {
@Override
public int compare(DataItem o1, DataItem o2) {
return UnsignedBytes.lexicographicalComparator().compare(o1.data, o2.data);
}
}
private static class SignedLexicographicalDataItemComparator implements Comparator<DataItem> {
@Override
public int compare(DataItem o1, DataItem o2) {
return SignedBytes.lexicographicalComparator().compare(o1.data, o2.data);
}
}
private static class DataItem {
private final byte[] data;
public DataItem(byte[] data) {
this.data = data;
}
public byte get(int index) {
return data[index];
}
@Override
public String toString() {
return "DataItem{" + Arrays.toString(data) + '}';
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment