Skip to content

Instantly share code, notes, and snippets.

@tdunning
Created April 12, 2019 23:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tdunning/d8d7337cbc6aaf31176f0babc6aaf95d to your computer and use it in GitHub Desktop.
Save tdunning/d8d7337cbc6aaf31176f0babc6aaf95d to your computer and use it in GitHub Desktop.
Demonstrates the summarization of database fields using t-digest
package com.tdunning.tdigest.quality;
import com.google.common.collect.ImmutableList;
import com.google.common.io.Resources;
import com.tdunning.math.stats.MergingDigest;
import com.tdunning.math.stats.TDigest;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Stream;
public class DbStatsTest {
@Test
public void testColumnBreaks() throws IOException {
SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
List<String> keyFields = ImmutableList.of("c_integer", "c_bigint", "c_float", "c_timestamp");
Map<String, TDigest> digests = new LinkedHashMap<>();
keyFields.stream().forEach((key) -> digests.put(key, new MergingDigest(100)));
Stream<String> data = Files.lines(new File(Resources.getResource("db_data.csv").getFile()).toPath());
List<String> headers = new ArrayList<>();
data.forEach((String line) -> {
if (headers.size() == 0) {
// first line has field names
headers.addAll(Arrays.asList(line.split(",")));
} else {
Iterator<String> i = headers.iterator();
for (String value : line.split(",")) {
String key = i.next();
if ("c_timestamp".equals(key)) {
try {
digests.get(key).add(parser.parse(value).getTime());
} catch (ParseException e) {
// ignore bad dates
}
} else if (keyFields.contains(key) && !"null".equals(value)) {
digests.get(key).add(Double.parseDouble(value));
}
}
}
});
for (String key : digests.keySet()) {
TDigest digest = digests.get(key);
System.out.printf("%s", key);
for (double q = 0; q < 1.05; q += 0.1) {
System.out.printf(",%10.5g", digest.quantile(q));
}
System.out.printf("\n");
}
}
}
@tdunning
Copy link
Author

Output looks like this. Note that the first number on each line is the minimum value and the last is the maximum.

c_integer,-2.1278e+09,-1.6601e+09,-1.2665e+09,-8.0910e+08,-4.3179e+08,-5.6708e+07,7.7199e+07,4.3322e+08,1.0286e+09,1.3180e+09,2.1119e+09
c_bigint,-8.8049e+18,-7.3087e+18,-5.7873e+18,-2.9042e+18,-1.3941e+18,    0.0000,    0.0000,    0.0000,    0.0000,8.8484e+17,4.6441e+18
c_float,-4.6874e+09,9.2246e+08,2.0305e+09,3.2413e+09,4.6905e+09,5.1524e+09,6.0231e+09,7.0389e+09,8.0028e+09,8.7593e+09,9.9838e+09
c_timestamp,1.3887e+12,1.3914e+12,1.3971e+12,1.4000e+12,1.4040e+12,1.4058e+12,1.4072e+12,1.4120e+12,1.4137e+12,1.4170e+12,1.4200e+12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment