Skip to content

Instantly share code, notes, and snippets.

@awwsmm awwsmm/Typifier.java
Last active Nov 12, 2018

Embed
What would you like to do?
Infer type of data from String representation in Java
import java.nio.CharBuffer;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Locale;
import java.util.Map.Entry;
import java.util.Optional;
import java.util.AbstractMap.SimpleEntry;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
public class Typifier {
public static void main (String[] args) {
System.out.println("\nFeatures:\n");
System.out.println("String.trim() is used to remove leading and trailing whitespace, so the following work fine:\n");
example(" 23 "); example(" 3.4 ");
System.out.println("But if the user enters only whitespace, that's fine, too:\n");
example(" "); example(" ");
System.out.println("The user can choose to interpret 1/0 as true/false:\n");
example("1"); example("true"); example("0");
System.out.println("Ranges of Byte, Short, and Float are used to box the value in the narrowest type available:\n");
example(" 2 "); example(" 200 "); example("2e9 "); example(" 2e99");
System.out.println("If String has length 1 and is not Boolean or Byte, it will be assigned the Character type:\n");
example("4"); example("-"); example("a");
System.out.println("Dates can also be parsed, and formats can be defined by the user:\n");
example("2014-12-22 14:35:22"); example("3/5/99 6:30");
System.out.println("Flags are available to allow the user to:");
System.out.println(" - interpret 0/1 as boolean false/true (or not)");
System.out.println(" - restrict to \"common\" types: Boolean, Double, String only (also LocalDateTime w/different flag)");
System.out.println(" - attempt to parse Strings as LocalDateTime objects, using a predefined list of DateTimeFormatters");
System.out.println(" - disallow postfixed l/L or f/F for long/float values, respectively (instead, interpreted as Strings)\n");
}
public static void example (String string) {
Entry<Class, String> result = typify(string);
System.out.printf(" typify (\"%s\") => \"%s\" [%s]%n%n",
string, result.getValue(), result.getKey().getSimpleName());
}
///---------------------------------------------------------------------------
///
/// private helper variables for typify()
///
///---------------------------------------------------------------------------
private static Locale defaultLocale = new Locale("en");
private static HashSet<DateTimeFormatter> formats = new HashSet<>(Arrays.asList(
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss", defaultLocale),
DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss.SS", defaultLocale),
DateTimeFormatter.ofPattern("MM/dd/yyyy hh:mm:ss a", defaultLocale),
DateTimeFormatter.ofPattern("M/d/yy H:mm", defaultLocale),
DateTimeFormatter.ofPattern("dd/MM/yyyy HH:mm:ss", defaultLocale) ));
// only Strings contain these characters -- skip all numeric processing
// arranged roughly by frequency in ~130MB of sample DASGIP files:
// $ awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file.txt
private static final char[] StringCharacters = new char[]{
' ', ':', 'n', 'a', 't', 'r', 'o', 'C', 'i', 'P', 'D', 's', 'c', 'S', 'u',
'A', 'm', '=', 'O', '\\', 'd', 'p', 'T', 'M', 'g', 'I', 'b', 'U', 'h', 'H' };
// typify looks for the above characters in an input String before it makes
// any attempt at parsing that String. If it finds any of the above characters,
// it immediately skips to the String-processing section, because no numerical
// type can contain those characters.
// adding more characters means that there are more characters to look for in
// the input String every time a piece of data is parsed, but it also reduces
// the likelihood that an Exception will be thrown when String data is attempted
// to be parsed as numerical data (which saves time).
// The characters below can also be added to the list, but the list above
// seems to be near-optimal
// 'J', '+', 'V', 'B', 'G', 'R', 'y', '(', ')', 'v', '_', ',', '[', ']', '/',
// 'N', 'k', 'w', '}', '{', 'X', '%', '>', 'x', '\'', 'W', '<', 'K', 'Q', 'q',
// 'z', 'Y', 'j', 'Z', '!', '#', '$', '&', '*', ',', ';', '?', '@', '^', '`',
// '|', '~'};
private static final String[] falseAliases = new String[]{ "false", "False", "FALSE" };
private static final String[] trueAliases = new String[]{ "true", "True", "TRUE" };
// if a String contains any of these, try to evaluate it as an equation
private static final char[] MathCharacters = new char[]{ '+', '-', '/', '*', '=' };
///---------------------------------------------------------------------------
///
/// typify():
/// attempt to classify String input as double, int, char, etc.
/// "common types" are: boolean, double, string, timestamp
///
///---------------------------------------------------------------------------
// default is:
// > don't interpret "0" and "1" as true and false
// > restrict interpretation to common types
// > don't allow f/F/l/L postfixes for float/long numbers
// > attempt to parse dates
public static Entry<Class, String> typify (String data) {
return typify(data, true, false, true, true);
}
public static Entry<Class, String> typify (String data, boolean bool01, boolean commonTypes, boolean postfixFL, boolean parseDates) {
// -2. if the input data has 0 length, return as null object
if (data == null || data.length() == 0) return new SimpleEntry<>(Object.class, null);
String s = data.trim();
int slen = s.length();
// -1. if the input data is only whitespace, return "String" and input as-is
if (slen == 0) return new SimpleEntry<>(String.class, data);
// in most data, numerical values are more common than true/false values. So,
// if we want to speed up data parsing, we can move this block to the end when
// looking only for common types. look for /// ***
/// 0. check if the data is Boolean (true or false)
if (!commonTypes) {
if (contains(falseAliases, s)) return new SimpleEntry<>(Boolean.class, "false");
else if (contains(trueAliases, s)) return new SimpleEntry<>(Boolean.class, "true");
}
// check for any String-only characters; if we find them, don't bother trying to parse this as a number
if (containsAny(s, StringCharacters) == false) {
// try again for boolean -- need to make sure it's not parsed as Byte
if (bool01) {
if (s.equals("0")) return new SimpleEntry<>(Boolean.class, "false");
else if (s.equals("1")) return new SimpleEntry<>(Boolean.class, "true");
}
char lastChar = s.charAt(slen-1); // we'll need this later...
boolean lastCharF = (lastChar == 'f' || lastChar == 'F');
boolean lastCharL = (lastChar == 'l' || lastChar == 'L');
if (!commonTypes) { // if we're not restricted to common types, look for anything
/// 1. check if data is a Byte (1-byte integer with range [-(2e7) = -128, ((2e7)-1) = 127])
try {
Byte b = Byte.parseByte(s);
return new SimpleEntry<>(Byte.class, b.toString()); // if we make it to this line, the data parsed fine as a Byte
} catch (NumberFormatException ex) {
// okay, guess it's not a Byte
}
/// 2. check if data is a Short (2-byte integer with range [-(2e15) = -32768, ((2e15)-1) = 32767])
try {
Short h = Short.parseShort(s);
return new SimpleEntry<>(Short.class, h.toString()); // if we make it to this line, the data parsed fine as a Short
} catch (NumberFormatException ex) {
// okay, guess it's not a Short
}
/// 3. check if data is an Integer (4-byte integer with range [-(2e31), (2e31)-1])
try {
Integer i = Integer.parseInt(s);
return new SimpleEntry<>(Integer.class, i.toString()); // if we make it to this line, the data parsed fine as an Integer
} catch (NumberFormatException ex) {
// okay, guess it's not an Integer
}
String s_L_trimmed = s;
/// 4. check if data is a Long (8-byte integer with range [-(2e63), (2e63)-1])
// ...first, see if the last character of the string is "L" or "l"
// ... Java parses "3.3F", etc. fine as a float, but throws an error with "3L", etc.
if (postfixFL && slen > 1 && lastCharL)
s_L_trimmed = s.substring(0, slen-1);
try {
Long l = Long.parseLong(s_L_trimmed);
return new SimpleEntry<>(Long.class, l.toString()); // if we make it to this line, the data parsed fine as a Long
} catch (NumberFormatException ex) {
// okay, guess it's not a Long
}
/// 5. check if data is a Float (32-bit IEEE 754 floating point with approximate extents +/- 3.4028235e38)
if (postfixFL || !lastCharF) {
try {
Float f = Float.parseFloat(s);
if (!f.isInfinite()) // if it's beyond the range of Float, maybe it's not beyond the range of Double
return new SimpleEntry<>(Float.class, f.toString()); // if we make it to this line, the data parsed fine as a Float and is finite
} catch (NumberFormatException ex) {
// okay, guess it's not a Float
} }
} // end uncommon types 1/2
/// 6. check if data is a Double (64-bit IEEE 754 floating point with approximate extents +/- 1.797693134862315e308 )
if (postfixFL || !lastCharF) {
try {
Double d = Double.parseDouble(s);
if (!d.isInfinite())
return new SimpleEntry<>(Double.class, d.toString()); // if we make it to this line, the data parsed fine as a Double
else // if it's beyond the range of Double, just return a String and let the user decide what to do
return new SimpleEntry<>(String.class, s);
} catch (NumberFormatException ex) {
// okay, guess it's not a Double
} }
} // if we have StringCharacters, we must have a String...
// ...or a Boolean!
if (commonTypes) { // try again for Boolean /// ***
if (contains(falseAliases, s)) return new SimpleEntry<>(Boolean.class, "false");
else if (contains(trueAliases, s)) return new SimpleEntry<>(Boolean.class, "true");
}
/// 7. revert to String by default, with caveats...
/// 7a. if string has length 1, it is a single character
if (!commonTypes && slen == 1)
return new SimpleEntry<>(Character.class, s); // end uncommon types 2/2
/// 7b. attempt to parse String as a LocalDateTime
if (parseDates && stringAsDate(s) != null) return new SimpleEntry<>(LocalDateTime.class, s);
// ...if we've made it all the way to here without returning, give up and return "String" and input as-is
return new SimpleEntry<>(String.class, data);
}
//----- helper function which attempts to parse a String as a date -----------
private static LocalDateTime stringAsDate (String date) {
for (DateTimeFormatter format : formats) {
try {
return LocalDateTime.parse(date, format);
} catch (java.time.format.DateTimeParseException ex) {
// can't parse it as this format, but maybe the next one...?
} }
return null; // if none work, return null
}
///---------------------------------------------------------------------------
///
/// decodeTypify():
/// parses a typify result and returns the value contained within
///
///---------------------------------------------------------------------------
public static Optional<Object> decodeTypify (Entry<Class, String> entry) {
// String
if (entry.getKey() == String.class)
return Optional.of(entry.getValue());
// Boolean
else if (entry.getKey() == Boolean.class)
return Optional.of(Boolean.parseBoolean(entry.getValue()));
// Byte
else if (entry.getKey() == Byte.class)
return Optional.of(Byte.parseByte(entry.getValue()));
// Character
else if (entry.getKey() == Character.class)
return Optional.of((Character) entry.getValue().charAt(0));
// Short
else if (entry.getKey() == Short.class)
return Optional.of(Short.parseShort(entry.getValue()));
// Integer
else if (entry.getKey() == Integer.class)
return Optional.of(Integer.parseInt(entry.getValue()));
// Long
else if (entry.getKey() == Long.class)
return Optional.of(Long.parseLong(entry.getValue()));
// Float
else if (entry.getKey() == Float.class)
return Optional.of(Float.parseFloat(entry.getValue()));
// Double
else if (entry.getKey() == Double.class)
return Optional.of(Double.parseDouble(entry.getValue()));
// LocalDateTime
else if (entry.getKey() == LocalDateTime.class)
return Optional.of(LocalDateTime.parse(entry.getValue()));
// otherwise, null
else return Optional.empty();
}
///---------------------------------------------------------------------------
///
/// containsAny():
/// returns true if any of the source chars are found in the target
///
///---------------------------------------------------------------------------
public static boolean containsAny (CharSequence target, CharSequence source) {
if (target == null || target.length() == 0 ||
source == null || source.length() == 0)
return false;
for (int aa = 0; aa < target.length(); ++aa)
for (int bb = 0; bb < source.length(); ++bb)
if (source.charAt(bb) == target.charAt(aa))
return true;
return false;
}
public static boolean containsAny (CharSequence target, char[] source) {
return containsAny(target, CharBuffer.wrap(source));
}
public static boolean containsAny (char[] target, CharSequence source) {
return containsAny(CharBuffer.wrap(target), source);
}
public static boolean containsAny (char[] target, char[] source) {
return containsAny(CharBuffer.wrap(target), CharBuffer.wrap(source));
}
///---------------------------------------------------------------------------
///
/// contains():
/// checks if a target array contains the source term
///
///---------------------------------------------------------------------------
public static <T> boolean contains (T[] target, T source) {
if (source == null) return false;
for (T t : target) if (t != null && t.equals(source)) return true;
return false;
}
// primitive boolean version
public static boolean contains (boolean[] target, boolean source) {
for (boolean t : target) if (t == source) return true;
return false;
}
// primitive char version
public static boolean contains (char[] target, char source) {
for (char t : target) if (t == source) return true;
return false;
}
} // end of class Typifier
/*
===================
PROGRAM OUTPUT:
===================
Features:
String.trim() is used to remove leading and trailing whitespace, so the following work fine:
typify (" 23 ") => "23" [Byte]
typify (" 3.4 ") => "3.4" [Float]
But if the user enters only whitespace, that's fine, too:
typify (" ") => " " [String]
typify (" ") => " " [String]
The user can choose to interpret 1/0 as true/false:
typify ("1") => "true" [Boolean]
typify ("true") => "true" [Boolean]
typify ("0") => "false" [Boolean]
Ranges of Byte, Short, and Float are used to box the value in the narrowest type available:
typify (" 2 ") => "2" [Byte]
typify (" 200 ") => "200" [Short]
typify ("2e9 ") => "2.0E9" [Float]
typify (" 2e99") => "2.0E99" [Double]
If String has length 1 and is not Boolean or Byte, it will be assigned the Character type:
typify ("4") => "4" [Byte]
typify ("-") => "-" [Character]
typify ("a") => "a" [Character]
Dates can also be parsed, and formats can be defined by the user:
typify ("2014-12-22 14:35:22") => "2014-12-22 14:35:22" [LocalDateTime]
typify ("3/5/99 6:30") => "3/5/99 6:30" [LocalDateTime]
Flags are available to allow the user to:
- interpret 0/1 as boolean false/true (or not)
- restrict to "common" types: Boolean, Double, String only (also LocalDateTime w/different flag)
- attempt to parse Strings as LocalDateTime objects, using a predefined list of DateTimeFormatters
- disallow postfixed l/L or f/F for long/float values, respectively (instead, interpreted as Strings)
*/
/*
RESOURCES / CITATIONS:
https://stackoverflow.com/questions/13314215/java-how-to-infer-type-from-data-coming-from-multiple-sources
https://stackoverflow.com/questions/36820754/floating-and-double-types-range-in-java
https://stackoverflow.com/questions/17223185/how-can-detect-one-string-contain-one-of-several-characters-using-java-regex
https://stackoverflow.com/questions/20085287/java-how-to-compare-multiple-strings
https://stackoverflow.com/questions/3422673/evaluating-a-math-expression-given-in-string-form
http://www.java2s.com/Tutorials/Java/Scripting_in_Java/0040__Scripting_in_Java_eval.htm
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.