Skip to content

Instantly share code, notes, and snippets.

@awwsmm
Last active June 22, 2023 10:45
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save awwsmm/56b8164410c89c719ebfca7b3d85870b to your computer and use it in GitHub Desktop.
Save awwsmm/56b8164410c89c719ebfca7b3d85870b to your computer and use it in GitHub Desktop.
Infer type of data from String representation in Java
/*
Copyright 2022 Andrew Watson
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
import java.nio.CharBuffer;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Locale;
import java.util.Map.Entry;
import java.util.Optional;
import java.util.AbstractMap.SimpleEntry;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
public class Typifier {
public static void main (String[] args) {
System.out.println("\nFeatures:\n");
System.out.println("String.trim() is used to remove leading and trailing whitespace, so the following work fine:\n");
example(" 23 "); example(" 3.4 ");
System.out.println("But if the user enters only whitespace, that's fine, too:\n");
example(" "); example(" ");
System.out.println("The user can choose to interpret 1/0 as true/false:\n");
example("1"); example("true"); example("0");
System.out.println("Ranges of Byte, Short, and Float are used to box the value in the narrowest type available:\n");
example(" 2 "); example(" 200 "); example("2e9 "); example(" 2e99");
System.out.println("If String has length 1 and is not Boolean or Byte, it will be assigned the Character type:\n");
example("4"); example("-"); example("a");
System.out.println("Dates can also be parsed, and formats can be defined by the user:\n");
example("2014-12-22 14:35:22"); example("3/5/99 6:30");
System.out.println("Flags are available to allow the user to:");
System.out.println(" - interpret 0/1 as boolean false/true (or not)");
System.out.println(" - restrict to \"common\" types: Boolean, Double, String only (also LocalDateTime w/different flag)");
System.out.println(" - attempt to parse Strings as LocalDateTime objects, using a predefined list of DateTimeFormatters");
System.out.println(" - disallow postfixed l/L or f/F for long/float values, respectively (instead, interpreted as Strings)\n");
}
public static void example (String string) {
Entry<Class, String> result = typify(string);
System.out.printf(" typify (\"%s\") => \"%s\" [%s]%n%n",
string, result.getValue(), result.getKey().getSimpleName());
}
///---------------------------------------------------------------------------
///
/// private helper variables for typify()
///
///---------------------------------------------------------------------------
private static Locale defaultLocale = new Locale("en");
private static HashSet<DateTimeFormatter> formats = new HashSet<>(Arrays.asList(
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss", defaultLocale),
DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss.SS", defaultLocale),
DateTimeFormatter.ofPattern("MM/dd/yyyy hh:mm:ss a", defaultLocale),
DateTimeFormatter.ofPattern("M/d/yy H:mm", defaultLocale),
DateTimeFormatter.ofPattern("dd/MM/yyyy HH:mm:ss", defaultLocale) ));
// only Strings contain these characters -- skip all numeric processing
// arranged roughly by frequency in ~130MB of sample DASGIP files:
// $ awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file.txt
private static final char[] StringCharacters = new char[]{
' ', ':', 'n', 'a', 't', 'r', 'o', 'C', 'i', 'P', 'D', 's', 'c', 'S', 'u',
'A', 'm', '=', 'O', '\\', 'd', 'p', 'T', 'M', 'g', 'I', 'b', 'U', 'h', 'H' };
// typify looks for the above characters in an input String before it makes
// any attempt at parsing that String. If it finds any of the above characters,
// it immediately skips to the String-processing section, because no numerical
// type can contain those characters.
// adding more characters means that there are more characters to look for in
// the input String every time a piece of data is parsed, but it also reduces
// the likelihood that an Exception will be thrown when String data is attempted
// to be parsed as numerical data (which saves time).
// The characters below can also be added to the list, but the list above
// seems to be near-optimal
// 'J', '+', 'V', 'B', 'G', 'R', 'y', '(', ')', 'v', '_', ',', '[', ']', '/',
// 'N', 'k', 'w', '}', '{', 'X', '%', '>', 'x', '\'', 'W', '<', 'K', 'Q', 'q',
// 'z', 'Y', 'j', 'Z', '!', '#', '$', '&', '*', ',', ';', '?', '@', '^', '`',
// '|', '~'};
private static final String[] falseAliases = new String[]{ "false", "False", "FALSE" };
private static final String[] trueAliases = new String[]{ "true", "True", "TRUE" };
// if a String contains any of these, try to evaluate it as an equation
private static final char[] MathCharacters = new char[]{ '+', '-', '/', '*', '=' };
///---------------------------------------------------------------------------
///
/// typify():
/// attempt to classify String input as double, int, char, etc.
/// "common types" are: boolean, double, string, timestamp
///
///---------------------------------------------------------------------------
// default is:
// > don't interpret "0" and "1" as true and false
// > restrict interpretation to common types
// > don't allow f/F/l/L postfixes for float/long numbers
// > attempt to parse dates
public static Entry<Class, String> typify (String data) {
return typify(data, true, false, true, true);
}
public static Entry<Class, String> typify (String data, boolean bool01, boolean commonTypes, boolean postfixFL, boolean parseDates) {
// -2. if the input data has 0 length, return as null object
if (data == null || data.length() == 0) return new SimpleEntry<>(Object.class, null);
String s = data.trim();
int slen = s.length();
// -1. if the input data is only whitespace, return "String" and input as-is
if (slen == 0) return new SimpleEntry<>(String.class, data);
// in most data, numerical values are more common than true/false values. So,
// if we want to speed up data parsing, we can move this block to the end when
// looking only for common types. look for /// ***
/// 0. check if the data is Boolean (true or false)
if (!commonTypes) {
if (contains(falseAliases, s)) return new SimpleEntry<>(Boolean.class, "false");
else if (contains(trueAliases, s)) return new SimpleEntry<>(Boolean.class, "true");
}
// check for any String-only characters; if we find them, don't bother trying to parse this as a number
if (containsAny(s, StringCharacters) == false) {
// try again for boolean -- need to make sure it's not parsed as Byte
if (bool01) {
if (s.equals("0")) return new SimpleEntry<>(Boolean.class, "false");
else if (s.equals("1")) return new SimpleEntry<>(Boolean.class, "true");
}
char lastChar = s.charAt(slen-1); // we'll need this later...
boolean lastCharF = (lastChar == 'f' || lastChar == 'F');
boolean lastCharL = (lastChar == 'l' || lastChar == 'L');
if (!commonTypes) { // if we're not restricted to common types, look for anything
/// 1. check if data is a Byte (1-byte integer with range [-(2e7) = -128, ((2e7)-1) = 127])
try {
Byte b = Byte.parseByte(s);
return new SimpleEntry<>(Byte.class, b.toString()); // if we make it to this line, the data parsed fine as a Byte
} catch (NumberFormatException ex) {
// okay, guess it's not a Byte
}
/// 2. check if data is a Short (2-byte integer with range [-(2e15) = -32768, ((2e15)-1) = 32767])
try {
Short h = Short.parseShort(s);
return new SimpleEntry<>(Short.class, h.toString()); // if we make it to this line, the data parsed fine as a Short
} catch (NumberFormatException ex) {
// okay, guess it's not a Short
}
/// 3. check if data is an Integer (4-byte integer with range [-(2e31), (2e31)-1])
try {
Integer i = Integer.parseInt(s);
return new SimpleEntry<>(Integer.class, i.toString()); // if we make it to this line, the data parsed fine as an Integer
} catch (NumberFormatException ex) {
// okay, guess it's not an Integer
}
String s_L_trimmed = s;
/// 4. check if data is a Long (8-byte integer with range [-(2e63), (2e63)-1])
// ...first, see if the last character of the string is "L" or "l"
// ... Java parses "3.3F", etc. fine as a float, but throws an error with "3L", etc.
if (postfixFL && slen > 1 && lastCharL)
s_L_trimmed = s.substring(0, slen-1);
try {
Long l = Long.parseLong(s_L_trimmed);
return new SimpleEntry<>(Long.class, l.toString()); // if we make it to this line, the data parsed fine as a Long
} catch (NumberFormatException ex) {
// okay, guess it's not a Long
}
/// 5. check if data is a Float (32-bit IEEE 754 floating point with approximate extents +/- 3.4028235e38)
if (postfixFL || !lastCharF) {
try {
Float f = Float.parseFloat(s);
if (!f.isInfinite()) // if it's beyond the range of Float, maybe it's not beyond the range of Double
return new SimpleEntry<>(Float.class, f.toString()); // if we make it to this line, the data parsed fine as a Float and is finite
} catch (NumberFormatException ex) {
// okay, guess it's not a Float
} }
} // end uncommon types 1/2
/// 6. check if data is a Double (64-bit IEEE 754 floating point with approximate extents +/- 1.797693134862315e308 )
if (postfixFL || !lastCharF) {
try {
Double d = Double.parseDouble(s);
if (!d.isInfinite())
return new SimpleEntry<>(Double.class, d.toString()); // if we make it to this line, the data parsed fine as a Double
else // if it's beyond the range of Double, just return a String and let the user decide what to do
return new SimpleEntry<>(String.class, s);
} catch (NumberFormatException ex) {
// okay, guess it's not a Double
} }
} // if we have StringCharacters, we must have a String...
// ...or a Boolean!
if (commonTypes) { // try again for Boolean /// ***
if (contains(falseAliases, s)) return new SimpleEntry<>(Boolean.class, "false");
else if (contains(trueAliases, s)) return new SimpleEntry<>(Boolean.class, "true");
}
/// 7. revert to String by default, with caveats...
/// 7a. if string has length 1, it is a single character
if (!commonTypes && slen == 1)
return new SimpleEntry<>(Character.class, s); // end uncommon types 2/2
/// 7b. attempt to parse String as a LocalDateTime
if (parseDates && stringAsDate(s) != null) return new SimpleEntry<>(LocalDateTime.class, s);
// ...if we've made it all the way to here without returning, give up and return "String" and input as-is
return new SimpleEntry<>(String.class, data);
}
//----- helper function which attempts to parse a String as a date -----------
private static LocalDateTime stringAsDate (String date) {
for (DateTimeFormatter format : formats) {
try {
return LocalDateTime.parse(date, format);
} catch (java.time.format.DateTimeParseException ex) {
// can't parse it as this format, but maybe the next one...?
} }
return null; // if none work, return null
}
///---------------------------------------------------------------------------
///
/// decodeTypify():
/// parses a typify result and returns the value contained within
///
///---------------------------------------------------------------------------
public static Optional<Object> decodeTypify (Entry<Class, String> entry) {
// String
if (entry.getKey() == String.class)
return Optional.of(entry.getValue());
// Boolean
else if (entry.getKey() == Boolean.class)
return Optional.of(Boolean.parseBoolean(entry.getValue()));
// Byte
else if (entry.getKey() == Byte.class)
return Optional.of(Byte.parseByte(entry.getValue()));
// Character
else if (entry.getKey() == Character.class)
return Optional.of((Character) entry.getValue().charAt(0));
// Short
else if (entry.getKey() == Short.class)
return Optional.of(Short.parseShort(entry.getValue()));
// Integer
else if (entry.getKey() == Integer.class)
return Optional.of(Integer.parseInt(entry.getValue()));
// Long
else if (entry.getKey() == Long.class)
return Optional.of(Long.parseLong(entry.getValue()));
// Float
else if (entry.getKey() == Float.class)
return Optional.of(Float.parseFloat(entry.getValue()));
// Double
else if (entry.getKey() == Double.class)
return Optional.of(Double.parseDouble(entry.getValue()));
// LocalDateTime
else if (entry.getKey() == LocalDateTime.class)
return Optional.of(LocalDateTime.parse(entry.getValue()));
// otherwise, null
else return Optional.empty();
}
///---------------------------------------------------------------------------
///
/// containsAny():
/// returns true if any of the source chars are found in the target
///
///---------------------------------------------------------------------------
public static boolean containsAny (CharSequence target, CharSequence source) {
if (target == null || target.length() == 0 ||
source == null || source.length() == 0)
return false;
for (int aa = 0; aa < target.length(); ++aa)
for (int bb = 0; bb < source.length(); ++bb)
if (source.charAt(bb) == target.charAt(aa))
return true;
return false;
}
public static boolean containsAny (CharSequence target, char[] source) {
return containsAny(target, CharBuffer.wrap(source));
}
public static boolean containsAny (char[] target, CharSequence source) {
return containsAny(CharBuffer.wrap(target), source);
}
public static boolean containsAny (char[] target, char[] source) {
return containsAny(CharBuffer.wrap(target), CharBuffer.wrap(source));
}
///---------------------------------------------------------------------------
///
/// contains():
/// checks if a target array contains the source term
///
///---------------------------------------------------------------------------
public static <T> boolean contains (T[] target, T source) {
if (source == null) return false;
for (T t : target) if (t != null && t.equals(source)) return true;
return false;
}
// primitive boolean version
public static boolean contains (boolean[] target, boolean source) {
for (boolean t : target) if (t == source) return true;
return false;
}
// primitive char version
public static boolean contains (char[] target, char source) {
for (char t : target) if (t == source) return true;
return false;
}
} // end of class Typifier
/*
===================
PROGRAM OUTPUT:
===================
Features:
String.trim() is used to remove leading and trailing whitespace, so the following work fine:
typify (" 23 ") => "23" [Byte]
typify (" 3.4 ") => "3.4" [Float]
But if the user enters only whitespace, that's fine, too:
typify (" ") => " " [String]
typify (" ") => " " [String]
The user can choose to interpret 1/0 as true/false:
typify ("1") => "true" [Boolean]
typify ("true") => "true" [Boolean]
typify ("0") => "false" [Boolean]
Ranges of Byte, Short, and Float are used to box the value in the narrowest type available:
typify (" 2 ") => "2" [Byte]
typify (" 200 ") => "200" [Short]
typify ("2e9 ") => "2.0E9" [Float]
typify (" 2e99") => "2.0E99" [Double]
If String has length 1 and is not Boolean or Byte, it will be assigned the Character type:
typify ("4") => "4" [Byte]
typify ("-") => "-" [Character]
typify ("a") => "a" [Character]
Dates can also be parsed, and formats can be defined by the user:
typify ("2014-12-22 14:35:22") => "2014-12-22 14:35:22" [LocalDateTime]
typify ("3/5/99 6:30") => "3/5/99 6:30" [LocalDateTime]
Flags are available to allow the user to:
- interpret 0/1 as boolean false/true (or not)
- restrict to "common" types: Boolean, Double, String only (also LocalDateTime w/different flag)
- attempt to parse Strings as LocalDateTime objects, using a predefined list of DateTimeFormatters
- disallow postfixed l/L or f/F for long/float values, respectively (instead, interpreted as Strings)
*/
/*
RESOURCES / CITATIONS:
https://stackoverflow.com/questions/13314215/java-how-to-infer-type-from-data-coming-from-multiple-sources
https://stackoverflow.com/questions/36820754/floating-and-double-types-range-in-java
https://stackoverflow.com/questions/17223185/how-can-detect-one-string-contain-one-of-several-characters-using-java-regex
https://stackoverflow.com/questions/20085287/java-how-to-compare-multiple-strings
https://stackoverflow.com/questions/3422673/evaluating-a-math-expression-given-in-string-form
http://www.java2s.com/Tutorials/Java/Scripting_in_Java/0040__Scripting_in_Java_eval.htm
*/
@awwsmm
Copy link
Author

awwsmm commented Jul 19, 2022

@cgivre there you go. I attached an MIT license.

@cgivre
Copy link

cgivre commented Jul 19, 2022

Thanks @awwsmm !!

@jgenoese
Copy link

This is excellent and saved me no end of time. I've added a LocalDate (it already had LocalDateTime). I'm going to contribute that back as my way of saying "Thank you"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment