Skip to content

Instantly share code, notes, and snippets.

@sysr-q
Last active December 19, 2015 02:59
Show Gist options
  • Save sysr-q/5887248 to your computer and use it in GitHub Desktop.
Save sysr-q/5887248 to your computer and use it in GitHub Desktop.
;; require'd [digest] and [base64-clj.core :as base64]
; This excessively large string is an SSH RSA key.
; You're meant to base64 decode then hash the result to get a key fingerprint.
(def k "AAAAB3NzaC1yc2EAAAADAQABAAABAQDt6wSbWhx/lQ1kqzqy4ET7ogSZqZngcDzaYiS8/ZWKgamkt4o9+2RebcysJT2DX/8t0Mif3jovSsUjW+6dCLY8rO0+fGMctwWL4HqAlHFgWY6xA2M4/ZLYvlm53WUKt02ygFeO9M4Fj9w9MoTRhQjS52Z6PA5OE0ppjjupvLWUp3wu23usUUWQucye50mTPBE4tZAbnh+H3w7FXTHOROsNTSuNbDYQ8pPqHy66hbJ5t2Dz5/3yTpL6mzc4rFHJPt5O8Wxlur4kzNSYlhPYsbrDoZF4lpxjdrkrE4qGxJUk48Hufr/VJ+3cafLumk7DsdVNnDeAO8lgpgXh2Hvr9ZYl")
(digest/md5 (base64/decode k)) ; => 8befc4d6d025ea89bd5603ea2bf7ecdd
; My question is this: in any other language, the above gives an output of:
; 287893e77d820df340d92b833e942128
; I've tested (independently and together) both the md5 and base64 functions, to verify that they're not cocking up the output and spewing out random crap, but they're giving valid info
(digest/md5 "hello") ; => 5d41402abc4b2a76b9719d911017c592 (just like in Python)
(base64/decode "aGVsbG8=") ; => "hello" (just like in Python)
; Why is it that in all cases but the one I actually need valid output, they don't reach the right hash?
; The string I'm trying to decode and hash is exactly the same as it is in Python/other places, but in Clojure it refuses to hash into the md5 I want it to. ;_;
; Halp me dobry-den-kanobi, you're my only hope.
@danneu
Copy link

danneu commented Jun 28, 2013

Change this:

(digest/md5 (base64/decode k))

to this:

(digest/md5 (base64/decode-bytes (.getBytes k)))

I know you would only ask me this as a last resort, so here is a crash course in debugging infuriating byte/string/encoding-level which Clojure actually makes pretty simple.

(class "a")       ;=> String
(class \a)        ;=> Character
(int "a")         ;   Error
(int \a)          ;=> 97
(char 97)         ;=> \a

(seq "hello")     ;=> (\h \e \l \l \o)
(map int "hello") ;=> (104 101 108 108 111)

;; Any time you need an array of bytes (Byte[]), particularly for Java
;; interop, remember this function.
(.getBytes "hello")

;; Although you can do it in Clojure directly but there's no point.
(byte-array (map byte "hello"))

;; And the jvm oddly expresses the class as [B:
(class (.getBytes "hello"))  ;=> [B

;; Java
new String()

;; Clojure
(String.)

;; Java: 
new String("hello".getBytes()) => "hello"

;; Clojure - `String.` is the way you get Strings back from
;;           Java stuff.
(String. (.getBytes "hello")) => "hello"

Anyways, to let you know how I debugged it, the first thing I did was take a look at what base64/decode actually returns. REPL output when debugging this kind of stuff is misleading since it, of course, can't display bytes that can't be displayed.

But (map int x) makes it trivial to see the actual values.

The 65533 values instantly revealed that shit was cray.

x.core> (map int (base64/decode k))

;=> (0 0 0 7 115 115 104 45 114 115 97 0 0 0 3 1 0 1 0 0 1 1 0 
     65533 65533 4 65533 90 28 127 65533 13 100 65533 58 65533 
     65533 68 65533 4 65533 65533 65533 65533 112 60 65533 98 36 
     65533 65533 65533 65533 61 65533 100 94 109 812 37 61 65533 95 
     65533 45 65533 543 65533 58 47 74 65533 35 91 65533 8 65533 60
     65533 65533 62 124 99 28 65533 5 65533 65533 122 65533 65533 ...)

Bytes of course are represented as 8 bits (00000000), so there are 256 possible combinations.

Java has signed bytes which means that the left-most bit is used to indicate +/-.

  • Languages with unsigned bytes can represent their integers as 0 to 255.
  • Languages with signed bytes can represent their bytes as integers -128 to 127.

So integer values this high indicates we're seeing a round trip through UTF-8 unicode.

Which means that there's a coercion ->String somewhere.

And since the purpose of Base64 is to encode bytes->ascii and decode ascii->bytes, it doesn't make sense to cast a Base64 decode to a string. And it's impossible in a language with signed bytes. ascii table is unsigned 0-255.

Sure enough, that's what is going on here.

x.core> (map int (base64/decode-bytes (.getBytes k)))

;=> (0 0 0 7 115 115 104 45 114 115 97 0 0 0 3 1 0 1 0 0 1 1 0 -19
     -21 4 -101 90 28 127 -107 13 100 -85 58 -78 -32 68 -5 -94 4 -103
     -87 -103 -32 112 60 -38 98 36 -68 -3 -107 -118 -127 -87 -92 -73 
     -118 61 -5 100 94 109 -52 -84 37 61 -125 95 -1 45 -48 -56 -97 -34 ...)

But the new String() constructor tries to coerce bytes that are represented as negative ints, and since there is no such character it replaces it with the U+FFDDD "replacement character" which, thanks to you, I just learned about (http://stackoverflow.com/questions/3526965/unicode-issue-with-an-html-title-question-mark-65533). And it's represented as 65533 in base10.

x.core> (map int (String. (base64/decode-bytes (.getBytes k))))

;=> (0 0 0 7 115 115 104 45 114 115 97 0 0 0 3 1 0 1 0 0 1 1 0 65533
     65533 4 65533 90 28 127 65533 13 100 65533 58 65533 65533 68 65533
     4 65533 65533 65533 65533 112 60 65533 98 36 65533 65533 65533 65533)

I doubt you needed the history of computing but maybe you are like how I was 5 months ago (before my attempt at implementing Bitcoin -_-) and are a bit rusty.


Finally, this is actually an area where Java libraries are nice: You know exactly what you're getting and they are perf-optimized for you.

For my Bitcoin project, I used the Codec lib from Apache Commons (http://commons.apache.org/). Basically, Apache Commons solves the problem of Java stdlib hell by providing nice wrappers. Definitely look there when you want some included batteries.

For instance:

(ns my-ns.core
  (:import [org.apache.commons.codec.digest DigestUtils]
           [org.apache.commons.codec.binary Base64]))

(def k "AAAAB3NzaC1yc2EAAAADAQABAAABAQDt6wSbWhx/lQ1kqzqy4ET7ogSZqZngcDzaYiS8/ZWKgamkt4o9+2RebcysJT2DX/8t0Mif3jovSsUjW+6dCLY8rO0+fGMctwWL4HqAlHFgWY6xA2M4/ZLYvlm53WUKt02ygFeO9M4Fj9w9MoTRhQjS52Z6PA5OE0ppjjupvLWUp3wu23usUUWQucye50mTPBE4tZAbnh+H3w7FXTHOROsNTSuNbDYQ8pPqHy66hbJ5t2Dz5/3yTpL6mzc4rFHJPt5O8Wxlur4kzNSYlhPYsbrDoZF4lpxjdrkrE4qGxJUk48Hufr/VJ+3cafLumk7DsdVNnDeAO8lgpgXh2Hvr9ZYl")

(DigestUtils/md5Hex (Base64/decodeBase64 (.getBytes k)))
;=> "287893e77d820df340d92b833e942128"

; And of course we can recreate your original problem with an intermediate String-cast.
(DigestUtils/md5Hex (String. (Base64/decodeBase64 (.getBytes k))))
;=> "8befc4d6d025ea89bd5603ea2bf7ecdd"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment