When we were in the final stages of the virtual piano project, one of the last steps was to map the user's finger points from the color frame to the depth frame. This would allow us to determine whether or not a key was actually being pressed (which is pretty important if you want a piano to actually play notes), rather than just being hovered over. Let's say your hand looked like this, where the green circles denote the centres of your fingertips:
To determine if a particular key was being pressed, we needed to calculate the difference between the current depth at that finger point and the original depth at that point when there was nothing in that frame, which was stored in a 2D numpy array. If the difference was lower than a specified threshold, you're playing music.