I came across Fabian's nice old blog post on quaternion differentiation:
I wanted to write a quick note on some of the broader context, which hopefully makes the quaternion case look less special.
Given any associative algebra where you can define what exp means, it's always true that d/dt exp(at) = a exp(at), which means the unique solution of x' = ax is x(t) = exp(at) x(0) = exp(a)^t x(0). In fact, it's true even if you work with formal power series, where you treat t as a formal symbol and interpret differentiation as the operator that shifts t^n down to n t^(n-1).
But usually we do want exp to be an analytic function. Examples of associative algebras where we can do this include the real numbers, the complex numbers, the quaternions, and matrix algebras. Matrix algebras are nice because they're very concrete to work with and are in a certain sense universal. For example, you can represent the complex numbers C as a 2x2 real matrix algebra generated by two matrices I and J:
I = [1, 0; 0, 1] J = [0, -1; 1, 0]
I is the identity matrix and J is the 90-degree counterclockwise rotation matrix. You can see that I behaves like 1 over the complex numbers because I^2 = I and IJ = JI = J, and J behaves like the imaginary unit i because J^2 = -I just like i^2 = -1. Indeed, successive 90-degrees rotations give a 180-degree rotation, which is just negation. So, the complex number a + ib can be represented faithfully by the matrix aI + bJ. Addition and multiplication of such matrices then corresponds to addition and multiplication of complex numbers.
If you believe in this isomorphism, it immediately follows from Euler's formula that exp(aJ) = cos(a)I + sin(a)J. Here exp(aJ) is interpreted as a matrix exponential, exp(A) = sum A^n / n!, which as mentioned earlier is the generator of solutions to the differential equation x' = Ax, so that x(t) = exp(At) x(0) = exp(A)^t x(0). You can prove that the power series for exp(A) always converges by showing that the matrix norm of tail sums go to zero. This reduces the question to the familiar case for the real numbers, where we know it converges.
I've been trying to explain that the familiar properties of exp (even the analytical properties) generalize very readily. However, one formula that doesn't generalize is exp(A + B) = exp(A) exp(B). If you try to prove this either by working with the power series or with the differential equation, you'll see that this only works when A and B commute: AB = BA. You can see this if you expand exp(A + B) and exp(A) exp(B) to the second order:
exp(A + B) = 1 + (A + B) + 1/2 (A + B)^2 + O((A+B)^3) = 1 + (A + B) + 1/2 (A^2 + AB + B^2 + BA) + O((A+B)^3)
exp(A) exp(B) = (1 + A + 1/2 A^2 + O(A^3)) (1 + B + 1/2 B^2 + O(B^3)) = 1 + (A + B) + 1/2 (A^2 + B^2 + 2 AB) + O((A+B)^3)
You see they always agree up to first order, but even to second order exp(A + B) = exp(A) exp(B) requires AB + BA = 2 AB and therefore AB = BA. It's useful to define the so-called commutator [A, B] = AB - BA to quantify A's degree of non-commutativity with B. Then [A, B] = 0 if and only if A and B commute. Using this, we can rewrite the relationship as follows:
exp(A) exp(B) = exp(A + B) + [A, B] + O((A+B)^3)
So [A, B] can be thought of as the second-order correction to the formula for exp(A) exp(B) in terms of exp(A + B). There's an extension of this correction to all higher orders, yielding the exact difference in the limit, using nested commutators starting with A and B. It's called the Baker-Campbell-Hausdorff formula and it's pretty complicated. But the fact that [A, B] provides the second-order correction is usually all you care about.
Anyway, the complex numbers are a commutative algebra (you can check that [I, J] = 0), so this problem doesn't arise there and hence we do have exp(A + B) = exp(A) exp(B) if A and B are linear combinations of I and J, even though 2x2 matrices don't commute in general (e.g. non-uniform scaling doesn't commute with 90-degree rotation).
Going back to Euler's formula, exp(aJ) = cos(a)I + sin(b)J, let's try to derive it in a way that's probably new to you. The trigonometry shows pretty clearly that exp(aJ) is a rotation matrix, but you can also check that directly. Note that the inverse formula exp(aJ)^(-1) = exp(-aJ) follows from the exp addition formula; it's true even in a non-commutative algebra since a matrix always commutes with itself. Let's prove that exp(aJ) is orthogonal using the differential equation for exp. If x' = Ax then the derivative of x^T x is (A x)^T x + x^T A x = x^T A^T x + x^T A x = x^T (A^T + A) x. Hence the rate of change of the squared length is controlled by the symmetric matrix A^T + A. This is zero if and only A^T = -A, so that A is skew-symmetric. You can easily verify that J^T = -J, so aJ is indeed skew-symmetric. You can also use the power series to prove exp(aJ)^T = exp(-aJ) = exp(aJ)^(-1). But that isn't very geometrically enlightening and doesn't quantify the rate of length stretching.
So, we've shown that exp(aJ) is an orthogonal matrix. And it cannot contain a reflection since the determinant of the exponential of any matrix is strictly positive, so exp(aJ) must be a pure rotation. The positivity of the determinant of exp follows from a simple continuity argument: For any matrix A, exp(At) = I at t = 0 and exp(At) = exp(A) at t = 1. Since det is continuous and exp(At) is nonsingular for all A and t (its inverse is exp(-At)), det(exp(A)) must have the same sign as det(I) = 1. To change signs it would have to pass through some exp(At) for 0 < t < 1 with det(exp(At)) = 0. As a fun little exercise you can calculate that the derivative of det(exp(At)) at t = 0 is the trace (sum of diagonal entries) tr(A). This shows that the flows generated by x' = Ax are volume preserving if and only if A is traceless, tr(A) = 0. As a special case, tr(J) = 0 since a skew-symmetric matrix like J has all zeroes on the diagonal. We already know the rotations are volume preserving, so it isn't telling us anything new in this case, but there are lots of traceless non-skew-symmetric matrices. For example, any traceless matrix with nonzeroes on the diagonal will do, like A = [1 0; 0 -1]. This matrix generates a motion that stretches and squashes the x and y directions in inverse proportion: exp(At) = [exp(at) 0; 0 exp(-at)].
Looking back, the only property we used to show that exp(aJ) is a rotation (an orientation-preserving orthogonal matrix) was that aJ is skew-symmetric. So we've in fact shown something much more general: exp(A) is a rotation whenever A is a skew-symmetric matrix. Given the relationship with the differential equation x' = Ax, this shows that the skew-symmetric matrices correspond to generalized angular velocities in Euclidean spaces of any dimension; they're the derivatives of continuous rigid motions. When A = aJ, a is the usual scalar angular velocity of a two-dimensional uniform circular motion, which is just Euler's formula in its more familiar form.
The traceless matrices are only constrained by the linear equation tr(A) = 0, so they live in an (n^2 - 1)-dimensional subspace of the n^2-dimensional space of matrices when the underlying vector space is n-dimensional. The space of skew-symmetric matrices is much smaller: they're defined by their strictly subdiagonal entries, so they live in a space of dimension n-1 + n-2 + ... + 1 = n(n-1)/2. You can verify that for n = 2 it is 1 (which is why 2D angular velocities can be described by scalars) and for n = 3 it is 3 (which is why 3D angular velocities can be described by vectors).
We've used that exp(A) has an inverse exp(A)^(-1) = exp(-A) regardless of whether A is invertible. It's a consequence of the simple fact that the dynamical system described by x' = Ax is time reversible. Almost all non-pathological dynamical systems described by a differential equation are at least locally time reversible, so that part isn't even particular to x' = Ax. For example, x' = f(x) has locally time reversible solutions if f is differentiable: the derivative of f is the Jacobian of the flow, which is the constant matrix A for f(x) = Ax. If f(x0) = 0 at some particular x0, so that x0 is an equilibrium, then f(x0 + x) = f'(x0) x + O(x^2) and hence the linearized dynamical system x' = Ax for A = f'(x0) lets you study the nonlinear system in the neighborhood of an equilibrium. This lets us apply our qualitative knowledge from previously to nonlinear systems. If A is traceless the system is locally volume preserving, if A is skew-symmetric the system is length preserving, and the eigenvalues of A tell you about the local stability/instability: when the eigenvalue is a + ib then exp((a + ib)t) = exp(at) exp(ibt), so a controls the rate of decay (if a < 0) or growth (if a > 0) and b controls the oscillation rate.
You might have become curious about a certain inverse problem: If exp(A) is orthogonal for every skew-symmetric A, does that mean every orthogonal matrix is the exponential of some skew-symmetric matrix. That is, do orthogonal matrices always have "logarithms"? Well, we learned earlier that orientation-reversing orthogonal matrices like reflections can't be the exp of any A, so clearly some things must be left out (like a generalized version of the negative numbers for exp over the real numbers). So, what if A is orientation-preserving and orthogonal? There's a general theorem (which is too hard to prove here) that any matrix group like this is the surjective exponential image of an associated algebra if the group is path connected (any element can be continuously deformed into any other element while never leaving the group) and compact (closed and bounded). The orthogonal group of nxn matrices O(n) is easily verified to be compact, but it is not path connected since it has two disjoint components corresponding to determinant +1 and -1. However, if you pick out the determinant 1 component of special orthogonal matrices, SO(n), which is a group in its own right, then you get path connectedness and surjectivity is guaranteed.
You can at least make this last result plausible by counting the degrees of freedom in O(n) compared to the set of skew-symmetric matrices. Let's construct an orthogonal matrix column by column in the Gram-Schmidt style. The first column has n entries with a unit length constraint, so there are n-1 degrees of freedom. The second column must be in the orthogonal complement of the first column, so it has one fewer degree of freedom, hence n-2 in total. And so on. This shows that the dimension of O(n) is n-1 + n-2 + ... + 1 = n(n-1)/2, same as for the set of skew-symmetric matrices. SO(n) has the same dimension as O(n); what distinguishes O(n) from SO(n) is a discrete 1-bit degree of freedom, not a continuous degree of freedom. Not coincidentally, n(n-1)/2 = (n choose 2) is the dimension of the algebra of bivectors in an n-dimensional space. That's another way of working with angular velocities, which I won't pursue here; the bivector perspective is beautiful and useful, but there's something to be said for low-brow matrix algebra and most of what we're doing can be copy-pasted for bivectors. Anyway, this sort of dimension numerology would be extremely suggestive if you didn't already know of this three-way relationship.
Let's talk a bit more about Euler's formula before covering its equivalent for higher dimensions. From the power series for exp, it's immediately clear that exp(aJ) has the form cI + sJ for unknown scalars c, s since J^2 = -I and hence the even terms contribute to c and the odd terms contribute to s, with alternating signs. In 3 dimensions, we know from the dimension counting that we'll need 3 dimensions to span our skew-symmetric matrices. In fact, we can get a set of generators by using skew-symmetric 1x1 blocks and 2x2 skew-symmetric blocks in a 3x3 block-diagonal matrix. The only 1x1 skew-symmetric block is 0 and J is the generator of 2x2 skew-symmetric matrices. Thus multiplying a vector by a standard generator is equivalent to taking the cross product with a standard basis vectors corresponding to the position of the 1x1 block: it kills the vector's component along the invariant axis (since the derivative of an invariant component should be 0) and rotates the component to the plane by 90 degrees counterclockwise.
Let P, Q, R be these skew-symmetric standard generators. Then any skew-symmetric matrix has the form xP + yQ + zR, corresponding to the angular velocity vector [x, y, z]. You can calculate that xP + yQ + zR = [0, -z, y; z, 0, -x; -y, x, 0], which is called the hat matrix for [x, y, z]. Earlier we noted that J^2 = -I in the two dimensional case because a 180-degree rotation in the plane is vector negation. This is a bit different in the 3x3 case. P generates rotations around the x axis, so the yz component is rotated, but the x component is killed. Concretely:
P = [0, 0, 0; 0, 0, -1; 0, 1, 0]
P^2 = [0, 0, 0; 0, -1, 0; 0, 0, -1]
P^3 = [0, 0, 0; 0, 1, 0; 0, -1, 0] = -P
We're back where we started but with a minus sign. This is just saying that rotating three times counterclockwise by 90 degrees is the same as rotating clockwise once by 90 degrees. This geometric interpretation both illuminates what's going on with the algebra and makes it clear that it must be true for generators in any plane, not just P. When we collapsed the power series for exp(aJ) using J^2 = -I, we got Euler's formula. Here we have something very similar, but we have to include the square as well. Hence if A is any skew-symmetric matrix corresponding to a 90-degree rotation in a plane, we have exp(aA) = I + sA + (1-c)A^2 where the coefficients c = cos(a) and s = sin(a) follow from calibrating to the 2D case since A^2 collapses to -I there. This is usually called Rodrigues's formula. Note that the reasoning that led us to this formula only depended on A being a skew-symmetric generator of a 2-plane rotation, so it works in any dimension.
I've used the commutator [A, B] = AB - BA a few times so far. If A and B are skew-symmetric then (AB)^T = B^T A^T = (-B)(-A) = BA, so AB is only skew-symmetric if A and B anti-commute, BA = -AB. But the commutator [A, B] is always skew-symmetric: [A, B]^T = (AB)^T - (BA)^T = BA - AB = -[A, B]. You can see that 1/2 [A, B] = AB when A and B anti-commute. The fact that the skew-symmetric matrices are closed under commutators but not under matrix products suggests treating the commutator as a kind of generalized matrix product. The most unusual thing about doing this is that the commutator product is non-associative in general. That is, [A, [B, C]] is not necessarily the same as [[A, B], C]. You can verify that in the 3D case we have [P, Q] = R, [Q, R] = P and [R, P] = Q, so the commutator product is isomorphic to the cross product for the corresponding axis vectors. This also gives an example of non-associativity: [P, P] = 0, so [[P, P], Q] = [0, Q] = 0 but [P, [P, Q]] = [P, R] = -[R, P] = -Q. The vector cross product is of course non-associative.
While the commutator product is non-associative in general, it's not free-for-all chaos. For any given A, there's an associated linear map d_A(X) = [A, X]. It's fruitful to think of this as the derivative along A. With this interpretation, you'd expect some kind of Leibniz product rule for derivatives. It's called the Jacobi identity and says that d_A([B, C]) = d_B([A, C]) - d_C([A, B]) or equivalently that [A, [B, C]] = [B, [A, C]] - [C, [A, B]]. You can verify it holds for the simplest non-associative case with P, Q, R: [P, [Q, R]] = [P, P] = 0, [Q, [P, R]] = [Q, -Q] = 0, [R, [P, Q]] = [R, R] = 0. You can attack the general case directly using skew-symmetry, but it's messy. V. I. Arnold once made a tantalizing statement in passing: "The Jacobi identity says the altitudes of a triangle intersect". He was talking about the vector cross product specifically, but maybe there's an appropriate generalization to skew-symmetric matrices or bivectors in higher dimensions. You can see a simple proof for the cross product here: http://www.maths.manchester.ac.uk/~khudian/Etudes/Geometry/jacidentandheights1.pdf
Continuing on from earlier, it's pretty easy to see (and quite plausible) that exp is surjective "near the identity" but the global result is less trivial in its full generality. It basically says that in a group like this, if you can reach a point by a potentially very circuitous path in the group, you can also reach the same point by shooting an inertial particle in an appropriate direction. "Inertial" means constant velocity in an appropriate sense, e.g. constant angular velocity for rotations.
A nice pay-off from this way of thinking is that you can regard a free rigid body (tumbling freely but not necessarily rotating around a stable axis) as inertial motion in the phase space of a "particle" where the "position" X is a rotation matrix and the "velocity" V is a skew-symmetric matrix and there is a symmetric positive definite mass matrix M corresponding to the tensor of inertia. The mass matrix M defines an inner product on velocities <V, W> = tr(V^T M W). This is just the usual Frobenius inner product of matrices tr(V^T W) with the mass matrix interposed. This creates a relationship between velocity and momentum space: The momentum matrix L corresponding to V is V^T M. Then L V = V^T M V is a symmetric positive semidefinite matrix and 1/2<V, V> is the kinetic energy; it is nonnegative because the trace equals the sum of eigenvalues, which are nonnegative for a semidefinite matrix.
The standard generator of angular velocity in 2D is J. You can verify by a direct computation that if M = diag(a, b) then 1/2<J, J> = 1/2 tr(M) = 1/2(a + b). Thus the mass matrix only affects kinetic energy through its trace. That makes sense: for the inertia it only matters how the mass is distributed in the plane, not along which particular axes.
The standard generators of angular velocity in 3D are P, Q, R. With the mass matrix M = diag(1, 2, 3), you can verify that 1/2<P, P> = 2.5, 1/2<Q, Q> = 2.0 and 1/2<R, R> = 1.5. Kinetic energy scales quadratically with speed, e.g. 1/2<2Q, 2Q> = 8. Spinning around the x axis is hardest (requires the most kinetic energy for a given angular rate) because most of the mass is spread out in the yz plane. You can see how this generalizes the 2D case. We get the trace of the mass matrix in the plane orthogonal to the axis of rotation. This can be understood more deeply by using the cyclic property of the trace: tr(ABC) = tr(BCA) = tr(CAB). Applied to <V, V>, we get tr(V^T M V) = tr(M V V^T). You can verify that V V^T = -V^2 is proportional to an orthogonal projection; when V is a standard generator that rotates by 90 degrees in its 2-planes then V^2 rotates 180 degrees in its 2-planes and kills the non-rotated components. For example, P P^T = [0, 0, 0; 0, 1, 0; 0, 0, 1] projects to the yz plane.
Can you guess how this generalizes to 4D? Rotations in higher dimensions can be decomposed into a bunch of simultaneous 2-plane rotations, in which case the orthogonal matrix can be expressed as a block diagonal matrix consisting of 1x1 identity matrices and 2x2 rotation matrices, and hence the kinetic energy will the sum of the 2-dimensional moments of inertia in each rotational 2-plane. That's what the trace inner product based on the mass matrix ends up computing. 4D rotations don't necessarily have an invariant axis of rotation. In 5D, like in 3D, any rotation will have at least one invariant axis along with the two simultaneous 2-plane rotations; this is true in any odd-dimensional space. One unfamiliar thing about higher dimensional rotations starting in 4D are isoclinic rotations. Suppose you rotate in the xy plane and in the zw plane by the same amount. Then the xy plane and zw plane are of course invariant, but there's actually an infinite family of invariant 2-planes. For example, if you rotate a vector of the form [a, b, a, b] you get [a', b', a', b'] both of which lie in the 2-plane cut out by the linear equations x = z, y = w. This is a diagonal 2-plane that makes a 45-degree angle with both the xy plane and zw plane. Rotating clockwise in xy and counterclockwise in zw but by the same amount yields another kind of isoclinic rotation.
I mentioned that rotations in odd dimensions always have an invariant axis. That's a simple theorem with many important implications, so let's see why it's true. Any rotation has the form exp(A) for some skew-symmetric A. An invariant vector x of exp(A) is a null vector of A: x' = 0 implies Ax = 0. So we have to show that any skew-symmetric matrix in an odd-dimensional space is singular. This follows from taking determinants of A^T = -A: det(A^T) = det(A) but det(-A) = (-1)^n det(A). If n is odd then (-1)^n = -1 and hence det(A) = -det(A) and det(A) = 0. Therefore A is singular and must have a null vector. But the skew-symmetry of the derivative only forces a single invariant axis for odd-dimensional rotations. You can have an invariant axis combined with an even-dimensional rotation in the orthogonal complement, and we know that even-dimensional rotations need not have invariant axes (e.g. concatenate 2x2 rotation blocks).
Let's also say something about invariant 2-planes. For any matrix A, A + A^T is symmetric and A - A^T is skew-symmetric. When A is orthogonal, this means that A + A^(-1) is symmetric and A - A^(-1) is skew-symmetric. From the real spectral theorem, we know that a symmetric matrix like A + A^(-1) is orthogonally diagonalizable with real eigenvalues. Let x be an eigenvector with eigenvalue r. Then (A + A^(-1)) x = r x, and multiplying by A we get A^2 x = r Ax - x. This shows that A^n x for n >= 2 is a linear combination of x and Ax. This reduction of higher powers is something we saw earlier in the Euler-Rodrigues formula. Here the implication is that x and Ax span an invariant 2-plane of A (or an invariant axis if Ax is proportional to x); higher powers of A applied to x don't move us out of the plane. Geometry tells us that a 2-dimensional rotation has no preferred directions, so in fact all vectors in the invariant 2-plane must be eigenvectors of A + A^(-1) with the same eigenvalue. Hence A + A^(-1) in a diagonalizing basis just has a 2x2 diagonal block B = r I. By calibrating to the 2-dimensional case with cos(a)I + sin(A)J you see that B = 2 cos(a) I so the eigenvalue is r = 2 cos(a). What about A - A^(-1)? It's diagonalizable but only for the complex numbers since it has an imaginary eigenvalue. You can show that it gives a block 2J sin(a). This corresponds to the complex conjugate formulas z + z* = 2 Re(z) and z - z* = 2i Im(z) combined with Euler's formula exp(ia) = cos(a) + i sin(a).