One of my favorite components of research is learning about the mathematics behind the algorithms I’m implementing. I might be an odd one out — many programmers I’ve talked to tend to focus on how to use their tools best, skimming over exactly how their tools work in the first place.
But for me, I like how clean the mathematics behind programs can be. Understanding that math allows me to use algorithms in entirely new ways.
Recently, I’ve been learning about how to match two 3D point sets that differ by a rigid rotation and translation. This is incredibly useful for all sorts of vision algorithms. In my situation, I have been using it to quickly calibrate ground truth using AR tags.
In the world of computer vision and tracking systems, it is common to establish a local frame based on internal sensors. However, in order to know where you are in a global frame, you need to be able to identify particular landmarks in your scene that have known locations. Depending on your algorithm, these landmarks might look a bit different, but I am using AR tags for now. They are easy to use, and can be detected fairly quickly. If you haven’t seen one, here’s what they look like:
Each one has a unique ID (for this particular set of 5×5 markers, you can have up to 1024 unique tags!) and common libraries like OpenCV do a decent job at tracking the pose of these tags when printed onto paper.
Because these are identifiable, I know two things: where I believe the tracker is in my local frame, and where the tracker should be in the global frame. When we collect many of these, we can find a rotation and translation that maps us from our local frame to the global frame by minimizing the error in positions of the trackers. So how do we minimize this error?
I found an old, but quite reliable (and simple!) solution here: http://post.queensu.ca/~sdb2/PAPERS/PAMI-3DLS-1987.pdf
Here’s the basic idea: we can model a transformation between the local and global frame as:
p' = Rp + T + N
p' is the position where we observe a tracker to be,
p is the position of where we know the tracker is in the global frame,
T are rotation and translations, and
N is some noise that may be introduced by small errors in computer vision. Our job is to figure out what R and T should be — and remember, they will need to be the same for each tracker!
The translation is actually pretty easy to find: given these two sets of points, the centroid (or average position) of each set would need to be the same. So, we just set
T to be the difference in the centroids of the two sets.
The rotation is a bit trickier, depending on your background in mathematics. After we center our observed point set around the centroid, we use singular value decomposition to find the optimal rotation. I won’t get into too many details here, but the idea is that we can break a matrix down into three sub-matricies, UWV*, where U and V* are rotation matrices and W is a diagonal matrix which represents any scaling factor. Here’s a picture from Wikipedia that may help you visualize this:
While calculating the SVD of a matrix is somewhat difficult, fortunately many people have implemented this! There are several libraries available that can do this. For my C#/Unity-based project, I used Math.NET. This library has a pretty nice and easy way to set up matrices and perform operations on them.
We get the SVD of a matrix H, whose construction is detailed in the paper above, and use the two rotational components, VU*, to find our optimal rotation.
And that’s it! Well… almost.
Sometimes, we find ourselves in a case where VU* is not the correct rotation. In some cases, this is because there is too much noise to create a good rotation. In other times, perhaps the point set is co-linear, in which case there are infinitely many rotations that work! (Hint: avoid making your point sets co-linear!)
However, there is one case we can address, and that is when VU* represents a rotation and a reflection. We can check that by looking at the determinant of VU*. If it is negative, then we have a reflection, and one of the singular values of H is zero. We find the singular value that is zero, and flip the corresponding column in V. This adjusts for the reflection.
That’s it! You apply these optimal rotation and translation matrices to your points, and they should match up pretty well.
All in all, I have found this algorithm easy to implement, fast, and reliable. I would definitely give it a try if you find yourself looking to solve a similar problem.
If you have any questions, or if I made a mistake somewhere, feel free to let me know in the comments.
Banner image credit to https://timefirevr.com/