Convergence

Convergence

The recent explosion of deep learning has demonstrated its awesome power, but data driven machine learning isn’t the end of the story.

Machine learning is only one piece of the puzzle, in the future it will just be, as Ken has put it, “like a sine function”, could find use everywhere. It will be used like mathematical glue, to juxtapose many components of complex systems for the future of extended reality, through edge computing / distributed networking, graphics, vision, and machine learning. Consequently, disentangling these notions of separable disciplines will be quite impossible.

One such illustration is in the case of 3D pose estimation, which we’ve eyed for constructing animations based on videos of people dancing and performing other actions. You’d require a high performance GPU in order to recover the poses per frame, as a result, it’s likely that people won’t be using desktop computers en masse, as we all carry smart phones in our pockets now, so these characters reconstructed via vision / machine learning would need to feed into graphics algorithms running elsewhere on a lightweight processor to provide animated characters. In order to really breathe life into these characters you need to employ procedural techniques, which would allow you to interact with and poke them and yank them, have them walk around on your table or climb your couch…perhaps even tap dance with the cat.

David Byrne – Talking Heads, Once in a Lifetime

As a specific demonstration, I’ll discuss an experiment I did. Above is a short clip of David Byrne’s eccentric dance moves in the Talking Heads music video for Once In a Lifetime. I feed this source to a 3D pose reconstruction network, and extract the following:

So above is showing the raw output of the predictions, which doesn’t have any temporal coherence. I performed experiments with a least squares smoothing over a sliding window, and eventually found this to be my preferred result.

Smoothed raw 3D joints output

My point here, is you can’t solve everything with pure data driven ML approaches, even with recent advances in reinforcement learning, the need for finely grained artistic control of the semantics is unassailable. As with style transfer techniques, for now, about all the control you’re able to retain is your choice of the target style image, but the algorithms themselves work as a black box, and unpredictably so. Of course you could add more layers or yet another network module, but the above network was already using several GB of memory, and adding temporal dimensions to such a network is only barely achievable on consumer hardware (and that’s assuming you have a Titan RTX with 24GB GPU memory laying aorund).

From my perspective, AI and ML isn’t here to replace humans in the content creation loop, but to empower the artist with more powerful brushes in their designer’s toolkit. Procedural techniques alone can be quite limited, both by computation, and is sometimes difficult to respond according to the artists input. Thus if we embrace combining ML and traditional graphics and procedural techniques, we can work towards this human input driven use of ML techniques.

Ben Ahlbrand
Posted on:
Post author

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.