Predicting next steps and making accurate predictions is certainly exciting but difficult… but not for Facebook! Facebook AI recently developed Anticipatory Video Transformer (AVT), an end-to-end attention-based model for anticipating action in video…
A machine learning process to estimate activities
For applications ranging from self-driving cars to augmented reality, it is important that artificial intelligence systems can predict people’s future actions. Consider a very simple situation where an autonomous vehicle is at a stop sign on the road. Now the situation requires the virtual assistant to guess whether a pedestrian is crossing the street. However, predicting future activities is a difficult challenge for AI as it requires both predicting the multivariate distribution of future activities and modeling the progress of past actions.
To address this challenge, two researchers, namely Rohit Girdhar from Facebook AI Research and Kristen Grauman from the University of Texas at Austin, came together to come up with the anticipated Video Transformer (AVT).
Facebook has unveiled this latest machine learning process that is capable of predicting future actions using visual interpretation. The new model builds on recent advances in Transformer architectures, particularly for natural language processing (NLP) and image modeling. The model is based on end-to-end attention to anticipation of the action in the video.
What will Facebook’s AVT be used for?
Viewed as a tiny wheel in Facebook’s larger vehicle for augmented reality (AR) or the metaverse, AVT analyzes an activity to show possible outcomes.
“We train the model to predict future actions and features using three losses. First, we rank the features in the last frame of the video clip to predict the future labeled action; Second, we revert the mid-frame features to later-frame features, which trains the model to predict what will happen next; Third, we train the model to classify intermediate actions. showed that by optimizing all three losses combined, our model predicts future actions by 10–30% better than the model trained with only two-way attention.
Essentially, the tool will warn people before they make a mistake in time. Facebook gives a concrete example. If you are about to take a hot pan, AVT will inform that person based on their previous interactions with that pan.
You may also be able to save water when cooking utensils, and potentially help you avoid hazards while cooking. This technique alerts you if you are about to do something that could harm you.
On a professional level, many of the tasks you’ll need to complete have a step-by-step guide to follow. For example, new recruits in a company can accomplish their mission more easily because this technology will tell them what to do. This will potentially be more efficient than a physical one, as the instructions will be displayed right in front of you as if they were part of the real world.
In the context of AR glasses, this can provide a range of useful metrics to help guide people in completing a variety of tasks, whether at home or at work.
The Facebook team also thinks that AVTs can be useful for tasks other than anticipation, such as:
- Self-supervised learning, a machine learning method in artificial intelligence;
- Exploring the patterns and limits of action;
- General recognition of verbs in tasks that involve modeling a chronological sequence of actions.
It will take some time for this technology to be implemented. However, the project highlights Facebook’s ongoing AI development as well as feature development that will likely be included in the upcoming phase of its AR Glasses project.