Predicting the next steps and making accurate predictions is certainly exciting but difficult… but not for Facebook! Facebook AI recently developed Anticipatory Video Transformer (AVT), an end-to-end attention-based model for anticipating action in video…
A machine learning process to estimate activities
For applications ranging from self-driving cars to augmented reality, it is important that artificial intelligence systems can predict people’s future actions. Consider a very simple situation where an autonomous vehicle is on the road at a stop sign. Now the situation requires the virtual assistant to guess whether a pedestrian is crossing the street. Predicting future activities is however a difficult issue for AI as it requires both predicting the multivariate distribution of future activities and modeling the course of past actions.
To address this challenge, two researchers, namely Rohit Girdhar of Facebook AI Research and Kristen Grauman of the University of Texas at Austin, have come up with the anticipated Video Transformer (AVT).
Facebook unveiled this latest machine learning process that is capable of predicting future actions using visual interpretation. The new model is based on recent advances in transformer architecture, particularly for natural language processing (NLP) and image modeling. The model is based on end-to-end attention to anticipation of the action in the video.
What will Facebook’s AVT be used for?
Viewed as a tiny wheel of Facebook’s biggest vehicle for augmented reality (AR) or the metaverse, AVT analyzes an activity to show possible outcomes.
“We train the model to predict future actions and features using three losses. First, we rank the features in the last frame of the video clip to predict the tagged future action; Second, we retrieve the mid-frame attribute for the features of later frames, allowing the model to predict what will happen next; third, we train the model to classify intermediate actions. We have shown that By optimizing all three losses together, our model predicts future actions 10–30% better than a model trained with only two-way attention.
Essentially, the tool will warn people before they make a mistake in time. Facebook gives a concrete example. If you were to take a hot pan, AVT would warn that person based on their previous interactions with that pan.
You may also be able to save water when cooking utensils, and potentially help you avoid hazards while cooking. This technique alerts you if you are about to perform an action that could result in injury.
Professionally, many of the tasks you need to accomplish are a step-by-step guide to follow. For example, new recruits in a company can accomplish their mission more easily because this technology will tell them what to do. This will potentially be more efficient than a natural one, as the instructions will be displayed right in front of you as if they were part of the real world.
In the context of AR glasses, this can provide a range of useful metrics to help guide people in performing a variety of tasks, whether at home or at work.
The Facebook team also believes that AVTs can be useful for tasks other than anticipation, such as:
- Self-supervised learning, a method of machine learning in artificial intelligence;
- Exploring the patterns and limits of action;
- General recognition of verbs in tasks that involve modeling a chronological sequence of actions.
It will take some time for this technology to be implemented. However, the project highlights the continuing evolution of Facebook’s AI as well as evolving features that will likely be included in the upcoming phase of its AR Glasses projects.