What if security cameras didn’t just capture video, they could understand what’s happening, distinguishing between routine and potentially dangerous activities in real time? That’s what the University of Virginia School of Engineering and Applied Science A future that is being shaped by the latest breakthroughs by our researchers. It is an AI-powered intelligent video analyzer that can detect human actions in video footage with unprecedented precision and intelligence.
The system, called the Semantic and Motion-Aware Spatiotemporal Transformer Network (SMAST), will help everything from powering surveillance systems and improving public safety to enabling more advanced motion tracking in the medical field and helping autonomous vehicles navigate complex environments. It promises a wide range of social benefits, from improving the way we do things.
“This AI technology opens the door to real-time action detection in some of the most demanding environments,” said Professor Scott T. Acton, professor in the Department of Electrical and Computer Engineering and principal investigator on the project. states. “This is the kind of advancement that could prevent accidents, improve diagnosis, and even save lives.”
AI-driven innovation for complex video analysis
So how does it work? At the core of SMAST is artificial intelligence. The system relies on two main components to detect and understand complex human behavior. The first is a multiple feature selective attention model. This helps the AI focus on the most important parts of the scene (like people and objects) while ignoring unnecessary details. This allows the system to more accurately identify what’s going on, such as recognizing that someone is throwing a ball rather than just moving their arm.
The second key feature is a motion-aware 2D position-encoding algorithm that helps the AI track how objects move over time. Imagine watching a video where people constantly change positions. This tool helps the AI remember those movements and understand how they relate to each other. By integrating these capabilities, SMAST can accurately recognize complex actions in real time, making it more effective in high-stakes scenarios such as surveillance, medical diagnostics, and autonomous driving.
SMAST redefines how machines detect and interpret human behavior. Current systems deal with chaotic, unedited, continuous video footage, where the context of events is often lost. However, SMAST’s innovative design leverages AI components that can learn and adapt from data to capture dynamic relationships between people and objects with incredible accuracy.
Setting a new standard for behavioral detection technology
This technological leap means AI systems can identify behaviors such as runners crossing the road, doctors giving precise treatments, and even safety threats in crowded spaces. SMAST has already outperformed top solutions across key academic benchmarks including AVA, UCF101-24, and EPIC-Kitchens, setting new standards for accuracy and efficiency.
“The implications for society could be huge,” said Matthew Corban, a postdoctoral researcher in Acton’s lab working on the project. “We are excited to see how this AI technology will transform the industry, making video-based systems more intelligent and capable of real-time understanding.”
This research builds on research published in the IEEE Transactions on Pattern Analysis and Machine Intelligence article “A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection.” The paper’s authors are Matthew Korban, Peter Youngs, and Scott T. Acton of the University of Virginia.
This project was supported by the National Science Foundation (NSF) under Grant 2000487 and Grant 2322993.