Simple Science

Cutting edge science explained simply

# Computer Science # Computer Vision and Pattern Recognition # Machine Learning

Innovative Approaches to Anomaly Detection in Video Surveillance

Testing various models for detecting unusual activities in video data.

Fabien Poirier

― 11 min read


Video Anomaly Detection Video Anomaly Detection Techniques of unusual activities. Testing models for real-time detection
Table of Contents

Since I couldn’t access real surveillance cameras during my studies, all the videos I used were just some I downloaded. Video data needs a lot of computer power to process. Unfortunately, I didn’t have a fancy GPU server at my company or research lab, so I had to make do with a regular computer that had 32 GB of RAM and a decent Intel Core i9 processor. I also had an Nvidia GeForce RTX2080 graphics card, which gave me a bit of a boost.

In this part, we'll talk about the tests I ran where I combined two models to check how well they detected unusual activities (or Anomalies). I will explain how the results changed based on whether I ran them in parallel or one after the other. Then, I'll share the experiments I did on detecting objects and anomalies, which helped me figure out which models worked best for each situation. Finally, I'll wrap up with a summary of everything.

Data Preprocessing

Here, I'll explain how I got the data ready for testing.

As mentioned earlier, videos often show normal actions interrupted by unusual ones. Because of this, I had to cut the videos into pieces to focus on the unusual activities and analyze them better. Even with this approach, handling all this video data was a real headache due to how much there was.

Loading all the videos at once was out of the question because my computer's memory just couldn't handle it. To deal with this issue, I decided to use a special tool called a generator. Think of it as a waiter bringing you dishes one at a time instead of serving the whole meal at once.

My first test was to see how different generators worked. I tried four different types:

  1. A generator that creates video sequences by moving a window along the video.
  2. A generator that also uses a sliding window but overlaps the sequences.
  3. A generator that uses a dynamic step to collect images from each video.
  4. A generator that combines the sliding window with the dynamic step.

The third generator turned out to be the best option. Why? Because it solves some big problems that the others have. For the first two generators, the time it takes for the models to learn depends on how long the videos are—long videos mean longer learning times. Also, deciding the size of the sliding window is tricky; it needs to capture the entire action, or the model might not learn properly.

Of course, the third generator isn’t perfect either. It has its own issue when dealing with videos of different lengths. For example, with short actions, the time between images is short, while for longer videos, there can be a long wait between images. So, short videos get more detailed images, while long videos might look less clear.

This third generator lets me create sequences that don't need the images to be on top of each other, which is cool. By changing the step size, I can decide how quickly I want to detect things, whether it’s per sequence or per video. This flexibility helps me deal with both finite videos and ongoing ones.

Once I picked my generator, I had to decide on the size of my images. I adjusted the size and found the best size to be... well, let's say it worked perfectly.

For the sequence size, I tested various lengths between 15 and 30 images. Since my videos run at a certain speed (30 frames per second), I found out that using 20 images was optimal.

To see how well each experiment ran, I created my own method to define the step when picking images. This way, I could test my models under the same conditions as during training, whether I was checking a whole video or working with sequences. My main goal is to help people keep an eye on continuous streams, so I'll focus on performance related to sequence Detection.

After setting up my sequences, I made my data a bit more lively by applying some enhancement techniques like mirror effects, zooms, and brightness changes. This helped multiply my data without straying too far from reality.

Next, I looked at different ways to preprocess data—basically, how to clean and prepare images. I started with standard tricks from the computer vision world: calculating how things move between images, comparing images to find differences, and applying masks.

My findings? The optical flow method didn't impress me at all. It stayed at 50% accuracy throughout training. The inter-image difference method brought slightly better results, but then again, it got worse when I used a mask. Surprisingly, when I just used data augmentation, the results were the best. Without any preprocessing, my accuracy was low, even though recall was high. With just data augmentation, I managed to get decent accuracy and a solid recall.

I then moved on to some more advanced preprocessing methods using specialized models for anomaly detection. I thought about using something called DINO. After trying it out, I noticed it took a lot of processing time, so it wasn't practical for my real-time needs. However, it did do well with gunshot detection without needing any extra training.

For the fight detection, things didn't go as well, leading me to exclude Vision Transformers from my plans and focus on models that suited my needs better.

Now, let’s talk about YOLO, my go-to for spotting objects in my videos and figuring out what people are doing. It did a good job, so I added it into my setup.

Experiments in Series Mode

Now, it's time to check out how my models performed when stacked one after another. I started by breaking my videos down to grab each frame, which I then ran through YOLO to spot all the different objects in the images. After that, I put the video back together and ran it through another model called CGRU to check for anomalies.

I compared this method with using CGRU by itself, and the results were pretty revealing. Turns out, adding YOLO didn’t change much in how the model performed. That means the model didn’t really pay attention to the bounding boxes during training. I also improved my preprocessing by using the bounding boxes to create masks that focused on the parts of the image with the objects and got rid of as much background as I could.

For any image lacking object detection, I had two choices: either keep the original image or replace it with a black one. I tested both options, and while they performed similarly, I saw slight differences. Using a black background seems to improve detecting normal actions, but it hurt the accuracy for fights and fires. This might happen when important details are missed during detection, causing the model to ignore crucial information.

I quickly realized that settings for object detection parameters, like confidence and overlap thresholds, were essential. I had set the confidence bar to 55 percent for my tests.

Now, when it came to detecting actions performed by people, YOLO version 7 stepped up by outlining their "skeletons," similar to what OpenPose does. So, I ran pose estimation on my videos to see how it impacted detecting behaviors. To keep things simple, I focused only on the Fight and Gunshot classes since not all anomalies needed humans.

Then I tested whether pose estimation made things better. Initially, I took out the backgrounds from the videos to sharpen detection, but soon I realized that could throw off detecting other types of anomalies. So, I brought the background back in and retrained my model to see if it could still spot things like fires.

In terms of overall performance, adding the fire class didn’t change much, but the results showed a drop in detecting gunshots since some of them were tagged as fires. This led me to swap my multi-class model out for a normal/abnormal setup to see how YOLO influenced things. I trained two new models—one without backgrounds and another with them.

Regardless of preprocessing, combining different types of anomalies consistently improved my models' performances. However, the inclusion of unrelated anomalies, like fires, tended to hurt the results. I also noticed that using YOLO to prepare my data boosted accuracy.

Parallel Mode Processing

Next, I decided to run my models in parallel. The idea here was to detect objects while analyzing the time, then combine results to improve accuracy. My first experiment was to combine CGRU trained on the "gunshot" category with YOLO, using a simple rule: if the model predicted "normal" but spotted a gun, it would switch the output to "gunshot."

For fire detection, I did the same. I evaluated how well this combination worked for each video sequence and set a confidence threshold at 55 percent.

The results for detecting fire were promising. The combination of CGRU and YOLO improved fire detection, while gunshot detection didn’t show any change. At first glance, it seemed that both models were picking up the same features for gunshot detection, which indicated how important YOLO's precision was for the overall performance.

I decided to tweak the rules for detecting gunshots just a bit. Since YOLO uses calculations to figure out if an object has been detected, I thought, "Hey, maybe a gunshot should only register if it’s near a person." So, I trained a fresh model that included images of guns and some other images to see how it performed.

My new model for detecting people did a better job than before, though the firearm detection still had its ups and downs. When I compared the results, my new model's performance for gunshot detection saw a nice little boost.

Then I looked into cutting down false positives. After fresh evaluations while still keeping everything in check, I did see fewer false alarms. However, that meant true positives dipped slightly too, showcasing a need for precision in YOLO.

Given the gravity of the anomalies I was trying to detect, I wanted to keep the model that had the lowest false negative rate, even if it meant allowing some false alerts. Plus, training a single object detection model to cover all my anomalies would just make life simpler.

Performance Comparison for Each of Our Models

Time to share how all models performed overall! I had three distinct ones for detecting fights, gunshots, and fires. I’ll evaluate them on their video classification (one detection per video) and on their speed for spotting anomalies in a continuous stream (one detection per sequence).

For fights, the model did well when checking whole videos, hitting about 85.6% accuracy, but fell short on continuous streams at 63.1%. Gunshot detection shot up to 86.5% accuracy for whole videos but dropped slightly in sequences at 91.8%, showing it does best in static scenarios. On the other hand, fire detection was solid, scoring 83.8% for videos and jumping up to 86.0% for sequences, revealing it’s a reliable performer!

When I combined all my datasets for a multi-class model, I noticed some interesting trends. Despite having more data, detection performance dropped for both the fire class and gunshot class in continuous streams, but overall my multi-class model held up decently.

Looking at videos of real-life incidents, my multi-class model performed respectably. The speeds at which it could process data weren't bad either, recording detection times between 104 and 744 milliseconds.

Conclusions

Through all these tests and adjustments, what have I learned? If you simply want to detect any incident, a binary model (normal/anomalous) is the way to go. It might not pinpoint exactly what went wrong, but it covers all bases.

On the other hand, if your goal is to spot a specific type of anomaly, like a fight or a fire, sticking to a specialized model should yield better results. If you want to mix together all kinds of anomalies and have a human figure out the details later, a normal/abnormal model suits your needs perfectly.

In short, experimenting with these models has been a wild ride. It’s true that real-time detection isn’t always perfect, but with the right tweaks and approaches, we can get close enough to make valuable insights for keeping an eye on safety!

Original Source

Title: Real-Time Anomaly Detection in Video Streams

Abstract: This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.

Authors: Fabien Poirier

Last Update: 2024-11-29 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2411.19731

Source PDF: https://arxiv.org/pdf/2411.19731

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

Similar Articles