Innovative Approaches to Anomaly Detection in Video Surveillance

Table of Contents

Data Preprocessing
Experiments in Series Mode
Parallel Mode Processing
Performance Comparison for Each of Our Models
Conclusions
Original Source
Reference Links

Since I couldn’t access real surveillance cameras during my studies, all the videos I used were just some I downloaded. Video data needs a lot of computer power to process. Unfortunately, I didn’t have a fancy GPU server at my company or research lab, so I had to make do with a regular computer that had 32 GB of RAM and a decent Intel Core i9 processor. I also had an Nvidia GeForce RTX2080 graphics card, which gave me a bit of a boost.

In this part, we'll talk about the tests I ran where I combined two models to check how well they detected unusual activities (or Anomalies). I will explain how the results changed based on whether I ran them in parallel or one after the other. Then, I'll share the experiments I did on detecting objects and anomalies, which helped me figure out which models worked best for each situation. Finally, I'll wrap up with a summary of everything.

Data Preprocessing

Here, I'll explain how I got the data ready for testing.

As mentioned earlier, videos often show normal actions interrupted by unusual ones. Because of this, I had to cut the videos into pieces to focus on the unusual activities and analyze them better. Even with this approach, handling all this video data was a real headache due to how much there was.

Loading all the videos at once was out of the question because my computer's memory just couldn't handle it. To deal with this issue, I decided to use a special tool called a generator. Think of it as a waiter bringing you dishes one at a time instead of serving the whole meal at once.

My first test was to see how different generators worked. I tried four different types:

A generator that creates video sequences by moving a window along the video.
A generator that also uses a sliding window but overlaps the sequences.
A generator that uses a dynamic step to collect images from each video.
A generator that combines the sliding window with the dynamic step.

The third generator turned out to be the best option. Why? Because it solves some big problems that the others have. For the first two generators, the time it takes for the models to learn depends on how long the videos are-long videos mean longer learning times. Also, deciding the size of the sliding window is tricky; it needs to capture the entire action, or the model might not learn properly.

Of course, the third generator isn’t perfect either. It has its own issue when dealing with videos of different lengths. For example, with short actions, the time between images is short, while for longer videos, there can be a long wait between images. So, short videos get more detailed images, while long videos might look less clear.

This third generator lets me create sequences that don't need the images to be on top of each other, which is cool. By changing the step size, I can decide how quickly I want to detect things, whether it’s per sequence or per video. This flexibility helps me deal with both finite videos and ongoing ones.

Once I picked my generator, I had to decide on the size of my images. I adjusted the size and found the best size to be... well, let's say it worked perfectly.

For the sequence size, I tested various lengths between 15 and 30 images. Since my videos run at a certain speed (30 frames per second), I found out that using 20 images was optimal.

To see how well each experiment ran, I created my own method to define the step when picking images. This way, I could test my models under the same conditions as during training, whether I was checking a whole video or working with sequences. My main goal is to help people keep an eye on continuous streams, so I'll focus on performance related to sequence Detection.

After setting up my sequences, I made my data a bit more lively by applying some enhancement techniques like mirror effects, zooms, and brightness changes. This helped multiply my data without straying too far from reality.

Next, I looked at different ways to preprocess data-basically, how to clean and prepare images. I started with standard tricks from the computer vision world: calculating how things move between images, comparing images to find differences, and applying masks.

My findings? The optical flow method didn't impress me at all. It stayed at 50% accuracy throughout training. The inter-image difference method brought slightly better results, but then again, it got worse when I used a mask. Surprisingly, when I just used data augmentation, the results were the best. Without any preprocessing, my accuracy was low, even though recall was high. With just data augmentation, I managed to get decent accuracy and a solid recall.

I then moved on to some more advanced preprocessing methods using specialized models for anomaly detection. I thought about using something called DINO. After trying it out, I noticed it took a lot of processing time, so it wasn't practical for my real-time needs. However, it did do well with gunshot detection without needing any extra training.

For the fight detection, things didn't go as well, leading me to exclude Vision Transformers from my plans and focus on models that suited my needs better.

Now, let’s talk about YOLO, my go-to for spotting objects in my videos and figuring out what people are doing. It did a good job, so I added it into my setup.

Experiments in Series Mode

Now, it's time to check out how my models performed when stacked one after another. I started by breaking my videos down to grab each frame, which I then ran through YOLO to spot all the different objects in the images. After that, I put the video back together and ran it through another model called CGRU to check for anomalies.

I compared this method with using CGRU by itself, and the results were pretty revealing. Turns out, adding YOLO didn’t change much in how the model performed. That means the model didn’t really pay attention to the bounding boxes during training. I also improved my preprocessing by using the bounding boxes to create masks that focused on the parts of the image with the objects and got rid of as much background as I could.

For any image lacking object detection, I had two choices: either keep the original image or replace it with a black one. I tested both options, and while they performed similarly, I saw slight differences. Using a black background seems to improve detecting normal actions, but it hurt the accuracy for fights and fires. This might happen when important details are missed during detection, causing the model to ignore crucial information.

I quickly realized that settings for object detection parameters, like confidence and overlap thresholds, were essential. I had set the confidence bar to 55 percent for my tests.

Now, when it came to detecting actions performed by people, YOLO version 7 stepped up by outlining their "skeletons," similar to what OpenPose does. So, I ran pose estimation on my videos to see how it impacted detecting behaviors. To keep things simple, I focused only on the Fight and Gunshot classes since not all anomalies needed humans.

Then I tested whether pose estimation made things better. Initially, I took out the backgrounds from the videos to sharpen detection, but soon I realized that could throw off detecting other types of anomalies. So, I brought the background back in and retrained my model to see if it could still spot things like fires.

In terms of overall performance, adding the fire class didn’t change much, but the results showed a drop in detecting gunshots since some of them were tagged as fires. This led me to swap my multi-class model out for a normal/abnormal setup to see how YOLO influenced things. I trained two new models-one without backgrounds and another with them.

Regardless of preprocessing, combining different types of anomalies consistently improved my models' performances. However, the inclusion of unrelated anomalies, like fires, tended to hurt the results. I also noticed that using YOLO to prepare my data boosted accuracy.

Parallel Mode Processing

Next, I decided to run my models in parallel. The idea here was to detect objects while analyzing the time, then combine results to improve accuracy. My first experiment was to combine CGRU trained on the "gunshot" category with YOLO, using a simple rule: if the model predicted "normal" but spotted a gun, it would switch the output to "gunshot."

For fire detection, I did the same. I evaluated how well this combination worked for each video sequence and set a confidence threshold at 55 percent.

The results for detecting fire were promising. The combination of CGRU and YOLO improved fire detection, while gunshot detection didn’t show any change. At first glance, it seemed that both models were picking up the same features for gunshot detection, which indicated how important YOLO's precision was for the overall performance.

I decided to tweak the rules for detecting gunshots just a bit. Since YOLO uses calculations to figure out if an object has been detected, I thought, "Hey, maybe a gunshot should only register if it’s near a person." So, I trained a fresh model that included images of guns and some other images to see how it performed.

My new model for detecting people did a better job than before, though the firearm detection still had its ups and downs. When I compared the results, my new model's performance for gunshot detection saw a nice little boost.

Then I looked into cutting down false positives. After fresh evaluations while still keeping everything in check, I did see fewer false alarms. However, that meant true positives dipped slightly too, showcasing a need for precision in YOLO.

Given the gravity of the anomalies I was trying to detect, I wanted to keep the model that had the lowest false negative rate, even if it meant allowing some false alerts. Plus, training a single object detection model to cover all my anomalies would just make life simpler.

Performance Comparison for Each of Our Models

Time to share how all models performed overall! I had three distinct ones for detecting fights, gunshots, and fires. I’ll evaluate them on their video classification (one detection per video) and on their speed for spotting anomalies in a continuous stream (one detection per sequence).

For fights, the model did well when checking whole videos, hitting about 85.6% accuracy, but fell short on continuous streams at 63.1%. Gunshot detection shot up to 86.5% accuracy for whole videos but dropped slightly in sequences at 91.8%, showing it does best in static scenarios. On the other hand, fire detection was solid, scoring 83.8% for videos and jumping up to 86.0% for sequences, revealing it’s a reliable performer!

When I combined all my datasets for a multi-class model, I noticed some interesting trends. Despite having more data, detection performance dropped for both the fire class and gunshot class in continuous streams, but overall my multi-class model held up decently.

Looking at videos of real-life incidents, my multi-class model performed respectably. The speeds at which it could process data weren't bad either, recording detection times between 104 and 744 milliseconds.

Conclusions

Through all these tests and adjustments, what have I learned? If you simply want to detect any incident, a binary model (normal/anomalous) is the way to go. It might not pinpoint exactly what went wrong, but it covers all bases.

On the other hand, if your goal is to spot a specific type of anomaly, like a fight or a fire, sticking to a specialized model should yield better results. If you want to mix together all kinds of anomalies and have a human figure out the details later, a normal/abnormal model suits your needs perfectly.

In short, experimenting with these models has been a wild ride. It’s true that real-time detection isn’t always perfect, but with the right tweaks and approaches, we can get close enough to make valuable insights for keeping an eye on safety!

Innovative Approaches to Anomaly Detection in Video Surveillance

Data Preprocessing

Experiments in Series Mode

Parallel Mode Processing

Performance Comparison for Each of Our Models

Conclusions

Reference Links

Referenced Topics

Similar Articles

Innovative Approaches to Anomaly Detection in Video Surveillance

#Data Preprocessing

#Experiments in Series Mode

#Parallel Mode Processing

#Performance Comparison for Each of Our Models

#Conclusions

Reference Links

Referenced Topics

Similar Articles

Data Preprocessing

Experiments in Series Mode

Parallel Mode Processing

Performance Comparison for Each of Our Models

Conclusions