Simple Science

Cutting edge science explained simply

# Electrical Engineering and Systems Science# Audio and Speech Processing

Real-Time Tracking of Singing Voices with SingNet

SingNet improves beat tracking in singing voices using past data.

― 6 min read


SingNet: Real-Time VocalSingNet: Real-Time VocalBeat Trackingsinging rhythms.SingNet revolutionizes how we track
Table of Contents

Singing voice tracking for Beats and Downbeats is important for many music-related tasks. It can help with automatic music production, analysis, and even live performances. However, tracking these elements in singing is tricky due to the unique rhythms and melodies found in songs. Real-time processing adds to the challenge since it limits access to future data and makes it impossible to correct earlier mistakes based on new information.

What is SingNet?

SingNet is a new system designed to track the beats and downbeats in singing voices in real time. It uses a fresh method called dynamic particle filtering that combines past information with ongoing analysis to improve accuracy. Traditional methods often rely solely on current data, which can make them less effective. SingNet builds on this by using data from the past to make better guesses about the present.

How Does it Work?

The system starts with a model that processes the sound from singing. It uses a type of neural network called Convolutional Recurrent Neural Network (CRNN) to identify when beats and downbeats occur. The unique twist in SingNet is its dynamic particle filtering model, which adjusts the number of analysis "particles" based on the situation, rather than using a fixed amount as in usual methods.

Importance of Past Data

By integrating past data into its real-time analysis, SingNet can make informed decisions. When there are strong signals, it adds more particles to improve tracking. This past-informed method creates a more accurate representation of the singing's rhythm.

Comparison with Other Methods

Many existing methods use deep learning models to analyze music. Some common techniques include Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN). However, these models usually work offline, meaning they analyze data after it has already been captured. Some newer systems have tried to make this real-time capable but often fall short due to technical limitations.

SingNet stands apart since it is designed from the ground up to function in real-time. While some other methods may offer good results when analyzing full song tracks, they often struggle when it comes to isolated vocals. In other words, they need to be more sophisticated to effectively analyze only the singer’s voice without any instrumental help.

Challenges with Isolated Singing Voices

Isolated singing presents unique challenges. Unlike complete music tracks, isolated singing lacks percussive and harmonic elements that help guide rhythmical analysis. When applied, typical music analysis methods tend to be less effective when used on vocals alone. Existing approaches often focus on more complex elements present in full songs.

When researchers have tried to develop models that track beats and downbeats for isolated singing, they found that the process is much tougher. This is because isolated vocals do not provide clear rhythmic cues as more layered music does.

Methodology Overview

In SingNet, the neural network uses features from the sound to identify the singing's rhythm accurately. It ignores instruments and focuses on the voice to produce more relevant data. The preprocessing for SingNet emphasizes conventional spectral features, making it easier to process in real time.

Design of the Neural Network

The neural network in SingNet is structured with careful consideration of the challenges it faces. It contains three layers of LSTM (Long Short-Term Memory) cells that help manage the complexities of the rhythm in isolated singing. This design came from testing different configurations to find what works best. A larger model with more layers helps gather better insights since tracking isolated singing requires more detail.

Tracking Process

SingNet relies on particles that represent possible states in the music. At the start, these particles are spread out randomly. As it processes music, the system adjusts the particles’ positions based on what it hears. If a strong signal arises, new particles are added to reflect that change.

Inference Model

The inference model in SingNet is a two-step process, first for tracking beats and then for downbeats. This process ensures the system has a clear understanding of both rhythmic elements simultaneously. The idea is to keep the particle filtering dynamic-adjusting the number of analysis particles based on the current audio input while still factoring in historical data.

Datasets and Testing

Evaluating the system's effectiveness can be complicated since there are few public datasets focused solely on isolated vocals. Researchers in the field often face challenges when trying to annotate beats and downbeats in a purely vocal environment. They used music source separation techniques to extract vocal tracks from full mixes, allowing for more accurate assessments.

For testing, SingNet used two key datasets. The first dataset involved a publicly available collection with vocal clips. The second was a self-created collection with thousands of clean, isolated vocal clips. Each of these datasets was carefully split into training, validation, and testing segments to ensure that the system was adequately assessed across different scenarios.

Results and Findings

The results from trials indicate that SingNet significantly outperforms traditional methods. The dynamic particle filtering techniques-salience-informed, past-informed, and combined-showed improvements over baseline models. SingNet’s combined method consistently yielded the best results, demonstrating the value of integrating both past and present data in real-time.

Comparison with Baseline Models

When evaluated, SingNet showed higher accuracy in identifying beats and downbeats than baseline models. This improvement was particularly noticeable in testing scenarios involving isolated singing. While other models did well with complete music tracks, SingNet proved more adept at precisely tracking rhythm in vocal-only tracks.

Future Applications

The technology behind SingNet holds promise for various applications, particularly in music-related fields. For instance, it could be used in interactive music systems, allowing users to produce music or create arrangements based solely on their singing. Other possibilities include live performance processing and real-time audio mixing.

Conclusion

In summary, SingNet represents an innovative step forward in singing voice beat and downbeat tracking. The system's unique approach of dynamic particle filtering, which incorporates both current and historical data, allows it to excel in real-time analysis. Despite the challenges of working with isolated singing voices, the results indicate a robust performance that opens the door to a variety of future applications in music technology.

Original Source

Title: SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System

Abstract: Singing voice beat and downbeat tracking posses several applications in automatic music production, analysis and manipulation. Among them, some require real-time processing, such as live performance processing and auto-accompaniment for singing inputs. This task is challenging owing to the non-trivial rhythmic and harmonic patterns in singing signals. For real-time processing, it introduces further constraints such as inaccessibility to future data and the impossibility to correct the previous results that are inconsistent with the latter ones. In this paper, we introduce the first system that tracks the beats and downbeats of singing voices in real-time. Specifically, we propose a novel dynamic particle filtering approach that incorporates offline historical data to correct the online inference by using a variable number of particles. We evaluate the performance on two datasets: GTZAN with the separated vocal tracks, and an in-house dataset with the original vocal stems. Experimental result demonstrates that our proposed approach outperforms the baseline by 3-5%.

Authors: Mojtaba Heydari, Ju-Chiang Wang, Zhiyao Duan

Last Update: 2023-06-04 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2306.02372

Source PDF: https://arxiv.org/pdf/2306.02372

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles