AudioSR: Versatile Audio Super-Resolution at Scale

Introduction

With advancements in digital technology, there is an imminent need for crisp and clear audio. The quality of audio is important due to its involvement in a variety of applications including video calls, virtual meetings, online classes, and cinematic soundtracks. But how do we transform the often imperfect, low-resolution audio into the high-fidelity sound that captivates and engages us?

Audio Super-Resolution (AudioSR) represents a crucial step in the field of digital audio processing. It involves enhancing the resolution of audio signals, effectively reconstructing high-frequency components from low-resolution audio. This technology is instrumental in various applications, ranging from restoring historical recordings to improving the audio quality of modern digital communication and entertainment systems.

Unlike its predecessors, which often had limitations in terms of the audio types and bandwidth settings they could handle, AudioSR is a diffusion-based generative model designed to robustly upscale a wide array of audio types. Traditionally, these methods were confined to specific types of audio, such as music or speech, and limited in the range of bandwidths they could handle. AudioSR shatters these constraints, showcasing remarkable versatility. It can upscale various audio types, including sound effects, music, and speech, catering to a broad spectrum of audio enhancement needs.

One of the most notable features of AudioSR is its versatility. It can process any audio signal within the bandwidth range of 2 kHz to 16 kHz, upscaling it to a high-resolution signal with a 24 kHz bandwidth and a 48 kHz sampling rate. This remarkable capability makes AudioSR a highly adaptable tool in the audio processing landscape. Its performance has been validated through extensive objective evaluations across various benchmarks in audio super-resolution, demonstrating its efficacy and robustness.

In addition, AudioSR's adaptability extends to its compatibility with other audio generative models. It acts as a plug-and-play enhancement module, effortlessly integrating with models like AudioLDM, Fastspeech2, and MusicGen to elevate the overall audio generation quality. This feature positions AudioSR not just as a standalone solution but as a valuable component in a broader range of audio processing and generation applications.

Methodology of AudioSR

To appreciate the full potential of AudioSR, it's necessary to understand the working behind it. At the heart of AudioSR is the Latent Diffusion Model (LDM), an advanced component that sets it apart from traditional audio SR methods. This model functions by quickly estimating and reconstructing the high-resolution melSpectrogram from a given low-resolution audio input. The magic of LDM lies in its ability to capture and replicate the intricate details that constitute high-quality sound.

Alongside LDM, AudioSR also uses Neural Vocoder, another key element which is responsible for translating the melSpectrograms back into audible waveforms. The combination of these two advanced technologies allows AudioSR to not only predict missing audio details but also recreate them in a way that is both natural and pleasing to the ear.

The process begins with resampling the low-resolution audio to a higher sampling rate, while still maintaining the original's maximum bandwidth. AudioSR utilizes the resampled audio as a conditioning signal, guiding the LDM in estimating the high-resolution melSpectrogram. To ensure the consistency and integrity of the audio, the lower frequency parts of the estimated spectrogram are replaced with those from the original signal.

This particular process preserves the authenticity of the sound while enhancing its quality. Another aspect where AudioSR innovates is in its post-processing algorithm. This algorithm ensures that the final output retains the essence of the original audio, especially in the lower frequencies, which are crucial for perceived audio quality.

Testing AudioSR

To test AudioSR, head over to E2E Cloud and launch a GPU node. Ensure that you add your SSH keys during the launch.

Once that’s done, you can use VS Code’s extension Remote Explorer to SSH and treat the E2E Cloud’s GPU instance as your local development environment.

Then install the Python module:

conda create -n audiosr python=3.9; conda activate audiosr
pip3 install audiosr==0.0.6

To process a single audio file, you can simply do the following:

audiosr -i example/music.wav

The full usage is like this:

audiosr [-h] -i INPUT_AUDIO_FILE [-il INPUT_FILE_LIST] [-s SAVE_PATH] [--model_name {basic,speech}] [-d DEVICE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_AUDIO_FILE, --input_audio_file INPUT_AUDIO_FILE
                        Input audio file for audio super resolution
  -il INPUT_FILE_LIST, --input_file_list INPUT_FILE_LIST
                        A file that contains all audio files that need to perform audio super resolution
  -s SAVE_PATH, --save_path SAVE_PATH
                        The path to save model output
  --model_name {basic,speech}
                        The checkpoint you gonna use
  -d DEVICE, --device DEVICE
                        The device for computation. If not specified, the script will automatically choose the device based on your environment.
  --ddim_steps DDIM_STEPS
                        The sampling step for DDIM
  -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                        Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
  --seed SEED           Change this value (any integer number) will lead to a different generation result.
  --suffix SUFFIX       Suffix for the output file

If you want to process a batch of files:

audiosr -il batch.lst

AudioSR in Action: Experiment and Results

To ensure its robustness and versatility, AudioSR was trained on a comprehensive array of datasets, including MUSDB18-HQ, MoisesDB, MedleyDB, FreeSound, and a speech dataset from OpenSLR. These datasets encompass a wide range of audio types, from music to speech, and everything in between. This extensive training, totaling approximately 7000 hours of audio, is crucial for the model's ability to adapt to various audio super-resolution tasks.

One of the most compelling pieces of evidence of AudioSR's capabilities is presented in the table below, which details both objective and subjective evaluation results. This table showcases how AudioSR and its speech-specific variant, AudioSR-Speech, perform against other models like NuWave and NVSR across different datasets and cutoff frequencies. The results show that AudioSR consistently outperforms other models, especially in music super-resolution.

The experimental results highlight several key findings:

Music Super-Resolution: In this category AudioSR shines, achieving the best performance across all cut-off frequency settings when compared to the baseline NVSR model.
Speech Super-Resolution: AudioSR-Speech shows exceptional results, particularly in the 24kHz to 48kHz upsampling task, underscoring the benefits of fine-tuning on specific data domains.
LSD Metric vs. Perceptual Quality: An interesting observation is the discrepancy between the LSD metric and perceptual quality, particularly in sound effect data. This suggests that while LSD is a useful objective measure, it might not always align with the subjective audio quality experienced by listeners.

A graph of a bar chartDescription automatically generated with medium confidence

From the graph, we can see that both GT-Mel and AudioSR have the highest overall quality scores, indicating a strong preference for their output in subjective evaluations. This highlights the effectiveness of AudioSR in enhancing audio quality, making it a competitive choice in the field of audio super-resolution. Beyond the numbers, subjective evaluations reveal that AudioSR significantly enhances the perceptual quality of outputs from AudioLDM, MusicGen, and FastSpeech2. This underscores its potential as a plug-and-play solution for improving existing audio generation models.

Insights and Implications

The results from the evaluation of AudioSR, especially when viewed in the context of its robust performance across various datasets and subjective quality assessments, offer a glimpse into the future of audio processing. AudioSR's adaptability in handling diverse audio types and its ability to operate over a wide range of input bandwidths position it as a pivotal tool in the evolution of audio enhancement technologies.

AudioSR can improve old recordings, bringing clarity and richness to sounds that were previously constrained by older recording technologies. For speech and communication, the enhanced clarity offered by AudioSR could revolutionize telecommunication and voice recognition systems, making them more efficient and reliable, especially in noisy environments. In entertainment and virtual reality, the heightened audio experience provided by AudioSR can significantly elevate the level of immersion.

One of the key insights from the evaluation of AudioSR is the observed discrepancy between the LSD metric and perceptual quality assessments. While the LSD metric is a valuable objective tool for quantifying audio quality, it doesn't always align with the subjective auditory experience of listeners. This divergence highlights the complexity of human audio perception and underscores the need for more nuanced and comprehensive evaluation methods in audio super-resolution.

For instance, in certain scenarios, a model with a better LSD score did not necessarily translate to higher subjective quality ratings. This suggests that while technical metrics are essential for initial assessments, they should be complemented with thorough subjective listening tests to capture the full scope of audio quality.

The development and evaluation of audio super-resolution technologies like AudioSR should adopt a holistic approach. This approach would integrate both objective metrics and subjective listening tests, ensuring that the advancements in the field are not just technically sound but also aligned with the nuanced preferences of human listeners. By doing so, the future of audio processing can be steered towards innovations that are not only scientifically robust but also perceptually meaningful.

Future of AudioSR

As AudioSR continues to set new standards in audio super-resolution, its future appears rich with possibilities. The potential applications and improvements for AudioSR span a vast array of domains:

Enhanced Real-Time Processing: One of the immediate areas of development could be implementing AudioSR in real-time audio processing systems. This could revolutionize live broadcasting, real-time communications, and interactive gaming experiences, providing high-resolution audio without perceptible delays.
Integration with Emerging Technologies: The integration of AudioSR with emerging technologies like AI-driven voice assistants, IoT devices, and virtual reality could lead to more immersive and natural user experiences. Imagine smart home devices delivering crystal-clear audio commands or VR environments with hyper-realistic soundscapes.
Preservation and Restoration: In the field of archival and restoration, AudioSR can play a pivotal role in preserving historical audio recordings. By enhancing the clarity and quality of old recordings, it could unlock new cultural and historical insights.
Healthcare Applications: AudioSR could also find applications in healthcare, especially in developing hearing aids that can provide clearer sound quality for individuals with hearing impairments.

It is clear that AudioSR stands as a significant milestone in the audio technology landscape. Its ability to enhance audio quality across various types of content marks a leap forward in our quest for perfect sound. AudioSR not only demonstrates what is technically feasible with current technology but also opens doors to new creative and functional possibilities in sound.

The journey of AudioSR from a concept to a tool that can enhance our auditory experience in multiple ways is a testament to the rapid advancements in audio technology. It underscores the importance of continuous innovation and user-centered design in the development of new technologies. As we move forward, AudioSR is likely to become a key player in shaping the future of sound, making our interactions with audio more enriching and immersive than ever before.

Using E2E Cloud to Run AudioSR

Are you fascinated by the transformative potential of AudioSR? Whether you're an audio engineer, a technology enthusiast, or someone passionate about the future of sound, there's a world of exploration awaiting you with AudioSR, and E2E Cloud is the perfect platform to start this journey.

E2E Cloud offers access to cutting-edge GPUs, including the powerful A100 and H100, which are ideal for handling the intensive computational demands of audio super-resolution tasks. By using these advanced GPUs, you can experiment with AudioSR, test its capabilities, and even push its boundaries. Imagine the possibilities of enhancing audio in real-time, creating immersive soundscapes, or restoring historical audio recordings with unprecedented clarity.

Visit E2E Cloud and discover how their powerful GPU infrastructure can help you harness the full potential of AudioSR. E2E Cloud provides the resources you need to explore and innovate in the world of audio super-resolution.