Learning from Sound

technical
intermediate
audio
Classifying respiratory sounds with PyTorch and torchaudio
Published

September 14, 2021

Some utility functions for this notebook
# To be used with torchaudio
def print_stats(waveform, sample_rate=None, src=None):
  if src:
    print("-" * 10)
    print("Source:", src)
    print("-" * 10)
  if sample_rate:
    print("Sample Rate:", sample_rate)
  print("Shape:", tuple(waveform.shape))
  print("Dtype:", waveform.dtype)
  print(f" - Max:     {waveform.max().item():6.3f}")
  print(f" - Min:     {waveform.min().item():6.3f}")
  print(f" - Mean:    {waveform.mean().item():6.3f}")
  print(f" - Std Dev: {waveform.std().item():6.3f}")
  print()
  print(waveform)
  print()

def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None, ylim=None):
  waveform = waveform.numpy()

  num_channels, num_frames = waveform.shape
  time_axis = torch.arange(0, num_frames) / sample_rate

  figure, axes = plt.subplots(num_channels, 1)
  if num_channels == 1:
    axes = [axes]
  for c in range(num_channels):
    axes[c].plot(time_axis, waveform[c], linewidth=1)
    axes[c].grid(True)
    if num_channels > 1:
      axes[c].set_ylabel(f'Channel {c+1}')
    if xlim:
      axes[c].set_xlim(xlim)
    if ylim:
      axes[c].set_ylim(ylim)
  figure.suptitle(title)
  plt.show(block=False)

def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
  waveform = waveform.numpy()

  num_channels, num_frames = waveform.shape
  time_axis = torch.arange(0, num_frames) / sample_rate

  figure, axes = plt.subplots(num_channels, 1)
  if num_channels == 1:
    axes = [axes]
  for c in range(num_channels):
    axes[c].specgram(waveform[c], Fs=sample_rate)
    if num_channels > 1:
      axes[c].set_ylabel(f'Channel {c+1}')
    if xlim:
      axes[c].set_xlim(xlim)
  figure.suptitle(title)
  plt.show(block=False)

def play_audio(waveform, sample_rate):
  waveform = waveform.numpy()

  num_channels, num_frames = waveform.shape
  if num_channels == 1:
    display(Audio(waveform[0], rate=sample_rate))
  elif num_channels == 2:
    display(Audio((waveform[0], waveform[1]), rate=sample_rate))
  else:
    raise ValueError("Waveform with more than 2 channels are not supported.")

def inspect_file(path):
  print("-" * 10)
  print("Source:", path)
  print("-" * 10)
  print(f" - File size: {os.path.getsize(path)} bytes")
  print(f" - {torchaudio.info(path)}")

def plot_spectrogram(spec, title=None, ylabel='freq_bin', aspect='auto', xmax=None):
  fig, axs = plt.subplots(1, 1)
  axs.set_title(title or 'Spectrogram (db)')
  axs.set_ylabel(ylabel)
  axs.set_xlabel('frame')
  im = axs.imshow(librosa.power_to_db(spec), origin='lower', aspect=aspect)
  if xmax:
    axs.set_xlim((0, xmax))
  fig.colorbar(im, ax=axs)
  plt.show(block=False)

The experimental version of this notebook can be found in this repo: Learning from Sound - Experimental

This notebook assumes basic knowledge about training neural networks, what a CNN is and other deep learning knowledge such as batchnorm and basic knowledge of sound represented in digital format.

To learn about the latter, you can go through this 6-part blog series that goes from the beginning explaining about the issue. (The first four posts will be sufficient for this notebook)

Imports
import random
from collections import Counter
import librosa

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from fastcore.all import *

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split, WeightedRandomSampler

import torchaudio
import torchaudio.transforms as T

from IPython.display import Audio, display

Introduction

Respiratory sounds are important indicators of respiratory health and respiratory disorders. The sound emitted when a person breathes is directly related to air movement, changes within lung tissue and the position of secretions within the lung. A wheezing sound, for example, is a common sign that a patient has an obstructive airway disease like asthma or chronic obstructive pulmonary disease (COPD).

These sounds can be recorded using digital stethoscopes and other recording techniques. This digital data opens up the possibility of using machine learning to automatically diagnose respiratory disorders like asthma, pneumonia and bronchiolitis, to name a few.

In this notebook, we are going to try and create a Convolutional Neural Network that can distinguish and classify different respiratory sounds and make a diagnosis. In the process, we are going to learn about how sound is represented in digital format, converting the audio files into spectrograms, which the CNN can use to learn from and a few other random things about training neural networks.

I learned a lot from other people while making this notebook and I reference all of them at the bottom.

Getting the Data

Luckily for us, two research teams in Portugal and Greece already prepared a suitable dataset that can be found on Kaggle. It includes 920 annotated recordings of varying length - 10s to 90s. These recordings were taken from 126 patients. There are a total of 5.5 hours of recordings containing 6898 respiratory cycles.

We can download the dataset using the kaggle command.

!kaggle datasets download -d vbookshelf/respiratory-sound-database
Downloading respiratory-sound-database.zip to /content
100% 3.68G/3.69G [01:38<00:00, 24.1MB/s]
100% 3.69G/3.69G [01:38<00:00, 40.2MB/s]

Working with torchaudio

We are going to be using PyTorch and torchaudio in this notebook.

Let’s create a pathlib object pointing to where our data is located:

data_path = Path('data/respiratory_sound_database/Respiratory_Sound_Database')

We can see what files are present in out data_path

data_path.ls()
(#4) [Path('patient_diagnosis.csv'),Path('audio_and_txt_files'),Path('filename_format.txt'),Path('filename_differences.txt')]

And get one file to use our example:

(data_path/'audio_and_txt_files').ls(file_exts='.wav')[0]
Path('audio_and_txt_files/138_1p2_Ar_mc_AKGC417L.wav')
AUDIO_FILE = (data_path/'audio_and_txt_files').ls(file_exts='.wav')[0]

Let us load that audio file using torchaudio. It returns a tuple containing the waveform and its sample rate.

waveform, sample_rate = torchaudio.load(AUDIO_FILE)
waveform.shape, sample_rate
(torch.Size([1, 882000]), 44100)

Our example audio file has a shape of [1, 882000] and a sample rate of 44100 kHz which is pretty common.

Other info about the audio file can be seen using the following handy utility function:

print_stats(waveform)
Shape: (1, 882000)
Dtype: torch.float32
 - Max:      0.899
 - Min:     -0.623
 - Mean:     0.000
 - Std Dev:  0.112

tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0847, 0.0853, 0.0724]])

We can plot the waveform of the audio file:

plot_waveform(waveform, sample_rate);

As you can see, the waveform is still a signal. A CNN expects an image-like input. So we need a way to convert the above signal to an image. A Spectrogram is a visual representation of spectrum of frequencies of a signal as it varies with time

Here is a spectrogram of the example audio above:

plot_specgram(waveform, sample_rate);
/usr/local/lib/python3.7/dist-packages/matplotlib/axes/_axes.py:7592: RuntimeWarning: divide by zero encountered in log10
  Z = 10. * np.log10(spec)

As you can see, just and ordinary spectrogram won’t give our CNN much to learn from. Mel Spectrograms work better in this case. Converting a Spectrogram to a Mel spectogram is easy in PyTorch.

n_fft = 1024
win_length = None
hop_length = 512
n_mels = 128

mel_spectrogram = T.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=2.0,
    norm='slaney',
    onesided=True,
    n_mels=n_mels,
    mel_scale="htk",
)

melspec = mel_spectrogram(waveform)
plot_spectrogram(
    melspec[0], title="MelSpectrogram", ylabel='mel freq');
/usr/local/lib/python3.7/dist-packages/torchaudio/functional/functional.py:433: UserWarning: At least one mel filterbank has all zero values. The value for `n_mels` (128) may be set too high. Or, the value for `n_freqs` (513) may be set too low.
  "At least one mel filterbank has all zero values. "

This is a better visual representation than ordinary spectrograms and gives our neural network something to work with.

Finally, we can play the audio and hear the respiratory recording.

play_audio(waveform, sample_rate);

Audio Data Augmentation (SpecAugment by Google)

To make training robust in Deep Learning, we usually utilize Data Augmentation which is artificially creating new data from the data we have. This also helps with regularizing the model to curb overfitting.

But for our data, we can’t just use the normal augmentations we use on images, like flip and rotate. Google came up with a nice augmentation specifically called SpecAugment. This involves maksing our Mel Spectograms on either the time axis (Time Masking) or along the frequency axis (Frequency Masking).

torchaudio makes this easier for us to implement the following transforms. Here are the corresponding outputs when we apply the specific masking:

Time Masking

time_masking = T.TimeMasking(time_mask_param=80)
spec = time_masking(melspec)

plot_spectrogram(spec[0], title="Masked along time axis")

Frequency Masking

freq_masking = T.FrequencyMasking(freq_mask_param=80)
spec = freq_masking(melspec)

plot_spectrogram(spec[0], title="Masked along frequency axis")

For our specific task, we are going to be utilizing both of the two transform simultaneously.

spec_augment = nn.Sequential(
    time_masking,
    freq_masking)

spec = spec_augment(melspec)

plot_spectrogram(spec[0], title="SpecAugment")

Getting samples and the corresponding labels

Now that we know how to load our audio files and perform augmentations on them, we need a way to get the labels for each audio file.

The creators of the dataset provided us with a csv file that we can pop into a pandas dataframe and get the labels of each file:

df = pd.read_csv(data_path/'patient_diagnosis.csv', 
                 names=['Patient number', 'Diagnosis'])

df.head()
Patient number Diagnosis
0 101 URTI
1 102 Healthy
2 103 Asthma
3 104 COPD
4 105 URTI

Here is a distribution of the Diagnosis column:

plt.figure(figsize=(10,5))
sns.countplot(df['Diagnosis']);
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

df['Diagnosis'].value_counts()
COPD              64
Healthy           26
URTI              14
Bronchiectasis     7
Bronchiolitis      6
Pneumonia          6
LRTI               2
Asthma             1
Name: Diagnosis, dtype: int64

As with many medical datasets, we can already see massive imbalance in the data. This is a disadvantage since our model can just learn to predict the most common class and it will correct a high amount of times.We will need to fix that later before feeding this data into our model.

To extract the diagnosis from our audio files, we need to look again closely at the structure of our filename.

AUDIO_FILE
Path('audio_and_txt_files/138_1p2_Ar_mc_AKGC417L.wav')

It contains a lot of cryptic information that can best be understood from the description of the dataset given by the dataset creators.

It reads:

Each audio file name is divided into 5 elements, separated with underscores (_).

  1. Patient number (101,102,…,226)
  2. Recording index
  3. Chest location

We are mostly interested in the first part of the filename, which is the Patient Number, which we can then cross-check from the data frame and get corresponding diagnosis.

Since we know the patient number is always three digits, we can use the following code to get the diagnosis:

df[df['Patient number'] == int(AUDIO_FILE.stem[:3])]['Diagnosis'].item()
'COPD'

Let’s pop that into a function since that functionality is vital to creating our dataset.

def get_y(path):
  return df[df['Patient number'] == int(path.stem[:3])]['Diagnosis'].item()

get_y(AUDIO_FILE)
'COPD'

Now that we can get the labels, let us revisit the unbalanced dataset problem and see how bad it actually is. You see, in this dataset, we have multiple recordings for the same patient, corresponding to different chest locations. So we need to get all audio files, label them and get the exact count and distribution of our data.

Let us create a function that takes in a list, and returns a dictionary containing the frequency of all the unique items in that list:

def CountFrequency(my_list):
    # Creating an empty dictionary
    freq = {}
    for item in my_list:
        if (item in freq):
            freq[item] += 1
        else:
            freq[item] = 1
    return freq

Next, we loop through all the audio files in the list, get their labels and append the label in a list

diagnosis_list = []
for recording in (data_path/'audio_and_txt_files').ls(file_exts='.wav'):
  diagnosis = df[df['Patient number'] == int(recording.stem[:3])]['Diagnosis'].item()
  diagnosis_list.append(diagnosis)
len(diagnosis_list)
920

We have a total of 920 labels, now let’s see the frequency of all the diagnosis:

CountFrequency(diagnosis_list)
{'Asthma': 1,
 'Bronchiectasis': 16,
 'Bronchiolitis': 13,
 'COPD': 793,
 'Healthy': 35,
 'LRTI': 2,
 'Pneumonia': 37,
 'URTI': 23}
count = CountFrequency(diagnosis_list)

plt.figure(figsize=(16, 6))
sns.barplot(x=list(count.keys()), y=list(count.values()));

COPD is highly represented in terms of frequency, almost around 750 times Asthma and LRTI.

To solve this, we will need to oversample the small classes until they are at per with the highest class. We will do that while creating the Dataloader

Creating the Dataset

We need to create a PyTorch Dataset before we deal with the imbalance problem.

The audio processing pipeline for Neural Networks involves a sequence of steps that can be represented as follows: * Load the audio file * Rechannel the waveform to be consistent, either as Mono (one channel) or Stereo (two channels) since our tensors have to have the same number of channels. We will convert all of them to Stereo audio files. * Resample the waveforms to a consistent sample rate since they can be sampled at different rates. We will resample them to 44100 kHz here. * Making the audio files the same size or duration. Some might be 20 seconds and others 15 seconds. This is because our model expects tensors of the same size. Resizing is accomplished by either padding smaller files or truncating longer files depending on their initial size and our target size. * Applying a Time Shift Data Augmentation of the waveform before it is converted into a visual representation. * And finally, converting the waveforms into their respective Mel Spectrogram respresentation.

To simplify the process, I have created the following audio utility class that does the above processes as static methods.

# Audio utility function
class AudioUtil():
  """
  This is a utility function for the following functions:
  -------------------------------------------------------
  * loading audio files
  * rechanneling the audio files
  * resampling the audio files
  * padding or truncating the audio files
  * Time Shift Data Augmentation
  * Converting waveform into Mel Spectrogram
  """

  # load audio and return signal as tensor and the sample rate
  @staticmethod
  def load(path):
    waveform, sample_rate = torchaudio.load(path)
    return (waveform, sample_rate)
  
  # conversion of channels (Mono to Stereo and vice versa)
  @staticmethod
  def rechannel(audio, new_channel):
    waveform, sample_rate = audio

    if (waveform.shape[0] == new_channel):
      # no rechanneling needed
      return audio
    
    if (new_channel==1):
      # converting stereo to mono
      # by selecting the first channel
      new_waveform = waveform[:1,:]
    elif (new_channel==2):
      # converting mono to stereo
      # by duplicating the first channel
      new_waveform = torch.cat([waveform, waveform])
    
    return (new_waveform, sample_rate)
  
  # resampling
  @staticmethod
  def resample(audio, new_sr):
    waveform, sr = audio

    if (sr==new_sr):
      # no resampling needed
      return audio
    
    num_channels = waveform.shape[0]

    # resample first channel
    new_audio = torchaudio.transforms.Resample(sr, new_sr)(waveform[:1,:])
    if (num_channels) > 1:
      # resample second channel and merge the two
      re_two = torchaudio.transforms.Resample(sr, new_sr)(waveform[1,:])
      new_audio = torch.cat([new_audio, re_two])
    
    return (new_audio, new_sr)
  
  # resizing audio to same max length (max_ms) in milliseconds
  @staticmethod
  def pad_trunc(audio, max_ms):
    waveform, sr = audio
    num_channels, num_frames = waveform.shape
    max_len = sr//1000 * max_ms

    if (num_frames>max_len):
      # truncate signal to given length
      waveform = waveform[:,:max_len]
    
    if (num_frames<max_len):
      # get padding lengths for beginning and end
      begin_ln = random.randint(0, max_len-num_frames)
      end_ln = max_len - num_frames - begin_ln

      # pad the audio with zeros
      pad_begin = torch.zeros((num_channels, begin_ln))
      pad_end = torch.zeros((num_channels, end_ln))

      waveform = torch.cat((pad_begin, waveform, pad_end), 1)
    return (waveform, sr)

  # time shift data augmentation
  @staticmethod
  def time_shift(audio, shift_limit):
    waveform, sr = audio

    _, num_frames = waveform.shape
    shift_amt = int(random.random() * shift_limit * num_frames)
    return (waveform.roll(shift_amt), sr)

  # generating a Mel Spectrogram
  @staticmethod
  def melspectro(audio, n_mels=64, n_fft=1024, hop_len=None):
    waveform, sr = audio
    top_db = 80

    # spec shape == (num_channels, n_mels, time)
    spec = torchaudio.transforms.MelSpectrogram(sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)(waveform)

    # convert into db
    spec = torchaudio.transforms.AmplitudeToDB(top_db=top_db)(spec)
    
    return spec

We also need a way to implement the SpecAugment from Google in Pytorch.

class SpecAugment(object):
    """Augment the spectograms based on SpecAugment from Google
    https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html

    Args:
        max_mask_pct: The percentange of the spectrogram to be augmented
        n_freq_masks: The number of frequency masks to place in spectrogram
        n_time_masks: The number of time masks to place in spectrogram
    """

    def __init__(self, max_mask_pct=0.1, n_freq_masks=1, n_time_masks=1):
        self.max_mask_pct = max_mask_pct
        self.n_freq_masks = n_freq_masks
        self.n_time_masks = n_time_masks

    def __call__(self, spec):
        _, n_mels, n_steps = spec.shape
        mask_value = spec.mean()
        aug_spec = spec
  
        # apply the augmentation one after the other
        # order: freq_aug -----> time_aug
        freq_mask_param = self.max_mask_pct * n_mels
        for _ in range(self.n_freq_masks):
          aug_spec = torchaudio.transforms.FrequencyMasking(freq_mask_param)(aug_spec, mask_value)
      
        time_mask_param = self.max_mask_pct * n_steps
        for _ in range(self.n_time_masks):
          aug_spec = torchaudio.transforms.TimeMasking(time_mask_param)(aug_spec, mask_value)
      
        return aug_spec

Now that we have those two out of the way, we need to move on to the next step in creating our Dataset.

Our model always expects data in numerical form, that means, we cannot just pass the labels as they are to our model. We need to represent them as numbers.

One solution is creating a list of all the possible labels and converting the list into a dictionary where the labels are enumerated to give them all a specific unique number:

files = (data_path/'audio_and_txt_files/').ls(file_exts='.wav')
lbls = files.map(get_y).unique()
lbls
(#8) ['COPD','Healthy','Bronchiectasis','Pneumonia','Bronchiolitis','URTI','Asthma','LRTI']
v2i = {v:k for k,v in enumerate(lbls)}
v2i
{'Asthma': 6,
 'Bronchiectasis': 2,
 'Bronchiolitis': 4,
 'COPD': 0,
 'Healthy': 1,
 'LRTI': 7,
 'Pneumonia': 3,
 'URTI': 5}

Now we can create our Dataset. We will rechannel the audio to two channels, resmaple them to 44100 kHz, resize them to 20 seconds each or 20, 000 milliseconds and use a shift percentage of 40 percent on the Time Shift:

class RespiratoryDataset(Dataset):

  def __init__(self, fns, v2i, transform):
    self.fns = fns
    self.v2i = v2i
    self.duration = 20_000
    self.sr = 44100
    self.channel = 2
    self.shift_pct=0.4
    self.transform = transform
  
  def __len__(self):
    return len(self.fns)
  
  def __getitem__(self, idx):
    # get audio file
    audio_file = self.fns[idx]
    # get label
    label = v2i[get_y(audio_file)]

    # preprocess the audio file
    # load -> resample -> rechannel -> resize -> time_shift -> convert into spec
    # -> spec augment

    aud = AudioUtil.load(audio_file)
    resampled = AudioUtil.resample(aud, self.sr)
    rechanneled = AudioUtil.rechannel(resampled, self.channel)
    resized = AudioUtil.pad_trunc(rechanneled, self.duration)
    shifted = AudioUtil.time_shift(resized, self.shift_pct)
    sgram = AudioUtil.melspectro(shifted)

    if self.transform:
        sgram = self.transform(sgram)

    return sgram, torch.tensor(label)

And pass in our SpecAugment as our transform:

specaugment = SpecAugment(
    max_mask_pct=0.1,
    n_freq_masks=1,
    n_time_masks=2
)
dset = RespiratoryDataset(files, v2i, transform=specaugment)

We can confirm our Dataset contains every file by checking its length:

len(dset)
920

Let’s take a sample random file from the Dataset and check its shape and the value of the label

sample = dset[100]
sample[0].shape, type(sample),sample[1]
(torch.Size([2, 64, 1719]), tuple, tensor(0))

The shape shows that we have two channels, which makes sense since we rechanneled all our audio files to stereos which contain two channels. The other shapes result from resizing our files with either padding or truncating to 20, 000 milliseconds and the Time Shift applied

As for the label, we can convert it back into a readable form using the created dictionary.

lbls[sample[1]]
'COPD'

We need to split the data into training and validation sets. We randomly take the data and split 80% into the training set and the remaining 20% into the validation set.

num_items = len(dset)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, val_ds = random_split(dset, [num_train, num_val])
len(train_ds), len(val_ds)
(736, 184)

Handling the Imbalance Problem

Now we can handle the imbalance problem.

Let has one more look on the frequencies of the diagnosis.

count = CountFrequency(diagnosis_list)
count
{'Asthma': 1,
 'Bronchiectasis': 16,
 'Bronchiolitis': 13,
 'COPD': 793,
 'Healthy': 35,
 'LRTI': 2,
 'Pneumonia': 37,
 'URTI': 23}

We can can convert that into a numpy array that will be better for creating our solution

data = list(count.values())
class_count = np.array(data)
class_count
array([793,  35,  16,  37,  13,  23,   1,   2])

To solve the issue, PyTorch approaches such problems using the concepts of Samplers. Samplers provides a way for the Dataloader fetches data from the dataset by implementing __iter__ function.

PyTorch comes with some in-built samplers and one of them will help us with our problem.

The WeightedRandomSampler fetches data randomly but in a weighted manner such that classes with lower frequencies are picked more frequently than classes with higher frequencies.

To create the sampler, we will need to pass in weights and number of samples.

I struggled a little with implementing this Sampler but luckily found a solution in the PyTorch Forums

To create the weights, we will get the inverse of the class counts array, then get the labels for all the files in our training dataset and get their class weights and store them in a list

class_weights = 1./ torch.Tensor(class_count)
train_targets = [sample[1] for sample in train_ds]
train_samples_weight = [class_weights[class_id] for class_id in train_targets]

And finally create the Sampler.

train_sampler = WeightedRandomSampler(train_samples_weight, len(train_ds))

Creating the Dataloader

When creating the training Dataloader, we will pass in the Sampler we created above and for the validation Dataloader, we will set shuffle=False which will make that Dataloader use the SequentialSampler that fetches data one after the other.

We create a function that returns both of this Dataloaders

def get_dls(bs, train_sampler=train_sampler):
  train_dl = torch.utils.data.DataLoader(
    train_ds, 
    batch_size=bs,
    sampler=train_sampler,
    num_workers=2,
    pin_memory=True)

  val_dl = torch.utils.data.DataLoader(
    val_ds, 
    batch_size=bs, 
    shuffle=False,
    num_workers=2,
    pin_memory=True)
  
  return train_dl, val_dl

We can see the distribution of data in the train dataloader:

for i, (data,target) in enumerate(get_dls(32, train_sampler)[0]):
  count=Counter(target.numpy())
  print(f'batch {i}, {count}')
batch 0, Counter({2: 5, 1: 5, 4: 5, 3: 4, 6: 4, 0: 3, 5: 3, 7: 3})
batch 1, Counter({7: 6, 3: 5, 4: 4, 5: 4, 6: 4, 0: 3, 2: 3, 1: 3})
batch 2, Counter({1: 7, 5: 7, 2: 6, 6: 6, 0: 2, 3: 2, 4: 1, 7: 1})
batch 3, Counter({6: 5, 4: 5, 0: 5, 3: 4, 2: 4, 5: 3, 1: 3, 7: 3})
batch 4, Counter({6: 7, 7: 6, 2: 5, 3: 4, 1: 4, 4: 3, 5: 2, 0: 1})
batch 5, Counter({7: 7, 6: 6, 3: 4, 0: 4, 2: 3, 1: 3, 5: 3, 4: 2})
batch 6, Counter({6: 7, 1: 7, 0: 6, 3: 3, 5: 3, 4: 3, 7: 2, 2: 1})
batch 7, Counter({6: 8, 1: 7, 7: 4, 3: 4, 0: 4, 4: 3, 5: 1, 2: 1})
batch 8, Counter({5: 6, 1: 5, 3: 5, 7: 5, 4: 4, 6: 3, 2: 2, 0: 2})
batch 9, Counter({0: 7, 7: 5, 5: 5, 2: 5, 4: 3, 3: 3, 6: 2, 1: 2})
batch 10, Counter({1: 7, 6: 5, 0: 4, 4: 4, 5: 4, 3: 3, 2: 3, 7: 2})
batch 11, Counter({1: 10, 2: 6, 5: 4, 0: 3, 4: 3, 6: 2, 3: 2, 7: 2})
batch 12, Counter({1: 8, 5: 5, 3: 5, 0: 4, 6: 3, 2: 3, 7: 3, 4: 1})
batch 13, Counter({1: 6, 6: 6, 3: 5, 5: 4, 4: 3, 7: 3, 2: 3, 0: 2})
batch 14, Counter({4: 7, 7: 5, 3: 5, 5: 4, 6: 4, 2: 4, 1: 3})
batch 15, Counter({5: 7, 1: 5, 7: 4, 6: 4, 0: 4, 2: 4, 4: 2, 3: 2})
batch 16, Counter({7: 7, 6: 5, 0: 5, 5: 4, 1: 3, 4: 3, 3: 3, 2: 2})
batch 17, Counter({7: 5, 5: 5, 2: 5, 6: 4, 1: 4, 0: 4, 4: 3, 3: 2})
batch 18, Counter({2: 6, 3: 5, 6: 5, 7: 4, 0: 3, 1: 3, 5: 3, 4: 3})
batch 19, Counter({2: 8, 7: 7, 6: 5, 5: 4, 3: 3, 1: 2, 0: 2, 4: 1})
batch 20, Counter({3: 6, 0: 6, 7: 5, 5: 5, 6: 5, 2: 2, 4: 2, 1: 1})
batch 21, Counter({5: 8, 6: 6, 2: 5, 4: 4, 1: 3, 7: 3, 3: 3})
batch 22, Counter({4: 6, 3: 5, 7: 5, 0: 5, 2: 4, 1: 4, 6: 2, 5: 1})

As expected, classes with smaller frequencies get sampled more often.

We can also take one batch and inpsect the shapes as a sanity check

batch = next(iter(get_dls(32, train_sampler)[0]))
batch[0].shape
torch.Size([32, 2, 64, 1719])

And plot one data item in the batch. We can see that this particular item has already been preprocessed and augmented and is ready to be passed in the model.

plot_spectrogram(batch[0][0][0]);

Creating the Model

As a final piece of our creation process, we can now create the model that will learn this data.

I chose to design and train a model from scratch.

Before designing the model, we will create some useful functions that will make it easier to create layers in our model.

The conv funciton return sequential layers of a Convolutional Layer, an Activation Function (ReLU) and a Batch Normalization layer in that order.

The linear_classifier returns sequential layers of Batch Normalization layer, Dropout regularization layer and finally a Linear layer.

The AdaptiveConcatPool2d is a function I got from the fastai library that concats an Adaptive Pool and a Max Pool that gives us better results than just using one of them individually. The only catch is the result will be double of what the previous layer outputed, e.g., if our last conv block outputs 64 channels, the output from our concat pool will be 64*2=128. This is because the adaptive pool has an output of 64, and the max pool has its output of 64 and both are concated together.

def conv(ni,nf, ks=3, act=True):
  layers = [ ]
  layers.append(nn.Conv2d(ni, nf, kernel_size=ks, stride=2, padding=ks//2))
  if act: layers.append(nn.ReLU())
  layers.append(nn.BatchNorm2d(nf))
  
  return nn.Sequential(*layers)

def linear_classifier(nf, out):
  layers = [ ]
  layers.append(nn.BatchNorm1d(num_features=nf)),
  layers.append(nn.Dropout(0.25)),
  layers.append(nn.Linear(in_features=nf, out_features=out))
  
  return nn.Sequential(*layers)

class AdaptiveConcatPool2d(nn.Module):
  "Layer that concats `AdaptiveAvgPool2d` and `AdaptiveMaxPool2d`"
  def __init__(self, size=None):
    super(AdaptiveConcatPool2d, self).__init__()
    self.size = size or 1
    self.ap = nn.AdaptiveAvgPool2d(self.size)
    self.mp = nn.AdaptiveMaxPool2d(self.size)

  def forward(self, x): return torch.cat([self.mp(x), self.ap(x)], 1)

Next, we define our custom model by subclassing the nn.Module class. We will have 4 conv blocks followed by our pooling block then three linear layers.

class Net(nn.Module):
  
  def __init__(self):
    super(Net, self).__init__()

    # Conv Layers
    self.conv_layers = nn.Sequential(
        conv(2, 8, ks=5),
        conv(8, 16),
        conv(16, 32),
        conv(32, 64))
    
    # Adaptive Concat Pool
    self.pool = nn.Sequential(
        AdaptiveConcatPool2d(size=1),
        nn.Flatten())
    
    # Linear Classifiers
    # the first layer is is double the last conv layer output
    # because our adaptive pool concats avg and max pool
    self.lin = nn.Sequential(
        linear_classifier(128, 256),
        linear_classifier(256, 128),
        linear_classifier(128, 8))
  
  def forward(self, x):
    x = self.conv_layers(x)
    x = self.pool(x)
    x = self.lin(x)

    return x
net = Net()
net
Net(
  (conv_layers): Sequential(
    (0): Sequential(
      (0): Conv2d(2, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
      (1): ReLU()
      (2): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): Sequential(
      (0): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): Sequential(
      (0): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): Sequential(
      (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): ReLU()
      (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (pool): Sequential(
    (0): AdaptiveConcatPool2d(
      (ap): AdaptiveAvgPool2d(output_size=1)
      (mp): AdaptiveMaxPool2d(output_size=1)
    )
    (1): Flatten(start_dim=1, end_dim=-1)
  )
  (lin): Sequential(
    (0): Sequential(
      (0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.25, inplace=False)
      (2): Linear(in_features=128, out_features=256, bias=True)
    )
    (1): Sequential(
      (0): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.25, inplace=False)
      (2): Linear(in_features=256, out_features=128, bias=True)
    )
    (2): Sequential(
      (0): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.25, inplace=False)
      (2): Linear(in_features=128, out_features=8, bias=True)
    )
  )
)

Training The Model

Finally after all our pipilining and modelling process, we can train our model:

#collapse
def train_epoch(model, optimizer, loss_fn, train_loader, scheduler=None):
  # set the model to training mode
  model.train()
  running_loss = 0.0
  for images, labels in train_loader:
    # place the data in the GPU
    images, labels = images.to(device), labels.to(device)
    preds = model(images)
    loss = loss_fn(preds, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if scheduler:
      scheduler.step()

    running_loss += loss.item()
  
  return running_loss / len(train_loader)

def validate_epoch(model, loss_fn, val_loader):
  # set the model to validation mode
  model.eval()
  running_loss = 0.0
  correct = 0
  total = 0

  with torch.no_grad():
    for images, labels in val_loader:
      # place the data in the GPU
      images, labels = images.to(device), labels.to(device)
      preds = model(images)
      loss = loss_fn(preds, labels)

      predicted = torch.argmax(preds, axis=1)
      total += labels.shape[0]
      correct += int((predicted==labels).sum())
      running_loss += loss.item()
  
  return running_loss / len(val_loader), correct / total


# The Main Training Loop
def training_loop(epochs,model, optimizer, loss_fn, train_loader, val_loader, 
                  scheduler=None):
  # loop through the epochs
  for epoch in range(1, epochs+1):
    # forward pass + backpropagation
    train_loss = train_epoch(model, optimizer, loss_fn, train_loader, 
                             scheduler=scheduler)

    val_loss, acc = validate_epoch(model, loss_fn, val_loader)

    if epoch == 1 or epoch % 2 == 0:
      print('\n')
      print(f'Epoch {epoch}/{epochs}')
      print('-' * 10)
      print(f'Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
      print(f'Accuracy: {acc:.2}')

We use a batch size of 128 and train for 20 epochs

trainloader, validloader = get_dls(128)

We will use the AdamW as our optimiter and train using the one cycle learning rate method from Leslie Smith’s Superconvergence Paper that let’s us train neural networks at order of magnitude’s faster than ordinary methods. Since this is a classificaiton task, we use the Cross Entropy Loss function.

net = Net()

# Initialize the weights
def init_weights(m):
  if type(m) == nn.Linear or type(m) == nn.Conv2d:
    nn.init.xavier_uniform_(m.weight)

net.apply(init_weights)

device = (torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu'))
net.to(device)

epochs = 20
lr = 0.001
optimizer = optim.AdamW(net.parameters(), lr=lr)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr,
                                          steps_per_epoch=int(len(trainloader)),
                                          epochs=epochs,
                                          anneal_strategy='cos')
loss_fn = nn.CrossEntropyLoss()


training_loop(
    epochs=epochs,
    model=net,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_loader=trainloader,
    val_loader=validloader,
    scheduler=scheduler)


Epoch 1/20
----------
Train Loss: 2.7297, Val Loss: 2.3048
Accuracy: 0.0054


Epoch 2/20
----------
Train Loss: 2.3913, Val Loss: 2.6318
Accuracy: 0.06


Epoch 4/20
----------
Train Loss: 1.5233, Val Loss: 2.1714
Accuracy: 0.43


Epoch 6/20
----------
Train Loss: 1.2452, Val Loss: 0.8995
Accuracy: 0.76


Epoch 8/20
----------
Train Loss: 0.9807, Val Loss: 0.8484
Accuracy: 0.74


Epoch 10/20
----------
Train Loss: 0.9345, Val Loss: 0.8602
Accuracy: 0.74


Epoch 12/20
----------
Train Loss: 0.8758, Val Loss: 0.6720
Accuracy: 0.79


Epoch 14/20
----------
Train Loss: 0.7463, Val Loss: 0.6032
Accuracy: 0.81


Epoch 16/20
----------
Train Loss: 0.7504, Val Loss: 0.7792
Accuracy: 0.78


Epoch 18/20
----------
Train Loss: 0.6423, Val Loss: 0.7427
Accuracy: 0.78


Epoch 20/20
----------
Train Loss: 0.6852, Val Loss: 0.6396
Accuracy: 0.82

In this notebook, we learned how to use Deep Learning to solve audio problems. It was a great learning experience for me and I am definitely going to delve more into this subfield.

References

Useful Kaggle Kernels:

  • https://www.kaggle.com/dienhoa/healthy-lung-classification-spectrogram-fast-ai
  • https://www.kaggle.com/craq21/pytorch-meets-audio
  • https://www.kaggle.com/shivam316/part-1-preprocessing
  • https://www.kaggle.com/shivam316/part-2-handel-imbalance-creating-spectrogram
  • https://www.kaggle.com/shivam316/part-3-feature-extraction-modeling-95-acc

Useful Blog Posts and Websites:

  • Audio manipulation with torchaudio - https://pytorch.org/audio_preprocessing_tutorial
  • 6 part blog - https://towardsdatascience.com/audio-deep-learning-made-simple-part-1-state-of-the-art-techniques-da1d3dff2504
  • fastaudio - https://fastaudio.github.io/Introduction%20to%20Audio.html