This is the first in what I'm hoping to make a series of posts on representation learning and unsupervised methods in general. I've noticed that there are far fewer resources out there detailing these topics than there are for common supervised learning topics, and next-to-none that show them off in practice (i.e. with code) along with the underlying math. I'd like these posts to be accessible to a wider audience while still providing mathematical intuition.
Suppose you are at a banquet with $n$ total attendees, all simultaneously engaged in conversation. Should you stand in the middle of this crowd, you will be able to pick out individual voices to tune in and out of at will; however, any microphone positioned in the banquet hall will record an incomprehensible cacaphony, all $n$ voices jumbled together based on their distance from the device. Say you would like to be able to listen to the crowd of voices on a per-speaker basis. With only the one recording, you might1 be out of luck. If you have recordings from $n$ microphones each placed at different positions rather than one, how can we recover the individual voice signals from every attendee?
What better way to illustrate this than to listen to some recordings! (The individual source voice signals were created by Google Translate Text-to-Speech and then mixed by me.)
import numpy as np from numpy import linalg import matplotlib.pyplot as plt from scipy import signal from scipy.io import wavfile from typing import Tuple import os import glob from IPython.display import Audio, display
from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
# create convenience function for plotting and playing audio def show_audio(a: Tuple[int, np.ndarray])->None: # a: (sample_rate, audio_array) fig, ax = plt.subplots() time_axis = np.linspace(start=0, stop=(len(a)/a),num=np.round(len(a))) ax.plot(time_axis, a) ax.set_xlabel('Time (seconds)') ax.set_ylabel('Amplitude') display(Audio(a, rate=a))
# collect all the wav files files = glob.glob('./data/mixed_data/*.wav')
samp_rates =  sound_list = 
# collect sampling frequencies and audio signals for f in files: samp_rate, sound = wavfile.read(f) samp_rates.append(samp_rate) sound_list.append(sound)
# store as numpy array audio_array = np.array(sound_list)
# listen and visualize sound waves as sanity check for a in zip(samp_rates, sound_list): show_audio(a)