This example was implemented by
@NandanApoorv. Let's take a look at the model architecture.
It starts by defining two embedding layers: a positional embedding for text tokens, and an embedding for speech features, that uses 1D convolutions with strides for downsampling.