Training Pipeline for Assistant Distillation, Red modules are trainable while Blue are frozen pretrained modules.
Large Language Models are increasingly aligned to provide a wide range of assistant capabilities, but only to text inputs. On the other hand, post-hoc early fusion with massively multitask finetuning has been observed to cause forgetting. We merge the capabilities of an existing ASR model with an existing LLM into a single differentiable model with distillation. Similar to context distillation, this preserves the models core capabilities while adding native speech support that simplifies and accelerates inference. Our approach makes this efficient and quick to train, adapting the model to speech in 12 hours on a TPU v4 Pod. For concrete details, including a simple lemma showing how the KL Divergence can be optimized with a stable, an compute efficient proxy, please read the methods section of our work!
We assess our model on two capabilities that should transfer well directly from the text LLM, Spoken Question Answering and Translation. We also test on three capabilities that rely on understanding tone, Emotion, Sarcasm, and Humor detection. While we feel our results are quantitatively strong, we encourage people to assess for quality for themselves the interactive demo above, especially compared to similar demos with other models such as Qwen 2!
Evaluation results on a large general purpose QA benchmark [HeySQUAD] and a smaller benchmark evaluating the robustness of Speech LMs to accent and dialectal variation [Spoken Dialect QA]. DiVA significantly improves (P<0.05) over the baselines (+5 PANDA) across both benchmarks and all accents. However, it is unclear whether lower accuracy can be directly attributable to catastrophic forgetting. We qualitatively explore this question by labeling a sample of 50 responses from the HeySQUAD dataset for whether the responses include even an attempted answer relevant to the task.
Qwen Audio shows signs of severe forgetting, with 30% of responses ignoring the prompt instructions entirely and instead transcribing the question e.g. "The citation for the Pearson v. Society of Sisters case is What is the citation for the Pearson v. Society of Sisters case?". By comparison, SALMONN, which takes inference time interventions to reduce overfitting by partially ablating the LoRA modules learned for the base LLM, sees reduced overfitting with only 8% of model responses ignoring the prompt and instead transcribing. Qwen 2, with over 350k hours of training data, sees relatively minimal forgetting with just 4% of responses ignoring the prompt. DiVA sees no cases where the model ignores the prompt, ironically despite being trained only on transcription data.
Finally, we assess the speech-to-text translation capabilities of each model from English Speech to text in another language. We prompt each model with the instruction Translate the input from {input_lang} to {output_lang}. DiVA is the second best model on average compared to Qwen 2! This is particularly promising given that the Qwen Audio models report training directly on CoVost 2 while DiVA relies only on capabilities transferred from the base LLM. Given that DiVA performs best on languages with less data in COVOST 2, Tamil and Turkish, the generalization is still promising. For example, Qwen Audio trains on 3700 hours of speech-to-text translation in German and Chinese, more data on just this one task than was used to train DiVA in total. Qwen 2 likely increased this volume as part of their 370k+ hours of training data, but sadly minimal details are released about their training data.
DiVA notably performs poorly in Chinese and Japanese. Inspecting outputs and comparing to translations from Llama 3 in response to text, we find that our distillation loss leads us to preserve a negative behaviour - for both Chinese and Japanese Llama 3 appears to have a strong bias towards generating translations in Pinyin and Romanji, rather than the expected native script. This leads to especially poor results in these languages. This highlights the dependence of the DiVA method on the strength of the base LLM, but may be easily addressed by updating to Llama 3.1
For emotion classification, we prompt each model with the instructions Respond in a single word what emotion the input exhibits. If there is no clear emotion, respond 'Neutral'.. To use each model as a classifier, we follow MMLU and take the log-probabilities assigned to each possible label them and take the most likely token as the models prediction.
DiVA performs significantly better than all baselines on both the MELD benchmark, sourced from television, and IEMOCAPS, which operates over recorded conversations.
In comparison to DiVA, the baseline models struggle to predict a diverse array of labels. Qwen-Audio predicts the emotion of Sadness for greater than 90% of inputs for both MELD and IEMOCAPS, while SALMONN and Qwen 2 Audio behave similarly with Neutral predictions.
These results are quite surprising given that DiVA is trained only without explicit emotion supervision. However, many examples in both IEMOCAPS and MELD communicate emotion through both text and audio signals which may confound how well these evaluations capture true sociophonetic signal. This highlights a possible need for more carefully designed emotion recognition tests.
On the other hand, DiVA, and most other models, perform relatively poorly on Sarcasm and Humor detection tasks! Only one model performs signficantly above chance on either of these tasks, Qwen Audio on Humor Detection. Interestingly, Qwen 2 Audio regresses on this task. Results here show that all Speech LLMs likely have a ways to go to understand more nuanced sociophonetic communication.
We built DiVA with a few different principles to make it more than just a single fixed model and as a foundation for future speech LM research.
@misc{held2024diva,
author="Held, Will and Li, Ella and Ryan, Michael and Shi, Weiyan and Zhang, Yanzhe and Yang, Diyi",
title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
year="2024",
publisher="HuggingFace",
}