Training Pipeline for Assistant Distillation, Red modules are trainable while Blue are frozen pretrained modules.
Large Language Models are increasingly aligned to provide a wide range of assistant capabilities, but only to text inputs. On the other hand, post-hoc early fusion with massively multitask finetuning has been observed to cause forgetting. We merge the capabilities of an existing ASR model with an existing LLM into a single differentiable model with distillation. Similar to context distillation, this preserves the models core capabilities while adding native speech support that simplifies and accelerates inference.
We assess our model on two capabilities that should transfer well directly from the text LLM, Spoken Question Answering and Translation. We also test on one capability that more directly relies on understanding tone, Emotion Recognition. While these results numerically are quite strong given that we do not directly train on any of these tasks in the speech domain, we encourage readers to draw their own conclusions using the interactive demo comparing models above!
Evaluation results on a large general purpose QA benchmark [HeySQUAD] and a smaller benchmark evaluating the robustness of Speech LMs to accent and dialectal variation [Spoken Dialect QA]. DiVA significantly improves (P<0.05) over the baselines by at least 10% (+5 PANDA) across both benchmarks and all accents. However, it is unclear whether lower accuracy can be directly attributable to catastrophic forgetting. We qualitatively explore this question by labeling a sample of 50 responses from the HeySQUAD dataset for whether the responses include even an attempted answer relevant to the task.
Qwen Audio shows signs of severe forgetting, with 30% of responses ignoring the prompt instructions entirely and instead transcribing the question e.g. "The citation for the Pearson v. Society of Sisters case is What is the citation for the Pearson v. Society of Sisters case?". By comparison, SALMONN, which takes inference time interventions to reduce overfitting by partially ablating the LoRA modules learned for the base LLM, sees reduced overfitting with only 8% of model responses ignoring the prompt and instead transcribing. DiVA sees no cases where the model ignores the prompt, ironically despit being trained only on transcription data.
Finally, we assess the speech-to-text translation capabilities of each model from English Speech to text in another language. We prompt each model with the instruction Translate the input from {input_lang} to {output_lang}. DiVA performs significantly better than both other models on 4 out of 7 languages. This is particularly promising both SALMONN and Qwen Audio train directly using CoVost 2 training data while DiVA is operating on the benchmark zero-shot relying only on capabilities transferred from the base LLM. Notably, Qwen Audio which performs the based on Chinese, German, and Japanese was trained with 3700 hours of speech-to-text translation data in German and Chinese, more data on just this one task than was used to train DiVA in total.
DiVA notably performs poorly in Chinese and Japanese. Inspecting outputs and comparing to translations from Llama 3 in response to text, we find that our distillation loss leads us to preserve a negative behaviour - for both Chinese and Japanese Llama 3 appears to have a strong bias towards generating translations in Pinyin and Romanji, rather than the expected native script. This leads to especially poor results in these languages. This highlights the dependence of the DiVA method on the strength of the base LLM, but may be easily addressed by updating to Llama 3.1
For emotion classification, we prompt each model with the instructions Respond in a single word what emotion the input exhibits. If there is no clear emotion, respond 'Neutral'.. To use each model as a classifier, we follow MMLU and take the log-probabilities assigned to each possible label them and take the most likely token as the models prediction.
DiVA performs significantly better than both baselines on both the MELD benchmark, sourced from television, and IEMOCAPS, which operates over recorded conversations.
In comparison to DiVA, both baseline models struggle to predict a diverse array of labels. Qwen-Audio predicts the emotion of Sadness for greater than 90% of inputs for both MELD and IEMOCAPS, while SALMONN behaves similarly with Neutral predictions.
These results are quite surprising given that DiVA is trained only without explicit emotion supervision. However, many examples in both IEMOCAPS and MELD communicate emotion through both text and audio signals which may confound how well these evaluations capture true sociophonetic signal. This highlights the need for more truly multimodal tests of speech LMs especially as they become more prevalent.
We built DiVA with a few different principles to make it more than just a single fixed model and as a foundation for future speech LM research.
@misc{held2024diva,
author="Held, Will and Li, Ella and Ryan, Michael and Shi, Weiyan and Zhang, Yanzhe and Yang, Diyi",
title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
year="2024",
publisher="HuggingFace",
}