A Cute Small Robot with DiVA Written on It's Chest Distilled Voice Assistant

Will Held*, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang**
All Authors except first and last are ordered alphabetically

[TL;DR] DiVA Llama 3 outperforms existing Speech LMs on QA, Emotion Recognition, and Translation with a speech encoder trained using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA was trained with open-source code in Levanter on 3.5k hours of publicly available and permissively licensed ASR data from Common Voice.

Demo

Method

Description of the training pipeline. The trainable red modules are the Whisper Decoder, Query Tokens, and a Projection. The frozen Blue modules are the Whisper Encoder and all of Llama.

Training Pipeline for Assistant Distillation, Red modules are trainable while Blue are frozen pretrained modules.

Large Language Models are increasingly aligned to provide a wide range of assistant capabilities, but only to text inputs. On the other hand, post-hoc early fusion with massively multitask finetuning has been observed to cause forgetting. We merge the capabilities of an existing ASR model with an existing LLM into a single differentiable model with distillation. Similar to context distillation, this preserves the models core capabilities while adding native speech support that simplifies and accelerates inference.

Evaluation

We assess our model on two capabilities that should transfer well directly from the text LLM, Spoken Question Answering and Translation. We also test on one capability that more directly relies on understanding tone, Emotion Recognition. While these results numerically are quite strong given that we do not directly train on any of these tasks in the speech domain, we encourage readers to draw their own conclusions using the interactive demo comparing models above!

QA results on Spoken Dialect QA and HeySQUAD. DiVA significantly (0.05) outperforms SALMONN and Qwen Audio.

Evaluation results on a large general purpose QA benchmark [HeySQUAD] and a smaller benchmark evaluating the robustness of Speech LMs to accent and dialectal variation [Spoken Dialect QA]. DiVA significantly improves (P<0.05) over the baselines by at least 10% (+5 PANDA) across both benchmarks and all accents. However, it is unclear whether lower accuracy can be directly attributable to catastrophic forgetting. We qualitatively explore this question by labeling a sample of 50 responses from the HeySQUAD dataset for whether the responses include even an attempted answer relevant to the task.

Qwen Audio shows signs of severe forgetting, with 30% of responses ignoring the prompt instructions entirely and instead transcribing the question e.g. "The citation for the Pearson v. Society of Sisters case is What is the citation for the Pearson v. Society of Sisters case?". By comparison, SALMONN, which takes inference time interventions to reduce overfitting by partially ablating the LoRA modules learned for the base LLM, sees reduced overfitting with only 8% of model responses ignoring the prompt and instead transcribing. DiVA sees no cases where the model ignores the prompt, ironically despit being trained only on transcription data.

Emotion Recognition results on MELD and IEMOCAP. DiVA significantly (0.05) outperforms SALMONN and Qwen Audio.

Finally, we assess the speech-to-text translation capabilities of each model from English Speech to text in another language. We prompt each model with the instruction Translate the input from {input_lang} to {output_lang}. DiVA performs significantly better than both other models on 4 out of 7 languages. This is particularly promising both SALMONN and Qwen Audio train directly using CoVost 2 training data while DiVA is operating on the benchmark zero-shot relying only on capabilities transferred from the base LLM. Notably, Qwen Audio which performs the based on Chinese, German, and Japanese was trained with 3700 hours of speech-to-text translation data in German and Chinese, more data on just this one task than was used to train DiVA in total.

DiVA notably performs poorly in Chinese and Japanese. Inspecting outputs and comparing to translations from Llama 3 in response to text, we find that our distillation loss leads us to preserve a negative behaviour - for both Chinese and Japanese Llama 3 appears to have a strong bias towards generating translations in Pinyin and Romanji, rather than the expected native script. This leads to especially poor results in these languages. This highlights the dependence of the DiVA method on the strength of the base LLM, but may be easily addressed by updating to Llama 3.1

Emotion Recognition results on MELD and IEMOCAP. DiVA significantly (0.05) outperforms SALMONN and Qwen Audio.

For emotion classification, we prompt each model with the instructions Respond in a single word what emotion the input exhibits. If there is no clear emotion, respond 'Neutral'.. To use each model as a classifier, we follow MMLU and take the log-probabilities assigned to each possible label them and take the most likely token as the models prediction.

DiVA performs significantly better than both baselines on both the MELD benchmark, sourced from television, and IEMOCAPS, which operates over recorded conversations. In comparison to DiVA, both baseline models struggle to predict a diverse array of labels. Qwen-Audio predicts the emotion of Sadness for greater than 90% of inputs for both MELD and IEMOCAPS, while SALMONN behaves similarly with Neutral predictions.

These results are quite surprising given that DiVA is trained only without explicit emotion supervision. However, many examples in both IEMOCAPS and MELD communicate emotion through both text and audio signals which may confound how well these evaluations capture true sociophonetic signal. This highlights the need for more truly multimodal tests of speech LMs especially as they become more prevalent.

Looking Forward

We built DiVA with a few different principles to make it more than just a single fixed model and as a foundation for future speech LM research.

  1. Fully Open. We use entirely open data that is permissively licensed so that both academic and industry researchers can build on DiVA for their work. Models that use GPT or other closed-source models, on the other hand, live in a questionable legal status since the terms of service of most platform providers forbid training on model outputs. Furthermore, unlike prior models, we release DiVAs full training code, not just evaluation code so that it is easier for future works to build their own DiVAs!
  2. Accessible to Train. One advantage of our distillation loss is that it enables us to get strong results, comparable to Qwen Audio for Speech, with an order of magnitude less data. This means the model trains in under a day. By implementing in Levanter, we are able to leverage both TPU and GPU resources. For example, all hardware used for training DiVA was accessed fully under the Google supported TPU Research Cloud which is generously available to researchers broadly.

    This not only expands reproducibility, but also means DiVA can be quickly and affordably applied to newly released text-only LLMs. We feel this is especially important given the increasing and exciting frequency of release of new capable models.
  3. End-To-End Differentiable. Models that can accept speech directly as input have the potential to simplify and accelerate inference, reduce annotation costs, and capture the rich social information inevitably lost by ASR. We hope by making it easier to fuse existing capable models into an end-to-end differentiable pipeline, LLM research can expand more easily to the complexities of human speech.

BibTeX


	    @misc{held2024diva,
            author="Held, Will and Li, Ella and Ryan, Michael and Shi, Weiyan and Zhang, Yanzhe and Yang, Diyi",
            title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
            year="2024",
            publisher="HuggingFace",
	    }