A Cute Small Robot with DiVA Written on It's Chest Distilled Voice Assistant

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Will Held*, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang**
[Initial Release: July 26, 2024; Last Updated October 4, 2024]
All Authors except first and last are ordered alphabetically

[TL;DR] DiVA is a new method to turn a text-only LLM to a Speech LLM using only weak supervision. DiVA learns to encode speech while preserving the underlying LLM output distribution using cross-modal context distillation between text and speech. DiVA Llama 3 8B is preferred by users to prior SoTA Speech LLMs and achieves competetive benchmark numbers, despite training with 100x less compute. DiVA was trained using entirely open-source code in Levanter, on 3.5k hours of publicly available ASR data from Common Voice, and is released under a Mozilla Public License.

QA results on Spoken Dialect QA and HeySQUAD. DiVA significantly (0.05) outperforms SALMONN and both Qwen Audio models.

Demo

Method

Description of the training pipeline. The trainable red modules are the Whisper Decoder, Query Tokens, and a Projection. The frozen Blue modules are the Whisper Encoder and all of Llama.

Training Pipeline for Assistant Distillation, Red modules are trainable while Blue are frozen pretrained modules.

Large Language Models are increasingly aligned to provide a wide range of assistant capabilities, but only to text inputs. On the other hand, post-hoc early fusion with massively multitask finetuning has been observed to cause forgetting. We merge the capabilities of an existing ASR model with an existing LLM into a single differentiable model with distillation. Similar to context distillation, this preserves the models core capabilities while adding native speech support that simplifies and accelerates inference. Our approach makes this efficient and quick to train, adapting the model to speech in 12 hours on a TPU v4 Pod. For concrete details, including a simple lemma showing how the KL Divergence can be optimized with a stable, an compute efficient proxy, please read the methods section of our work!

Evaluation

We assess our model on two capabilities that should transfer well directly from the text LLM, Spoken Question Answering and Translation. We also test on three capabilities that rely on understanding tone, Emotion, Sarcasm, and Humor detection. While we feel our results are quantitatively strong, we encourage people to assess for quality for themselves the interactive demo above, especially compared to similar demos with other models such as Qwen 2!

QA results on Spoken Dialect QA and HeySQUAD. DiVA significantly (0.05) outperforms SALMONN and both Qwen Audio models.

Evaluation results on a large general purpose QA benchmark [HeySQUAD] and a smaller benchmark evaluating the robustness of Speech LMs to accent and dialectal variation [Spoken Dialect QA]. DiVA significantly improves (P<0.05) over the baselines (+5 PANDA) across both benchmarks and all accents. However, it is unclear whether lower accuracy can be directly attributable to catastrophic forgetting. We qualitatively explore this question by labeling a sample of 50 responses from the HeySQUAD dataset for whether the responses include even an attempted answer relevant to the task.

Qwen Audio shows signs of severe forgetting, with 30% of responses ignoring the prompt instructions entirely and instead transcribing the question e.g. "The citation for the Pearson v. Society of Sisters case is What is the citation for the Pearson v. Society of Sisters case?". By comparison, SALMONN, which takes inference time interventions to reduce overfitting by partially ablating the LoRA modules learned for the base LLM, sees reduced overfitting with only 8% of model responses ignoring the prompt and instead transcribing. Qwen 2, with over 350k hours of training data, sees relatively minimal forgetting with just 4% of responses ignoring the prompt. DiVA sees no cases where the model ignores the prompt, ironically despite being trained only on transcription data.

Speech Translation results on COVOST2. Qwen 2 performs the strongest on the most tasks, with DiVA coming in at second.

Finally, we assess the speech-to-text translation capabilities of each model from English Speech to text in another language. We prompt each model with the instruction Translate the input from {input_lang} to {output_lang}. DiVA is the second best model on average compared to Qwen 2! This is particularly promising given that the Qwen Audio models report training directly on CoVost 2 while DiVA relies only on capabilities transferred from the base LLM. Given that DiVA performs best on languages with less data in COVOST 2, Tamil and Turkish, the generalization is still promising. For example, Qwen Audio trains on 3700 hours of speech-to-text translation in German and Chinese, more data on just this one task than was used to train DiVA in total. Qwen 2 likely increased this volume as part of their 370k+ hours of training data, but sadly minimal details are released about their training data.

DiVA notably performs poorly in Chinese and Japanese. Inspecting outputs and comparing to translations from Llama 3 in response to text, we find that our distillation loss leads us to preserve a negative behaviour - for both Chinese and Japanese Llama 3 appears to have a strong bias towards generating translations in Pinyin and Romanji, rather than the expected native script. This leads to especially poor results in these languages. This highlights the dependence of the DiVA method on the strength of the base LLM, but may be easily addressed by updating to Llama 3.1

Emotion Recognition results on MELD and IEMOCAP. DiVA significantly (0.05) outperforms SALMONN and Qwen Audio.

For emotion classification, we prompt each model with the instructions Respond in a single word what emotion the input exhibits. If there is no clear emotion, respond 'Neutral'.. To use each model as a classifier, we follow MMLU and take the log-probabilities assigned to each possible label them and take the most likely token as the models prediction.

DiVA performs significantly better than all baselines on both the MELD benchmark, sourced from television, and IEMOCAPS, which operates over recorded conversations. In comparison to DiVA, the baseline models struggle to predict a diverse array of labels. Qwen-Audio predicts the emotion of Sadness for greater than 90% of inputs for both MELD and IEMOCAPS, while SALMONN and Qwen 2 Audio behave similarly with Neutral predictions.

These results are quite surprising given that DiVA is trained only without explicit emotion supervision. However, many examples in both IEMOCAPS and MELD communicate emotion through both text and audio signals which may confound how well these evaluations capture true sociophonetic signal. This highlights a possible need for more carefully designed emotion recognition tests.

Emotion Recognition results on MELD and IEMOCAP. DiVA significantly (0.05) outperforms SALMONN and Qwen Audio.

On the other hand, DiVA, and most other models, perform relatively poorly on Sarcasm and Humor detection tasks! Only one model performs signficantly above chance on either of these tasks, Qwen Audio on Humor Detection. Interestingly, Qwen 2 Audio regresses on this task. Results here show that all Speech LLMs likely have a ways to go to understand more nuanced sociophonetic communication.



Looking Forward

We built DiVA with a few different principles to make it more than just a single fixed model and as a foundation for future speech LM research.

  1. Fully Open. We use entirely open data that is permissively licensed so that both academic and industry researchers can build on DiVA for their work. Models that use GPT or other closed-source models, on the other hand, live in a questionable legal status since the terms of service of most platform providers forbid training on model outputs. Furthermore, unlike prior models, we release DiVAs full training code, not just evaluation code so that it is easier for future works to build their own DiVAs!

  2. Accessible to Train. One advantage of our distillation loss is that it enables us to get strong results, comparable to Qwen Audio for Speech, with an order of magnitude less data. This means the model trains in under a day. By implementing in Levanter, we are able to leverage both TPU and GPU resources. For example, all hardware used for training DiVA was accessed fully under the Google supported TPU Research Cloud which is generously available to researchers broadly.
    This not only expands reproducibility, but also means DiVA can be quickly and affordably applied to newly released text-only LLMs. We feel this is especially important given the increasing and exciting frequency of release of new capable models.

  3. End-To-End Differentiable. Models that can accept speech directly as input have the potential to simplify and accelerate inference, reduce annotation costs, and capture the rich social information inevitably lost by ASR. We hope by making it easier to fuse existing capable models into an end-to-end differentiable pipeline, LLM research can expand more easily to the complexities of human speech.

BibTeX


	    @misc{held2024diva,
            author="Held, Will and Li, Ella and Ryan, Michael and Shi, Weiyan and Zhang, Yanzhe and Yang, Diyi",
            title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
            year="2024",
            publisher="HuggingFace",
	    }