Build a speech transcription demo with OpenAI’s Whisper model in-browser using Transformers.js – 山里来了攻城狮

1 Transformers.js
2 Build a voice transcription app
3 The final result
4 References

Transformers.js

Transformers.js is a Javascript implementation of the Python library transformers, which was developed by HuggingFace. Transformers.js uses ONNX Runtime to run pre-trained AI models with JavaScript. Long story short, ONNX Runtime allows developers to share neural network models interchangeably among frameworks and support a wide range hardwares.

Common tasks supported by Transformers.js:

Natural Language Processing: text classification, named entity recognition, question answering, language modelling, summarization, translation, multiple choice, and text generation.
Computer Vision: image classification, object detection, segmentation, and depth estimation.
Audio: automatic speech recognition, audio classification, and text-to-speech.
Multimodal: embeddings, zero-shot audio classification, zero-shot image classification, and zero-shot object detection

Build a voice transcription app

The demo web app features a button to choose a wave file, a status message showing the recognition process progress, and a box below to display the results.

Form the foundation with Vite

Create a Vue application with Vite; install dependencies; install transformers.js library.

npm create vite@latest speechtranscription -- --template vue
cd speechtranscription
npm install
npm install @huggingface/transformers

Download pretained ONNX model

HuggingFace Xenova/whisper-tiny.en provides a Transformers.js compatible ONNX model for English speech recognition.

The following files are required:

config.json
generation_config.json
onnx/decoder_model_merged_quantized.onnx
onnx/encoder_model_quantized.onnx
preprocessor_config.json
tokenizer_config.json
tokenizer.json

Project structure

The coding section will primarily focus on the App.vue file, and the rest can be ignored.

├── index.html
├── package.json
├── package-lock.json
├── public
│   ├── models
│   │   └── whisper-tiny.en
│   │       ├── config.json
│   │       ├── generation_config.json
│   │       ├── onnx
│   │       │   ├── decoder_model_merged_quantized.onnx
│   │       │   └── encoder_model_quantized.onnx
│   │       ├── preprocessor_config.json
│   │       ├── tokenizer_config.json
│   │       └── tokenizer.json
├── README.md
├── src
│   ├── App.vue
│   ├── main.js
│   └── style.css
└── vite.config.js

Initialise variables and the Template structure in the App.vue

Create a list of Vue template reference objects and add an onMounted hook for initialising the pre-trained model.

import { ref, onMounted } from "vue";

import { pipeline, env, read_audio } from "@huggingface/transformers";

const status = ref('Loading model...');
const currentText = ref('');
const isLoading = ref(true);
const isTranscribing = ref(false);
const buttonText = ref('Select file');

const audioFile = ref(null);
const audioBlob = ref('');

onMounted(async () => {
    try {
        // Load up the pretained model
    } catch (err) {
        status.value = 'Failed to load the model: ' + err.message;
    }
});

The template includes a hidden file inputbox, an audio preview element, a button that enables users to select an audio file, a status and an area that displays transcription results.

  <div class="contianer">
    <h1>Voice Transcribe</h1>
    <div class="controls">
      <input type="file" ref="audioFile" accept="audio/wave" style="display: none" @change="handleFileSelect" />
      <audio :src="audioBlob" controls v-show="audioBlob != ''"/>
      <button :disabled="isLoading" @click="handleTranscribe" :class="{ transcribing: isTranscribing }">
        {{ buttonText }}
      </button>
    </div>
    <div class="status">{{ status }}</div>
    <pre class="transcription">
    {{ currentText }}
  </pre>
  </div>

Decorate the HTML elements.

.status,h1{text-align:center}
.container{max-width:800px;margin:0 auto;padding:20px;background-color:#fff;border-radius:10px;box-shadow:0 2px 4px rgba(0,0,0,.1)}h1{color:#333;margin-bottom:30px}
.controls{display:flex;justify-content:center;gap:20px;margin-bottom:20px}
button{padding:10px 20px;border:none;border-radius:5px;cursor:pointer;font-size:16px;transition:background-color .3s;background-color:#f44;color:#fff}
button:hover{background-color:#c00}
button:disabled{background-color:#ccc;cursor:not-allowed}
button.transcribing{background-color:#00c851}
button.transcribing:hover{background-color:#007e33}
.transcription{margin-top:20px;padding:15px;border:1px solid #ddd;border-radius:5px;min-height:100px;background-color:#f9f9f9;white-space:pre-wrap}
.status{margin:10px 0;color:#666}

Initialise the pre-trained model

Import transformers.js. Allowing it to read the pre-trained model locally and stopping it from downloading remote models:

import { pipeline, env } from "@huggingface/transformers";

let transcriber = null;
env.allowRemoteModels = false; // cease to download remote models
env.allowLocalModels = true; // read the pretained model from local

Load the model in the onMounted function within the try-catch clause. In the given parameters, quantized specified a quantised model will be loaded. The quantisation reduces the complexity of computation and memory consumption of running inference tasks by minimising the precisions of a precise model.

        transcriber = await pipeline('automatic-speech-recognition', 'whisper-tiny.en', {
            quantized: true, // Specify to use the quantised model
            local_files_only: true, // Specify to load the local model
            progress_callback: (message) => {
                // Display verbose status messages while loading the model
                if (message.status === 'done') {
                    currentText.value += message.file + " - loaded\n"
                }
                if (message.status === 'ready') {
                    currentText.value = ''
                    isLoading.value = false
                    status.value = 'Ready to transcribe'
                }
            }
        });

Add the click event handler

When the “Select file” button is clicked, click the audioFile element to raise the file selection window in the meantime.

const handleTranscribe = () => {
    audioFile.value.click();
};

Now switch to the handleFileSelect change event handler.

const handleFileSelect = async (event) => {
    // Clean up previous status and results
    audioBlob.value = "";
    status.value = "";
    currentText.value = "";
    const file = event.target.files[0];
    if (!file) {
        status.value = "Audio file is required.";
        return;
    }
    // Change the status to 'processing'
    isTranscribing.value = true;
    buttonText.value = 'Processing...';
    status.value = 'Processing...';

    try {
        // Process speech transcription
    } catch (err) {
        status.value = "Transcription failed: " + err.message;
        currentText.value = "";
    } finally {
        // Reset status and button text
        isTranscribing.value = false;
        buttonText.value = "Select file";
        audioFile.value = null;
    }
}

Convert the audio file into a Blob object and create an object URL for it in the try-catch clause.

    let audio = await file.arrayBuffer();
    audio = new Blob(, {type: 'audio/wave'});
    audio = URL.createObjectURL(audio);
    // For preview purpose, load the audio element by assigning the blob url to the audioBlob variable
    audioBlob.value = audio;

Feed the audio to the transcriber pipeline, specifying to expect timestamp returns alongside text.

    const result = await transcriber(audio, {
        return_timestamps: true,
    });

Show timestamps and transcribed text messages, and update the status value to inform the user that the transcription is complete.

    result.chunks.forEach(chunk => {
        currentText.value += `[${chunk.timestamp.join("-")}s]: ${chunk.text}\n`; 
    });

    status.value = "Transcription finished."

The final result

The audio sample used here is from Kaggle.

References

Quantization. (n.d.). https://huggingface.co/docs/optimum/en/concept_guides/quantization#quantization
ONNX Runtime. (n.d.). Onnxruntime. https://onnxruntime.ai/docs/