Build a speech transcription demo with OpenAI’s Whisper model in-browser using Transformers.js – 山里来了攻城狮

1 Transformers.js
2 Build a voice transcription app
3 The final result
4 References

Transformers.js

Transformers.js is a Javascript implementation of the Python library transformers, which was developed by HuggingFace. Transformers.js uses ONNX Runtime to run pre-trained AI models with JavaScript. Basically, ONNX Runtime performs better on edge or low-end devices with its ONNX models compared to PyTorch, TensorFlow, and other similar frameworks. ONNX Runtime will be introduced in detail in another post. For now, we will focus on implementing a Transformers.js demo instead of diving deep into the background.

Common tasks supported by Transformers.js:

Natural Language Processing: text classification, named entity recognition, question answering, language modelling, summarization, translation, multiple choice, and text generation.
Computer Vision: image classification, object detection, segmentation, and depth estimation.
Audio: automatic speech recognition, audio classification, and text-to-speech.
Multimodal: embeddings, zero-shot audio classification, zero-shot image classification, and zero-shot object detection

Build a voice transcription app

The demo web app features a button to choose a wave file, a status message showing the recognition process progress, and a box below to display the results.

Form the foundation with Vite

Create a Vue application with Vite; install dependencies; install transformers.js library.

npm create vite@latest speechtranscription -- --template vue
cd speechtranscription
npm install
npm install @huggingface/transformers

Download pretained ONNX model

HuggingFace Xenova/whisper-tiny.en provides a Transformers.js compatible ONNX model for English speech recognition.

The following files are required:

config.json
generation_config.json
onnx/decoder_model_merged_quantized.onnx
onnx/encoder_model_quantized.onnx
preprocessor_config.json
tokenizer_config.json
tokenizer.json

Project structure

The coding section will primarily focus on the App.vue file, and the rest can be ignored.

├── index.html
├── package.json
├── package-lock.json
├── public
│   ├── models
│   │   └── whisper-tiny.en
│   │       ├── config.json
│   │       ├── generation_config.json
│   │       ├── onnx
│   │       │   ├── decoder_model_merged_quantized.onnx
│   │       │   └── encoder_model_quantized.onnx
│   │       ├── preprocessor_config.json
│   │       ├── tokenizer_config.json
│   │       └── tokenizer.json
├── README.md
├── src
│   ├── App.vue
│   ├── main.js
│   └── style.css
└── vite.config.js

Initialise variables and the Template structure in the App.vue

Create a list of Vue template reference objects and add an onMounted hook for initialising the pre-trained model.

import { ref, onMounted } from "vue";

import { pipeline, env, read_audio } from "@huggingface/transformers";

const status = ref('Loading model...');
const currentText = ref('');
const isLoading = ref(true);
const isTranscribing = ref(false);
const buttonText = ref('Select file');

const audioFile = ref(null);
const audioBlob = ref('');

onMounted(async () => {
    try {
        // Load up the pretained model
    } catch (err) {
        status.value = 'Failed to load the model: ' + err.message;
    }
});

The template includes a hidden file inputbox, an audio preview element, a button that enables users to select an audio file, a status and an area that displays transcription results.

  <div class="contianer">
    <h1>Voice Transcribe</h1>
    <div class="controls">
      <input type="file" ref="audioFile" accept="audio/wave" style="display: none" @change="handleFileSelect" />
      <audio :src="audioBlob" controls v-show="audioBlob != ''"/>
      <button :disabled="isLoading" @click="handleTranscribe" :class="{ transcribing: isTranscribing }">
        {{ buttonText }}
      </button>
    </div>
    <div class="status">{{ status }}</div>
    <pre class="transcription">
    {{ currentText }}
  </pre>
  </div>

Decorate the HTML elements.

.status,h1{text-align:center}
.container{max-width:800px;margin:0 auto;padding:20px;background-color:#fff;border-radius:10px;box-shadow:0 2px 4px rgba(0,0,0,.1)}h1{color:#333;margin-bottom:30px}
.controls{display:flex;justify-content:center;gap:20px;margin-bottom:20px}
button{padding:10px 20px;border:none;border-radius:5px;cursor:pointer;font-size:16px;transition:background-color .3s;background-color:#f44;color:#fff}
button:hover{background-color:#c00}
button:disabled{background-color:#ccc;cursor:not-allowed}
button.transcribing{background-color:#00c851}
button.transcribing:hover{background-color:#007e33}
.transcription{margin-top:20px;padding:15px;border:1px solid #ddd;border-radius:5px;min-height:100px;background-color:#f9f9f9;white-space:pre-wrap}
.status{margin:10px 0;color:#666}

Initialise the pre-trained model

Import transformers.js. Allowing it to read the pre-trained model locally and stopping it from downloading remote models:

import { pipeline, env } from "@huggingface/transformers";

let transcriber = null;
env.allowRemoteModels = false; // cease to download remote models
env.allowLocalModels = true; // read the pretained model from local

Load the model in the onMounted function within the try-catch clause. In the given parameters, quantized specified a quantised model will be loaded. The quantisation reduces the complexity of computation and memory consumption of running inference tasks by minimising the precisions of a precise model.

        transcriber = await pipeline('automatic-speech-recognition', 'whisper-tiny.en', {
            quantized: true, // Specify to use the quantised model
            local_files_only: true, // Specify to load the local model
            progress_callback: (message) => {
                // Display verbose status messages while loading the model
                if (message.status === 'done') {
                    currentText.value += message.file + " - loaded\n"
                }
                if (message.status === 'ready') {
                    currentText.value = ''
                    isLoading.value = false
                    status.value = 'Ready to transcribe'
                }
            }
        });

Add the click event handler

When the “Select file” button is clicked, click the audioFile element to raise the file selection window in the meantime.

const handleTranscribe = () => {
    audioFile.value.click();
};

Now switch to the handleFileSelect change event handler.

const handleFileSelect = async (event) => {
    // Clean up previous status and results
    audioBlob.value = "";
    status.value = "";
    currentText.value = "";
    const file = event.target.files[0];
    if (!file) {
        status.value = "Audio file is required.";
        return;
    }
    // Change the status to 'processing'
    isTranscribing.value = true;
    buttonText.value = 'Processing...';
    status.value = 'Processing...';

    try {
        // Process speech transcription
    } catch (err) {
        status.value = "Transcription failed: " + err.message;
        currentText.value = "";
    } finally {
        // Reset status and button text
        isTranscribing.value = false;
        buttonText.value = "Select file";
        audioFile.value = null;
    }
}

Convert the audio file into a Blob object and create an object URL for it in the try-catch clause.

    let audio = await file.arrayBuffer();
    audio = new Blob(, {type: 'audio/wave'});
    audio = URL.createObjectURL(audio);
    // For preview purpose, load the audio element by assigning the blob url to the audioBlob variable
    audioBlob.value = audio;

Feed the audio to the transcriber pipeline, specifying to expect timestamp returns alongside text.

    const result = await transcriber(audio, {
        return_timestamps: true,
    });

Show timestamps and transcribed text messages, and update the status value to inform the user that the transcription is complete.

    result.chunks.forEach(chunk => {
        currentText.value += `[${chunk.timestamp.join("-")}s]: ${chunk.text}\n`; 
    });

    status.value = "Transcription finished."

The final result

The audio sample used here is from Kaggle.

References

Quantization. (n.d.). https://huggingface.co/docs/optimum/en/concept_guides/quantization#quantization
ONNX Runtime. (n.d.). Onnxruntime. https://onnxruntime.ai/docs/