Transformers.js
Transformers.js is a Javascript implementation of the Python library transformers, which was developed by HuggingFace. Transformers.js uses ONNX Runtime to run pre-trained AI models with JavaScript. Basically, ONNX Runtime performs better on edge or low-end devices with its ONNX models compared to PyTorch, TensorFlow, and other similar frameworks. ONNX Runtime will be introduced in detail in another post. For now, we will focus on implementing a Transformers.js demo instead of diving deep into the background.
Common tasks supported by Transformers.js:
- Natural Language Processing: text classification, named entity recognition, question answering, language modelling, summarization, translation, multiple choice, and text generation.
- Computer Vision: image classification, object detection, segmentation, and depth estimation.
- Audio: automatic speech recognition, audio classification, and text-to-speech.
- Multimodal: embeddings, zero-shot audio classification, zero-shot image classification, and zero-shot object detection
Build a voice transcription app
The demo web app features a button to choose a wave file, a status message showing the recognition process progress, and a box below to display the results.
Form the foundation with Vite
Create a Vue application with Vite; install dependencies; install transformers.js library.
npm create vite@latest speechtranscription -- --template vue
cd speechtranscription
npm install
npm install @huggingface/transformers
Download pretained ONNX model
HuggingFace Xenova/whisper-tiny.en provides a Transformers.js compatible ONNX model for English speech recognition.
The following files are required:
- config.json
- generation_config.json
- onnx/decoder_model_merged_quantized.onnx
- onnx/encoder_model_quantized.onnx
- preprocessor_config.json
- tokenizer_config.json
- tokenizer.json
Project structure
The coding section will primarily focus on the App.vue
file, and the rest can be ignored.
├── index.html
├── package.json
├── package-lock.json
├── public
│ ├── models
│ │ └── whisper-tiny.en
│ │ ├── config.json
│ │ ├── generation_config.json
│ │ ├── onnx
│ │ │ ├── decoder_model_merged_quantized.onnx
│ │ │ └── encoder_model_quantized.onnx
│ │ ├── preprocessor_config.json
│ │ ├── tokenizer_config.json
│ │ └── tokenizer.json
├── README.md
├── src
│ ├── App.vue
│ ├── main.js
│ └── style.css
└── vite.config.js
Initialise variables and the Template structure in the App.vue
Create a list of Vue template reference objects and add an onMounted
hook for initialising the pre-trained model.
import { ref, onMounted } from "vue";
import { pipeline, env, read_audio } from "@huggingface/transformers";
const status = ref('Loading model...');
const currentText = ref('');
const isLoading = ref(true);
const isTranscribing = ref(false);
const buttonText = ref('Select file');
const audioFile = ref(null);
const audioBlob = ref('');
onMounted(async () => {
try {
// Load up the pretained model
} catch (err) {
status.value = 'Failed to load the model: ' + err.message;
}
});
The template includes a hidden file inputbox, an audio preview element, a button that enables users to select an audio file, a status and an area that displays transcription results.
<div class="contianer">
<h1>Voice Transcribe</h1>
<div class="controls">
<input type="file" ref="audioFile" accept="audio/wave" style="display: none" @change="handleFileSelect" />
<audio :src="audioBlob" controls v-show="audioBlob != ''"/>
<button :disabled="isLoading" @click="handleTranscribe" :class="{ transcribing: isTranscribing }">
{{ buttonText }}
</button>
</div>
<div class="status">{{ status }}</div>
<pre class="transcription">
{{ currentText }}
</pre>
</div>
Decorate the HTML elements.
.status,h1{text-align:center}
.container{max-width:800px;margin:0 auto;padding:20px;background-color:#fff;border-radius:10px;box-shadow:0 2px 4px rgba(0,0,0,.1)}h1{color:#333;margin-bottom:30px}
.controls{display:flex;justify-content:center;gap:20px;margin-bottom:20px}
button{padding:10px 20px;border:none;border-radius:5px;cursor:pointer;font-size:16px;transition:background-color .3s;background-color:#f44;color:#fff}
button:hover{background-color:#c00}
button:disabled{background-color:#ccc;cursor:not-allowed}
button.transcribing{background-color:#00c851}
button.transcribing:hover{background-color:#007e33}
.transcription{margin-top:20px;padding:15px;border:1px solid #ddd;border-radius:5px;min-height:100px;background-color:#f9f9f9;white-space:pre-wrap}
.status{margin:10px 0;color:#666}
Initialise the pre-trained model
Import transformers.js. Allowing it to read the pre-trained model locally and stopping it from downloading remote models:
import { pipeline, env } from "@huggingface/transformers";
let transcriber = null;
env.allowRemoteModels = false; // cease to download remote models
env.allowLocalModels = true; // read the pretained model from local
Load the model in the onMounted
function within the try-catch
clause. In the given parameters, quantized
specified a quantised model will be loaded. The quantisation reduces the complexity of computation and memory consumption of running inference tasks by minimising the precisions of a precise model.
transcriber = await pipeline('automatic-speech-recognition', 'whisper-tiny.en', {
quantized: true, // Specify to use the quantised model
local_files_only: true, // Specify to load the local model
progress_callback: (message) => {
// Display verbose status messages while loading the model
if (message.status === 'done') {
currentText.value += message.file + " - loaded\n"
}
if (message.status === 'ready') {
currentText.value = ''
isLoading.value = false
status.value = 'Ready to transcribe'
}
}
});
Add the click event handler
When the “Select file” button is clicked, click the audioFile
element to raise the file selection window in the meantime.
const handleTranscribe = () => {
audioFile.value.click();
};
Now switch to the handleFileSelect
change event handler.
const handleFileSelect = async (event) => {
// Clean up previous status and results
audioBlob.value = "";
status.value = "";
currentText.value = "";
const file = event.target.files[0];
if (!file) {
status.value = "Audio file is required.";
return;
}
// Change the status to 'processing'
isTranscribing.value = true;
buttonText.value = 'Processing...';
status.value = 'Processing...';
try {
// Process speech transcription
} catch (err) {
status.value = "Transcription failed: " + err.message;
currentText.value = "";
} finally {
// Reset status and button text
isTranscribing.value = false;
buttonText.value = "Select file";
audioFile.value = null;
}
}
Convert the audio file into a Blob object and create an object URL for it in the try-catch
clause.
let audio = await file.arrayBuffer();
audio = new Blob(, {type: 'audio/wave'});
audio = URL.createObjectURL(audio);
// For preview purpose, load the audio element by assigning the blob url to the audioBlob variable
audioBlob.value = audio;
Feed the audio to the transcriber
pipeline, specifying to expect timestamp returns alongside text.
const result = await transcriber(audio, {
return_timestamps: true,
});
Show timestamps and transcribed text messages, and update the status value to inform the user that the transcription is complete.
result.chunks.forEach(chunk => {
currentText.value += `[${chunk.timestamp.join("-")}s]: ${chunk.text}\n`;
});
status.value = "Transcription finished."
The final result
The audio sample used here is from Kaggle.
References
Quantization. (n.d.). https://huggingface.co/docs/optimum/en/concept_guides/quantization#quantization
ONNX Runtime. (n.d.). Onnxruntime. https://onnxruntime.ai/docs/