Transformers.js
Transformers.js is a Javascript implementation of the Python library transformers, which was developed by HuggingFace. Transformers.js uses ONNX Runtime to run pre-trained AI models with JavaScript. Long story short, ONNX Runtime allows developers to share neural network models interchangeably among frameworks and support a wide range hardwares.
Common tasks supported by Transformers.js:
- Natural Language Processing: text classification, named entity recognition, question answering, language modelling, summarization, translation, multiple choice, and text generation.
- Computer Vision: image classification, object detection, segmentation, and depth estimation.
- Audio: automatic speech recognition, audio classification, and text-to-speech.
- Multimodal: embeddings, zero-shot audio classification, zero-shot image classification, and zero-shot object detection
Build a voice transcription app
The demo web app features a button to choose a wave file, a status message showing the recognition process progress, and a box below to display the results.
Form the foundation with Vite
Create a Vue application with Vite; install dependencies; install transformers.js library.
npm create vite@latest speechtranscription -- --template vue
cd speechtranscription
npm install
npm install @huggingface/transformers
Download pretained ONNX model
HuggingFace Xenova/whisper-tiny.en provides a Transformers.js compatible ONNX model for English speech recognition.
The following files are required:
- config.json
- generation_config.json
- onnx/decoder_model_merged_quantized.onnx
- onnx/encoder_model_quantized.onnx
- preprocessor_config.json
- tokenizer_config.json
- tokenizer.json
Project structure
The coding section will primarily focus on the App.vue
file, and the rest can be ignored.
├── index.html
├── package.json
├── package-lock.json
├── public
│ ├── models
│ │ └── whisper-tiny.en
│ │ ├── config.json
│ │ ├── generation_config.json
│ │ ├── onnx
│ │ │ ├── decoder_model_merged_quantized.onnx
│ │ │ └── encoder_model_quantized.onnx
│ │ ├── preprocessor_config.json
│ │ ├── tokenizer_config.json
│ │ └── tokenizer.json
├── README.md
├── src
│ ├── App.vue
│ ├── main.js
│ └── style.css
└── vite.config.js
Initialise variables and the Template structure in the App.vue
Create a list of Vue template reference objects and add an onMounted
hook for initialising the pre-trained model.
import { ref, onMounted } from "vue";
import { pipeline, env, read_audio } from "@huggingface/transformers";
const status = ref('Loading model...');
const currentText = ref('');
const isLoading = ref(true);
const isTranscribing = ref(false);
const buttonText = ref('Select file');
const audioFile = ref(null);
const audioBlob = ref('');
onMounted(async () => {
try {
// Load up the pretained model
} catch (err) {
status.value = 'Failed to load the model: ' + err.message;
}
});
The template includes a hidden file inputbox, an audio preview element, a button that enables users to select an audio file, a status and an area that displays transcription results.
<div class="contianer">
<h1>Voice Transcribe</h1>
<div class="controls">
<input type="file" ref="audioFile" accept="audio/wave" style="display: none" @change="handleFileSelect" />
<audio :src="audioBlob" controls v-show="audioBlob != ''"/>
<button :disabled="isLoading" @click="handleTranscribe" :class="{ transcribing: isTranscribing }">
{{ buttonText }}
</button>
</div>
<div class="status">{{ status }}</div>
<pre class="transcription">
{{ currentText }}
</pre>
</div>
Decorate the HTML elements.
.status,h1{text-align:center}
.container{max-width:800px;margin:0 auto;padding:20px;background-color:#fff;border-radius:10px;box-shadow:0 2px 4px rgba(0,0,0,.1)}h1{color:#333;margin-bottom:30px}
.controls{display:flex;justify-content:center;gap:20px;margin-bottom:20px}
button{padding:10px 20px;border:none;border-radius:5px;cursor:pointer;font-size:16px;transition:background-color .3s;background-color:#f44;color:#fff}
button:hover{background-color:#c00}
button:disabled{background-color:#ccc;cursor:not-allowed}
button.transcribing{background-color:#00c851}
button.transcribing:hover{background-color:#007e33}
.transcription{margin-top:20px;padding:15px;border:1px solid #ddd;border-radius:5px;min-height:100px;background-color:#f9f9f9;white-space:pre-wrap}
.status{margin:10px 0;color:#666}
Initialise the pre-trained model
Import transformers.js. Allowing it to read the pre-trained model locally and stopping it from downloading remote models:
import { pipeline, env } from "@huggingface/transformers";
let transcriber = null;
env.allowRemoteModels = false; // cease to download remote models
env.allowLocalModels = true; // read the pretained model from local
Load the model in the onMounted
function within the try-catch
clause. In the given parameters, quantized
specified a quantised model will be loaded. The quantisation reduces the complexity of computation and memory consumption of running inference tasks by minimising the precisions of a precise model.
transcriber = await pipeline('automatic-speech-recognition', 'whisper-tiny.en', {
quantized: true, // Specify to use the quantised model
local_files_only: true, // Specify to load the local model
progress_callback: (message) => {
// Display verbose status messages while loading the model
if (message.status === 'done') {
currentText.value += message.file + " - loaded\n"
}
if (message.status === 'ready') {
currentText.value = ''
isLoading.value = false
status.value = 'Ready to transcribe'
}
}
});
Add the click event handler
When the “Select file” button is clicked, click the audioFile
element to raise the file selection window in the meantime.
const handleTranscribe = () => {
audioFile.value.click();
};
Now switch to the handleFileSelect
change event handler.
const handleFileSelect = async (event) => {
// Clean up previous status and results
audioBlob.value = "";
status.value = "";
currentText.value = "";
const file = event.target.files[0];
if (!file) {
status.value = "Audio file is required.";
return;
}
// Change the status to 'processing'
isTranscribing.value = true;
buttonText.value = 'Processing...';
status.value = 'Processing...';
try {
// Process speech transcription
} catch (err) {
status.value = "Transcription failed: " + err.message;
currentText.value = "";
} finally {
// Reset status and button text
isTranscribing.value = false;
buttonText.value = "Select file";
audioFile.value = null;
}
}
Convert the audio file into a Blob object and create an object URL for it in the try-catch
clause.
let audio = await file.arrayBuffer();
audio = new Blob(, {type: 'audio/wave'});
audio = URL.createObjectURL(audio);
// For preview purpose, load the audio element by assigning the blob url to the audioBlob variable
audioBlob.value = audio;
Feed the audio to the transcriber
pipeline, specifying to expect timestamp returns alongside text.
const result = await transcriber(audio, {
return_timestamps: true,
});
Show timestamps and transcribed text messages, and update the status value to inform the user that the transcription is complete.
result.chunks.forEach(chunk => {
currentText.value += `[${chunk.timestamp.join("-")}s]: ${chunk.text}\n`;
});
status.value = "Transcription finished."
The final result
The audio sample used here is from Kaggle.
References
Quantization. (n.d.). https://huggingface.co/docs/optimum/en/concept_guides/quantization#quantization
ONNX Runtime. (n.d.). Onnxruntime. https://onnxruntime.ai/docs/