Speech translation

My parent don’t speak English… They were born in Soviet Union (currently Lithuania) where the only English speakers were probably spies. I don’t think my parents were undercover agents.

I’ve been grinding through a Master’s degree in Denmark for the past few years, and the big day for my thesis defense is almost here. I really want my folks there, even though they wouldn’t understand a word in English.

I figured, how tough could it be to translate my presentation on the fly, so everyone catches the drift of my spiel? I don’t have a juicy GPU lurking under my desk, nor do I have a massive collection of English-to-Lithuanian speech data to train a model. But hey, transcribing and translating are old news, so I thought, why not try combining them together for kicks?

I checked out whisper.cpp, and in just a few clicks my broken English was getting transcribed into text in real-time. They even have an example program that transcribes straight from you mic. I cranked up the threads to utilize all 16 cores of my machine (to make it sweat a bit), and it worked like charm.

git clone https://github.com/ggerganov/whisper.cpp.git
bash ./models/download-ggml-model.sh base.en
make
make stream
./stream -m ./models/ggml-base.en.bin -t 16 --step 0 --length 10000 -vth 0.8

Start speaking…

### Transcription 1 START | t0 = 0 ms | t1 = 4077 ms
... Hello world.
### Transcription 1 END

Whoolah!

Let’s hook up these output with the Google’s conglomerate, the translation API. Here’s a nifty little function I whipped up to flip English text into Lithuanian…

from google.cloud import translate

client = translate.TranslationServiceClient()
project_id = "..."
location = "global"
parent = f"projects/{project_id}/locations/{location}"

def translate_text(text="Hello, world!"):
    response = client.translate_text(
        request={
            "parent": parent,
            "contents": [text],
            "mime_type": "text/plain",
            "source_language_code": "en-US",
            "target_language_code": "lt",
        }
    )
    return response.translations[0].translated_text

The only thing left is to glue the two together… Start a whisper stream app process, and route the outputs with some Python string magic to the aforementioned translation function. Here’s a snippet of the final code:

import subprocess
from pynput import keyboard
import os
from translate import translate_text

terminate = False

def on_press(key):
    global terminate
    try:
        if key.char == 'q':  # Check if 'q' is pressed
            terminate = True
    except AttributeError:
        pass
listener = keyboard.Listener(on_press=on_press)
listener.start()

command = ["./stream", 
            "-m", "./models/ggml-base.en.bin", 
            "-t", "16", 
            "--step", "0", 
            "--length", "5000", 
            "-vth", "0.8"]
whisper_dir = os.path.expanduser("~/code/whisper.cpp")
process = subprocess.Popen(
    command,
    cwd=whisper_dir,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True)

try:
    while not terminate:
        output = process.stdout.readline()

        if output == '' and process.poll() is not None:
            break
        if output:
            if ']' in output and ']' in output:
                text = " ".join(output.split("]")[1:]).strip()
                if "BLANK_AUDIO" in text:
                    continue
                if text == "":
                    continue
                translation = translate_text(text)
                print(translation)
finally:
    process.kill()
    listener.stop()

Let’s test it! … (Me speaking into the mic) …

ernie@andromeda:~/code/translate$ python main.py 
Labas pasauli.

You guessed it! “Hello world” in Lithuanian. I hope my parent gonna love it! : )

Scripts can be found in Github https://github.com/simutisernestas/polyglot.