Speech translation
My parent don’t speak English… They were born in Soviet Union (currently Lithuania) where the only English speakers were probably spies. I don’t think my parents were undercover agents.
I’ve been grinding through a Master’s degree in Denmark for the past few years, and the big day for my thesis defense is almost here. I really want my folks there, even though they wouldn’t understand a word in English.
I figured, how tough could it be to translate my presentation on the fly, so everyone catches the drift of my spiel? I don’t have a juicy GPU lurking under my desk, nor do I have a massive collection of English-to-Lithuanian speech data to train a model. But hey, transcribing and translating are old news, so I thought, why not try combining them together for kicks?
I checked out whisper.cpp, and in just a few clicks my broken English was getting transcribed into text in real-time. They even have an example program that transcribes straight from you mic. I cranked up the threads to utilize all 16 cores of my machine (to make it sweat a bit), and it worked like charm.
git clone https://github.com/ggerganov/whisper.cpp.git
bash ./models/download-ggml-model.sh base.en
make
make stream
./stream -m ./models/ggml-base.en.bin -t 16 --step 0 --length 10000 -vth 0.8
Start speaking…
### Transcription 1 START | t0 = 0 ms | t1 = 4077 ms
... Hello world.
### Transcription 1 END
Whoolah!
Let’s hook up these output with the Google’s conglomerate, the translation API. Here’s a nifty little function I whipped up to flip English text into Lithuanian…
from google.cloud import translate
client = translate.TranslationServiceClient()
project_id = "..."
location = "global"
parent = f"projects/{project_id}/locations/{location}"
def translate_text(text="Hello, world!"):
response = client.translate_text(
request={
"parent": parent,
"contents": [text],
"mime_type": "text/plain",
"source_language_code": "en-US",
"target_language_code": "lt",
}
)
return response.translations[0].translated_text
The only thing left is to glue the two together… Start a whisper stream app process, and route the outputs with some Python string magic to the aforementioned translation function. Here’s a snippet of the final code:
import subprocess
from pynput import keyboard
import os
from translate import translate_text
terminate = False
def on_press(key):
global terminate
try:
if key.char == 'q': # Check if 'q' is pressed
terminate = True
except AttributeError:
pass
listener = keyboard.Listener(on_press=on_press)
listener.start()
command = ["./stream",
"-m", "./models/ggml-base.en.bin",
"-t", "16",
"--step", "0",
"--length", "5000",
"-vth", "0.8"]
whisper_dir = os.path.expanduser("~/code/whisper.cpp")
process = subprocess.Popen(
command,
cwd=whisper_dir,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
try:
while not terminate:
output = process.stdout.readline()
if output == '' and process.poll() is not None:
break
if output:
if ']' in output and ']' in output:
text = " ".join(output.split("]")[1:]).strip()
if "BLANK_AUDIO" in text:
continue
if text == "":
continue
translation = translate_text(text)
print(translation)
finally:
process.kill()
listener.stop()
Let’s test it! … (Me speaking into the mic) …
ernie@andromeda:~/code/translate$ python main.py
Labas pasauli.
You guessed it! “Hello world” in Lithuanian. I hope my parent gonna love it! : )
Scripts can be found in Github https://github.com/simutisernestas/polyglot.