Want a tiny voice assistant that actually follows your lead, runs on your own hardware, and won’t accidentally order twelve pineapples because it misheard you? A DIY AI Assistant with Raspberry Pi is surprisingly achievable, fun, and flexible. You’ll wire up a wake word, speech recognition (ASR = automatic speech recognition), a brain for natural language (rules or an LLM), and text-to-speech (TTS). Add a few scripts, one or two services, and some careful audio tweaks, and you’ve got a pocketable smart speaker that obeys your rules.
Let’s get you from zero to talking-to-your-Pi without the usual hair-pulling. We’ll cover parts, setup, code, comparisons, gotchas... the whole burrito. 🌯
Articles you may like to read after this one:
🔗 How to study AI effectively
Create a study roadmap, practice projects, and track progress.
🔗 How to start an AI company
Validate problem, build MVP, assemble team, secure initial customers.
🔗 How to use AI to be more productive
Automate routine tasks, streamline workflows, and augment creative output.
🔗 How to incorporate AI into your business
Identify high-impact processes, implement pilots, measure ROI, scale.
What Makes a good DIY AI Assistant with Raspberry Pi ✅
-
Private by default – keep audio local where possible. You decide what leaves the device.
-
Modular – swap components like Lego: wake word engine, ASR, LLM, TTS.
-
Affordable – mostly open source, commodity mics, speakers, and a Pi.
-
Hackable – want home automation, dashboards, routines, custom skills? Easy.
-
Reliable – service-managed, boots and starts listening automatically.
-
Fun – you’ll learn a lot about audio, processes, and event-driven design.
Tiny tip: If you use a Raspberry Pi 5 and plan to run heavier local models, a clip-on cooler helps under sustained load. (When in doubt, pick the official Active Cooler designed for Pi 5.) [1]
Parts & Tools You’ll Need 🧰
-
Raspberry Pi: Pi 4 or Pi 5 recommended for headroom.
-
microSD card: 32 GB+ recommended.
-
USB microphone: a simple USB conference mic is great.
-
Speaker: USB or 3.5 mm speaker, or an I2S amp HAT.
-
Network: Ethernet or Wi-Fi.
-
Optional niceties: case, active cooler for Pi 5, push button for push-to-talk, LED ring. [1]
OS & Baseline Setup
-
Flash Raspberry Pi OS with Raspberry Pi Imager. It’s the straightforward way to get a bootable microSD with the presets you want. [1]
-
Boot, connect to network, then update packages:
sudo apt update && sudo apt upgrade -y
-
Audio basics: On Raspberry Pi OS you can set the default output, levels, and devices via the desktop UI or
raspi-config. USB and HDMI audio are supported across models; Bluetooth output is available on models with Bluetooth. [1] -
Verify devices:
arecord -l
aplay -l
Then test capture and playback. If levels seem weird, check mixers and defaults before blaming the mic.
The Architecture At A Glance 🗺️
A sensible DIY AI Assistant with Raspberry Pi flow looks like this:
Wake word → live audio capture → ASR transcription → intent handling or LLM → response text → TTS → audio playback → optional actions via MQTT or HTTP.
-
Wake word: Porcupine is small, accurate, and runs locally with per-keyword sensitivity control. [2]
-
ASR: Whisper is a multilingual, general-purpose ASR model trained on ~680k hours; it’s robust to accents/background noise. For on-device use,
whisper.cppprovides a lean C/C++ inference path. [3][4] -
Brain: Your pick – a cloud LLM via API, a rules engine, or local inference depending on horsepower.
-
TTS: Piper generates natural speech locally, fast enough for snappy responses on modest hardware. [5]
Quick Comparison Table 🔎
| Tool | Best For | Price-ish | Why It Works |
|---|---|---|---|
| Porcupine Wake Word | Always-listening trigger | Free tier + | Low CPU, accurate, easy bindings [2] |
| Whisper.cpp | Local ASR on Pi | Open source | Good accuracy, CPU-friendly [4] |
| Faster-Whisper | Faster ASR on CPU/GPU | Open source | CTranslate2 optimizations |
| Piper TTS | Local speech output | Open source | Fast voices, many languages [5] |
| Cloud LLM API | Rich reasoning | Usage based | Offloads heavy compute |
| Node-RED | Orchestrating actions | Open source | Visual flows, MQTT friendly |
Step-by-Step Build: Your First Voice Loop 🧩
We’ll use Porcupine for wake word, Whisper for transcription, a lightweight “brain” function for the reply (replace with your LLM of choice), and Piper for speech. Keep it minimal, then iterate.
1) Install dependencies
sudo apt install -y python3-pip portaudio19-dev sox ffmpeg
pip3 install sounddevice numpy
-
Porcupine: grab the SDK/bindings for your language and follow the quick start (access key + keyword list + audio frames →
.process). [2] -
Whisper (CPU-friendly): build whisper.cpp:
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp && cmake -B build && cmake --build build -j
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -m ./models/ggml-base.en.bin -f your.wav -otxt
The above mirrors the project’s quick start. [4]
Prefer Python?
faster-whisper(CTranslate2) is often snappier than vanilla Python on modest CPUs.
2) Set up Piper TTS
git clone https://github.com/rhasspy/piper
cd piper
make
# Download a voice model you like, e.g., en_US-amy
echo "Hello there." | ./piper --model voices/en/en_US-amy-medium.onnx --output_file hello.wav
aplay hello.wav
Piper is designed for on-device TTS with multiple voice/language options. [5]
3) A minimal assistant loop in Python
Deliberately compact: waits for a wake phrase (stub), records, transcribes with whisper.cpp, generates a reply (placeholder), then speaks via Piper. Swap the placeholder with your favorite LLM or rule logic.
import os, subprocess, wave
import sounddevice as sd
WAKE_WORD = "hey computer" # swap for Porcupine in production [2]
RECORD_SECONDS = 6
SAMPLE_RATE = 16000
CHANNELS = 1
WORKDIR = "/home/pi/assistant"
ASR_BIN = "/home/pi/whisper.cpp/build/bin/whisper-cli" # [4]
ASR_MODEL = "/home/pi/whisper.cpp/models/ggml-base.en.bin"
PIPER_BIN = "/home/pi/piper/build/piper" # [5]
PIPER_VOICE = "/home/pi/piper/voices/en/en_US-amy-medium.onnx"
os.makedirs(WORKDIR, exist_ok=True)
def record_wav(path, seconds=RECORD_SECONDS):
audio = sd.rec(int(seconds * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=CHANNELS, dtype='int16')
sd.wait()
with wave.open(path, 'wb') as w:
w.setnchannels(CHANNELS); w.setsampwidth(2); w.setframerate(SAMPLE_RATE)
w.writeframes(audio.tobytes())
def transcribe(path):
cmd = [ASR_BIN, "-m", ASR_MODEL, "-f", path, "-otxt"]
subprocess.run(cmd, check=True, cwd=WORKDIR)
with open(path.replace(".wav", ".txt"), "r", encoding="utf-8") as f:
return f.read().strip()
def generate_reply(prompt):
if "weather" in prompt.lower():
return "I can't see the clouds, but it might be fine. Bring a jacket just in case."
return "You said: " + prompt
def speak(text):
proc = subprocess.Popen([PIPER_BIN, "--model", PIPER_VOICE, "--output_file", f"{WORKDIR}/reply.wav"], stdin=subprocess.PIPE)
proc.stdin.write(text.encode("utf-8")); proc.stdin.close(); proc.wait()
subprocess.run(["aplay", f"{WORKDIR}/reply.wav"], check=True)
print("Assistant ready. Type the wake phrase to test.")
while True:
typed = input("> ").strip().lower()
if typed == WAKE_WORD:
wav_path = f"{WORKDIR}/input.wav"
record_wav(wav_path)
text = transcribe(wav_path)
reply = generate_reply(text)
print("User:", text); print("Assistant:", reply)
speak(reply)
else:
print("Type the wake phrase to test the loop.")
For real wake-word detection, integrate Porcupine’s streaming detector (low CPU, per-keyword sensitivity). [2]
Audio Tuning That Actually Matters 🎚️
A few tiny fixes make your assistant feel 10× smarter:
-
Mic distance: 30–60 cm is a sweet spot for many USB mics.
-
Levels: avoid clipping on input and keep playback sane; fix routing before chasing code ghosts. On Raspberry Pi OS, you can manage output device and levels via system tools or
raspi-config. [1] -
Room acoustics: hard walls cause echoes; a soft mat under the mic helps.
-
Wake word threshold: too sensitive → ghost triggers; too strict → you’ll be yelling at plastic. Porcupine lets you tweak sensitivity per keyword. [2]
-
Thermals: long transcriptions on Pi 5 benefit from the official active cooler for sustained performance. [1]
Going From Toy To Appliance: Services, Autostart, Healthchecks 🧯
Humans forget to run scripts. Computers forget to be nice. Turn your loop into a managed service:
-
Create a systemd unit:
[Unit]
Description=DIY Voice Assistant
After=network.target sound.target
[Service]
User=pi
WorkingDirectory=/home/pi/assistant
ExecStart=/usr/bin/python3 /home/pi/assistant/assistant.py
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
-
Enable it:
sudo cp assistant.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now assistant.service
-
Log tails:
journalctl -u assistant -f
Now it starts on boot, restarts on crash, and generally behaves like an appliance. A little boring, a lot better.
Skill System: Make It Actually Useful At Home 🏠✨
Once voice-in and voice-out are solid, add actions:
-
Intent router: simple keyword routes for common tasks.
-
Smart home: publish events to MQTT or call Home Assistant’s HTTP endpoints.
-
Plugins: quick Python functions like
set_timer,what_is_the_time,play_radio,run_scene.
Even with a cloud LLM in the loop, route obvious local commands first for speed and reliability.
Local Only vs Cloud Assist: Trade-offs You’ll Feel 🌓
Local only
Pros: private, offline, predictable costs.
Cons: heavier models may be slow on small boards. Whisper’s multilingual training helps with robustness if you keep it on-device or on a nearby server. [3]
Cloud assist
Pros: powerful reasoning, larger context windows.
Cons: data leaves device, network dependency, variable costs.
A hybrid often wins: wake word + ASR local → call an API for reasoning → TTS local. [2][3][5]
Troubleshooting: Strange Gremlins & Quick Fixes 👾
-
Wake word false triggers: lower sensitivities or try a different mic. [2]
-
ASR lag: use a smaller Whisper model or build
whisper.cppwith release flags (-j --config Release). [4] -
Choppy TTS: pre-generate common phrases; confirm your audio device and sample rates.
-
No mic detected: check
arecord -land mixers. -
Thermal throttling: use the official Active Cooler on Pi 5 for sustained performance. [1]
Security & Privacy Notes You Should Actually Read 🔒
-
Keep your Pi updated with APT.
-
If you use any cloud API, log what you send and consider redacting personal bits locally first.
-
Run services with least privilege; avoid
sudoin ExecStart unless required. -
Provide a local-only mode for guests or quiet hours.
Build Variants: Mix And Match Like A Sandwich 🥪
-
Ultra-local: Porcupine + whisper.cpp + Piper + simple rules. Private and sturdy. [2][4][5]
-
Speedy cloud assist: Porcupine + (smaller local Whisper or cloud ASR) + TTS local + cloud LLM.
-
Home automation central: Add Node-RED or Home Assistant flows for routines, scenes, and sensors.
Example Skill: Lights On via MQTT 💡
import paho.mqtt.client as mqtt
MQTT_HOST = "192.168.1.10"
TOPIC = "home/livingroom/light/set"
def set_light(state: str):
client = mqtt.Client()
client.connect(MQTT_HOST, 1883, 60)
payload = "ON" if state.lower().startswith("on") else "OFF"
client.publish(TOPIC, payload, qos=1, retain=False)
client.disconnect()
# if "turn on the lights" in text: set_light("on")
Add a voice line like: “turn on the living room lamp,” and you’ll feel like a wizard.
Why This Stack Works in Practice 🧪
-
Porcupine is efficient and accurate at wake-word detection on small boards, which makes always-listening feasible. [2]
-
Whisper’s large, multilingual training makes it robust to varied environments and accents. [3]
-
whisper.cppkeeps that power usable on CPU-only devices like the Pi. [4] -
Piper keeps responses snappy without shipping audio to a cloud TTS. [5]
Too Long, Didn't Read It
Build a modular, private DIY AI Assistant with Raspberry Pi by combining Porcupine for wake word, Whisper (via whisper.cpp) for ASR, your choice of brain for replies, and Piper for local TTS. Wrap it as a systemd service, tune audio, and wire in MQTT or HTTP actions. It’s cheaper than you think, and oddly delightful to live with. [1][2][3][4][5]
References
-
Raspberry Pi Software & Cooling – Raspberry Pi Imager (download & use) and Pi 5 Active Cooler product info
-
Porcupine Wake Word – SDK & quick start (keywords, sensitivity, local inference)
-
Whisper (ASR model) – Multilingual, robust ASR trained on ~680k hours
-
Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper): read more
-
-
whisper.cpp – CPU-friendly Whisper inference with CLI and build steps
-
Piper TTS – Fast, local neural TTS with multiple voices/languages